Teaching a model to call tools.
Most function-calling failures are template failures: the model emits a tool call in a format the parser does not recognise, the call gets dropped, the agent retries. The fix is to train on a strict template, validate the dataset before training, and parse tolerantly at inference. Hermes-FC is the open format that holds all three.
Why Hermes-FC instead of OpenAI's format
OpenAI's tool-calling format is closed: the tool call is a structured field on the assistant message object, not a string the model emits. To replicate this on an open model you have to either (a) train the model on a string template that your parser then converts back to the structured form, or (b) ship a constrained-decode union schema that forces emission to match a target schema.
Hermes-FC, introduced by NousResearch in the Hermes-2-Pro series, picks (a) and writes the template down. The tool schemas go in the system prompt between <tools> and </tools>; the assistant emits tool calls as <tool_call>JSON</tool_call> blocks; tool responses come back as <tool_response>JSON</tool_response> blocks in user turns. The format is open, the trainer can SFT directly on it, and the parser is regex-friendly.
The template
<|im_start|>system
You are a function calling AI model. Use the following tools when helpful.
<tools>
{"type":"function","function":{"name":"get_weather","description":"...","parameters":{...}}}
{"type":"function","function":{"name":"search_docs","description":"...","parameters":{...}}}
</tools>
For each function call return a tool_call block.
<|im_end|>
<|im_start|>user
What's the weather in Tokyo?
<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call>
<|im_end|>
<|im_start|>user
<tool_response>
{"temperature_c": 22, "conditions": "clear"}
</tool_response>
<|im_end|>
<|im_start|>assistant
The weather in Tokyo is 22 C and clear.<|im_end|>
One tool schema per line inside <tools>; one JSON object per <tool_call> block; multiple <tool_call> blocks back-to-back when the model wants to call several tools in one turn. The trainer's format_hermes_example renders this from a structured dataset row; the inference parser walks the model output for <tool_call>...</tool_call> and parses each as JSON.
The dataset validator
Function-calling SFT data is full of subtle bugs: tool calls that don't match any declared tool, JSON arguments that don't match the tool's parameter schema, assistant turns that should have a tool call but don't, user turns with a tool_response for a tool that was never called. Training on a dataset with these bugs produces a model that emits the same bugs at inference.
def validate_dataset(rows):
issues = []
for i, row in enumerate(rows):
tool_names = {t["function"]["name"] for t in row.get("tools", [])}
for msg in row.get("messages", []):
if msg["role"] == "assistant":
for call in parse_tool_calls(msg["content"]):
if call.get("name") not in tool_names:
issues.append(f"row {i}: undeclared tool {call.get('name')!r}")
# ... schema validation of call.arguments vs the tool's parameters
return issues
The trainer runs the validator before training; any non-empty issue list is a hard error with a list of rows to fix. The buyer's dataset is rejected, not silently trained on. This is the single most consequential move in the whole pipeline; ~30% of public function-calling datasets have at least one bug per 1000 rows, and SFT on bad data is worse than no SFT.
The inference parser is tolerant
_TOOL_CALL_PATTERN = re.compile(
r"<tool_call>\s*(\{.*?\})\s*</tool_call>",
re.DOTALL
)
def parse_tool_calls(text):
out = []
for match in _TOOL_CALL_PATTERN.finditer(text):
body = match.group(1)
try:
out.append(json.loads(body))
except json.JSONDecodeError:
# Try once more with minor cleanup (trailing commas, single quotes)
try:
out.append(json.loads(_clean(body)))
except json.JSONDecodeError:
logger.warning("dropping unparseable tool call: %r", body[:80])
return out
Strict at training time, tolerant at inference time. The model occasionally drifts (a stray comma, a smart quote, an unescaped newline inside a string); the parser cleans these where it can and drops the call where it cannot. The receipt records dropped calls per request so a buyer can spot a degradation early.
The minimal call
from apps.trainer.function_calling import fc_trainer, FCConfig, validate_dataset
issues = validate_dataset(train_rows)
if issues:
raise ValueError(f"{len(issues)} dataset issues, first 5: {issues[:5]}")
trainer = fc_trainer(
model_id="Qwen/Qwen2.5-7B-Instruct",
train_rows=train_rows,
config=FCConfig(
learning_rate=2e-5,
max_seq_length=4096,
num_train_epochs=1,
bf16=True,
),
)
trainer.train()
What lands in the receipt
"function_calling": {
"template": "hermes-fc",
"papers": ["arXiv:2409.00920", "arXiv:2305.15334"],
"n_tools_declared": 14,
"n_tool_calls_in_train": 8923,
"validator_issues": 0,
"max_seq_length": 4096,
"train_examples": 4096
}
The buyer's auditor sees the validator ran with zero issues (otherwise training would not have started), how many tool schemas the buyer declared, and how many tool calls landed in the training set. The deployment-time tool-call drop rate is a separate metric on the runtime side.
Composing with constrained decoding
SFT on Hermes-FC teaches the model the format. Constrained decoding at inference time guarantees the format. The two compose: the trainer ships a model that wants to emit Hermes-FC tool calls; the runtime's union schema (in apps/runtime/tools.py) forces every <tool_call> block to validate against the union of declared tool schemas. The buyer never sees a malformed tool call.
This is the engineering reason kolm uses Hermes-FC instead of OpenAI's structured-output format directly: the open template makes both the training signal and the constrained-decode schema explicit and inspectable. A buyer running an audit can read the system prompt, see the declared tools, see the constrained-decode grammar, and trace any single tool call back through the receipt to the exact assistant emission.
Edge cases worth naming
Parallel tool calls. The model emits multiple <tool_call> blocks in one assistant turn. The parser handles this natively (the regex is global). The runtime's executor fans them out in parallel and waits for all responses before resuming generation. The training data should include parallel-call examples; if it does not, the model learns to call tools serially and the agent loses throughput.
The tool call needs a thought block first. Many strong tool-calling models (Hermes 2.5+, Qwen 2.5) emit a brief reasoning sentence before the tool call: "I'll check the weather. <tool_call>...</tool_call>". The trainer's loss includes those preface tokens; the parser ignores them. Stripping the preface from training data degrades quality without saving inference cost.
No-call turns. When the user's request needs no tool, the model should answer directly with no <tool_call> block. The training data must include such turns; otherwise the model learns to always call a tool, even when answering "What is 2+2?" causes it to call a calculator unnecessarily. The validator warns if < 5% of assistant turns are no-call.
Where this fits in the kolm compile loop
Function-calling SFT is a sibling stage to plain SFT. The buyer's spec names it as a separate run; the dataset format is structured (rows of {tools, messages}); the K-score evaluator runs against a function-calling-specific eval pack with tool-call accuracy as the primary axis. The constrained-decode tool surface at serve time is part of the same artifact; the receipt links the SFT output to the runtime tools schema by sharing a tool-schema hash.
Citations
Liu, W. et al. ToolACE: Winning the Points of LLM Function Calling. arXiv:2409.00920, 2024.
Patil, S. G. et al. Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334, 2023.