Constrained decoding and tool calling.
If the contract is "the output must be valid JSON," the wrong answer is "ask the model nicely and retry on parse error." The right answer is to make the decoder physically incapable of emitting a string that fails the schema. Same machinery, two surfaces: JSON-mode, and tool calling.
The mechanism
A logits processor is a function that runs between the model and the sampler. It receives the next-token logits and can zero out any tokens that would violate a constraint. If the constraint is "the output must satisfy this JSON schema," the processor maintains a partial-parse state machine over the generated tokens and masks any next-token that does not extend a valid prefix of the schema's grammar.
Two production libraries implement this:
- Outlines (Willard & Louf, 2023). Compiles a regex or JSON schema into a finite-state machine over the tokenizer's vocabulary; at decode time, the FSM tracks which states are reachable. Fast, handles JSON schema, regex, and EBNF CFG.
- lm-format-enforcer. Same idea, different implementation. Mature, slightly slower in our microbench, but the schema coverage in the long tail is sometimes better.
Both ship as logits-processor-compatible objects that drop into vLLM, transformers, and llama.cpp. kolm prefers Outlines when present, falls back to lm-format-enforcer, and finally to a post-hoc parse-and-retry shim when neither is installed.
The three modes
JSON schema mode
For structured outputs where the schema is known in advance: classifier labels, extracted entities, function arguments. The schema goes in, a logits processor comes out:
from apps.runtime.constrained import json_schema_processor
processor = json_schema_processor({
"type": "object",
"properties": {
"diagnosis_code": {"type": "string", "pattern": "^[A-Z][0-9]{2}\\.[0-9]$"},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["diagnosis_code", "confidence"]
}, tokenizer=tok)
out = model.generate(..., logits_processor=[processor.processor])
The model cannot emit a string that fails to parse, and cannot emit a diagnosis code that fails the regex. No retries.
Regex mode
For free-form structured outputs where a JSON schema is overkill: a single SSN, a date, a UUID, a phone number, a Likert score, a per-call action tag. The regex is the constraint:
from apps.runtime.constrained import regex_processor processor = regex_processor(r"^[1-5]$", tokenizer=tok) # model can only emit one of "1", "2", "3", "4", "5"
CFG / EBNF mode
For domain DSLs: SQL with a tightened grammar, ICD-10 code expressions, an in-house intent language. Outlines accepts an EBNF grammar string and compiles it. The grammar bounds what the model can emit at every step.
The fallback path
Some buyers run on hardware where Outlines does not build (some ROCm, some old Apple silicon). The fallback is a post-hoc retry loop:
from apps.runtime.constrained import post_hoc_retry_json
ok, parsed, err = post_hoc_retry_json(
generate_fn=lambda p: model.generate(p, max_new_tokens=256),
prompt="Return a diagnosis as JSON: ...",
schema=schema,
max_retries=3,
)
if not ok:
raise RuntimeError(f"model failed schema after retries: {err}")
The retry loop validates the output, and on failure re-prompts with the error message folded in. It is slower and weaker than a real constrained-decode pass, but it keeps the API surface uniform so downstream code does not branch on which backend is available.
Tool calling
OpenAI Chat Completions accepts a tools array of JSON-schema function definitions and emits a tool_calls field with name + arguments. The kolm tool-calling surface is two things stacked:
- Render the tools into the prompt in the format the target model expects.
- Constrain the decoder so the emission parses, regardless of what the model wanted to say.
Renderers
Different model families have different conventions for how tool calls travel in plain text. The renderer detects the right one from the model name:
| Family | Render | Detect |
|---|---|---|
| Qwen 2.5 / 3 | <tools>...</tools> + <tool_call>{...}</tool_call> | name contains "qwen" |
| Llama 3.1 / 3.2 | <function=NAME>{...}</function> | name contains "llama" |
| Hermes-2-Pro | Qwen-style with a longer system framing | name contains "hermes" or "nous" |
| Anything else | Generic JSON envelope: {"tool":"NAME","arguments":{...}} or {"answer":"..."} | fallback |
The union schema
Regardless of how the prompt is rendered, the decoder is constrained to emit a string that parses as one of these JSON shapes:
{
"oneOf": [
{"type": "object",
"properties": {"tool": {"const": "lookup_user"},
"arguments": {"type": "object", "properties": {...}}},
"required": ["tool", "arguments"]},
{"type": "object",
"properties": {"tool": {"const": "send_email"},
"arguments": {"type": "object", "properties": {...}}},
"required": ["tool", "arguments"]},
{"type": "object",
"properties": {"answer": {"type": "string"}},
"required": ["answer"]}
]
}
One oneOf branch per registered tool, plus an "answer" branch for direct replies. tool_choice tightens the schema:
"none"→ only the answer branch is allowed"auto"→ tools plus answer (default)"required"→ tools only, no answer escape{"type":"function","function":{"name":"send_email"}}→ exactly that one tool
Parsing back
After generation, the response is parsed by parse_native_tool_call() for the rendered format, or parse_envelope() for the JSON envelope. Both return a ParsedResponse with either an answer string or a tuple of ToolCalls. Each call has a stable id (call_<24-hex>) so the receipt chain can reference the same call across retries and tool-result follow-ups.
The end-to-end shape:
from apps.runtime.tools import parse_tools_field, union_schema, render_tool_prompt, parse_envelope from apps.runtime.constrained import json_schema_processor tools = parse_tools_field(request_body["tools"]) schema = union_schema(tools, allow_answer=True) tool_block = render_tool_prompt(tools, model_hint=model_id) processor = json_schema_processor(schema, tokenizer=tok).processor prompt = system_prompt + "\n" + tool_block + "\n" + user_message out = model.generate(prompt, logits_processor=[processor]) parsed = parse_envelope(out)
Every emission is guaranteed to parse as {tool, arguments} or {answer}. The receipt records the parsed structure, not the raw bytes, so a buyer auditing the call sees the tool name and arguments directly.
Why this is the right shape
The alternative is to leave parsing to a try/except + retry loop, which costs latency, doubles tokens on retry, and produces an unbounded tail where the model never converges to valid JSON. The alternative gives up on the constraint at exactly the wrong moment: when production is bursty and the retry budget runs out, the contract breaks.
Constrained decoding makes the contract physical. The model cannot emit a token that breaks the schema, so the parse cannot fail. The receipt records the schema as part of the manifest hash, so the buyer's auditor can prove that the run was bound by a schema and not by a hopeful try/except.
What we are not promising
Constrained decoding does not improve quality. It enforces shape, not correctness. A model can emit a perfectly valid JSON object with a wrong answer in it. K-score still gates the artifact, and the judge surface (LLM-as-judge, hallucination detection) still grades the content.
Performance overhead is not zero. Maintaining the FSM costs a few microseconds per token in CPU time. On a fast decoder this is invisible; on llama.cpp running on a phone it is measurable. The trade is worth it because the alternative is retries, which cost orders of magnitude more.
EBNF / CFG modes are heavier. A complex grammar compiles to a large FSM. For very-large SQL grammars we recommend a JSON-schema constraint over the parameter struct rather than the full SQL grammar.
References and source
Willard, B. T. & Louf, R. Efficient Guided Generation for Large Language Models. 2023.
lm-format-enforcer. github.com/noamgat/lm-format-enforcer
Outlines. github.com/dottxt-ai/outlines
kolm implementation: apps/runtime/constrained.py and apps/runtime/tools.py.