research . runtime . 11 min read

Constrained decoding and tool calling.

If the contract is "the output must be valid JSON," the wrong answer is "ask the model nicely and retry on parse error." The right answer is to make the decoder physically incapable of emitting a string that fails the schema. Same machinery, two surfaces: JSON-mode, and tool calling.

May 14, 2026 · Kolmogorov research · apps/runtime/constrained.py, apps/runtime/tools.py

The mechanism

A logits processor is a function that runs between the model and the sampler. It receives the next-token logits and can zero out any tokens that would violate a constraint. If the constraint is "the output must satisfy this JSON schema," the processor maintains a partial-parse state machine over the generated tokens and masks any next-token that does not extend a valid prefix of the schema's grammar.

Two production libraries implement this:

Both ship as logits-processor-compatible objects that drop into vLLM, transformers, and llama.cpp. kolm prefers Outlines when present, falls back to lm-format-enforcer, and finally to a post-hoc parse-and-retry shim when neither is installed.

The three modes

JSON schema mode

For structured outputs where the schema is known in advance: classifier labels, extracted entities, function arguments. The schema goes in, a logits processor comes out:

from apps.runtime.constrained import json_schema_processor

processor = json_schema_processor({
    "type": "object",
    "properties": {
        "diagnosis_code": {"type": "string", "pattern": "^[A-Z][0-9]{2}\\.[0-9]$"},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1}
    },
    "required": ["diagnosis_code", "confidence"]
}, tokenizer=tok)

out = model.generate(..., logits_processor=[processor.processor])

The model cannot emit a string that fails to parse, and cannot emit a diagnosis code that fails the regex. No retries.

Regex mode

For free-form structured outputs where a JSON schema is overkill: a single SSN, a date, a UUID, a phone number, a Likert score, a per-call action tag. The regex is the constraint:

from apps.runtime.constrained import regex_processor
processor = regex_processor(r"^[1-5]$", tokenizer=tok)
# model can only emit one of "1", "2", "3", "4", "5"

CFG / EBNF mode

For domain DSLs: SQL with a tightened grammar, ICD-10 code expressions, an in-house intent language. Outlines accepts an EBNF grammar string and compiles it. The grammar bounds what the model can emit at every step.

The fallback path

Some buyers run on hardware where Outlines does not build (some ROCm, some old Apple silicon). The fallback is a post-hoc retry loop:

from apps.runtime.constrained import post_hoc_retry_json

ok, parsed, err = post_hoc_retry_json(
    generate_fn=lambda p: model.generate(p, max_new_tokens=256),
    prompt="Return a diagnosis as JSON: ...",
    schema=schema,
    max_retries=3,
)
if not ok:
    raise RuntimeError(f"model failed schema after retries: {err}")

The retry loop validates the output, and on failure re-prompts with the error message folded in. It is slower and weaker than a real constrained-decode pass, but it keeps the API surface uniform so downstream code does not branch on which backend is available.

Tool calling

OpenAI Chat Completions accepts a tools array of JSON-schema function definitions and emits a tool_calls field with name + arguments. The kolm tool-calling surface is two things stacked:

  1. Render the tools into the prompt in the format the target model expects.
  2. Constrain the decoder so the emission parses, regardless of what the model wanted to say.

Renderers

Different model families have different conventions for how tool calls travel in plain text. The renderer detects the right one from the model name:

FamilyRenderDetect
Qwen 2.5 / 3<tools>...</tools> + <tool_call>{...}</tool_call>name contains "qwen"
Llama 3.1 / 3.2<function=NAME>{...}</function>name contains "llama"
Hermes-2-ProQwen-style with a longer system framingname contains "hermes" or "nous"
Anything elseGeneric JSON envelope: {"tool":"NAME","arguments":{...}} or {"answer":"..."}fallback

The union schema

Regardless of how the prompt is rendered, the decoder is constrained to emit a string that parses as one of these JSON shapes:

{
  "oneOf": [
    {"type": "object",
     "properties": {"tool": {"const": "lookup_user"},
                    "arguments": {"type": "object", "properties": {...}}},
     "required": ["tool", "arguments"]},
    {"type": "object",
     "properties": {"tool": {"const": "send_email"},
                    "arguments": {"type": "object", "properties": {...}}},
     "required": ["tool", "arguments"]},
    {"type": "object",
     "properties": {"answer": {"type": "string"}},
     "required": ["answer"]}
  ]
}

One oneOf branch per registered tool, plus an "answer" branch for direct replies. tool_choice tightens the schema:

Parsing back

After generation, the response is parsed by parse_native_tool_call() for the rendered format, or parse_envelope() for the JSON envelope. Both return a ParsedResponse with either an answer string or a tuple of ToolCalls. Each call has a stable id (call_<24-hex>) so the receipt chain can reference the same call across retries and tool-result follow-ups.

The end-to-end shape:

from apps.runtime.tools import parse_tools_field, union_schema, render_tool_prompt, parse_envelope
from apps.runtime.constrained import json_schema_processor

tools = parse_tools_field(request_body["tools"])
schema = union_schema(tools, allow_answer=True)
tool_block = render_tool_prompt(tools, model_hint=model_id)
processor = json_schema_processor(schema, tokenizer=tok).processor

prompt = system_prompt + "\n" + tool_block + "\n" + user_message
out = model.generate(prompt, logits_processor=[processor])
parsed = parse_envelope(out)

Every emission is guaranteed to parse as {tool, arguments} or {answer}. The receipt records the parsed structure, not the raw bytes, so a buyer auditing the call sees the tool name and arguments directly.

Why this is the right shape

The alternative is to leave parsing to a try/except + retry loop, which costs latency, doubles tokens on retry, and produces an unbounded tail where the model never converges to valid JSON. The alternative gives up on the constraint at exactly the wrong moment: when production is bursty and the retry budget runs out, the contract breaks.

Constrained decoding makes the contract physical. The model cannot emit a token that breaks the schema, so the parse cannot fail. The receipt records the schema as part of the manifest hash, so the buyer's auditor can prove that the run was bound by a schema and not by a hopeful try/except.

What we are not promising

Constrained decoding does not improve quality. It enforces shape, not correctness. A model can emit a perfectly valid JSON object with a wrong answer in it. K-score still gates the artifact, and the judge surface (LLM-as-judge, hallucination detection) still grades the content.

Performance overhead is not zero. Maintaining the FSM costs a few microseconds per token in CPU time. On a fast decoder this is invisible; on llama.cpp running on a phone it is measurable. The trade is worth it because the alternative is retries, which cost orders of magnitude more.

EBNF / CFG modes are heavier. A complex grammar compiles to a large FSM. For very-large SQL grammars we recommend a JSON-schema constraint over the parameter struct rather than the full SQL grammar.

References and source

Willard, B. T. & Louf, R. Efficient Guided Generation for Large Language Models. 2023.

lm-format-enforcer. github.com/noamgat/lm-format-enforcer

Outlines. github.com/dottxt-ai/outlines

kolm implementation: apps/runtime/constrained.py and apps/runtime/tools.py.