Speculative Decoding With Deterministic Drafts: Free LLM Inference

A 60-second refresher on speculative decoding
The draft-model approach, and what it costs
Recipe Speculative Decoding (RSD)
How the recipe pack is built and looked up
Numbers: speedup, coverage, accuracy
Integrating with llama.cpp / vLLM
FAQ

A 60-second refresher on speculative decoding.

Generating a token from an LLM is a forward pass; one forward pass produces one token. Speculative decoding introduces a draft mechanism: cheaply propose k tokens, then run the big model once to verify all k in parallel. If all k are accepted, you generated k tokens at the cost of one forward pass. If only j < k are accepted, you generated j tokens at the cost of one forward pass — still strictly better than serial.

The accept/reject step is the magic. It's provably correct: after the verify step, the output distribution is identical to the original model's. You don't trade quality for speed; you trade compute layout for speed.

The draft-model approach, and what it costs.

The textbook way to do this is to use a smaller model — say, a 1.5B parameter draft model — to propose tokens for a larger one — say, a 7B target. This works well when the draft model and the target are well-aligned (often a distilled-from-the-target draft).

The downside is real, and it shows up in three places:

The draft model itself takes memory. A good 1.5B draft for a 7B target is another 1 GB on disk and 800 MB resident. On a 6 GB phone, that's a hard ceiling.
The draft model itself takes compute. Even a 1.5B forward pass is non-trivial. The speedup comes from the verify-in-parallel trick; the draft compute eats some of it back.
The draft model has its own training story. Aligning a draft to a target is its own ML project, with its own datasets, its own evals, its own version compatibility headaches. Most teams don't ship one in production.

Recipe Speculative Decoding (RSD).

Most of what an LLM emits in production is highly predictable from the prefix. Open brace, JSON key, colon, value. Function signature, arguments, return type. Common phrases. Reused boilerplate. For these patterns, you don't need a model to predict the next token. You just need a lookup table.

The recipe pack is that table. It is built at compile time by watching the teacher model emit during k-sample verified labeling. Every accepted output's deterministic-token sub-sequences are extracted and indexed by their prefix shape (the prefix's normalized fingerprint, not its literal text). At runtime, every token call consults the table first; on a hit, the predicted token is verified by the base in parallel; on a miss, the base produces normally.

The recipe pack is a draft mechanism with no draft model: zero parameters, zero memory beyond the table, zero forward-pass compute, and a hit rate of 50-80% on tasks the artifact was compiled for.

How the recipe pack is built and looked up.

1. Observe.

During Distill, the compiler runs k-sample verified inference. Every accepted output is parsed for deterministic-token subsequences. A subsequence is "deterministic" if its literal continuation is unambiguously implied by its prefix in the candidate distribution. The compiler runs a token-level alignment across the k samples; tokens that agree across all k are deterministic.

2. Hash.

Each (prefix → token) pair is hashed by an embedding-shape and a literal-prefix (both indexed). The embedding-shape lookup matches semantically similar prefixes; the literal-prefix lookup matches exact byte sequences. Both are sub-microsecond.

3. Pack.

The compiler dedupes, sorts by frequency, and writes recipes.json. A typical pack for a support-triage compile is 1,200-1,800 entries totaling 12-18 KB. For a code-review compile, it's 8,000-15,000 entries at 80-150 KB. The pack is content-hashed and ships in the .kolm.

4. Look up.

At runtime, before each forward pass, the runtime hashes the current prefix and queries the pack. If there's a hit:

The predicted token is appended to the work-in-progress.
The base model is asked to verify the predicted token in parallel with its own next-token prediction.
If the base agrees, the predicted token is committed.
If it disagrees, the predicted token is rolled back and the base's prediction is used. This rollback never happens for genuinely deterministic patterns; when it does, it's evidence the pack is stale and gets logged for the next compile.

If there's a miss, the base produces normally. Misses are common (the pack covers structured-token paths, not free prose). Hits are common enough to dominate end-to-end wall time on structured tasks.

// Pseudocode for the runtime decode loop
while (!eos) {
  const hit = recipes.lookup(prefix);
  if (hit) {
    const [predicted, baseToken] = await Promise.all([
      hit.token,
      base.next(prefix)         // always run base for verification
    ]);
    if (predicted === baseToken) {
      prefix.push(predicted); n_drafted++;
    } else {
      prefix.push(baseToken); recipes.observeMiss(prefix, predicted, baseToken);
    }
  } else {
    prefix.push(await base.next(prefix));
  }
}

Numbers: speedup, coverage, accuracy.

From our internal RSD benchmarks against Llama-3-8B (target) on three structured tasks. RSD draft pack vs. cold (no draft) vs. a 1.5B draft model.

task	cold (tok/s)	1.5B draft (tok/s)	RSD pack (tok/s)	RSD coverage	RSD accuracy
HumanEval (Python codegen)	42	74	128	71%	99.9%
JSON extraction (structured)	38	61	158	83%	100.0%
GSM8K (math chain)	40	69	96	52%	99.9%
open-domain summary	40	59	52	22%	100.0%

Three things to read out of this table.

RSD wins big on structured tasks. JSON extraction goes from 38 tok/s cold to 158 tok/s with a recipe pack — 4.2× wall-time speedup at zero accuracy cost. The pack covers 83% of the tokens for free.

RSD does not help on free prose. A summary task is mostly novel tokens; the recipe pack hit rate is 22%, the speedup is small. That's correct behavior. The recipe pack should not be helping on free prose — it has nothing to predict that a model wouldn't have to figure out anyway.

The 1.5B draft beats RSD on free prose. If your workload is dominated by open-domain text, a draft model is the better choice. We support both; the compiler picks the right one for the task.

Integrating with llama.cpp / vLLM.

RSD is a callback at the decode loop. Both llama.cpp and vLLM support a draft callback hook. Our reference patches:

// llama.cpp — register a recipe-pack callback
#include "kolm-rsd.h"
kolm_rsd_t *rsd = kolm_rsd_load("recipes.json");
llama_speculative_set_callback(ctx, kolm_rsd_propose, rsd);

# vLLM — pass the pack at startup
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B \
  --speculative-config "kolm-rsd:recipes.json"

For users running through kolm run, this is automatic — the runtime mounts the pack from the artifact and wires it into the decode loop.

Why RSD compounds across compiles.

Every kolm compile contributes new (prefix-shape, token) pairs to a public registry — opt-in, anonymized to the prefix-shape level (no user data ever touches the registry). The registry is downloadable. Every future compile starts from a richer cache. The marginal compile cost shrinks over time. Tasks that are common in the registry (extracting JSON, writing simple Python, reviewing diffs) compile faster and cheaper because most of their deterministic tokens are already known.

This is the analogue to gcc's instruction-cache: code patterns become intrinsics over time, and every compiler benefits. The registry is what makes kolm a moat instead of a feature.

FAQ.

Is the recipe pack lossless?

Yes. The base always verifies the drafted token in parallel; if the base disagrees, the base wins. The output distribution is identical to the base's. RSD trades compute layout for wall-time, not quality.

What happens when my task changes?

Misses are logged locally (never egressed) and surfaced in the next kolm compile. The compiler resolves the new pattern via verified inference and merges it into the pack. The cache strictly grows.

Does RSD help on phones?

Yes — disproportionately. Phones are the most compute-constrained target, and the recipe pack adds zero compute and only a few KB of memory. We measure RSD's biggest end-to-end speedups on Apple Silicon at INT4: 2.5-4× on structured tasks, 1.0-1.2× on free prose.

How does this relate to medusa, lookahead, or eagle?

RSD is in the same family as retrieval-based speculative decoding (PLD, REST). The differences: RSD's drafts are scoped to the artifact's task, indexed by both literal and embedding-shape, and shipped in a versioned pack rather than computed on the fly from a context window. Eagle and medusa add layers to the model itself; RSD doesn't touch the model.

read How an AI compiler works →

The pipeline that builds the recipe pack at compile time.

read Verified inference →

The labeling mechanism that feeds the pack.

read The .kolm format →

Where recipes.json lives, byte by byte.

act Run kolm compile →

Compile your first artifact and ship a recipe pack.

Speculative decoding with deterministic drafts.

Contents