Research / Decoding
EAGLE-3 and Lookahead: faster decoding without a draft model
Classical speculative decoding needs a smaller draft model that learns to mimic the target. Two 2024-2025 methods get the same speedup without one. EAGLE-3 reuses the target's own hidden states; Lookahead generates candidate n-grams from the prompt itself.
The cost of draft models
Leviathan 2022 and Chen 2023 (DeepMind speculative sampling) established the pattern: run a small draft model to propose K tokens, run the target model once to verify K+1 positions in parallel, accept the longest matching prefix. Wall-clock speedup is 2-3× when the draft model is 10-20× smaller and well-aligned.
The problem with this pattern at the kolm scale: each tenant's adapter changes what the target predicts. A generic draft for Qwen-7B is well-aligned in pretraining distribution but drifts on a refund-flagger adapter that has been distilled into a specific tone. You end up needing a draft model per adapter, which is the same cardinality problem multi-LoRA serving solved on the target side.
EAGLE-3
Li 2025 (EAGLE-3) is the third generation of EAGLE. The key idea, refined across three papers (Li 2024 EAGLE, Li 2024 EAGLE-2, Li 2025 EAGLE-3): use the target model's own intermediate hidden states as the input to a tiny single-layer auto-regressive head that produces draft tokens. The head is trained once per base, costs ~1% of the base parameters, and benefits from every adapter the base wears.
# during a single forward pass on the target:
hidden_l = target.layers[L].output # mid-layer hidden state
draft_logits = eagle_head(hidden_l) # tiny head, ~100M params for a 7B base
draft_tokens = sample(draft_logits, k=4)
# verify draft_tokens with one more target forward pass
EAGLE-3's incremental contribution over EAGLE-2 is multi-layer feature aggregation and a sharper training objective that explicitly conditions on later-layer features. Reported speedups on Llama-3-70B: 3.05× (EAGLE) → 4.26× (EAGLE-2) → 5.62× (EAGLE-3) at temperature 0, while matching greedy output exactly.
Lookahead decoding
Fu 2024 (Lookahead) takes a different angle: skip the draft model entirely. Use the Jacobi-iteration trick: at each step, generate candidate n-grams from existing context tokens by running the target model in parallel with multiple guessed positions, then collect the n-grams that the model agrees with itself on.
# lookahead at every step:
guesses = [token_i, sample_at_position_i+1, sample_at_position_i+2, ...]
target_check = target(prompt + guesses) # one forward pass
accepted = longest prefix where target_check matches guesses
Lookahead requires no extra parameters and no extra training. It scales naturally with longer contexts, where the n-gram bank grows. Speedups are typically 1.5-2.5× on conversation-style workloads and degrade on highly unpredictable outputs (e.g., open creative writing). It is the right pick for high-cardinality multi-tenant settings where the cost of training one EAGLE head per base does not pay back.
REST
He 2023 (REST) is a third draft-free method that retrieves draft sequences from a corpus rather than generating them. For RAG-style workloads where the same passages appear repeatedly across queries, REST hits 2-3× speedup with zero additional parameters. Kolm uses REST in the retrieval-heavy paths (long-context QA) and EAGLE-3 in everything else.
Which to pick
| Workload | Pick | Why |
|---|---|---|
| Single-tenant high-volume | EAGLE-3 | One head trained once, used forever |
| Many tenants, varied adapters | Lookahead | Zero training cost, scales with usage |
| RAG / long context | REST | Retrieval already in-flight |
| Code with repetition | Lookahead | N-grams reappear often |
| Tool-call structured output | EAGLE-3 | Schema is predictable, head trains well |
The kolm runtime call
from apps.runtime.spec_decode import SpecDecodeConfig
cfg = SpecDecodeConfig(
method="eagle3", # "eagle3" | "lookahead" | "rest" | "draft" | "off"
eagle_head="qwen2.5-7b-eagle3-head@sha256:a1b2c3",
k=4, # propose 4 tokens per step
accept_threshold=0.0, # greedy verify (temp=0)
)
runtime.serve(request, spec_decode=cfg)
The runtime auto-falls-back to lookahead if no EAGLE head is registered for the base, and to off if temperature is set high enough that speculation no longer pays off (typically T > 1.2).
What the receipt records
{
"spec_decode": {
"method": "eagle3",
"head_cid": "cidv1:sha256:a1b2c3...",
"k": 4,
"acceptance_rate": 0.78,
"tokens_proposed": 312,
"tokens_accepted": 243,
"wall_clock_speedup": 4.9,
"verified_identical_to_off": true
}
}
verified_identical_to_off is the contract: speculative decoding must produce token-for-token identical output to non-speculative at the same temperature and seed. If it deviates, the runtime halts and reports. This is enforceable and we enforce it.
Edge cases
Temperature. Speculation pays off most at T=0 (greedy), where acceptance rates are highest. At T=0.7, expect ~70% of the T=0 speedup. At T > 1.0, expect speedup near 1× and disable speculation.
Adapter compatibility. An EAGLE-3 head trained on the bare base generalizes to most LoRA adapters but degrades for high-rank or wide-coverage adapters. We K-score the head against a sample of adapters each release; if acceptance drops below 0.5, we retrain.
Long generations. Lookahead's overhead scales with the size of the n-gram cache. We cap at 64 entries by default; beyond that, eviction churn eats the speedup. EAGLE-3 is the better pick for outputs > 2k tokens.
Determinism. Speculation is verifier-bound, so any speedup must verify identical. Acceptance is the only thing that changes between runs; outputs match bit-for-bit at the same seed. This is a load-bearing property for the kolm receipt chain.
Citations
Li et al. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077
Li et al. 2025. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv:2503.01840
Fu et al. 2024. Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. arXiv:2402.02057
He et al. 2023. REST: Retrieval-Based Speculative Decoding. arXiv:2311.08252
Leviathan et al. 2022. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192