Research / Decoding

EAGLE-3 and Lookahead: faster decoding without a draft model

Classical speculative decoding needs a smaller draft model that learns to mimic the target. Two 2024-2025 methods get the same speedup without one. EAGLE-3 reuses the target's own hidden states; Lookahead generates candidate n-grams from the prompt itself.

2026-05-14 · kolm research 6 min read

The cost of draft models

Leviathan 2022 and Chen 2023 (DeepMind speculative sampling) established the pattern: run a small draft model to propose K tokens, run the target model once to verify K+1 positions in parallel, accept the longest matching prefix. Wall-clock speedup is 2-3× when the draft model is 10-20× smaller and well-aligned.

The problem with this pattern at the kolm scale: each tenant's adapter changes what the target predicts. A generic draft for Qwen-7B is well-aligned in pretraining distribution but drifts on a refund-flagger adapter that has been distilled into a specific tone. You end up needing a draft model per adapter, which is the same cardinality problem multi-LoRA serving solved on the target side.

EAGLE-3

Li 2025 (EAGLE-3) is the third generation of EAGLE. The key idea, refined across three papers (Li 2024 EAGLE, Li 2024 EAGLE-2, Li 2025 EAGLE-3): use the target model's own intermediate hidden states as the input to a tiny single-layer auto-regressive head that produces draft tokens. The head is trained once per base, costs ~1% of the base parameters, and benefits from every adapter the base wears.

# during a single forward pass on the target:
hidden_l = target.layers[L].output     # mid-layer hidden state
draft_logits = eagle_head(hidden_l)    # tiny head, ~100M params for a 7B base
draft_tokens = sample(draft_logits, k=4)
# verify draft_tokens with one more target forward pass

EAGLE-3's incremental contribution over EAGLE-2 is multi-layer feature aggregation and a sharper training objective that explicitly conditions on later-layer features. Reported speedups on Llama-3-70B: 3.05× (EAGLE) → 4.26× (EAGLE-2) → 5.62× (EAGLE-3) at temperature 0, while matching greedy output exactly.

Lookahead decoding

Fu 2024 (Lookahead) takes a different angle: skip the draft model entirely. Use the Jacobi-iteration trick: at each step, generate candidate n-grams from existing context tokens by running the target model in parallel with multiple guessed positions, then collect the n-grams that the model agrees with itself on.

# lookahead at every step:
guesses = [token_i, sample_at_position_i+1, sample_at_position_i+2, ...]
target_check = target(prompt + guesses)  # one forward pass
accepted = longest prefix where target_check matches guesses

Lookahead requires no extra parameters and no extra training. It scales naturally with longer contexts, where the n-gram bank grows. Speedups are typically 1.5-2.5× on conversation-style workloads and degrade on highly unpredictable outputs (e.g., open creative writing). It is the right pick for high-cardinality multi-tenant settings where the cost of training one EAGLE head per base does not pay back.

REST

He 2023 (REST) is a third draft-free method that retrieves draft sequences from a corpus rather than generating them. For RAG-style workloads where the same passages appear repeatedly across queries, REST hits 2-3× speedup with zero additional parameters. Kolm uses REST in the retrieval-heavy paths (long-context QA) and EAGLE-3 in everything else.

Which to pick

Workload	Pick	Why
Single-tenant high-volume	EAGLE-3	One head trained once, used forever
Many tenants, varied adapters	Lookahead	Zero training cost, scales with usage
RAG / long context	REST	Retrieval already in-flight
Code with repetition	Lookahead	N-grams reappear often
Tool-call structured output	EAGLE-3	Schema is predictable, head trains well

The kolm runtime call

from apps.runtime.spec_decode import SpecDecodeConfig
cfg = SpecDecodeConfig(
    method="eagle3",         # "eagle3" | "lookahead" | "rest" | "draft" | "off"
    eagle_head="qwen2.5-7b-eagle3-head@sha256:a1b2c3",
    k=4,                     # propose 4 tokens per step
    accept_threshold=0.0,    # greedy verify (temp=0)
)
runtime.serve(request, spec_decode=cfg)

The runtime auto-falls-back to lookahead if no EAGLE head is registered for the base, and to off if temperature is set high enough that speculation no longer pays off (typically T > 1.2).

What the receipt records

{
  "spec_decode": {
    "method": "eagle3",
    "head_cid": "cidv1:sha256:a1b2c3...",
    "k": 4,
    "acceptance_rate": 0.78,
    "tokens_proposed": 312,
    "tokens_accepted": 243,
    "wall_clock_speedup": 4.9,
    "verified_identical_to_off": true
  }
}

verified_identical_to_off is the contract: speculative decoding must produce token-for-token identical output to non-speculative at the same temperature and seed. If it deviates, the runtime halts and reports. This is enforceable and we enforce it.

Edge cases

Temperature. Speculation pays off most at T=0 (greedy), where acceptance rates are highest. At T=0.7, expect ~70% of the T=0 speedup. At T > 1.0, expect speedup near 1× and disable speculation.

Adapter compatibility. An EAGLE-3 head trained on the bare base generalizes to most LoRA adapters but degrades for high-rank or wide-coverage adapters. We K-score the head against a sample of adapters each release; if acceptance drops below 0.5, we retrain.

Long generations. Lookahead's overhead scales with the size of the n-gram cache. We cap at 64 entries by default; beyond that, eviction churn eats the speedup. EAGLE-3 is the better pick for outputs > 2k tokens.

Determinism. Speculation is verifier-bound, so any speedup must verify identical. Acceptance is the only thing that changes between runs; outputs match bit-for-bit at the same seed. This is a load-bearing property for the kolm receipt chain.

Citations

Li et al. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077

Li et al. 2025. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv:2503.01840

Fu et al. 2024. Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. arXiv:2402.02057

He et al. 2023. REST: Retrieval-Based Speculative Decoding. arXiv:2311.08252

Leviathan et al. 2022. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192

← back to research kolm.ai/research/eagle3-lookahead