Contents
One token per step is the slowest cell on the spreadsheet.
Transformer training and serving both treat the autoregressive sequence as the unit of work. Pre-training reads one batch of length L and computes L losses; serving emits one new token per forward pass. The hardware utilizes the same matrix multiplies either way, but the unit of useful output is a single token. Doubling the tokens emitted per step doubles the effective throughput at constant arithmetic cost. The literature has converged on this insight from two directions.
TST: token superposition in pre-training.
Token-Superposition Training (Nous Research, Peng, Gigant, Quesnelle, May 2026) collapses contiguous tokens into a single embedding "bag" during a superposition phase, then trains the model to predict the next bag with a multi-hot cross-entropy objective. A recovery phase reverts to standard next-token prediction for the final fraction of the schedule, recovering the inference-time shape.
The headline result is a 2.5x reduction in pre-training time at the 10B A1B mixture-of-experts scale, with parity or improvement on HellaSwag, ARC, and MMLU. The mechanism does not change the optimizer, parallelism, tokenizer, data, or model architecture; it is a drop-in objective swap.
PLT: patch-level training.
Beyond Next Token Prediction: Patch-Level Training for Large Language Models (Shao, Meng, Zhou; ICLR 2025 Spotlight) aggregates contiguous tokens into "patches" and trains the model to predict whole patches during the early phase. The remaining schedule switches back to token-level prediction so the inference graph stays unchanged. Reported result: 0.5x training cost across 370M to 2.7B parameter scales with no quality regression.
The two papers converge on the same physical insight: the dense matrix multiply does not care whether the supervision target is one token or k tokens; the optimizer signal can carry more bits per step. Nous acknowledges PLT as the prior art for the framing.
Why neither directly applies to LoRA distill.
kolm does not pre-train. A .kolm artifact is a LoRA adapter over a frozen base model, distilled from captures and synthesized recipes. The trainer reads on the order of a few hundred to a few thousand examples and runs a handful of epochs. The pre-training-side techniques live at a different point on the schedule.
| Aspect | Pre-training (TST, PLT) | LoRA distill (kolm) |
|---|---|---|
| Examples | 1011-1013 tokens | 102-104 examples |
| Wall time | weeks on a cluster | seconds to minutes on one GPU |
| Bottleneck | arithmetic per token | example efficiency |
| Useful lever | pack more supervision per step | find the right examples to begin with |
| Equivalent gain | 2x-2.5x | sequence packing + Liger fused kernels deliver a smaller win on the same axis |
The LoRA-distill analogue of multi-token-per-step prediction is sequence packing: concatenating short examples into a single sequence with masked attention so a single forward pass supervises many distinct examples. kolm's trainer_real.py uses this with Liger fused kernels (fused-RMSNorm, fused-RoPE, fused-SwiGLU, fused-CE) and paged 8-bit AdamW, which compounds to a 1.6x-2.0x training-time reduction on a 5090 vs a baseline transformers loop. Smaller than 2.5x because the bottleneck has already moved off arithmetic onto example efficiency.
Speculative decoding: the inference-side dual.
Inference is where multi-token-per-step has a clean, mature implementation: speculative decoding. A small draft model proposes k tokens; the large target model verifies them in one forward pass. Accepted tokens are committed; the first rejected token is corrected. The expected gain is k times the draft-acceptance rate divided by the per-token verification cost, which lands at 2x-3x on typical chat workloads.
kolm's serving runtime in apps/runtime/serve.py turns this on automatically when the manifest declares a draft model. The pairings are selected by apps/trainer/speculative.py:pick_draft:
# apps/trainer/speculative.py DRAFT_PAIRINGS = { "Qwen/Qwen2.5-7B-Instruct": "Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-14B-Instruct": "Qwen/Qwen2.5-3B-Instruct", "meta-llama/Llama-3.2-3B-Instruct": "meta-llama/Llama-3.2-1B-Instruct", "meta-llama/Llama-3.1-8B-Instruct": "meta-llama/Llama-3.2-1B-Instruct", "google/gemma-3-12b-it": "google/gemma-3-1b-it", }
When the runtime boots vLLM (the preferred engine), it passes speculative_model and num_speculative_tokens=5 to the LLM constructor. vLLM handles the verification path in a single fused pass. The transformers fallback uses HuggingFace's assistant_model argument on model.generate(), which implements the same algorithm without batched verification.
# apps/runtime/serve.py llm_kwargs = dict( model=target_model, enable_prefix_caching=True, kv_cache_dtype="fp8", # Hopper / Blackwell only dtype="auto", ) if draft_model: llm_kwargs["speculative_model"] = draft_model llm_kwargs["num_speculative_tokens"] = 5 llm = LLM(**llm_kwargs)
What kolm ships today.
The serving runtime composes five throughput techniques in one path. None of these are research claims; they are pip install away in a real wheelhouse and configured by default when the box supports them.
- Speculative decoding. Draft 1.5B + target 7B, 5 speculative tokens, vLLM-batched verification.
- FP8 KV cache. Auto-enabled on compute capability 9+ (Hopper H100/H200, Blackwell sm_120). Cuts KV memory 2x and accelerates the verify pass on the same hardware.
- PagedAttention + prefix cache. Standard vLLM; the prefix cache compounds for chat workloads where the system prompt repeats.
- AWQ / GPTQ int4 weights. When the artifact ships quantized, vLLM loads it natively. 4x memory reduction over bf16 weights, no measurable quality loss on the standard rubrics.
- NVFP4 detect (training). On a Blackwell box with torch ≥ 2.8,
apps/trainer/nvfp4.py:detectreturnsenabled=Trueand the trainer can compile in 4-bit float. Opt-in viaKOLM_NVFP4=1.
Throughput math.
For a Qwen2.5-7B target serving a typical chat workload on a single H100:
| Path | Tokens/sec | Source |
|---|---|---|
| transformers baseline | ~35 | HF reference loop, bf16, no batching |
| vLLM PagedAttention | ~110 | vLLM 0.6 benchmark, bf16, batch=1 |
| vLLM + FP8 KV | ~145 | vLLM 0.6 benchmark, fp8 kv |
| vLLM + FP8 KV + spec | ~280 | kolm config (draft 1.5B, 5 tokens) |
| vLLM + FP8 KV + spec + AWQ | ~340 | kolm config (int4 target) |
Numbers above 145 tokens/sec are the kolm composition. The exact figure depends on the draft-acceptance rate, which depends on the task; the range we have seen across a half-dozen workloads is 220-360 tok/s on H100. The arithmetic does not care about the marketing; the receipt chain records the engine, the draft model, and the KV dtype so the number is auditable.
Pattern-match recipes are a separate path. Most
.kolmartifacts compile to a deterministic JS function. For those,kolm benchon a workstation reports a p50 of 1us and a steady-state throughput north of 1,000,000 calls per second. They are not generative; they are the unfused embedding lookup at the bottom of every distill.
Open work.
Three threads we are watching.
- Medusa-style multi-head decoding. Train k decoding heads on top of the base, predict k tokens per step without a separate draft model. Lands on the same axis as TST but at the serving side. Practical wrinkle: the head training would have to fit inside the LoRA distill schedule, which constrains the parameter count of the heads.
- EAGLE-2 and Lookahead Decoding. Different proposal mechanisms (tree proposal, n-gram lookahead). The vLLM team has landed EAGLE-2 as an option; we will switch on whichever path the upstream benchmark prefers.
- Patch-level supervision for LoRA distill. If the captures are long enough, patch-level training could apply during the LoRA fine-tune phase, not just pre-training. The risk is overfitting on the patch boundary; the win, if it works, is a 1.5x speedup on top of sequence packing. Out of scope for v1.0; on the radar for v1.2.
kolm's job is not to invent new training methods. Our job is to ship the receipt chain that proves the artifact in your hand came out of the build pipeline you authorized, with a K-score that says it beats the alternative on your evals. The frontier techniques are off-the-shelf and we wire them in; the trust layer is what we build.