Response distillation.
Kolmogorov complexity is about compressing a string into the shortest program that emits it. Response distillation is the same idea for language models: compress a 14B teacher's behavior on the buyer's prompts into a 3B student that emits the same token distribution. The student is the artifact we sign and ship; the teacher never leaves the trainer.
The loss, in one line
For every response position, the teacher produces a vocabulary-wide logit vector; the student produces its own. The distillation loss is the KL divergence between the two softmaxes, taken at a shared temperature T, scaled by T2 so the gradient magnitude does not shrink as T grows:
L_kd = KL( softmax(s/T) || softmax(t/T) ) * T^2
Plus a small cross-entropy term against the teacher's sampled token so the student also learns the surface form, not only the distribution shape:
L = α * L_kd + (1 - α) * L_ce
The standard recipe is α = 0.9, T = 2.0. The temperature has two roles. First it dampens the teacher's argmax so the student sees the second and third-most-likely tokens at every position; that is what Hinton called the "dark knowledge" the student would never see from sampled labels alone. Second it shapes the gradient magnitude; T2 is the chain-rule correction.
Three objectives, three failure modes
| Objective | What it minimizes | When it fails |
|---|---|---|
| Forward KL | KL(s || t): the student tries to cover all the teacher's mass | Student spreads probability across modes the teacher rejects (the "mean-seeking" mode) |
| Reverse KL | KL(t || s): the student concentrates on what the teacher likes most (MiniLLM) | Student becomes overconfident; misses minority but valid alternatives |
| JSD | Symmetric mixture: 0.5*KL(s||m) + 0.5*KL(t||m) | Most expensive of the three; bounded loss so gradients soften far from the teacher |
The default in apps/trainer/distill.py is forward-KL, but the buyer's spec can request reverse-KL when the task admits one good answer and many wrong ones (PHI redaction, single-label classification), and JSD when the policy needs to remain calibrated for downstream best-of-N selection.
The top-k logit-pruning trick
A full vocabulary KL is (B, T, V) shaped; for Qwen-class tokenizers V ≈ 152K and the activation memory dominates the GPU. The pruning trick: keep only the top-k teacher tokens (by teacher probability) and aggregate the rest into a single "other" bucket via logsumexp on both sides:
def topk_prune(s_logits, t_logits, k):
_, idx = torch.topk(t_logits, k=k, dim=-1)
t_top = t_logits.gather(-1, idx); s_top = s_logits.gather(-1, idx)
mask = torch.ones_like(t_logits, dtype=torch.bool)
mask.scatter_(-1, idx, False)
t_other = t_logits.masked_fill(~mask, -inf).logsumexp(-1, keepdim=True)
s_other = s_logits.masked_fill(~mask, -inf).logsumexp(-1, keepdim=True)
return cat([s_top, s_other], -1), cat([t_top, t_other], -1)
For k = 64 on a 152K-token vocabulary the memory drop is ≈ 2000x; the quality drop is in the third decimal place of the held-out perplexity. Worth it for any teacher larger than the student's VRAM.
Off-policy versus on-policy
Off-policy distillation: the teacher's response (or a captured ground-truth response) is the target sequence; the student's gradients are computed on tokens it did not generate. Cheap because the teacher runs once per row at data-preparation time.
On-policy distillation (Agarwal 2024): sample the response from the student, then compute the loss between the student's logits and the teacher's logits at the student's positions. Double the forward-pass cost (teacher + student each run on every batch), but resolves the well-known "train/test mismatch" problem where the student is taught on sequences it would never generate. Agarwal shows the on-policy variant beats off-policy by 1-3 points on MMLU for student-teacher gaps larger than 5x.
kolm exposes the choice as DistillConfig.on_policy; the default is off-policy because the typical kolm buyer has a 14B→3B gap (2-3x) where the lift is small and the doubled cost is not worth it. Buyers with cross-family teachers (e.g. distilling a 70B Llama into a 3B Qwen) should enable on-policy.
Why same-family tokenizers
Token-level KD requires the teacher and student to share a vocabulary. If they do not, position i in the teacher's logit vector does not correspond to position i in the student's, and the KL is undefined. apps/trainer/distill.py raises explicitly if the tokenizers diverge: "teacher and student tokenizers must share a vocabulary for token-level KD."
For cross-family distillation, the alternative is sequence-level KD (Kim & Rush 2016): the teacher's argmax becomes the student's training target, no logit matching, no shared vocab requirement. This collapses to plain SFT-on-teacher-outputs, which is also a valid path through apps/trainer/trainer_real.py. The trade-off: token-level KD captures the "dark knowledge"; sequence-level does not.
What ships
A distilled student is the same shape as any other kolm artifact: a base-model pointer (the small one), a LoRA adapter trained against the teacher, a manifest, a receipt, a CID. The teacher is not embedded; it is referenced by its own CID in the manifest so a reviewer can re-fetch the teacher and re-run a held-out KL check. The student artifact is what gets signed, what gets gated by the K-score, what gets aliased by the runtime, and what ships to production.
What the receipt records
"distill": {
"method": "kd_response_distillation",
"teacher_model": "Qwen/Qwen2.5-14B-Instruct",
"student_model": "Qwen/Qwen2.5-3B-Instruct",
"config": {
"temperature": 2.0,
"alpha": 0.9,
"objective": "forward_kl",
"top_k": 64,
"on_policy": false,
"lora_r": 16,
"lora_alpha": 32
},
"n_train_rows": 8200,
"n_eval_rows": 432,
"loss_final": 0.184,
"ppl_eval": 6.21,
"papers": [
"arXiv:1503.02531",
"arXiv:1606.07947",
"arXiv:2306.08543",
"arXiv:2306.13649"
]
}
The buyer's auditor can re-fetch both models by Hub id, replay the held-out rows, and recompute ppl_eval within float-noise of the receipt. The canonical-JSON manifest covers the block; a tampered receipt invalidates the artifact signature.
Edge cases worth naming
Capacity gap. A 3B student cannot perfectly imitate a 70B teacher; some teacher predictions live on capabilities the student lacks. The held-out KL plateaus above zero. The fix is not more training; it is either a bigger student, a narrower training distribution (fewer modes to compress), or an on-policy switch so the student is taught only on its own reachable trajectories.
Teacher hallucinations. The student perfectly imitates whatever the teacher does, including the teacher's confident errors. A reward-model gate (apps/trainer/reward.py) after distillation filters student responses that score low under an independent judge; alternatively, a verifiable check at the K-score gate catches the hallucination class the buyer cares about most.
Numerical instability at small T. Temperatures below 1.0 make the softmax peaky and the gradient large; loss spikes and the training run NaNs. Stay above 1.5 unless a held-out perplexity sweep shows otherwise.
Length bias. Longer responses contribute more positions to the loss, so the gradient is biased toward long-target behavior. We mean-reduce over response positions to flatten this; the alternative (sum-reduce) is what some early KD papers used and what makes some recipes mysteriously prefer terse responses.
Where this fits in the kolm compile loop
The standard pipeline is capture → (distill | SFT) → (preference) → (GRPO) → K-score gate → sign → ship. Response distillation replaces or augments the SFT stage when a strong frontier teacher is available and the buyer's deployment target is much smaller than that teacher. The downstream stages are unchanged; the K-score gate sees only the student's output and does not care whether the student learned from labels or from teacher logits.
Citations
Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Network. arXiv:1503.02531, 2015. The original soft-labels recipe and the T2 chain-rule scaling.
Kim, Y. & Rush, A. M. Sequence-Level Knowledge Distillation. arXiv:1606.07947, 2016. The sequence-level variant; teacher's argmax becomes the student's target.
Gu, Y. et al. MiniLLM: Knowledge Distillation of Large Language Models. arXiv:2306.08543, 2024. The reverse-KL formulation; keeps the student from spreading mass.
Agarwal, R. et al. On-Policy Distillation of Language Models. arXiv:2306.13649, 2024. Why training on the student's own samples beats training on the teacher's when the gap is large.
Gou, J. et al. Knowledge Distillation: A Survey. arXiv:2006.05525, 2021. The reference review covering offline, online, and self-distillation variants.