research . training . 10 min read

Response distillation.

Kolmogorov complexity is about compressing a string into the shortest program that emits it. Response distillation is the same idea for language models: compress a 14B teacher's behavior on the buyer's prompts into a 3B student that emits the same token distribution. The student is the artifact we sign and ship; the teacher never leaves the trainer.

May 15, 2026 · Kolmogorov research · apps/trainer/distill.py

The loss, in one line

For every response position, the teacher produces a vocabulary-wide logit vector; the student produces its own. The distillation loss is the KL divergence between the two softmaxes, taken at a shared temperature T, scaled by T² so the gradient magnitude does not shrink as T grows:

L_kd = KL( softmax(s/T) || softmax(t/T) ) * T^2

Plus a small cross-entropy term against the teacher's sampled token so the student also learns the surface form, not only the distribution shape:

L = α * L_kd + (1 - α) * L_ce

The standard recipe is α = 0.9, T = 2.0. The temperature has two roles. First it dampens the teacher's argmax so the student sees the second and third-most-likely tokens at every position; that is what Hinton called the "dark knowledge" the student would never see from sampled labels alone. Second it shapes the gradient magnitude; T² is the chain-rule correction.

Three objectives, three failure modes

Objective	What it minimizes	When it fails
Forward KL	`KL(s \|\| t)`: the student tries to cover all the teacher's mass	Student spreads probability across modes the teacher rejects (the "mean-seeking" mode)
Reverse KL	`KL(t \|\| s)`: the student concentrates on what the teacher likes most (MiniLLM)	Student becomes overconfident; misses minority but valid alternatives
JSD	Symmetric mixture: `0.5KL(s\|\|m) + 0.5KL(t\|\|m)`	Most expensive of the three; bounded loss so gradients soften far from the teacher

The default in apps/trainer/distill.py is forward-KL, but the buyer's spec can request reverse-KL when the task admits one good answer and many wrong ones (PHI redaction, single-label classification), and JSD when the policy needs to remain calibrated for downstream best-of-N selection.

The top-k logit-pruning trick

A full vocabulary KL is (B, T, V) shaped; for Qwen-class tokenizers V ≈ 152K and the activation memory dominates the GPU. The pruning trick: keep only the top-k teacher tokens (by teacher probability) and aggregate the rest into a single "other" bucket via logsumexp on both sides:

def topk_prune(s_logits, t_logits, k):
    _, idx = torch.topk(t_logits, k=k, dim=-1)
    t_top = t_logits.gather(-1, idx); s_top = s_logits.gather(-1, idx)
    mask = torch.ones_like(t_logits, dtype=torch.bool)
    mask.scatter_(-1, idx, False)
    t_other = t_logits.masked_fill(~mask, -inf).logsumexp(-1, keepdim=True)
    s_other = s_logits.masked_fill(~mask, -inf).logsumexp(-1, keepdim=True)
    return cat([s_top, s_other], -1), cat([t_top, t_other], -1)

For k = 64 on a 152K-token vocabulary the memory drop is ≈ 2000x; the quality drop is in the third decimal place of the held-out perplexity. Worth it for any teacher larger than the student's VRAM.

Off-policy versus on-policy

Off-policy distillation: the teacher's response (or a captured ground-truth response) is the target sequence; the student's gradients are computed on tokens it did not generate. Cheap because the teacher runs once per row at data-preparation time.

On-policy distillation (Agarwal 2024): sample the response from the student, then compute the loss between the student's logits and the teacher's logits at the student's positions. Double the forward-pass cost (teacher + student each run on every batch), but resolves the well-known "train/test mismatch" problem where the student is taught on sequences it would never generate. Agarwal shows the on-policy variant beats off-policy by 1-3 points on MMLU for student-teacher gaps larger than 5x.

kolm exposes the choice as DistillConfig.on_policy; the default is off-policy because the typical kolm buyer has a 14B→3B gap (2-3x) where the lift is small and the doubled cost is not worth it. Buyers with cross-family teachers (e.g. distilling a 70B Llama into a 3B Qwen) should enable on-policy.

Why same-family tokenizers

Token-level KD requires the teacher and student to share a vocabulary. If they do not, position i in the teacher's logit vector does not correspond to position i in the student's, and the KL is undefined. apps/trainer/distill.py raises explicitly if the tokenizers diverge: "teacher and student tokenizers must share a vocabulary for token-level KD."

For cross-family distillation, the alternative is sequence-level KD (Kim & Rush 2016): the teacher's argmax becomes the student's training target, no logit matching, no shared vocab requirement. This collapses to plain SFT-on-teacher-outputs, which is also a valid path through apps/trainer/trainer_real.py. The trade-off: token-level KD captures the "dark knowledge"; sequence-level does not.

What ships

A distilled student is the same shape as any other kolm artifact: a base-model pointer (the small one), a LoRA adapter trained against the teacher, a manifest, a receipt, a CID. The teacher is not embedded; it is referenced by its own CID in the manifest so a reviewer can re-fetch the teacher and re-run a held-out KL check. The student artifact is what gets signed, what gets gated by the K-score, what gets aliased by the runtime, and what ships to production.

What the receipt records

"distill": {
  "method": "kd_response_distillation",
  "teacher_model": "Qwen/Qwen2.5-14B-Instruct",
  "student_model": "Qwen/Qwen2.5-3B-Instruct",
  "config": {
    "temperature": 2.0,
    "alpha": 0.9,
    "objective": "forward_kl",
    "top_k": 64,
    "on_policy": false,
    "lora_r": 16,
    "lora_alpha": 32
  },
  "n_train_rows": 8200,
  "n_eval_rows": 432,
  "loss_final": 0.184,
  "ppl_eval": 6.21,
  "papers": [
    "arXiv:1503.02531",
    "arXiv:1606.07947",
    "arXiv:2306.08543",
    "arXiv:2306.13649"
  ]
}

The buyer's auditor can re-fetch both models by Hub id, replay the held-out rows, and recompute ppl_eval within float-noise of the receipt. The canonical-JSON manifest covers the block; a tampered receipt invalidates the artifact signature.

Edge cases worth naming

Capacity gap. A 3B student cannot perfectly imitate a 70B teacher; some teacher predictions live on capabilities the student lacks. The held-out KL plateaus above zero. The fix is not more training; it is either a bigger student, a narrower training distribution (fewer modes to compress), or an on-policy switch so the student is taught only on its own reachable trajectories.

Teacher hallucinations. The student perfectly imitates whatever the teacher does, including the teacher's confident errors. A reward-model gate (apps/trainer/reward.py) after distillation filters student responses that score low under an independent judge; alternatively, a verifiable check at the K-score gate catches the hallucination class the buyer cares about most.

Numerical instability at small T. Temperatures below 1.0 make the softmax peaky and the gradient large; loss spikes and the training run NaNs. Stay above 1.5 unless a held-out perplexity sweep shows otherwise.

Length bias. Longer responses contribute more positions to the loss, so the gradient is biased toward long-target behavior. We mean-reduce over response positions to flatten this; the alternative (sum-reduce) is what some early KD papers used and what makes some recipes mysteriously prefer terse responses.

Where this fits in the kolm compile loop

The standard pipeline is capture → (distill | SFT) → (preference) → (GRPO) → K-score gate → sign → ship. Response distillation replaces or augments the SFT stage when a strong frontier teacher is available and the buyer's deployment target is much smaller than that teacher. The downstream stages are unchanged; the K-score gate sees only the student's output and does not care whether the student learned from labels or from teacher logits.

Citations

Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Network. arXiv:1503.02531, 2015. The original soft-labels recipe and the T² chain-rule scaling.

Kim, Y. & Rush, A. M. Sequence-Level Knowledge Distillation. arXiv:1606.07947, 2016. The sequence-level variant; teacher's argmax becomes the student's target.

Gu, Y. et al. MiniLLM: Knowledge Distillation of Large Language Models. arXiv:2306.08543, 2024. The reverse-KL formulation; keeps the student from spreading mass.

Agarwal, R. et al. On-Policy Distillation of Language Models. arXiv:2306.13649, 2024. Why training on the student's own samples beats training on the teacher's when the gap is large.

Gou, J. et al. Knowledge Distillation: A Survey. arXiv:2006.05525, 2021. The reference review covering offline, online, and self-distillation variants.

← Back to research · Frontier reference · Response distillation · RS-1 spec