Engineering · 2026-05-09 · 11 min read

Distillation vs fine-tuning vs RAG.

A decision tree, not a religion. When each technique wins, when each one fails, and the actual cost numbers. Most teams pick one based on what their loudest engineer used last quarter; it's worth taking ten minutes to pick on substance.

By KolmogorovTag decision tree · LoRA · RAG · distillation

The three techniques, in one paragraph each.

RAG (retrieval-augmented generation). At inference time, search a vector index, paste the top-k chunks into the prompt, ask the model to answer using them. The model is unchanged; the system is.

Fine-tuning (LoRA / QLoRA / full). Take an open-weight base model, train it on labeled (input, output) pairs from your domain. The model parameters change. The model is now your model.

Distillation. Use a powerful teacher model to label a large dataset, then fine-tune a smaller student to match. Often paired with verified inference: k-sample the teacher and only keep labels that pass a deterministic check. The student learns the teacher’s behavior on your specific tasks; the cost shifts from per-call to one-time.

RAG retrieves. Fine-tuning teaches. Distillation transfers a behavior from one model to another. They solve different problems. They are not interchangeable.

The decision tree, by question.

Five questions. Asked in this order, they get most teams to the right answer in under a minute.

The cost matrix.

Concrete numbers from the same hypothetical SaaS in our manifesto: 4M calls/month, ticket-classification task. The numbers are illustrative; your mileage will vary by 2-3x.

TechniqueSetup costPer-call costLatency p50Data leaves?Verifier?
Frontier API only$0$0.040700msyesno
RAG over frontier$200 (index)$0.045950msyesno
LoRA fine-tune of frontier$2-5k (vendor)$0.030700msat trainingno
LoRA fine-tune of open base$200-1k$0.0008 (own GPU)180msnono
kolm distillation (.kolm)$700$0.0003180msnoyes

The setup-cost column hides the most important detail. Frontier API has no setup cost because there is no setup. But the per-call line compounds; at 4M calls/mo it’s $160k/yr forever, and it grows with usage. The kolm distillation row pays back its setup in about three weeks at that volume.

When RAG wins (and where it dies).

RAG wins when the answer is in the data and the data changes. A support knowledge base that gets updated weekly. An internal docs system. A news summarizer. A research assistant over recent papers. The model doesn’t need to know the answer; it needs to read the answer and present it.

RAG dies when the task is structured. Asking GPT-4 to classify a ticket into one of 12 queues by retrieving 8 example tickets is more expensive, slower, and less accurate than fine-tuning a 3B model to do the same thing in 180ms. RAG also dies when latency budgets are tight (the retrieval step is a third of the latency) and when the data needs to stay local (the embeddings still need to live somewhere).

RAG also has a hidden tax that nobody costs out: the eval. RAG systems are notoriously hard to evaluate because the answer depends on what was retrieved, which depends on what was indexed, which drifts. Fine-tuned and distilled models are evaluated on a fixed eval set with a K-score; RAG systems are evaluated on prayers and a Slack channel.

When fine-tuning wins (and the trap).

Fine-tuning wins when the task is structured, the eval is reproducible, and you have 200-2000 high-quality labeled pairs. It is the right answer at moderate scale (say, 100k-1M calls/mo) on tasks where you don’t need verifier-grade outputs.

The trap: fine-tuning a vendor-hosted frontier model. You don’t actually own the result. The fine-tuned weights live in their cloud. They charge you per call as if it were the base. You can’t move it to your VPC. You can’t ship it to the edge. You can’t put it on a phone. You traded one rental for another.

The fine-tune that earns its keep is on an open-weight base (Qwen, Llama, Phi, Hermes) that you can serve yourself. That’s a real advantage. It is also, mostly, what kolm does inside the compile pipeline as one of seven stages. The artifact is what you get back; the fine-tune is plumbing.

When distillation wins (and the cost).

Distillation wins when:

The cost of distillation is the teacher API bill at compile time, plus a few hours of GPU for LoRA training. $200 - $5,000 per Specialist depending on dataset size and base model. After that, marginal cost is whatever you pay for the GPU sliver that serves the artifact.

The honest failure mode: distillation can’t conjure capability the student doesn’t have. If the task is open-domain reasoning, distilling a 3B from a frontier model gets you a 3B that knows your tasks, not a 3B that reasons like Opus. Pick distillation for the deterministic 80%; keep frontier for the open 20%.

The smaller-model-with-LoRA can match the frontier on your tasks. It cannot match the frontier on tasks you didn’t teach it. That’s a feature, not a bug.

The combinations that actually ship.

None of the three techniques are pure in production. The combinations win.

A worked example: ticket routing.

Concrete walk-through. Same SaaS, same 4M calls/mo. The team chose distillation. Here’s why each alternative didn’t fit.

Why not RAG? The answer isn’t in the knowledge base; it’s a classification into one of 12 queues, learned from prior tickets. Retrieval doesn’t add anything; the labels are baked into the training set.

Why not vendor fine-tuning? Per-call cost still ran on the vendor meter, and the team wanted to ship the same model into a customer-VPC deployment for one regulated buyer. The fine-tune had to be portable.

Why not open-base LoRA without distillation? They had 200 labeled pairs. Not enough to train a 3B from scratch on this task; the LoRA stage needed thousands. Distillation generated 3,400 labeled pairs by k-sampling the teacher with verifier gating, which gave the LoRA enough signal to converge.

Outcome: K-score 94.2, latency 180ms, $1,888/yr fixed cost vs $160,128/yr variable. Same accuracy band on the holdout set. The receipt chain went live the day they shipped. The CFO was happy.

One feature, one quarter, one number on the unit-economics line. Then the next.