Contents
The three techniques, in one paragraph each.
RAG (retrieval-augmented generation). At inference time, search a vector index, paste the top-k chunks into the prompt, ask the model to answer using them. The model is unchanged; the system is.
Fine-tuning (LoRA / QLoRA / full). Take an open-weight base model, train it on labeled (input, output) pairs from your domain. The model parameters change. The model is now your model.
Distillation. Use a powerful teacher model to label a large dataset, then fine-tune a smaller student to match. Often paired with verified inference: k-sample the teacher and only keep labels that pass a deterministic check. The student learns the teacher’s behavior on your specific tasks; the cost shifts from per-call to one-time.
RAG retrieves. Fine-tuning teaches. Distillation transfers a behavior from one model to another. They solve different problems. They are not interchangeable.
The decision tree, by question.
Five questions. Asked in this order, they get most teams to the right answer in under a minute.
- 1. Does the answer change as your data changes? If yes (knowledge base, ticket archive, fresh news), RAG is mandatory at the base. Stop here unless the next questions push you further.
- 2. Is the task structured? Does it always produce JSON, a fixed schema, a label from a closed set, code in a known style? If yes, fine-tuning helps a lot; distillation helps even more.
- 3. Are you paying frontier prices for it at scale? If yes (>1M calls/mo), distillation pays back fast. Below that, fine-tuning’s ROI is iffy.
- 4. Does the data have to stay local? If yes (HIPAA, defense, customer-VPC, mobile offline), distillation is the only viable answer: the artifact has to travel, the data does not.
- 5. Do you need a verifier? If outputs go to regulated systems, distillation gives you one for free as the gate during training. RAG and fine-tuning don’t produce one.
The cost matrix.
Concrete numbers from the same hypothetical SaaS in our manifesto: 4M calls/month, ticket-classification task. The numbers are illustrative; your mileage will vary by 2-3x.
| Technique | Setup cost | Per-call cost | Latency p50 | Data leaves? | Verifier? |
|---|---|---|---|---|---|
| Frontier API only | $0 | $0.040 | 700ms | yes | no |
| RAG over frontier | $200 (index) | $0.045 | 950ms | yes | no |
| LoRA fine-tune of frontier | $2-5k (vendor) | $0.030 | 700ms | at training | no |
| LoRA fine-tune of open base | $200-1k | $0.0008 (own GPU) | 180ms | no | no |
| kolm distillation (.kolm) | $700 | $0.0003 | 180ms | no | yes |
The setup-cost column hides the most important detail. Frontier API has no setup cost because there is no setup. But the per-call line compounds; at 4M calls/mo it’s $160k/yr forever, and it grows with usage. The kolm distillation row pays back its setup in about three weeks at that volume.
When RAG wins (and where it dies).
RAG wins when the answer is in the data and the data changes. A support knowledge base that gets updated weekly. An internal docs system. A news summarizer. A research assistant over recent papers. The model doesn’t need to know the answer; it needs to read the answer and present it.
RAG dies when the task is structured. Asking GPT-4 to classify a ticket into one of 12 queues by retrieving 8 example tickets is more expensive, slower, and less accurate than fine-tuning a 3B model to do the same thing in 180ms. RAG also dies when latency budgets are tight (the retrieval step is a third of the latency) and when the data needs to stay local (the embeddings still need to live somewhere).
RAG also has a hidden tax that nobody costs out: the eval. RAG systems are notoriously hard to evaluate because the answer depends on what was retrieved, which depends on what was indexed, which drifts. Fine-tuned and distilled models are evaluated on a fixed eval set with a K-score; RAG systems are evaluated on prayers and a Slack channel.
When fine-tuning wins (and the trap).
Fine-tuning wins when the task is structured, the eval is reproducible, and you have 200-2000 high-quality labeled pairs. It is the right answer at moderate scale (say, 100k-1M calls/mo) on tasks where you don’t need verifier-grade outputs.
The trap: fine-tuning a vendor-hosted frontier model. You don’t actually own the result. The fine-tuned weights live in their cloud. They charge you per call as if it were the base. You can’t move it to your VPC. You can’t ship it to the edge. You can’t put it on a phone. You traded one rental for another.
The fine-tune that earns its keep is on an open-weight base (Qwen, Llama, Phi, Hermes) that you can serve yourself. That’s a real advantage. It is also, mostly, what kolm does inside the compile pipeline as one of seven stages. The artifact is what you get back; the fine-tune is plumbing.
When distillation wins (and the cost).
Distillation wins when:
- You have a teacher (frontier API) that is good at the task and a student (open base, 3-7B) that needs to learn it.
- The task has a verifier (regex, schema, label set, BLEU, AST diff) that can score outputs deterministically.
- You’ll make >1M calls/mo or you have a local-only constraint that makes per-call frontier cost moot.
- You need a signed receipt, a K-score, or a portable artifact that runs across devices.
The cost of distillation is the teacher API bill at compile time, plus a few hours of GPU for LoRA training. $200 - $5,000 per Specialist depending on dataset size and base model. After that, marginal cost is whatever you pay for the GPU sliver that serves the artifact.
The honest failure mode: distillation can’t conjure capability the student doesn’t have. If the task is open-domain reasoning, distilling a 3B from a frontier model gets you a 3B that knows your tasks, not a 3B that reasons like Opus. Pick distillation for the deterministic 80%; keep frontier for the open 20%.
The smaller-model-with-LoRA can match the frontier on your tasks. It cannot match the frontier on tasks you didn’t teach it. That’s a feature, not a bug.
The combinations that actually ship.
None of the three techniques are pure in production. The combinations win.
- RAG + fine-tune. Fine-tune the base to follow your output schema, RAG over your knowledge base for the answer content. Common for support copilots.
- Distillation + Recall (RAG-flavored). The kolm default. The artifact ships with a sqlite-vec index of the user’s corpus and a LoRA-fit base. Inference grounds in the index and the model behaves like the teacher.
- Distillation + frontier escape hatch. Compile the deterministic 80%; route the open-domain 20% to the frontier API. Same harness, two backends. K-score on the cover guarantees the cheap path is correct; the frontier escape hatch is for the hard tail.
A worked example: ticket routing.
Concrete walk-through. Same SaaS, same 4M calls/mo. The team chose distillation. Here’s why each alternative didn’t fit.
Why not RAG? The answer isn’t in the knowledge base; it’s a classification into one of 12 queues, learned from prior tickets. Retrieval doesn’t add anything; the labels are baked into the training set.
Why not vendor fine-tuning? Per-call cost still ran on the vendor meter, and the team wanted to ship the same model into a customer-VPC deployment for one regulated buyer. The fine-tune had to be portable.
Why not open-base LoRA without distillation? They had 200 labeled pairs. Not enough to train a 3B from scratch on this task; the LoRA stage needed thousands. Distillation generated 3,400 labeled pairs by k-sampling the teacher with verifier gating, which gave the LoRA enough signal to converge.
Outcome: K-score 94.2, latency 180ms, $1,888/yr fixed cost vs $160,128/yr variable. Same accuracy band on the holdout set. The receipt chain went live the day they shipped. The CFO was happy.
One feature, one quarter, one number on the unit-economics line. Then the next.
The origin story behind the file format and the AI compiler that produces it.
read next K-sample verified inference →The mechanism behind every label inside a .kolm. A practical alternative to zk-ML.
dogfood Our marketing runs on distilled models →Four Specialists run features on this site. Here’s the K-scores and the bill.
act The AI-native SaaS use case →Compile your highest-cost features. Recover gross margin in a quarter.