Where they overlap
Both ground answers in your data. Both index a corpus into vectors. Both let you cite sources. The question is when and where the model runs and who pays per request.
RAG inherits the latency and cost shape of whichever frontier API you call. kolm pays the frontier cost once at compile time (distillation), then runs the resulting student locally for free.
Side-by-side
| RAG | kolm | |
|---|---|---|
| What it is | Orchestration pattern (vector store + LLM call) | Compiled artifact (small model + LoRA + recall index + recipes) |
| Where the model runs | Frontier API (OpenAI / Anthropic / Google) at every request | Locally, embedded llama.cpp, on phone/laptop/server |
| Per-request cost | $$ for frontier tokens for context + answer | $0 after compile (you paid it once) |
| Compile cost | $0, just plumbing | Frontier-API cost for k-sample distillation pass (typically <$50 for an SMB corpus) |
| Latency | Network round-trip + frontier inference | Local inference only, no network |
| Offline | no | yes |
| Quality on your task | Frontier-class, but every request leaves your network | Frontier-distilled into a 3B-class student, gated by K-score (accuracy × coverage ÷ size). Compile fails → nothing ships. |
| Data leaves your network | yes, every prompt | no after compile (distillation step is opt-in cloud or self-host) |
| Receipts / signing | none, no audit trail | HMAC-SHA256 chain over manifest → recall → output |
| Embedding drift | Re-embed when model changes; rebuild index ad-hoc | Index pinned in artifact at compile time; deterministic |
| Reproducibility | Frontier API outputs drift week-to-week | Byte-exact: same inputs, same artifact, same outputs |
When RAG is the right answer
Use RAG when the corpus changes faster than you can recompile, when you don't need offline, and when frontier-API cost-per-request is fine for your scale.
# classic RAG: corpus changes hourly, fine for low volume embed(corpus) -> vectors query -> top_k(vectors) -> openai.chat.create() # pays $$$ per request, no offline, no audit trail
When kolm is the right answer
Use kolm when the corpus is stable enough to recompile weekly, when you need offline or sovereignty, when you need a signed audit trail, or when per-request frontier cost is killing your unit economics.
# compile corpus + task into one artifact: kolm compile "answer support tickets" \ --corpus ./kb/ \ --examples ./tickets.jsonl \ --base qwen2.5-7b ok wrote support.kolm k_score=0.88 size=2.3GB # now every request runs locally, $0 frontier: kolm run support.kolm "can I downgrade mid-cycle?" --receipt
Can you combine them?
Yes. A common pattern is kolm at edge for hot path, RAG fallback for cold path. Compile the 80% of queries that look like your training set; if K-score on the live query falls below threshold, route to a frontier API with full RAG. That gives you cost control on the hot path and frontier quality on the long tail.
Verdict
Use RAG if your corpus changes hourly, you're at low request volume, and you don't need offline or audit trails.
Use kolm if your corpus is stable, you need offline / sovereignty / receipts, or your frontier-API bill is becoming a problem. The compile step pays for itself the first month at any meaningful volume.
Adjacent comparisons: vs Ollama · vs fine-tuning · full comparison table