kolm vs RAG: orchestration vs compilation

Where they overlap

Both ground answers in your data. Both index a corpus into vectors. Both let you cite sources. The question is when and where the model runs and who pays per request.

RAG inherits the latency and cost shape of whichever frontier API you call. kolm pays the frontier cost once at compile time (distillation), then runs the resulting student locally for free.

Side-by-side

	RAG	kolm
What it is	Orchestration pattern (vector store + LLM call)	Compiled artifact (small model + LoRA + recall index + recipes)
Where the model runs	Frontier API (OpenAI / Anthropic / Google) at every request	Locally, embedded llama.cpp, on phone/laptop/server
Per-request cost	$$ for frontier tokens for context + answer	$0 after compile (you paid it once)
Compile cost	$0, just plumbing	Frontier-API cost for k-sample distillation pass (typically <$50 for an SMB corpus)
Latency	Network round-trip + frontier inference	Local inference only, no network
Offline	no	yes
Quality on your task	Frontier-class, but every request leaves your network	Frontier-distilled into a 3B-class student, gated by K-score (accuracy × coverage ÷ size). Compile fails → nothing ships.
Data leaves your network	yes, every prompt	no after compile (distillation step is opt-in cloud or self-host)
Receipts / signing	none, no audit trail	HMAC-SHA256 chain over manifest → recall → output
Embedding drift	Re-embed when model changes; rebuild index ad-hoc	Index pinned in artifact at compile time; deterministic
Reproducibility	Frontier API outputs drift week-to-week	Byte-exact: same inputs, same artifact, same outputs

When RAG is the right answer

Use RAG when the corpus changes faster than you can recompile, when you don't need offline, and when frontier-API cost-per-request is fine for your scale.

# classic RAG: corpus changes hourly, fine for low volume
embed(corpus) -> vectors
query -> top_k(vectors) -> openai.chat.create()
# pays $$$ per request, no offline, no audit trail

When kolm is the right answer

Use kolm when the corpus is stable enough to recompile weekly, when you need offline or sovereignty, when you need a signed audit trail, or when per-request frontier cost is killing your unit economics.

# compile corpus + task into one artifact:
kolm compile "answer support tickets" \
  --corpus ./kb/ \
  --examples ./tickets.jsonl \
  --base qwen2.5-7b

ok wrote support.kolm  k_score=0.88 size=2.3GB

# now every request runs locally, $0 frontier:
kolm run support.kolm "can I downgrade mid-cycle?" --receipt

Can you combine them?

Yes. A common pattern is kolm at edge for hot path, RAG fallback for cold path. Compile the 80% of queries that look like your training set; if K-score on the live query falls below threshold, route to a frontier API with full RAG. That gives you cost control on the hot path and frontier quality on the long tail.

Verdict

Use RAG if your corpus changes hourly, you're at low request volume, and you don't need offline or audit trails.

Use kolm if your corpus is stable, you need offline / sovereignty / receipts, or your frontier-API bill is becoming a problem. The compile step pays for itself the first month at any meaningful volume.

Adjacent comparisons: vs Ollama · vs fine-tuning · full comparison table

RAG is orchestration. kolm is a build.