vs RAG

RAG is orchestration. kolm is a build.

Retrieval-Augmented Generation glues a vector store to a frontier API at query time. Useful, and painful when you scale, when you need offline, when you need proof. kolm compiles those moving parts into one signed file.

RAG

An architecture. Embed corpus, retrieve top-k, stuff into a frontier prompt, return answer. Per-request frontier API spend.

vs

kolm

An artifact. Same retrieval, distilled into the model itself plus a bundled recall index. Runs locally, signed end-to-end, no per-request frontier bill.

Where they overlap

Both ground answers in your data. Both index a corpus into vectors. Both let you cite sources. The question is when and where the model runs and who pays per request.

RAG inherits the latency and cost shape of whichever frontier API you call. kolm pays the frontier cost once at compile time (distillation), then runs the resulting student locally for free.

Side-by-side

RAGkolm
What it is Orchestration pattern (vector store + LLM call) Compiled artifact (small model + LoRA + recall index + recipes)
Where the model runs Frontier API (OpenAI / Anthropic / Google) at every request Locally, embedded llama.cpp, on phone/laptop/server
Per-request cost $$ for frontier tokens for context + answer $0 after compile (you paid it once)
Compile cost $0, just plumbing Frontier-API cost for k-sample distillation pass (typically <$50 for an SMB corpus)
Latency Network round-trip + frontier inference Local inference only, no network
Offline no yes
Quality on your task Frontier-class, but every request leaves your network Frontier-distilled into a 3B-class student, gated by K-score (accuracy × coverage ÷ size). Compile fails → nothing ships.
Data leaves your network yes, every prompt no after compile (distillation step is opt-in cloud or self-host)
Receipts / signing none, no audit trail HMAC-SHA256 chain over manifest → recall → output
Embedding drift Re-embed when model changes; rebuild index ad-hoc Index pinned in artifact at compile time; deterministic
Reproducibility Frontier API outputs drift week-to-week Byte-exact: same inputs, same artifact, same outputs

When RAG is the right answer

Use RAG when the corpus changes faster than you can recompile, when you don't need offline, and when frontier-API cost-per-request is fine for your scale.

# classic RAG: corpus changes hourly, fine for low volume
embed(corpus) -> vectors
query -> top_k(vectors) -> openai.chat.create()
# pays $$$ per request, no offline, no audit trail

When kolm is the right answer

Use kolm when the corpus is stable enough to recompile weekly, when you need offline or sovereignty, when you need a signed audit trail, or when per-request frontier cost is killing your unit economics.

# compile corpus + task into one artifact:
kolm compile "answer support tickets" \
  --corpus ./kb/ \
  --examples ./tickets.jsonl \
  --base qwen2.5-7b

ok wrote support.kolm  k_score=0.88 size=2.3GB

# now every request runs locally, $0 frontier:
kolm run support.kolm "can I downgrade mid-cycle?" --receipt

Can you combine them?

Yes. A common pattern is kolm at edge for hot path, RAG fallback for cold path. Compile the 80% of queries that look like your training set; if K-score on the live query falls below threshold, route to a frontier API with full RAG. That gives you cost control on the hot path and frontier quality on the long tail.

Verdict

Use RAG if your corpus changes hourly, you're at low request volume, and you don't need offline or audit trails.

Use kolm if your corpus is stable, you need offline / sovereignty / receipts, or your frontier-API bill is becoming a problem. The compile step pays for itself the first month at any meaningful volume.

Adjacent comparisons: vs Ollama · vs fine-tuning · full comparison table