Contents
The frozen-model problem.
A frontier model is trained on a snapshot of the web ending some months ago. Its weights are fixed; its tokenizer is fixed; the policy that ranks responses is fixed. For most consumer chat this is fine. For a buyer with an industry-specific tail (clinical workflows, regulatory deadlines, internal jargon, customer-name vocabulary), the tail does not exist in the training distribution. The model is competent on the long shoulder and merely plausible on the tail.
The conventional answer is RAG: paste the buyer's documents into the prompt at inference time and hope the model uses them. RAG works for a class of retrieval-shaped queries and fails for anything that needs the model to have internalized a vocabulary or a policy. The model never learns. Every new shipment of buyer-specific knowledge is a context-window expense, not an asset.
The right answer is a distill that converts the buyer's traffic into weights. The traffic exists already: it is the captures the buyer is sending to a frontier API today. The distill is short: a LoRA over a small base. The risk is regression: the distill is worse than the model it replaces. The fix is a gate.
The four-step loop.
The kolm loop is four CLI verbs and one assertion. The buyer runs each step manually, on a schedule (nightly, weekly), or on an event (volume threshold, semantic drift detected).
# 1. Capture: proxy live traffic, tag, redact, store. $ kolm capture --tenant acme --tag clinical-intake # 2. Train: distill the captures into a new .kolm adapter. $ kolm compile --task "clinical intake triage" --from-captures acme/clinical-intake # 3. Score: K-score the new artifact against the held-out evals. $ kolm eval --artifact ./out/intake_v2.kolm --suite clinical-intake-evals # 4. Swap: hot-swap the adapter only if it K-scores higher than the current one. $ kolm swap --artifact ./out/intake_v2.kolm --strategy higher-k-wins
The four verbs map to four files in the source tree: src/capture.js, src/spec-compile.js, src/benchmark.js, src/serve.js. The connecting state is the receipt chain: every step writes a receipt that references the previous step's CID. A swap that ships also writes a registry row with the from-CID and the to-CID so a deployer can roll back the adapter on the fly.
The K-score gate.
The K-score is a weighted aggregate of accuracy, size, latency, cost, and coverage. K = 0.40 . A + 0.15 . S + 0.15 . L + 0.15 . C + 0.15 . V. The default ship threshold is 0.85, but in the continual-learning loop the threshold is dynamic: the new artifact must beat the current artifact's K-score, plus a margin.
| Current K | New K | Margin | Action |
|---|---|---|---|
| 0.871 | 0.889 | +0.018 | swap |
| 0.871 | 0.872 | +0.001 | refuse (within noise) |
| 0.871 | 0.850 | -0.021 | refuse (regression) |
| 0.871 | 0.823 | -0.048 | refuse + alert |
The margin defaults to +0.01 so noise alone does not flip the adapter. The alert threshold defaults to -0.03 so a meaningful regression pages the engineer who owns the recipe.
Hot-swap mechanics.
The serving runtime loads .kolm artifacts via a versioned adapter registry. A hot-swap is two file renames and one HUP signal. There is no model restart, no service interruption, no cold cache.
# The current adapter symlink points at v1. adapters/clinical-intake -> adapters/clinical-intake_v1.kolm # The swap atomically repoints the symlink. ln -sf adapters/clinical-intake_v2.kolm adapters/clinical-intake.swap mv -f adapters/clinical-intake.swap adapters/clinical-intake # The runtime polls the symlink every second; on change, it loads the new adapter. # In-flight requests finish on the old one; new requests use the new one.
The trade is that two adapters live in memory briefly. For LoRA adapters at r=16 on a 7B base, that is two adapters of about 30 MB each, which is negligible against the 14 GB base weights. The window is bounded by the longest in-flight request; on a typical chat path with max_tokens=512 at 200 tok/s that is 2.5 seconds.
Provenance, every swap.
Every artifact carries a receipt chain over task → seeds → recipes → evals → package, signed under the tenant key, with the CID embedded. Every swap writes a registry row that names the from-CID and the to-CID. The deployer can answer five questions without leaving the registry:
- What task is this model trained for? The task field in the manifest.
- What data trained it? The capture namespace and the seed range.
- What code produced it? The deterministic recipe bytes and the trainer version.
- What does it score? The K-score from the evals.json block.
- What did it replace? The from-CID column in the swap log.
A regulator who asks "show me the trail" gets all five from one query. None of those answers depend on us being alive. The verifier is Rust, forbid(unsafe_code), dependency-pinned, and ships as a 4 MB binary the deployer can vendor.
Cadence and cost.
Three cadences cover most production patterns.
| Cadence | Trigger | Typical cost | Suits |
|---|---|---|---|
| Nightly | cron at 02:00 local | $0.05-$0.50 | chat, support, internal tools |
| Volume-gated | 10k new captures | $0.10-$1.00 | variable workloads, seasonal traffic |
| Drift-detected | embedding distance > threshold | $0.10-$2.00 | regulated workflows, compliance-sensitive |
The dollar figures are the median LoRA-distill cost for a Qwen2.5-3B target on a rented A100 via the kolm compute picker (see /compute for the per-backend rates). The total is dominated by the cold-start of the rental container; the actual training pass usually takes 30-90 seconds.
Failure modes.
The honest list of ways this loop breaks.
- Catastrophic forgetting. A LoRA distill that overweights the new captures forgets the long shoulder. The mitigation is the held-out evals: if the new artifact scores worse on the broad rubric, the K-score gate refuses.
- Eval-set drift. If the held-out evals were authored months ago, they may no longer reflect what the buyer cares about. The mitigation is to mark a fraction of captures as "evaluator" examples in the capture step (
kolm capture --tag eval-candidate); the trainer rotates them into the held-out set on the next compile. - Capture pollution. A user types nonsense; the captures absorb it; the next distill regresses on a clean prompt. The mitigation is the receipt chain plus the swap log: roll back to a known-good CID.
- Provider deprecation. The frontier API the captures came from raises rates or sunsets. The mitigation is that the captures already exist in your registry; the next distill does not need the API.
The loop is not a silver bullet. It is the only structural answer to the frozen-model failure mode that does not require trusting a vendor to retrain on your schedule. kolm ships the loop; you own the captures; the receipt chain proves both.