Three forces, one answer
Frontier capability commoditized.
In late 2024 and through 2025, the gap between top frontier models (GPT-4 / Claude / Gemini / Mistral / Hermes) collapsed to within a few points on most benchmarks that matter. Open-weight 70B-class models reached parity with hosted frontier of 18 months prior. The "use the best frontier" pitch lost its edge: picking the right teacher is now a routing decision, not an exclusive purchase.
Per-token economics broke.
For agent workloads, code review, document classification, content generation at scale, the per-token bill at $5–$15 per million tokens compounds into five- and six-figure monthly invoices. Companies that started in 2023 with frontier-everything are migrating tier-by-tier to local. The compile step is exactly that migration: pay the frontier once, run the student forever.
Sovereignty got regulated.
EU AI Act enforcement starts in 2026 for general-purpose AI providers and 2027 for high-risk use. Article 50 demands provenance for AI-generated content. NIST AI RMF, ISO/IEC 42001, and the FTC's deceptive-pattern rulemaking all converge on the same requirement: you must be able to prove what your AI did. A signed receipt chain over a local artifact is the cleanest possible answer to that requirement.
The compounding pressure
Each of the three forces alone justifies a niche product. Together they justify a primitive. A team that adopted a frontier API for a real workload in 2024 hits all three within 12 months: the model that was special is now one of five equivalent options; the bill that was small is now hard to justify; the auditor is now asking how the AI made that decision.
The compile step solves all three with one operation. Compile takes whichever frontier is cheapest this quarter and uses it as a teacher. Compile turns the per-token bill into a one-time charge. Compile emits a signed manifest the auditor can verify.
"We have been moving from frontier APIs to local-first inference for our high-volume tasks not because frontier got worse, but because $40k/mo for token spend wasn't justifiable when an open-weight 8B model fine-tuned on our data hit 95% of the same quality." common pattern, 2025–2026
Why "now" and not 2024
Three things had to be true for the compile model to work, and they all became true within 18 months of each other:
Open-weight bases got good enough.
Qwen2.5, Llama-3, Hermes-3, Phi-3 (all released within an 18-month window) finally cleared the threshold where a LoRA-distilled student of a frontier teacher actually wins on the user's task. Pre-2024, the floor was too low. The compile step would have produced a sub-quality artifact.
Phones became real inference targets.
iOS Neural Engine and Android NPUs in 2025 silicon (Apple A18, Snapdragon 8 Gen 4, Tensor G4) cleared the bar where a 3B-class INT4 model runs at usable speed for agentic workloads, not just chat. The deployment surface that "compile-and-run-anywhere" needs finally exists.
Cryptographic provenance reached general expectation.
Code-signing in CI/CD (Sigstore, SLSA), supply-chain attestation (in-toto), reproducible builds: every adjacent area of software shipped this in 2023–2025. The pattern is now familiar enough that demanding the same for AI outputs is a question of when, not whether.
What this means for you
If you are operating a real LLM workload today, you will hit at least one of those three forces in the next 12 months. The earliest movers are already migrating. The compile step is the route they take.
# the migration, in one command kolm compile "$YOUR_TASK" \ --examples ./examples.jsonl \ --teacher anthropic/claude-opus-4-7 \ --base qwen2.5-7b # what you get back: # - .kolm file, ≤3 GB # - K-score over your held-out set # - HMAC-SHA256 receipt chain # - $0 per-token from this point on
The window
Frontier capability is going to keep commoditizing. Token bills are going to keep growing as agentic patterns scale. Regulators are not going to relax. Whatever the answer to "AI ownership at scale" looks like, this is the year it stops being theoretical.