Contents
The nine features that need a model.
Audit of every place AI touches the kolm marketing pipeline. Two-week review, one engineer, no tools you don’t already have access to.
- 1. ROI calc descriptor. The text under /roi after a user types in their numbers.
- 2. Article TL;DR generator. Each article hero contains a 60-word lede; we draft them with AI.
- 3. Outbound email opener. First line of the cold-email cadence to design partners (Smartlead).
- 4. Reply classifier. Routes inbound mail to "yes / maybe / unsubscribe / spam / other-team."
- 5. Demo-request triage. Reads form submissions, infers tier interest, drafts the first reply.
- 6. Slack/Discord summarizer. Daily digest of #partners and #ops channels.
- 7. Doc-rewrite for the latest model release. When a frontier vendor releases new tokens or capabilities, we rewrite the affected article paragraphs.
- 8. Comparator-page diff. When a competitor updates pricing or features, we generate a "what changed" delta.
- 9. Long-form drafting. The cornerstone essays you’re reading.
Of those, five are 80%+ deterministic in shape. They take an input that fits a recognizable mold and produce an output that fits a recognizable mold. Those are the candidates for a Specialist. The other four are open-domain reasoning that we keep on the frontier.
The four Specialists.
What we actually compiled, what each is doing, and what we paid for it.
| Specialist | Powers | Base | K-score | Compile cost | Per-call cost |
|---|---|---|---|---|---|
| tldr-2.1.kolm | article TL;DRs (#2) | Qwen-2.5-3B | 93.4 | $220 | $0.0002 |
| cold-opener-1.4.kolm | cold-email openers (#3) | Qwen-2.5-3B | 88.7 | $340 | $0.0003 |
| reply-router-1.0.kolm | inbound classification (#4) + demo triage (#5) | Phi-3-mini | 94.8 | $180 | $0.0001 |
| chan-digest-1.2.kolm | Slack/Discord summaries (#6) | Qwen-2.5-3B | 90.1 | $280 | $0.0002 |
One-time compile total: $1,020. Marginal cost of all four running on one $99/mo H100-sliver from a tier-2 GPU host: $99/mo, regardless of volume. If we 10x our outbound volume tomorrow, the cost stays at $99.
For comparison, when we ran all four on Sonnet at our current volume in February, the bill was $1,860/mo. The four artifacts paid back their compile cost in 17 days.
Where we kept the frontier.
We did not compile features 1, 7, 8, or 9. Each for a specific reason.
1. ROI calc descriptor.
The descriptor changes every time a user types in different numbers. It would be easy to compile, but the volume is so low (a few hundred calls a month) that compiling it would be a vanity flex, not a margin recovery. The frontier is the right answer for low-volume, open-input features.
7. Doc-rewrite for new model releases.
This is open-ended editing. The shape of the input is "old paragraph plus list of model-release notes;" the shape of the output is "new paragraph that gets the technical claims right." It needs a frontier-class reasoning capacity that a 3B distill cannot fake. We pay frontier rates here and we are happy to.
8. Comparator-page diff.
Same shape as #7. Open-domain editing of nuanced positioning. The frontier earns its margin here.
9. Long-form drafting.
The essay you’re reading was drafted by a human, edited by Opus, and reviewed by a human again. We have not even tried to compile this. It would be premature distillation; the task is too open and the volume is too low. Long-form essay drafting is a frontier task.
The point of compiling isn’t to compile everything. It’s to stop paying frontier prices for the shapes the frontier doesn’t need to handle.
The actual monthly bill.
February 2026, before any compile work:
# before sonnet calls (all 9 features) $1,860 / mo total $1,860 / mo
May 2026, four Specialists shipped:
# after sonnet/opus calls (features 1, 7, 8, 9) $ 240 / mo H100-sliver lease (one box, four .kolm) $ 99 / mo sigstore rekor anchoring (free) $ 0 / mo total $ 339 / mo
Net: $1,521/mo recovered, or $18,252/yr. One-time compile cost was $1,020. Payback: 20 days. Annualized return on the $1,020 spend: ~1,790%.
The numbers above are not impressive in absolute dollars; we’re a small company, our volumes are small. The numbers matter because they’re shaped the same way at every scale. A Series A AI-native SaaS doing 4M calls a month gets the same shape, multiplied by 100x. The point is the slope.
The features we tried to compile and rolled back.
Two failures worth naming.
The "rephrase to brand voice" Specialist.
We tried to compile a "make this sound like kolm marketing voice" Specialist. The verifier was a BLEU-based reference output check. The K-score gated at 78. The outputs were technically correct and stylistically wrong. Brand voice turns out to be open-domain in a way the verifier couldn’t capture. We pulled it back to the frontier for now.
The "draft a tweet from this article" Specialist.
Compile worked, K-score 91, the model produced acceptable tweets, but not good tweets. The line between "fine" and "fire" turns out to be exactly the open-domain creative judgment we were trying to compile away. Tweets stay on the frontier. Maybe forever.
Both failures are honest examples of where compiling does not pay back. Distillation transfers behavior; it does not transfer taste. Anywhere taste is the metric, the frontier wins for now.
The build pipeline.
For the four Specialists that did ship, the pipeline looks like this:
# nightly job, runs at 04:00 UTC $ kolm compile "draft a 60-word TLDR for the article body provided" \ --examples ./marketing/examples/tldr-2026-05.jsonl \ --eval ./marketing/eval/tldr-holdout.jsonl \ --base qwen-2.5-3b-instruct \ --gate K=88 → verifier synthesized schema + length-band + topic-overlap → k-sampled teacher claude-opus-4-7 (k=4, n=200) → LoRA fit 12 min on H100, loss 0.21 → 0.06 → K-score 93.4 (T 95.1 / C 92.0 / L 92.8) → signed tldr-2.1.kolm (1.4GB) $ kolm compare ./tldr-2.0.kolm ./tldr-2.1.kolm → K-score 93.4 ↑ from 92.8 (+0.6) → regressions 0 → recommended promote new artifact $ kolm publish ./tldr-2.1.kolm --to=marketing-prod → rolled out behind 5% traffic → hold for 24h, auto-promote on no K-score regression
The whole pipeline is six commands. One CI job, ten minutes a night. Specialists that fail the gate stay at the prior version; Specialists that pass roll out behind a 5% canary. We have not had a regression incident since switching.
Why this is the most useful blog post we could write.
Most engineering posts are arguments. This one is a checklist. If you read it as a checklist of "what your AI-native SaaS could be doing instead of paying Sonnet rates for ticket-tagging," it has done its job.
If you want the same setup at your company:
- Audit: list every AI feature in your product. Tag each as deterministic-shape or open-domain. Tag each by call volume.
- Pick: the highest-volume, most-deterministic feature is your first compile. Skip taste-driven features.
- Compile: 50-200 example pairs, one verifier, one base model, one CI run.
kolm compilehandles the seven stages. - Canary: route 5% of traffic to the Specialist; watch the K-score and the customer-facing metric. Hold 24h. Promote.
- Repeat: the next-highest-volume feature.
Three months in, your AI-feature line item is a fraction of what it was, your gross margin curve has bent, and your auditor has a receipt for every customer-facing call. That’s the dogfood. That’s how the building gets used.
The origin story behind the file format and the AI compiler that produces it.
read Distillation vs fine-tuning vs RAG →Decision tree for which technique fits your task. With the actual cost numbers.
act The AI-native SaaS use case →Compile your highest-cost features. Recover gross margin in a quarter.
manifesto The closed-API tax →The economic case for an exit from frontier-API recurring spend. Compile your way out.