Dogfood · 2026-05-09 · 10 min read

Our marketing runs on four distilled models.

The site you’re reading this on uses AI for nine features. Four of them are powered by .kolm Specialists running on a $99/mo GPU sliver. One is on the frontier where it belongs. Four are static or rule-based. This is the breakdown: which Specialists, why we picked them, what they cost, and where we kept the frontier.

By KolmogorovTag dogfood · cost · production

The nine features that need a model.

Audit of every place AI touches the kolm marketing pipeline. Two-week review, one engineer, no tools you don’t already have access to.

Of those, five are 80%+ deterministic in shape. They take an input that fits a recognizable mold and produce an output that fits a recognizable mold. Those are the candidates for a Specialist. The other four are open-domain reasoning that we keep on the frontier.

The four Specialists.

What we actually compiled, what each is doing, and what we paid for it.

SpecialistPowersBaseK-scoreCompile costPer-call cost
tldr-2.1.kolmarticle TL;DRs (#2)Qwen-2.5-3B93.4$220$0.0002
cold-opener-1.4.kolmcold-email openers (#3)Qwen-2.5-3B88.7$340$0.0003
reply-router-1.0.kolminbound classification (#4) + demo triage (#5)Phi-3-mini94.8$180$0.0001
chan-digest-1.2.kolmSlack/Discord summaries (#6)Qwen-2.5-3B90.1$280$0.0002

One-time compile total: $1,020. Marginal cost of all four running on one $99/mo H100-sliver from a tier-2 GPU host: $99/mo, regardless of volume. If we 10x our outbound volume tomorrow, the cost stays at $99.

For comparison, when we ran all four on Sonnet at our current volume in February, the bill was $1,860/mo. The four artifacts paid back their compile cost in 17 days.

Where we kept the frontier.

We did not compile features 1, 7, 8, or 9. Each for a specific reason.

1. ROI calc descriptor.

The descriptor changes every time a user types in different numbers. It would be easy to compile, but the volume is so low (a few hundred calls a month) that compiling it would be a vanity flex, not a margin recovery. The frontier is the right answer for low-volume, open-input features.

7. Doc-rewrite for new model releases.

This is open-ended editing. The shape of the input is "old paragraph plus list of model-release notes;" the shape of the output is "new paragraph that gets the technical claims right." It needs a frontier-class reasoning capacity that a 3B distill cannot fake. We pay frontier rates here and we are happy to.

8. Comparator-page diff.

Same shape as #7. Open-domain editing of nuanced positioning. The frontier earns its margin here.

9. Long-form drafting.

The essay you’re reading was drafted by a human, edited by Opus, and reviewed by a human again. We have not even tried to compile this. It would be premature distillation; the task is too open and the volume is too low. Long-form essay drafting is a frontier task.

The point of compiling isn’t to compile everything. It’s to stop paying frontier prices for the shapes the frontier doesn’t need to handle.

The actual monthly bill.

February 2026, before any compile work:

# before
sonnet calls (all 9 features)             $1,860 / mo
total                                     $1,860 / mo

May 2026, four Specialists shipped:

# after
sonnet/opus calls (features 1, 7, 8, 9)   $  240 / mo
H100-sliver lease (one box, four .kolm)   $   99 / mo
sigstore rekor anchoring (free)           $    0 / mo
total                                     $  339 / mo

Net: $1,521/mo recovered, or $18,252/yr. One-time compile cost was $1,020. Payback: 20 days. Annualized return on the $1,020 spend: ~1,790%.

The numbers above are not impressive in absolute dollars; we’re a small company, our volumes are small. The numbers matter because they’re shaped the same way at every scale. A Series A AI-native SaaS doing 4M calls a month gets the same shape, multiplied by 100x. The point is the slope.

The features we tried to compile and rolled back.

Two failures worth naming.

The "rephrase to brand voice" Specialist.

We tried to compile a "make this sound like kolm marketing voice" Specialist. The verifier was a BLEU-based reference output check. The K-score gated at 78. The outputs were technically correct and stylistically wrong. Brand voice turns out to be open-domain in a way the verifier couldn’t capture. We pulled it back to the frontier for now.

The "draft a tweet from this article" Specialist.

Compile worked, K-score 91, the model produced acceptable tweets, but not good tweets. The line between "fine" and "fire" turns out to be exactly the open-domain creative judgment we were trying to compile away. Tweets stay on the frontier. Maybe forever.

Both failures are honest examples of where compiling does not pay back. Distillation transfers behavior; it does not transfer taste. Anywhere taste is the metric, the frontier wins for now.

The build pipeline.

For the four Specialists that did ship, the pipeline looks like this:

# nightly job, runs at 04:00 UTC
$ kolm compile "draft a 60-word TLDR for the article body provided" \
    --examples ./marketing/examples/tldr-2026-05.jsonl \
    --eval ./marketing/eval/tldr-holdout.jsonl \
    --base qwen-2.5-3b-instruct \
    --gate K=88

 verifier synthesized   schema + length-band + topic-overlap
 k-sampled teacher       claude-opus-4-7 (k=4, n=200)
 LoRA fit                12 min on H100, loss 0.21 → 0.06
 K-score                 93.4 (T 95.1 / C 92.0 / L 92.8)
 signed                  tldr-2.1.kolm  (1.4GB)

$ kolm compare ./tldr-2.0.kolm ./tldr-2.1.kolm
 K-score          93.4 ↑ from 92.8 (+0.6)
 regressions      0
 recommended      promote new artifact

$ kolm publish ./tldr-2.1.kolm --to=marketing-prod
 rolled out behind 5% traffic
 hold for 24h, auto-promote on no K-score regression

The whole pipeline is six commands. One CI job, ten minutes a night. Specialists that fail the gate stay at the prior version; Specialists that pass roll out behind a 5% canary. We have not had a regression incident since switching.

Why this is the most useful blog post we could write.

Most engineering posts are arguments. This one is a checklist. If you read it as a checklist of "what your AI-native SaaS could be doing instead of paying Sonnet rates for ticket-tagging," it has done its job.

If you want the same setup at your company:

Three months in, your AI-feature line item is a fraction of what it was, your gross margin curve has bent, and your auditor has a receipt for every customer-facing call. That’s the dogfood. That’s how the building gets used.