Every AI-native SaaS has a handful of features that do the same shape of thing every time (classify, extract, reformat, score). They make 60% of the calls and burn 80% of the model bill. kolm compiles each one once and ships you a .kolm artifact you serve from your own infra at fixed cost.
Pricing rounds usually expose this in the fundraise: AI features that should be 80%-margin SaaS plumbing end up at 30% because every customer call hits a frontier API. The fix is not a smaller model. The fix is the right artifact.
Typical Sonnet/GPT-class call for a 1.5k-input / 600-output extraction task. Pure variable cost, charged on every customer click.
3B-class Specialist serving the same task. ~130× cheaper on a $99/mo GPU sliver after the one-time compile fee. Latency drops 3-8× alongside.
The teacher-model bill to k-sample & verify a labeled set big enough to LoRA-tune the student. Amortizes over the next million calls.
Not every feature should be compiled. The criterion is honest: does the task have a verifier? If a deterministic check (regex, JSON schema, AST diff, exact match, BLEU>cutoff) can distinguish a good output from a bad one, kolm can compile it.
Resume parsing, invoice line-items, contract clauses, support-ticket field-tagging. JSON-schema verifiers run in microseconds; the verifier is the spec.
Intent detection, ticket routing, content moderation, lead scoring. Confusion-matrix verifiers over a held-out set, K-score gates production.
Tone shifting, summarization at fixed length, translation pairs you ship for, code-style normalization. Reference-output verifiers via BLEU/ROUGE thresholds.
Bring 50-200 (input, expected) pairs from your prod logs. The compiler synthesizes the verifier, k-samples the teacher, fits a LoRA, runs the K-score gate, and signs the artifact.
# 1. drop your eval pairs in $ ls examples/ ticket-router-train.jsonl # 200 pairs from prod logs ticket-router-eval.jsonl # 50 holdout # 2. compile $ kolm compile "route a support ticket to one of 12 queues" \ --examples examples/ticket-router-train.jsonl \ --eval examples/ticket-router-eval.jsonl \ --base qwen2.5-3b-instruct ✓ verifier synthesized: schema + label-set match (47 lines) ✓ k-sampled teacher (claude-opus-4-7) on 200 pairs ✓ LoRA fit: 14 min, 3 epochs, loss 0.18 → 0.04 ✓ K-score: 94.2 (T 95.8 / C 91.1 / L 99.7) ✓ signed: ticket-router-1.0.0.kolm (1.4GB) # 3. ship it $ kolm serve ticket-router-1.0.0.kolm --port 8000 --mcp ▸ serving on http://localhost:8000 (cold start: 1.2s)
Worked example: a Series A company processing 4M ticket-routing calls / month. Before kolm, that’s a $160k/yr line item with growth-elastic exposure. After: a fixed $99/mo GPU and a one-time compile.
4M calls × $0.04/call. Variable. Grows with usage. Vendor pricing changes outside your control. Per-call latency 1.4s p50.
$99/mo GPU + $700 one-time compile. Fixed. Latency 180ms p50. K-score 94.2 monitored continuously, regressions block deploys.
Same artifact, three places it can run. The CI pipeline produces the artifact; what changes is where the inference happens.
Serve .kolm via kolm serve on a single H100 sliver, A10, L40, or 5090. OpenAI-compatible /v1/chat/completions + MCP. Drop-in replace your existing API client.
Ship the artifact to the customer’s VPC for their compliance posture. Same artifact, different infra. The receipt chain proves it’s the same model.
Don’t want to operate GPUs at all? Cloud serves your artifacts on managed infra, billed by request, with the same receipt chain. Same gross margin shape.
Every shipped Specialist is a margin you got back. Compile your three biggest cost-center features and watch the unit economics shift in a quarter.