Engineering · 2026-05-09 · 12 min read

Rent vs. buy compute. Keep the artifact.

Every dollar you spend on Anthropic or OpenAI is a rental fee for a model you do not own and cannot run offline. Proxy those calls through kolm capture and the same dollar also pays for a verified (input, output) dataset. At the threshold you choose, that dataset compiles into a signed local .kolm: a LoRA on an open base model that runs on your hardware, on your network, in perpetuity. The frontier API bill becomes a deposit account.

By KolmogorovTag capture · distill · on-device

The problem with renting forever.

If you ship anything that calls a frontier model in production, the bill grows linearly with usage. Every prompt is a network round trip and a row in someone else's log. The model is deprecated on a fourteen-day notice. The data leaves your perimeter on every call. Latency floor is two hundred to eight hundred milliseconds even on the best path. None of these are bugs in your stack — they are properties of a rental architecture.

The line item that this creates on a finance team's monthly close is unique: spend goes up, no asset accumulates. Every other large recurring line in a software company — cloud hosting, observability, payments — either is fungible commodity or accrues real switching cost in the form of integration. Frontier API spend accrues neither. Cancel today and you have nothing to show for it but log entries.

This is the wrong shape for spend that is forecast to keep doubling. The rational move is to convert it.

Most of what your team is paying Anthropic or OpenAI to do is a small set of tasks repeated thousands of times. Each repetition produces a verified label. Save the labels and you can train a model that performs the same tasks for free.

The thesis in one paragraph.

You are already paying the frontier model to do work for you. The output it returns is, under every major provider's terms of service, yours to use. What we add is a thin proxy that captures the verified (input, output, latency) tuple for each call and a compiler that turns the captured corpus into a signed LoRA on an open base. When the LoRA is good enough to retire the proxy on a slice of traffic, you stop paying for that slice. The local artifact compounds in coverage as your traffic does. Every dollar of frontier spend you pass through the proxy is, mechanically, also a deposit into the local model. That is the rent-to-own moment for AI compute.

How capture-and-distill actually works.

There are four moving parts. None of them are exotic.

1. The capture proxy.

Two endpoints, POST /v1/capture/anthropic and POST /v1/capture/openai, accept the same body shape as the upstream provider. You send your customer's API key in a header that we strip before forwarding. The upstream call is unmodified. The response is unmodified. Before returning, we record a tuple of (input, output, model, latency_us, namespace, tenant) to your tenant's observations table. Surface area for the caller is one base-URL change in their codebase.

2. The verifier sample.

By default a configurable fraction of captures are run through k-sample verification — the same primitive that powers verified inference on the compile path — to attach a confidence score to each label. Captures that fail verification are still recorded but tagged so the distiller can quarantine them. The verifier is the brake that stops you from training on outputs the frontier model itself was uncertain about.

3. The label endpoint.

GET /v1/labels/synthesize-corpus?namespace=&format=jsonl returns the captured pairs as JSONL or parquet, ready to feed to any training framework. Tenant-scoped. Filtered by namespace, optional date range, and optional minimum verifier confidence. This is the format used by the next stage, but the export path lets you take the data anywhere — you are not locked into our trainer.

4. The distill bridge.

POST /v1/specialists/auto-distill takes a namespace, a base model, and a target footprint. At a configurable threshold (default one thousand verified pairs) the bridge fires a LoRA training job, runs the K-score gate on a held-out evaluation slice, and ships back a signed .kolm artifact when the gate passes. From the operator's view it is a single CLI call and a download.

The five-line CLI surface that wraps these endpoints:

# point your existing OpenAI / Anthropic SDK at our capture endpoint
kolm capture --provider anthropic --as support-replies --namespace prod-tickets

# inspect what has been captured so far
kolm capture status

# export the corpus for any trainer (returns JSONL)
kolm labels --namespace prod-tickets --format jsonl > corpus.jsonl

# or compile straight to a signed local artifact
kolm distill --namespace prod-tickets --base-model phi-3-mini
kolm inspect ~/.kolm/artifacts/prod-tickets.kolm

A worked example: 50 engineers on Anthropic.

The math only matters at scale. Here is one team-shaped example to make it concrete.

Shape: A fifty-person engineering org running an internal customer-support copilot on Claude Opus through Anthropic's API. Volume: approximately eighty thousand calls per month, average two thousand input tokens and four hundred output tokens per call. Bill: roughly twelve thousand dollars per month at current Opus pricing, climbing as the team grows.

MonthFrontier callsCaptured pairsVerified pairsLocal LoRA statusFrontier spend
Month 080,00000None$12,000
Month 180,00080,00062,400 (78%)Below threshold$12,000
Month 280,000160,000~125kCompiled. K=0.86 on 240-pair holdout. 78% Opus quality.$12,000
Month 3~32,000~32,000~25k60% of traffic served locally. Frontier slice is the long tail.$4,800
Month 4~16,000~16,000~13k80% local. LoRA retrained on tail traffic.$2,400
Month 12~12,000~12,000~10k85% local steady-state. Tail is novel cases & escalations.$1,800

The interesting line is month two. After two months of capture and one distill cycle, this team has a local Phi-3-mini-LoRA that scores K=0.86 on a held-out two-hundred-and-forty-pair eval — a signed gate that says, in plain numbers, "this artifact replicates seventy-eight percent of Opus's behavior on this team's prompts." The artifact runs on a single 5090 at sixty milliseconds per call. The marginal cost per call is electricity.

From month three onward the proxy routes the easy slice of traffic to the local artifact and only escalates the tail to Opus. The frontier bill compresses to the slice that the local model cannot yet handle. The team keeps capturing on that escalated slice, retrains the LoRA on a quarterly cadence, and the local-served fraction grows.

Twelve-month total: ~$48k in frontier spend (down from $144k linear), one signed local artifact owned in perpetuity, sub-hundred-millisecond p50 latency on most calls, no PHI or customer data leaving the perimeter on the local-served slice. The frontier bill paid for the artifact. That is the rent-to-own moment.

The legal question every counsel asks first is whether training a model on frontier-model outputs is permitted. Three things resolve it cleanly:

Counsel still has to read the relevant TOS for the customer's specific contract tier. We are not a substitute for that. But the structure of capture-and-distill is the structure that the TOS authors clearly contemplated, not a loophole.

The privacy frame: who sees what.

The capture proxy is a network hop. It sees the prompt and the response in transit. Three things make this safer than it sounds:

The privacy model is straightforward: we are a passthrough, the data is yours, the artifact is yours, the receipts let you prove it.

Why the signed receipt matters.

When the artifact ships, it carries a receipt. The receipt is an HMAC-SHA256 chain over (a) the captured corpus content hash, (b) the verifier configuration, (c) the holdout eval set hash, (d) the resulting K-score, (e) the base model identity and the LoRA delta hash, and (f) the build environment manifest. The chain is anchored optionally to Sigstore Rekor for cross-organization audit.

What this buys you in practice: a counsel review can reproduce the exact dataset that produced the exact LoRA that produced the exact eval score. "Did we train on PII" becomes a query against the receipt, not a forensic exercise. "Has this model drifted" becomes a comparison of two receipts. "Is this the model we deployed" becomes a verify call.

The same receipt is the answer to the auditor question that kills most internal-AI projects: what is in this thing. The receipt is the bill of materials. The signature says we did not change it after the fact.

If your local LoRA can answer the question "what data trained you and on what date" with a cryptographic proof, your security review goes from a quarter of meetings to a fifteen-minute conversation.

vs. fine-tuning, vs. observability, vs. RAG.

Three categories of tool look adjacent on the surface. They are not the same product.

vs. OpenAI / Anthropic fine-tuning.

Fine-tuning hosted by the same provider you are renting from gives you a lower per-call rate but no local artifact. The model still lives in their data center, the data still leaves your perimeter, the latency floor is unchanged, and you are still on the deprecation treadmill. We ship a file. They ship a charge code.

vs. observability tools (LangSmith, Langfuse).

Observability captures for diagnostics: trace, debug, replay. The capture is in the right place but the next move is not "train a local model from these traces." There is no verifier on the captures, no distiller, no signed artifact. We are downstream of where they stop.

vs. RAG.

RAG is the right pattern for situations where the relevant knowledge changes on a daily cadence and you want the frontier model's reasoning over fresh documents. Capture-and-distill is the right pattern for tasks that are stable over months and where the marginal call cost is the binding constraint. They compose. A common deployment shape is a local LoRA for the high-volume routine slice and a frontier-RAG fallback for the long tail and novel queries.

capture surfacedrop-in proxy
training dataverified pairs (yours)
output artifactsigned .kolm LoRA
where it runsyour hardware
what it costs to callelectricity
what stays at the frontierlong-tail escalations
capture-and-distill, end to end. The proxy compounds. The artifact persists.

Start here.

Two ways in. Smallest commitment: install the CLI, point one of your existing services at the capture endpoint for a week, run kolm capture status to see what your verified pair count looks like. There is no charge for capture under a hundred thousand pairs per month. Larger commitment: book a thirty-minute design-partner call and we will scope a namespace plan with your team, including the bring-your-own-VPC option if your data residency requires it.

npm i -g github:sneaky-hippo/kolmogorov-stack
kolm config base https://kolm.ai
kolm login

# 30 seconds in:
kolm capture --provider anthropic --as first-run

The capture endpoint is what makes the rest possible. Once your existing traffic is flowing through it, every other piece — the labels endpoint, the distill bridge, the receipt, the K-score — is one CLI call away. The first call you proxy through the capture is the first dollar that goes from rent to deposit.