How to Compile GPT-5 Into a 4GB Local Model

Why the cloud-AI line is the wrong line
What an AI compiler actually does
Four stages of kolm compile
Why the artifact is 4GB and not 70GB
Compile your first artifact in five minutes
The K-score, and why a number on the cover matters
FAQ

Why the cloud-AI line is the wrong line.

If you ship anything that calls a frontier model in production, you have lived this. The bill is linear in usage. Every prompt is a network round trip and a row in someone else's log. The model deprecates on a 14-day notice. The data leaves the building on every call. The latency floor is 200-800 ms per inference even on the best-case path. None of these are bugs in your stack. They are properties of the architecture: model lives in a third-party data center, data has to come to it.

For about half of the AI work being done today, that architecture is fine. For the other half — anything touching health records, financial transactions, internal codebases, the user's own photos and messages, anything regulated, anything privacy-sensitive — it is a slow-motion compliance crisis. The privacy policy says one thing. The audit log on the frontier vendor side says another. The user trusted you, not the vendor.

The fix is the inversion: compile in the cloud, run on the device. You pay the frontier model API at compile time, in a controlled environment, on data flows you have already approved. You get back a single signed file. The file goes to the device. The device runs the file. Frontier-class intelligence happens locally, on the user's own data, with zero runtime egress.

The AI compiler is the build step that makes this practical. It is not a wrapper, not a vendor SDK, not an ML platform. It is a deterministic process that turns a higher-level description into a lower-level executable.

What an AI compiler actually does.

The word "compiler" is precise. gcc compiles C source into a binary that runs anywhere a CPU is. kolm compiles an AI task — a description, examples, evaluations, your data, and a frontier API key — into a binary that runs anywhere a small open-weight model can be loaded: phone, laptop, server, browser, edge box.

The input is what you would otherwise put in a prompt template, plus a corpus and a budget. The output is a .kolm file: a signed zip containing a base model, a LoRA adapter, a recipe pack of deterministic-token drafts, a multimodal sqlite-vec index of the corpus, a synthesized verifier, the held-out test set, a manifest, and an HMAC signature anchored to the public Kolmogorov registry. All seven components are content-hashed; tampering with any one of them invalidates the signature.

The frontier model is the teacher. Its k-sample output, deterministically verified, is what shapes the LoRA. Its observed deterministic patterns are what fill the recipe pack. The base is whichever open-weight model best matches the task — Qwen 2.5, Llama 3, Phi-3, or Hermes-3, in the 3B-7B range. Quantized to INT4, it lands in the 1.5-4 GB band. Fast enough to run interactively on every phone shipped since 2021.

Four stages of `kolm compile`.

Inside the compiler, four engines run in sequence. The user only ever sees the orchestrator.

1. Recall.

Your corpus — text, images, audio, video, PDFs — is embedded with magika-detected modality routing (bge-m3 for text, clip-vit-large for images, whisper + clap for audio, scene-detect + clip + ASR for video, unstructured + bge-m3 for PDFs). The result is a sqlite-vec index that ships inside the artifact. At compile time, every k-sample call is grounded in the top-k recalled chunks. At runtime, the same index is queried locally — never over a wire.

2. Distill.

For every example in your training set, the compiler asks the frontier model k times (default k=8). The k samples are scored by a deterministic verifier synthesized from your seed examples and your held-out tests. The winner becomes the labeled output. The labels feed a LoRA fine-tune of the base model. This is what makes the artifact yours: it learns from a frontier-grade teacher whose outputs were verified before they trained anything.

3. Decompose.

The compiler watches every accepted output during Distill and extracts the deterministic-token subsequences — the parts of the answer that are entirely determined by the prefix, like JSON keys, function signatures, opening braces, common phrasings. These become the recipe pack: a registry-indexed table of (prefix-shape, token) pairs that the runtime consults during decoding. Hits are free; misses fall back to the base model. Speculative decoding without a draft model.

4. Run.

The four components are sealed into a .kolm file with a manifest and a signature. kolm run artifact.kolm loads the base, applies the LoRA, mounts the recall index, consults the recipe pack at every token. The runtime is llama.cpp + sqlite-vec — both open-source, both portable. The artifact behaves frontier-class on the task it was compiled for, locally, offline, with zero egress.

compile time3-12 min

artifact size38 MB — 4 GB

runtime memory2.5 — 5 GB

p50 latency60 — 120 ms

marginal cost$0.00 / call

data egress0 bytes

Typical numbers for a 3B-base support-triage compile, measured on a 5090 + iPhone 13 Pro

Why the artifact is 4GB and not 70GB.

Let's name the lie: nobody runs a quantized 70B model on a phone. Compressing a 70B base to fit in 6 GB destroys it the way compressing a 4K video to 200 MB destroys it. What you can do is take a 3B-7B base that already runs comfortably in 2-5 GB, and distill the 70B's behavior on your specific tasks into the smaller model. The student does not become as smart as the teacher in general. The student becomes indistinguishable from the teacher on the tasks the user actually asks about, because that's all it was trained on.

That is what makes the compile step interesting. A 3B-class model compiled against a frontier teacher on a narrow domain — say, "answer support tickets in our voice" or "review React PRs the way I would" — wins or ties cold frontier output 80%+ of the time on that specific domain, in our internal benchmarks. The recipe pack closes the latency gap (deterministic patterns get drafted at hundreds of tokens per second). The recall index closes the freshness gap (queries always run against the user's actual current data).

The compile is what compiles. The user does not run a compressed 70B; the user runs a compiled artifact that was built with a 70B as its teacher and behaves like one on the task the user actually has.

Compile your first artifact in five minutes.

The CLI is one command per platform.

# macOS, Linux, Windows (WSL)
curl -fsSL kolm.ai/install.sh | sh

# or
npm i -g @kolmogorov/kolm
brew install kolmogorov/tap/kolm
pip install kolm

Then point it at a folder of examples.

kolm login
kolm compile "summarize support tickets in our voice" \
  --examples ./tickets/ \
  --base qwen2.5-3b \
  --out support.kolm

# 4 minutes later:
kolm run support.kolm "customer cant login since update"

The CLI handles the cloud compile, the artifact download, and the local runtime. Your frontier API key (Anthropic, OpenAI, Hermes) is the teacher. Everything after the compile is local.

The K-score, and why a number on the cover matters.

Every .kolm ships with a single number from 0 to 1 on its manifest: the K-score. It is the harmonic mean of accuracy, size, latency, cost, and coverage, normalized against the frontier baseline measured at compile time. Below the gate (default 0.70), the artifact does not ship.

This is the analogue to test pass/fail in regular software. "My model is good enough" has been a feeling for the entire history of applied ML. Compilation makes it a number. If the K-score is above the gate, you ship; if it isn't, you tighten the test set and recompile. The gate is configurable per task. The number is on the artifact, the receipts chain proves it wasn't tampered with, and the registry holds the public benchmark for any artifact that opts in.

The K-score is what makes .kolm defensible at audit. The compliance officer doesn't have to trust a vendor; they read a number and verify the chain.

FAQ.

Does the artifact need a network connection?

No. After kolm compile finishes, the artifact runs fully offline. The recall index lives inside it. The recipe pack lives inside it. The base model and LoRA live inside it. The signature is verifiable offline against a public key. No outbound traffic at runtime.

Can I use any frontier model as the teacher?

Yes. The compiler is teacher-agnostic. Plug in your Anthropic, OpenAI, or Hermes API key. The student (the base model in the artifact) is also configurable; the default picks the smallest open-weight model that passes the K-score gate.

What happens when the model drifts?

Every input the runtime sees that falls outside the recipe pack's coverage is logged (locally — never egressed). The next time you run kolm compile, the verifier resolves those drift inputs once via the frontier teacher, the new patterns merge into the registry, and the next compile absorbs them. The cache strictly grows. Every compile makes the next one cheaper.

Is this on-prem only?

The compile step is a hosted cloud service by default — you bring the API key, we run the orchestrator. Enterprise tiers run the compile inside your VPC, including the recall embedders. The artifact you produce is identical either way.

act Run kolm compile →

Five minutes. A signed artifact. No GPU required.

read What's inside a .kolm →

A field-by-field walkthrough of the seven components.

measure The K-score, defined →

The single number on the cover, derived from five.

read K-sample verified inference →

The mechanism behind Distill, in plain English.

How to compile GPT-5 into a 4GB local model that runs offline.

Contents

Why the cloud-AI line is the wrong line.

What an AI compiler actually does.

Four stages of `kolm compile`.

1. Recall.

2. Distill.

3. Decompose.

4. Run.

Why the artifact is 4GB and not 70GB.

Compile your first artifact in five minutes.

The K-score, and why a number on the cover matters.

FAQ.

Does the artifact need a network connection?

Can I use any frontier model as the teacher?

What happens when the model drifts?

Is this on-prem only?

How to compile GPT-5 into a 4GB local model that runs offline.

Contents

Why the cloud-AI line is the wrong line.

What an AI compiler actually does.

Four stages of kolm compile.

1. Recall.

2. Distill.

3. Decompose.

4. Run.

Why the artifact is 4GB and not 70GB.

Compile your first artifact in five minutes.

The K-score, and why a number on the cover matters.

FAQ.

Does the artifact need a network connection?

Can I use any frontier model as the teacher?

What happens when the model drifts?

Is this on-prem only?

Four stages of `kolm compile`.