Contents
The auditability problem with frontier models.
You ask a frontier model a question. You get an answer. You have no way to prove that answer came from the model you think it came from, on the prompt you think you sent, with the temperature you think you set. The vendor logs say one thing; your application log says another; the user sees the answer. If the answer ends up in a regulated context — a clinical decision, a credit denial, a court filing — your downstream auditors will eventually ask one of three questions:
- Which model produced this output? (Was it the version under contract, or one that silently rolled forward last Tuesday?)
- What was the exact prompt? (Was it the system message your team approved, or one mutated by middleware?)
- Was the output stable, or a one-shot accident? (Did anyone re-roll the same prompt before deciding to act on the answer?)
Three questions, one missing object: a receipt. Software has been auditing itself with cryptographic logs since the 1980s. Frontier model output is the first widely deployed compute primitive that does not, by default, leave a trail.
Why zk-ML is not the answer (yet).
The cryptographer's reflex is to reach for zero-knowledge proofs. Prove that some statement about a private witness — "this output came from this model on this prompt" — is true, without revealing the witness. The math is elegant. The cost is brutal.
State-of-the-art zk-ML systems as of late 2026 prove a forward pass over a 13B-parameter model in 4-12 hours of A100 time per inference, with proof sizes in the hundreds of kilobytes. The economics close at "important compliance documents that are computed once and then live forever." They do not close at "every customer support ticket your application processes."
The right move is to back off the cryptographic ambition and ask: what auditability does the use case actually require? In our experience, it is almost always one of three things:
- Reproducibility. An auditor wants to re-run the same prompt against the same model and see the same output, or close to it.
- Stability. An auditor wants evidence that the chosen output was not a one-in-a-thousand outlier, but a stable response.
- Provenance. An auditor wants a chain of custody from prompt to output that nothing in the middle altered.
Verified inference gets all three with HMAC and a verifier function. It does not prove the forward pass. It proves something weaker but operationally sufficient: the chosen output is the deterministic winner among k samples, against a public verifier, and the chain has not been tampered with.
K-sample verified inference, mechanically.
Concrete in eleven lines.
// Verified inference, in pseudocode function verified(client, prompt, k = 8, verifier) { const samples = []; for (let i = 0; i < k; i++) { samples.push(await client.generate(prompt, { temperature: 0.7, seed: i })); } const scores = samples.map(s => verifier(prompt, s)); const winner = samples[argmax(scores)]; const receipt = hmac(SECRET, JSON.stringify({ prompt_hash: sha256(prompt), samples_hashes: samples.map(sha256), verifier_id: verifier.id, winner_idx: argmax(scores), winner_score: max(scores), timestamp: Date.now() })); return { output: winner, receipt }; }
Three properties make this work.
Determinism. The verifier is pure. Same input, same output, every time. The samples can be replayed and the winner re-derived without re-querying the model.
Diversity. k samples, different seeds (or different temperatures, or both). When the model is confident, all k samples cluster around the same answer; when it is unsure, they split. The verifier resolves the split.
Anchoring. The receipt is HMAC-chained against the previous receipt. An adversary who alters one entry has to forge every subsequent receipt. Public anchoring (Arweave, Bitcoin OP_RETURN) makes that forgery globally visible.
Synthesizing a verifier from your seed examples.
A verifier is just a scoring function. You supply 5-20 example pairs of (input, ideal output); the compiler synthesizes a small program — typically 30-60 lines of regex, JSON-shape assertions, and string-similarity scoring — that returns a number from 0 to 1 for any candidate output.
For structured tasks (extract a JSON object, return a function signature, classify into a label set), the verifier is exact: the candidate either matches the schema or it doesn't. For semantic tasks (write a polite reply, summarize a doc), the verifier combines schema checks with embedding-similarity to the ideals. Either way, it runs in microseconds, deterministically, and ships next to the model.
The verifier is what makes "AI output" auditable. Anyone who has the verifier can re-derive the score. Anyone who has the receipt can verify the chain. Nobody has to trust the model vendor.
Worked example: a clinical-summary verifier.
// Compiler-generated verifier for "summarize chart in 50 words" function score(input, output) { let s = 1.0; if (output.length > 400) s *= 0.4; // length gate if (!/\b(?:patient|chart|admit)\b/i.test(output)) s *= 0.7; if (/\b(?:always|never|guaranteed)\b/i.test(output)) s *= 0.3; // hedge gate const inputTerms = extractMedicalTerms(input); const hit = inputTerms.filter(t => output.includes(t)).length / inputTerms.length; s *= (0.5 + 0.5 * hit); // grounding return s; // 0..1 }
The compiler builds this from your seed examples. You can edit it. You can version it. You can publish its content hash. Everyone downstream of the receipts you produce can read it.
The receipt: what gets signed and why.
Each receipt is a JSON object plus an HMAC tag. The fields:
prompt_hash— sha256 of the canonical prompt (system + user, post-template).samples_hashes— array of sha256 over each of the k candidate outputs.verifier_id— content hash of the verifier program. If the verifier changes, the receipt changes.winner_idx— which of the k samples scored highest.winner_score— the score, on [0, 1].model_id— vendor's model identifier (and our content-hash of the LoRA, when local).timestamp— RFC 3339 UTC.parent_hash— sha256 of the previous receipt in the chain. First receipt sets the genesis.tag— HMAC-SHA256 of the entire object using a tenant-scoped secret.
The compiler emits a receipt for every label it produces. The label store is append-only. Anyone with the verifier and the parent chain can re-derive every score and confirm no entry has been silently rewritten.
Cost, accuracy, and how the numbers stack up.
Verified inference costs k× the unverified API call (k=8 is our default), plus O(microseconds) of verifier work. The accuracy lift over k=1 sampling is what makes the trade worth it.
| property | k=1 (raw) | k=8 verified | zk-ML proof |
|---|---|---|---|
| cost vs raw | 1× | ~8× | ~10,000× |
| accuracy on benchmark* | 62% | 79% | varies |
| auditable? | no | yes | yes |
| tamper-evident? | no | yes | yes |
| per-call latency | 500 ms | 3-5 sec | 4-12 hr |
| practical for live traffic? | yes | yes (compile-time) | no |
* Internal benchmark on 200 SWE-bench Lite tasks, k=8 verified labeling vs. k=1 sampling, both with Claude Opus 4.7 as the teacher. The lift here was the foundation of our public reproducer at bench/swe-bench-lite-reproduce.sh.
What this buys you in production.
For compliance. Every label that trained your LoRA is signed, timestamped, and chained. Every chain is recoverable. The auditor reads a number and verifies the chain; you do not have to argue policy with them.
For training quality. The verifier is a quality filter. Distilling against k=8 verified labels produces consistently better LoRAs than distilling against k=1 raw labels, because the bad samples don't pollute the training set. This is the mechanism behind our published +15.33pp Lite reproducer.
For the receipt-chain itself. You can publish the chain root every N receipts to Arweave or a public blockchain. From that point on, anyone in the world can independently verify that no historical entry has been rewritten.
FAQ.
Why not just save the raw model output and call it a day?
Because nothing prevents an attacker (or a careless engineer) from rewriting your saved output later. A receipt is content-anchored: changing the output changes the hash, which invalidates the receipt, which invalidates every later receipt in the chain.
Does the verifier need to be deterministic?
Yes. If the verifier is non-deterministic, you cannot re-derive the same winner from the same samples. The synthesis pipeline rejects verifiers that don't pass a determinism check (run the verifier 100 times on a fixed input and a fixed candidate; reject if any output differs).
What about the frontier vendor's silent model rolls?
If the vendor changes the model under your contract, the same prompt produces a different sample distribution; verified inference makes that visible (the receipt's model_id shows the vendor's reported version, but the sample-distribution shift is detectable). For local artifacts, the LoRA's content hash is in the receipt; you control when it changes.
Can I bring my own verifier?
Yes. Pass --verifier ./mine.js to kolm compile. The compiler still synthesizes a fallback from your seeds, but a hand-written verifier always wins. You publish the verifier's content hash so downstream auditors can re-run it.
The four-stage pipeline that produces every .kolm.
.kolm file format →
What's inside the artifact, byte by byte.
measure The K-score, defined →The single number on the cover, derived from five.
act Runkolm compile →
Five minutes. A signed artifact. No GPU required.