What the harness measures
When you compile a kolm artifact, the eval set you gated it on is sealed inside the bundle as evals.json. kolm bench reads that, runs every recipe in recipes.json against every embedded case, and reports five things:
| field | what it answers | where it comes from |
|---|---|---|
| k_score | "Is this artifact good and small?" | manifest.json composite, signed at compile time |
| evals.accuracy | "How many embedded cases pass right now?" | replay against evals.cases, no API call |
| latency_us.p50 | "How fast does it actually run on this machine?" | process.hrtime.bigint, microsecond resolution |
| privacy.runtime_egress_attempts | "Did this artifact try to phone home?" | fetch + http + https + net + tls + dns patched |
| integrity.signature_valid | "Has this binary been tampered with?" | HMAC-SHA256 over manifest.json |
Every cell above maps to a line in src/benchmark.js. If you don't trust us, run the harness yourself.
The compile-time gate
Before an artifact is signed and sealed, it has to pass a sandbox verifier (src/verifier.js). The verifier runs the candidate generator against three sets (positives, negatives, property tests) and computes a quality score:
gate: Q ≥ 0.85 AND positive_pass_rate ≥ 0.85
If the gate fails, the artifact is never written to disk. There is no "bad kolm" that ships and then fails in production. The build either produces a binary that passed its own evals, or it produces nothing.
The verified-inference math
For artifacts whose ground-truth labels come from a frontier model, kolm uses k-sample verified inference (src/verified.js). Given a stochastic generator with single-shot accuracy p on a verifiable task, and a sound verifier with single-shot accuracy v, drawing k independent samples and accepting the first that the verifier passes yields:
This is the SOTA-amplifier. It is not a claim; it is monotone in k by construction, and the verifier's soundness is what bounds it. Worked numbers for a verifiable coding task with p≈0.91 and v≈1.0:
| k | accuracy | cost (relative) | note |
|---|---|---|---|
| 1 | 91.0% | 1× | cold inference |
| 2 | 99.2% | 2× | first amplification |
| 4 | 99.99% | 4× | production gate |
| 8 | 99.9999% | 8× | labels-corpus mode |
The amplifier only fires when the verifier is sound (no false positives). For non-verifiable tasks (open-ended generation, judgment calls), v ≪ 1 and the amplifier collapses. We are explicit about which tasks fall in each bucket.
K-score, computed
K-score is what the manifest reports: a single number that combines accuracy, eval coverage, and binary size. Bigger is better; bloat is penalized:
A 5KB recipe at 100% accuracy and 100% coverage scores ~588. A 50MB checkpoint at the same accuracy scores ~62. The denominator is the discipline.
Reproducer
Every kolm artifact you compile, you can bench against on your own machine. The harness is one file.
# compile a sample artifact kolm compile "classify support emails" \ --examples examples.jsonl \ --out support.kolm # run the embedded eval set offline kolm bench support.kolm --runs 10 --out report.json # report.json conforms to spec kolm-benchmark-1 (real run against test/fixtures/sample.kolm) { "spec": "kolm-benchmark-1", "artifact_sha256": "sha256:b8344082…1089880e0", "artifact_bytes": 3259, "k_score": 424.57, "evals": { "n": 4, "graded": 400, "passed": 400, "accuracy": 1.0 }, "latency_us": { "p50": 274, "p95": 335, "max": 639 }, "privacy": { "runtime_egress_attempts": 0, "blocked": false }, "integrity": { "signature_valid": true, "receipt_chain_steps": 5 } }
The harness patches fetch, http, https, net, tls, and dns before any recipe runs. Any egress attempt is recorded and counted against the artifact. This is how you prove a binary is sovereign.
Four real fixtures, four real runs
Every number below comes from running kolm bench against the four signed fixtures shipped in this repo, on Windows 11 + Node v24.14.0. Anyone can rerun the exact commands. The fixtures are the same artifacts the cookbook and /build-your-own templates produce. kolm-benchmark-1 is the report schema.
$ RECIPE_RECEIPT_SECRET=kolm-public-fixture-v0-1-0 \
kolm bench test/fixtures/<name>.kolm --runs 50 \
--target kolm-public-v0.1.0 --device win11-node24
| fixture | bytes | evals | accuracy | p50 us | p95 us | K-score | egress | signed |
|---|---|---|---|---|---|---|---|---|
| sample.kolm uppercase reference | 3,259 | 4 | 1.0 | 286 | 363 | 424.57 | 0 | true |
| redactor.kolm PII / PHI redactor | 4,933 | 6 | 1.0 | 288 | 351 | 362.96 | 0 | true |
| extractor.kolm structured-field extractor | 4,571 | 5 | 1.0 | 256 | 342 | 373.48 | 0 | true |
| classifier.kolm rule-based classifier | 4,649 | 5 | 1.0 | 269 | 337 | 371.00 | 0 | true |
Each row is a 50-run benchmark. The fixtures sign under RECIPE_RECEIPT_SECRET=kolm-public-fixture-v0-1-0; this is also the secret the smoke battery and the e2e tests use. Each artifact ships its own embedded eval set, so the harness re-runs those evals offline with fetch/http/https/net/tls/dns all monitored. Latency varies with hardware; evals.accuracy, privacy.runtime_egress_attempts, and integrity.signature_valid do not. Reproducer notes at docs/benchmark-results-v0.1.0.md.
What we do not claim
kolm does not appear on the SWE-bench Verified leaderboard. It does not claim to make Opus or any frontier model better at coding in general. The compile-time gate either produces an artifact that passes its embedded evals, or it doesn't ship.
Verified inference (k-sample + sound verifier) is monotone in k by construction. The interesting empirical question is what the cost-vs-accuracy curve looks like for a given task and verifier; that is the per-task compile decision, made in your account, not on a leaderboard.
If your task isn't verifiable (no machine-checkable ground truth), kolm tells you. The amplifier doesn't fire and the artifact isn't sealed. We say no on builds we can't gate.
Bottom line
Every .kolm carries the eval set it was gated on. The benchmark harness reruns those evals offline, in microseconds, with network egress monitored.
The report is JSON. The signature is HMAC. The harness is open. There is no leaderboard to cherry-pick because the artifact is its own leaderboard, and the next person to download it can verify the same number you did.