kolm benchmarks

What the harness measures

When you compile a kolm artifact, the eval set you gated it on is sealed inside the bundle as evals.json. kolm bench reads that, runs every recipe in recipes.json against every embedded case, and reports five things:

field	what it answers	where it comes from
k_score	"Is this artifact good and small?"	manifest.json composite, signed at compile time
evals.accuracy	"How many embedded cases pass right now?"	replay against evals.cases, no API call
latency_us.p50	"How fast does it actually run on this machine?"	process.hrtime.bigint, microsecond resolution
privacy.runtime_egress_attempts	"Did this artifact try to phone home?"	fetch + http + https + net + tls + dns patched
integrity.signature_valid	"Has this binary been tampered with?"	HMAC-SHA256 over manifest.json

Every cell above maps to a line in src/benchmark.js. If you don't trust us, run the harness yourself.

The compile-time gate

Before an artifact is signed and sealed, it has to pass a sandbox verifier (src/verifier.js). The verifier runs the candidate generator against three sets (positives, negatives, property tests) and computes a quality score:

verifier quality_score

Q = 0.5·positive_pass_rate + 0.4·negative_pass_rate + 0.1·property_pass_rate
gate: Q ≥ 0.85 AND positive_pass_rate ≥ 0.85

If the gate fails, the artifact is never written to disk. There is no "bad kolm" that ships and then fails in production. The build either produces a binary that passed its own evals, or it produces nothing.

The verified-inference math

For artifacts whose ground-truth labels come from a frontier model, kolm uses k-sample verified inference (src/verified.js). Given a stochastic generator with single-shot accuracy p on a verifiable task, and a sound verifier with single-shot accuracy v, drawing k independent samples and accepting the first that the verifier passes yields:

verified-inference accuracy

accuracy(k) = 1 − (1 − p·v)^k

This is the SOTA-amplifier. It is not a claim; it is monotone in k by construction, and the verifier's soundness is what bounds it. Worked numbers for a verifiable coding task with p≈0.91 and v≈1.0:

k	accuracy	cost (relative)	note
1	91.0%	1×	cold inference
2	99.2%	2×	first amplification
4	99.99%	4×	production gate
8	99.9999%	8×	labels-corpus mode

The amplifier only fires when the verifier is sound (no false positives). For non-verifiable tasks (open-ended generation, judgment calls), v ≪ 1 and the amplifier collapses. We are explicit about which tasks fall in each bucket.

K-score, computed

K-score is what the manifest reports: a single number that combines accuracy, eval coverage, and binary size. Bigger is better; bloat is penalized:

k_score.composite

composite = (accuracy · coverage · 1000) / log₂(size_kb + 2)

A 5KB recipe at 100% accuracy and 100% coverage scores ~588. A 50MB checkpoint at the same accuracy scores ~62. The denominator is the discipline.

→ how K-score is computed, all five raw axes

Reproducer

Every kolm artifact you compile, you can bench against on your own machine. The harness is one file.

# compile a sample artifact
kolm compile "classify support emails" \
  --examples examples.jsonl \
  --out support.kolm

# run the embedded eval set offline
kolm bench support.kolm --runs 10 --out report.json

# report.json conforms to spec kolm-benchmark-1 (real run against test/fixtures/sample.kolm)
{
  "spec": "kolm-benchmark-1",
  "artifact_sha256": "sha256:b8344082…1089880e0",
  "artifact_bytes": 3259,
  "k_score": 424.57,
  "evals": { "n": 4, "graded": 400, "passed": 400, "accuracy": 1.0 },
  "latency_us": { "p50": 274, "p95": 335, "max": 639 },
  "privacy": { "runtime_egress_attempts": 0, "blocked": false },
  "integrity": { "signature_valid": true, "receipt_chain_steps": 5 }
}

The harness patches fetch, http, https, net, tls, and dns before any recipe runs. Any egress attempt is recorded and counted against the artifact. This is how you prove a binary is sovereign.

Four real fixtures, four real runs

Every number below comes from running kolm bench against the four signed fixtures shipped in this repo, on Windows 11 + Node v24.14.0. Anyone can rerun the exact commands. The fixtures are the same artifacts the cookbook and /build-your-own templates produce. kolm-benchmark-1 is the report schema.

$ RECIPE_RECEIPT_SECRET=kolm-public-fixture-v0-1-0 \
  kolm bench test/fixtures/<name>.kolm --runs 50 \
  --target kolm-public-v0.1.0 --device win11-node24

fixture	bytes	evals	accuracy	p50 us	p95 us	K-score	signed
sample.kolm uppercase reference	3,259	4	1.0	286	363	424.57	true
redactor.kolm PII / PHI redactor	4,933	6	1.0	288	351	362.96	true
extractor.kolm structured-field extractor	4,571	5	1.0	256	342	373.48	true
classifier.kolm rule-based classifier	4,649	5	1.0	269	337	371.00	true

Each row is a 50-run benchmark. The fixtures sign under RECIPE_RECEIPT_SECRET=kolm-public-fixture-v0-1-0; this is also the secret the smoke battery and the e2e tests use. Each artifact ships its own embedded eval set, so the harness re-runs those evals offline with fetch/http/https/net/tls/dns all monitored. Latency varies with hardware; evals.accuracy, privacy.runtime_egress_attempts, and integrity.signature_valid do not. Reproducer notes at docs/benchmark-results-v0.1.0.md.

What we do not claim

kolm does not appear on the SWE-bench Verified leaderboard. It does not claim to make Opus or any frontier model better at coding in general. The compile-time gate either produces an artifact that passes its embedded evals, or it doesn't ship.

Verified inference (k-sample + sound verifier) is monotone in k by construction. The interesting empirical question is what the cost-vs-accuracy curve looks like for a given task and verifier; that is the per-task compile decision, made in your account, not on a leaderboard.

If your task isn't verifiable (no machine-checkable ground truth), kolm tells you. The amplifier doesn't fire and the artifact isn't sealed. We say no on builds we can't gate.

harness source ↗ methodology → how K-score is computed → .kolm anatomy →

Bottom line

Every .kolm carries the eval set it was gated on. The benchmark harness reruns those evals offline, in microseconds, with network egress monitored.

The report is JSON. The signature is HMAC. The harness is open. There is no leaderboard to cherry-pick because the artifact is its own leaderboard, and the next person to download it can verify the same number you did.

Every .kolm carries its own evals.