benchmarks · spec kolm-benchmark-1

Every .kolm carries its own evals.

A kolm artifact is not a marketing claim. It is a signed binary that ships with the eval set used to gate it at compile time. kolm bench file.kolm reruns those evals offline, monitors network egress, and emits a JSON report. No leaderboard. No frontier proxying. The artifact is its own benchmark.

spec
kolm-benchmark-1
harness in src/benchmark.js, open
offline
egress=0
fetch / http / dns patched and recorded
latency
µs
measured per case via process.hrtime.bigint
integrity
HMAC chain
signature_valid + receipt_chain_steps in report

The benchmark report is plain JSON. You can run it on any machine, against any .kolm, and diff the result. Reproducibility is the unit of currency.

What the harness measures

When you compile a kolm artifact, the eval set you gated it on is sealed inside the bundle as evals.json. kolm bench reads that, runs every recipe in recipes.json against every embedded case, and reports five things:

fieldwhat it answerswhere it comes from
k_score "Is this artifact good and small?" manifest.json composite, signed at compile time
evals.accuracy "How many embedded cases pass right now?" replay against evals.cases, no API call
latency_us.p50 "How fast does it actually run on this machine?" process.hrtime.bigint, microsecond resolution
privacy.runtime_egress_attempts "Did this artifact try to phone home?" fetch + http + https + net + tls + dns patched
integrity.signature_valid "Has this binary been tampered with?" HMAC-SHA256 over manifest.json

Every cell above maps to a line in src/benchmark.js. If you don't trust us, run the harness yourself.

The compile-time gate

Before an artifact is signed and sealed, it has to pass a sandbox verifier (src/verifier.js). The verifier runs the candidate generator against three sets (positives, negatives, property tests) and computes a quality score:

verifier quality_score
Q = 0.5·positive_pass_rate + 0.4·negative_pass_rate + 0.1·property_pass_rate
gate: Q ≥ 0.85 AND positive_pass_rate ≥ 0.85

If the gate fails, the artifact is never written to disk. There is no "bad kolm" that ships and then fails in production. The build either produces a binary that passed its own evals, or it produces nothing.

The verified-inference math

For artifacts whose ground-truth labels come from a frontier model, kolm uses k-sample verified inference (src/verified.js). Given a stochastic generator with single-shot accuracy p on a verifiable task, and a sound verifier with single-shot accuracy v, drawing k independent samples and accepting the first that the verifier passes yields:

verified-inference accuracy
accuracy(k) = 1 − (1 − p·v)k

This is the SOTA-amplifier. It is not a claim; it is monotone in k by construction, and the verifier's soundness is what bounds it. Worked numbers for a verifiable coding task with p≈0.91 and v≈1.0:

kaccuracycost (relative)note
191.0%cold inference
299.2%first amplification
499.99%production gate
899.9999%labels-corpus mode

The amplifier only fires when the verifier is sound (no false positives). For non-verifiable tasks (open-ended generation, judgment calls), v ≪ 1 and the amplifier collapses. We are explicit about which tasks fall in each bucket.

K-score, computed

K-score is what the manifest reports: a single number that combines accuracy, eval coverage, and binary size. Bigger is better; bloat is penalized:

k_score.composite
composite = (accuracy · coverage · 1000) / log2(size_kb + 2)

A 5KB recipe at 100% accuracy and 100% coverage scores ~588. A 50MB checkpoint at the same accuracy scores ~62. The denominator is the discipline.

→ how K-score is computed, all five raw axes

Reproducer

Every kolm artifact you compile, you can bench against on your own machine. The harness is one file.

# compile a sample artifact
kolm compile "classify support emails" \
  --examples examples.jsonl \
  --out support.kolm

# run the embedded eval set offline
kolm bench support.kolm --runs 10 --out report.json

# report.json conforms to spec kolm-benchmark-1 (real run against test/fixtures/sample.kolm)
{
  "spec": "kolm-benchmark-1",
  "artifact_sha256": "sha256:b8344082…1089880e0",
  "artifact_bytes": 3259,
  "k_score": 424.57,
  "evals": { "n": 4, "graded": 400, "passed": 400, "accuracy": 1.0 },
  "latency_us": { "p50": 274, "p95": 335, "max": 639 },
  "privacy": { "runtime_egress_attempts": 0, "blocked": false },
  "integrity": { "signature_valid": true, "receipt_chain_steps": 5 }
}

The harness patches fetch, http, https, net, tls, and dns before any recipe runs. Any egress attempt is recorded and counted against the artifact. This is how you prove a binary is sovereign.

Four real fixtures, four real runs

Every number below comes from running kolm bench against the four signed fixtures shipped in this repo, on Windows 11 + Node v24.14.0. Anyone can rerun the exact commands. The fixtures are the same artifacts the cookbook and /build-your-own templates produce. kolm-benchmark-1 is the report schema.

$ RECIPE_RECEIPT_SECRET=kolm-public-fixture-v0-1-0 \
  kolm bench test/fixtures/<name>.kolm --runs 50 \
  --target kolm-public-v0.1.0 --device win11-node24
fixturebytesevalsaccuracyp50 usp95 usK-scoreegresssigned
sample.kolm
uppercase reference
3,25941.0286363424.570true
redactor.kolm
PII / PHI redactor
4,93361.0288351362.960true
extractor.kolm
structured-field extractor
4,57151.0256342373.480true
classifier.kolm
rule-based classifier
4,64951.0269337371.000true

Each row is a 50-run benchmark. The fixtures sign under RECIPE_RECEIPT_SECRET=kolm-public-fixture-v0-1-0; this is also the secret the smoke battery and the e2e tests use. Each artifact ships its own embedded eval set, so the harness re-runs those evals offline with fetch/http/https/net/tls/dns all monitored. Latency varies with hardware; evals.accuracy, privacy.runtime_egress_attempts, and integrity.signature_valid do not. Reproducer notes at docs/benchmark-results-v0.1.0.md.

What we do not claim

kolm does not appear on the SWE-bench Verified leaderboard. It does not claim to make Opus or any frontier model better at coding in general. The compile-time gate either produces an artifact that passes its embedded evals, or it doesn't ship.

Verified inference (k-sample + sound verifier) is monotone in k by construction. The interesting empirical question is what the cost-vs-accuracy curve looks like for a given task and verifier; that is the per-task compile decision, made in your account, not on a leaderboard.

If your task isn't verifiable (no machine-checkable ground truth), kolm tells you. The amplifier doesn't fire and the artifact isn't sealed. We say no on builds we can't gate.

Bottom line

Every .kolm carries the eval set it was gated on. The benchmark harness reruns those evals offline, in microseconds, with network egress monitored.

The report is JSON. The signature is HMAC. The harness is open. There is no leaderboard to cherry-pick because the artifact is its own leaderboard, and the next person to download it can verify the same number you did.