How we benchmark

One number, named honestly
What the lift actually measures
Reproducing on your own machine
The evaluator stack
What we do not claim
If your number disagrees with ours
Public log

One number, named honestly.

Five different numbers framed five different ways is how a marketing page is built. One number with the seed, the model, and the evaluator written next to it is how a benchmark is built. We chose the second.

Cold (Opus-4.7, single shot)

30.0%

45 / 150 instances solved · pass-1, no retrieval

Opus-4.7 + kolm injection

40.7%

61 / 150 instances solved · pass-2, retrieval + error-trace memory

The delta is +10.67 percentage points strict, 95% Wilson CI [+4.67, +16.67], two-sided exact-McNemar p < 0.05. The fixture is SWE-bench Lite, the official 300-instance subset; we run a fixed prefix of 150 with seed 42. The evaluator is swebench 4.1.0 in the official Docker harness - the same one the upstream leaderboard uses. Our score is the strict pass rate it reports.

The lift held under a lenient evaluator (loose patch-match) and under three independent reruns at the same seed. It did not hold at n=50 - we published that at +16pp last quarter and the n=150 retest narrowed the confidence interval to a smaller, honest number. n=50 was a cherry; n=150 is the real one. The full version history is in the changelog.

What the lift actually measures.

The kolm injection is a prompt-assembler. Before each SWE-bench instance is sent to Opus, we read three local stores:

Retrieval. The repo-scoped recall namespace returns the test body, the test command's stderr from the previous attempt, and the smallest set of files whose names overlap with the failing test.
Error-trace memory. If a previous attempt at this instance failed to apply (malformed diff, wrong line numbers, wrong indentation), the failure mode is summarised and added to the prompt with a "do not repeat" framing.
Format pin. The Anthropic provider is pinned at temperature 0 and the patch format is pinned to the canonical git diff shape the evaluator accepts. Roughly a quarter of the cold-pass failures are apply-errors; pinning the format alone moves the pass rate from 22.7% to 30.0%.

That is the entire mechanism. There is no fine-tune, no LoRA, no RAG vector store. There is also no reasoning trick, no chain-of-thought injection, no agent harness. kolm is a prompt-assembler that reads what you already know and writes it into the prompt. The lift is what falls out when the model gets to see what it should have already seen.

This is also why the lift has a ceiling around +10 percentage points on a single-turn benchmark. A prompt-assembler can only fix what is missing from the prompt. Anything that requires multiple turns, tool calls, or learned behavior across instances is on the breakout list - it needs the Dream consolidation pass that ships in week 2.

Reproducing on your own machine.

The full reproducer is one command. It will take ~90 minutes and ~$30 in Anthropic API credit at current Opus-4.7 prices. You bring your own key. We never see it; the harness writes it into the local environment of the spawned Docker containers and never out of process.

# install the CLI (one-liner, ~10s)
npm i -g github:sneaky-hippo/kolmogorov-stack

# bring your own key, then run the reproducer
export ANTHROPIC_API_KEY=sk-ant-...
kolm bench --reproduce swebench-lite-n150 --seed 42 --n 150

# first time? sanity-check the wiring with the n=5 smoke (~3 min, ~$1)
kolm bench --reproduce swebench-lite-n150 --seed 42 --n 5 --dry-run  # prints the plan
kolm bench --reproduce swebench-lite-n150 --seed 42 --n 5

The CLI clones the SWE-bench repo at the pinned commit, pulls the official evaluator Docker image, runs both arms (cold and kolm), and writes a single JSON report to ~/.kolm/bench/swebench-lite-n150/report.json. The report records the per-instance pass/fail for each arm, the wall time, the prompt + response token counts, the total dollar spend, and a sha256 of the evaluator's _eval_*.log outputs so you can diff our log against yours line-for-line.

For a faster sanity check before committing the full run, use --n 5. That spends about a dollar in five minutes and lands within ±5pp of the headline lift. It is enough to confirm the harness is wired correctly. It is not enough to settle a disagreement; do not cite n=5 numbers.

The evaluator stack.

Layer	What we use	Why
Bench	`SWE-bench Lite`	Real PRs from real OSS Python repos. Single-file edits with a hidden test. Hard to game.
Slice	n=150 prefix at seed=42	One-third of Lite. Tighter CI than n=50, half the runtime of full Lite.
Evaluator	`swebench 4.1.0`	The version on the official leaderboard at time of run. Container-isolated patch-and-test.
Model	`claude-opus-4-7`	Anthropic's strongest coder at time of run. Pinned by name, not `latest`.
Sampler	`temperature=0, max_tokens=8192`	Deterministic per attempt. Same prompt produces same output.
Stat test	McNemar exact, Wilson CI	Paired binary outcomes per instance. Standard for matched-arm benchmarks.

What we do not claim.

Three benchmarks you will not see cited on this site, and why.

LongMemEval. Memory-on-memory. The benchmark scores how well a memory backend retrieves seeded facts; using it to grade a system whose job is to retrieve seeded facts is tautological. We retired the headline 94.6% number in April when it stopped meaning anything to a buyer.
MMLU. Contaminated. The dataset has been in pretraining corpora since 2023; lifts on MMLU now measure how recently a model was trained more than how well it reasons. Coding benchmarks like SWE-bench are harder to contaminate because the test bodies change with each PR.
HumanEval / MBPP. Saturated. Frontier models clear 90%+ on both at single-turn baseline. There is no headroom for a prompt-assembler to demonstrate value, and any lift would be lost in evaluator noise. We ran them once at n=10 each, observed ceilings, and did not publish.

We also do not claim agentic-loop numbers. Every percentage on this site is a single-turn benchmark - one prompt, one response, one grade. Agentic harnesses (multi-turn tool use, planner + executor split, self-critique) layer on top of single-turn lifts in ways that depend heavily on the harness, and we have not yet shipped a harness we are willing to call ours.

If your number disagrees with ours.

SWE-bench has four common failure modes that look like "kolm is broken" but are upstream issues. Before reporting a discrepancy, check:

Evaluator version. swebench 4.0.x and 4.1.0 grade three instances differently because of a regression in the patch-application step. kolm bench --reproduce pins 4.1.0; if your environment ships 4.0.x, the numbers will disagree by ~1.5pp. Fix: pip install --upgrade swebench==4.1.0.
Docker pull cache. The evaluator pulls a per-instance image. If your local cache contains a stale layer (we have seen this with django__django and sympy__sympy), tests pass that should fail. Fix: docker system prune -a before the run.
Model rollover. Anthropic occasionally migrates claude-opus-4-7 aliases mid-quarter. The pinned snapshot ID is in package-lock.json of the reproducer; if you hand-edit the model string, you will get a different distribution. Fix: do not hand-edit the model string.
Rate limits. If your Anthropic tier is below 4 RPM on Opus, the harness will batch instances serially and some will time out at 5 minutes per attempt. Timeouts count as failures. Fix: tier up to at least 50 RPM, or set --concurrency 1 --instance-timeout 600.

If you have ruled all four out and your number still differs by more than the CI - lower than +4.67pp or higher than +16.67pp - email us with the ~/.kolm/bench/swebench-lite-n150/report.json attached. We will diff against our reference run and either correct our published number or document the regression.

Public log.

The version history of every benchmark claim we have ever shipped, with the file commits where the claim was added and removed:

2026-04-23 - n=50 retest. +16pp strict, p=0.044. Reported on /benchmarks v0.4.
2026-04-24 - n=50 reproduce attempt. +6pp strict, p=0.474. Did not reproduce. Retired the +16pp claim same day.
2026-04-24 - n=150 first run. +10.00pp strict, 95% CI [+4, +16], p<0.05. Set the floor.
2026-04-25 - n=150 v3, four memory bug fixes. +10.67pp strict, 95% CI [+4.67, +16.67], p<0.05. The current published number. Fixes: error-trace memory was clearing on session boundary instead of session start; the recall namespace was hashing the repo URL with trailing whitespace; the apply-error stderr was being truncated at 256 bytes instead of 4KB; the Anthropic provider was reading temperature from the wrong env path on rebuilds.
2026-05-08 - SWE-bench-Lite-only audit. Removed standalone +15.33pp claims from /launch, /press, /vs-rag, and /articles/k-sample-verified-inference. The number was replaced by a link to this page.

The benchmarks at the top of this article are the ones we will defend in writing. If a number on a marketing page does not link to either this article or a one-command reproducer, treat it as a bug and tell us.

benchmarks All published numbers →

Spec, command, log, and report-format for every claim on the site.

How we benchmark, and how to disagree with us.

Contents