Engineering · 2026-05-09 · 8 min read

How we benchmark, and how to disagree with us.

Every percentage on this site comes from one rerun, one seed, one evaluator. n=150, seed=42, Opus-4.7, swebench 4.1.0. The result is +10.67pp on SWE-bench Lite, 95% CI [+4.67, +16.67], p<0.05. This page documents what that number measures, what it does not, and the diagnosis flowchart for when your reproduce attempt lands somewhere different.

By KolmogorovTag methodology · benchmark · reproducer

One number, named honestly.

Five different numbers framed five different ways is how a marketing page is built. One number with the seed, the model, and the evaluator written next to it is how a benchmark is built. We chose the second.

Cold (Opus-4.7, single shot)
30.0%
45 / 150 instances solved · pass-1, no retrieval
Opus-4.7 + kolm injection
40.7%
61 / 150 instances solved · pass-2, retrieval + error-trace memory

The delta is +10.67 percentage points strict, 95% Wilson CI [+4.67, +16.67], two-sided exact-McNemar p < 0.05. The fixture is SWE-bench Lite, the official 300-instance subset; we run a fixed prefix of 150 with seed 42. The evaluator is swebench 4.1.0 in the official Docker harness - the same one the upstream leaderboard uses. Our score is the strict pass rate it reports.

The lift held under a lenient evaluator (loose patch-match) and under three independent reruns at the same seed. It did not hold at n=50 - we published that at +16pp last quarter and the n=150 retest narrowed the confidence interval to a smaller, honest number. n=50 was a cherry; n=150 is the real one. The full version history is in the changelog.

What the lift actually measures.

The kolm injection is a prompt-assembler. Before each SWE-bench instance is sent to Opus, we read three local stores:

That is the entire mechanism. There is no fine-tune, no LoRA, no RAG vector store. There is also no reasoning trick, no chain-of-thought injection, no agent harness. kolm is a prompt-assembler that reads what you already know and writes it into the prompt. The lift is what falls out when the model gets to see what it should have already seen.

This is also why the lift has a ceiling around +10 percentage points on a single-turn benchmark. A prompt-assembler can only fix what is missing from the prompt. Anything that requires multiple turns, tool calls, or learned behavior across instances is on the breakout list - it needs the Dream consolidation pass that ships in week 2.

Reproducing on your own machine.

The full reproducer is one command. It will take ~90 minutes and ~$30 in Anthropic API credit at current Opus-4.7 prices. You bring your own key. We never see it; the harness writes it into the local environment of the spawned Docker containers and never out of process.

# install the CLI (one-liner, ~10s)
npm i -g github:sneaky-hippo/kolmogorov-stack

# bring your own key, then run the reproducer
export ANTHROPIC_API_KEY=sk-ant-...
kolm bench --reproduce swebench-lite-n150 --seed 42 --n 150

# first time? sanity-check the wiring with the n=5 smoke (~3 min, ~$1)
kolm bench --reproduce swebench-lite-n150 --seed 42 --n 5 --dry-run  # prints the plan
kolm bench --reproduce swebench-lite-n150 --seed 42 --n 5

The CLI clones the SWE-bench repo at the pinned commit, pulls the official evaluator Docker image, runs both arms (cold and kolm), and writes a single JSON report to ~/.kolm/bench/swebench-lite-n150/report.json. The report records the per-instance pass/fail for each arm, the wall time, the prompt + response token counts, the total dollar spend, and a sha256 of the evaluator's _eval_*.log outputs so you can diff our log against yours line-for-line.

For a faster sanity check before committing the full run, use --n 5. That spends about a dollar in five minutes and lands within ±5pp of the headline lift. It is enough to confirm the harness is wired correctly. It is not enough to settle a disagreement; do not cite n=5 numbers.

The evaluator stack.

LayerWhat we useWhy
BenchSWE-bench LiteReal PRs from real OSS Python repos. Single-file edits with a hidden test. Hard to game.
Slicen=150 prefix at seed=42One-third of Lite. Tighter CI than n=50, half the runtime of full Lite.
Evaluatorswebench 4.1.0The version on the official leaderboard at time of run. Container-isolated patch-and-test.
Modelclaude-opus-4-7Anthropic's strongest coder at time of run. Pinned by name, not latest.
Samplertemperature=0, max_tokens=8192Deterministic per attempt. Same prompt produces same output.
Stat testMcNemar exact, Wilson CIPaired binary outcomes per instance. Standard for matched-arm benchmarks.

What we do not claim.

Three benchmarks you will not see cited on this site, and why.

We also do not claim agentic-loop numbers. Every percentage on this site is a single-turn benchmark - one prompt, one response, one grade. Agentic harnesses (multi-turn tool use, planner + executor split, self-critique) layer on top of single-turn lifts in ways that depend heavily on the harness, and we have not yet shipped a harness we are willing to call ours.

If your number disagrees with ours.

SWE-bench has four common failure modes that look like "kolm is broken" but are upstream issues. Before reporting a discrepancy, check:

If you have ruled all four out and your number still differs by more than the CI - lower than +4.67pp or higher than +16.67pp - email us with the ~/.kolm/bench/swebench-lite-n150/report.json attached. We will diff against our reference run and either correct our published number or document the regression.

Public log.

The version history of every benchmark claim we have ever shipped, with the file commits where the claim was added and removed:

The benchmarks at the top of this article are the ones we will defend in writing. If a number on a marketing page does not link to either this article or a one-command reproducer, treat it as a bug and tell us.