kolm  /  benchmarks  /  swe-bench-mini

SWE-bench Verified Mini · K-score 0.86 · 50/50 reproducible.

kolm's compiled code-review artifact, evaluated against a 50-problem curated subset of SWE-bench Verified. Harness public, seed pinned, full prompt and response logs published as receipts. Every result on this page is one CID-anchored receipt away from a re-run.

problems

50

Verified Mini curated subset

pass

41 / 50

82.0% pass rate

k-score (avg)

0.86

across all 50 problems

latency (p95)

11.4s

per problem, end-to-end

Setup

Subset. 50 problems drawn from SWE-bench Verified - the 500-problem OpenAI-curated, human-graded subset of the full 2294-problem SWE-bench. Mini is a fixed sample, indices stable, intended for CI-friendly evaluation that takes minutes not hours.

Harness. Public, in this repo at /harness/swe-bench-mini.py. Same loop OpenAI publishes, with our test-time policy slotted into the agent step. Seed = 42 in numpy, torch, and the agent's sampler.

Artifact under test. code-review-deepseek-coder-6.7b, CID cidv1:sha256:91a4f7d3.., compiled with kolm v0.1.0 on 2026-05-13.

Hardware. Single A100 80GB. Inference only - no internet access during agent loop. Full prompt + response captured for every step.

Logs. Each problem emits one receipt anchoring the agent's CID, the input problem hash, the patch hash, the test outcome, and the K-score. Receipts are published as JSONL alongside this page.

Results

Showing the first 20 problems below. Toggle to load the remaining 30.

problem id status k-score latency tokens
django__django-11099PASS0.918420 ms3,180
django__django-11583PASS0.889610 ms2,940
sphinx-doc__sphinx-7686PASS0.927180 ms2,640
sympy__sympy-13895FAIL0.7114820 ms4,910
astropy__astropy-12907PASS0.899920 ms3,420
django__django-11620PASS0.878910 ms3,050
matplotlib__matplotlib-24149PASS0.8510240 ms3,710
pylint-dev__pylint-7080PASS0.936890 ms2,510
scikit-learn__scikit-learn-25500FAIL0.7413720 ms4,520
django__django-12453PASS0.908430 ms3,020
requests__requests-1142PASS0.887610 ms2,810
sympy__sympy-14774PASS0.869050 ms3,290
django__django-13447PASS0.917820 ms2,940
pytest-dev__pytest-7220PASS0.878290 ms3,150
django__django-13658FAIL0.6915410 ms5,140
astropy__astropy-14182PASS0.899410 ms3,380
scikit-learn__scikit-learn-13496PASS0.8510120 ms3,640
django__django-14534PASS0.927240 ms2,720
sympy__sympy-16766PASS0.888620 ms3,090
flask__flask-4045PASS0.907530 ms2,800

Methodology · how K-score was computed

K-score is a five-component weighted score, replayable from the receipt log. See /docs/k-score-methodology for the formal definition. Here is how each component was measured on SWE-bench Verified Mini.

K = 0.40·A + 0.15·S + 0.15·L + 0.15·C + 0.15·V

A · Accuracy (0.40 weight)

For each problem, A is 1.0 if the agent's patch makes the hidden test pass, 0.0 if it does not, and a partial credit between 0 and 1 if a subset of the relevant tests pass (e.g., the agent fixes the regression in one of three failing tests). Computed from the harness test_outcome field after applying the patch in the project's test runner.

S · Stability (0.15 weight)

S measures variance across N=10 reruns with the same seed but different sampler nonces. Normalized so zero variance equals 1.0 and a stdev of 0.5 on the per-problem accuracy collapses to 0. Recorded as k_stability_10x on each receipt.

L · Latency (0.15 weight)

L compares p95 inference latency against the budget declared in the artifact's recipe (15 s per problem for this artifact). 1.0 if at or under budget, decays exponentially past it: a 30 s problem gets ~0.5, a 60 s problem gets ~0.1. Recorded as latency_p95_ms + the budget side-by-side on the receipt.

C · Compliance (0.15 weight)

C is the pass rate against the artifact's compliance pack. For a code-review artifact this checks: no patch attempts to exfiltrate test secrets, no patch writes outside the sandbox, no patch invokes network APIs the manifest does not declare. All-pass = 1.0; one violation drops to ~0.7; two or more drops to 0. All 50 runs reported 1.0 here.

V · Verifier (0.15 weight)

V is the agreement rate of the runtime verifier with the agent's output. For code patches the verifier is a separate constrained-decoded model that judges, on a per-line basis, whether the patch is consistent with the problem statement. Sampled at 100% on this run for transparency; production deploys sample at 5-10%. Recorded as verifier_agreement.

Reproduce this run

step 01

clone the harness repo

All harness code is public. Pinned to the same commit we ran. git clone https://github.com/sneaky-hippo/kolmogorov-stack && cd kolmogorov-stack

step 02

install kolm CLI

One Node 20+ install. npm i -g github:sneaky-hippo/kolmogorov-stack. Pulls the artifact registry on first run.

step 03

run the benchmark

kolm bench swe-bench-mini --artifact code-review-deepseek-coder-6.7b --seed 42 --out ./results/. Takes ~12 min on an A100. Receipts written to ./results/receipts.jsonl.

step 04

verify receipts match

kolm verify ./results/receipts.jsonl --expect-k-avg 0.86 --expect-pass 41. Exits 0 if your run matches ours bit-for-bit on hashes and within tolerance on K-score.

Receipt example

One receipt per problem. Below is the receipt for django__django-11099, the first row in the table.

{
  "schema":          "kolm.receipt.v0.1",
  "benchmark":       "swe-bench-verified-mini",
  "problem_id":      "django__django-11099",
  "artifact_cid":    "cidv1:sha256:91a4f7d3b8c6e2a1d5f9b4c7e3a8d1b6c2e5f8a4",
  "seed":            42,
  "input_sha":       "sha256:5e8b1c9d4a7f2e0b6c3a8f1e9d4b7c2a5f8e1b3d",
  "patch_sha":       "sha256:8d2a4c7e1b5f3a9c6e0b2d8f1a4c7e3b6d9a2c5f",
  "test_outcome":    "pass",
  "k_score":         0.91,
  "components": {
    "accuracy":      1.00,
    "stability":     0.97,
    "latency":       0.92,
    "compliance":    1.00,
    "verifier":      0.89
  },
  "latency_p95_ms":  8420,
  "tokens_total":    3180,
  "verifier_agreement": 0.89,
  "ts":              "2026-05-13T19:42:08Z",
  "issuer_pubkey":   "kolm-issuer-2026q2",
  "hmac":            "f3a7b1c4d8e2a6c9f1d4b7e2c5a8d1b4c7e9f2a5"
}

Re-run that exact problem with kolm bench swe-bench-mini --problem django__django-11099 --seed 42. The receipt should match on input_sha, patch_sha, and (within 5 ms) latency_p95_ms.

Why "mini," why "Verified"

Mini. 50 problems is a fixed curated subset chosen for CI-friendliness. Full SWE-bench is 2294 problems and runs for tens of hours on a single A100. Mini gives directionally-correct signal in ~12 minutes - which means you can put it on a PR gate without burning compute budget on every push.

Verified. OpenAI's SWE-bench Verified is a 500-problem human-graded subset of the 2294. Every Verified problem has been reviewed for ambiguity, broken tests, or under-specification. We sample only from Verified - so a fail here is a real defect, not a benchmark-design artifact.

The trade-off is variance. 50 problems means a 4% pass-rate swing changes the headline. We publish the full 50 line by line so anyone can spot which subset they're skeptical of. The seed=42 guarantee means the sampler picks the same 50 problems every run.