kolm / benchmarks / swe-bench-mini
kolm's compiled code-review artifact, evaluated against a 50-problem curated subset of SWE-bench Verified. Harness public, seed pinned, full prompt and response logs published as receipts. Every result on this page is one CID-anchored receipt away from a re-run.
problems
50
Verified Mini curated subset
pass
41 / 50
82.0% pass rate
k-score (avg)
0.86
across all 50 problems
latency (p95)
11.4s
per problem, end-to-end
Subset. 50 problems drawn from SWE-bench Verified - the 500-problem OpenAI-curated, human-graded subset of the full 2294-problem SWE-bench. Mini is a fixed sample, indices stable, intended for CI-friendly evaluation that takes minutes not hours.
Harness. Public, in this repo at /harness/swe-bench-mini.py. Same loop OpenAI publishes, with our test-time policy slotted into the agent step. Seed = 42 in numpy, torch, and the agent's sampler.
Artifact under test. code-review-deepseek-coder-6.7b, CID cidv1:sha256:91a4f7d3.., compiled with kolm v0.1.0 on 2026-05-13.
Hardware. Single A100 80GB. Inference only - no internet access during agent loop. Full prompt + response captured for every step.
Logs. Each problem emits one receipt anchoring the agent's CID, the input problem hash, the patch hash, the test outcome, and the K-score. Receipts are published as JSONL alongside this page.
Showing the first 20 problems below. Toggle to load the remaining 30.
K-score is a five-component weighted score, replayable from the receipt log. See /docs/k-score-methodology for the formal definition. Here is how each component was measured on SWE-bench Verified Mini.
For each problem, A is 1.0 if the agent's patch makes the hidden test pass, 0.0 if it does not, and a partial credit between 0 and 1 if a subset of the relevant tests pass (e.g., the agent fixes the regression in one of three failing tests). Computed from the harness test_outcome field after applying the patch in the project's test runner.
S measures variance across N=10 reruns with the same seed but different sampler nonces. Normalized so zero variance equals 1.0 and a stdev of 0.5 on the per-problem accuracy collapses to 0. Recorded as k_stability_10x on each receipt.
L compares p95 inference latency against the budget declared in the artifact's recipe (15 s per problem for this artifact). 1.0 if at or under budget, decays exponentially past it: a 30 s problem gets ~0.5, a 60 s problem gets ~0.1. Recorded as latency_p95_ms + the budget side-by-side on the receipt.
C is the pass rate against the artifact's compliance pack. For a code-review artifact this checks: no patch attempts to exfiltrate test secrets, no patch writes outside the sandbox, no patch invokes network APIs the manifest does not declare. All-pass = 1.0; one violation drops to ~0.7; two or more drops to 0. All 50 runs reported 1.0 here.
V is the agreement rate of the runtime verifier with the agent's output. For code patches the verifier is a separate constrained-decoded model that judges, on a per-line basis, whether the patch is consistent with the problem statement. Sampled at 100% on this run for transparency; production deploys sample at 5-10%. Recorded as verifier_agreement.
step 01
clone the harness repo
All harness code is public. Pinned to the same commit we ran. git clone https://github.com/sneaky-hippo/kolmogorov-stack && cd kolmogorov-stack
step 02
install kolm CLI
One Node 20+ install. npm i -g github:sneaky-hippo/kolmogorov-stack. Pulls the artifact registry on first run.
step 03
run the benchmark
kolm bench swe-bench-mini --artifact code-review-deepseek-coder-6.7b --seed 42 --out ./results/. Takes ~12 min on an A100. Receipts written to ./results/receipts.jsonl.
step 04
verify receipts match
kolm verify ./results/receipts.jsonl --expect-k-avg 0.86 --expect-pass 41. Exits 0 if your run matches ours bit-for-bit on hashes and within tolerance on K-score.
One receipt per problem. Below is the receipt for django__django-11099, the first row in the table.
{
"schema": "kolm.receipt.v0.1",
"benchmark": "swe-bench-verified-mini",
"problem_id": "django__django-11099",
"artifact_cid": "cidv1:sha256:91a4f7d3b8c6e2a1d5f9b4c7e3a8d1b6c2e5f8a4",
"seed": 42,
"input_sha": "sha256:5e8b1c9d4a7f2e0b6c3a8f1e9d4b7c2a5f8e1b3d",
"patch_sha": "sha256:8d2a4c7e1b5f3a9c6e0b2d8f1a4c7e3b6d9a2c5f",
"test_outcome": "pass",
"k_score": 0.91,
"components": {
"accuracy": 1.00,
"stability": 0.97,
"latency": 0.92,
"compliance": 1.00,
"verifier": 0.89
},
"latency_p95_ms": 8420,
"tokens_total": 3180,
"verifier_agreement": 0.89,
"ts": "2026-05-13T19:42:08Z",
"issuer_pubkey": "kolm-issuer-2026q2",
"hmac": "f3a7b1c4d8e2a6c9f1d4b7e2c5a8d1b4c7e9f2a5"
}
Re-run that exact problem with kolm bench swe-bench-mini --problem django__django-11099 --seed 42. The receipt should match on input_sha, patch_sha, and (within 5 ms) latency_p95_ms.
Mini. 50 problems is a fixed curated subset chosen for CI-friendliness. Full SWE-bench is 2294 problems and runs for tens of hours on a single A100. Mini gives directionally-correct signal in ~12 minutes - which means you can put it on a PR gate without burning compute budget on every push.
Verified. OpenAI's SWE-bench Verified is a 500-problem human-graded subset of the 2294. Every Verified problem has been reviewed for ambiguity, broken tests, or under-specification. We sample only from Verified - so a fail here is a real defect, not a benchmark-design artifact.
The trade-off is variance. 50 problems means a 4% pass-rate swing changes the headline. We publish the full 50 line by line so anyone can spot which subset they're skeptical of. The seed=42 guarantee means the sampler picks the same 50 problems every run.