K-score explainer recipe

What this recipe does

A failed K-score is the kolm equivalent of a compile error in C. The diagnostic is verbose; the fix is usually one of about a dozen patterns ("three gold pairs in your set are mutually inconsistent," "your verifier is too strict for the schema you wrote," "your eval set is too small to hit the floor with statistical confidence"). This recipe knows those patterns and applies them.

The verifier rejects any fix that doesn't trace to the actual numbers in the report — so the recipe can't make up a "your eval set is too small" diagnosis when it isn't. Each suggested fix points to a concrete row in the failure log.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["diagnosis", "fixes", "expected_lift"],
    "properties": {
      "diagnosis": { "type": "string", "maxLength": 500 },
      "fixes": { "type": "array", "items": {
        "required": ["action", "target", "why"],
        "properties": {
          "action": { "enum": ["remove_pair", "relabel_pair", "relax_verifier", "strengthen_verifier", "add_pairs", "lower_k_floor"] },
          "target": { "type": "string" },
          "why": { "type": "string" }
        }
      }, "maxItems": 6 },
      "expected_lift": { "type": "number" }
    }
  },
  "verifier": {
    "target_must_appear_in_failure_log": true,
    "diagnosis_must_cite_T_C_or_L_value": true,
    "expected_lift_calibrated_brier": 0.12
  }
}

Gold pair (1 of 800 shown)

input - failed K-score report

k_score=0.78 (floor=0.85)
  T=0.74 (k-sample pass rate)  C=0.91 (recipe coverage)  L=0.96 (latency)
failure_log:
  pair[112] verifier_failed: span_must_byte_match_input  diff=2chars
  pair[113] verifier_failed: span_must_byte_match_input  diff=4chars
  pair[114] verifier_failed: span_must_byte_match_input  diff=1char
  ... (24 of 30 failures are span byte-match with diff < 5 chars)

output

{
  "diagnosis": "T=0.74 is dragging the K-score below 0.85; 24 of 30 failures are span byte-match misses with diff under 5 chars (whitespace/punctuation). The verifier is correctly strict but the gold corpus has noisy span boundaries.",
  "fixes": [
    {"action": "relax_verifier", "target": "span_must_byte_match_input", "why": "switch to span_must_overlap_input(min_overlap=0.95) — keeps semantic constraint, tolerates whitespace drift"},
    {"action": "relabel_pair", "target": "pairs[112-135]", "why": "normalize gold spans to trim leading/trailing whitespace"}
  ],
  "expected_lift": 0.09
}

Compile

kolm compile "K-score failure explainer with grounded fix recommendations" \
  --base qwen2.5-coder-7b \
  --pairs ./failed-compile-pairs.jsonl \
  --verifier target-grounded,T-C-L-cited \
  --k-floor 0.82 \
  --output k-score-explainer.kolm

ok wrote k-score-explainer.kolm
   k_score=0.86  signature=hmac-sha256

K-score gate

K-score 0.86 held-out 240 failed compiles · target-grounded 100% · expected-lift Brier 0.094 (target 0.12) · user-applied-fix-passed 71%

"User-applied-fix-passed 71%" means: when a real engineer ran the suggested fix, the next compile passed its K-floor 71% of the time. The other 29% needed a second iteration — but the second iteration was also explained by this recipe.

Run-time profile

M2 MacBook

2.4s

RTX 5090

560ms

Mac Studio

1.4s

CPU x86 (server)

3.4s

Deploy

# wired into kolm compile failure path:
on_compile_failed = (report) => {
  const x = kolm.run('k-score-explainer.kolm', report);
  console.log('\nDIAGNOSIS:\n  ' + x.diagnosis);
  console.log('\nFIXES (expected lift +' + x.expected_lift.toFixed(2) + '):');
  for (const f of x.fixes) console.log('  - ' + f.action + ' ' + f.target + '\n    ' + f.why);
};

K-score failed, here's the fix.