cookbook · meta · k-score-explainer
Recipe · meta

K-score failed, here's the fix.

A local .kolm file that takes a failed compile's K-score report (T, C, L decomposition + per-pair failure log) and returns a ranked list of fixes: which gold pairs to remove, which verifier hooks to relax, which to add. Trained on 800 of our own historical failed compiles paired with the actual fix that made them pass.

base modelqwen2.5-coder-7b
gold pairs800 (560 train / 240 eval)
k-score floor0.82
artifact size2.4 GB
compile time~46 min
spec sourcegrounded diagnosis

What this recipe does

A failed K-score is the kolm equivalent of a compile error in C. The diagnostic is verbose; the fix is usually one of about a dozen patterns ("three gold pairs in your set are mutually inconsistent," "your verifier is too strict for the schema you wrote," "your eval set is too small to hit the floor with statistical confidence"). This recipe knows those patterns and applies them.

The verifier rejects any fix that doesn't trace to the actual numbers in the report — so the recipe can't make up a "your eval set is too small" diagnosis when it isn't. Each suggested fix points to a concrete row in the failure log.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["diagnosis", "fixes", "expected_lift"],
    "properties": {
      "diagnosis": { "type": "string", "maxLength": 500 },
      "fixes": { "type": "array", "items": {
        "required": ["action", "target", "why"],
        "properties": {
          "action": { "enum": ["remove_pair", "relabel_pair", "relax_verifier", "strengthen_verifier", "add_pairs", "lower_k_floor"] },
          "target": { "type": "string" },
          "why": { "type": "string" }
        }
      }, "maxItems": 6 },
      "expected_lift": { "type": "number" }
    }
  },
  "verifier": {
    "target_must_appear_in_failure_log": true,
    "diagnosis_must_cite_T_C_or_L_value": true,
    "expected_lift_calibrated_brier": 0.12
  }
}

Gold pair (1 of 800 shown)

input - failed K-score report
k_score=0.78 (floor=0.85)
  T=0.74 (k-sample pass rate)  C=0.91 (recipe coverage)  L=0.96 (latency)
failure_log:
  pair[112] verifier_failed: span_must_byte_match_input  diff=2chars
  pair[113] verifier_failed: span_must_byte_match_input  diff=4chars
  pair[114] verifier_failed: span_must_byte_match_input  diff=1char
  ... (24 of 30 failures are span byte-match with diff < 5 chars)
output
{
  "diagnosis": "T=0.74 is dragging the K-score below 0.85; 24 of 30 failures are span byte-match misses with diff under 5 chars (whitespace/punctuation). The verifier is correctly strict but the gold corpus has noisy span boundaries.",
  "fixes": [
    {"action": "relax_verifier", "target": "span_must_byte_match_input", "why": "switch to span_must_overlap_input(min_overlap=0.95) — keeps semantic constraint, tolerates whitespace drift"},
    {"action": "relabel_pair", "target": "pairs[112-135]", "why": "normalize gold spans to trim leading/trailing whitespace"}
  ],
  "expected_lift": 0.09
}

Compile

kolm compile "K-score failure explainer with grounded fix recommendations" \
  --base qwen2.5-coder-7b \
  --pairs ./failed-compile-pairs.jsonl \
  --verifier target-grounded,T-C-L-cited \
  --k-floor 0.82 \
  --output k-score-explainer.kolm

ok wrote k-score-explainer.kolm
   k_score=0.86  signature=hmac-sha256

K-score gate

K-score 0.86 held-out 240 failed compiles · target-grounded 100% · expected-lift Brier 0.094 (target 0.12) · user-applied-fix-passed 71%

"User-applied-fix-passed 71%" means: when a real engineer ran the suggested fix, the next compile passed its K-floor 71% of the time. The other 29% needed a second iteration — but the second iteration was also explained by this recipe.

Run-time profile

M2 MacBook
2.4s
RTX 5090
560ms
Mac Studio
1.4s
CPU x86 (server)
3.4s

Deploy

# wired into kolm compile failure path:
on_compile_failed = (report) => {
  const x = kolm.run('k-score-explainer.kolm', report);
  console.log('\nDIAGNOSIS:\n  ' + x.diagnosis);
  console.log('\nFIXES (expected lift +' + x.expected_lift.toFixed(2) + '):');
  for (const f of x.fixes) console.log('  - ' + f.action + ' ' + f.target + '\n    ' + f.why);
};