What this recipe does
A failed K-score is the kolm equivalent of a compile error in C. The diagnostic is verbose; the fix is usually one of about a dozen patterns ("three gold pairs in your set are mutually inconsistent," "your verifier is too strict for the schema you wrote," "your eval set is too small to hit the floor with statistical confidence"). This recipe knows those patterns and applies them.
The verifier rejects any fix that doesn't trace to the actual numbers in the report — so the recipe can't make up a "your eval set is too small" diagnosis when it isn't. Each suggested fix points to a concrete row in the failure log.
The spec
{
"output_kind": "json",
"schema": {
"required": ["diagnosis", "fixes", "expected_lift"],
"properties": {
"diagnosis": { "type": "string", "maxLength": 500 },
"fixes": { "type": "array", "items": {
"required": ["action", "target", "why"],
"properties": {
"action": { "enum": ["remove_pair", "relabel_pair", "relax_verifier", "strengthen_verifier", "add_pairs", "lower_k_floor"] },
"target": { "type": "string" },
"why": { "type": "string" }
}
}, "maxItems": 6 },
"expected_lift": { "type": "number" }
}
},
"verifier": {
"target_must_appear_in_failure_log": true,
"diagnosis_must_cite_T_C_or_L_value": true,
"expected_lift_calibrated_brier": 0.12
}
}
Gold pair (1 of 800 shown)
k_score=0.78 (floor=0.85) T=0.74 (k-sample pass rate) C=0.91 (recipe coverage) L=0.96 (latency) failure_log: pair[112] verifier_failed: span_must_byte_match_input diff=2chars pair[113] verifier_failed: span_must_byte_match_input diff=4chars pair[114] verifier_failed: span_must_byte_match_input diff=1char ... (24 of 30 failures are span byte-match with diff < 5 chars)
{
"diagnosis": "T=0.74 is dragging the K-score below 0.85; 24 of 30 failures are span byte-match misses with diff under 5 chars (whitespace/punctuation). The verifier is correctly strict but the gold corpus has noisy span boundaries.",
"fixes": [
{"action": "relax_verifier", "target": "span_must_byte_match_input", "why": "switch to span_must_overlap_input(min_overlap=0.95) — keeps semantic constraint, tolerates whitespace drift"},
{"action": "relabel_pair", "target": "pairs[112-135]", "why": "normalize gold spans to trim leading/trailing whitespace"}
],
"expected_lift": 0.09
}
Compile
kolm compile "K-score failure explainer with grounded fix recommendations" \ --base qwen2.5-coder-7b \ --pairs ./failed-compile-pairs.jsonl \ --verifier target-grounded,T-C-L-cited \ --k-floor 0.82 \ --output k-score-explainer.kolm ok wrote k-score-explainer.kolm k_score=0.86 signature=hmac-sha256
K-score gate
"User-applied-fix-passed 71%" means: when a real engineer ran the suggested fix, the next compile passed its K-floor 71% of the time. The other 29% needed a second iteration — but the second iteration was also explained by this recipe.
Run-time profile
Deploy
# wired into kolm compile failure path: on_compile_failed = (report) => { const x = kolm.run('k-score-explainer.kolm', report); console.log('\nDIAGNOSIS:\n ' + x.diagnosis); console.log('\nFIXES (expected lift +' + x.expected_lift.toFixed(2) + '):'); for (const f of x.fixes) console.log(' - ' + f.action + ' ' + f.target + '\n ' + f.why); };