Research · Methodology · 2026-05-14 · 10 min read

Why K-score correlates with deployment quality.

A K-score is one number with five inputs. It is a calibration, not a universal truth, and the 0.85 gate is the smallest assertion we are willing to make at compile time. What the score measures, what it does not, and where the open problems are.

By kolmTag methodology · evaluation · calibration

What K-score is.

K-score v1 (the version that ships in every .kolm today) is a weighted aggregate of five normalized inputs:

K = 0.40 * A
  + 0.15 * S
  + 0.15 * L
  + 0.15 * C
  + 0.15 * V

Each input is on [0, 1]. The composite is on [0, 1]. The variable assignment is short:

The composite is rounded to four decimal places and embedded in the manifest, the receipt body, and the audit log. The K-score is one of the seven fields the receipt body HMAC seals, so a third party verifier reading a .kolm can confirm the number was produced under the receipt secret and not edited after.

Why these weights.

The weights are a stance, not a derivation. They encode three opinions we hold about a compile-time gate.

Accuracy dominates. 0.40 of the composite is A alone. The other four axes share 0.60. We picked this because the production failures we have actually shipped have always been accuracy regressions; never once has a deploy gone sideways because the artifact was too large or too slow. If you fix the accuracy, the rest is rounding.

The four non-accuracy axes are balanced. Size, latency, cost, and coverage each get 0.15. We do not weight them differently because the cases where they bite are situational. Size matters most on edge devices; latency matters most in the request loop; cost matters most in unit economics. A single weighting that biases for one task class biases against the others.

The composite is convex. A perfect score on one axis cannot compensate for a zero on another. A 1.0 accuracy artifact with zero coverage scores 0.55 on the composite (0.40 + four times 0.15 times 0.5 average), well below the 0.85 gate. This is the whole point: a degenerate input class is supposed to fail.

K-score v2 is on the v0.2 roadmap and adds four optional axes: R (held-out accuracy vs declared accuracy), F (fairness, lowest sub-group accuracy), E (energy, joules per call), and Z (drift vs registry baseline). When a v2 axis is supplied, the v1 weights redistribute proportionally so the composite stays comparable across versions. v1 artifacts continue to verify under v1; the manifest carries spec: 'k-score-1' or spec: 'k-score-2' so a verifier can dispatch.

# v2 (when all four optional axes are supplied)
K2 = 0.30 * A
   + 0.10 * S + 0.10 * L + 0.10 * C + 0.10 * V
   + 0.10 * R + 0.10 * F
   + 0.05 * E + 0.05 * Z

The 0.85 gate.

Every compile that does not produce a composite of at least 0.85 fails. No artifact is emitted. The receipt is not signed. The CID is not minted. The compile process exits non-zero, the trainer writes a diagnostic JSON next to where the artifact would have been, and kolm compile prints the failing axes.

$ kolm compile -t spam_classifier -d ./examples.jsonl
[compile]     synthesizing recipes (n=42)
[compile]     evaluating on 24 held-out cases
[K-score]     A=0.875 S=0.412 L=0.834 C=1.000 V=0.667
[K-score]     composite=0.808 (gate=0.85)
[fail]        coverage below threshold; 8 of 24 cases were not exercised
[fail]        no artifact produced
[diag]        ./spam_classifier.failed.json   inspect with: kolm diagnose

The gate is adjustable per-compile with --gate. A higher gate produces fewer artifacts but stronger guarantees. A lower gate produces more artifacts that may not deserve to ship. The default we chose (0.85) is the median of the empirical accuracy distribution across the first 1,000 compiles we ran during development. We did not derive 0.85 from theory; we observed it from the actual compile histogram and picked the knee point.

A failed compile is a successful gate. The artifact you never shipped is the one that never embarrassed you in front of a customer.

When the gate fails.

Three failure modes account for most of the rejections we see during development.

Mode one: accuracy collapse on a narrow eval set. A small eval set (under 50 cases) is high variance. One mislabeled case can move A by 2 percentage points. The fix is not to lower the gate but to grow the eval set. The synthesizer can generate additional held-out cases on demand; kolm evals expand --target 200 doubles the held-out coverage and stabilizes the accuracy estimate.

Mode two: coverage shortfall. The eval set declared 24 case classes but the compiled artifact only fired on 16 of them. This is almost always a recipe-pack regression: a pattern was synthesized that subsumed several others and the missing classes never produced an output. The fix is to inspect the recipe pack, find the over-greedy pattern, and split it.

Mode three: size or latency cliff. A LoRA rank that was set too high produced a 12 GB artifact instead of a 4 GB one, and S collapsed to 0.15. A model pointer that picked a larger base than needed produced 200 ms latency instead of 30 ms, and L collapsed to 0.33. These are config errors; the fix is to re-run the compile with a smaller adapter rank or a smaller base.

Open problem: cross-task calibration.

The honest critique of the K-score is that it is a single number across tasks that have very different distributional shapes. A spam classifier with A = 0.94 and a refund-flagger with A = 0.94 are not the same kind of evidence. The classifier is a balanced two-class problem with a baseline of 0.50; the flagger has a 3% positive class and a baseline of 0.97. Naively, the flagger's 0.94 is worse than the baseline.

K-score v1 does not correct for this. The accuracy axis is bare top-1 on declared positives. Two artifacts with the same A score on different task distributions are not interchangeable evidence. The K-score gate catches both at the same threshold, which is correct as a budget constraint and wrong as a quality claim.

This is the v0.2 calibration problem. The proposal we are sketching is task-class-aware normalization: every task declared at compile time carries a task_class field (e.g., binary_classification, extraction, generation), and the A axis is normalized against the baseline expected for that class (F1 over majority-class baseline for skewed classification; exact-match over no-op for extraction; reference-similarity over empty-string for generation). The composite stays on [0, 1] but the accuracy contribution becomes comparable across task classes.

The cost of this proposal is that the K-score becomes a function of the task_class field, and a deployer can game the gate by mislabeling the class. The receipt chain seals the task_class along with everything else, so the audit trail catches it, but the cost is one more field a third party needs to inspect to know what the K-score actually attests to. We will publish the v0.2 proposal as a separate RS-1 amendment when the calibration data lands.

Honest limitations.

Three things the K-score does not do.

It is a compile-time signal, not a production-quality guarantee. The K-score is computed against an eval set the synthesizer generated and the user reviewed. It is not a population estimate. An artifact that scores 0.92 on its embedded eval set can still produce 0.65 accuracy on a real production traffic mix. Deployments need their own eval loops, fed by sampled production traffic, with their own quality dashboards. The K-score answers "did the compile produce a thing worth shipping"; it does not answer "is the thing still good two weeks later".

It does not measure safety. The accuracy axis sees expected outputs; it does not see refusals, jailbreaks, or out-of-distribution behavior. We do not currently embed a safety eval inside the standard K-score envelope. Safety-relevant tasks should layer a dedicated safety classifier on top of the compiled artifact and gate deployment on that classifier separately.

It does not measure user satisfaction. The composite weights what we could measure cheaply at compile time. It does not measure how users felt about the response. For that, ship the artifact, instrument production, and feed the result back to the next compile through kolm capture. The K-score is the smallest assertion the compiler can make about the bytes it produced; the rest is the deployer's job.

The fact that the K-score is one number with one threshold is its weakness as a research artifact and its strength as an operational tool. One number, one gate, one binary decision: ship or fail. Five inputs go in; one decision comes out; the receipt chain records every step. Across the 10-app smoke pass we run on every release, K-scores cluster between 0.944 and 0.991 for the artifacts that ship. The number is not magic. It is a calibration we chose, an axis we publish, and a gate we are willing to be wrong in front of.