kolm / docs / k-score methodology

K-score: how we score every artifact.

K-score is a five-component weighted score, computed at compile time, replayable at any time from the receipt log. A single number that compresses what an engineering team would otherwise check by hand across five different dashboards. This page is the formal definition.

The formula

K = 0.40·A + 0.15·S + 0.15·L + 0.15·C + 0.15·V

A accuracy · S stability · L latency · C compliance · V verifier

Each component is normalized to the range [0, 1]. Weights sum to 1.00, so K is also bounded [0, 1]. The K-score is computed once at compile-time over the artifact's frozen eval set; the components are also recomputed at runtime per inference and emitted on the receipt, so drift over time is observable.

A · Accuracy 0.40 weight

A is the artifact's task-specific accuracy on its declared eval set. The eval set is frozen at compile time and its sha256 is recorded in the manifest, so the same A is reproducible months later.

Metric depends on the task:

Classification (incident triage, ticket priority, BI-RADS) - exact-match accuracy, with per-class F1 published as supplementary.
Extraction (clause, invoice line-item, action item) - schema-locked F1 across all declared fields. A missing required field counts the same as a wrong value.
Generation (summarize, rewrite, code review) - BLEU against reference + a constrained verifier-judge cross-check. We publish both numbers, but A uses the verifier-judge agreement rate.
Redaction (PHI, PII) - precision + recall, both required. A redactor that misses one PHI entity in 100 fails A no matter how many it caught - this is the regulated-task asymmetry.

For benchmark artifacts (like SWE-bench Mini) A is the harness's binary pass/fail per problem, averaged over the suite. See SWE-bench Verified Mini results →

S · Stability 0.15 weight

S measures variance across N=10 reruns of the same eval set with the same seed but different sampler nonces. The motivation: a 0.92 average that comes from runs scoring [0.95, 0.92, 0.89, 0.91, ...] is one artifact; a 0.92 average that comes from [1.0, 0.99, 0.78, 0.91, ...] is a different artifact under regulated load.

Normalization: S = 1 - clamp(stdev_over_reruns / 0.50, 0, 1). Zero variance gives 1.0; a stdev of 0.50 (effectively coin-flip behavior across reruns) collapses to 0. The 0.50 anchor is chosen so a regulated task with stdev > 0.10 starts to take a visible hit.

Recorded as k_stability_10x on the receipt. For artifacts deployed in non-determinism-sensitive settings (e.g., financial routing, medical), we recommend the recipe set the K-score gate to require S >= 0.95 in addition to the K threshold.

L · Latency 0.15 weight

L compares observed p95 inference latency against the budget declared in the artifact's recipe. The budget is what the compile job promised the operator at gate time; L measures how often the runtime keeps that promise.

Formula:

L = 1.0                                  if p95 <= budget
L = exp(-ln(2) · (p95 - budget) / budget) if p95 > budget

Geometric intuition: at budget, L = 1.0. At 2x budget, L = 0.5. At 4x budget, L = 0.25. Past 10x budget L is effectively 0 - the artifact missed its promise hard enough that no other component should save it.

p95 is measured over the eval set at compile time and over a rolling 1000-inference window at runtime. The runtime number is what shows up on /dashboard and is what underwriters see in the receipt log.

C · Compliance 0.15 weight

C is the pass rate over the artifact's compliance pack. A compliance pack is a set of checks declared in the manifest: HIPAA Safe Harbor for PHI redactors, SR 11-7 / OCC model risk for finance routers, NIST AI RMF GenAI Profile for general regulated use, EU AI Act Annex III for high-risk applications, custom checks for narrowly scoped pipelines.

Each check is binary; C is the unweighted average. All-pass = 1.0. One check fails = (n-1)/n. Two checks fail = (n-2)/n. The compliance pack itself is included in the .kolm bundle so the same check set is replayable at audit time.

Examples of what a check looks like:

HIPAA Safe Harbor: scan eval outputs for the 18 PHI identifier types, fail if any match.
SR 11-7: confirm the manifest declares a refusal path with a rationale; fail if missing.
NIST AI RMF: confirm provenance fields are populated end-to-end; fail if any are blank.

See /compliance-packs for the full pack catalog.

V · Verifier 0.15 weight

V is the agreement rate of the runtime verifier with the agent's output. The verifier is a separate model (usually smaller, constrained-decoded) whose job is to spot-check that the primary output is consistent with the input.

For classification, V is the rate at which the verifier model agrees the chosen label is plausible. For extraction, V is the rate at which the verifier model agrees each extracted field is grounded in the input. For generation, V is the rate at which the verifier model judges the output non-hallucinated against the input.

Sampling rate at runtime is configurable. Production deploys usually run V at 5-10% (every Nth inference); benchmark runs are typically at 100% for transparency. The recipe declares the target sampling rate, the runtime emits the actual rate on the receipt, and a discrepancy > 1% drops V proportionally to zero.

V also incorporates a supply-chain check: every compile runs Trivy and Grype on the artifact's dependency tree and binds the result into manifest.dependencies.cve_audit. If cve_audit.ok is false (Critical CVE present, or High with a patched version available under the default policy), V is forced to 0 regardless of verifier agreement, which is enough on its own to fail the K >= 0.85 gate. See cve audit in the k-score → for the manifest shape and the policy options.

Why these weights

Accuracy dominates at 0.40 because, in practice, an artifact with strong stability and latency but weak accuracy is useless - the operator pays for nothing. We considered 0.50 for A but pulled back; over-weighting A would let teams game the score by burning latency budget to crank one more accuracy point. 0.40 keeps A primary without making it everything.

S, L, C, V split the remaining 0.60 evenly at 0.15 each because we have not seen evidence that one of those four matters strictly more than the others. S matters most in regulated settings; L matters most in latency-sensitive settings; C matters most in audited settings; V matters most in safety-sensitive settings. Equal weights let the recipe author tune via the K-score gate threshold instead of the weights themselves.

Weights are versioned (currently K-score v1.0, attribute on receipts as k_score_version: "1.0"). If we change weights, the old receipts continue to validate against the old version - we never silently re-score history.

Why K >= 0.85 is the deploy gate

the gate

0.85 is the public registry minimum. 0.92 is the kolm-team Featured tier. 0.95+ is the bar we recommend for any artifact behind a regulated decision (medical, financial, legal, defense).

0.85 is the threshold below which we've observed an artifact reliably fails some component badly enough that even strong values elsewhere can't save the operator from a bad week. Concretely: at K = 0.85 you can be one weak component (say C = 0.4 because compliance pack is half-failing) or two moderately weak components (say S = 0.65 and L = 0.65), but you can't be three components below 0.7. The arithmetic forces a floor across the dimensions an operator cares about.

The CLI rejects compile jobs below K = 0.85 by default; that floor can be lowered with --k-min, but the artifact is then flagged in its manifest, the registry lists it under "Experimental" rather than the public grid, and the dashboard surfaces a warning every time it runs.

Threshold customization

Most teams tighten the threshold past the default. Examples:

scenario	recommended	flag
experimental / research	0.70	--k-min=0.70
default (public registry)	0.85	--k-min=0.85
regulated production	0.90	--k-min=0.90
safety-critical	0.95	--k-min=0.95

kolm compile recipes/phi-redactor.yaml \
  --base llama-3.1-8b \
  --k-min=0.90 \
  --gate-stability=0.95 \
  --gate-latency-budget=200ms

Per-component gates (--gate-stability, --gate-latency-budget, etc.) are additive on top of the K threshold - they let you require, e.g., K >= 0.90 AND S >= 0.95 even when a stronger A could otherwise carry the K average. This is what we recommend for any artifact whose failure mode is "rare but expensive."

The K-score gate, the per-component gates, and the K-score version are all written into the manifest and signed at compile time. kolm verify replays the gate at audit time and confirms the artifact passed the gate it claimed.