Research · Open problem · 2026-05-14 · 11 min read

Eval set drift: when your gold set rots.

A gold eval set that scored K=0.91 in January may score 0.84 in June without the model changing. The world moved. The set did not. What signals that drift, why kolm refuses to auto-rotate, and what the v0.2 detector looks like.

By kolmTag evaluation · drift · open problem

The problem statement.

The shipped .kolm artifact embeds an eval pack and a K-score computed at compile time. The eval pack is a fixed set of declared inputs with declared expected outputs. The K-score is the composite of accuracy, size, latency, cost, and coverage on that pack. None of these numbers move after the artifact is signed; the receipt seals them.

The world the inputs were drawn from is not so polite. Product catalogs churn; case law updates; consumer slang rotates; the medical coding standard publishes new ICD codes; tax law changes; news cycles produce a new vocabulary every quarter. A static eval set captures a snapshot of a moving distribution. As the snapshot ages, two failure modes compound.

Failure mode one: covered cases stop being representative. The artifact still scores 0.91 on the January eval pack in June because the pack has not changed. But the June production traffic looks substantially different from the January pack. The 0.91 number is honest for what it measures and misleading as an estimate of current production quality.

Failure mode two: novel cases enter the traffic and the eval pack never sees them. An artifact that handled all of the January catalog perfectly cannot answer a question about the June catalog because the eval pack has no June catalog cases. The artifact may still ship the receipt with its January K-score; the receipt is correct; the inference for the new entries is unaddressed.

Both modes are reasonable to expect. Neither is a bug in the artifact, the receipt, or the gate. They are bugs in the assumption that an eval set is a stable proxy for production. The honest research question is: when has the gap between eval and production grown enough that the K-score on the embedded pack is no longer a useful signal?

Where it hits hardest.

Drift is not uniform across application domains. Five domains we have run compiles for show the pattern most clearly.

DomainDriverTypical half-life of an eval pack
E-commerce searchCatalog churn, SKU rotation, brand entry/exit3 to 6 months
Legal researchCase law publication, statute amendment6 to 12 months
Regulatory textRule publication, comment-period output6 to 18 months
News summarizationVocabulary churn, named-entity emergence4 to 8 weeks
Medical codingICD revisions, payer policy updates12 months (annual cycle)

The half-life column is not a measurement; it is an observation from the compiles we have shipped and the regressions we have seen. The pattern that holds across all five domains is that the rate of drift is the rate of vocabulary entry. When the rate of new named entities or new tokens in production traffic exceeds the rate captured in the eval pack, the pack is aging.

Domains not on this list (basic spam, sentiment of stable product reviews, classification of well-formed JSON, structured extraction from a frozen schema) drift more slowly. An eval pack on a stable distribution can be useful for years. The question is which distribution the deployer is actually drawing from.

Detection signals.

The literature on covariate shift and concept drift is rich. The signals we find most useful in a kolm deployment are a small subset of that literature, chosen because they can be computed from the captures stream without extra labeled data.

Signal one: production traffic distribution shift. Take a fixed-size feature vector over recent production captures (TF-IDF over a sliding window, or sentence embeddings if you have a cheap embedder available). Compare to the same feature vector over the embedded eval pack. A Kolmogorov-Smirnov test or maximum mean discrepancy estimate gives a single number per axis. If the number crosses a threshold, the captures distribution and the eval distribution have measurably diverged.

Signal two: novel-entity rate. Count named entities or vocabulary tokens in the production captures that are absent from the eval pack. Normalize by total captures. A rate above some operator-chosen ceiling (we have used 8% of distinct entities per week, but the right number is domain specific) indicates the world is producing inputs the pack cannot reach.

Signal three: item-IDs leaking into queries. A specific pattern in e-commerce, legal, and ticket-routing workloads: production queries start to reference SKU codes, case numbers, or ticket IDs that did not exist when the pack was assembled. A simple regex matched against the captures stream catches this without semantic analysis.

Signal four: shadow accuracy gap. If the deployer is running a small fraction of production traffic through a "shadow" frontier-model call for spot checks, the gap between the artifact's answer and the shadow answer on the same input is a direct estimate of accuracy on live traffic. The gap need not be ground truth; it is a noisy estimate, but it moves in the right direction when the artifact's coverage falls.

None of these signals require ground-truth labels on production traffic. All four can be computed from the captures stream and the embedded eval pack alone. The four together are usually enough to decide whether a re-eval is warranted before the K-score gate starts failing for the wrong reasons.

Why we will not auto-rotate.

The seductive next step is to use the captures stream to automatically rotate the eval pack. The captures contain real production inputs; promote a sample of them into the pack; the pack stays fresh; the K-score stays meaningful. This is a bad idea, and we have explicitly declined to ship it.

Reason one: an auto-rotated pack is no longer auditable. The receipt that signs a K-score should be reproducible from the artifact and the pack. If the pack mutates between the moment the receipt was signed and the moment a third party tries to verify, the verification fails. Every auto-rotation invalidates every prior receipt that referenced the pack.

Reason two: ground truth is not free. Promoting a captured input into the eval pack requires a declared expected output. Production captures do not carry one. An auto-rotation that promotes captures with model outputs as ground truth produces a pack that measures self-agreement, not correctness. The K-score against such a pack monotonically rises and means nothing.

Reason three: distributional capture risks PII bleed. The captures stream is operationally sensitive (see capture-loop honesty). A pack assembled from raw captures inherits the same sensitivity but is now embedded in every signed artifact and shipped to every consumer of the registry. The blast radius of an inadvertent PII leak into a pack is enormous.

The honest stance is: the pack is human-curated. A drift signal triggers an alert. A human reviews the alert. The human decides whether to expand the pack, hand-label new cases, and re-run the compile against the larger pack. The receipt for the new artifact references the new pack hash; the prior receipt remains valid against the prior pack.

Auto-rotating an eval pack is one of those features whose hidden cost is "now every receipt depends on a moving target". We prefer the cost of asking a human.

Sketch of the v0.2 detector.

The detector on the v0.2 roadmap is a separate process, not a change to the compile loop. It runs against the captures stream on the cadence the operator chooses (daily, weekly, per-deployment). It produces one of four states.

The detector output is a structured artifact, not a number. It records the signals computed, the thresholds used, the sample of captures inspected, and the state. It is signed against the operator's receipt secret so the audit trail can recover when the state changed and why.

$ kolm evals drift --task ticket-router --window 7d
[drift]     captures inspected: 2,418
[drift]     eval pack: ticket-router@v3 (hash sha256:9d3e…)
[drift]     distribution-shift (MMD): 0.31  # threshold 0.20
[drift]     novel-entity rate:        4.2% # threshold 8.0%
[drift]     item-id leakage:          0.6% # threshold 2.0%
[drift]     shadow-accuracy gap:      not configured

state: amber
recommendation: expand eval pack with last 7d novel-entity samples before next compile

The output is intentionally boring. It is four numbers and a state. It does not auto-rotate the pack; it does not auto-compile; it does not deploy anything. It produces a state and a recommendation. The operator does the work.

Three lines of prior work inform the design.

HELM (Liang et al., arXiv:2211.09110) framed the broad-coverage evaluation question with explicit attention to scenario coverage and the way a static benchmark can lose representativeness. The HELM contribution we lean on is the separation between the metric (accuracy on a scenario) and the operational decision (which scenarios to run). The kolm K-score sits closer to the operational decision; the HELM framing of scenario rotation as a deliberate human choice rather than an automatic one informs our refusal to auto-rotate.

Continual evaluation literature over the last decade in classical ML has converged on the same diagnostic shape: detect shift, alert a human, expand the labeled set, retrain. The Hendrycks et al. line on out-of-distribution detection, the Lipton et al. work on label shift estimation, and the Kuznetsova et al. treatment of distribution-shift uncertainty are all instances of the same operational pattern: computing a shift signal is the cheap half; deciding what to label next is the expensive half.

Production-MLOps drift detection systems (Evidently AI, Arize, Fiddler) have built monitoring around variants of these signals. The contribution of those systems is mostly operational: a UI that surfaces the signal, an alerting rule that lets a human notice. Our detector is meant to slot into the same operational role for kolm deployments, with the additional constraint that the signal output gets signed alongside the receipt chain.

What an operator can do today.

The full v0.2 detector ships when the calibration data lands. Until then, three practices an operator can adopt today.

Practice one: maintain a hand-labeled holdout outside the artifact. The eval pack inside the .kolm is the compile-time gate. The production-quality estimate is a separate exercise. Sample 50 to 200 captures a week; have a human label them; run them through the artifact; record the accuracy. The result is a noisy but unbiased estimate of how the artifact is performing on current traffic.

Practice two: expand the eval pack on a calendar. A quarterly compile cycle with explicit eval-pack expansion catches most drift before it hurts. Add 20 to 50 new cases per quarter, drawn from human-labeled production captures, and re-run the compile.

Practice three: monitor the kolm capture counters. A namespace whose captured count keeps growing while distilled stays flat is producing inputs that are not making it into the next artifact. That is often a sign that the eval pack needs to grow to cover the new inputs before the next distill will clear the gate.

Drift is the part of the system the receipt chain cannot guarantee. The chain attests that the K-score was computed honestly against a particular pack at a particular time. It does not attest that the pack still describes the world. The chain is the proof of what we did. The pack is the bet on what we measured. The detector is the open problem. Today, the bet is the operator's; we are working on the detector that helps them see when the bet is going stale.