Contents
The N-to-K problem.
Synthesis-time and distillation-time both end in the same situation. The teacher has produced N candidate outputs for one spec input. Which N do you keep, and which do you drop?
The naive answer is: keep all of them. This is wrong. Candidates from the same teacher under the same prompt with one temperature draw are correlated. Their embeddings cluster. Their lexical surface repeats. Training on the full pool teaches the student to memorize the teacher's surface, not the underlying task. The K-score drops by 4 to 8 percentage points when the trainer is fed the unfiltered pool versus a properly K-sampled subset.
The slightly less naive answer is: keep the top-K by some score. This is closer but still wrong if the score is computed candidate-by-candidate without considering the relationships between them. Top-K by raw verifier score will reliably pick K near-duplicates of the highest-scoring candidate, because the underlying scoring function is monotonically related to surface similarity to the spec.
The right answer is to pick K such that the kept candidates maximize a joint objective: high per-candidate score plus high inter-candidate distance. That is what K-sampling means inside the kolm compile loop.
The selection algorithm.
The trainer runs a two-step selection. The first step is clustering; the second is in-cluster ranking. The implementation lives in the synthesis layer and is invoked per spec slice.
# pseudocode for the K-sampling loop def k_sample(candidates, K, embed_fn, score_fn): # 1. embed every candidate embs = [embed_fn(c.text) for c in candidates] # 2. cluster by cosine distance; target K clusters clusters = agglomerative_cluster(embs, n_clusters=K, metric="cosine") # 3. inside each cluster, rank by per-candidate score selected = [] for cluster_id in range(K): members = [c for c, k in zip(candidates, clusters) if k == cluster_id] members.sort(key=score_fn, reverse=True) selected.append(members[0]) return selected
The embedding function is a small open model; the cluster count is the target K passed in by the caller; the score function is whichever of three signals is available:
- Reward signal. If a task has a learned reward model (rare for our compile pipelines, but present for some RLHF-style tasks), the reward is the ranking score.
- Verifier score. The default. The verifier in
src/verifier.jsreturns an accuracy score in [0, 1] per candidate against the spec; this is the ranking score for synthesis-time selection. - Teacher log-prob. Fallback. If no verifier is available for the task, the teacher's own log-probability of the candidate (when exposed by the API) is the ranking score. This is the weakest of the three, because it correlates with confidence rather than correctness, but it beats random.
The agglomerative-cluster step is the load-bearing one. It is what turns N correlated candidates into K representative ones. The cluster count is the user-facing knob: smaller K means more aggressive deduplication; larger K means more raw data with more redundancy. The trainer defaults to K = N/3 for synthesis-time, K = N/2 for distillation-time, and exposes both as overrides in the recipe pack.
The cluster-gap failure mode.
The selection algorithm is correct when the candidate pool has clean clusters. It breaks when the clusters overlap.
Consider a spec slice for which the teacher has produced 12 candidates. Eleven of them are near-paraphrases of each other (cluster A); one is a clean outlier (cluster B). You ask for K=4. The agglomerative clusterer correctly identifies two natural clusters at the top of the dendrogram, but you forced it to produce four. So it artificially splits cluster A into three subclusters at very small inter-cluster distances. The four selected candidates are: three near-duplicates from the artificially-split cluster A, and one from cluster B. The selection is technically four clusters but is effectively a 3+1 split.
The tell shows up in the cluster-cohesion log:
# cluster-gap detection log slice 2/5 policy_clarification candidates: 12 K target: 4 natural clusters: 2 (silhouette 0.61) forced K=4 split: silhouette 0.18 <-- bad smallest inter-cluster cosine: 0.04 WARN cluster-gap detected; degenerate split
The trainer logs the warning, but the warning alone does not fix the training data. The mitigation is one of three.
- Drop K. Honor the natural cluster count. Take K' < K candidates if the natural cluster count is below K. This is the default fallback and is documented in the receipt as
k_actualvsk_requested. - Resynthesize. If K is load-bearing (e.g. the trainer needs at least K pairs per slice for budget reasons), the loop goes back to the teacher and requests N' > N candidates with a higher temperature, hoping to surface more clusters.
- Refactor the slice. The slice itself may be too narrow. Splitting "policy_clarification" into "policy_clarification_refund" and "policy_clarification_shipping" can produce two slices that each cluster cleanly.
Diversity-weighted selection.
The default selection above is greedy: one pick per cluster, by score, no interaction across clusters. The diversity-weighted variant is slightly more expensive and produces better training distributions on hard tasks.
The objective: maximize a sum-of-scores subject to a minimum inter-pick distance. Concretely:
# diversity-weighted variant maximize sum(score(c_i) for c_i in selected) minus lambda * sum(1 / (1 + dist(c_i, c_j)) for c_i, c_j in selected) subject to |selected| == K
The penalty term pushes the optimizer away from picks that are close to each other in embedding space. Lambda is the diversity weight; the trainer's default is 0.3 (a fairly mild diversity preference) and is tunable per recipe. At lambda=0 the algorithm reduces to top-K by score; at high lambda it reduces to maximally-spread selection regardless of score.
The diversity-weighted variant is solved approximately with a determinantal point process or with a simple greedy-with-marginal-gain heuristic; the trainer ships the greedy version because it is good enough on the corpus sizes we see (hundreds of candidates per slice, not millions). The DPP version is on the v0.2 list, behind the speculative draft cache.
Diversity is not a free lunch. A diversity-weighted selection guarantees coverage at the cost of average per-pick quality. The right lambda is the one that pushes the post-distill K-score up; for most tasks we have measured, lambda = 0.3 is the sweet spot.
Open problem: the speculative draft cache.
Every K-sampling pass throws away N-K candidates. For N=12, K=4, that is two-thirds of the teacher's output. The wall-clock cost of producing those rejected candidates is paid in full; the dollar cost (in the cloud-teacher case) is paid in full. The candidates themselves are coherent text, scored by the verifier, embedded, clustered.
The question on the v0.2 research agenda: can the rejected N-K be cached as speculative drafts for future similar specs?
The intuition is the same intuition that makes speculative decoding work at inference time. A small draft model proposes tokens; a large verifier model accepts or rejects them. When the proposal is accepted, the large model has done less work than it would have done generating the token from scratch. The wall-clock saving scales with the acceptance rate.
At compile time, the analogous setup looks like this: the trainer has built a corpus of (spec_slice, candidate_text, verifier_score, embedding) tuples from past compiles. When a new spec slice arrives, the trainer queries the cache for embeddings within a threshold distance. The cached candidates are drafts: the new compile presents them to the teacher and asks "would you have produced this?" If the teacher accepts, the draft becomes a candidate at zero teacher-API cost. If the teacher rejects, the draft is discarded and the trainer falls back to generating fresh candidates.
The framework is borrowed directly from Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (arxiv 2302.01318). The translation is: at inference time, the large model is the bottleneck; at compile time, the teacher API is the bottleneck. The same speculative pattern that compresses inference-time latency should compress compile-time cost.
The reason this is on the v0.2 agenda and not in the v0.1 shipping code: the verification step is harder than it looks. A draft candidate from spec slice X last week is not automatically a valid candidate for spec slice Y this week. The cosine distance threshold for "similar enough to reuse" is a hyperparameter we have not yet calibrated, and getting it wrong means the cache leaks signal across tasks. The probable shape of the v0.2 implementation is a per-tenant cache, scoped to the same task family, with a published draft-acceptance rate inside the receipt chain.
The cost arithmetic.
The reason K-sampling is worth the engineering complexity is the cost of the alternative.
For a synth-then-distill compile of a refund-classifier, the teacher API call is the dominant cost line. A typical compile asks the teacher for 240 candidate inputs (pass one) and 240 candidate labels (pass two). At ~500 tokens per candidate output and ~$5 per million output tokens (the published rates for frontier teachers as of early 2026), that is $2.40 per compile in teacher costs alone.
If the trainer takes all 240 inputs through the distill pass without K-sampling, the post-distill K-score lands around 0.85 to 0.88 (right at the gate, with no margin). If the trainer K-samples down to 80 inputs, the post-distill K-score lands around 0.89 to 0.92, and the budget for pass two is now $0.80 (one-third the cost). The K-sampled run is both better and cheaper.
The wider pattern is the load-bearing one for the kolm compile economics. More candidates is not better. Better candidates is better. The teacher API bill follows the candidate count linearly; the artifact quality follows the candidate quality, which is an integer scaled by diversity. K-sampling is the operator that converts a quantitative spend into a qualitative gain.
Related work.
- Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al., arxiv 2302.01318). The draft-and-verify framework the speculative draft cache borrows from.
- Fast Inference from Transformers via Speculative Decoding (Leviathan et al., arxiv 2211.17192). The companion paper on speculative decoding; same core idea.
- Determinantal Point Processes for Machine Learning (Kulesza and Taskar, 2012). The diversity-weighted selection formalism; the trainer uses a greedy approximation.
- Synthesis-then-distillation. The pipeline that produces the candidate pool K-sampling consumes.
- The case for adapters. What the K-selected pairs become after the distill pass: a 30-200 MB LoRA.
- K-score correlation. The metric that proves K-sampled training pairs produce better artifacts than unfiltered pools.
The two-pass pipeline that produces the N candidates K-sampling chooses among.
read next K-score correlation →The metric that lets you measure whether K-sampling actually helped.
docs kolm compile reference →K-sampling parameters in the recipe pack: n_candidates, k_select, diversity_lambda.
The on-disk shape of a .kolm artifact. K-sampling logs land in training_stats.k_sampling.