Synthesis-then-distillation: building task corpora without seed data

The cold-start problem
Two passes: synthesize, then distill
The verifier gate
K-selection for diversity
Three failure modes and their tells
The synthetic-only ceiling

The cold-start problem.

The textbook setup for a fine-tune is a thousand labeled pairs in hand. The realistic setup is a task description and nothing else. A founder wants a refund-classifier; a support team wants a ticket-router; a clinician wants a drug-interaction filter. None of them have a curated jsonl. They have a sentence.

The pre-2024 answer was: go capture some data, come back in two weeks. The post-2024 answer is: a frontier model is good enough at generating in-distribution training pairs for narrow tasks that you can bootstrap from zero. Synthesis is not a substitute for captured production traffic. It is a way to ship the first artifact while the capture pipeline accumulates real pairs.

The kolm compile loop runs synthesis automatically when the corpus is empty or below threshold. The verb is kolm compile with an --examples flag that doubles as the seed for the verifier; when no examples are supplied, the loop falls back to the spec alone.

$ kolm compile -t refund_flagger
[synth]     no seed corpus; entering synth-then-distill mode
[synth]     teacher: claude-opus-4-7 (api)
[synth]     candidates per spec: N=12, target K=4
[synth]     generating 240 candidate pairs across 60 spec slices
[verify]    rejected 53 / 240 (22%) malformed or off-spec
[k-sel]     selected 180 diverse pairs across 12 clusters
[distill]   teacher rollout on 180 pairs → LoRA training set
[K-score]   0.887 (ships at 0.85 gate)

Two passes: synthesize, then distill.

The pipeline runs in two named passes with different inputs and different teachers. They look like one step from the CLI; they are two steps internally.

Pass one: synthesize the spec slice. The compiler decomposes the task description into a small number of spec slices: input shape, output shape, edge cases, negation cases, near-miss cases. For a refund classifier, the slices are roughly clear refund request, refund-adjacent question, policy clarification, angry-but-not-asking, not refund-related. The teacher is prompted N times per slice with a temperature high enough to surface lexical variation (typically T=1.0, top_p=0.92). The output is an unfiltered set of N candidate inputs per slice, plus a teacher-generated rationale that explains why each candidate fits the slice.

This pass is the one most likely to be done badly. Three failure modes (mode collapse, lexical regression, label distribution skew) all live here. The cure is structural: each slice has a verifier-side schema, the teacher temperature stays high, and the candidates land in a pool that is later K-sampled, not consumed in order.

Pass two: distill the labels. Once the inputs are in the pool and verified, the teacher is prompted again, this time at low temperature (T=0.2), to produce the canonical label for each input. The temperature flip is intentional. Diverse inputs, deterministic labels. This is the pattern that makes synthetic training data work; the inverse (low-temperature inputs, high-temperature labels) collapses the distribution and teaches the student to babble.

Pass two writes a jsonl that looks like every other captured corpus. The downstream trainer (kolm distill) does not know or care whether the pairs came from a real session or a synthesis run. The receipt chain marks the provenance: seeds.source = "synth" in the canonical-JSON envelope.

The verifier gate.

Every synthesized pair passes through a deterministic verifier before it joins the training set. The verifier is not an LLM. It is a small JavaScript generator (in the inline-spec case) or a JSON-schema checker (in the structured-spec case) that takes (input, expected_output) and returns (ok, reject_reason). The signature is plain:

// src/verifier.js
export function verify(generator, { positives = [], negatives = [], property_tests = [] }) {
  // run the generator against each example, count rejects,
  // return reject_rate, accuracy, and a trace for the audit log
  return {
    accuracy: round(accOk / total, 3),
    reject_rate_negative: round(negRate, 3),
    trace,
  };
}

The verifier is called twice in the synth pipeline. The first call is the per-candidate gate: if the input does not fit the slice schema (wrong type, wrong shape, off-topic), the pair is dropped before it ever sees pass two. The second call is the post-distill audit: the held-out positives and negatives are run against the freshly distilled student to compute the K-score for the artifact.

Why deterministic? Because the only reliable way to reject malformed teacher output at scale is a rule that returns the same answer on the same input every time. An LLM-as-judge is not deterministic; the same pair will be accepted on Monday and rejected on Tuesday. The verifier is allowed to be conservative; it is not allowed to be flaky.

K-selection for diversity.

The teacher returns N candidates per spec slice. Even after the verifier drops the malformed ones, the remaining pool is correlated: candidates from the same slice tend to use similar vocabulary, similar sentence structure, similar named entities. Training on the full pool teaches the student to memorize the teacher's surface patterns, not the underlying task.

The cure is K-selection. From the verified pool of M candidates per slice, the compiler keeps the K most-diverse and most-aligned-with-spec, where K is a fraction of M (typically 1/3 to 1/2). The selection is the smaller of two scoring loops:

Embedding clustering. Each candidate is embedded with a small open model; pairwise cosine distance partitions the pool into clusters; one candidate is drawn from each cluster until K is reached.
Verifier-score ranking. Each candidate has a verifier score in [0, 1] capturing how cleanly it fits the slice schema. Inside each cluster, the highest-scored candidate wins.

The diversity-weighted selection is what keeps the synthetic corpus from looking like fifty paraphrases of the same sentence. The verifier-score tiebreaker is what keeps the picker from choosing the weirdest candidate in each cluster. Together they encode the same stance as the temperature flip: broad inputs, sharp labels.

A reference run on the refund-flagger task above:

# slice-by-slice K-selection trace
slice 1/5  clear_refund_request
  candidates: 48     verified: 41     clusters: 11     selected: 18
slice 2/5  refund_adjacent_question
  candidates: 48     verified: 37     clusters: 9      selected: 14
slice 3/5  policy_clarification
  candidates: 48     verified: 39     clusters: 10     selected: 16
slice 4/5  angry_not_asking
  candidates: 48     verified: 30     clusters: 8      selected: 12
slice 5/5  not_refund_related
  candidates: 48     verified: 40     clusters: 12     selected: 20
TOTAL    240 candidates → 187 verified → 80 K-selected pairs

Three failure modes and their tells.

The pipeline is built around three pathologies that synthesis is known to produce. Each has a tell the compiler logs and a mitigation the compiler applies.

Mode collapse. The teacher locks onto a narrow lexical pattern and produces fifty variants of the same sentence. The tell is the embedding cluster count: if N=12 candidates per slice produce one or two clusters, the pool is collapsed. The mitigation is a temperature bump on pass one and, if that fails twice, a switch to a different teacher model. The compiler logs collapse_detected: true and the K-score gate refuses to ship the artifact below 0.85.

Lexical regression. The teacher reaches for its own training distribution instead of the task slice. A refund-classifier slice on "policy clarification" starts returning customer-service script openers ("Thank you for reaching out"). The tell is the verifier reject rate: when reject_rate climbs above 30% on a single slice, the slice is regenerated with a stricter prompt anchored to two human-written exemplars (if available) or to a hand-built JSON schema (otherwise).

Label distribution skew. The teacher's prior over labels diverges from the spec. A refund-classifier specced to be 50/50 refund-vs-not lands at 80/20 in the synthesized set, because the teacher's training data is biased toward refund-shaped requests. The tell is the post-pass-two label histogram. The mitigation is stratified resampling: the compiler downsamples the over-represented class until the label distribution matches the spec within 5 percentage points.

The three failure modes are not exotic. They are the default behavior of a high-temperature teacher with no constraints. The verifier, the K-selection, and the stratified resample are the three guard rails that keep them from showing up inside a shipped artifact.

The synthetic-only ceiling.

Synthesis-then-distillation is good. It is not as good as real data. The K-scores we see across the public registry tell a consistent story:

Synthetic-only artifacts on a narrow task (single-slot classification, bounded structured extraction) ship at K=0.86 to K=0.91. They pass the gate, but the long tail of edge cases is uncovered.
Synthetic-then-captured artifacts (a synth bootstrap plus 200-500 captured pairs from production traffic) push K to 0.93-0.96. The captured pairs disproportionately cover the failure modes the synth missed.
Captured-only artifacts with 1000+ pairs cap out at K=0.95-0.98 on the same task class. Beyond that, the limit is the base model and the eval pack itself.

This is the honest framing on the homepage and in the docs. The synth path exists so you can ship something the first hour, but the production path is always: ship the synth artifact, route 5% of traffic through it under kolm capture, replace with the recompiled artifact a week later. The receipt chain on the second artifact records the provenance flip: seeds.source = "captured" instead of "synth". The K-score band moves up. The artifact bytes change. The wire format does not.

The single-line argument for keeping the synth path in the compiler at all is that the alternative (asking the user to hand-curate a thousand pairs before they can run anything) is the reason most "fine-tune your own model" tools have a 3% activation rate. The synth path is the difference between an unactivated user and a user with a 0.89-K artifact in their hand who knows what to capture next.

Self-Instruct (Wang et al., arxiv 2212.10560). The original synthesis-then-finetune pipeline. The compile loop's pass-one structure is a direct descendant.
Alpaca (Taori et al., 2023). Demonstrated that 52k teacher-generated pairs can fine-tune a small base into a usable instruction-follower. The data ceiling argument is exactly the lesson learned there.
WizardLM Evol-Instruct (Xu et al., arxiv 2304.12244). The diversity-via-prompt-evolution approach we draw from on the slice-decomposition step.
K-sampling at compile time. The companion paper on the diversity-weighted selection algorithm.
The case for adapters. Why the distilled student is a LoRA, not a merged checkpoint.

Synthesis-then-distillation: building task corpora without seed data.

Contents

The cold-start problem.

Two passes: synthesize, then distill.

The verifier gate.

K-selection for diversity.

Three failure modes and their tells.

The synthetic-only ceiling.

Synthesis-then-distillation: building task corpora without seed data.

Contents

The cold-start problem.

Two passes: synthesize, then distill.

The verifier gate.

K-selection for diversity.

Three failure modes and their tells.

The synthetic-only ceiling.

Related work.