research . training . 11 min read

Online iterative DPO.

Offline DPO trains on a static pair set. The pairs drift away from the policy as soon as the policy moves; gradients shrink and quality plateaus. Online iterative DPO closes the loop: sample fresh candidates from the current policy each round, judge them, train one DPO step, swap the adapter, repeat. Each round produces its own signed artifact pinned to the round's captures.

May 15, 2026 · Kolmogorov research · apps/trainer/online_dpo.py

The round, in five steps

Sample. Draw N candidates per prompt from the current policy at sample_temperature = 0.9. N = 4 is the most-cited choice; N = 8 is the practical ceiling before sampling cost dominates.
Judge. Score each candidate. The judge can be a learned reward model (cheap, deterministic), an LLM judge (high quality, costly), a self-rewarding rubric prompt (Yuan 2024), or a verifiable check shared with the K-score gate.
Pair. For each prompt, take the highest-scored candidate as chosen and the lowest as rejected. Drop pairs below a margin floor (uninformative) or below a diversity floor (near-identical strings).
Train. Run one DPO pass with the previous round's policy as the reference. The reference shifts each round; a bounded number of steps per round keeps the policy near the reference.
Swap. Save the new adapter. The runtime hot-reloads via the same symlink-and-HUP path described in the continual learning loop.

Four rounds is the kolm default; published recipes cite diminishing returns past five. Each round's adapter has its own CID and its own receipt; the chain across rounds is the audit story.

Why the reference shifts each round

The DPO loss is

L = -log σ( β * [ log π(c|p) / π_ref(c|p) - log π(r|p) / π_ref(r|p) ] )

where π is the policy under training and π_ref is a frozen reference. Offline DPO uses the initial SFT model as π_ref throughout training. Iterative DPO uses the previous round's policy: π_ref at round t is the model that was just shipped at round t-1.

This bounds the KL of each update. Without it, after a few rounds the policy can drift arbitrarily far from the original SFT model and the DPO loss starts measuring the gap to a now-irrelevant reference. With the shifting reference, each round is a small step from a known-good policy; the policy can move further over many rounds without any single step being unbounded.

Four judges, four tradeoffs

Judge	Cost	Signal-to-noise	When to use
Learned RM	Cheap (one local inference call per candidate)	Whatever the RM encoded	You already trained an RM via apps/trainer/reward.py and want fast deterministic rounds
LLM judge	~100x learned RM (API call per candidate)	High, with order-swap bias correction	The task is open-ended and the RM cannot keep up with policy drift
Self-rewarding	2x sampling (the policy judges itself)	Bootstraps from a competent policy; quality depends on rubric design	No external judge available; policy is already good enough to grade itself
Verifiable	Microseconds (subprocess sandbox)	Perfect, where applicable	The task admits a programmatic check (code passes tests, JSON validates, numeric answer matches)

The same REWARD_FUNCTIONS table that GRPO uses powers the verifiable judge. That is on purpose: the gradient signal in GRPO and the pair signal in online DPO should agree about which response is better, otherwise the two recipes pull the policy in different directions.

Pair filtering: why drop low-margin and low-diversity pairs

Two filters apply before each DPO pass:

Margin floor. If the judge scored the top and bottom candidate within pair_filter_margin, the pair carries little signal. The DPO loss is roughly proportional to the margin, so a near-zero-margin pair contributes near-zero gradient anyway, but it costs forward passes and dilutes the load-average. Default 0.0 (keep all pairs); raise when the judge is noisy.
Diversity floor. If chosen and rejected agree on >95% of characters, the pair is teaching the policy whitespace or punctuation differences. Drop them. Cheap to compute (character-level overlap) and reliably noisy.

Both filters are recorded in the round's receipt with counts so the auditor can confirm the policy was trained on a sane fraction of the sampled pairs.

The honesty contract

Iterative DPO has a known failure mode: the policy and the judge can collude. A learned reward model rewards what it was trained on; a policy trained against that RM learns to maximize the RM's idiosyncrasies, not the user's preferences. Three guardrails:

Independent eval pack. The K-score gate runs against a held-out pack the judge never sees. If the policy is gaining ground on the judge but losing ground on the pack, the gate fails the round and the swap does not happen.
RM refresh between rounds. An RM trained at round 0 ages out after a few rounds; the policy has moved past its training distribution. A new captured-traffic batch retrains the RM mid-loop; the receipt records which RM CID judged each round.
Pair sampling diversity. Self-rewarding is the highest-risk judge (it cannot dissent from the policy). Mix in 10-30% of the prompts judged by an external RM or LLM judge; cross-check that the two judges agree on majority of pairs.

The receipt records the judge CID, the pair source, and the per-round held-out K-score. A run where the held-out K-score does not improve across rounds is a run that has collusion or saturation; the auditor sees both immediately.

What ships

One artifact per round. The final round's adapter is the deployment unit; the prior rounds' adapters are the audit trail. Each round's receipt names:

The input adapter CID and the output adapter CID.
The judge type and the judge CID (so the auditor can re-fetch the RM, replay the pairs, and recompute the margins).
The DPO config (beta, learning rate, max steps).
The mean margin, the surviving pair count, and the dropped-pair count by filter.

The chain is the canonical-JSON HMAC chain from the artifact spec; tampering with any round invalidates the subsequent rounds' signatures. The binder (kolm verify --binder) walks the chain start-to-finish in one report.

What the receipt records

"online_dpo": {
  "method": "online_dpo",
  "base_model": "Qwen/Qwen2.5-3B-Instruct",
  "adapter_in":  "registry/cidv1:sha256:a4b1...",
  "adapter_out": "registry/cidv1:sha256:e9c8...",
  "config": {
    "n_rounds": 4,
    "candidates_per_prompt": 4,
    "judge": "learned_rm",
    "beta": 0.1,
    "sample_temperature": 0.9,
    "per_round_batches": 200,
    "pair_filter_margin": 0.05,
    "diversity_filter": true
  },
  "rounds": [
    {"round": 1, "n_pairs": 488, "loss_final": 0.581, "mean_margin": 1.42, "seconds": 311},
    {"round": 2, "n_pairs": 472, "loss_final": 0.524, "mean_margin": 1.18, "seconds": 305},
    {"round": 3, "n_pairs": 461, "loss_final": 0.491, "mean_margin": 0.96, "seconds": 308},
    {"round": 4, "n_pairs": 444, "loss_final": 0.469, "mean_margin": 0.78, "seconds": 312}
  ],
  "papers": [
    "arXiv:2312.16682",
    "arXiv:2401.10020",
    "arXiv:2403.08635",
    "arXiv:2404.03715",
    "arXiv:2405.07863"
  ]
}

Diminishing margin across rounds is expected; the policy is approaching the judge's preferences. A margin that collapses to zero says the policy has saturated on this judge and the next round will gain nothing. A margin that does not decrease at all says the policy is not learning; check the reference-policy detachment and the learning rate.

Edge cases worth naming

Reference-policy memory. Each round's reference is a frozen clone of the previous policy. For PEFT models, deepcopy of the merged base+adapter is too heavy; we snapshot the adapter weights and reload them into a non-trainable clone. If the run OOMs on the freeze step, drop per_round_batches so the round finishes faster and the freeze happens less often.

Judge cost dominating. An LLM-judge round on 5000 prompts at 4 candidates each is 20K judge calls. The K-score gate after the round adds another judging pass on the held-out pack. Plan the spend before starting; the receipt records judge_cost_usd per round so the buyer's auditor can confirm the budget.

Off-policy prompt drift. If the prompt set was captured weeks ago and the production distribution has moved, the loop trains on the wrong distribution. Capture fresh prompts before each loop start. The capture infrastructure (capture-loop honesty) is the right primitive.

Catastrophic forgetting. After many rounds, the policy can lose capabilities outside the loop's prompt distribution. The fix is to mix 10-20% of an SFT replay buffer into each round's training data; preserves baseline behavior on tasks the judge does not see.

Where this fits in the kolm compile loop

The three preference recipes share one surface. Offline preference for static labeled pairs. GRPO with verifiable rewards for tasks that admit a programmatic check. Online iterative DPO for everything in between: tasks where the right answer is a matter of judgement, the judgement function is captured in an RM or an LLM, and the policy needs to keep up with a moving target. The K-score gate is the same across all three; the receipt is the same across all three; the runtime hot-swap path is the same across all three.

Citations

Xu, J. et al. Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss. arXiv:2312.16682, 2023.

Yuan, W. et al. Self-Rewarding Language Models. arXiv:2401.10020, 2024. The model is both the policy and the judge that generates new pairs.

Calandriello, D. et al. Human Alignment of Large Language Models through Online Preference Optimisation. arXiv:2403.08635, 2024.

Rosset, C. et al. Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences. arXiv:2404.03715, 2024. The Nash-equilibrium framing of iterative preference optimization.

Dong, H. et al. RLHF Workflow: From Reward Modeling to Online RLHF. arXiv:2405.07863, 2024. The most-cited recipe; the source of the canonical four-round schedule.

← Back to research · Frontier reference · Online DPO · RS-1 spec