research . alignment . 12 min read

Preference optimization after DPO.

DPO was the first single-stage preference algorithm. It is still good. Every newer method beats it on at least one axis. This report is the decision tree kolm uses to pick one, with the data shape and VRAM thresholds that flip the choice.

May 14, 2026 · Kolmogorov research · apps/trainer/preference.py

The five methods, one paragraph each

DPO · the baseline

Rafailov et al, 2023. Takes (prompt, chosen, rejected) pairs and optimises the policy directly against a Bradley-Terry preference model, side-stepping the reward-model + PPO loop. Requires a reference model (usually the SFT policy with adapters disabled). Battle-tested. beta=0.1 is the standard knob.

KTO · binary signals

Ethayarajh et al, 2024. Takes (prompt, output, label) where label is good or bad. No pairs needed; the algorithm learns from independent positives and negatives. This matters in production because most logged feedback is a thumbs-up or thumbs-down on a single response, not a paired comparison. Loss is Kahneman-Tversky-style asymmetric, weighing losses harder than gains.

ORPO · SFT and preference in one pass

Hong et al, 2024. Combines the SFT cross-entropy and an odds-ratio preference term in a single loss, so the model learns the chosen response while down-weighting the rejected response in the same step. No reference model needed. This is the right pick when the buyer has SFT data and preference pairs and a tight VRAM budget, because the alternative is two separate stages (SFT, then DPO) which doubles the GPU envelope.

SimPO · reference-free

Meng et al, 2024. Drops the reference model entirely. Optimises the average log-probability of the chosen over the rejected with a margin term (gamma_beta_ratio). Beats DPO at lower memory on most public benchmarks. The trade-off is sensitivity to the margin hyperparameter; out of the box, beta=2.0, gamma_beta_ratio=0.5 are the values that line up with the paper.

IPO · identity-preference

Azar et al, 2023. A drop-in replacement for the DPO loss that does not collapse when preferences are deterministic (every comparison has a clear winner). DPO can over-optimise on these and produce degenerate policies; IPO does not. Use this when the preference pack is curated, not noisy.

The decision tree

Buyer's data shape	VRAM	Has SFT data?	Method	Reason
Preference pairs	< 24 GB	yes	ORPO	One pass, no reference model
Preference pairs	< 24 GB	no	SimPO	Reference-free; fits on a 4090
Preference pairs	≥ 24 GB	any	DPO	Battle-tested; lowest deviation risk
Preference pairs (deterministic)	any	any	IPO	DPO over-optimises on these
Binary good/bad labels	any	any	KTO	No pairs needed; eats the natural label format
< 200 captured pairs	any	any	KTO	Works with smaller, noisier label sets

This is the same logic recommend_method() bakes in. The CLI surface lets the user override:

$ kolm compile --pref-method orpo specs/refund-flagger.spec.json
. picked orpo (vram_gb=22, has_sft_data=true)
. beta=0.1 lambda_=0.1
. train_pairs=2,118 sft_examples=14,400

How kolm wires this

The single entry point is preference_trainer(). It version-gates against trl: if the user's trl install does not have KTOTrainer, it falls back to DPOTrainer with loss_type="kto_pair" (this works in trl ≥ 0.8). If ORPOTrainer or CPOTrainer are missing, the method fails closed with a crisp upgrade message rather than silently silently mis-training.

from apps.trainer.preference import preference_trainer, PreferenceMethod

trainer = preference_trainer(
    method=PreferenceMethod.ORPO,
    model=peft_model,
    tokenizer=tok,
    train_dataset=pref_dataset,
    args=training_args,
    beta=0.1,
)
trainer.train()

The config builder copies the standard TrainingArguments fields into the trl-specific config class (DPOConfig, KTOConfig, ORPOConfig, CPOConfig), then folds in the loss-specific knobs (beta, loss_type, lambda_, gamma_beta_ratio). The reference model is supplied only for DPO and IPO; KTO scores internally; ORPO and SimPO are reference-free by construction.

What the receipt records

Every preference-optimization stage lands in the artifact's receipt with the following block:

"preference": {
  "method": "orpo",
  "beta": 0.1,
  "loss_type": "orpo",
  "ref_model": null,
  "trl_version": "0.13.0",
  "train_pairs": 2118,
  "sft_examples": 14400,
  "epochs": 2,
  "final_loss": 0.42
}

The buyer's auditor can confirm which method shipped without re-reading the training logs. The verifier ignores the block (it is not load-bearing for signature validation), but it is part of the canonical-JSON manifest hash, so changes invalidate the receipt.

Edge cases the router skips

Cold-start with no preferences. If the buyer has zero preference pairs and zero good/bad labels, the answer is not "pick a method"; the answer is "synthesize preferences." That is in apps/data/synth.py (the Evol-Instruct path can be used to generate hard-negative rejected completions from a chosen pack). The receipt then carries a synth_provenance block alongside the preference block.

Multi-turn preferences. All five methods assume single-turn preferences. Multi-turn conversational preference data needs a different loss (currently in flight in trl); when it lands, the router gets a new branch.

Very small adapter ranks. ORPO and SimPO both put more pressure on the adapter than DPO does, because there is no reference model softening the gradient. At r < 8 with rsLoRA off, both can underfit relative to DPO. The trainer warns when this combination is detected.

If the buyer cannot tell which case they are in, run kolm doctor first; it inspects the dataset and prints the routing decision before any training starts.

Citations

Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290, 2023.

Ethayarajh, K. et al. KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306, 2024.

Hong, J. et al. ORPO: Monolithic Preference Optimization without Reference Model. arXiv:2403.07691, 2024.

Meng, Y. et al. SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv:2405.14734, 2024.

Azar, M. G. et al. A General Theoretical Paradigm to Understand Learning from Human Preferences. arXiv:2310.12036, 2023.

← Back to research · Frontier reference · preference · RS-1 spec