Bradley-Terry reward models.
Every preference-optimization recipe in the last five years has the same engine inside it: a Bradley-Terry log-odds loss over chosen-versus-rejected pairs. RLHF uses it to train a separate reward model. DPO bakes it into the policy update. Best-of-N inference uses it to score candidates at decode time. kolm trains it as a tiny regression head and ships it as its own signed artifact.
The loss, in one line
Bradley and Terry (1952) modeled the probability that option A beats option B as the logistic of the difference between their scalar strengths. Applied to language models: let rθ(prompt, response) be a scalar score from a model with parameters θ. The negative log-likelihood of observing a chosen response c over a rejected response r is:
L(θ) = -log σ( β · ( rθ(p, c) - rθ(p, r) ) )
where σ is the sigmoid and β is a temperature. The whole RLHF reward pipeline is this loss, batched and averaged. There is no expected-future-reward to estimate, no value head, no separate inference policy. Two forward passes (the chosen and the rejected), a subtraction, a logsigmoid, a backward pass.
What the model actually is
A reward model is a base language model with the LM head replaced by a single-output regression head. In Hugging Face terms this is AutoModelForSequenceClassification.from_pretrained(base, num_labels=1, problem_type="regression"). The base body produces hidden states for the prompt+response, the head reads the final non-pad token's hidden state and projects to a scalar. That scalar is the "strength" in Bradley-Terry.
kolm trains this as a LoRA adapter on the same base model the SFT artifact came from. The adapter is 30-50 MB; the base is loaded read-only and shared with the SFT model's loader in memory. The output is a *.rm.kolm artifact that the preference-optimization stage consumes and best-of-N inference can also call.
The pairwise collator
The single non-obvious engineering bit is the data collator. Chosen and rejected sequences have different lengths in general; the collator pads each side to its own batch-max, runs them through the model as two separate forward passes, gathers the score at each sequence's last non-pad token, and stacks into (B, 2). The trainer subclass then computes the loss in three lines:
def compute_loss(self, model, inputs, return_outputs=False):
r_c = self._reward(model, inputs["input_ids_chosen"], inputs["attn_chosen"])
r_r = self._reward(model, inputs["input_ids_rejected"], inputs["attn_rejected"])
loss = -F.logsigmoid(self.beta * (r_c - r_r)).mean()
return (loss, {"rewards_chosen": r_c, "rewards_rejected": r_r}) if return_outputs else loss
That is the entirety of the algorithmic surface. Everything else (LoRA application, gradient checkpointing, bf16 autocast, eval-time pair-accuracy) is standard transformers boilerplate.
Three places the trained reward model is used
| Stage | Role of the reward model | Where it sits in the loop |
|---|---|---|
| Classical RLHF (PPO) | Provides the scalar reward signal that PPO maximizes | Separate inference call once per sample; KL'd against a reference policy |
| DPO ablation | Not used; DPO derives the reward implicitly from the chosen/rejected gap | Sanity-check only: a trained RM should agree with DPO's implicit reward |
| Best-of-N inference | Scores N sampled candidates; the agent returns the argmax | Test-time, after the policy is frozen |
| Online filtering | Rejects low-reward responses before they reach the user | Inline gate at /v1/chat/completions; rejects below a configurable floor |
kolm's GRPO trainer can also use a learned reward model in place of a verifiable check when no programmatic check exists. The wiring is the same dict (REWARD_FUNCTIONS['learned_rm']); the trade-off is that the model is now exposed to all the failure modes of an imperfect RM (reward hacking, overconfidence on out-of-distribution generations) that verifiable rewards by construction avoid.
The pair-accuracy metric
Loss numbers are not interpretable on their own; pair-accuracy is. For each held-out triple (prompt, chosen, rejected), the model passes if it scores the chosen response higher than the rejected one. Random is 0.5; well-trained reward models on summarization preferences land at 0.70-0.78 (Stiennon 2020), helpfulness/harmlessness models at 0.65-0.72 (Bai 2022). The kolm evaluator reports both the mean reward gap and the pair-accuracy at every evaluation step so a buyer can sanity-check before they ship the RM to a downstream preference run.
Why the reward model gets its own artifact
It would be cheaper to bake the RM into the policy artifact. We deliberately do not. Three reasons:
- Independent verification. A reviewer can re-run the RM against a fresh held-out preference set without touching the policy. If the RM agrees with humans 75% of the time on the buyer's eval set, that number ships on the artifact's manifest.
- Composable best-of-N. A single RM can score candidates from multiple policy artifacts. Coupling them would force re-distillation every time the policy changes.
- Auditable training signal. RLHF's most-publicized failure mode is when a slightly-bad RM produces a confidently-bad policy. Shipping the RM as its own signed thing lets the buyer's auditor compare the RM's K-score against the policy's K-score and notice the gap.
What the receipt records
"reward_model": {
"method": "bradley_terry_rm",
"base_model": "Qwen/Qwen2.5-3B-Instruct",
"lora_r": 16,
"lora_alpha": 32,
"beta": 1.0,
"n_pairs_train": 12480,
"n_pairs_eval": 656,
"loss": 0.412,
"pair_accuracy": 0.738,
"reward_gap_mean": 1.84,
"papers": [
"Bradley-Terry-1952",
"arXiv:2009.01325",
"arXiv:2203.02155",
"arXiv:2204.05862"
]
}
The buyer's auditor can confirm which base model carried the regression head, how many preference pairs trained it, and what the held-out agreement rate looked like. The canonical-JSON manifest hash covers the block, so a tampered receipt invalidates the artifact signature.
Edge cases worth naming
Reward hacking. A reward model trained on superficial features (length, formatting) rewards superficial features. The fix is the same as in classical RLHF: train on diverse pairs (length-balanced, format-balanced), evaluate on a held-out set the trainer never sees, and inspect highest- and lowest-reward generations by hand at every save_steps.
Pair quality dominates pair quantity. 1,000 high-confidence pairs beat 10,000 noisy pairs. The capture loop in kolm tags pairs with the annotator confidence (or judge confidence for AI-labeled pairs); the trainer optionally filters below a threshold.
The RM saturates fast on narrow tasks. A redactor that gets PHI types correct on 99% of training pairs will produce a near-uniform reward distribution at inference; the gradient signal for any downstream stage approaches zero. The fix is to broaden the eval set or skip RLHF altogether and stay with SFT.
Distribution shift. A reward model trained on policy A's outputs may score policy B's outputs poorly even when they are objectively better. This is the "off-policy reward" problem. The mitigation in classical RLHF is iterative DPO; kolm prefers a fresh capture-and-retrain cycle on the new policy's outputs.
Where this fits in the kolm compile loop
The standard pipeline is capture → SFT → (RM → preference optimization) → (GRPO) → K-score gate → sign → ship. The RM is optional, and most enterprise distill jobs do not need it: SFT followed by direct DPO (which folds the RM into the policy update) is the common path. The standalone RM stage matters when the buyer wants best-of-N at inference time, when they are running classical PPO instead of DPO for compliance reasons, or when they want an independent gate on policy output that does not co-train with the policy.
Citations
Bradley, R. A. & Terry, M. E. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 1952. The original pairwise loss.
Stiennon, N. et al. Learning to summarize from human feedback. arXiv:2009.01325, 2020.
Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv:2203.02155, 2022. InstructGPT.
Bai, Y. et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862, 2022. The HH-RLHF dataset and procedure.