research . training . 11 min read

Bradley-Terry reward models.

Every preference-optimization recipe in the last five years has the same engine inside it: a Bradley-Terry log-odds loss over chosen-versus-rejected pairs. RLHF uses it to train a separate reward model. DPO bakes it into the policy update. Best-of-N inference uses it to score candidates at decode time. kolm trains it as a tiny regression head and ships it as its own signed artifact.

May 14, 2026 · Kolmogorov research · apps/trainer/reward.py

The loss, in one line

Bradley and Terry (1952) modeled the probability that option A beats option B as the logistic of the difference between their scalar strengths. Applied to language models: let rθ(prompt, response) be a scalar score from a model with parameters θ. The negative log-likelihood of observing a chosen response c over a rejected response r is:

L(θ) = -log σ( β · ( rθ(p, c) - rθ(p, r) ) )

where σ is the sigmoid and β is a temperature. The whole RLHF reward pipeline is this loss, batched and averaged. There is no expected-future-reward to estimate, no value head, no separate inference policy. Two forward passes (the chosen and the rejected), a subtraction, a logsigmoid, a backward pass.

What the model actually is

A reward model is a base language model with the LM head replaced by a single-output regression head. In Hugging Face terms this is AutoModelForSequenceClassification.from_pretrained(base, num_labels=1, problem_type="regression"). The base body produces hidden states for the prompt+response, the head reads the final non-pad token's hidden state and projects to a scalar. That scalar is the "strength" in Bradley-Terry.

kolm trains this as a LoRA adapter on the same base model the SFT artifact came from. The adapter is 30-50 MB; the base is loaded read-only and shared with the SFT model's loader in memory. The output is a *.rm.kolm artifact that the preference-optimization stage consumes and best-of-N inference can also call.

The pairwise collator

The single non-obvious engineering bit is the data collator. Chosen and rejected sequences have different lengths in general; the collator pads each side to its own batch-max, runs them through the model as two separate forward passes, gathers the score at each sequence's last non-pad token, and stacks into (B, 2). The trainer subclass then computes the loss in three lines:

def compute_loss(self, model, inputs, return_outputs=False):
    r_c = self._reward(model, inputs["input_ids_chosen"],   inputs["attn_chosen"])
    r_r = self._reward(model, inputs["input_ids_rejected"], inputs["attn_rejected"])
    loss = -F.logsigmoid(self.beta * (r_c - r_r)).mean()
    return (loss, {"rewards_chosen": r_c, "rewards_rejected": r_r}) if return_outputs else loss

That is the entirety of the algorithmic surface. Everything else (LoRA application, gradient checkpointing, bf16 autocast, eval-time pair-accuracy) is standard transformers boilerplate.

Three places the trained reward model is used

Stage	Role of the reward model	Where it sits in the loop
Classical RLHF (PPO)	Provides the scalar reward signal that PPO maximizes	Separate inference call once per sample; KL'd against a reference policy
DPO ablation	Not used; DPO derives the reward implicitly from the chosen/rejected gap	Sanity-check only: a trained RM should agree with DPO's implicit reward
Best-of-N inference	Scores N sampled candidates; the agent returns the argmax	Test-time, after the policy is frozen
Online filtering	Rejects low-reward responses before they reach the user	Inline gate at /v1/chat/completions; rejects below a configurable floor

kolm's GRPO trainer can also use a learned reward model in place of a verifiable check when no programmatic check exists. The wiring is the same dict (REWARD_FUNCTIONS['learned_rm']); the trade-off is that the model is now exposed to all the failure modes of an imperfect RM (reward hacking, overconfidence on out-of-distribution generations) that verifiable rewards by construction avoid.

The pair-accuracy metric

Loss numbers are not interpretable on their own; pair-accuracy is. For each held-out triple (prompt, chosen, rejected), the model passes if it scores the chosen response higher than the rejected one. Random is 0.5; well-trained reward models on summarization preferences land at 0.70-0.78 (Stiennon 2020), helpfulness/harmlessness models at 0.65-0.72 (Bai 2022). The kolm evaluator reports both the mean reward gap and the pair-accuracy at every evaluation step so a buyer can sanity-check before they ship the RM to a downstream preference run.

Why the reward model gets its own artifact

It would be cheaper to bake the RM into the policy artifact. We deliberately do not. Three reasons:

Independent verification. A reviewer can re-run the RM against a fresh held-out preference set without touching the policy. If the RM agrees with humans 75% of the time on the buyer's eval set, that number ships on the artifact's manifest.
Composable best-of-N. A single RM can score candidates from multiple policy artifacts. Coupling them would force re-distillation every time the policy changes.
Auditable training signal. RLHF's most-publicized failure mode is when a slightly-bad RM produces a confidently-bad policy. Shipping the RM as its own signed thing lets the buyer's auditor compare the RM's K-score against the policy's K-score and notice the gap.

What the receipt records

"reward_model": {
  "method": "bradley_terry_rm",
  "base_model": "Qwen/Qwen2.5-3B-Instruct",
  "lora_r": 16,
  "lora_alpha": 32,
  "beta": 1.0,
  "n_pairs_train": 12480,
  "n_pairs_eval": 656,
  "loss": 0.412,
  "pair_accuracy": 0.738,
  "reward_gap_mean": 1.84,
  "papers": [
    "Bradley-Terry-1952",
    "arXiv:2009.01325",
    "arXiv:2203.02155",
    "arXiv:2204.05862"
  ]
}

The buyer's auditor can confirm which base model carried the regression head, how many preference pairs trained it, and what the held-out agreement rate looked like. The canonical-JSON manifest hash covers the block, so a tampered receipt invalidates the artifact signature.

Edge cases worth naming

Reward hacking. A reward model trained on superficial features (length, formatting) rewards superficial features. The fix is the same as in classical RLHF: train on diverse pairs (length-balanced, format-balanced), evaluate on a held-out set the trainer never sees, and inspect highest- and lowest-reward generations by hand at every save_steps.

Pair quality dominates pair quantity. 1,000 high-confidence pairs beat 10,000 noisy pairs. The capture loop in kolm tags pairs with the annotator confidence (or judge confidence for AI-labeled pairs); the trainer optionally filters below a threshold.

The RM saturates fast on narrow tasks. A redactor that gets PHI types correct on 99% of training pairs will produce a near-uniform reward distribution at inference; the gradient signal for any downstream stage approaches zero. The fix is to broaden the eval set or skip RLHF altogether and stay with SFT.

Distribution shift. A reward model trained on policy A's outputs may score policy B's outputs poorly even when they are objectively better. This is the "off-policy reward" problem. The mitigation in classical RLHF is iterative DPO; kolm prefers a fresh capture-and-retrain cycle on the new policy's outputs.

Where this fits in the kolm compile loop

The standard pipeline is capture → SFT → (RM → preference optimization) → (GRPO) → K-score gate → sign → ship. The RM is optional, and most enterprise distill jobs do not need it: SFT followed by direct DPO (which folds the RM into the policy update) is the common path. The standalone RM stage matters when the buyer wants best-of-N at inference time, when they are running classical PPO instead of DPO for compliance reasons, or when they want an independent gate on policy output that does not co-train with the policy.

Citations

Bradley, R. A. & Terry, M. E. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 1952. The original pairwise loss.

Stiennon, N. et al. Learning to summarize from human feedback. arXiv:2009.01325, 2020.

Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv:2203.02155, 2022. InstructGPT.

Bai, Y. et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862, 2022. The HH-RLHF dataset and procedure.

← Back to research · Frontier reference · Reward modeling · RS-1 spec