Research / Evaluation
Mixture of judges and process reward models
Two evaluation upgrades that ship together. A panel of small judges replaces one expensive frontier judge. A step-by-step process reward model replaces a single end-of-answer scorer. Both make the K-score gate harder to fool.
Why one judge is not enough
The standard pattern in 2024 was: use GPT-4 as a judge. Score from 1 to 10. Call it done. The problem is well documented in Verga 2024 (PoLL): a single judge is biased in ways the calling code cannot see. It prefers longer answers. It rates its own family of models higher. It is inconsistent under order-swap.
Worse, if the judge is the same family as the model being evaluated, the eval becomes a tautology. Kolm artifacts get distilled into Llama and Qwen. If the judge is also Llama, the eval shows up generous on Llama-trained artifacts and harsh on Mistral ones, and there is no good reason for the difference.
What PoLL changed
Verga 2024 proposed: run three smaller judges from different model families, take the majority vote (for binary preference) or the mean (for scalar scores). They reported that 3× Haiku-class judges matched 1× Opus-class judge on agreement-with-humans, at a fraction of the cost and with measurably less family bias.
The kolm implementation uses three judges by default: a Claude Haiku-class, a GPT-mini-class, and a Llama-3-70B-class. Each scores the same item. We log all three scores and the aggregate.
The kolm call
from apps.eval.judges_mix import JudgeMix
mix = JudgeMix(judges=["haiku-4-5", "gpt-mini-2025", "llama-3-70b"])
result = mix.score(
prompt=p,
response=r,
rubric=load_rubric("medical_triage"),
swap_order=True,
)
# result.scores = {"haiku-4-5": 0.87, "gpt-mini-2025": 0.81, "llama-3-70b": 0.89}
# result.aggregate = 0.857
# result.bias_corrected = 0.842
swap_order=True evaluates the response in both (A, B) and (B, A) orderings for pairwise comparisons, then averages the bias-corrected outcomes. Order-swap correction is mechanical and removes one of the largest sources of judge noise.
Process reward models
For chain-of-thought tasks (math, multi-step reasoning, agentic flows), an end-only scorer cannot tell you where the model went wrong. The answer is right or wrong; the reward is one number.
A process reward model scores each step in the chain. Math-Shepherd (Wang 2023) trained the first credible PRM by mining step-level labels from outcome-only data: a step is positive if "most rollouts continued from this step land on the right answer". PRM-800K (Lightman 2023) released 800,000 human-labeled step rewards for math, the dataset most kolm PRMs are initialized on.
How a PRM is structured
A PRM takes a partial chain of thought and returns a probability that the chain is still on a correct path. Concretely:
input: prompt + steps[0..i]
output: P(this step is correct | context)
For an N-step chain, a PRM produces N scores. Aggregation strategies:
- Product:
P_total = Π P_i. Conservative; one bad step kills the chain. - Min:
P_total = min(P_i). Same intuition, less spiky. - Mean:
P_total = mean(P_i). Lenient; lets early errors get rescued.
Default in kolm is min. It matches the operator's intuition that an answer with even one bad reasoning step is not safe to ship.
Step-DPO
Once you have step-level labels, you can do step-level preference optimization. Lai 2024 (Step-DPO) showed that DPO over step pairs (preferred-step vs. rejected-step at the same chain position) beats DPO over full-answer pairs on math reasoning by 4-7 points absolute, with no extra parameters.
from apps.trainer.preference import StepDPO
trainer = StepDPO(
base="qwen2.5-7b-instruct",
prm="math-shepherd-7b",
beta=0.1,
seed=42,
)
trainer.train(captures="math/step_preferences.jsonl")
Where this fits in the K-score gate
The K-score is K = 0.40·A + 0.15·S + 0.15·L + 0.15·C + 0.15·V. The judges-mix is what produces the A term (Accuracy) for tasks where there is no exact verifier. The PRM is what produces the per-step component when the task is multi-step (math, planning, agentic). For coding tasks with unit tests, neither shows up; the test suite is the verifier.
The gate is K ≥ 0.85 AND pass_rate ≥ 0.85. The judges-mix votes go into the pass_rate side: a response "passes" only if at least 2 of 3 judges score it ≥ rubric_threshold.
What the receipt records
{
"eval": {
"k_score": 0.91,
"components": {"A": 0.88, "S": 0.93, "L": 0.96, "C": 0.95, "V": 0.84},
"judges_mix": {
"judges": ["haiku-4-5", "gpt-mini-2025", "llama-3-70b"],
"scores": [0.87, 0.81, 0.89],
"aggregate": 0.857,
"bias_corrected": 0.842,
"order_swap": true
},
"prm": {
"model": "math-shepherd-7b@sha256:c4d2e1",
"agg": "min",
"step_scores": [0.96, 0.94, 0.88, 0.97],
"chain_score": 0.88
}
}
}
Every number a regulator might ask about is in the receipt. If the K-score is 0.91, you can reconstruct exactly which judges voted what, which step in which chain scored the lowest, and what bias correction was applied.
Edge cases
Judge collusion. If all three judges are trained on overlapping data (e.g., all use RLHF datasets from the same vendor), they fail in correlated ways. Mitigation: pick judges from genuinely different families, and audit judge agreement quarterly.
PRM staleness. A PRM trained on 2024 math data may underweight novel notation. K-score the PRM itself on a held-out current set; refresh when its accuracy drops below 0.85.
Cost. Three Haiku-class judges per item is roughly 3× cheaper than one Opus-class judge, with better agreement and traceability. We log judge_cost_usd on every eval so buyers see the spend.
Pairwise vs pointwise. Pointwise (score from 1-10) has higher variance and worse agreement than pairwise (A vs B). Default to pairwise; fall back to pointwise only when no comparison set exists.
Citations
Verga et al. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796
Wang et al. 2023. Math-Shepherd: Verify and Reinforce LLMs Step-by-step. arXiv:2312.08935
Lightman et al. 2023. Let's Verify Step by Step. arXiv:2305.20050
Lai et al. 2024. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning. arXiv:2406.18629