GRPO with verifiable rewards.
The algorithm behind DeepSeek R1's chain-of-thought elicitation, with the boring engineering parts written down. No value head, no separate reward model, and the reward function shares code with the K-score evaluator so the training signal and the release gate score the same output the same way.
What GRPO is, in one paragraph
Group Relative Policy Optimization, introduced in Shao et al 2024 (DeepSeek-MATH) and made famous by DeepSeek R1 in 2025, is online RL for language models without a value head. For each prompt you sample G completions from the current policy, score each completion with a reward function, and the advantage for completion i is its score minus the group mean, divided by the group standard deviation. PPO's clipped surrogate then optimises the policy against those advantages with a KL penalty against a reference policy. The piece you drop relative to PPO is the value head; the piece you add is the requirement that the reward function is fast enough to run G times per prompt per step, which is fine if the reward is a deterministic check rather than a separate neural network.
Why no value head matters
The value head in PPO is the per-token estimate of expected future reward. It exists because rewards in classic RLHF are sparse (one scalar per generation) and the policy gradient needs a per-token baseline to have manageable variance. Training the value head well is its own problem: it needs the same KL-controlled trajectories, it adds 5-15% parameters depending on the architecture, and a badly trained value head silently corrupts the policy gradient.
GRPO sidesteps this. The baseline comes from the group: with G=8 completions per prompt, the within-group standard deviation is enough signal to identify whether a given completion was better or worse than its peers, and that is all the policy gradient needs. The cost is sampling G completions instead of one, which is a wall-clock cost, not a parameter cost.
The three reward families kolm ships
GRPO is only as useful as the reward function. RLHF burned a generation of engineers on the realisation that a slightly-bad reward model produces a confidently-bad policy. The fix in RLVR (Reinforcement Learning from Verifiable Rewards, Lambert 2024) is to skip the reward model entirely: use deterministic checks where you have them, accept that you cannot RL the cases where you do not.
| Family | Reward function | Use when |
|---|---|---|
code_exec | Subprocess sandbox + unit tests; reward = passed / total | Code generation with unit tests, SQL with expected results, scripts with fixture inputs |
math_checker | Numeric tolerance or symbolic equivalence vs gold | Math word problems, calculations with extractable final answers |
schema_validator | JSON-schema validation or regex match | Structured extraction, tool-call emission, classifier outputs |
The buyer can register their own. Reward functions take (prompts, completions, **kwargs) and return list[float] in [0, 1]. trl averages multiple reward functions inside the trainer, which is the right shape for combining correctness with a structural reward (e.g., the completion must include a non-empty <think>...</think> block).
The reward-is-the-eval trick
The reward function and the K-score evaluator look at the same model output and apply the same check. In kolm this is enforced by routing both through the same code path: the K-score evaluator suite calls REWARD_FUNCTIONS[name](...); the GRPO trainer calls the same dict. If you change the math-equivalence tolerance, both move together. The buyer's auditor cannot get a high K-score and a low train-time reward because the gate that decided to ship is the same gate that drove the gradient.
The minimal call
from apps.trainer.grpo import grpo_trainer, GRPOTrainConfig, REWARD_FUNCTIONS, make_format_reward
from functools import partial
# 1. Bind references for the reward functions (these are per-prompt gold answers)
correctness = partial(REWARD_FUNCTIONS['math_checker'], references=gold_answers)
structure = make_format_reward('<think>', '</think>')
# 2. Build the trainer
trainer = grpo_trainer(
model=peft_model,
tokenizer=tok,
train_dataset=prompts_only_dataset,
reward_funcs=[correctness, structure],
args=GRPOTrainConfig(
num_generations=8,
max_completion_length=1024,
temperature=0.7,
learning_rate=5e-6,
beta=0.04,
),
)
trainer.train()
The dataset is prompts-only; GRPO samples its own completions. num_generations is the G in the group; the published DeepSeek-MATH paper used 64, R1 used 16, and 8 is the right starting point for a 7B-class model on a single H100. beta is the KL coefficient against the reference policy; 0.04 is the trl default and matches the R1 recipe.
What the receipt records
Every GRPO stage lands in the artifact's receipt with the following block:
"grpo": {
"method": "grpo",
"trl_version": "0.13.0",
"papers": ["arXiv:2402.03300", "arXiv:2501.12948"],
"num_generations": 8,
"max_completion_length": 1024,
"max_prompt_length": 512,
"temperature": 0.7,
"top_p": 0.95,
"learning_rate": 5e-6,
"beta": 0.04,
"epsilon": 0.2,
"num_train_epochs": 1,
"seed": 42,
"reward_funcs": ["math_checker", "structure"],
"train_examples": 4096,
"final_loss": 0.18,
"final_reward_mean": 0.74
}
The buyer's auditor can confirm which reward functions drove the gradient, with which hyperparameters, and what the terminal reward distribution looked like. The verifier ignores the block for signature validation, but the canonical-JSON manifest hash covers it, so a swap invalidates the receipt.
Edge cases worth naming
Reward hacking is real. If code_exec is the only reward, the model will discover that printing the expected output literally (without computing it) passes the test. The fix is the same as in classical RL: add a structural reward (the <think> block must be non-empty), add a length penalty for trivially short completions, and inspect samples by hand at every checkpoint. The trainer logs ten random samples per save_steps interval for this reason.
Wall-clock cost scales with G. Eight completions per prompt at 1024 tokens each on a 7B model is ~8 seconds per step on an H100 with vLLM as the rollout backend. num_generations=16 doubles that. The published R1 recipe ran for tens of thousands of steps; a kolm GRPO stage on a buyer's distill is typically 200-2000 steps because the base is already on-distribution for the task.
KL drift. If beta is too small, the policy drifts far from the reference and the buyer's other gates (HHEM grounding, K-score safety) collapse before the correctness reward saturates. The trainer evaluates the K-score every save_steps and refuses to commit a checkpoint whose K-score fell below the pre-GRPO baseline.
Verifiable rewards are a privileged setting. Many tasks do not have a programmatic check. For those, GRPO is the wrong tool; preference optimization (DPO, ORPO, SimPO) on a paired-comparison dataset is. The router in apps/trainer/preference.py covers that branch. GRPO is for the cases where the buyer has, or can write, a checker.
Where this fits in the kolm compile loop
The standard pipeline is SFT → preference (optional) → GRPO (optional) → K-score gate → sign → ship. SFT teaches the format and the base capability; preference optimization aligns generic preferences; GRPO sharpens the reasoning trace and pushes correctness on tasks with verifiable answers. Each stage is opt-in via the spec; each stage lands in the receipt. The K-score gate at 0.85 binds across all three.
For most enterprise distill jobs, SFT alone hits 0.85. GRPO is the right add-on when (a) the base task has a verifiable check, (b) the SFT model still misses on edge cases, and (c) the buyer accepts a longer training run for higher accuracy. Reasoning-heavy verticals (medical diagnosis with guideline checks, legal-citation lookup with citation validity, code generation with test suites) hit all three.
Citations
Shao, Z. et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300, 2024.
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025.
Lambert, N. et al. TÜLU 3: Pushing Frontiers in Open Language Model Post-Training. arXiv:2411.15124, 2024 (RLVR formulation).