Research / Decoding
Test-time compute: best-of-N, self-consistency, reflexion
Three ways to spend more tokens at inference and get a better answer. Each has a regime where it wins, a regime where it wastes money, and a receipt line that tells you which one happened.
The premise
For a fixed model, the only knob left at inference time is how many tokens you are willing to spend on one answer. The literature calls this test-time compute. The point is not to make the model bigger. The point is to use the same model harder when the task warrants it.
Snell 2024 showed that for problems where verification is cheaper than generation, you can match a 14× larger model by spending compute at test time instead of train time. That is the thesis. The three methods below are the most credible ways to cash it in.
1. Best-of-N
Sample N completions independently. Pick the one with the highest score under some verifier. If you have a process reward model, score per step. If you have an outcome reward model or a unit test, score the final answer.
completions = [model.generate(prompt, temperature=0.7, seed=i) for i in range(N)]
scores = [verifier(c) for c in completions]
return completions[argmax(scores)]
Best-of-N is the right tool when the verifier is much cheaper than the generator and the task has a clean notion of correctness. Coding with a test suite. Math with a checker. Tool-call schemas with a validator.
What kills it: weak verifiers. If your reward model is noisy at the top of the distribution, best-of-N selects on noise. Cobbe 2021 reported this as the verifier ceiling, and it is the first thing to measure when you scale N.
2. Self-consistency
Sample N completions with chain-of-thought. Majority-vote the final answer. No verifier required, just an answer extractor and an equality predicate.
answers = [extract_answer(model.generate(prompt, temperature=0.7, seed=i))
for i in range(N)]
return Counter(answers).most_common(1)[0][0]
Wang 2022 ran this on GSM8K and showed monotone gains up to N=40. The premise: if the model knows the right path, several paths will land on the same answer. Wrong paths disagree on different things.
Self-consistency wins when (1) the answer space is small or discrete, (2) you have no verifier, and (3) the model is roughly calibrated. It loses on open-ended generation where there is no canonical answer to vote on.
3. Reflexion
Generate. Critique. Regenerate using the critique. Loop until a stopping condition (verifier passes, max iterations, no further change).
output = model.generate(prompt)
for _ in range(MAX_ITERS):
critique = model.critique(prompt, output)
if critique.is_clean:
break
output = model.generate(prompt, prior=output, critique=critique)
return output
Shinn 2023 framed reflexion as verbal self-improvement: the model writes notes to itself about what went wrong, and conditions the next attempt on those notes. It works best on tasks where the model can identify its own errors after the fact better than it can avoid them in the first place. Long-horizon agentic tasks. Code with run-time errors. Logical puzzles with check steps.
When each wins
| Setup | Best fit | Why |
|---|---|---|
| Math with checker | Best-of-N + PRM | Verifier is exact and per-step |
| Math without checker | Self-consistency | Discrete answers, no verifier needed |
| Code with unit tests | Best-of-N + tests | Tests are the verifier |
| Code without tests | Reflexion | Model critiques compile errors |
| Tool calls | Best-of-N + schema | JSON Schema is the verifier |
| Open generation | None of these | No notion of correctness |
The kolm call
from apps.eval import test_time
result = test_time.best_of_n(
model=runtime,
prompt=p,
n=8,
verifier=schema_validator(tool_schema),
seed=42,
)
The verifier is the same object the K-score gate uses. Same scoring function in training, evaluation, and inference. This is the kolm-specific design point: the reward function is the eval is the inference-time picker.
What the receipt records
{
"test_time": {
"method": "best_of_n",
"n": 8,
"verifier_id": "schema/tool_call_v1",
"scores": [0.97, 0.94, 0.99, 0.21, 0.88, 0.92, 0.99, 0.96],
"selected_index": 2,
"tokens_total": 4831,
"cost_usd": 0.0024
}
}
If a buyer asks "did you spend more compute on the answer I just got than on the one yesterday", the receipt answers exactly that. tokens_total is the full spend across all N samples, not just the selected one.
Edge cases
Temperature. Best-of-N and self-consistency need diversity. Temperature=0 collapses them to one-shot. The defaults are 0.7 for reasoning, 0.4 for code, 1.0 for creative.
Stopping criteria. Reflexion needs an upper bound (MAX_ITERS=3 is a good default). Without one, an under-calibrated critic can loop forever on a correct answer.
Cost. N=8 costs 8× tokens. The break-even is the point where (gain × user willingness to pay) exceeds (8× generation cost). For high-stakes tools (medical triage, legal drafting), it is almost always positive.
Verifier-of-the-verifier. If your PRM is a smaller model, run it through K-score quarterly. PRMs drift the same way generators do.
Citations
Cobbe et al. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168
Wang et al. 2022. Self-Consistency Improves Chain of Thought Reasoning. arXiv:2203.11171
Shinn et al. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366
Snell et al. 2024. Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters. arXiv:2408.03314