Research / Decoding

Test-time compute: best-of-N, self-consistency, reflexion

Three ways to spend more tokens at inference and get a better answer. Each has a regime where it wins, a regime where it wastes money, and a receipt line that tells you which one happened.

2026-05-14 · kolm research 6 min read

The premise

For a fixed model, the only knob left at inference time is how many tokens you are willing to spend on one answer. The literature calls this test-time compute. The point is not to make the model bigger. The point is to use the same model harder when the task warrants it.

Snell 2024 showed that for problems where verification is cheaper than generation, you can match a 14× larger model by spending compute at test time instead of train time. That is the thesis. The three methods below are the most credible ways to cash it in.

1. Best-of-N

Sample N completions independently. Pick the one with the highest score under some verifier. If you have a process reward model, score per step. If you have an outcome reward model or a unit test, score the final answer.

completions = [model.generate(prompt, temperature=0.7, seed=i) for i in range(N)]
scores = [verifier(c) for c in completions]
return completions[argmax(scores)]

Best-of-N is the right tool when the verifier is much cheaper than the generator and the task has a clean notion of correctness. Coding with a test suite. Math with a checker. Tool-call schemas with a validator.

What kills it: weak verifiers. If your reward model is noisy at the top of the distribution, best-of-N selects on noise. Cobbe 2021 reported this as the verifier ceiling, and it is the first thing to measure when you scale N.

2. Self-consistency

Sample N completions with chain-of-thought. Majority-vote the final answer. No verifier required, just an answer extractor and an equality predicate.

answers = [extract_answer(model.generate(prompt, temperature=0.7, seed=i))
           for i in range(N)]
return Counter(answers).most_common(1)[0][0]

Wang 2022 ran this on GSM8K and showed monotone gains up to N=40. The premise: if the model knows the right path, several paths will land on the same answer. Wrong paths disagree on different things.

Self-consistency wins when (1) the answer space is small or discrete, (2) you have no verifier, and (3) the model is roughly calibrated. It loses on open-ended generation where there is no canonical answer to vote on.

3. Reflexion

Generate. Critique. Regenerate using the critique. Loop until a stopping condition (verifier passes, max iterations, no further change).

output = model.generate(prompt)
for _ in range(MAX_ITERS):
    critique = model.critique(prompt, output)
    if critique.is_clean:
        break
    output = model.generate(prompt, prior=output, critique=critique)
return output

Shinn 2023 framed reflexion as verbal self-improvement: the model writes notes to itself about what went wrong, and conditions the next attempt on those notes. It works best on tasks where the model can identify its own errors after the fact better than it can avoid them in the first place. Long-horizon agentic tasks. Code with run-time errors. Logical puzzles with check steps.

When each wins

SetupBest fitWhy
Math with checkerBest-of-N + PRMVerifier is exact and per-step
Math without checkerSelf-consistencyDiscrete answers, no verifier needed
Code with unit testsBest-of-N + testsTests are the verifier
Code without testsReflexionModel critiques compile errors
Tool callsBest-of-N + schemaJSON Schema is the verifier
Open generationNone of theseNo notion of correctness

The kolm call

from apps.eval import test_time
result = test_time.best_of_n(
    model=runtime,
    prompt=p,
    n=8,
    verifier=schema_validator(tool_schema),
    seed=42,
)

The verifier is the same object the K-score gate uses. Same scoring function in training, evaluation, and inference. This is the kolm-specific design point: the reward function is the eval is the inference-time picker.

What the receipt records

{
  "test_time": {
    "method": "best_of_n",
    "n": 8,
    "verifier_id": "schema/tool_call_v1",
    "scores": [0.97, 0.94, 0.99, 0.21, 0.88, 0.92, 0.99, 0.96],
    "selected_index": 2,
    "tokens_total": 4831,
    "cost_usd": 0.0024
  }
}

If a buyer asks "did you spend more compute on the answer I just got than on the one yesterday", the receipt answers exactly that. tokens_total is the full spend across all N samples, not just the selected one.

Edge cases

Temperature. Best-of-N and self-consistency need diversity. Temperature=0 collapses them to one-shot. The defaults are 0.7 for reasoning, 0.4 for code, 1.0 for creative.

Stopping criteria. Reflexion needs an upper bound (MAX_ITERS=3 is a good default). Without one, an under-calibrated critic can loop forever on a correct answer.

Cost. N=8 costs 8× tokens. The break-even is the point where (gain × user willingness to pay) exceeds (8× generation cost). For high-stakes tools (medical triage, legal drafting), it is almost always positive.

Verifier-of-the-verifier. If your PRM is a smaller model, run it through K-score quarterly. PRMs drift the same way generators do.

Citations

Cobbe et al. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

Wang et al. 2022. Self-Consistency Improves Chain of Thought Reasoning. arXiv:2203.11171

Shinn et al. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366

Snell et al. 2024. Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters. arXiv:2408.03314

← back to research kolm.ai/research/test-time-compute