ArXiv TLDR

What should post-training optimize? A test-time scaling law perspective

🐦 Tweet
2605.10716

Muheng Li, Jian Qian, Wenlong Mou

cs.LGstat.ML

TLDR

This paper proposes Tail-Extrapolated estimators (TEA, Prefix-TEA) to optimize LLM post-training for best-of-N deployment, even with limited training rollouts.

Key contributions

  • Identifies a mismatch between standard post-training (mean reward) and best-of-N deployment (upper tail).
  • Addresses the budget-mismatch regime where training has significantly fewer rollouts than deployment.
  • Introduces Tail-Extrapolated Advantage (TEA) and Prefix-TEA estimators for best-of-N optimization.
  • Demonstrates improved best-of-N performance across various LLMs, reward models, and datasets.

Why it matters

Current LLM post-training methods don't align with common best-of-N deployment strategies, especially with budget constraints. This work provides a practical solution to bridge that gap. By enabling efficient optimization for best-of-N, it can lead to more robust and performant LLMs in real-world applications.

Original Abstract

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.