Test-Time Scaling Makes Overtraining Compute-Optimal

April 1, 20262604.01411

Nicholas Roberts, Sungjun Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu + 5 more

cs.LGcs.CLstat.ML

TLDR

New $T^2$ scaling laws show that overtraining LLMs is compute-optimal when considering test-time inference costs, leading to stronger performance.

Key contributions

Introduces $T^2$ scaling laws to jointly optimize LLM training, model size, and test-time inference samples.
Modernizes pretraining scaling by integrating pass@k modeling to account for test-time sampling costs.
Demonstrates that optimal pretraining shifts into an "overtraining regime" when inference costs are considered.
Validates that $T^2$-forecasted overtrained models achieve substantially stronger performance.

Why it matters

This paper redefines optimal LLM training by integrating test-time inference costs, challenging traditional pretraining scaling laws. It shows "overtraining" is compute-optimal for better performance in real-world deployments, significantly impacting how future frontier LLMs are designed and trained.

Original Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers