LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

May 8, 20262605.08083

Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang + 8 more

cs.CL

TLDR

AutoTTS automatically discovers optimal test-time scaling strategies for LLMs, outperforming hand-crafted methods with efficient, agentic search.

Key contributions

Introduces AutoTTS, an environment-driven framework for automatically discovering test-time scaling strategies.
Formulates width-depth TTS as controller synthesis, enabling cheap evaluation without repeated LLM calls.
Employs beta parameterization and fine-grained execution trace feedback for efficient strategy discovery.
Discovered strategies improve accuracy-cost, generalize to new benchmarks, and are found at low computational cost.

Why it matters

Manually designing test-time scaling strategies for LLMs is inefficient and limits performance. This paper automates the discovery of these strategies, leading to more effective and cost-efficient LLM inference. It significantly advances how we optimize LLM performance by making computation allocation agentic and data-driven.

Original Abstract

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers