ArXiv TLDR

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

🐦 Tweet
2604.02155

Xuan Qi

cs.CL

TLDR

Brief chain-of-thought (CoT) reasoning dramatically improves function-calling agent accuracy, while extended CoT surprisingly degrades performance.

Key contributions

  • Systematic study reveals non-monotonic CoT budget effects on function-calling agents' accuracy.
  • Brief CoT (32 tokens) boosts accuracy by 45% relative; extended CoT (256 tokens) degrades performance below baseline.
  • Brief CoT acts as a function-routing step, significantly reducing wrong function selections and errors.
  • Proposed Function-Routing CoT (FR-CoT) eliminates function hallucination, offering structural reliability.

Why it matters

This paper challenges the assumption that more CoT is always better, revealing a non-monotonic effect on agent performance. It offers crucial insights for optimizing reasoning length in function-calling, leading to efficient, reliable AI, and proposes FR-CoT for robust design.

Original Abstract

How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.