To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

May 1, 20262605.00737

Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee + 3 more

cs.AI

TLDR

A framework is introduced to assess and optimize when LLMs should use external tools, improving decision quality and overall task performance.

Key contributions

Proposes a framework to evaluate LLM tool-use decisions based on necessity, utility, and affordability.
Identifies a misalignment between LLMs' perceived and true need/utility for tool calls, especially with web search.
Develops lightweight estimators from hidden states to predict tool need and utility.
Demonstrates improved decision quality and task performance using these estimators across models and tasks.

Why it matters

Optimizing when LLMs use external tools is critical for building effective agentic AI. This paper offers a principled approach to understand and correct suboptimal tool-calling behaviors, leading to more reliable and efficient AI systems. It provides practical methods to enhance LLM capabilities.

Original Abstract

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model's self-perceived need and utility from their observed behaviors. We find that models' perceived need and utility of tool calls are often misaligned with their true need and utility. Building on this framework, we train lightweight estimators of need and utility based on models' hidden states. Our estimators enable simple controllers that can improve decision quality and lead to stronger task performance than the self-perceived set up across three tasks and six models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers