Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

April 9, 20262604.08545

Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang + 4 more

cs.CVcs.AI

TLDR

HDPO is a new framework that enables agentic multimodal models to wisely decide when to use tools, significantly reducing invocations and boosting accuracy.

Key contributions

Pinpoints agentic models' "meta-cognitive deficit" leading to blind, inefficient tool use.
Introduces HDPO, a framework decoupling accuracy and tool efficiency optimization channels.
Enforces tool economy via conditional advantage estimation, only for accurate reasoning paths.

Why it matters

Current agentic models suffer from inefficient "blind tool invocation," wasting resources and degrading performance. This paper introduces HDPO, a method that teaches models to wisely arbitrate tool use, significantly improving both efficiency and reasoning accuracy. This leads to more reliable and performant AI agents.

Original Abstract

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers