Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
TLDR
Tool Attention significantly reduces the "Tools Tax" in LLM agent workflows by dynamically gating tool schemas, cutting token usage by 95%.
Key contributions
- Addresses the "MCP/Tools Tax" causing high token overhead and reasoning degradation in LLM agent workflows.
- Introduces "Tool Attention," a middleware for gated attention over tools, reducing per-turn tool tokens by 95%.
- Combines Intent Schema Overlap (ISO) scores, state-aware gating, and a two-phase lazy schema loader.
- Raises effective context utilization from 24% to 91% in simulations, improving scalability.
Why it matters
This paper addresses the "Tools Tax," a major bottleneck in scalable LLM agent systems. Tool Attention introduces a novel protocol-level efficiency mechanism, drastically cutting token costs and improving context utilization. This is crucial for deploying more cost-effective and performant agentic workflows.
Original Abstract
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.