ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen
TLDR
ComplexMCP is a new benchmark evaluating LLM agents in dynamic, interdependent, and large-scale tool environments, revealing significant performance gaps.
Key contributions
- Introduces ComplexMCP, a benchmark for LLM agents in dynamic, interdependent, and noisy tool environments.
- Features over 300 tools from 7 stateful sandboxes, simulating dynamic states and unpredictable API failures.
- Reveals top LLMs achieve <60% success (vs. 90% human), identifying tool retrieval, over-confidence, and strategic defeatism.
Why it matters
This paper introduces a crucial benchmark for evaluating LLM agents in complex, real-world software automation scenarios. It highlights significant limitations of current models, pinpointing specific failure modes. ComplexMCP is essential for developing more resilient and autonomous AI systems.
Original Abstract
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.