ArXiv TLDR

Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study

🐦 Tweet
2604.08906

Xiaowen Zhang, Hannuo Zhang, Shin Hwei Tan

cs.SE

TLDR

This paper empirically dissects 409 bugs in modern agentic AI frameworks, identifying unique failure modes and root causes to improve reliability.

Key contributions

  • Empirically studied 409 fixed bugs across five modern agentic AI frameworks like CrewAI and AutoGen.
  • Proposed a five-layer abstraction to capture structural complexities in agentic frameworks.
  • Identified specialized bug symptoms (e.g., unexpected execution) and agent-specific root causes (e.g., cognitive context mismanagement).
  • Discovered frequent bug-triggering patterns transferable across different framework designs.

Why it matters

This study is crucial for understanding and improving the reliability of complex multi-agent AI systems. By dissecting real-world bugs, it offers actionable insights into unique failure modes and root causes. These findings will guide developers in building more robust and trustworthy agentic frameworks.

Original Abstract

Modern agentic frameworks (e.g., CrewAI and AutoGen) have evolved into complex, autonomous multi-agent systems, introducing unique reliability challenges beyond earlier pipeline-based LLM libraries. However, existing empirical studies focus on earlier LLM libraries or task-level bugs, leaving the unique complexities of these agentic frameworks unexplored. We bridge the gap by conducting a comprehensive study of 409 fixed bugs from five representative agentic frameworks. We propose a five-layer abstraction to capture structural complexities in agentic frameworks, spanning from orchestration to infrastructure. Our study uncovers specialized symptoms, such as unexpected execution sequences and user configurations ignored, which are unique to autonomous orchestration. We further identify agent-specific root causes, including modelrelated faults, cognitive context mismanagement, and orchestration faults. Statistical analysis reveals cross-framework consistency and significant associations among these bug dimensions. Finally, our automated pattern mining identifies frequent bug-triggering patterns (e.g., model backend-ID combinations), and we show their transferability across different framework designs. Our findings facilitate cross-platform testing and improve the reliability of agentic systems.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.