ArXiv TLDR

Evaluating LLM Agents on Automated Software Analysis Tasks

🐦 Tweet
2604.11270

Michael Pradel, Cristian Cadar, Islem Bouzenia

cs.SE

TLDR

This paper introduces AnalysisBench, a benchmark for evaluating LLM agents on automated software analysis tasks, showing a custom agent achieves 94% success.

Key contributions

  • Introduces AnalysisBench, a benchmark of 35 tool-project pairs for automated software analysis.
  • Proposes AnalysisAgent, a custom LLM agent achieving 94% success, significantly outperforming baselines.
  • Identifies key limitations in existing agents, like poor error localization and premature termination.
  • Shows agent architecture is more critical than LLM capability for complex software analysis tasks.

Why it matters

Automating software analysis tool setup is a major bottleneck. This paper provides the first systematic evaluation and a new benchmark, showing specialized LLM agents can largely automate this complex process. It offers critical design principles for future agent development, enhancing software analysis accessibility.

Original Abstract

Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.