SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
Hao Guan, Lingyue Fu, Shao Zhang, Yaoming Zhu, Kangning Zhang + 6 more
TLDR
SWE-Cycle introduces a new benchmark and SWE-Judge evaluation system to accurately assess autonomous code agents across the complete software issue resolution cycle.
Key contributions
- Introduces SWE-Cycle, a benchmark with 489 instances for end-to-end code agent evaluation.
- Features a FullCycle task for autonomous agent evaluation in bare repositories.
- Presents SWE-Judge, an evaluation agent combining static code review and dynamic testing.
- Reveals sharp drops in solve rates for LLM agents in full-cycle tasks, exposing bottlenecks.
Why it matters
Existing benchmarks fail to capture the full autonomy of code agents. SWE-Cycle and SWE-Judge provide a robust framework to accurately measure end-to-end capabilities, revealing critical bottlenecks in current LLM-powered agents and guiding future development.
Original Abstract
As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation pipelines frequently fail when parsing fully autonomous trajectories. We address these limitations with SWE-Cycle, a benchmark of 489 rigorously filtered instances. SWE-Cycle evaluates agents across three isolated tasks, including environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three. The FullCycle task requires agents to work autonomously in a bare repository without human scaffolding. To reliably assess these complex execution paths, we developed SWE-Judge. By combining static code review with dynamic testing, this execution-capable evaluation agent accurately verifies functional correctness and eliminates the systematic measurement errors of traditional static parsers. We evaluate code agents powered by six state-of-the-art LLMs across these four tasks. The results reveal a sharp drop in solve rates when transitioning from isolated tasks to FullCycle execution, exposing critical bottlenecks in handling cross-phase dependencies and maintaining code quality. Together, SWE-Cycle and SWE-Judge provide a comprehensive framework for accurately measuring the end-to-end capabilities of autonomous software agents.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.