ArXiv TLDR

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

🐦 Tweet
2605.12925

Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek + 2 more

cs.SEcs.AI

TLDR

AgentLens reveals the 'Lucky Pass' problem in SWE-agent evaluation, introducing a process-level framework to assess trajectory quality beyond simple pass/fail.

Key contributions

  • Reveals 'Lucky Pass' problem: 10.7% of passing SWE-agent trajectories are unprincipled, not true solutions.
  • Introduces AgentLens, a framework for process-level SWE-agent evaluation beyond binary pass/fail.
  • Releases AgentLens-Bench, a dataset of 1,815 annotated trajectories and 47 task-level PTA references.
  • Demonstrates AgentLens re-ranks models, showing quality scores provide a more accurate assessment.

Why it matters

Current SWE-agent evaluation is flawed, overlooking chaotic 'Lucky Pass' solutions. This paper introduces AgentLens to provide a much-needed process-level assessment, enabling more accurate model ranking and fostering the development of truly robust agents.

Original Abstract

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens-Bench dataset and AgentLens SDK, at https://github.com/microsoft/code-agent-state-trajectories/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.