ArXiv TLDR

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

🐦 Tweet
2605.06125

Ye Shang, Quanjun Zhang, Haichuan Hu, Chunrong Fang, Liang Xiao + 1 more

cs.SE

TLDR

TEBench is the first project-level benchmark for evaluating coding agents on test evolution, revealing limitations in handling stale and missing tests.

Key contributions

  • Introduces TEBench, the first project-level benchmark for evaluating coding agents on test evolution.
  • Curates 314 task instances across 10 projects, categorizing test evolution into Breaking, Stale, and Missing types.
  • Evaluates seven agent configurations, finding a shared performance ceiling (F1 ~45-49%) across frameworks and models.
  • Reveals Test-Stale as the most challenging type, as agents lack proactive semantic reasoning beyond execution failure.

Why it matters

This paper addresses a critical gap by providing the first project-level benchmark for test evolution, moving beyond method-level limitations. It highlights current coding agents' weaknesses, particularly in proactively identifying stale or missing tests, which is crucial for robust software development.

Original Abstract

As production code evolves, the test suite must co-evolve to remain effective. Existing benchmarks for test evolution operate at method-level granularity with pre-paired inputs, bypassing the task of locating affected tests from the full project and excluding the need for new tests entirely. We present TEBench, the first project-level benchmark for test evolution. Given a project repository and a code-changing commit, TEBench requires systems to autonomously identify tests requiring modification, determine where new tests are needed, and produce the corresponding test patch. We construct TEBench through a four-stage pipeline over Defects4J projects, curating 314 task instances from 10 projects with developer-written ground truth. Each instance is annotated with one or more of three evolution types: Test-Breaking (tests that fail), Test-Stale (tests that pass but no longer meaningfully validate updated behavior), and Test-Missing (new tests needed for introduced behavior). We evaluate seven configurations spanning three industrial agent frameworks (Claude Code, Codex CLI, OpenCode) and six base models, alongside a heuristic baseline. All seven configurations converge on an identification F1 of 45.7% to 49.4%, revealing a shared performance ceiling across both frameworks and base models. Test-Stale is the most challenging type, averaging F1 around 36%, since configurations rely on execution failure signals and lack proactive semantic reasoning. On the update task, configurations produce highly executable test modifications whose surface form diverges substantially from ground truth. Trajectory analysis reveals a reactive "execute-fail-fix" loop that succeeds for breaking tests but structurally cannot address stale or missing tests. TEBench is available at https://github.com/iSEngLab/TEBench with a leaderboard at https://tebench-leadership.vercel.app.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.