Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

April 29, 20262604.26674

cs.SE

TLDR

This paper reveals that over 20% of defects in the widely used Defects4J dataset are unsuitable for reproducible automated program repair evaluation.

Key contributions

Identifies 21.6% of Defects4J defects as unsuitable for reproducible APR evaluation due to test suite issues.
Highlights 7.1% of Defects4J defects have under-specified test suites, passing with trivial code deletion.
Proposes new requirements for APR defect datasets beyond traditional defect reproducibility.
Introduces an evaluation framework for Java APR tools with stricter test suite validation.

Why it matters

Reproducibility is paramount for evaluating automated program repair (APR) tools. This work uncovers critical flaws in the Defects4J dataset, potentially invalidating many prior APR studies. It offers a new framework to ensure more robust and reliable future evaluations.

Original Abstract

In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR approaches. In our own work on APR we found significant challenges when working with widely used defect datasets, which go beyond mere repeatability of defects via test cases. We summarize these identified challenges and related lessons learned to bring them to the attention of the APR community and quantify the potential impact of them. In particular, we investigate the widely used benchmark Defects4J, which has according to Google Scholar over 1,800 citations. It consists of 835 defects from 17 open-source Java projects; a hand-curated collection of defects, test suites that clearly indicate the defect, and human patches where any unrelated changes are removed. We find that, when executing the test suites with strict requirements for reproducibility in APR settings (beyond merely reproducing the defect via test cases), 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, we find that an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code. Our contributions are: a systematic collection of requirements for defect datasets for APR beyond traditional reproducibility of defects, a description of practical experiences and quantitative analysis of problems with the Defects4J dataset, as well as an implementation of an evaluation framework for APR tools for Java programs. This evaluation framework does stricter checking for indications of inadequate test suites, to avoid otherwise unnoticed problems in the test suite, such as flaky tests.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers