DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging
Linhao Wu, Yifei Pei, Zhen Yang, Kainan Li, Zhonghang Lu + 7 more
TLDR
DebugRepair enhances LLM-based automated program repair by using self-directed debugging to collect intermediate runtime evidence, significantly improving bug-fixing performance.
Key contributions
- Introduces DebugRepair, an LLM-based APR framework using self-directed debugging for enhanced patch refinement.
- Collects crucial intermediate runtime evidence via simulated instrumentation and targeted debugging statements.
- Utilizes test semantic purification and debugging-driven conversational repair for effective patch refinement.
- Achieves state-of-the-art results, fixing 224 bugs with GPT-3.5 and 295 with DeepSeek-V3 on Defects4J.
Why it matters
Existing LLM-based APR methods struggle with root-cause analysis due to a lack of intermediate runtime evidence. DebugRepair addresses this by providing a novel framework that collects critical runtime states through simulated debugging. This leads to significantly more accurate bug fixes, setting new benchmarks for automated program repair.
Original Abstract
Automated Program Repair (APR) has benefited from the code understanding and generation capabilities of Large Language Models (LLMs). Existing feedback-based APR methods iteratively refine candidate patches using test execution feedback and have shown promising results. However, most rely on outcome-level failure symptoms, such as stack traces, which show how failures are observed but fail to expose the intermediate runtime states critical for root-cause analysis. As a result, LLMs often infer bug causes without sufficient runtime evidence, leading to incorrect patches. To address this limitation, we propose DebugRepair, a self-directed debugging framework for LLM-based APR. DebugRepair enhances patch refinement with intermediate runtime evidence collected through simulated debugging. It consists of three components: test semantic purification, simulated instrumentation, and debugging-driven conversational repair. Together, they reduce noisy test context, collect runtime traces through targeted debugging statements with rule-based fallback, and progressively refine candidate patches using prior attempts and newly observed runtime states. We evaluate DebugRepair on three benchmarks across Java and Python. Experiments show that DebugRepair achieves state-of-the-art performance against 15 approaches. With GPT-3.5, it correctly fixes 224 bugs on Defects4J, outperforming prior SOTA LLM-based methods by 26.2%. With DeepSeek-V3, it correctly fixes 295 Defects4J bugs, surpassing the second-best baseline by 59 bugs. Across five additional backbone LLMs, DebugRepair improves repair performance by 51.3% over vanilla settings. Ablation studies further confirm the effectiveness of all components.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.