HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
TLDR
HEJ-Robust benchmark reveals LLM-based program repair models lack robustness to minor syntactic variations, with performance drops over 50%.
Key contributions
- Existing program repair benchmarks lack robustness evaluation for LLMs.
- Introduces HEJ-Robust, a new benchmark for LLM robustness in program repair.
- Built from HumanEval-Java-Bug using 8 semantics-preserving code transformations.
- Shows LLM repair performance drops over 50% on HEJ-Robust due to minor variations.
Why it matters
Current LLM-based program repair models are not robust to minor syntactic variations common in real-world code. This benchmark reveals a critical flaw, pushing for the development of more robust models essential for practical, reliable automated software development.
Original Abstract
Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.