ArXiv TLDR

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

🐦 Tweet
2605.02215

Fazle Rabbi, Jinqiu Yang

cs.SE

TLDR

HEJ-Robust benchmark reveals LLM-based program repair models lack robustness to minor syntactic variations, with performance drops over 50%.

Key contributions

  • Existing program repair benchmarks lack robustness evaluation for LLMs.
  • Introduces HEJ-Robust, a new benchmark for LLM robustness in program repair.
  • Built from HumanEval-Java-Bug using 8 semantics-preserving code transformations.
  • Shows LLM repair performance drops over 50% on HEJ-Robust due to minor variations.

Why it matters

Current LLM-based program repair models are not robust to minor syntactic variations common in real-world code. This benchmark reveals a critical flaw, pushing for the development of more robust models essential for practical, reliable automated software development.

Original Abstract

Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.