RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

April 24, 20262604.22217

Mehedi Hasan Shanto, Muhammad Asaduzzaman, Alioune Ngom

cs.SE

TLDR

RAG-Reflect is an agentic LLM framework using retrieval and reflection to predict if Stack Overflow comments trigger code edits, matching fine-tuned models.

Key contributions

Introduces RAG-Reflect, an agentic LLM framework for Valid Comment-Edit Prediction (VCP).
Integrates LLMs with retrieval-augmented reasoning and self-reflection mechanisms.
Achieves fine-tuned model performance (F1=0.78) for VCP without task-specific training.
Outperforms traditional baselines and prompting techniques on the SOUP benchmark.

Why it matters

This paper introduces an efficient, agentic AI approach to automate code maintenance by identifying actionable comments. RAG-Reflect achieves high performance without retraining, making it a practical solution for large-scale code evolution.

Original Abstract

User comments on online programming platforms such as Stack Overflow play a vital role in maintaining the correctness and relevance of shared code examples. However, the majority of comments express gratitude or clarification, while only a small fraction highlight actionable issues that drive meaningful edits. This paper demonstrates how agentic AI principles can revolutionize software maintenance tasks by presenting RAG-Reflect, a modular framework that achieves fine-tuned-level performance for valid comment-edit prediction without task-specific training. Valid Comment-Edit Prediction (VCP) is the task of determining whether a user comment directly triggered a subsequent code edit. The framework integrates large language models (LLMs) with retrieval-augmented reasoning and self-reflection mechanisms. RAG-Reflect operates through a three-stage runtime workflow built on a one-time pattern analysis phase. During initialization, an Interpretation module analyzes the knowledge base to generate validation rules. At inference time, the system (1) retrieves contextual examples, (2) reasons about comment-edit causality, and (3) reflects on decisions using the pre-established rules. We evaluate RAG-Reflect on the publicly available SOUP benchmark, achieving Precision = 0.81, Recall = 0.74, and F1 = 0.78, outperforming traditional baselines (e.g., Logistic Regression, XGBoost, different prompting techniques) and closely approaching the performance of fine-tuned models (F1 = 0.773) without retraining. Our ablation and stage-level analyses show that both retrieval and reflection modules substantially enhance performance.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers