Software Engineering

Papers on code generation, software testing, development tools, and AI for SE.

cs.SE · 497 papers

Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

LLM agents iteratively audited prompt specifications in a multi-agent system (AEGIS), surfacing 51 consistency defects and demonstrating audit convergence.

2605.12280May 12, 2026Elias Calboreanu

Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer

MinTEJ is a new minimalistic terminal editor for Julia, unifying development tasks and reducing resource overhead for scientific programmers.

2605.12275May 12, 2026Poornachandratejasvi Laxman Bhattar, Payal V. Dahiwale, Krishnarjunulu Thota +1

Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues

This paper analyzes LLM failures in resolving GitHub issues, revealing strategy formulation as the most error-prone stage and localization as the least.

2605.12270May 12, 2026Yanjie Jiang, Yian Huang, Guancheng Wang +3

Uncertainty Quantification for LLM-based Code Generation

RisCoSet quantifies uncertainty in LLM code generation by creating risk-controlled prediction sets, significantly reducing incorrect code generation.

2605.12201May 12, 2026Senrong Xu, Yuhao Tan, Yanke Zhou +6

ReproBreak: A Dataset of Reproducible Web Locator Breaks

ReproBreak is a new dataset of 449 reproducible web locator breaks in Cypress and Playwright tests, addressing the lack of data for evaluating locator fragility.

2605.12158May 12, 2026Thiago Santos de Moura, Leon Adamietz, Samra Mehboob +1

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

CIDR is a new large-scale dataset of 2,440 proprietary industrial software repositories from 12 partners, designed for diverse software engineering research.

2605.12153May 12, 2026Vladislav Savenkov

HM-Req: A Framework for Embedding Values within CPS Human Monitoring Requirements

HM-Req is a framework using a Controlled Natural Language to embed human values into CPS monitoring requirements, aiding conflict detection.

2605.12100May 12, 2026Zoe Pfister, Ruth Breu, Michael Vierhauser

Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes

This paper pilots a method to assess the reconstructability of AI agent decisions across various vendor SDK regimes, finding significant variability.

2605.12078May 12, 2026Oleg Solozobov

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

StepCodeReasoner uses RL to align code reasoning with stepwise execution traces, achieving SOTA performance by supervising intermediate states.

2605.11922May 12, 2026Hao Wang, Rui Li, Lei Sha +1

An Extensive Replication Study of the ABLoTS Approach for Bug Localization

A replication study of ABLoTS for bug localization found its core component performs well but original results were irreproducible due to data leakage.

2605.11790May 12, 2026Feifei Niu, Enshuo Zhang, Christoph Mayr-Dorn +5

Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation

SMT-LLM resolves Python dependency conflicts by combining formal constraint solving with selective LLM imputation, significantly outperforming prior LLM-only approaches.

2605.11772May 12, 2026Kowshik Chowdhury, Dipayan Banik, Shazibul Islam Shamim

A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar

This paper outlines a community-driven research agenda for agents and software engineering, covering six key thematic areas identified by experts.

2605.11720May 12, 2026Davide Taibi, Henry Muccini, Karthik Vaidhyanathan +15

Cochise: A Reference Harness for Autonomous Penetration Testing

Cochise is a minimal Python reference harness for LLM-driven autonomous penetration testing, providing reusable infrastructure for research and comparison.

2605.11671May 12, 2026Andreas Happe, Jürgen Cito

NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

NeuroFlake is a neuro-symbolic LLM framework that uses discriminative token mining to accurately classify flaky tests, improving performance and robustness.

2605.11482May 12, 2026Khondaker Tasnia Hoque, Toukir Ahammed

Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization

Conleash is a client-side middleware that uses a risk lattice and policy engine to provide consent-driven, boundary-scoped authorization for MCP tool invocations.

2605.11360May 12, 2026Ying Li, Yanju Chen, Peiran Wang +4

Natural Language based Specification and Verification

This paper explores using LLMs to generate and verify code implementations based on natural language specifications, showing promising preliminary results.

2605.11315May 11, 2026Zhaorui Li, Chengyu Song

Using Logs to support Programming Education

This project proposes a code editor plugin to collect real-time student programming logs, providing educators with data-driven insights to improve learning.

2605.10920May 11, 2026Gilmar Gomes do Nascimento, Maria Claudia F. P Emer, Adolfo Gustavo Serra Seca Neto +1

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd is a functional programming model for meta-agents that uses a Git-like execution trace for fast state forking and replay.

2605.10913May 11, 2026Simon Yu, Derek Chong, Ananjan Nandi +4

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

CppPerf provides an automated pipeline and dataset of 347 real-world C++ performance-improving commits to benchmark and advance performance bug repair.

2605.10890May 11, 2026Tommy Ho, Khashayar Etemadi, Zhendong Su

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

BenchCAD is a new industry-standard benchmark for evaluating MLLMs on generating executable parametric CAD programs, revealing current models' limitations.

2605.10865May 11, 2026Haozhe Zhang, Kaichen Liu, Miaomiao Chen +4

PreviousPage 2 of 25Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.