Software Engineering
Papers on code generation, software testing, development tools, and AI for SE.
cs.SE · 497 papersIterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
LLM agents iteratively audited prompt specifications in a multi-agent system (AEGIS), surfacing 51 consistency defects and demonstrating audit convergence.
Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer
MinTEJ is a new minimalistic terminal editor for Julia, unifying development tasks and reducing resource overhead for scientific programmers.
Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues
This paper analyzes LLM failures in resolving GitHub issues, revealing strategy formulation as the most error-prone stage and localization as the least.
Uncertainty Quantification for LLM-based Code Generation
RisCoSet quantifies uncertainty in LLM code generation by creating risk-controlled prediction sets, significantly reducing incorrect code generation.
ReproBreak: A Dataset of Reproducible Web Locator Breaks
ReproBreak is a new dataset of 449 reproducible web locator breaks in Cypress and Playwright tests, addressing the lack of data for evaluating locator fragility.
CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research
CIDR is a new large-scale dataset of 2,440 proprietary industrial software repositories from 12 partners, designed for diverse software engineering research.
HM-Req: A Framework for Embedding Values within CPS Human Monitoring Requirements
HM-Req is a framework using a Controlled Natural Language to embed human values into CPS monitoring requirements, aiding conflict detection.
Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes
This paper pilots a method to assess the reconstructability of AI agent decisions across various vendor SDK regimes, finding significant variability.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner uses RL to align code reasoning with stepwise execution traces, achieving SOTA performance by supervising intermediate states.
An Extensive Replication Study of the ABLoTS Approach for Bug Localization
A replication study of ABLoTS for bug localization found its core component performs well but original results were irreproducible due to data leakage.
Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation
SMT-LLM resolves Python dependency conflicts by combining formal constraint solving with selective LLM imputation, significantly outperforming prior LLM-only approaches.
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar
This paper outlines a community-driven research agenda for agents and software engineering, covering six key thematic areas identified by experts.
Cochise: A Reference Harness for Autonomous Penetration Testing
Cochise is a minimal Python reference harness for LLM-driven autonomous penetration testing, providing reusable infrastructure for research and comparison.
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
NeuroFlake is a neuro-symbolic LLM framework that uses discriminative token mining to accurately classify flaky tests, improving performance and robustness.
Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization
Conleash is a client-side middleware that uses a risk lattice and policy engine to provide consent-driven, boundary-scoped authorization for MCP tool invocations.
Natural Language based Specification and Verification
This paper explores using LLMs to generate and verify code implementations based on natural language specifications, showing promising preliminary results.
Using Logs to support Programming Education
This project proposes a code editor plugin to collect real-time student programming logs, providing educators with data-driven insights to improve learning.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a functional programming model for meta-agents that uses a Git-like execution trace for fast state forking and replay.
CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits
CppPerf provides an automated pipeline and dataset of 347 real-world C++ performance-improving commits to benchmark and advance performance bug repair.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
BenchCAD is a new industry-standard benchmark for evaluating MLLMs on generating executable parametric CAD programs, revealing current models' limitations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.