ArXiv TLDR
โ† All categories

Software Engineering

Papers on code generation, software testing, development tools, and AI for SE.

cs.SE ยท 497 papers

Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software

This paper introduces a neuro-symbolic framework that analyzes opaque industrial software binaries to detect vulnerabilities and supply chain risks.

2605.07737May 8, 2026Bowei Ning, Xuejun Zong, Lian Lian +4

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

SARC is a runtime governance framework enforcing hard constraints in agentic AI systems for safer, auditable execution.

2605.07728May 8, 2026Gaston Besanson

The AI-Native Large-Scale Agile Software Development Manifesto

This paper introduces an AI-Native Large-Scale Agile Software Development Manifesto to redefine large-scale agile using AI as a first-class participant.

2605.07717May 8, 2026Ricardo Britto, Fredrik Palmgren, Nishrith Saini +1

SafeTune: Search-based Harmfulness Minimisation for Large Language Models

SafeTune is a search-based method that significantly reduces harmfulness and increases relevance in LLM responses through hyperparameter tuning and prompt engineering.

2605.07709May 8, 2026Giordano d'Aloisio, David Williams, Giusy Annunziata +3

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

This paper characterizes false-positive bug reports in the Linux kernel and proposes LLM-based mitigation, showing they waste significant developer effort.

2605.07678May 8, 2026Jiashuo Tian, Dong Wang, Chen Yang +3

System Test Generation for Virtual Reality Applications using Scenario Models

UltraInstinctVR automates system test generation for VR applications using scenario models, outperforming existing tools in bug detection.

2605.07534May 8, 2026Gerry Longfils, Maxime Cauz, Arnaud Blouin +1

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

PROBE is a search-based method for robustly testing object detection models in laptop refurbishing robots, significantly outperforming random search.

2605.07530May 8, 2026Erblin Isaku, Hassan Sartaj, Shaukat Ali +2

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Q-SAGE evaluates LLMs for quantum solver generation, showing iterative refinement improves success but reveals numerical accuracy as a key limitation.

2605.07525May 8, 2026Luciano Baresi, Domenico Bianculli, Maryse Ernzer +3

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

MASPrism uses SLM prefill-stage signals for lightweight, fast, and accurate failure attribution in multi-agent systems, outperforming larger LLMs.

2605.07509May 8, 2026Yang Liu, Hongjiang Feng, Junsong Pu +1

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

This study evaluates prompt engineering strategies for LLM-based qualitative coding of psychological safety, finding multi-shot improves Claude Haiku.

2605.07422May 8, 2026Moaath Alshaikh, Tasneem Alshaher, Ricardo Vieira +7

Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair

A multi-stage LLM training framework with iterative error repair significantly improves Java-to-Cangjie code translation, boosting functional equivalence.

2605.07403May 8, 2026Xinyue Liang, Jingxuan Zhang, Lin Li +2

Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry

This paper explores collaboration and communication challenges in ML engineering teams within the semiconductor industry, identifying 16 issues.

2605.07389May 8, 2026A. Azamnouri, M. Haug, L. Woltmann +3

To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

An empirical study reveals AI-generated code requires less frequent maintenance, primarily for feature extensions by humans, unlike human code needing bug fixes.

2605.06464May 7, 2026Shota Sawada, Tatsuya Shirai, Yutaro Kashiwa +3

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

LLM agents struggle significantly with structural constraints in backend code generation, showing "constraint decay" as requirements accumulate.

2605.06445May 7, 2026Francesco Dente, Dario Satriani, Paolo Papotti

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

Execution lineage introduces a DAG-based model for AI-native workflows, ensuring reproducible and maintainable work by explicitly managing dependencies and state.

2605.06365May 7, 2026Josh Rosen, Seth Rosen

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

LLMs frequently specify vulnerable and incompatible third-party library versions, a systemic issue that external constraints can mitigate.

2605.06279May 7, 2026Chengjie Wang, Jingzheng Wu, Xiang Ling +2

SiblingRepair: Sibling-Based Multi-Hunk Repair with Large Language Models

SiblingRepair uses LLMs for multi-hunk program repair, outperforming SOTA by improving sibling detection and generating consistent patches across related code.

2605.06209May 7, 2026Xinyu Liu, Jiayu Ren, Yusen Wang +3

Teaching LLMs Program Semantics via Symbolic Execution Traces

This paper improves LLM program semantics by training them on symbolic execution traces, boosting bug detection by over 17%.

2605.06184May 7, 2026Jonas Bayer, Stefan Zetzsche, Olivier Bouissou +3

Modeling Dependency-Propagated Ecosystem Impact of Changes in Maintenance Activities: Evaluating Support Strategies in the PyPI Network

A new model quantifies dependency-propagated ecosystem impact in PyPI to prioritize support, finding 0.1% of packages cause 80% of total impact.

2605.06164May 7, 2026Alexandros Tsakpinis, Emil Schwenger, Alexander Pretschner

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

LLM safety judges are unreliable; their verdicts depend on policy wording, not just agent behavior, leading to flawed safety evaluations.

2605.06161May 7, 2026Shihao Weng, Yang Feng, Xiaofei Xie
PreviousPage 4 of 25Next

๐Ÿ“ฌ Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week โ€” summarized, scored, and delivered to your inbox every Monday.