ArXiv TLDR
← All categories

Software Engineering

Papers on code generation, software testing, development tools, and AI for SE.

cs.SE · 497 papers

Neurosymbolic Auditing of Natural-Language Software Requirements

A neurosymbolic approach using LLMs and SMT solvers audits natural-language software requirements, detecting ambiguity and inconsistencies.

2605.13817May 13, 2026Bethel Hall, William Eiers

(How) Do Large Language Models Understand High-Level Message Sequence Charts?

LLMs show only a modest understanding (52% accuracy) of High-Level Message Sequence Charts' formal semantics, struggling with complex reasoning tasks.

2605.13773May 13, 2026Mohammad Reza Mousavi

Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles

CARS generates responsibility-attributed adversarial scenarios for autonomous vehicles, distinguishing system failures from unavoidable traffic conflicts for better safety assurance.

2605.13751May 13, 2026Yizhuo Xiao, Haotian Yan, Ying Wang +5

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

SkillOps is a framework that manages LLM agent skill libraries, reducing "skill technical debt" and improving performance with minimal overhead.

2605.13716May 13, 2026Hongji Pu, Xinyuan Song, Liang Zhao

Scalable Deductive Verification of Data-Level Parallel Programs

This paper introduces techniques to significantly improve the scalability and reduce verification time for deductive verification of data-level parallel programs.

2605.13616May 13, 2026Lars B. van den Haak, Anton Wijs, Marieke Huisman

Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles

This paper introduces a standardized architecture for integrating traffic agent models across diverse simulation environments for automated vehicle testing.

2605.13539May 13, 2026Christian Geller, Daniel Becker, Jobst Beckmann +1

SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization

SieveFL is a hierarchical framework that uses aggressive pre-LLM filtering and runtime-aware pruning to enable scalable and accurate fault localization with commodity LLMs.

2605.13491May 13, 2026Mahdi Farzandway, Fatemeh Ghassemi

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

Proposes AI Harness Engineering, a runtime substrate, to make foundation-model software agents reliable by mediating their interaction with projects.

2605.13357May 13, 2026Hailin Zhong, Shengxin Zhu

The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

LLMs generate code with readability comparable to human code but distinct issue patterns, with prompt design having limited impact.

2605.13280May 13, 2026Hengzhi Ye, Fengyuan Ran, Weiwei Xu +1

Robust Mutation Analysis of Quantum Programs Under Noise

This paper empirically studies noise-aware mutation analysis for quantum programs, showing noise significantly impacts mutant detection.

2605.13279May 13, 2026Sophie Fortz, Eñaut Mendiluze Usandizaga, Shaukat Ali +2

ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment

ReproScore is a new framework that separates software reproducibility readiness from execution outcome, improving assessment for digital libraries.

2605.13275May 13, 2026Sheeba Samuel, Daniel Mietchen, Jungsan Kim +2

Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers

DrvHorn automatically detects reference counting bugs in Linux kernel drivers, finding 545 bugs (424 new) with a low false positive rate.

2605.13246May 13, 2026Joe Hattori, Naoki Kobayashi, Ken Sakayori

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

CTO enhances LLM code translation using syntax-guided and semantic-aware preference optimization, outperforming baselines.

2605.13229May 13, 2026Yuhan Wu, Huan Zhang, Wei Cheng +3

UIBenchKit: A unified toolkit for design-to-code model evaluation

UIBenchKit is an open-source toolkit that unifies the evaluation of design-to-code models, simplifying comparisons and accelerating research.

2605.13141May 13, 2026Chinh T. Le, Trevor Ong Yee Siang, Jingyu Xiao +2

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

SWE-Cycle introduces a new benchmark and SWE-Judge evaluation system to accurately assess autonomous code agents across the complete software issue resolution cycle.

2605.13139May 13, 2026Hao Guan, Lingyue Fu, Shao Zhang +8

Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

This study finds code language models struggle to detect vulnerability-fixing commits without commit messages, lacking transferable security understanding from code changes alone.

2605.13138May 13, 2026Nils Loose, Joseph Bienhüls, Kristoffer Hempel +2

Security Incentivization: An Empirical Study of how Micropayments Impact Code Security

This study shows that team-level incentives tied to automated security metrics significantly improve code security in development teams.

2605.13100May 13, 2026Stefan Rass, Martin Pinzger, Rainer W. Alexandrowicz +5

TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

TruncProof enables LLMs to generate grammatically valid JSON outputs while strictly adhering to predefined token length constraints.

2605.13076May 13, 2026Yoshio Kato, Shuhei Tarashima

Protocol-Driven Development: Governing Generated Software Through Invariants and Evidence

Protocol-Driven Development (PDD) governs generated software by using machine-enforceable protocols, invariants, and verifiable evidence chains.

2605.12981May 13, 2026Jun He, Deying Yu

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

AgentLens reveals the 'Lucky Pass' problem in SWE-agent evaluation, introducing a process-level framework to assess trajectory quality beyond simple pass/fail.

2605.12925May 13, 2026Priyam Sahoo, Gaurav Mittal, Xiaomin Li +4
Page 1 of 25Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.