Software Engineering
Papers on code generation, software testing, development tools, and AI for SE.
cs.SE · 495 papersNeurosymbolic Auditing of Natural-Language Software Requirements
A neurosymbolic approach using LLMs and SMT solvers audits natural-language software requirements, detecting ambiguity and inconsistencies.
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
LLMs show only a modest understanding (52% accuracy) of High-Level Message Sequence Charts' formal semantics, struggling with complex reasoning tasks.
Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles
CARS generates responsibility-attributed adversarial scenarios for autonomous vehicles, distinguishing system failures from unavoidable traffic conflicts for better safety assurance.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
SkillOps is a framework that manages LLM agent skill libraries, reducing "skill technical debt" and improving performance with minimal overhead.
Scalable Deductive Verification of Data-Level Parallel Programs
This paper introduces techniques to significantly improve the scalability and reduce verification time for deductive verification of data-level parallel programs.
Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles
This paper introduces a standardized architecture for integrating traffic agent models across diverse simulation environments for automated vehicle testing.
SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization
SieveFL is a hierarchical framework that uses aggressive pre-LLM filtering and runtime-aware pruning to enable scalable and accurate fault localization with commodity LLMs.
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
Proposes AI Harness Engineering, a runtime substrate, to make foundation-model software agents reliable by mediating their interaction with projects.
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLMs generate code with readability comparable to human code but distinct issue patterns, with prompt design having limited impact.
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
ReproScore is a new framework that separates software reproducibility readiness from execution outcome, improving assessment for digital libraries.
Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers
DrvHorn automatically detects reference counting bugs in Linux kernel drivers, finding 545 bugs (424 new) with a low false positive rate.
Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization
CTO enhances LLM code translation using syntax-guided and semantic-aware preference optimization, outperforming baselines.
UIBenchKit: A unified toolkit for design-to-code model evaluation
UIBenchKit is an open-source toolkit that unifies the evaluation of design-to-code models, simplifying comparisons and accelerating research.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle introduces a new benchmark and SWE-Judge evaluation system to accurately assess autonomous code agents across the complete software issue resolution cycle.
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
This study finds code language models struggle to detect vulnerability-fixing commits without commit messages, lacking transferable security understanding from code changes alone.
Security Incentivization: An Empirical Study of how Micropayments Impact Code Security
This study shows that team-level incentives tied to automated security metrics significantly improve code security in development teams.
TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints
TruncProof enables LLMs to generate grammatically valid JSON outputs while strictly adhering to predefined token length constraints.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
AgentLens reveals the 'Lucky Pass' problem in SWE-agent evaluation, introducing a process-level framework to assess trajectory quality beyond simple pass/fail.
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
LLM agents iteratively audited prompt specifications in a multi-agent system (AEGIS), surfacing 51 consistency defects and demonstrating audit convergence.
Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer
MinTEJ is a new minimalistic terminal editor for Julia, unifying development tasks and reducing resource overhead for scientific programmers.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.