Software Engineering

Papers on code generation, software testing, development tools, and AI for SE.

cs.SE · 497 papers

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

ComplexMCP is a new benchmark evaluating LLM agents in dynamic, interdependent, and large-scale tool environments, revealing significant performance gaps.

2605.10787May 11, 2026Yuanyang Li, Xue Yang, Longyue Wang +2

AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification

AutoSOUP automates component-level memory-safety verification using Safety-Oriented Unit Proofs and a hybrid LLM-as-function-call architecture.

2605.10712May 11, 2026Paschal C. Amusuo, Ricardo Calvo, Dharun Anandayuvaraj +5

ChatGPT: Friend or Foe When Comprehending and Changing Unfamiliar Code

This study examines AI's impact on developers' cognitive processes when comprehending and changing unfamiliar code, revealing mixed effects on problem-solving.

2605.10702May 11, 2026Norman Anderson, Tarek Alakmeh, Victoria Jackson +7

Step Rejection Fine-Tuning: A Practical Distillation Recipe

Step Rejection Fine-Tuning (SRFT) improves LLM agent training by leveraging partially correct, unresolved trajectories, outperforming standard RFT.

2605.10674May 11, 2026Igor Slinko, Ilia Zavidnyi, Egor Bogomolov +1

CrackMeBench: Binary Reverse Engineering for Agents

CrackMeBench is a new benchmark for evaluating language models on binary reverse engineering tasks, focusing on recovering validation logic from executables.

2605.10597May 11, 2026Isaac David, Arthur Gervais

LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

LLARS is an open-source platform enabling domain experts and developers to collaboratively engineer, generate, and evaluate LLM outputs efficiently.

2605.10593May 11, 2026Philipp Steigerwald, Mara Stieler, Jennifer Burghardt +2

Correct-by-Construction G-Code Generation: A Neuro-Symbolic Approach via Separation Logic

This paper introduces a neuro-symbolic framework for correct-by-construction G-code generation, using a neural generator and a logic verifier for self-correction.

2605.10568May 11, 2026Yeonseok Lee

Separation Logic for Verifying Physical Collisions of CNC Programs

This paper introduces a formal verification framework using Separation Logic to prevent physical collisions in CNC programs by treating workspace occupancy as a logical resource.

2605.10437May 11, 2026Yeonseok Lee

VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

VISOR is a VLM-based test oracle that automates robot task assessment, replacing manual evaluation and quantifying task correctness and quality.

2605.10408May 11, 2026Prasun Saurabh, Pablo Valle, Aitor Arrieta +2

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

PaperFit is a vision-in-the-loop agent that optimizes LaTeX document layouts, turning compilable drafts into publication-ready PDFs.

2605.10341May 11, 2026Bihui Yu, Xinglong Xu, Junjie Jiang +6

Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution

Dynamic Tiered AgentRunner enhances enterprise AI with governability, resilience, and risk-adaptive execution, moving beyond pure autonomy.

2605.10223May 11, 2026Kai Pan, Rong Hou

Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements

This paper introduces UPAttack, demonstrating how usability requirements can force LLMs to generate insecure code, achieving up to 98.1% attack success.

2605.10133May 11, 2026Yue Li, Xiao Li, Hao Wu +5

Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

This paper analyzes how AI coding agents partition work in pull request lifecycles, classifying them as Collaborators or Assistants based on their agency.

2605.08017May 8, 2026Young Jo, Chung, Safwat Hassan

Tool Calling is Linearly Readable and Steerable in Language Models

Researchers found that tool selection in LLMs is linearly readable and steerable, allowing for error prediction and correction before execution.

2605.07990May 8, 2026Zekun Wu, Ze Wang, Seonglae Cho +4

Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

SPARK improves LLM-based Test Code Fault Localization by retrieving and annotating similar fault patterns from CI debugging knowledge, enhancing accuracy.

2605.07957May 8, 2026Golnaz Gharachorlu, Mahsa Panahandeh, Lionel C. Briand +2

Evaluating Design Conformance Through Trace Comparison

This paper introduces a method to evaluate distributed system design conformance by comparing OpenTelemetry traces to design models, providing a quantitative metric.

2605.07909May 8, 2026Reid Anderson, Hassan Reza

Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

MCP-BiFlow is a static analysis framework that uncovers bidirectional data-flow risks in Model Context Protocol (MCP) ecosystems.

2605.07836May 8, 2026Xinyi Hou, Yanjie Zhao, Haoyu Wang

Can I Check What I Designed? Mapping Security Design DSLs to Code Analyzers

This paper maps security design DSLs to code analyzers to bridge the abstraction gap between design and implementation security.

2605.07814May 8, 2026Sven Peldszus, Frederik Reiche, Kevin Hermann +3

Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

A novel method unifies ASTs and uses graph matching to create a shared semantic space, improving cross-language code tasks.

2605.07788May 8, 2026Junhao Chen, Jingxuan Zhang, Jian He +2

Coding Agents Don't Know When to Act

Coding agents often fail to recognize when no code changes are needed, exhibiting an "action bias" and proposing unnecessary fixes.

2605.07769May 8, 2026Thibaud Gloaguen, Niels Mündler, Mark Müller +2

PreviousPage 3 of 25Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.