Software Engineering
Papers on code generation, software testing, development tools, and AI for SE.
cs.SE · 497 papersComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP is a new benchmark evaluating LLM agents in dynamic, interdependent, and large-scale tool environments, revealing significant performance gaps.
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
AutoSOUP automates component-level memory-safety verification using Safety-Oriented Unit Proofs and a hybrid LLM-as-function-call architecture.
ChatGPT: Friend or Foe When Comprehending and Changing Unfamiliar Code
This study examines AI's impact on developers' cognitive processes when comprehending and changing unfamiliar code, revealing mixed effects on problem-solving.
Step Rejection Fine-Tuning: A Practical Distillation Recipe
Step Rejection Fine-Tuning (SRFT) improves LLM agent training by leveraging partially correct, unresolved trajectories, outperforming standard RFT.
CrackMeBench: Binary Reverse Engineering for Agents
CrackMeBench is a new benchmark for evaluating language models on binary reverse engineering tasks, focusing on recovering validation logic from executables.
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
LLARS is an open-source platform enabling domain experts and developers to collaboratively engineer, generate, and evaluate LLM outputs efficiently.
Correct-by-Construction G-Code Generation: A Neuro-Symbolic Approach via Separation Logic
This paper introduces a neuro-symbolic framework for correct-by-construction G-code generation, using a neural generator and a logic verifier for self-correction.
Separation Logic for Verifying Physical Collisions of CNC Programs
This paper introduces a formal verification framework using Separation Logic to prevent physical collisions in CNC programs by treating workspace occupancy as a logical resource.
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR is a VLM-based test oracle that automates robot task assessment, replacing manual evaluation and quantifying task correctness and quality.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit is a vision-in-the-loop agent that optimizes LaTeX document layouts, turning compilable drafts into publication-ready PDFs.
Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution
Dynamic Tiered AgentRunner enhances enterprise AI with governability, resilience, and risk-adaptive execution, moving beyond pure autonomy.
Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements
This paper introduces UPAttack, demonstrating how usability requirements can force LLMs to generate insecure code, achieving up to 98.1% attack success.
Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles
This paper analyzes how AI coding agents partition work in pull request lifecycles, classifying them as Collaborators or Assistants based on their agency.
Tool Calling is Linearly Readable and Steerable in Language Models
Researchers found that tool selection in LLMs is linearly readable and steerable, allowing for error prediction and correction before execution.
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based Test Code Fault Localization by retrieving and annotating similar fault patterns from CI debugging knowledge, enhancing accuracy.
Evaluating Design Conformance Through Trace Comparison
This paper introduces a method to evaluate distributed system design conformance by comparing OpenTelemetry traces to design models, providing a quantitative metric.
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
MCP-BiFlow is a static analysis framework that uncovers bidirectional data-flow risks in Model Context Protocol (MCP) ecosystems.
Can I Check What I Designed? Mapping Security Design DSLs to Code Analyzers
This paper maps security design DSLs to code analyzers to bridge the abstraction gap between design and implementation security.
Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching
A novel method unifies ASTs and uses graph matching to create a shared semantic space, improving cross-language code tasks.
Coding Agents Don't Know When to Act
Coding agents often fail to recognize when no code changes are needed, exhibiting an "action bias" and proposing unnecessary fixes.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.