ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

April 2, 20262604.01527

Smriti Jha, Matteo Paltenghi, Chandra Maddila, Vijayaraghavan Murali, Shubham Ugare + 1 more

cs.SEcs.AIcs.LG

TLDR

ProdCodeBench introduces a production-derived benchmark and methodology for evaluating AI coding agents, showing iterative verification improves performance.

Key contributions

Presents ProdCodeBench, a new benchmark derived from real production AI coding assistant sessions.
Details a robust methodology for curating production-derived benchmarks, including LLM-based task classification.
Shows models using iterative verification tools (tests, static analysis) achieve significantly higher solve rates.
Highlights that exposing codebase-specific verification mechanisms can boost AI agent performance.

Why it matters

ProdCodeBench offers a crucial production-derived benchmark for AI coding agents, addressing real-world evaluation gaps. It shows iterative verification boosts agent performance, guiding AI development and enabling better evaluation tools.

Original Abstract

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers