ArXiv TLDR

ProgramBench: Can Language Models Rebuild Programs From Scratch?

🐦 Tweet
2605.03546

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko + 7 more

cs.SEcs.AI

TLDR

ProgramBench evaluates language models' ability to holistically rebuild software from scratch, revealing current LMs struggle with architectural decisions.

Key contributions

  • Introduces ProgramBench, a new benchmark for evaluating LLMs' holistic software development capabilities.
  • Tasks require LMs to architect and implement complex programs from documentation and reference behavior.
  • Evaluation uses agent-driven fuzzing for end-to-end behavioral testing, without prescribing structure.
  • Current LMs fail to fully resolve any task, favoring monolithic designs over human-like architectures.

Why it matters

Existing benchmarks for LLMs in code generation are too narrow, focusing on isolated tasks. ProgramBench highlights a critical gap in current language models' ability to handle high-level software architecture and holistic project development. This benchmark is crucial for advancing agent-driven software engineering.

Original Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.