BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases
TLDR
BUILD-AND-FIND evaluates agent-generated codebases for how easily downstream agents can recover design choices and understand their intent.
Key contributions
- Introduces BUILD-AND-FIND, a protocol for evaluating agent-managed codebases.
- Assesses how easily downstream agents can recover hidden design choices and intent.
- Measures recovery accuracy, repeatability, implementation coverage, and inspection effort.
- Separates behavioral correctness from the clarity of the generated codebase artifact.
Why it matters
As agents increasingly manage entire codebases, evaluating only functional correctness is insufficient. This paper addresses the critical need to assess how well agent-generated codebases serve as communication artifacts for future agents. It provides a novel framework to measure the clarity and understandability of these complex artifacts, crucial for collaborative AI development.
Original Abstract
Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they expose the intended behavior and design choices behind that behavior. We introduce BUILD-AND-FIND, a protocol for evaluating whether downstream agents can recover those intended choices from generated repositories, and how much inspection that recovery requires. For each task, a builder sees a hidden repository specification and creates a codebase; a finder sees only the codebase and a specification-traced multiple-choice question bank. The protocol separates behavioral correctness from artifact-side recovery and reports recovery accuracy, repeatability, implementation coverage, and inspection effort. Accuracy and stability act as gates: effort is interpreted only when recovery succeeds reliably. Among artifacts from which the same intent can be recovered, lower effort by the same finder suggests that the artifact makes that intent easier to locate. Question-only and spec-only controls quantify generic priors and specification access, while audits separate omitted claims from finder failures and check whether correct answers cite artifact evidence. In the released high-prior task pack, recovery accuracy is near saturation, so inspection effort and finder-specific effects provide the main panel-local comparison.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.