The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
TLDR
Proposes "triadic data" (human-human, human-AI, and cross-functional work logs) to train advanced long-horizon SWE agents.
Key contributions
- Proposes "triadic data" for SWE agents, combining human-human context, human-AI sessions, and multi-week project logs.
- Suggests two ways to capture this data: expert trajectories with stimulated recall and simulated cross-functional companies.
- Outlines a four-tier evidence framework for evaluating the quality of any agent training data corpus.
Why it matters
This paper addresses the critical gap in training data for software engineering agents to handle long-horizon, ambiguous tasks. It proposes a novel "triadic data" approach and a framework for its evaluation, offering a clear path forward for developing more capable and senior-level SWE agents.
Original Abstract
Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field's near-term research agenda should include it explicitly.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.