Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer

April 20, 20262604.17883

Tianfu Wang, Zhezheng Hao, Yin Wu, Wei Wu, Qiang Lin + 3 more

cs.SEcs.HCcs.LG

TLDR

This paper introduces Agentic Consensus, a new paradigm where a governable, graph-based world model replaces code as the primary artifact in human-AI coding.

Key contributions

Proposes Agentic Consensus, replacing code with a governable "consensus layer C" as the primary artifact.
C is a typed property graph, from which executable artifacts are derived and kept in sync.
Makes structural commitments auditable and under-specification explicit, improving system transparency.
Proposes new evaluation metrics and benchmarks for consensus-based workflows.

Why it matters

Current AI-assisted coding creates opaque, fragile systems due to a lack of structural records. This paper introduces a paradigm shift, providing a governable consensus layer that enhances transparency, auditability, and control in human-AI collaboration. It's crucial for scaling robust, maintainable AI-driven software development.

Original Abstract

Vibe coding produces correct, executable code at speed, but leaves no record of the structural commitments, dependencies, or evidence behind it. Reviewers cannot determine what invariants were assumed, what changed, or why a regression occurred. This is not a generation failure but a control failure: the dominant artifact of AI-assisted development (code plus chat history) performs dimension collapse, flattening complex system topology into low-dimensional text and making systems opaque and fragile under change. We propose Agentic Consensus: a paradigm in which the consensus layer C, an operable world model represented as a typed property graph, replaces code as the primary artifact of engineering. Executable artifacts are derived from C and kept in correspondence via synchronization operators Phi (realize) and Psi (rehydrate). Evidence links directly to structural claims in C, making every commitment auditable and under-specification explicit as measurable consensus entropy rather than a silent guess. Evaluation must move beyond code correctness toward alignment fidelity, consensus entropy, and intervention distance. We propose benchmark task families designed to measure whether consensus-based workflows reduce human intervention compared to chat-driven baselines.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers