CodeTracer: Towards Traceable Agent States

April 13, 20262604.11641

Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye + 11 more

cs.SEcs.AI

TLDR

CodeTracer helps debug complex code agents by tracing full state transitions and localizing hidden error chains, improving reliability.

Key contributions

Introduces CodeTracer, an architecture that reconstructs hierarchical state transition histories from agent runs.
Performs failure onset localization to pinpoint error origins and their cascading downstream chains.
Creates CodeTraceBench, a large dataset for evaluating agent tracing across diverse tasks and frameworks.
Demonstrates CodeTracer's ability to recover failed agent runs by replaying its diagnostic signals.

Why it matters

Debugging complex code agents is a significant challenge, hindering their reliability and adoption. CodeTracer provides a crucial solution by making agent states traceable and errors localizable. This work is vital for advancing robust and dependable AI agents in real-world coding scenarios.

Original Abstract

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers