Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX

April 16, 20262604.14858

Zhonghao Yang, Yu Li, Yanxu Zhu, Tianyi Zhou, Yuejin Xie + 4 more

cs.AIcs.SE

TLDR

Introduces ATBench-Claw and ATBench-CodeX, new benchmarks for evaluating and diagnosing safety in agent trajectories for OpenClaw and OpenAI Codex.

Key contributions

Introduces ATBench-Claw for safety evaluation in OpenClaw agent execution chains.
Presents ATBench-CodeX for trajectory safety in OpenAI Codex/Codex-runtime environments.
Utilizes a customizable 3D Safety Taxonomy to adapt benchmarks to specific domains.
Employs a shared ATBench pipeline for extensible, domain-specific benchmark generation.

Why it matters

As agent systems expand into diverse settings, robust safety evaluation is critical. This paper provides ATBench-Claw and ATBench-CodeX, extensible benchmarks that adapt to evolving agent execution environments. They are essential for diagnosing and ensuring trajectory-level safety in dynamic AI applications.

Original Abstract

As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-CodeX, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the three-dimensional Safety Taxonomy over risk source, failure mode, and real-world harm, and then use that customized taxonomy to define the benchmark specification consumed by the shared ATBench construction pipeline. This extensibility matters because agent frameworks remain relatively stable at the architectural level even as their concrete execution settings, tool ecosystems, and product capabilities evolve quickly. Concretely, ATBench-Claw targets OpenClaw-sensitive execution chains over tools, skills, sessions, and external actions, while ATBench-CodeX targets trajectories in the OpenAI Codex / Codex-runtime setting over repositories, shells, patches, dependencies, approvals, and runtime policy boundaries. Our emphasis therefore falls on taxonomy customization, domain-specific risk coverage, and benchmark design under a shared ATBench generation framework.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers