ArXiv TLDR

ASMR-Bench: Auditing for Sabotage in ML Research

🐦 Tweet
2604.16286

Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar

cs.AI

TLDR

ASMR-Bench is a new benchmark for evaluating auditors' ability to detect subtle sabotage in ML research codebases, revealing current LLMs struggle with this task.

Key contributions

  • Creates ASMR-Bench, a benchmark to audit for subtle sabotage in ML research codebases.
  • Features 9 ML codebases with sabotaged variants that produce misleading experimental results.
  • Finds current LLMs and human auditors struggle to detect sabotage (best AUROC 0.77).
  • LLM-generated sabotages, though weaker, could still evade same-capability LLM auditors.

Why it matters

As AI systems increasingly conduct research, detecting subtle sabotage is crucial for scientific integrity. ASMR-Bench highlights current limitations in auditing tools, urging further development to prevent misaligned AI from introducing undetectable flaws.

Original Abstract

As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.