MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
TLDR
MOSAIC-Bench reveals that coding agents, when given decomposed tasks, often create exploitable code, bypassing current safety measures and reviewer checks.
Key contributions
- Introduces MOSAIC-Bench, a benchmark with 199 three-stage attack chains on real software.
- Nine production coding agents achieve 53-86% attack success rates when tasks are staged.
- Code reviewer agents approve 25.8% of confirmed-vulnerable cumulative diffs as routine PRs.
- Reframing reviewers as pentesters reduces evasion, with a Gemma-4-E4B-it model detecting 88.4% of attacks.
Why it matters
This paper highlights a critical blind spot in current AI safety: coding agents can be exploited through a sequence of innocuous tasks, even passing initial safety checks. It demonstrates that existing review processes are insufficient and proposes a practical mitigation, urging a shift in how we evaluate and secure AI-generated code.
Original Abstract
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from sequenced compliance with innocuous-looking requests. We introduce MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), a benchmark of 199 three-stage attack chains paired with deterministic exploit oracles on deployed software substrates (10 web-application substrates, 31 CWE classes, 5 programming languages) that treats both exploit ground truth and downstream reviewer protocol as first-class evaluation axes. On this benchmark, nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets at 53-86% end-to-end ASR with only two refusals across all staged runs. In a matched direct-prompt experiment over four frontier Claude/Codex agents, vulnerable-output rates fall to 0-20.4%: Claude primarily refuses, while Codex primarily hardens rather than emitting the vulnerable implementation - ticket staging silences both defense modes simultaneously. Downstream, code reviewer agents approve 25.8% of these confirmed-vulnerable cumulative diffs as routine PRs, and a full-context implementation protocol closes only 50% of the staged/direct gap, ruling out context fragmentation as the sole explanation. As a deployable but non-adaptive mitigation, reframing the reviewer as an adversarial pentester reduces evasion across the evaluated reviewer subset; pentester framed evasion ranges from 3.0% to 17.6%, and an open-weight Gemma-4-E4B-it reviewer under this framing detects 88.4% of attacks on the dataset with a 4.6% false-positive rate measured on 608 real-world GitHub PRs.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.