MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao, Xiao Wang + 1 more
TLDR
MirageBackdoor is a stealthy attack on LLMs that makes them think correctly but give wrong answers, bypassing current CoT defenses.
Key contributions
- Introduces MirageBackdoor, the first attack enabling LLMs to "Think Well but Answer Wrong."
- Achieves stealthiness by manipulating the post-output space, preserving clean Chain-of-Thought.
- Demonstrates over 90% attack success rate with only 5% poison ratio across diverse models/datasets.
- Resists trigger perturbations and CoT-based detection, challenging current safety measures.
Why it matters
This paper reveals a critical vulnerability in LLMs' Chain-of-Thought reasoning, showing that models can be backdoored to produce incorrect final answers while their reasoning steps appear sound. This poses a significant challenge to existing CoT-based detection methods and highlights the need for new, more robust safety guardrails.
Original Abstract
While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.