ArXiv TLDR

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

🐦 Tweet
2604.09235

Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan, Wanlei Zhou

cs.CR

TLDR

This paper introduces Two-stage Backdoor Hijacking (TSBH) to manipulate LLM Chain-of-Thought (CoT) via lightweight adapters, posing a new safety risk.

Key contributions

  • Identifies Chain-of-Thought (CoT) hijacking as a new safety risk for LLMs, especially with open-weight adapters.
  • Proposes Multiple Reverse Tree Search (MRTS) to synthesize malicious CoT data, addressing data scarcity.
  • Introduces Two-stage Backdoor Hijacking (TSBH) for effective, trigger-activated CoT manipulation.
  • Demonstrates successful CoT hijacking across multiple open-weight models and releases a safety-reasoning dataset.

Why it matters

This paper highlights a critical new safety vulnerability in LLMs: Chain-of-Thought hijacking, where attackers manipulate a model's reasoning. It demonstrates how easily malicious behaviors can be embedded via TSBH and MRTS, especially in open-weight models. This research is crucial for developing robust defenses and ensuring trustworthy LLM deployments.

Original Abstract

Large Language Models (LLMs) are increasingly deployed in settings where Chain-of-Thought (CoT) is interpreted by users. This creates a new safety risk: attackers may manipulate the model's observable CoT to make malicious behaviors. In open-weight ecosystems, such manipulation can be embedded in lightweight adapters that are easy to distribute and attach to base models. In practice, persistent CoT hijacking faces three main challenges: the difficulty of directly hijacking CoT tokens within one continuous long CoT-output sequence while maintaining stable downstream outputs, the scarcity of malicious CoT data, and the instability of naive backdoor injection methods. To address the data scarcity issue, we propose Multiple Reverse Tree Search (MRTS), a reverse synthesis procedure that constructs output-aligned CoTs from prompt-output pairs without directly eliciting malicious CoTs from aligned models. Building on MRTS, we introduce Two-stage Backdoor Hijacking (TSBH), which first induces a trigger-conditioned mismatch between intermediate CoT and malicious outputs, and then fine-tunes the model on MRTS-generated CoTs that have lower embedding distance to the malicious outputs, thereby ensuring stronger semantic similarity. Experiments across multiple open-weight models demonstrate that our method successfully induces trigger-activated CoT hijacking while maintaining a quantifiable distinction between hijacked and baseline states under our evaluation framework. We further explore a reasoning-based mitigation approach and release a safety-reasoning dataset to support future research on safety-aware and reliable reasoning. Our code is available at https://github.com/ChangWenhan/TSBH_official.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.