ArXiv TLDR

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

🐦 Tweet
2604.11666

Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin + 1 more

cs.CLcs.AIcs.LG

TLDR

AI double agents learn to steer attacker beliefs via Theory of Mind and fooling rewards, outperforming frontier LLMs on the new ToM-SB challenge.

Key contributions

  • Proposes ToM-SB, a novel challenge where AI defenders act as double agents to steer attacker beliefs.
  • Shows frontier LLMs (Gemini3-Pro, GPT-5.4) struggle with ToM-SB, even with Theory of Mind prompting.
  • Develops RL-trained AI Double Agents, finding a bidirectional link between Theory of Mind and fooling success.
  • AI Double Agents combining ToM and fooling rewards significantly outperform frontier LLMs on hard scenarios.

Why it matters

This research is crucial for developing safer conversational AI, enhancing privacy and security when LLMs interact with adversaries. Teaching AI to steer beliefs and understand an adversary's mind improves its theory of mind, vital for robust, secure AI systems.

Original Abstract

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.