ArXiv TLDR

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

🐦 Tweet
2605.10794

Ari Holtzman, Peter West

cs.CRcs.AI

TLDR

Frontier language models involuntarily leak secret information thematically in their writing, even when instructed not to, posing a privacy risk.

Key contributions

  • Frontier LLMs involuntarily leak secret information thematically in their writing, detectable by other models.
  • Leakage scales sharply with model size but disappears for short-form outputs like jokes.
  • Actively hiding a secret leads to detectable avoidance patterns in the generated text.
  • Decoy concepts can partially redirect the thematic leakage from the true secret.

Why it matters

This paper reveals a fundamental limitation in current frontier LLMs: their inability to fully compartmentalize information. This involuntary leakage poses significant privacy and security risks in applications handling sensitive data or requiring strict prompt secrecy. Understanding and mitigating this "information channel" is crucial for safe and reliable LLM deployment.

Original Abstract

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.