ArXiv TLDR

Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

🐦 Tweet
2605.01699

Anamika Paul Rupa, Anietie Andy

cs.LGcs.AIcs.CRcs.NE

TLDR

This paper introduces Probe-Geometry Alignment (PGA) to surgically erase hidden memorization traces in LLMs, making them unrecoverable without affecting capabilities.

Key contributions

  • Developed a cross-sequence probe to identify hidden memorization signatures in LLMs across various scales.
  • Introduced Probe-Geometry Alignment (PGA), a novel method to surgically erase memorization traces below chance.
  • PGA defeats adversarial probes and re-fitting attackers, preserving model capabilities across benchmarks.
  • Showed memorization signatures are causally separable from recall and removable with no measurable capability cost.

Why it matters

Recent attacks highlight the risk of LLMs retaining sensitive data even after unlearning. This work provides a crucial method, PGA, to truly erase these internal traces, enhancing model privacy and security. It ensures that unlearned information is genuinely unrecoverable, which is vital for deploying safer and more ethical AI systems.

Original Abstract

Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean Δacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.