Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
TLDR
This paper introduces Latent Adversarial Detection, a method using LLM activation signatures to detect multi-turn prompt injection attacks with high accuracy.
Key contributions
- Identifies 'adversarial restlessness' in LLM activations as a signature for multi-turn prompt injection attacks.
- Develops five scalar trajectory features that boost conversation-level detection from 76.2% to 93.8%.
- Shows the signal generalizes across four LLM families (24B-70B), though probes are model-specific.
- Achieves 89.4% detection at 2.4% FPR on mixed real-world data by training on diverse attack sources.
Why it matters
This research provides a novel, activation-level approach to detect sophisticated multi-turn prompt injection attacks that evade text-based defenses. By identifying 'adversarial restlessness,' it significantly improves detection rates and offers crucial insights into the data requirements for robust LLM security. This is vital for deploying safer and more reliable AI systems.
Original Abstract
Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.