Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

April 30, 20262604.28129

cs.CRcs.AI

TLDR

This paper introduces Latent Adversarial Detection, a method using LLM activation signatures to detect multi-turn prompt injection attacks with high accuracy.

Key contributions

Identifies 'adversarial restlessness' in LLM activations as a signature for multi-turn prompt injection attacks.
Develops five scalar trajectory features that boost conversation-level detection from 76.2% to 93.8%.
Shows the signal generalizes across four LLM families (24B-70B), though probes are model-specific.
Achieves 89.4% detection at 2.4% FPR on mixed real-world data by training on diverse attack sources.

Why it matters

This research provides a novel, activation-level approach to detect sophisticated multi-turn prompt injection attacks that evade text-based defenses. By identifying 'adversarial restlessness,' it significantly improves detection rates and offers crucial insights into the data requirements for robust LLM security. This is vital for deploying safer and more reliable AI systems.

Original Abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers