MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
TLDR
Introduces MEMSAD, a gradient-coupled anomaly detection defense, to secure retrieval-augmented LLM agents against memory poisoning attacks with formal guarantees.
Key contributions
- Formalizes memory poisoning attacks on retrieval-augmented agents and corrects evaluation protocol.
- Presents MEMSAD, a gradient-coupled anomaly detection defense with a certified detection radius.
- Proves MEMSAD's minimax optimality and derives online regret bounds for rolling calibration.
- Identifies a synonym-invariance loophole that evades continuous-space defenses, exposing a critical gap.
Why it matters
This paper is crucial for securing retrieval-augmented LLM agents, which rely on external memory. It provides the first formal characterization of memory poisoning attacks and introduces MEMSAD, a novel defense with strong theoretical guarantees. The work also highlights a critical limitation of current defenses, paving the way for future research.
Original Abstract
Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by $4\times$ (ASR-R: $0.25 \to 1.00$). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires $Ω(1/ρ^2)$ calibration samples and MEMSAD achieves this up to $\log(1/δ)$ factors. We further derive online regret bounds for rolling calibration at rate $O(σ^{2/3}Δ^{1/3})$, and formally characterize a discrete synonym-invariance loophole that marks the boundary of what continuous-space defenses can guarantee. Experiments on a $3 \times 5$ attack-defense matrix with bootstrap confidence intervals, Bonferroni-corrected hypothesis tests, and Clopper-Pearson validation ($n=1{,}000$) confirm: composite defenses achieve TPR $= 1.00$, FPR $= 0.00$ across all attacks, while synonym substitution evades detection at $Δ$ ASR-R $\approx 0$, exposing a gap existing embedding-based defenses cannot close.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.