STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

May 7, 20262605.06527

Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun

cs.CL

TLDR

LLM agents struggle to update invalid memories; STALE benchmark and CUPMem prototype address this critical "Implicit Conflict" failure mode.

Key contributions

Identifies "Implicit Conflict," a critical failure mode where LLMs don't infer memory invalidation from new evidence.
Introduces STALE, a benchmark with 400 expert-validated scenarios to evaluate LLM agent memory revision.
Proposes a 3D probing framework: State Resolution, Premise Resistance, and Implicit Policy Adaptation.
Reveals frontier LLMs achieve only 55.2% accuracy on STALE, struggling to act on updated information.

Why it matters

LLM agents need adaptive memory, but struggle to update beliefs when new info implicitly invalidates old ones. This paper introduces STALE, a benchmark, and CUPMem, a prototype, to address this critical gap, paving the way for more reliable, state-aware agents.

Original Abstract

Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers