ArXiv TLDR

MEME: Multi-entity & Evolving Memory Evaluation

🐦 Tweet
2605.12477

Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh

cs.LGcs.CL

TLDR

MEME is a new benchmark evaluating LLM agents' multi-entity and evolving memory, revealing severe limitations in dependency reasoning.

Key contributions

  • Introduces MEME, a benchmark for multi-entity and evolving memory in LLM agents, extending beyond single-entity updates.
  • Defines new tasks like Cascade, Absence (dependency reasoning), and Deletion to assess complex memory capabilities.
  • Reveals that current LLM memory systems universally fail at dependency reasoning (avg. 1-3% accuracy).
  • Shows that common fixes and stronger LLMs don't close the memory gap, and practical solutions are currently too costly.

Why it matters

This paper exposes a critical flaw in current LLM agents: their inability to reason over dependent, evolving information in persistent environments. It shows that even advanced LLMs and optimization techniques fail to solve this, highlighting a fundamental challenge for building robust, long-term memory agents.

Original Abstract

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.