ArXiv TLDR

ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

🐦 Tweet
2604.10981

Samuel Sameer Tanguturi

cs.AIcs.IR

TLDR

ATANT v1.1 shows existing memory benchmarks fail to evaluate "continuity" as defined by v1.0, covering few of its 7 required properties.

Key contributions

  • Conducts structural analysis proving existing memory benchmarks do not measure "continuity."
  • Quantifies that existing evaluations cover a median of 1 (max 2) of ATANT v1.0's 7 continuity properties.
  • Identifies methodological defects in benchmarks, including a critical empty-gold scoring bug in LOCOMO.
  • Presents a calibration pair (ATANT 96%, LOCOMO 8.8%) to show distinct properties are measured.

Why it matters

This paper is crucial for clarifying the distinct concept of "continuity" in LLM evaluation, differentiating it from general memory benchmarks. It prevents misinterpretation of evaluation results and guides researchers to invest in specific properties needed for truly continuous systems.

Original Abstract

ATANT v1.0 (arXiv:2604.06710) defined continuity as a system property with 7 required properties and introduced a 10-checkpoint, LLM-free evaluation methodology validated on a 250-story corpus. Since publication, a recurring reviewer and practitioner question has concerned not the framework itself but its relationship to a wider set of memory evaluations: LOCOMO, LongMemEval, BEAM, MemoryBench, Zep's evaluation suite, Letta/MemGPT's evaluations, and RULER. This companion paper, v1.1, does not modify the v1.0 standard. It closes a related-work gap that v1.0 left brief under page limits. We show by structural analysis that none of these benchmarks measures continuity as defined in v1.0: of the 7 required properties, the median existing eval covers 1 property, the mean covers 0.43 when partial credit is scored at 0.5, and no eval covers more than 2. We provide a cell-by-cell property-coverage matrix, identify methodological defects specific to each benchmark (including an empty-gold scoring bug in the LOCOMO reference implementation that renders 23% of its corpus unscorable by construction), and publish our reference implementation's LOCOMO score (8.8%) alongside the structural reason that number is uninformative about continuity. We publish our 8.8% LOCOMO score alongside our 96% ATANT cumulative-scale score as a calibration pair: the 87-point divergence is evidence that the two benchmarks measure different properties, not that one system is an order of magnitude better than another. The position v1.1 takes is not adversarial: each benchmark measures a real capability. The claim is that none of them can adjudicate continuity, and conflating them with continuity evaluation has led the field to under-invest in the properties v1.0 names.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.