ArXiv TLDR

NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

🐦 Tweet
2604.27306

Saber Zerhoudi, Michael Granitzer, Jelena Mitrovic

cs.IR

TLDR

NuggetIndex introduces governed atomic information units ("nuggets") for RAG, improving maintainability, temporal correctness, and reducing conflicts.

Key contributions

  • Introduces "nuggets" as atomic, governed information units for RAG systems.
  • Filters invalid/deprecated nuggets to ensure temporal correctness and prevent outdated information.
  • Improves nugget recall by 42% and temporal correctness by 9% over baselines.
  • Reduces conflict rates by 55% and generator input length by 64%.

Why it matters

RAG systems struggle with evolving corpora and outdated facts. NuggetIndex addresses this by managing information at an atomic level, ensuring temporal correctness and reducing conflicts. This leads to more reliable and maintainable RAG, crucial for dynamic knowledge bases.

Original Abstract

Retrieval-augmented generation (RAG) systems are frequently evaluated via fact-based metrics, yet standard implementations retrieve passages or static propositions. This unit mismatch between evaluation and retrieval objects hinders maintenance when corpora evolve and fails to capture superseded facts or source disagreements. We propose NuggetIndex, a retrieval system that stores atomic information units as managed records, so called nuggets. Each record maintains links to evidence, a temporal validity interval, and a lifecycle state. By filtering invalid or deprecated nuggets prior to ranking, the system prevents the inclusion of outdated information. We evaluate the approach using a nuggetized MS MARCO subset, a temporal Wikipedia QA dataset, and a multi-hop QA task. Against passage and unmanaged proposition retrieval baselines, NuggetIndex improves nugget recall by 42%, increases temporal correctness by 9 percentage points without the recall collapse observed in time-filtered baselines, and reduces conflict rates by 55%. The compact nugget format reduces generator input length by 64% while enabling lightweight index structures suitable for browser-based and resource-constrained deployment. We release our implementation, datasets, and evaluation scripts

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.