RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

April 10, 20262604.09494

cs.CLcs.AIcs.IRcs.LG

TLDR

RecaLLM tackles LLM's 'lost-in-thought' by interleaving reasoning with explicit in-context retrieval, boosting long-context performance.

Key contributions

RecaLLM addresses 'lost-in-thought' by interleaving reasoning with explicit in-context retrieval.
Employs a negligible-overhead constrained decoding for verbatim evidence copying.
Achieves strong performance on RULER and HELMET, outperforming baselines.
Shows consistent gains up to 128K tokens using only 10K token training data.

Why it matters

LLMs often struggle with long contexts due to 'lost-in-thought,' where reasoning hinders subsequent retrieval. RecaLLM offers a novel approach by explicitly interleaving retrieval, significantly boosting performance on long-context tasks. This method also reduces the need for expensive long-context training data, making it a promising and efficient solution.

Original Abstract

We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers