KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

May 12, 20262605.12471

Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez

cs.LGcs.AIcs.CL

TLDR

KV-Fold enables stable, training-free long-context inference by treating the KV-cache as an accumulator, achieving high fidelity and memory efficiency.

Key contributions

Introduces KV-Fold, a training-free protocol for long-context inference using KV-cache recurrence.
Treats the KV cache as an accumulator, processing sequence chunks sequentially like `foldl`.
Demonstrates stable recurrence, with per-step drift saturating and robust across conditions.
Achieves 100% exact-match retrieval up to 128K tokens on Llama-3.1-8B within GPU memory.

Why it matters

KV-Fold provides a practical, training-free method to extend transformer context windows significantly by leveraging existing pretrained models. It demonstrates that stable long-context inference is possible through simple KV-cache manipulation, offering a memory-efficient alternative to streaming methods while maintaining high fidelity for very long sequences.

Original Abstract

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers