KV Cache Offloading for Context-Intensive Tasks
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov
TLDR
KV cache offloading, while promising, significantly degrades LLM performance on context-intensive tasks, prompting a new strategy and benchmark.
Key contributions
- Evaluates KV cache offloading on context-intensive tasks, finding significant performance degradation.
- Introduces Text2JSON, a new benchmark for highly context-intensive structured knowledge extraction.
- Identifies low-rank key projection and unreliable landmarks as key reasons for accuracy loss.
- Proposes a simpler alternative strategy that significantly improves accuracy across LLM families.
Why it matters
Long-context LLMs are crucial, but KV cache offloading, a key optimization, struggles with complex tasks. This work exposes these limitations and offers a path forward, emphasizing the need for robust evaluation of compression techniques.
Original Abstract
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.