DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
Zahra Dehghanighobadi, Asja Fischer
TLDR
DepthKV introduces layer-dependent KV cache pruning, significantly improving long-context LLM inference efficiency by optimizing memory usage.
Key contributions
- Identifies that uniform KV cache pruning is suboptimal due to varying layer sensitivities.
- Proposes DepthKV, a novel layer-dependent pruning framework for LLM KV caches.
- Allocates KV cache budget based on layer sensitivity, not uniformly, for better efficiency.
- Outperforms uniform pruning across models and tasks, effectively utilizing the KV cache budget.
Why it matters
Long-context LLMs are critical but face severe memory bottlenecks from KV caches. DepthKV offers a more intelligent and efficient pruning strategy, directly addressing this limitation. This advancement enables more practical and scalable deployment of LLMs for complex, long-document tasks.
Original Abstract
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.