ArXiv TLDR

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

🐦 Tweet
2604.26837

Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu + 6 more

cs.LG

TLDR

SPIN unifies sparse attention with hierarchical memory, significantly boosting LLM serving throughput and reducing latency for long contexts.

Key contributions

  • Unifies sparse attention granularities via a shared page-based KV substrate.
  • Manages hierarchical KV cache with a locality-aware, GPU-friendly bucketed LRU policy.
  • Employs a two-level metadata layout optimized for the active working set.
  • Boosts LLM serving throughput by 1.66-5.66x and reduces TTFT by 7-9x compared to vLLM.

Why it matters

Long-context LLMs face KV cache and sparse attention bottlenecks. SPIN unifies sparse attention with hierarchical memory, boosting throughput and reducing latency. This makes large language models practical and scalable.

Original Abstract

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.