KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
Zedong Liu, Xinyang Ma, Dejun Luo, Hairui Zhao, Bing Lu + 7 more
TLDR
KVServe adaptively compresses KV cache in disaggregated LLM serving, significantly boosting performance by optimizing communication.
Key contributions
- Unifies KV compression into a modular strategy space with new components.
- Introduces a Bayesian Profiling Engine for efficient 3D Pareto candidate search, reducing offline overhead by 50x.
- Deploys a Service-Aware Online Controller for adaptive profile selection, correcting offline-to-online mismatches.
- Achieves up to 9.13x JCT speedup and 32.8x TTFT reduction in disaggregated LLM serving environments.
Why it matters
Disaggregated LLM serving is crucial for scalability but creates KV cache bottlenecks. KVServe addresses this by providing an adaptive, service-aware compression framework. This significantly improves performance and cost-efficiency for large-scale LLM deployments.
Original Abstract
LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.