Kwai Summary Attention Technical Report

April 27, 20262604.24432

Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng + 33 more

cs.CLcs.AIcs.IRcs.LG

TLDR

Kwai Summary Attention (KSA) reduces LLM long-context modeling costs by compressing historical contexts into learnable summary tokens.

Key contributions

Addresses quadratic time complexity of standard softmax attention in long-context LLMs.
Proposes a novel intermediate path for LLMs: semantic-level KV cache compression with ratio `k`.
Introduces Kwai Summary Attention (KSA) to compress historical contexts into learnable summary tokens.
KSA trades acceptable memory for complete and interpretable retention of long-distance dependencies.

Why it matters

Long-context LLMs are crucial but expensive due to quadratic attention. Kwai Summary Attention offers a novel solution by compressing historical contexts. This reduces costs and retains long-distance dependencies, enabling more efficient LLMs.

Original Abstract

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers