Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
TLDR
Stream-CQSA enables exact self-attention for billion-token sequences on a single GPU by decomposing attention to avoid out-of-memory errors.
Key contributions
- Introduces CQS Divide, decomposing exact self-attention into independent sub-sequence computations.
- Stream-CQSA framework schedules attention subproblems to fit arbitrary memory budgets.
- Enables exact attention for billion-token sequences on a single GPU without OOM or approximation.
- Allows flexible execution across devices without requiring inter-device communication.
Why it matters
This paper addresses a critical scalability bottleneck in long-context LLMs by enabling exact self-attention on massive sequences without out-of-memory errors. It offers a practical solution for training and inference of larger models on existing hardware, significantly expanding their capabilities.
Original Abstract
The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.