ArXiv TLDR

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

🐦 Tweet
2605.13784

Victor Norgren

cs.LG

TLDR

This paper introduces stateful transformers for efficient streaming inference, reducing query latency to O(|q|) by moving prefill off the critical path.

Key contributions

  • Introduces stateful sessions with a persistent KV cache, incrementally advancing context.
  • Achieves O(|q|) query latency, independent of context size, by moving prefill off the critical path.
  • Enables 'Flash Queries' to pre-evaluate questions using idle GPU cycles for cached answers.
  • A multi-tenant scheduler supports dozens of stateful sessions on a single GPU with full self-attention.

Why it matters

This paper addresses a critical bottleneck in streaming transformer inference, making real-time applications more feasible. By introducing stateful sessions, it provides significant speedups and constant query latency, which is vital for continuous data processing.

Original Abstract

Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, llama.cpp), holding query latency constant as accumulated context grows.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.