Sessa: Selective State Space Attention

April 20, 20262604.18580

cs.LGcs.AIcs.CL

TLDR

Sessa combines attention with state-space models to achieve superior long-range memory and flexible selective retrieval, outperforming existing architectures.

Key contributions

Introduces Sessa, a decoder integrating attention within a recurrent feedback path for many-path aggregation.
Achieves a power-law memory tail ($O(\ell^{-\beta})$) in long contexts, outperforming Transformer and Mamba's decay.
Enables flexible selective retrieval, including non-decaying memory profiles, unlike prior models.
Outperforms baselines on long-context tasks while remaining competitive on short-context language modeling.

Why it matters

This paper introduces Sessa, a novel architecture that addresses the limitations of both Transformers and state-space models in handling long-range dependencies. By integrating attention into a recurrent feedback path, Sessa achieves significantly better long-term memory and more flexible information retrieval. This breakthrough could lead to more efficient and powerful models for very long sequence processing.

Original Abstract

Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support $S_{\mathrm{eff}}(t)$, the influence of any individual token is diluted, typically scaling as $O(1/S_{\mathrm{eff}}(t))$ and reaching $O(1/\ell)$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag $\ell$ of order $O(\ell^{-β})$ for $0<β<1$, which is asymptotically slower than $1/\ell$; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is $Θ(\ell^{-β})$. Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers