xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova + 4 more
TLDR
xLSTM introduces exponential gating and novel memory structures to scale LSTMs to billions of parameters, achieving competitive performance with state-of-the-art Transformers in language modeling.
Key contributions
- Proposes exponential gating with normalization and stabilization to enhance LSTM gating mechanisms.
- Develops modified memory structures: sLSTM with scalar memory and mLSTM with matrix memory and covariance updates.
- Integrates these innovations into residual xLSTM blocks that scale effectively and rival Transformer and State Space Model performance.
Why it matters
This paper matters because it revisits and significantly extends the classic LSTM architecture, demonstrating that with modern techniques and architectural modifications, LSTMs can still compete with the dominant Transformer models in large-scale language modeling. This challenges the prevailing notion that Transformers are the sole scalable solution for large language models and opens new avenues for efficient and effective sequence modeling.
Original Abstract
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.