Winner-Take-All Spiking Transformer for Language Modeling

April 13, 20262604.11321

Chenlin Zhou, Sihang Guo, Jiaqi Wang, Dongyang Ma, Kaiwei Che + 4 more

cs.NE

TLDR

This paper introduces Winner-Take-All Spiking Transformers (WTA-ST) for energy-efficient language modeling, replacing costly softmax with novel spike-driven attention.

Key contributions

Introduces Winner-Take-All (WTA) mechanisms into spiking transformers for NLP tasks.
Proposes WSSA and CWSSA, novel softmax-free, spike-driven self-attention modules.
Designs WE-Spikingformer and WD-Spikingformer for masked and causal language modeling.
Achieves strong performance across 16 NLP datasets, validating energy-efficient approach.

Why it matters

Spiking Transformers offer significant energy efficiency but struggled with language modeling due to reliance on softmax. This work provides a crucial breakthrough by eliminating softmax, enabling truly energy-efficient NLP models. It paves the way for sustainable AI and neuromorphic hardware deployment in language processing.

Original Abstract

Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient property of Spiking Neural Networks (SNNs), have achieved impressive results in neuromorphic and vision tasks and attracted increasing attention. However, existing directly trained spiking transformers primarily focus on vision tasks. For language modeling with spiking transformer, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment. To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on them, we design WTA-based Encoder-only Spiking Transformer (WE-Spikingformer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-Spikingformer) for causal language modeling, systematically exploring softmax-free, spiking-driven Transformer architectures trained end-to-end for natural language processing tasks. Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers