Winner-Take-All Spiking Transformer for Language Modeling
Chenlin Zhou, Sihang Guo, Jiaqi Wang, Dongyang Ma, Kaiwei Che + 4 more
TLDR
This paper introduces Winner-Take-All Spiking Transformers (WTA-ST) for energy-efficient language modeling, replacing costly softmax with novel spike-driven attention.
Key contributions
- Introduces Winner-Take-All (WTA) mechanisms into spiking transformers for NLP tasks.
- Proposes WSSA and CWSSA, novel softmax-free, spike-driven self-attention modules.
- Designs WE-Spikingformer and WD-Spikingformer for masked and causal language modeling.
- Achieves strong performance across 16 NLP datasets, validating energy-efficient approach.
Why it matters
Spiking Transformers offer significant energy efficiency but struggled with language modeling due to reliance on softmax. This work provides a crucial breakthrough by eliminating softmax, enabling truly energy-efficient NLP models. It paves the way for sustainable AI and neuromorphic hardware deployment in language processing.
Original Abstract
Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient property of Spiking Neural Networks (SNNs), have achieved impressive results in neuromorphic and vision tasks and attracted increasing attention. However, existing directly trained spiking transformers primarily focus on vision tasks. For language modeling with spiking transformer, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment. To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on them, we design WTA-based Encoder-only Spiking Transformer (WE-Spikingformer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-Spikingformer) for causal language modeling, systematically exploring softmax-free, spiking-driven Transformer architectures trained end-to-end for natural language processing tasks. Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.