Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

April 28, 20262604.25570

Dewei Bai, Hongxiang Peng, Yunyun Zeng, Ziyu Zhang, Hong Qu + 1 more

cs.CV

TLDR

Vision SmolMamba introduces spike-guided token pruning in a state-space model for energy-efficient spiking vision, achieving superior accuracy-efficiency.

Key contributions

Introduces Vision SmolMamba, an energy-efficient spiking state-space vision model.
Proposes Spike-Guided Spatio-Temporal Token Pruner (SST-TP) for efficient token pruning.
SST-TP uses spike activation strength and first-spike latency to remove redundant tokens.
Reduces energy cost by over 1.5x compared to spiking Transformer baselines while maintaining accuracy.

Why it matters

Spiking Transformers are inefficient due to quadratic interactions. This paper offers a novel solution by integrating spike-guided token pruning with state-space models. It significantly improves energy efficiency while maintaining high accuracy, paving the way for more scalable and practical spiking vision systems.

Original Abstract

Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers