When Spike Sparsity Does Not Translate to Deployed Cost: VS-WNO on Jetson Orin Nano
Jason Yoo, Shailesh Garg, Souvik Chakraborty, Syed Bahauddin Alam
TLDR
Spike sparsity in neural operators does not reduce deployed cost on edge GPUs like Jetson Orin Nano due to non-sparsity-aware runtimes.
Key contributions
- Evaluated variable-spiking (VS-WNO) vs. dense wavelet neural operators on Jetson Orin Nano.
- VS-WNO exhibited substantial algorithmic spike sparsity, decreasing from 54% to 18%.
- Despite sparsity, VS-WNO showed higher latency (59.6ms) and energy (228mJ) than dense WNO.
- GPU runtime was launch-dominated and dense, failing to leverage spike sparsity for cost reduction.
Why it matters
This paper reveals a critical gap between theoretical benefits of spiking neural networks and their practical deployment on commodity edge GPUs. It highlights that current GPU software stacks are not optimized to translate spike sparsity into real-world energy or latency savings, urging for runtime-level innovations.
Original Abstract
Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.