Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini
TLDR
HCCS is a fast, integer-native softmax surrogate for edge AI, outperforming existing methods on AMD AI Engines with competitive accuracy.
Key contributions
- Introduces Head-Calibrated Clipped-Linear Softmax (HCCS) for efficient low-precision Transformer inference.
- HCCS is a bounded, monotone exponential surrogate with head-specific calibration for statistical preservation.
- Designed for AMD AI Engines, it naturally maps to int8 MAC units for high-throughput processing.
- Achieves significant speedup over existing bfloat16/LUT implementations with competitive task accuracy.
Why it matters
Softmax is a computational bottleneck in low-precision edge AI. This paper offers a novel integer-native solution, HCCS, that drastically improves speed on AMD AI Engines without sacrificing accuracy. This enables more efficient deployment of Transformer models on resource-constrained devices.
Original Abstract
Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.