Alternating Multi-bit Quantization for Recurrent Neural Networks
Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao + 2 more
TLDR
This paper proposes an alternating multi-bit quantization method for RNNs that significantly reduces model size and speeds up inference with minimal accuracy loss.
Key contributions
- Formulates multi-bit quantization of weights and activations as an optimization problem using alternating minimization.
- Achieves up to 16x memory savings and 6x CPU inference acceleration with 2-bit quantization on LSTM and GRU models.
- Demonstrates near-original or improved accuracy with 3-bit quantization alongside substantial memory and speed gains.
- Extends the quantization approach successfully to image classification tasks and feedforward neural networks.
Why it matters
Efficient deployment of RNNs on resource-constrained devices and large-scale servers is critical, but model size and inference latency remain major bottlenecks. This paper introduces a principled multi-bit quantization technique that drastically reduces memory footprint and accelerates inference while maintaining or improving accuracy. Its broad applicability to both recurrent and feedforward networks makes it a valuable advancement for practical deep learning deployment.
Original Abstract
Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent requests, the latency during inference can also be very critical for costly computing resources. In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes {-1,+1}. We formulate the quantization as an optimization problem. Under the key observation that once the quantization coefficients are fixed the binary codes can be derived efficiently by binary search tree, alternating minimization is then applied. We test the quantization for two well-known RNNs, i.e., long short term memory (LSTM) and gated recurrent unit (GRU), on the language models. Compared with the full-precision counter part, by 2-bit quantization we can achieve ~16x memory saving and ~6x real inference acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit quantization, we can achieve almost no loss in the accuracy or even surpass the original model, with ~10.5x memory saving and ~3x real inference acceleration. Both results beat the exiting quantization works with large margins. We extend our alternating quantization to image classification tasks. In both RNNs and feedforward neural networks, the method also achieves excellent performance.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.