GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

April 20, 20262604.18556

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic + 1 more

cs.CLcs.LG

TLDR

GSQ is a new scalar quantization method that achieves high accuracy at low bit-widths (2-3 bpp) for LLMs, bridging the gap with complex vector methods.

Key contributions

Introduces GSQ, a post-training scalar quantization method for LLMs.
Achieves high accuracy at 2-3 bits per parameter, matching complex vector quantizers like QTIP.
Learns grid assignments and scales via Gumbel-Softmax relaxation, compatible with existing scalar kernels.
Scales effectively to large Mixture-of-Experts (MoE) models, where vector quantization struggles.

Why it matters

Existing low-bit quantization methods for LLMs either offer simplicity with limited accuracy or high accuracy with implementation complexity. GSQ provides a much-needed bridge, delivering state-of-the-art accuracy at 2-3 bits using a simple, deployable scalar quantization approach. This enables more efficient local LLM inference and broader application to massive MoE models.

Original Abstract

Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers