High-Rate Quantized Matrix Multiplication II

May 13, 20262605.13768

cs.LGcs.AIcs.IT

TLDR

This paper explores high-rate quantized matrix multiplication for LLMs, showing how waterfilling improves GPTQ and analyzing the near-optimal WaterSIC scheme.

Key contributions

Shows waterfilling improves LLM quantization (e.g., GPTQ) by optimizing rate distribution.
Analyzes WaterSIC, a scalar INT quantizer, proving it's basis-free and near the information-theoretic limit.
Demonstrates GPTQ with random rotation is near-optimal (within 0.1 bit) compared to WaterSIC for Llama-3-8B.

Why it matters

This paper provides critical insights into optimizing weight-only post-training quantization for LLMs. By applying waterfilling principles and analyzing schemes like WaterSIC and GPTQ, it offers practical improvements and theoretical guarantees for efficient LLM deployment.

Original Abstract

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers