NJI

Abstract

Modern large language models spend most of their inference time not on floating-point arithmetic, but on moving weight matrices from memory to compute units. This motivates a fundamental question: if we must compress matrices, how should we quantize them when the only operation we care about is matrix multiplication?

In this first post of a series, we formulate this problem precisely and review existing weight-only quantization methods like GPTQ and LDLQ.

1. Introduction: The Memory Wall Problem

1.1 Why Matrix Multiplication Dominates LLM Inference

In transformer-based language models, the dominant computational primitive is matrix-vector multiplication:

$y = Wx$

where $W \in \mathbb{R}^{m \times n}$ is a weight matrix and $x \in \mathbb{R}^n$ is an activation vector. For a model like LLaMA-70B, the weight matrices contain approximately 70 billion parameters.

The key insight is that modern hardware is compute-bound in theory but memory-bound in practice. Consider the arithmetic intensity:

$\xi = \frac{\text{FLOPs}}{\text{Bytes transferred}}$

For a matrix-vector product $Wx$ where $W \in \mathbb{R}^{m \times n}$:

FLOPs: $2mn$ (one multiply and one add per entry)
Bytes: $4mn$ (assuming FP32 weights) plus $4n$ (input) plus $4m$ (output)

This gives $\xi \approx 0.5$ ops/byte for large matrices.

1.2 The Hardware Reality

Modern GPUs have an ops:bytes ratio far exceeding 1:

Hardware	Peak FLOPS	Memory BW	$\xi_{\text{hardware}}$
NVIDIA A100	312 TFLOPS	2 TB/s	~150
NVIDIA H100	990 TFLOPS	3.35 TB/s	~300

Since the workload's arithmetic intensity ($\xi \approx 0.5$) is far below the hardware's ops:bytes ratio ($\xi_{\text{hardware}} > 100$), memory bandwidth is the bottleneck, not compute.

This means: reducing the size of weight matrices by $k\times$ can yield up to $k\times$ speedup, even if decompression requires additional computation!

2. Problem Formulation

2.1 The Central Question

Given this memory bottleneck, we ask:

How should we quantize two matrices $A$ and $B$ if our sole goal is to accurately approximate $A^T B$?

More formally, suppose we have:

$A \in \mathbb{R}^{n \times a}$ (e.g., weight matrix)
$B \in \mathbb{R}^{n \times b}$ (e.g., activation matrix)
Rate constraint: $R$ bits per entry

We want encoders $f_A: \mathbb{R}^{n \times a} \to {0,1}^{naR}$ and $f_B: \mathbb{R}^{n \times b} \to {0,1}^{nbR}$, and a decoder $g$ such that:

$\hat{C} = g(f_A(A), f_B(B)) \approx A^T B = C$

2.2 Distortion Measure

The natural distortion measure is normalized mean squared error:

$D = \frac{1}{abn} \mathbb{E}\left[\|\hat{C} - C\|_F^2\right]$

The normalization by $abn$ ensures the distortion remains $O(1)$ as dimensions grow (since $|A^TB|_F^2 = O(abn)$ for random matrices with unit variance entries).

2.3 Why This Is Non-Standard

This is not a standard rate-distortion problem. In classical vector quantization, we compress $X$ and try to reconstruct $X$ itself. Here, we compress $A$ and $B$ separately, but only care about reconstructing $A^T B$—a quadratic function of the sources.

Key complications:

Separate encoding: $A$ and $B$ are encoded independently
Quadratic reconstruction: We reconstruct a product, not the sources
Distributed setting: Two encoders, one decoder

3. Existing Weight-Only Quantization Methods

3.1 GPTQ: Greedy Second-Order Quantization

GPTQ (Frantar et al., 2022) is a widely-used post-training quantization method that can be viewed as a greedy, second-order vector quantization scheme.

Objective: Minimize the Hessian-weighted quadratic loss:

$L_{\text{GPTQ}}(W, \hat{W}) = \text{vec}(W - \hat{W})^T H \, \text{vec}(W - \hat{W})$

where $H$ is the (approximate) Hessian of the training loss with respect to the weights.

Algorithm:

Compute the Hessian $H \approx X X^T$ from calibration data
For each column $w_i$ of $W$:
- Quantize $w_i$ to nearest codebook entry
- Update remaining columns to compensate for quantization error

Key insight: GPTQ uses second-order information (the Hessian) to prioritize which weights matter most, achieving good performance at 4-bit and even 3-bit precision.

3.2 LDLQ: Lattice-Based Quantization

LDLQ and related methods (QuIP, QuIP#) quantize small blocks of weights using structured codebooks derived from lattice quantizers.

Lattice quantizers: A lattice $\Lambda \subset \mathbb{R}^n$ is a discrete additive subgroup. The lattice quantizer maps each point to its nearest lattice point:

$Q_{\Lambda}(x) = \arg\min_{\lambda \in \Lambda} \|x - \lambda\|$

Good lattices: Some lattices achieve near-optimal quantization efficiency. For example:

$E_8$ lattice in 8 dimensions
Leech lattice $\Lambda_{24}$ in 24 dimensions

LDLQ approach:

Apply random rotation (incoherence processing)
Quantize blocks using $E_8$ or other efficient lattices
Use adaptive rounding based on input statistics

3.3 The Objective Mismatch Problem

Both GPTQ and LDLQ optimize a proxy objective:

$\min_{\hat{W}} \|W - \hat{W}\|^2_{\text{weighted}}$

But what we actually care about is:

$\min_{\hat{W}} \mathbb{E}\left[\|Wx - \hat{W}x\|^2\right]$

These are not the same!

A key result from rate-distortion theory shows that there exist quantizers providing strictly better MSE for matrix multiplication than what any classical "approximate the matrix itself" method can achieve.

4. A Preview of Optimal Quantization

4.1 The Distortion-Rate Function

For i.i.d. Gaussian matrices, the optimal distortion-rate function turns out to be:

$\Gamma(R) = \begin{cases} 1 - \frac{1 - f_{\text{naive}}(R^*)}{R^*} R, & 0 \leq R < R^* \\ f_{\text{naive}}(R), & R \geq R^* \end{cases}$

where:

$f_{\text{naive}}(R) = 2 \cdot 2^{-2R} - 2^{-4R}$ (naive scalar quantization)
$R^* \approx 0.906$ bits per entry (critical rate)

4.2 Phase Transition

There is a phase transition at $R^* \approx 0.906$ bits:

High rate ($R \geq R^*$): Quantize all entries with equal precision
Low rate ($R < R^*$): Use time-sharing—send some entries with high precision, others with zero bits

At low rates, it's better to perfectly transmit a fraction of entries than to poorly transmit all of them.

5. Conclusion and Next Steps

We have formulated the rate-distortion problem for matrix multiplication and identified the objective mismatch in existing quantization methods.

Summary:

Memory bandwidth, not compute, is the bottleneck in LLM inference
Quantizing matrices for matmul is different from classical VQ
Existing methods optimize proxies, not the true objective
There exists an optimal distortion-rate function with a phase transition

In the next post, we will prove that classical VQ, sketching, and naive dithering fail to achieve the optimal rate.

This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).