Quantization for Matrix MultiplicationPart 4 of 4
Theory of AI

Quantization for Matrix Multiplication (4): Implementation and Open Problems

Abstract

In this final post of the series, we bridge theory and practice. We discuss how to implement optimal quantizers using nested lattice schemes, explore extensions beyond the Gaussian case, and outline important open problems that remain for future research.


1. Implementing Optimal Quantization

1.1 From Theory to Practice

The previous post established that time-sharing achieves the optimal distortion $\Gamma(R)$. But how do we implement this in practice?

Two challenges:

  1. Coordinate selection: How to choose which coordinates to quantize?
  2. High-rate quantization: How to optimally quantize the selected coordinates?

1.2 Nested Lattice Quantization

The optimal scheme uses nested lattice quantizers with dithering.

Nested lattices: Two lattices $\Lambda_f \subset \Lambda_c$ where:

  • $\Lambda_c$: Coarse lattice (determines rate)
  • $\Lambda_f$: Fine lattice (determines reconstruction quality)

Dithered quantization:

  1. Generate shared dither $Z \sim \text{Uniform}(\mathcal{V}_c)$ where $\mathcal{V}_c$ is the Voronoi cell of $\Lambda_c$
  2. Encode: $\hat{X} = Q_{\Lambda_c}(X + Z) - Z$ (mod $\Lambda_f$)
  3. This makes quantization error independent of input and uniformly distributed

1.3 Why Nested Lattices Work

Key property: For a good nested lattice pair, the quantization error $X - \hat{X}$ is approximately Gaussian with variance $\sigma^2_{\text{eff}}$ determined by the lattice geometry.

The second moment of a lattice $\Lambda$ is: σ2(Λ)=1nVx2dx\sigma^2(\Lambda) = \frac{1}{n} \int_{\mathcal{V}} \|x\|^2 dx

Good lattices approach the Zador bound: σ2(Λ)12πeV2/n\sigma^2(\Lambda) \to \frac{1}{2\pi e} V^{2/n}

where $V$ is the volume of the Voronoi cell.


2. The Complete Algorithm

2.1 Optimal Quantizer for $R < R^*$

Input: Vectors $U, V \in \mathbb{R}^n$, rate $R$, shared randomness seed $s$

Algorithm:

  1. Using seed $s$, generate random subset $S \subset [n]$ with $|S| = \kappa^* n$ where $\kappa^* = R/R^*$
  2. Generate dither $Z \sim \text{Uniform}(\mathcal{V}_c)$ using seed $s$
  3. For $i \in S$:
    • Quantize $U_i$ at rate $R^$: $\hat{U}i = Q{R^}(U_i; Z_i)$
    • Quantize $V_i$ at rate $R^$: $\hat{V}i = Q{R^}(V_i; Z_i)$
  4. For $i \notin S$: $\hat{U}_i = \hat{V}_i = 0$
  5. Output estimate: $\hat{C} = \hat{U}^T \hat{V}$

2.2 Optimal Quantizer for $R \geq R^*$

Simply apply scalar Gaussian quantization at rate $R$ to all coordinates:

C^=U^TV^\hat{C} = \hat{U}^T \hat{V}

where $\hat{U}_i = Q_R(U_i)$ for all $i$.

2.3 Practical Considerations

  1. Shared randomness: Both encoder and decoder need the same random seed
  2. Lattice choice: Use $E_8$ or $\Lambda_{24}$ for practical implementations
  3. Finite precision: Quantization of the dither itself requires care

3. Beyond Gaussian Sources

3.1 Sub-Gaussian Distributions

The theory extends to sub-Gaussian distributions:

Definition: $X$ is $\sigma$-sub-Gaussian if $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ for all $t$.

Theorem: For i.i.d. sub-Gaussian entries with parameter $\sigma$, the distortion-rate function satisfies: D(R)σ2Γ(R)+o(n)D^*(R) \leq \sigma^2 \cdot \Gamma(R) + o(n)

3.2 Non-i.i.d. Sources

For correlated sources with covariance matrix $\Sigma$:

  • Pre-whitening: Transform $\tilde{U} = \Sigma^{-1/2} U$
  • Quantize: Apply optimal scheme to $\tilde{U}, \tilde{V}$
  • Correct: Account for transformation in reconstruction

3.3 LLM Weight Distributions

Real LLM weights are not Gaussian:

  • Often exhibit heavy tails
  • May have outliers
  • Show layer-dependent statistics

Practical approaches:

  1. Outlier handling: Separate treatment for large weights
  2. Per-channel scaling: Normalize each row/column
  3. Mixed-precision: Different rates for different layers

4. Connections to Existing LLM Quantization

4.1 GPTQ and Optimal Theory

GPTQ approximates the Hessian-weighted objective: L=E[WxW^x2]L = \mathbb{E}[\|Wx - \hat{W}x\|^2]

This is related but not identical to the matmul rate-distortion objective. Key differences:

  • GPTQ optimizes for reconstruction of $Wx$, not $W$ itself
  • Uses calibration data to estimate the Hessian
  • Greedy column-by-column quantization

Gap from optimal: GPTQ doesn't exploit the time-sharing structure optimal at low rates.

4.2 Lattice-Based Methods

QuIP# and LDLQ use lattice quantizers:

  • Employ $E_8$ lattice for 8-dimensional blocks
  • Use random rotations for incoherence
  • Closer to theoretically optimal for moderate rates

Remaining gap: These methods still don't implement optimal coordinate selection for very low rates.

4.3 Sparsity-Based Methods

Some methods (e.g., SparseGPT) combine quantization with pruning:

  • Set some weights to exactly zero
  • Quantize remaining weights

This is conceptually similar to time-sharing! But the selection criterion differs:

  • Time-sharing: random selection
  • Pruning: magnitude or importance-based selection

5. Open Problems

5.1 Asymmetric Rates

Problem: What if $A$ (weights) can use rate $R_A$ and $B$ (activations) use rate $R_B \neq R_A$?

Motivation: In LLMs, weights are fixed but activations change per input. Different rate allocation may be optimal.

The optimal distortion-rate tradeoff for asymmetric rates remains unknown.

5.2 Non-Gaussian Optimality

Problem: Is $\Gamma(R)$ optimal for non-Gaussian i.i.d. sources?

Known: Achievability via same scheme Unknown: Whether a better scheme exists for specific distributions

Conjecture: For any i.i.d. source with finite fourth moment, $\Gamma(R)$ is asymptotically optimal.

5.3 Finite Block Length

Problem: What is the optimal distortion for finite $n$?

The current theory is asymptotic ($n \to \infty$). For practical LLMs:

  • Block sizes of 64-256 are common
  • Finite-length effects may be significant

Needed: Non-asymptotic bounds on $D^*(R, n)$.

5.4 Computational Complexity

Problem: Can optimal quantization be done efficiently?

Current status:

  • Encoding: $O(n \log n)$ using FFT-based lattice decoding
  • Shared randomness: Requires synchronized random number generators

Open: Is there a linear-time algorithm?

5.5 Multiple Matrix Products

Problem: What if we use the same quantized $A$ for multiple products $A^T B_1, A^T B_2, \ldots$?

Motivation: In LLMs, the same weight matrix is used for many different activations.

Unknown: Whether amortization over multiple products changes the optimal rate-distortion tradeoff.


6. Implications for LLM Design

6.1 Architecture Insights

The theory suggests:

  1. Smaller matrices are harder to quantize (finite-length effects)
  2. Memory bandwidth is key (confirms motivation)
  3. Sub-1-bit quantization needs new algorithms (time-sharing)

6.2 Algorithm Design Principles

For practitioners building quantization methods:

  1. Don't ignore low-rate regime: Standard scalar quantization is suboptimal below ~1 bit
  2. Sparsification can help: Quantizing fewer entries better may beat quantizing all entries poorly
  3. Shared randomness is powerful: Dithering enables reaching the optimal rate

6.3 Future Directions

  1. Matmul-aware training: Pre-train models to be amenable to matmul quantization
  2. Adaptive rate allocation: Learn which layers need more bits
  3. Hardware co-design: Build accelerators that support time-sharing schemes

7. Series Conclusion

Over four posts, we have:

  1. Post 1: Formulated the rate-distortion problem for matrix multiplication
  2. Post 2: Proved that classical VQ, sketching, and dithering fail
  3. Post 3: Derived the optimal $\Gamma(R)$ with phase transition at $R^*$
  4. Post 4: Discussed implementation and open problems

Summary:

  • Memory bandwidth limits LLM inference
  • Quantizing for matmul differs from classical VQ
  • Optimal schemes exist with provable guarantees
  • Many problems remain open

This theory provides a rigorous foundation for understanding and improving LLM quantization.


References

  1. Ordentlich, O., & Polyanskiy, Y. (2024). Optimal Quantization for Matrix Multiplication. arXiv:2410.13780.

  2. Frantar, E., et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.

  3. Chee, J., et al. (2023). QuIP: 2-Bit Quantization of Large Language Models With Guarantees.


This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).