Abstract
In this final post of the series, we bridge theory and practice. We discuss how to implement optimal quantizers using nested lattice schemes, explore extensions beyond the Gaussian case, and outline important open problems that remain for future research.
1. Implementing Optimal Quantization
1.1 From Theory to Practice
The previous post established that time-sharing achieves the optimal distortion $\Gamma(R)$. But how do we implement this in practice?
Two challenges:
- Coordinate selection: How to choose which coordinates to quantize?
- High-rate quantization: How to optimally quantize the selected coordinates?
1.2 Nested Lattice Quantization
The optimal scheme uses nested lattice quantizers with dithering.
Nested lattices: Two lattices $\Lambda_f \subset \Lambda_c$ where:
- $\Lambda_c$: Coarse lattice (determines rate)
- $\Lambda_f$: Fine lattice (determines reconstruction quality)
Dithered quantization:
- Generate shared dither $Z \sim \text{Uniform}(\mathcal{V}_c)$ where $\mathcal{V}_c$ is the Voronoi cell of $\Lambda_c$
- Encode: $\hat{X} = Q_{\Lambda_c}(X + Z) - Z$ (mod $\Lambda_f$)
- This makes quantization error independent of input and uniformly distributed
1.3 Why Nested Lattices Work
Key property: For a good nested lattice pair, the quantization error $X - \hat{X}$ is approximately Gaussian with variance $\sigma^2_{\text{eff}}$ determined by the lattice geometry.
The second moment of a lattice $\Lambda$ is:
Good lattices approach the Zador bound:
where $V$ is the volume of the Voronoi cell.
2. The Complete Algorithm
2.1 Optimal Quantizer for $R < R^*$
Input: Vectors $U, V \in \mathbb{R}^n$, rate $R$, shared randomness seed $s$
Algorithm:
- Using seed $s$, generate random subset $S \subset [n]$ with $|S| = \kappa^* n$ where $\kappa^* = R/R^*$
- Generate dither $Z \sim \text{Uniform}(\mathcal{V}_c)$ using seed $s$
- For $i \in S$:
- Quantize $U_i$ at rate $R^$: $\hat{U}i = Q{R^}(U_i; Z_i)$
- Quantize $V_i$ at rate $R^$: $\hat{V}i = Q{R^}(V_i; Z_i)$
- For $i \notin S$: $\hat{U}_i = \hat{V}_i = 0$
- Output estimate: $\hat{C} = \hat{U}^T \hat{V}$
2.2 Optimal Quantizer for $R \geq R^*$
Simply apply scalar Gaussian quantization at rate $R$ to all coordinates:
where $\hat{U}_i = Q_R(U_i)$ for all $i$.
2.3 Practical Considerations
- Shared randomness: Both encoder and decoder need the same random seed
- Lattice choice: Use $E_8$ or $\Lambda_{24}$ for practical implementations
- Finite precision: Quantization of the dither itself requires care
3. Beyond Gaussian Sources
3.1 Sub-Gaussian Distributions
The theory extends to sub-Gaussian distributions:
Definition: $X$ is $\sigma$-sub-Gaussian if $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ for all $t$.
Theorem: For i.i.d. sub-Gaussian entries with parameter $\sigma$, the distortion-rate function satisfies:
3.2 Non-i.i.d. Sources
For correlated sources with covariance matrix $\Sigma$:
- Pre-whitening: Transform $\tilde{U} = \Sigma^{-1/2} U$
- Quantize: Apply optimal scheme to $\tilde{U}, \tilde{V}$
- Correct: Account for transformation in reconstruction
3.3 LLM Weight Distributions
Real LLM weights are not Gaussian:
- Often exhibit heavy tails
- May have outliers
- Show layer-dependent statistics
Practical approaches:
- Outlier handling: Separate treatment for large weights
- Per-channel scaling: Normalize each row/column
- Mixed-precision: Different rates for different layers
4. Connections to Existing LLM Quantization
4.1 GPTQ and Optimal Theory
GPTQ approximates the Hessian-weighted objective:
This is related but not identical to the matmul rate-distortion objective. Key differences:
- GPTQ optimizes for reconstruction of $Wx$, not $W$ itself
- Uses calibration data to estimate the Hessian
- Greedy column-by-column quantization
Gap from optimal: GPTQ doesn't exploit the time-sharing structure optimal at low rates.
4.2 Lattice-Based Methods
QuIP# and LDLQ use lattice quantizers:
- Employ $E_8$ lattice for 8-dimensional blocks
- Use random rotations for incoherence
- Closer to theoretically optimal for moderate rates
Remaining gap: These methods still don't implement optimal coordinate selection for very low rates.
4.3 Sparsity-Based Methods
Some methods (e.g., SparseGPT) combine quantization with pruning:
- Set some weights to exactly zero
- Quantize remaining weights
This is conceptually similar to time-sharing! But the selection criterion differs:
- Time-sharing: random selection
- Pruning: magnitude or importance-based selection
5. Open Problems
5.1 Asymmetric Rates
Problem: What if $A$ (weights) can use rate $R_A$ and $B$ (activations) use rate $R_B \neq R_A$?
Motivation: In LLMs, weights are fixed but activations change per input. Different rate allocation may be optimal.
The optimal distortion-rate tradeoff for asymmetric rates remains unknown.
5.2 Non-Gaussian Optimality
Problem: Is $\Gamma(R)$ optimal for non-Gaussian i.i.d. sources?
Known: Achievability via same scheme Unknown: Whether a better scheme exists for specific distributions
Conjecture: For any i.i.d. source with finite fourth moment, $\Gamma(R)$ is asymptotically optimal.
5.3 Finite Block Length
Problem: What is the optimal distortion for finite $n$?
The current theory is asymptotic ($n \to \infty$). For practical LLMs:
- Block sizes of 64-256 are common
- Finite-length effects may be significant
Needed: Non-asymptotic bounds on $D^*(R, n)$.
5.4 Computational Complexity
Problem: Can optimal quantization be done efficiently?
Current status:
- Encoding: $O(n \log n)$ using FFT-based lattice decoding
- Shared randomness: Requires synchronized random number generators
Open: Is there a linear-time algorithm?
5.5 Multiple Matrix Products
Problem: What if we use the same quantized $A$ for multiple products $A^T B_1, A^T B_2, \ldots$?
Motivation: In LLMs, the same weight matrix is used for many different activations.
Unknown: Whether amortization over multiple products changes the optimal rate-distortion tradeoff.
6. Implications for LLM Design
6.1 Architecture Insights
The theory suggests:
- Smaller matrices are harder to quantize (finite-length effects)
- Memory bandwidth is key (confirms motivation)
- Sub-1-bit quantization needs new algorithms (time-sharing)
6.2 Algorithm Design Principles
For practitioners building quantization methods:
- Don't ignore low-rate regime: Standard scalar quantization is suboptimal below ~1 bit
- Sparsification can help: Quantizing fewer entries better may beat quantizing all entries poorly
- Shared randomness is powerful: Dithering enables reaching the optimal rate
6.3 Future Directions
- Matmul-aware training: Pre-train models to be amenable to matmul quantization
- Adaptive rate allocation: Learn which layers need more bits
- Hardware co-design: Build accelerators that support time-sharing schemes
7. Series Conclusion
Over four posts, we have:
- Post 1: Formulated the rate-distortion problem for matrix multiplication
- Post 2: Proved that classical VQ, sketching, and dithering fail
- Post 3: Derived the optimal $\Gamma(R)$ with phase transition at $R^*$
- Post 4: Discussed implementation and open problems
Summary:
- Memory bandwidth limits LLM inference
- Quantizing for matmul differs from classical VQ
- Optimal schemes exist with provable guarantees
- Many problems remain open
This theory provides a rigorous foundation for understanding and improving LLM quantization.
References
-
Ordentlich, O., & Polyanskiy, Y. (2024). Optimal Quantization for Matrix Multiplication. arXiv:2410.13780.
-
Frantar, E., et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
-
Chee, J., et al. (2023). QuIP: 2-Bit Quantization of Large Language Models With Guarantees.
This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).