Quantization for Matrix MultiplicationPart 2 of 4
Theory of AI

Quantization for Matrix Multiplication (2): Why Classical Methods Fail

Abstract

In the previous post, we formulated the rate-distortion problem for matrix multiplication. Now we prove that three natural approaches—deterministic vector quantization, sketching, and naive dithering—provably fail to achieve optimal distortion. These negative results motivate the development of fundamentally new techniques.


1. Recap: The Objective

Recall our goal: given matrices $A, B \in \mathbb{R}^n$ (treating them as vectors for inner product), we want to compress each to $nR$ bits and reconstruct their inner product $A^T B$.

Distortion measure: D=1nE[(ATBC^)2]D = \frac{1}{n} \mathbb{E}\left[(A^T B - \hat{C})^2\right]

For i.i.d. $\mathcal{N}(0,1)$ entries, we have $\mathbb{E}[|A^T B|^2] = n$, so we want distortion $D = O(1)$.

The optimal distortion-rate function $\Gamma(R)$ achieves $D = \Gamma(R) \cdot n$ where $\Gamma(R) < 1$ for $R > 0$.


2. Failure of Deterministic Vector Quantization

2.1 The Setup

Consider any deterministic encoding scheme:

  • Encoder 1: $f_1: \mathbb{R}^n \to {1, \ldots, 2^{nR}}$
  • Encoder 2: $f_2: \mathbb{R}^n \to {1, \ldots, 2^{nR}}$
  • Decoder: $g: {1, \ldots, 2^{nR}}^2 \to \mathbb{R}$

The decoder reconstructs $\hat{C} = g(f_1(A), f_2(B))$.

2.2 Main Negative Result

Proposition 1 (Deterministic VQ Fails). Fix rate $R > 0$. There exist constants $\delta(R) > 0$ and $n_0(R)$ such that for all $n \geq n_0$, and for any deterministic encoders $f_1, f_2$ and decoder $g$, there exist vectors $A, B$ with $|A|_2 = |B|_2 = \sqrt{n}$ such that:

(ATBg(f1(A),f2(B)))2δ(R)n2(A^T B - g(f_1(A), f_2(B)))^2 \geq \delta(R) \cdot n^2

2.3 Proof Sketch

The proof uses a volume argument.

Step 1: Counting codewords

Each encoder has at most $2^{nR}$ codewords. The joint codebook has at most $2^{2nR}$ pairs.

Step 2: Covering the sphere

Consider vectors on the sphere $\mathcal{S}^{n-1}(\sqrt{n})$ of radius $\sqrt{n}$. The product of two spheres has "dimension" roughly $2n$.

Step 3: Inner product variation

For two vectors $A, A'$ in the same Voronoi cell of encoder 1, and $B, B'$ in the same cell of encoder 2:

  • $g(f_1(A), f_2(B)) = g(f_1(A'), f_2(B'))$
  • But $A^T B$ and $(A')^T B'$ can differ by $\Omega(n)$

Step 4: Quantitative bound

Using metric entropy arguments, for any partition of the sphere into $2^{nR}$ cells, there exist points in the same cell with inner products differing by $\Omega(n)$.

The reason: deterministic VQ creates Voronoi cells. Within each cell, the decoder must output a single value, but inner products within the cell vary by $\Theta(n)$.

2.4 Implications

This result shows that no matter how cleverly we design the codebook, deterministic VQ has worst-case error $\Omega(n^2)$, not $O(n)$.

The core issue: deterministic quantizers approximate vectors, but inner products are quadratic in vectors. Small errors in vectors become large errors in products.


3. Failure of Sketching

3.1 Random Projection Sketches

A natural idea: use random sketches to compress vectors.

Simple sketch: Let $U \sim \mathcal{N}(0, I_n)$ be a random direction. Compute:

  • $U_A = U^T A$ (scalar sketch of $A$)
  • $U_B = U^T B$ (scalar sketch of $B$)

Estimator: Use $\hat{C} = U_A \cdot U_B$ as an estimate of $A^T B$.

3.2 Variance Analysis

Proposition 2 (Sketching Variance). For the simple sketch:

Var[UAUB]=n2(1+ρ2)\text{Var}[U_A U_B] = n^2(1 + \rho^2)

where $\rho = \frac{A^T B}{|A||B|}$ is the cosine similarity.

Proof:

Let $U \sim \mathcal{N}(0, I_n)$. Then:

  • $U_A = U^T A \sim \mathcal{N}(0, |A|^2)$
  • $U_B = U^T B \sim \mathcal{N}(0, |B|^2)$
  • $\text{Cov}(U_A, U_B) = A^T B$

For jointly Gaussian $(U_A, U_B)$ with zero means:

E[UA2UB2]=E[UA2]E[UB2]+2(Cov(UA,UB))2\mathbb{E}[U_A^2 U_B^2] = \mathbb{E}[U_A^2]\mathbb{E}[U_B^2] + 2(\text{Cov}(U_A, U_B))^2

This gives: E[UA2UB2]=A2B2+2(ATB)2\mathbb{E}[U_A^2 U_B^2] = \|A\|^2 \|B\|^2 + 2(A^T B)^2

Since $\mathbb{E}[U_A U_B] = A^T B$ (the estimator is unbiased):

Var[UAUB]=E[UA2UB2](E[UAUB])2=A2B2+(ATB)2\text{Var}[U_A U_B] = \mathbb{E}[U_A^2 U_B^2] - (\mathbb{E}[U_A U_B])^2 = \|A\|^2 \|B\|^2 + (A^T B)^2

For $|A| = |B| = \sqrt{n}$: Var[UAUB]=n2+(ATB)2=n2(1+ρ2)\text{Var}[U_A U_B] = n^2 + (A^T B)^2 = n^2(1 + \rho^2)

3.3 Implications

Even with perfect access to $(U_A, U_B)$ (infinite precision), the MSE is $\Theta(n^2)$!

To reduce variance to $O(n)$: Need $\Omega(n)$ independent sketches, requiring $\Omega(n)$ bits total—far above the $O(nR)$ budget for small $R$.

Johnson-Lindenstrauss?: JL lemma preserves pairwise distances, but here we have a single pair with no concentration benefit.


4. Failure of Naive Dithered Quantization

4.1 Subtractive Dithering

A more sophisticated approach uses dithered quantization:

  1. Generate shared randomness $Z \sim \text{Uniform}[0, \Delta)^n$
  2. Quantize: $\hat{A}_i = \Delta \cdot \lfloor (A_i + Z_i)/\Delta \rfloor$
  3. Similarly for $B$
  4. Reconstruct: $\hat{C} = \hat{A}^T \hat{B}$

4.2 The Problem

Proposition 3 (Dithered Quantization Error). For subtractive dithered quantization with step size $\Delta$:

E[(ATBA^TB^)2]=nΔ2(σA2+σB2)+nΔ412\mathbb{E}[(A^T B - \hat{A}^T \hat{B})^2] = n \cdot \Delta^2 \cdot (\sigma_A^2 + \sigma_B^2) + \frac{n \Delta^4}{12}

where $\sigma_A^2, \sigma_B^2$ are the variances of $A_i, B_i$.

The issue: for rate $R$, we need $\Delta \propto 2^{-R/2}$ per coordinate (roughly). But the $n\Delta^2$ term means MSE scales as $n$, not achieving the optimal $\Gamma(R) \cdot n$ for low rates.

4.3 Why Dithering Alone Isn't Enough

The fundamental issue: dithered quantization treats each coordinate equally. For inner products:

(A^A)TB+AT(B^B)+(A^A)T(B^B)(\hat{A} - A)^T B + A^T(\hat{B} - B) + (\hat{A} - A)^T(\hat{B} - B)

The cross-terms don't cancel, and the error accumulates across $n$ coordinates.


5. The Root Cause: Quadratic Distortion

5.1 Why All These Methods Fail

All three approaches share a common flaw: they treat the quantization problem linearly.

  • Deterministic VQ: Approximates $A \approx \hat{A}$, ignores that $A^T B$ is quadratic
  • Sketching: Linear projections don't capture quadratic structure
  • Dithering: Per-coordinate treatment ignores interaction terms

5.2 Requirements for Optimal Schemes

The optimal scheme must:

  1. Be randomized: To escape the deterministic lower bound
  2. Jointly consider $A$ and $B$: Even though they're encoded separately
  3. Exploit correlations: Use the structure of Gaussian distributions

6. A Glimpse of the Solution

The optimal approach uses time-sharing (or sparsification):

The idea: at low rates, it's better to:

  1. Select a random subset $S \subset [n]$ of coordinates
  2. Quantize those coordinates with high precision
  3. Ignore the rest

This achieves: D=(1κ)n+κfnaive(R/κ)nD = (1-\kappa) \cdot n + \kappa \cdot f_{\text{naive}}(R/\kappa) \cdot n

where $\kappa = |S|/n$ is the fraction selected.

Why this works: The ignored coordinates contribute variance $\propto (1-\kappa)n$, but the quantized coordinates are done well. The optimal $\kappa$ balances these terms.


7. Conclusion

We have proven that three natural approaches fail:

MethodMSEWhy It Fails
Deterministic VQ$\Omega(n^2)$ worst caseVoronoi cells have varying inner products
Sketching$\Theta(n^2)$ varianceSingle projection lacks concentration
Naive ditheringSuboptimal $O(n)$Doesn't exploit rate allocation

Quantizing for matrix multiplication differs from classical VQ. The quadratic nature of the distortion requires new techniques.

In the next post, we will derive the optimal distortion-rate function $\Gamma(R)$ for Gaussian inner products and prove it achieves the information-theoretic limit.


This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).