Abstract
In the previous post, we formulated the rate-distortion problem for matrix multiplication. Now we prove that three natural approaches—deterministic vector quantization, sketching, and naive dithering—provably fail to achieve optimal distortion. These negative results motivate the development of fundamentally new techniques.
1. Recap: The Objective
Recall our goal: given matrices $A, B \in \mathbb{R}^n$ (treating them as vectors for inner product), we want to compress each to $nR$ bits and reconstruct their inner product $A^T B$.
Distortion measure:
For i.i.d. $\mathcal{N}(0,1)$ entries, we have $\mathbb{E}[|A^T B|^2] = n$, so we want distortion $D = O(1)$.
The optimal distortion-rate function $\Gamma(R)$ achieves $D = \Gamma(R) \cdot n$ where $\Gamma(R) < 1$ for $R > 0$.
2. Failure of Deterministic Vector Quantization
2.1 The Setup
Consider any deterministic encoding scheme:
- Encoder 1: $f_1: \mathbb{R}^n \to {1, \ldots, 2^{nR}}$
- Encoder 2: $f_2: \mathbb{R}^n \to {1, \ldots, 2^{nR}}$
- Decoder: $g: {1, \ldots, 2^{nR}}^2 \to \mathbb{R}$
The decoder reconstructs $\hat{C} = g(f_1(A), f_2(B))$.
2.2 Main Negative Result
Proposition 1 (Deterministic VQ Fails). Fix rate $R > 0$. There exist constants $\delta(R) > 0$ and $n_0(R)$ such that for all $n \geq n_0$, and for any deterministic encoders $f_1, f_2$ and decoder $g$, there exist vectors $A, B$ with $|A|_2 = |B|_2 = \sqrt{n}$ such that:
2.3 Proof Sketch
The proof uses a volume argument.
Step 1: Counting codewords
Each encoder has at most $2^{nR}$ codewords. The joint codebook has at most $2^{2nR}$ pairs.
Step 2: Covering the sphere
Consider vectors on the sphere $\mathcal{S}^{n-1}(\sqrt{n})$ of radius $\sqrt{n}$. The product of two spheres has "dimension" roughly $2n$.
Step 3: Inner product variation
For two vectors $A, A'$ in the same Voronoi cell of encoder 1, and $B, B'$ in the same cell of encoder 2:
- $g(f_1(A), f_2(B)) = g(f_1(A'), f_2(B'))$
- But $A^T B$ and $(A')^T B'$ can differ by $\Omega(n)$
Step 4: Quantitative bound
Using metric entropy arguments, for any partition of the sphere into $2^{nR}$ cells, there exist points in the same cell with inner products differing by $\Omega(n)$.
The reason: deterministic VQ creates Voronoi cells. Within each cell, the decoder must output a single value, but inner products within the cell vary by $\Theta(n)$.
2.4 Implications
This result shows that no matter how cleverly we design the codebook, deterministic VQ has worst-case error $\Omega(n^2)$, not $O(n)$.
The core issue: deterministic quantizers approximate vectors, but inner products are quadratic in vectors. Small errors in vectors become large errors in products.
3. Failure of Sketching
3.1 Random Projection Sketches
A natural idea: use random sketches to compress vectors.
Simple sketch: Let $U \sim \mathcal{N}(0, I_n)$ be a random direction. Compute:
- $U_A = U^T A$ (scalar sketch of $A$)
- $U_B = U^T B$ (scalar sketch of $B$)
Estimator: Use $\hat{C} = U_A \cdot U_B$ as an estimate of $A^T B$.
3.2 Variance Analysis
Proposition 2 (Sketching Variance). For the simple sketch:
where $\rho = \frac{A^T B}{|A||B|}$ is the cosine similarity.
Proof:
Let $U \sim \mathcal{N}(0, I_n)$. Then:
- $U_A = U^T A \sim \mathcal{N}(0, |A|^2)$
- $U_B = U^T B \sim \mathcal{N}(0, |B|^2)$
- $\text{Cov}(U_A, U_B) = A^T B$
For jointly Gaussian $(U_A, U_B)$ with zero means:
This gives:
Since $\mathbb{E}[U_A U_B] = A^T B$ (the estimator is unbiased):
For $|A| = |B| = \sqrt{n}$:
3.3 Implications
Even with perfect access to $(U_A, U_B)$ (infinite precision), the MSE is $\Theta(n^2)$!
To reduce variance to $O(n)$: Need $\Omega(n)$ independent sketches, requiring $\Omega(n)$ bits total—far above the $O(nR)$ budget for small $R$.
Johnson-Lindenstrauss?: JL lemma preserves pairwise distances, but here we have a single pair with no concentration benefit.
4. Failure of Naive Dithered Quantization
4.1 Subtractive Dithering
A more sophisticated approach uses dithered quantization:
- Generate shared randomness $Z \sim \text{Uniform}[0, \Delta)^n$
- Quantize: $\hat{A}_i = \Delta \cdot \lfloor (A_i + Z_i)/\Delta \rfloor$
- Similarly for $B$
- Reconstruct: $\hat{C} = \hat{A}^T \hat{B}$
4.2 The Problem
Proposition 3 (Dithered Quantization Error). For subtractive dithered quantization with step size $\Delta$:
where $\sigma_A^2, \sigma_B^2$ are the variances of $A_i, B_i$.
The issue: for rate $R$, we need $\Delta \propto 2^{-R/2}$ per coordinate (roughly). But the $n\Delta^2$ term means MSE scales as $n$, not achieving the optimal $\Gamma(R) \cdot n$ for low rates.
4.3 Why Dithering Alone Isn't Enough
The fundamental issue: dithered quantization treats each coordinate equally. For inner products:
The cross-terms don't cancel, and the error accumulates across $n$ coordinates.
5. The Root Cause: Quadratic Distortion
5.1 Why All These Methods Fail
All three approaches share a common flaw: they treat the quantization problem linearly.
- Deterministic VQ: Approximates $A \approx \hat{A}$, ignores that $A^T B$ is quadratic
- Sketching: Linear projections don't capture quadratic structure
- Dithering: Per-coordinate treatment ignores interaction terms
5.2 Requirements for Optimal Schemes
The optimal scheme must:
- Be randomized: To escape the deterministic lower bound
- Jointly consider $A$ and $B$: Even though they're encoded separately
- Exploit correlations: Use the structure of Gaussian distributions
6. A Glimpse of the Solution
The optimal approach uses time-sharing (or sparsification):
The idea: at low rates, it's better to:
- Select a random subset $S \subset [n]$ of coordinates
- Quantize those coordinates with high precision
- Ignore the rest
This achieves:
where $\kappa = |S|/n$ is the fraction selected.
Why this works: The ignored coordinates contribute variance $\propto (1-\kappa)n$, but the quantized coordinates are done well. The optimal $\kappa$ balances these terms.
7. Conclusion
We have proven that three natural approaches fail:
| Method | MSE | Why It Fails |
|---|---|---|
| Deterministic VQ | $\Omega(n^2)$ worst case | Voronoi cells have varying inner products |
| Sketching | $\Theta(n^2)$ variance | Single projection lacks concentration |
| Naive dithering | Suboptimal $O(n)$ | Doesn't exploit rate allocation |
Quantizing for matrix multiplication differs from classical VQ. The quadratic nature of the distortion requires new techniques.
In the next post, we will derive the optimal distortion-rate function $\Gamma(R)$ for Gaussian inner products and prove it achieves the information-theoretic limit.
This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).