Abstract
In the previous post, we formulated the rate-distortion problem for matrix multiplication. Now we prove that three natural approaches—deterministic vector quantization, sketching, and naive dithering—provably fail to achieve optimal distortion. These negative results motivate the development of fundamentally new techniques.
1. Recap: The Objective
Recall our goal: given matrices (treating them as vectors for inner product), we want to compress each to bits and reconstruct their inner product .
Distortion measure:
For i.i.d. entries, we have , so we want distortion .
The optimal distortion-rate function achieves where for .
2. Failure of Deterministic Vector Quantization
2.1 The Setup
Consider any deterministic encoding scheme:
- Encoder 1:
- Encoder 2:
- Decoder:
The decoder reconstructs .
2.2 Main Negative Result
Proposition 1 (Deterministic VQ Fails). Fix rate . There exist constants and such that for all , and for any deterministic encoders and decoder , there exist vectors with such that:
2.3 Proof Sketch
The proof uses a volume argument.
Step 1: Counting codewords
Each encoder has at most codewords. The joint codebook has at most pairs.
Step 2: Covering the sphere
Consider vectors on the sphere of radius . The product of two spheres has "dimension" roughly .
Step 3: Inner product variation
For two vectors in the same Voronoi cell of encoder 1, and in the same cell of encoder 2:
- But and can differ by
Step 4: Quantitative bound
Using metric entropy arguments, for any partition of the sphere into cells, there exist points in the same cell with inner products differing by .
The reason: deterministic VQ creates Voronoi cells. Within each cell, the decoder must output a single value, but inner products within the cell vary by .
2.4 Implications
This result shows that no matter how cleverly we design the codebook, deterministic VQ has worst-case error , not .
The core issue: deterministic quantizers approximate vectors, but inner products are quadratic in vectors. Small errors in vectors become large errors in products.
3. Failure of Sketching
3.1 Random Projection Sketches
A natural idea: use random sketches to compress vectors.
Simple sketch: Let be a random direction. Compute:
- (scalar sketch of )
- (scalar sketch of )
Estimator: Use as an estimate of .
3.2 Variance Analysis
Proposition 2 (Sketching Variance). For the simple sketch:
where is the cosine similarity.
Proof:
Let . Then:
For jointly Gaussian with zero means:
This gives:
Since (the estimator is unbiased):
For :
3.3 Implications
Even with perfect access to (infinite precision), the MSE is !
To reduce variance to : Need independent sketches, requiring bits total—far above the budget for small .
Johnson-Lindenstrauss?: JL lemma preserves pairwise distances, but here we have a single pair with no concentration benefit.
4. Failure of Naive Dithered Quantization
4.1 Subtractive Dithering
A more sophisticated approach uses dithered quantization:
- Generate shared randomness
- Quantize:
- Similarly for
- Reconstruct:
4.2 The Problem
Proposition 3 (Dithered Quantization Error). For subtractive dithered quantization with step size :
where are the variances of .
The issue: for rate , we need per coordinate (roughly). But the term means MSE scales as , not achieving the optimal for low rates.
4.3 Why Dithering Alone Isn't Enough
The fundamental issue: dithered quantization treats each coordinate equally. For inner products:
The cross-terms don't cancel, and the error accumulates across coordinates.
5. The Root Cause: Quadratic Distortion
5.1 Why All These Methods Fail
All three approaches share a common flaw: they treat the quantization problem linearly.
- Deterministic VQ: Approximates , ignores that is quadratic
- Sketching: Linear projections don't capture quadratic structure
- Dithering: Per-coordinate treatment ignores interaction terms
5.2 Requirements for Optimal Schemes
The optimal scheme must:
- Be randomized: To escape the deterministic lower bound
- Jointly consider and : Even though they're encoded separately
- Exploit correlations: Use the structure of Gaussian distributions
6. A Glimpse of the Solution
The optimal approach uses time-sharing (or sparsification):
The idea: at low rates, it's better to:
- Select a random subset of coordinates
- Quantize those coordinates with high precision
- Ignore the rest
This achieves:
where is the fraction selected.
Why this works: The ignored coordinates contribute variance , but the quantized coordinates are done well. The optimal balances these terms.
7. Conclusion
We have proven that three natural approaches fail:
| Method | MSE | Why It Fails |
|---|---|---|
| Deterministic VQ | worst case | Voronoi cells have varying inner products |
| Sketching | variance | Single projection lacks concentration |
| Naive dithering | Suboptimal | Doesn't exploit rate allocation |
Quantizing for matrix multiplication differs from classical VQ. The quadratic nature of the distortion requires new techniques.
In the next post, we will derive the optimal distortion-rate function for Gaussian inner products and prove it achieves the information-theoretic limit.
This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).