Abstract
Having established that classical methods fail, we now derive the optimal distortion-rate function $\Gamma(R)$ for inner products of Gaussian vectors. This function exhibits a surprising phase transition at $R^* \approx 0.906$ bits per entry. We prove both achievability (via time-sharing) and the converse (information-theoretic lower bound), then extend the result to matrix multiplication.
1. Setup and Main Result
1.1 The Inner Product Rate-Distortion Problem
Let $U, V \in \mathbb{R}^n$ be independent with i.i.d. $\mathcal{N}(0,1)$ entries. We want to:
- Encode $U$ into $nR_U$ bits: $f_U: \mathbb{R}^n \to {0,1}^{nR_U}$
- Encode $V$ into $nR_V$ bits: $f_V: \mathbb{R}^n \to {0,1}^{nR_V}$
- Decode to estimate $U^T V$: $g: {0,1}^{nR_U} \times {0,1}^{nR_V} \to \mathbb{R}$
Distortion: Normalized MSE
Goal: Find $D^*_{\text{ip}}(R)$, the minimum achievable distortion at symmetric rate $R = R_U = R_V$.
1.2 The Main Theorem
Theorem 1 (Gaussian Inner Product Rate-Distortion). The optimal distortion-rate function for Gaussian inner products is:
where
with:
- $f_{\text{naive}}(R) = 2 \cdot 2^{-2R} - 2^{-4R}$
- $R^* \approx 0.906$ bits per entry
2. The Naive Scheme: A Baseline
2.1 Scalar Gaussian Quantization
First, consider quantizing each coordinate independently using optimal scalar Gaussian quantization.
Scalar rate-distortion: For a Gaussian source with variance $\sigma^2$, the distortion-rate function is:
2.2 Naive Inner Product Estimation
Apply scalar quantization with rate $R$ to each coordinate:
- $\hat{U}_i = U_i + N_i^U$ where $N_i^U$ is quantization noise with variance $2^{-2R}$
- $\hat{V}_i = V_i + N_i^V$ where $N_i^V$ is quantization noise with variance $2^{-2R}$
Estimator: $\hat{C} = \hat{U}^T \hat{V}$
2.3 Error Analysis
The error decomposes as:
Computing the MSE:
Term 1:
Term 2: Similarly,
Term 3:
Total MSE:
(The cross-terms vanish by independence, and we subtract one copy of Term 3 due to correlation structure.)
Normalized distortion:
3. Time-Sharing: Beating the Naive Scheme
3.1 The Key Insight
At low rates, the naive scheme is suboptimal. The insight: it's better to quantize a fraction of coordinates perfectly than to quantize all coordinates poorly.
3.2 Time-Sharing Scheme
Parameters:
- $\kappa \in [0, 1]$: fraction of coordinates to quantize
- Effective rate per quantized coordinate: $R/\kappa$
Algorithm:
- Randomly select subset $S \subset [n]$ with $|S| = \kappa n$
- For $i \in S$: quantize $U_i, V_i$ at rate $R/\kappa$
- For $i \notin S$: send nothing (set $\hat{U}_i = \hat{V}_i = 0$)
- Estimate: $\hat{C} = \sum_{i \in S} \hat{U}_i \hat{V}_i$
3.3 Distortion Analysis
Contribution from unquantized coordinates ($i \notin S$):
(Since $\mathbb{E}[(U_i V_i)^2] = 1$ and terms are independent.)
Contribution from quantized coordinates ($i \in S$):
Total distortion:
3.4 Optimizing Over $\kappa$
The achievable distortion is:
Calculus: Taking derivative and setting to zero:
This gives the critical point equation. The solution $\kappa^*(R)$ determines the optimal allocation.
4. The Phase Transition
4.1 Two Regimes
Regime 1: High rate ($R \geq R^*$)
- Optimal: $\kappa^* = 1$ (quantize all coordinates)
- Distortion: $\Gamma(R) = f_{\text{naive}}(R)$
Regime 2: Low rate ($R < R^*$)
- Optimal: $\kappa^* = R/R^* < 1$ (quantize fraction of coordinates)
- Distortion: $\Gamma(R) = 1 - \frac{1 - f_{\text{naive}}(R^)}{R^} \cdot R$
4.2 The Critical Rate $R^*$
The critical rate satisfies:
Numerically: $R^* \approx 0.906$ bits per entry.
At $R^*$, the slope of time-sharing matches the slope of naive quantization, creating a continuous but non-differentiable transition.
4.3 Interpretation
Low rate regime:
- Bits are "expensive"
- Better to concentrate bits on fewer coordinates
- Linear decay: every bit reduces distortion by the same amount
High rate regime:
- Bits are "cheap"
- Quantize all coordinates uniformly
- Exponential decay from standard rate-distortion
5. Converse: Information-Theoretic Lower Bound
5.1 The Key Lemma
Lemma (Lower Bound). For any encoding scheme with total rate $R$ per source:
5.2 Proof Sketch
The proof uses single-letterization and supporting hyperplane arguments.
Step 1: Mutual Information Bound
For reliable reconstruction with distortion $D$:
where $\hat{C}$ achieves distortion $D$.
Step 2: Rate Constraint
Since $U, V$ are encoded separately:
Step 3: Combining
The mutual information $I(U^T V; \hat{C})$ needed to achieve distortion $D$ satisfies:
Using entropy-power inequalities and the Gaussian assumption:
5.3 Tightness
The matching upper bound (time-sharing) and lower bound prove:
6. From Inner Products to Matrix Multiplication
6.1 The Extension
For matrices $A \in \mathbb{R}^{n \times a}$ and $B \in \mathbb{R}^{n \times b}$ with i.i.d. $\mathcal{N}(0,1)$ entries, we want to approximate $C = A^T B \in \mathbb{R}^{a \times b}$.
Distortion measure:
6.2 Reduction to Inner Products
The key observation: $C_{ij} = A_i^T B_j$ where $A_i, B_j$ are the $i$-th and $j$-th columns of $A, B$ respectively.
The total error is:
6.3 Matrix Multiplication Rate-Distortion
Theorem 2 (Gaussian Matrix Multiplication). The optimal distortion-rate function for Gaussian matrix multiplication is:
where $\Gamma(R)$ is the same function as for inner products.
Proof idea:
- Achievability: Apply time-sharing row-wise to both $A$ and $B$
- Converse: Each entry $C_{ij}$ is an independent inner product; the average distortion cannot beat the per-entry optimal
7. Practical Implications
7.1 The Magic Number: 0.906 bits
The phase transition at $R^* \approx 0.906$ bits has practical significance:
- Below 1 bit/entry: Sparsification helps
- Above 1 bit/entry: Uniform quantization is near-optimal
7.2 Comparison with Existing Methods
| Method | Rate | Distortion |
|---|---|---|
| FP32 (baseline) | 32 bits | 0 |
| INT8 | 8 bits | $\approx f_{\text{naive}}(8)$ |
| INT4 | 4 bits | $\approx f_{\text{naive}}(4)$ |
| Optimal 1-bit | 1 bit | $\Gamma(1) \approx 0.54$ |
| Optimal 0.5-bit | 0.5 bits | $\Gamma(0.5) \approx 0.77$ |
7.3 Design Principles
- Sub-1-bit quantization: Use sparsification, not uniform scalar quantization
- Rate allocation: Concentrate bits on a subset of coordinates
- Shared randomness: Essential for achieving optimal rates
8. Conclusion
We have derived the complete solution to the Gaussian inner product rate-distortion problem:
Main results:
- Phase transition at $R^* \approx 0.906$ bits
- Time-sharing is optimal for low rates
- Same function governs inner products and matrix multiplication
In the final post, we will discuss implementation considerations and open problems in this field.
This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).