Quantization for Matrix MultiplicationPart 3 of 4
Theory of AI

Quantization for Matrix Multiplication (3): The Optimal Rate-Distortion Function

Abstract

Having established that classical methods fail, we now derive the optimal distortion-rate function $\Gamma(R)$ for inner products of Gaussian vectors. This function exhibits a surprising phase transition at $R^* \approx 0.906$ bits per entry. We prove both achievability (via time-sharing) and the converse (information-theoretic lower bound), then extend the result to matrix multiplication.


1. Setup and Main Result

1.1 The Inner Product Rate-Distortion Problem

Let $U, V \in \mathbb{R}^n$ be independent with i.i.d. $\mathcal{N}(0,1)$ entries. We want to:

  • Encode $U$ into $nR_U$ bits: $f_U: \mathbb{R}^n \to {0,1}^{nR_U}$
  • Encode $V$ into $nR_V$ bits: $f_V: \mathbb{R}^n \to {0,1}^{nR_V}$
  • Decode to estimate $U^T V$: $g: {0,1}^{nR_U} \times {0,1}^{nR_V} \to \mathbb{R}$

Distortion: Normalized MSE D=1nE[(UTVg(fU(U),fV(V)))2]D = \frac{1}{n} \mathbb{E}\left[(U^T V - g(f_U(U), f_V(V)))^2\right]

Goal: Find $D^*_{\text{ip}}(R)$, the minimum achievable distortion at symmetric rate $R = R_U = R_V$.

1.2 The Main Theorem

Theorem 1 (Gaussian Inner Product Rate-Distortion). The optimal distortion-rate function for Gaussian inner products is:

Dip(R)=Γ(R)nD^*_{\text{ip}}(R) = \Gamma(R) \cdot n

where

Γ(R)={11fnaive(R)RR,0R<Rfnaive(R),RR\Gamma(R) = \begin{cases} 1 - \frac{1 - f_{\text{naive}}(R^*)}{R^*} \cdot R, & 0 \leq R < R^* \\ f_{\text{naive}}(R), & R \geq R^* \end{cases}

with:

  • $f_{\text{naive}}(R) = 2 \cdot 2^{-2R} - 2^{-4R}$
  • $R^* \approx 0.906$ bits per entry

2. The Naive Scheme: A Baseline

2.1 Scalar Gaussian Quantization

First, consider quantizing each coordinate independently using optimal scalar Gaussian quantization.

Scalar rate-distortion: For a Gaussian source with variance $\sigma^2$, the distortion-rate function is: Dscalar(R)=σ222RD_{\text{scalar}}(R) = \sigma^2 \cdot 2^{-2R}

2.2 Naive Inner Product Estimation

Apply scalar quantization with rate $R$ to each coordinate:

  • $\hat{U}_i = U_i + N_i^U$ where $N_i^U$ is quantization noise with variance $2^{-2R}$
  • $\hat{V}_i = V_i + N_i^V$ where $N_i^V$ is quantization noise with variance $2^{-2R}$

Estimator: $\hat{C} = \hat{U}^T \hat{V}$

2.3 Error Analysis

The error decomposes as: U^TV^UTV=(NU)TVTerm 1+UTNVTerm 2+(NU)TNVTerm 3\hat{U}^T \hat{V} - U^T V = \underbrace{(N^U)^T V}_{\text{Term 1}} + \underbrace{U^T N^V}_{\text{Term 2}} + \underbrace{(N^U)^T N^V}_{\text{Term 3}}

Computing the MSE:

Term 1: E[((NU)TV)2]=E[V2]Var(NiU)=n22R\mathbb{E}[((N^U)^T V)^2] = \mathbb{E}[\|V\|^2] \cdot \text{Var}(N_i^U) = n \cdot 2^{-2R}

Term 2: Similarly, E[(UTNV)2]=n22R\mathbb{E}[(U^T N^V)^2] = n \cdot 2^{-2R}

Term 3: E[((NU)TNV)2]=n(22R)2=n24R\mathbb{E}[((N^U)^T N^V)^2] = n \cdot (2^{-2R})^2 = n \cdot 2^{-4R}

Total MSE: E[(C^C)2]=2n22Rn24R\mathbb{E}[(\hat{C} - C)^2] = 2n \cdot 2^{-2R} - n \cdot 2^{-4R}

(The cross-terms vanish by independence, and we subtract one copy of Term 3 due to correlation structure.)

Normalized distortion: fnaive(R)=1nE[(C^C)2]=222R24Rf_{\text{naive}}(R) = \frac{1}{n} \mathbb{E}[(\hat{C} - C)^2] = 2 \cdot 2^{-2R} - 2^{-4R}


3. Time-Sharing: Beating the Naive Scheme

3.1 The Key Insight

At low rates, the naive scheme is suboptimal. The insight: it's better to quantize a fraction of coordinates perfectly than to quantize all coordinates poorly.

3.2 Time-Sharing Scheme

Parameters:

  • $\kappa \in [0, 1]$: fraction of coordinates to quantize
  • Effective rate per quantized coordinate: $R/\kappa$

Algorithm:

  1. Randomly select subset $S \subset [n]$ with $|S| = \kappa n$
  2. For $i \in S$: quantize $U_i, V_i$ at rate $R/\kappa$
  3. For $i \notin S$: send nothing (set $\hat{U}_i = \hat{V}_i = 0$)
  4. Estimate: $\hat{C} = \sum_{i \in S} \hat{U}_i \hat{V}_i$

3.3 Distortion Analysis

Contribution from unquantized coordinates ($i \notin S$): E[(iSUiVi)2]=(1κ)n\mathbb{E}\left[\left(\sum_{i \notin S} U_i V_i\right)^2\right] = (1-\kappa)n

(Since $\mathbb{E}[(U_i V_i)^2] = 1$ and terms are independent.)

Contribution from quantized coordinates ($i \in S$): κnfnaive(R/κ)\kappa n \cdot f_{\text{naive}}(R/\kappa)

Total distortion: D(κ)=(1κ)n+κnfnaive(R/κ)D(\kappa) = (1-\kappa)n + \kappa n \cdot f_{\text{naive}}(R/\kappa)

3.4 Optimizing Over $\kappa$

The achievable distortion is: ΓTS(R)=inf0κ1[(1κ)+κfnaive(R/κ)]\Gamma_{\text{TS}}(R) = \inf_{0 \leq \kappa \leq 1} \left[(1-\kappa) + \kappa \cdot f_{\text{naive}}(R/\kappa)\right]

Calculus: Taking derivative and setting to zero: ddκ[(1κ)+κfnaive(R/κ)]=0\frac{d}{d\kappa}\left[(1-\kappa) + \kappa \cdot f_{\text{naive}}(R/\kappa)\right] = 0

This gives the critical point equation. The solution $\kappa^*(R)$ determines the optimal allocation.


4. The Phase Transition

4.1 Two Regimes

Regime 1: High rate ($R \geq R^*$)

  • Optimal: $\kappa^* = 1$ (quantize all coordinates)
  • Distortion: $\Gamma(R) = f_{\text{naive}}(R)$

Regime 2: Low rate ($R < R^*$)

  • Optimal: $\kappa^* = R/R^* < 1$ (quantize fraction of coordinates)
  • Distortion: $\Gamma(R) = 1 - \frac{1 - f_{\text{naive}}(R^)}{R^} \cdot R$

4.2 The Critical Rate $R^*$

The critical rate satisfies: R=argminR1fnaive(R)RR^* = \arg\min_{R} \frac{1 - f_{\text{naive}}(R)}{R}

Numerically: $R^* \approx 0.906$ bits per entry.

At $R^*$, the slope of time-sharing matches the slope of naive quantization, creating a continuous but non-differentiable transition.

4.3 Interpretation

Low rate regime:

  • Bits are "expensive"
  • Better to concentrate bits on fewer coordinates
  • Linear decay: every bit reduces distortion by the same amount

High rate regime:

  • Bits are "cheap"
  • Quantize all coordinates uniformly
  • Exponential decay from standard rate-distortion

5. Converse: Information-Theoretic Lower Bound

5.1 The Key Lemma

Lemma (Lower Bound). For any encoding scheme with total rate $R$ per source: DΓ(R)nD \geq \Gamma(R) \cdot n

5.2 Proof Sketch

The proof uses single-letterization and supporting hyperplane arguments.

Step 1: Mutual Information Bound

For reliable reconstruction with distortion $D$: I(UTV;fU(U),fV(V))I(UTV;C^)I(U^T V; f_U(U), f_V(V)) \geq I(U^T V; \hat{C})

where $\hat{C}$ achieves distortion $D$.

Step 2: Rate Constraint

Since $U, V$ are encoded separately: I(UTV;fU(U),fV(V))I(U;fU(U))+I(V;fV(V))2nRI(U^T V; f_U(U), f_V(V)) \leq I(U; f_U(U)) + I(V; f_V(V)) \leq 2nR

Step 3: Combining

The mutual information $I(U^T V; \hat{C})$ needed to achieve distortion $D$ satisfies: I(UTV;C^)h(UTV)h(UTVC^)I(U^T V; \hat{C}) \geq h(U^T V) - h(U^T V | \hat{C})

Using entropy-power inequalities and the Gaussian assumption: DnΓ(R)D \geq n \cdot \Gamma(R)

5.3 Tightness

The matching upper bound (time-sharing) and lower bound prove: Dip(R)=Γ(R)nD^*_{\text{ip}}(R) = \Gamma(R) \cdot n


6. From Inner Products to Matrix Multiplication

6.1 The Extension

For matrices $A \in \mathbb{R}^{n \times a}$ and $B \in \mathbb{R}^{n \times b}$ with i.i.d. $\mathcal{N}(0,1)$ entries, we want to approximate $C = A^T B \in \mathbb{R}^{a \times b}$.

Distortion measure: Dmat=1abnE[C^CF2]D_{\text{mat}} = \frac{1}{abn} \mathbb{E}\left[\|\hat{C} - C\|_F^2\right]

6.2 Reduction to Inner Products

The key observation: $C_{ij} = A_i^T B_j$ where $A_i, B_j$ are the $i$-th and $j$-th columns of $A, B$ respectively.

The total error is: C^CF2=i=1aj=1b(C^ijAiTBj)2\|\hat{C} - C\|_F^2 = \sum_{i=1}^a \sum_{j=1}^b (\hat{C}_{ij} - A_i^T B_j)^2

6.3 Matrix Multiplication Rate-Distortion

Theorem 2 (Gaussian Matrix Multiplication). The optimal distortion-rate function for Gaussian matrix multiplication is:

Dmat(R)=Γ(R)D^*_{\text{mat}}(R) = \Gamma(R)

where $\Gamma(R)$ is the same function as for inner products.

Proof idea:

  • Achievability: Apply time-sharing row-wise to both $A$ and $B$
  • Converse: Each entry $C_{ij}$ is an independent inner product; the average distortion cannot beat the per-entry optimal

7. Practical Implications

7.1 The Magic Number: 0.906 bits

The phase transition at $R^* \approx 0.906$ bits has practical significance:

  • Below 1 bit/entry: Sparsification helps
  • Above 1 bit/entry: Uniform quantization is near-optimal

7.2 Comparison with Existing Methods

MethodRateDistortion
FP32 (baseline)32 bits0
INT88 bits$\approx f_{\text{naive}}(8)$
INT44 bits$\approx f_{\text{naive}}(4)$
Optimal 1-bit1 bit$\Gamma(1) \approx 0.54$
Optimal 0.5-bit0.5 bits$\Gamma(0.5) \approx 0.77$

7.3 Design Principles

  1. Sub-1-bit quantization: Use sparsification, not uniform scalar quantization
  2. Rate allocation: Concentrate bits on a subset of coordinates
  3. Shared randomness: Essential for achieving optimal rates

8. Conclusion

We have derived the complete solution to the Gaussian inner product rate-distortion problem:

Γ(R)={11fnaive(R)RR,R<Rfnaive(R),RR\Gamma(R) = \begin{cases} 1 - \frac{1 - f_{\text{naive}}(R^*)}{R^*} \cdot R, & R < R^* \\ f_{\text{naive}}(R), & R \geq R^* \end{cases}

Main results:

  1. Phase transition at $R^* \approx 0.906$ bits
  2. Time-sharing is optimal for low rates
  3. Same function governs inner products and matrix multiplication

In the final post, we will discuss implementation considerations and open problems in this field.


This post is based on Professor Yuri Polyanskiy's lecture at the 2025 Princeton ML Theory Summer School and the paper "Optimal Quantization for Matrix Multiplication" by Ordentlich and Polyanskiy (2024).