CH 8

Chapter 8

Rate-Distortion Theory

© Raymond W. Yeung 2012

Department of Information Engineering

The Chinese University of Hong Kong
Information Transmission with

• Consider compressing an information source with entropy rate H at rate

R < H.
• By the source coding theorem, Pe ! 1 as n ! 1.
• Under such a situation, information must be transmitted with “distor-
• What is the best possible tradeo↵?
8.1 Single-Letter Distortion
• Let {Xk , k 1} be an i.i.d. information source with generic random vari-
able X ⇠ p(x), where |X | < 1.
• Consider a source sequence x = (x1 , x2 , · · · , xn ) and a reproduction se-
quence x̂ = (x̂1 , x̂2 , · · · , x̂n ).

• The components of x̂ take values in a reproduction alphabet X̂ , where

|X̂ | < 1.
• In general, X̂ may be di↵erent from X .
• For example, x̂ can be a quantized version of x.
Definition 8.1 A single-letter distortion measure is a mapping

d : X ⇥ X̂ ! <+ .

The value d(x, x̂) denotes the distortion incurred when a source symbol x is
reproduced as x̂.

Definition 8.2 The average distortion between a source sequence x 2 X n and

a reproduction sequence x̂ 2 X̂ n induced by a single-letter distortion measure
d is defined by
d(x, x̂) = d(xk , x̂k ).
Examples of a Distortion Measure
• Let X̂ = X .
1. Square-error: d(x, x̂) = (x x̂)2 , where X and X̂ are real.
2. Hamming distortion:

0 if x = x̂
d(x, x̂) =
1 if x 6= x̂

where the symbols in X do not carry any particular meaning.

• Let X̂ be an estimate of X.
1. If d is the square-error distortion measure, Ed(X, X̂) is called the
mean square error.
2. If d is the Hamming distortion measure,

Ed(X, X̂) = Pr{X = X̂} · 0 + Pr{X 6= X̂} · 1 = Pr{X 6= X̂}

is the probability of error. For a source sequence x and a reproduction

sequence x̂, the average distortion d(x, x̂) gives the frequency of error
in x̂.
Definition 8.5 For a distortion measure d, for each x 2 X , let x̂⇤ (x) 2 X̂
minimize d(x, x̂) over all x̂ 2 X̂ . A distortion measure d is said to be normal if
cx = d(x, x̂⇤ (x)) = 0

for all x 2 X .

• A normal distortion measure is one which allows a source X to be repro-

duced with zero distortion.
• The square-error distortion measure and the Hamming distortion measure
are normal distortion measures.
• The normalization of a distortion measure d is the distortion measure d˜
defined by
˜ x̂) = d(x, x̂) cx
for all (x, x̂) 2 X ⇥ X̂ .

• It suffices to consider normal distortion measures as we will see.

Example 8.6 Let d be a distortion measure defined by
d(x, x̂) a b c
1 2 7 5
2 4 3 8
˜ the normalization of d, is given by
Then d,
˜ x̂)
d(x, a b c
1 0 5 3
2 1 0 5
Let X̂ be any estimate of X which takes values in X̂ . Then
Ed(X, X̂) = p(x, x̂)d(x, x̂)
x x̂
XX h i
= ˜ x̂) + cx
p(x, x̂) d(x,
x x̂
= ˜
E d(X, X̂) + p(x) p(x̂|x)cx
x x̂
= ˜
E d(X, X̂) + p(x)cx p(x̂|x)
x x̂
= ˜
E d(X, X̂) + p(x)cx

= ˜
E d(X, X̂) + ,

where X
= p(x)cx

is a constant which depends only on p(x) and d but not on the conditional
distribution p(x̂|x).
Definition 8.7 Let x̂⇤ minimizes Ed(X, x̂) over all x̂ 2 X̂ , and define

Dmax = Ed(X, x̂⇤ ).

Note: x̂⇤ is not the same as x̂⇤ (x).

• If we know nothing about a source variable X, then x̂⇤ is the best estimate
of X, and Dmax is the minimum expected distortion between X and a
constant estimate of X.
• Specifically, Dmax can be asymptotically achieved by taking (x̂⇤ , x̂⇤ , · · · , x̂⇤ )
to be the reproduction sequence.
• Therefore it is not meanful to impose a constraint D Dmax on the
reproduction sequence.
8.2 The Rate-Distortion Function
All the discussions are with respect to an i.i.d. information source {Xk , k 1}
with generic random variable X and a distortion measure d.

Definition 8.8 An (n, M ) rate-distortion code is defined by an encoding func-

f : X n ! {1, 2, · · · , M }
and a decoding function

g : {1, 2, · · · , M } ! X̂ n .

The set {1, 2, · · · , M }, denoted by I, is called the index set. The reproduc-
tion sequences g(1), g(2), · · · , g(M ) in X̂ n are called codewords, and the set of
codewords is called the codebook.
A Rate-Distortion Code

X f (X) X
Encoder Decoder
source reproduction
sequence sequence
Definition 8.9 The rate of an (n, M ) rate-distortion code is n 1
log M in bits
per symbol.

Definition 8.10 A rate-distortion pair (R, D) is (asymptotically) achievable if

for any ✏ > 0, there exists for sufficiently large n an (n, M ) rate-distortion code
such that
log M  R + ✏
Pr{d(X, X̂) > D + ✏}  ✏,
where X̂ = g(f (X)).

Remark If (R, D) is achievable, then (R0 , D) and (R, D0 ) are achievable for
all R0 R and D0 D. This in turn implies that (R0 , D0 ) are achievable for all
R0 R and D0 D.
Definition 8.11 The rate-distortion region is the subset of <2 containing all
achievable pairs (R, D).

Theorem 8.12 The rate-distortion region is closed and convex.

• The closeness follows from the definition of the achievability of an (R, D)
• The convexity is proved by time-sharing. Specifically, if (R(1) , D(1) ) and
(R(2) , D(2) ) are achievable, then so is (R( ) , D( ) ), where

R( )
= R(1) + ¯ R(2)
D( )
= D(1) + ¯ D(2)

and ¯ = 1 . This can be seen by time-sharing between two codes,

one achieving (R(1) , D(1) ) for fraction of the time, and the other one
achieving (R(2) , D(2) ) for ¯ fraction of the time.

R-D region

Definition 8.13 The rate-distortion function R(D) is the minimum of all rates
R for a given distortion D such that (R, D) is achievable.

Definition 8.14 The distortion-rate function D(R) is the minimum of all

distortions D for a given rate R such that (R, D) is achievable.
Theorem 8.15 The following properties hold for the rate-distortion function
1. R(D) is non-increasing in D.
2. R(D) is convex.

3. R(D) = 0 for D Dmax .

4. R(0)  H(X).

1. Let D0 D. (R(D), D) achievable ) (R(D), D0 ) achievable. Then

R(D) R(D0 ) by definition of R(·).
2. Follows from the convexity of the rate-distortion region.
3. (0, Dmax ) is achievable ) R(D) = 0 for D Dmax .

4. Since d is assumed to be normal, (H(X), 0) is achievable, and hence R(0) 


The rate

8.3 The Rate Distortion Theorem
Definition 8.16 For D 0, the information rate-distortion function is defined
RI (D) = min I(X; X̂).

• The minimization is taken over the set of all p(x̂|x) such that Ed(X, X̂) 
D is satisfied, namely the set
8 9
< X =
p(x̂|x) : p(x)p(x̂|x)d(x, x̂)  D .
: ;

• Since this set is compact in <|X ||X̂ | and I(X; X̂) is a continuous functional
of p(x̂|x), the minimum value of I(X; X̂) can be attained.
• Since
E d(X, X̂) = Ed(X, X̂) ,
where does not depend on p(x̂|x), we can always replace d by d˜ and D
by D in the definition of RI (D) without changing the minimization
• Without loss of generality, we can assume d is normal.
Theorem 8.17 (The Rate-Distortion Theorem) R(D) = RI (D).

Theorem 8.18 The following properties hold for the information rate-distortion
function RI (D):
1. RI (D) is non-increasing in D.

2. RI (D) is convex.
3. RI (D) = 0 for D Dmax .
4. RI (0)  H(X).
Proof of Theorem 8.18
1. For a larger D, the minimization is taken over a larger set.
3. Let X̂ = x̂⇤ w.p. 1 to show that (0, Dmax ) is achievable. Then for D
Dmax , RI (D)  I(X; X̂) = 0, which implies RI (D) = 0.

4. Let X̂ = x̂⇤ (X), so that Ed(X, X̂) = 0 (since d is normal). Then

RI (0)  I(X; X̂)  H(X).

Proof of Theorem 8.18
2. Consider any D(1) , D(2) 0 and 0  1. Let X̂ (i) achieves RI (D(i) ) for
i = 1, 2, i.e.,
RI (D(i) ) = I(X; X̂ (i) ),
Ed(X, X̂ (i) )  D(i) ,
Let X̂ ( )
be jointly distributed with X defined by

p (x̂|x) = p1 (x̂|x) + ¯ p2 (x̂|x).


Ed(X, X̂ ( ) )
= Ed(X, X̂ (1) ) + ¯ Ed(X, X̂ (2) )
 D(1) + ¯ D(2)
= D( ) .
Finally consider

RI (D(1) ) + ¯ RI (D(2) ) = I(X; X̂ (1) ) + ¯ I(X; X̂ (2) )

I(X; X̂ ( ) )
RI (D( ) ).
Corollary 8.19 If RI (0) > 0, then RI (D) is strictly decreasing for 0  D 
Dmax , and the inequality constraint in the definition of RI (D) can be replaced
by an equality constraint.


1. RI (D) must be strictly decreasing for 0  D  Dmax because RI (0) > 0,

RI (Dmax ) = 0, and RI (D) is non-increasing and convex.
2. Show that RI (D) > 0 for 0  D < Dmax by contradiction.
• Suppose RI (D0 ) = 0 for some 0  D0 < Dmax , and let RI (D0 ) be
achieved by some X̂. Then

RI (D0 ) = I(X; X̂) = 0

implies that X and X̂ are independent.

• Show that such an X̂ which is independent of X cannot do better
than the constant estimate x̂⇤ , i.e., Ed(X, X̂) Ed(X, x̂⇤ ) = Dmax .
• This leads to a contradiction because

D0 Ed(X, X̂) Dmax .

3. Show that the inequality constraints in RI (D) can be replaced by an
equality constraint by contradiction.
• Assume that RI (D) is achieved by some X̂ ⇤ such that Ed(X, X̂ ⇤ ) =
D00 < D.
• Then

RI (D00 ) = min I(X; X̂)  I(X; X̂ ⇤ ) = RI (D),

X̂:Ed(X,X̂)D 00

a contradiction because RI (D) is strictly decreasing for 0  D 

Dmax .
• Therefore, Ed(X, X̂ ⇤ ) = D.

Remark In all problems of interest, R(0) = RI (0) > 0. Otherwise, R(D) = 0

for all D 0 because R(D) is nonnegative and non-increasing.
Example 8.20 (Binary Source)
Let X be a binary random variable with

Pr{X = 0} = 1 and Pr{X = 1} = .

Let X̂ = {0, 1} and d be the Hamming distortion measure. Show that

hb ( ) hb (D) if 0  D < min( , 1 )
RI (D) =
0 if D min( , 1 ).
First consider 0   12 , and show that

hb ( ) hb (D) if 0  D <
RI (D) =
0 if D .

• x̂⇤ = 0 and Dmax = Ed(X, 0) = Pr{X = 1} = .

• Consider any X̂ and let Y = d(X, X̂).
• Conditioning on X̂, X and Y determine each other, and so, H(X|X̂) =
H(Y |X̂).

• Then for D < = Dmax and any X̂ such that Ed(X, X̂)  D,

I(X; X̂) = H(X) H(X|X̂)

= hb ( ) H(Y |X̂)
hb ( ) H(Y ) (1)
= hb ( ) hb (Pr{X 6= X̂})
hb ( ) hb (D), (2)

a) because Pr{X 6= X̂} = Ed(X, X̂)  D and hb (a) is increasing for

0  a  21 .
• Therefore,

RI (D) = min I(X; X̂) hb ( ) hb (D).


Now need to construct X̂ which is tight for (1) and (2), so that the above bound
is achieved.

• (1) tight , Y independent of X̂

• (2) tight , Pr{X 6= X̂} = D
• The required X̂ can be specified by the following reverse BSC:

1 D 1 D
0 0 1
1 2D
1 2D 1 1
1 D
• Therefore, we conclude that

hb ( ) hb (D) if 0  D <
RI (D) =
0 if D .

For 1/2   1, by exchanging the roles of the symbols 0 and 1 and applying
the same argument, we obtain RI (D) as above except that is replaced by
1 . Combining the two cases, we have

hb ( ) hb (D) if 0  D < min( , 1 )
RI (D) =
0 if D min( , 1 ).

for 0   1.
RI (D)

0 0.5
A Remark
The rate-distortion theorem does not include the source coding theorem as a
special case:
• In Example 8.20, RI (0) = hb ( ) = H(X).
• By the rate-distortion theorem, if R > H(X), the average Hamming dis-
tortion, i.e., the error probability per symbol, can be made arbitrarily
• However, by the source coding theorem, if R > H(X), the message error
probability can be made arbitrarily small, which is much stronger.
8.4 The Converse
• Prove that for any achievable rate-distortion pair (R, D), R RI (D).
• Fix D and minimize R over all achievable pairs (R, D) to conclude that
R(D) RI (D).

1. Let (R, D) be any achievable rate-distortion pair, i.e., for any ✏ > 0, there
exists for sufficiently large n an (n, M ) code such that
log M  R + ✏
Pr{d(X, X̂) > D + ✏}  ✏,
where X̂ = g(f (X)).
2. Then
n(R + ✏) log M
I(Xk ; X̂k )
c) Xn
RI (Ed(Xk , X̂k ))
" n
1 X
= n RI (Ed(Xk , X̂k ))
d) 1 X
nRI Ed(Xk , X̂k )

= nRI (Ed(X, X̂)).

c) follows from from the definition of RI (D).

d) follows from the convexity of RI (D) and Jensen’s inequality.
3. Let dmax = maxx,x̂ d(x, x̂). Then

Ed(X, X̂)
= E[d(X, X̂)|d(X, X̂) > D + ✏]Pr{d(X, X̂) > D + ✏}
+E[d(X, X̂)|d(X, X̂)  D + ✏]Pr{d(X, X̂)  D + ✏}
 dmax · ✏ + (D + ✏) · 1
= D + (dmax + 1)✏.

That is, if the probability that the average distortion between X and X̂
exceeds D + ✏ is small, then the expected average distortion between X
and X̂ can exceed D only by a small amount.

4. Therefore,

R+✏ RI (Ed(X, X̂))

RI (D + (dmax + 1)✏),

because RI (D) is non-increasing in D.

5. RI (D) convex implies it is continuous in D. Finally,

R lim RI (D + (dmax + 1)✏)

⇣ ⌘
= RI D + (dmax + 1) lim ✏
= RI (D).

Minimizing R over all achievable pairs (R, D) for a fixed D to obtain

R(D) RI (D).
8.5 Achievability of RI(D)
• An i.i.d. source {Xk : k 1} with generic random variable X ⇠ p(x) is
• For every random variable X̂ taking values in X̂ with Ed(X, X̂)  D,
where 0  D  Dmax , prove that the rate-distortion pair (I(X; X̂), D) is
achievable by showing for large n the existence of a rate-distortion code
such that
1. the rate of the code is not more than I(X; X̂) + ✏;
2. d(X, X̂)  D + ✏ with probability almost 1.

• Minimize I(X; X̂) over all such X̂ to conclude that (RI (D), D) is achiev-
• This implies that RI (D) R(D).
Random Coding Scheme
• Fix ✏ > 0 and X̂ with Ed(X, X̂)  D, where 0  D  Dmax . Let be
specified later.
• Let M be an integer satisfying
✏ 1
I(X; X̂) +  log M  I(X; X̂) + ✏,
2 n
where n is sufficiently large.
• The random coding scheme:
1. Construct a codebook C of an (n, M ) code by randomly generating M
codewords in X̂ n independently and identically according to p(x̂)n .
Denote these codewords by X̂(1), X̂(2), · · · , X̂(M ).
2. Reveal the codebook C to both the encoder and the decoder.
3. The source sequence X is generated according to p(x)n .
4. The encoder encodes the source sequence X into an index K in the set
I = {1, 2, · · · , M }. The index K takes the value i if
(a) (X, X̂(i)) 2 T[X X̂]

(b) for all i0 2 I, if (X, X̂(i0 )) 2 T[X

, then i 0
 i;

i.e., if there exists more than one i satisfying (a), let K be the largest one.
Otherwise, K takes the constant value 1.

5. The index K is delivered to the decoder.

6. The decoder outputs X̂(K) as the reproduction sequence X̂.
Performance Analysis
• The event {K = 1} occurs in one of the following two scenarios:
1. X̂(1) is the only codeword in C which is jointly typical with X.
2. No codeword in C is jointly typical with X.
In other words, if K = 1, then X is jointly typical with none of the
codewords X̂(2), X̂(3), · · · , X̂(M ).
• Define the event n o
Ei = (X, X̂(i)) 2 T[X X̂]

• Then
{K = 1} ⇢ E2c \ E3c \ · · · \ EM

• Since the codewords are generated i.i.d., conditioning on {X = x} for

any x 2 X n , the events Ei are mutually independent and have the same
• Then for any x 2 X n ,

Pr{K = 1|X = x}  Pr{E2c \ E3c \ · · · \ EM

|X = x}
= Pr{Eic |X = x}
= (Pr{E1c |X = x})M 1

= (1 Pr{E1 |X = x})M 1

• We will focus on x 2 S[X] where
n n
S[X] = {x 2 T[X] : |T[nX̂|X] (x)| 1},

because Pr{X 2 S[X] } ⇡ 1 for large n (Proposition 6.13).
• For x 2 S[X] , obtain a lower bound on Pr{E1 |X = x} as follows:
n o
Pr{E1 |X = x} = Pr (x, X̂(1)) 2 T[X X̂]
= p(x̂)
x̂2T n (x)

a) X
x̂2T n (x)

2n(H(X̂|X) ⇠)
2 n(H(X̂)+⌘)

n(H(X̂) H(X̂|X)+⇠+⌘)
= 2
= 2 ,

where ⇣ = ⇠ + ⌘ ! 0 as ! 0. In the above,

a) follows because from the consistency of strong typicality, if (x, x̂) 2
n n
T[X X̂]
, then x̂ 2 T [X̂]

b) follows from conditional strong AEP.

• Therefore,

Pr{K = 1|X = x}  Pr{E2c \ E3c \ · · · \ EM

|X = x}
h iM 1
 1 2 n(I(X;X̂)+⇣)

• Then
h i
ln Pr{K = 1|X = x}  (M 1) ln 1 2 n(I(X;X̂)+⇣)
a) ⇣ ✏
⌘ h i
 2n(I(X;X̂)+ 2 ) 1 ln 1 2 n(I(X;X̂)+⇣)
b) ⇣ ✏

 2n(I(X;X̂)+ 2 ) 1 2 n(I(X;X̂)+⇣)
h ✏ i
= 2n( 2 ⇣) 2 n(I(X;X̂)+⇣)

a) follows because the logarithm is negative.

b) follows from the fundamental inequality.
• Let be sufficiently small so that

⇣ > 0. (1)
Then the upper bound on ln Pr{K = 1|X = x} tends to 1 as n ! 1,
i.e., Pr{K = 1|X = x} ! 0 as n ! 1.
• This implies for sufficiently large n,

Pr{K = 1|X = x}  .
• It follows that
Pr{K = 1} = Pr{K = 1|X = x}Pr{X = x}
+ Pr{K = 1|X = x}Pr{X = x}
X ✏ X
 · Pr{X = x} + 1 · Pr{X = x}
2 n
x2S[X] x62S[X]
✏ n n
= · Pr{X 2 S[X] } + Pr{X 62 S[X] }
✏ n
 · 1 + (1 Pr{X 2 S[X] })

< + ,
where we have invoked Proposition 6.13 in the last step.

• By letting be sufficiently small so that both (1) and < 2 are satisfied,
we obtain
Pr{K = 1} < ✏.
Main Idea
• Randomly generate M codewords in X̂ n according to p(x̂)n , where n is
• X 2 S[X] with high probability.
• For x 2 S[X] , by conditional strong AEP,
n o
n nI(X;X̂)
Pr (X, X̂(i)) 2 T[X X̂]
|X = x ⇡ 2 .

• If M grows with n at a rate higher than I(X; X̂), then the probability
that there exists at least one X̂(i) which is jointly typical with the source
sequence X with respect to p(x, x̂) is high.

• Such an X̂(i), if exists, would have d(X, X̂) ⇡ Ed(X, X̂)  D, because
the joint relative frequency of (x, X̂(i)) ⇡ p(x, x̂).

• Use this X̂(i) to represent X to satisfy the distortion constraint.

The Remaining Details
• For sufficiently large n, consider

Pr{d(X, X̂) > D + ✏} = Pr{d(X, X̂) > D + ✏|K = 1}Pr{K = 1}

+Pr{d(X, X̂) > D + ✏|K 6= 1}Pr{K 6= 1}
 1 · ✏ + Pr{d(X, X̂) > D + ✏|K 6= 1} · 1
= ✏ + Pr{d(X, X̂) > D + ✏|K 6= 1}.

• Conditioning on {K 6= 1}, we have (X, X̂) 2 T[X X̂]

• It can be shown that (see textbook)

d(X, X̂)  D + dmax .

By taking  dmax , we obtain d(X, X̂)  D + ✏.

• Therefore, Pr{d(X, X̂) > D + ✏|K 6= 1} = 0, which implies Pr{d(X, X̂) >
D + ✏}  ✏.
⇡ 2nH(X|X̂)

⇡ 2nH(X) ⇡ 2nI(X;X̂)
sequences codewords
in T[X] in T[nX̂]

The number of codewords must be at least

⇡ 2nI(X;X̂)

