Entropy 1

Entropy, lecture 1
Some basic questions about counting

sequences
1 Introduction
Within mathematics, ‘entropy’ is a quantity which answers a simple and natural
problem about counting sequences. It has several variants and generalizations,
corresponding to variants and generalizations of that counting problem.
This counting problem appears naturally in several other branches of mathe-
matics and science. The best known are information theory and statistical physics.
Although the underlying mathematics has an essential unity, those two fields use
it to answers questions of different kinds. In information theory, the counting
problem is given an operational meaning, in terms of the possibility of communi-
cating a certain amount of data in a certain way (or related questions). In statisti-
cal physics, it has phenomenological consequences, describing why certain large
physical systems seem to behave the way they do.
This course is primarily about the mathematical meaning of entropy, but we
also indicate its relevance to those other fields.
2 Preliminaries
2.1 Setting: types and typical sequences
Let A be a finite set. We call it the alphabet.
• A probability vector on A is a real vector p = (pa )a∈A ∈ RA satisfying

X
pa ≥ 0 ∀a ∈ A and pa = 1.
a∈A
1
This vector is rational if every pa is rational. It has denominator n if npa
is an integer for every a. (Note: in this usage, a probability vector with
denominator n also has denominator m whenever m is a multiple of n.)
• The probability simplex on A is the set of all probability vectors on A: that
is, the simplex
n X o
Prob(A) = (pa )a∈A : pa ≥ 0 ∀a and pa = 1 ⊂ R A .
a
More generally, if X is a metric space, then Prob(X) denotes the convex

set of all Borel probability measures on X. We use the notation this way at
some later points in the course.
• The total variation norm on Prob(A) is defined by
X
kp − qk = |pa − qa | = 2 max |p(B) − q(B)|.
B⊆A
a∈A
• Let n be a positive integer, and let x = (x1 , . . . , xn ) ∈ An . Given a letter

a ∈ A, let
N (a | x) = |{i = 1, . . . , n : xi = a}|
be the number of times that this letter occurs in x. The frequency of a in x
is
N (a | x)
px,a = .
n
Together, these frequencies form a probability vector px = (px,a )a∈A . It is
called the empirical distribution or type of x.
• Given p ∈ Prob(A) and n ≥ 1, the type class of p is the set
Tn (p) := {x ∈ An : px = p}.
Clearly Tn (p) = ∅ unless p has denominator n.
• Sometimes it is important to allow an approximation. Given δ > 0, we say
that x ∈ An is δ-typical for p if
kpx − pk < δ.
(Conventions are not quite fixed. Some authors call this ‘strongly typical’.
Also, some make the extra requirement that N (a | x) is strictly zero for all
a such that pa = 0; I prefer to do without this.)
We write Tn,δ (p) for the set of all δ-typical strings for p in An .
2
2.2 Asymptotic notation
We frequently use Landau notation: if f : N −→ R and g : N −→ (0, ∞), then
i) “f = O(g)” means that there is a constant C > 0 such that
|f (n)| ≤ Cg(n) ∀n ∈ N;
ii) “f = o(g)” means that
f (n)
−→ 0 as n −→ ∞.
g(n)
So
f = O(g) =⇒ f = o(g),
but not the reverse. For example:
10 10
= o(1) and 1 + = O(1).
n n
Landau notation is especially useful as a placeholder for some part of a more
complicated expression which still indicates the relevant behaviour. For example:
any real polynomial p satisfies
p(n) = 2o(n) .
(This example appears frequently later, but of course it is very far from tight.
α
For example, for any α ∈ (0, 1) the function 2n grows much faster than any
polynomial but is still 2o(n) .)
We sometimes use another, less standard convention. If ε1 > 0 and f :
(0, ε1 ) −→ (0, ∞), then “f (ε) = ∆(ε)” means that f (ε) −→ 0 as ε −→ 0.
If these estimates depend on some other quantities A, B, . . . (for instance, if
the constant C in (i) above is a function of A, B, . . . ), then we may record these
as subscripts: that is, we write f = OA,B,... (g) or f = oA,B,... (g) or ∆A,B,... (ε). In
other places we leave this dependence to be inferred from the context.
3 The basic counting problems and Shannon entropy

Informally, here is the basic type multiplicity problem:
3
Given a probability vector p on A, how many sequences are there in
An of type p?
This problem has many variants. For example, it can be more useful to have
a simple approximation than an exact count. Also, it is often more important to
count sequences that are δ-typical for some small positive δ, rather than exactly
typical. In more complicated problems, we may wish to count sequences with
some other property in addition to their type. Several of these variants appear
later in the course, but for now we stay close to the above.
3.1 Counting exactly typical sequences

Elementary combinatorics gives the exact answer to the type multiplicity question:
if A = {a1 , . . . , ak }, then Tn (p) is nonempty if and only if p has denominator n,
and in this case

n n!
|Tn (p)| = = .
npa1 , . . . , npak (npa1 )! · · · (npak )!
But multinomial coefficients can be difficult to manipulate. Often it’s more useful
to have estimates of a simpler form.
Theorem 3.1. If p has denominator n, then
1
2Hn ≤ |Tn (p)| ≤ 2Hn ,
(n + 1)k
where X
H := − pa log2 pa .
a∈A
Here and in the rest of the course we use the convention that 0 log2 0 = 0.
Often the even simpler estimate
2Hn−o(n) ≤ |Tn (p)| ≤ 2Hn (1)
is sufficient. Here the implied estimate in the term ‘o(n)’ depends on |A| = k, but
not otherwise on p.
The proof of Theorem 3.1 shows nicely the role of discrete probability in ex-
plaining these results, as well as formulating them. Let p×n denote the product
probability distribution on An under which the coordinates are i.i.d. ∼ p.
4
Proof of Theorem 3.1. Upper bound. If x ∈ Tn (p), then
n
Y Y Y
p×n {x} = p xi = pN
a
(a | x)
= pnp
a
a
= 2−Hn ,
i=1 a∈A a∈A
where the third equality holds because px = p. This calculation is the ultimate
reason why Shannon entropy is so central to our story. As a result, we have
X
1 ≥ p×n [Tn (p)] = p×n {x} = |Tn (p)|2−Hn . (2)
x∈Tn (p)
Re-arranging completes the proof.

Lower bound. It is generally not true that p×n [Tn (p)] is close to 1 in the esti-
mate (2). However, an element of Prob(A) which has denominator n may be spec-
ified by the k values pa , a ∈ A, each of which is an element of {0, 1/n, 2/n, . . . , 1}.
Hence there are at most (n + 1)k elements of Prob(A) that have denominator n.
The proof is completed by showing that
p×n [Tn (q)] ≥ p×n [Tn (q)] (3)
for any other type q. Indeed, from this it follows that
h [ i
1 = p×n Tn (q) ≤ (n + 1)k · p×n [Tn (p)],
q∈Prob(A), denom(q)=n
which re-arranges to the desired lower bound in light of (2).

To prove (3), let us substitute the formula for p×n {x} and the exact multino-
mial formula for |Tn (p)|. After simplifying, these give
p×n [Tn (p)] Y h (nqa )! (pa −qa )n i
= p .
p×n [Tn (q)] a
(npa )! a
Now we use the simple inequality
m!
≥ nm−n ∀n, m ∈ N ∪ {0}
n!
(exercise!). Upon substituting, it leaves us with
p×n [Tn (p)] Y (qa −pa )n (pa −qa )n

≥ (np a ) p a
p×n [Tn (q)] a
Y P
= n(qa −pa )n = nn a (qa −pa ) = nn−n = 1.
a
5
The new quantity which appears in Theorem 3.1 is a function of the probability
vector p.
Definition 3.2 (Shannon entropy). The Shannon entropy of p is
X
H(p) := − pa log2 pa .
a∈A
The basic properties of this function can be established by simple calculus.

These simple facts have extensive consequences later in the course.
Proposition 3.3. The Shannon entropy function H on Prob(A) has the following
properties:
(a) It is continuous.
(b) It is strictly concave.
(c) It is a symmetric function of the values pa , a ∈ A.
(d) Its maximum value is log2 |A|, achieved precisely at the uniform distribu-
tion: pa = |A|−1 for all a ∈ A.
(e) Its minimum value is zero, achieved precisely at the delta masses, δa , for
a ∈ A.
Proof. Define η : [0, 1] −→ R by

−t log2 t if t > 0
η(t) :=
0 if t = 0.
Some elementary analysis shows that η is continuous (including at 0) and strictly
concave. Conclusion (a) follows at once. Conclusion (b) also follows because, if
p, q ∈ Prob(A) and 0 < λ < 1, then
X
H(λp + (1 − λ)q) = η(λpa + (1 − λ)qa )
a
X
≥ [λη(pa ) + (1 − λ)η(qa )] = λH(p) + (1 − λ)H(q),
a
with equality if and only if pa = qa for every a.

Conclusion (c) is immediate from the definition, and now conclusion (d) fol-
lows from the combination of (b) and (c) (exercise!). Finally, conclusion (e) is a
direct calculations from the definition.
6
Some of theses properties of H also have a more conceptual interpretation. We
derive a few of them later.
Example. Elements of Prob({0, 1}) may be written as (p, 1 − p) for p ∈ [0, 1]. In
this notation, Shannon’s entropy function is
H(p, 1 − p) = −p log2 p − (1 − p) log2 (1 − p).
It is concave and symmetric about p = 1/2. Its maximum value is 1, achieved at

p = 1/2, and its minimum value is 0, achieved at p = 0 and 1. C
4 Notes and remarks

Basic sources for this lecture: [CT06, Chapter 2], [Sha48].
References
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory.
Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition,
2006.
[Sha48] C. E. Shannon. A mathematical theory of communication. Bell System

Tech. J., 27:379–423, 623–656, 1948.
T IM AUSTIN
U NIVERSITY OF C ALIFORNIA , L OS A NGELES , CA 90025, USA
Email: tim@math.ucla.edu
URL: math.ucla.edu/˜tim

Entropy 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Entropy 1

Uploaded by

Copyright:

Available Formats

Entropy, lecture 1

Some basic questions about counting

• A probability vector on A is a real vector p = (pa )a∈A ∈ RA satisfying

More generally, if X is a metric space, then Prob(X) denotes the convex

• Let n be a positive integer, and let x = (x1 , . . . , xn ) ∈ An . Given a letter

i) “f = O(g)” means that there is a constant C > 0 such that

ii) “f = o(g)” means that

3 The basic counting problems and Shannon entropy

3.1 Counting exactly typical sequences

2Hn−o(n) ≤ |Tn (p)| ≤ 2Hn (1)

Re-arranging completes the proof.

which re-arranges to the desired lower bound in light of (2).

The basic properties of this function can be established by simple calculus.

with equality if and only if pa = qa for every a.

H(p, 1 − p) = −p log2 p − (1 − p) log2 (1 − p).

It is concave and symmetric about p = 1/2. Its maximum value is 1, achieved at

4 Notes and remarks

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell System

You might also like