Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Entropy, lecture 1

Some basic questions about counting


sequences

1 Introduction
Within mathematics, ‘entropy’ is a quantity which answers a simple and natural
problem about counting sequences. It has several variants and generalizations,
corresponding to variants and generalizations of that counting problem.
This counting problem appears naturally in several other branches of mathe-
matics and science. The best known are information theory and statistical physics.
Although the underlying mathematics has an essential unity, those two fields use
it to answers questions of different kinds. In information theory, the counting
problem is given an operational meaning, in terms of the possibility of communi-
cating a certain amount of data in a certain way (or related questions). In statisti-
cal physics, it has phenomenological consequences, describing why certain large
physical systems seem to behave the way they do.
This course is primarily about the mathematical meaning of entropy, but we
also indicate its relevance to those other fields.

2 Preliminaries
2.1 Setting: types and typical sequences
Let A be a finite set. We call it the alphabet.

• A probability vector on A is a real vector p = (pa )a∈A ∈ RA satisfying


X
pa ≥ 0 ∀a ∈ A and pa = 1.
a∈A

1
This vector is rational if every pa is rational. It has denominator n if npa
is an integer for every a. (Note: in this usage, a probability vector with
denominator n also has denominator m whenever m is a multiple of n.)
• The probability simplex on A is the set of all probability vectors on A: that
is, the simplex
n X o
Prob(A) = (pa )a∈A : pa ≥ 0 ∀a and pa = 1 ⊂ R A .
a

More generally, if X is a metric space, then Prob(X) denotes the convex


set of all Borel probability measures on X. We use the notation this way at
some later points in the course.
• The total variation norm on Prob(A) is defined by
X
kp − qk = |pa − qa | = 2 max |p(B) − q(B)|.
B⊆A
a∈A

• Let n be a positive integer, and let x = (x1 , . . . , xn ) ∈ An . Given a letter


a ∈ A, let
N (a | x) = |{i = 1, . . . , n : xi = a}|
be the number of times that this letter occurs in x. The frequency of a in x
is
N (a | x)
px,a = .
n
Together, these frequencies form a probability vector px = (px,a )a∈A . It is
called the empirical distribution or type of x.
• Given p ∈ Prob(A) and n ≥ 1, the type class of p is the set
Tn (p) := {x ∈ An : px = p}.
Clearly Tn (p) = ∅ unless p has denominator n.
• Sometimes it is important to allow an approximation. Given δ > 0, we say
that x ∈ An is δ-typical for p if
kpx − pk < δ.
(Conventions are not quite fixed. Some authors call this ‘strongly typical’.
Also, some make the extra requirement that N (a | x) is strictly zero for all
a such that pa = 0; I prefer to do without this.)
We write Tn,δ (p) for the set of all δ-typical strings for p in An .

2
2.2 Asymptotic notation
We frequently use Landau notation: if f : N −→ R and g : N −→ (0, ∞), then

i) “f = O(g)” means that there is a constant C > 0 such that

|f (n)| ≤ Cg(n) ∀n ∈ N;

ii) “f = o(g)” means that

f (n)
−→ 0 as n −→ ∞.
g(n)

So
f = O(g) =⇒ f = o(g),
but not the reverse. For example:
10 10
= o(1) and 1 + = O(1).
n n
Landau notation is especially useful as a placeholder for some part of a more
complicated expression which still indicates the relevant behaviour. For example:
any real polynomial p satisfies

p(n) = 2o(n) .

(This example appears frequently later, but of course it is very far from tight.
α
For example, for any α ∈ (0, 1) the function 2n grows much faster than any
polynomial but is still 2o(n) .)
We sometimes use another, less standard convention. If ε1 > 0 and f :
(0, ε1 ) −→ (0, ∞), then “f (ε) = ∆(ε)” means that f (ε) −→ 0 as ε −→ 0.
If these estimates depend on some other quantities A, B, . . . (for instance, if
the constant C in (i) above is a function of A, B, . . . ), then we may record these
as subscripts: that is, we write f = OA,B,... (g) or f = oA,B,... (g) or ∆A,B,... (ε). In
other places we leave this dependence to be inferred from the context.

3 The basic counting problems and Shannon entropy


Informally, here is the basic type multiplicity problem:

3
Given a probability vector p on A, how many sequences are there in
An of type p?
This problem has many variants. For example, it can be more useful to have
a simple approximation than an exact count. Also, it is often more important to
count sequences that are δ-typical for some small positive δ, rather than exactly
typical. In more complicated problems, we may wish to count sequences with
some other property in addition to their type. Several of these variants appear
later in the course, but for now we stay close to the above.

3.1 Counting exactly typical sequences


Elementary combinatorics gives the exact answer to the type multiplicity question:
if A = {a1 , . . . , ak }, then Tn (p) is nonempty if and only if p has denominator n,
and in this case
 
n n!
|Tn (p)| = = .
npa1 , . . . , npak (npa1 )! · · · (npak )!
But multinomial coefficients can be difficult to manipulate. Often it’s more useful
to have estimates of a simpler form.
Theorem 3.1. If p has denominator n, then
1
2Hn ≤ |Tn (p)| ≤ 2Hn ,
(n + 1)k
where X
H := − pa log2 pa .
a∈A

Here and in the rest of the course we use the convention that 0 log2 0 = 0.
Often the even simpler estimate

2Hn−o(n) ≤ |Tn (p)| ≤ 2Hn (1)

is sufficient. Here the implied estimate in the term ‘o(n)’ depends on |A| = k, but
not otherwise on p.
The proof of Theorem 3.1 shows nicely the role of discrete probability in ex-
plaining these results, as well as formulating them. Let p×n denote the product
probability distribution on An under which the coordinates are i.i.d. ∼ p.

4
Proof of Theorem 3.1. Upper bound. If x ∈ Tn (p), then
n
Y Y Y
p×n {x} = p xi = pN
a
(a | x)
= pnp
a
a
= 2−Hn ,
i=1 a∈A a∈A

where the third equality holds because px = p. This calculation is the ultimate
reason why Shannon entropy is so central to our story. As a result, we have
X
1 ≥ p×n [Tn (p)] = p×n {x} = |Tn (p)|2−Hn . (2)
x∈Tn (p)

Re-arranging completes the proof.


Lower bound. It is generally not true that p×n [Tn (p)] is close to 1 in the esti-
mate (2). However, an element of Prob(A) which has denominator n may be spec-
ified by the k values pa , a ∈ A, each of which is an element of {0, 1/n, 2/n, . . . , 1}.
Hence there are at most (n + 1)k elements of Prob(A) that have denominator n.
The proof is completed by showing that
p×n [Tn (q)] ≥ p×n [Tn (q)] (3)
for any other type q. Indeed, from this it follows that
h [ i
1 = p×n Tn (q) ≤ (n + 1)k · p×n [Tn (p)],
q∈Prob(A), denom(q)=n

which re-arranges to the desired lower bound in light of (2).


To prove (3), let us substitute the formula for p×n {x} and the exact multino-
mial formula for |Tn (p)|. After simplifying, these give
p×n [Tn (p)] Y h (nqa )! (pa −qa )n i
= p .
p×n [Tn (q)] a
(npa )! a
Now we use the simple inequality
m!
≥ nm−n ∀n, m ∈ N ∪ {0}
n!
(exercise!). Upon substituting, it leaves us with
p×n [Tn (p)] Y  (qa −pa )n (pa −qa )n

≥ (np a ) p a
p×n [Tn (q)] a
Y P
= n(qa −pa )n = nn a (qa −pa ) = nn−n = 1.
a

5
The new quantity which appears in Theorem 3.1 is a function of the probability
vector p.
Definition 3.2 (Shannon entropy). The Shannon entropy of p is
X
H(p) := − pa log2 pa .
a∈A

The basic properties of this function can be established by simple calculus.


These simple facts have extensive consequences later in the course.
Proposition 3.3. The Shannon entropy function H on Prob(A) has the following
properties:
(a) It is continuous.
(b) It is strictly concave.
(c) It is a symmetric function of the values pa , a ∈ A.
(d) Its maximum value is log2 |A|, achieved precisely at the uniform distribu-
tion: pa = |A|−1 for all a ∈ A.
(e) Its minimum value is zero, achieved precisely at the delta masses, δa , for
a ∈ A.
Proof. Define η : [0, 1] −→ R by

−t log2 t if t > 0
η(t) :=
0 if t = 0.
Some elementary analysis shows that η is continuous (including at 0) and strictly
concave. Conclusion (a) follows at once. Conclusion (b) also follows because, if
p, q ∈ Prob(A) and 0 < λ < 1, then
X
H(λp + (1 − λ)q) = η(λpa + (1 − λ)qa )
a
X
≥ [λη(pa ) + (1 − λ)η(qa )] = λH(p) + (1 − λ)H(q),
a

with equality if and only if pa = qa for every a.


Conclusion (c) is immediate from the definition, and now conclusion (d) fol-
lows from the combination of (b) and (c) (exercise!). Finally, conclusion (e) is a
direct calculations from the definition.

6
Some of theses properties of H also have a more conceptual interpretation. We
derive a few of them later.
Example. Elements of Prob({0, 1}) may be written as (p, 1 − p) for p ∈ [0, 1]. In
this notation, Shannon’s entropy function is

H(p, 1 − p) = −p log2 p − (1 − p) log2 (1 − p).

It is concave and symmetric about p = 1/2. Its maximum value is 1, achieved at


p = 1/2, and its minimum value is 0, achieved at p = 0 and 1. C

4 Notes and remarks


Basic sources for this lecture: [CT06, Chapter 2], [Sha48].

References
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory.
Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition,
2006.

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell System


Tech. J., 27:379–423, 623–656, 1948.

T IM AUSTIN
U NIVERSITY OF C ALIFORNIA , L OS A NGELES , CA 90025, USA
Email: tim@math.ucla.edu
URL: math.ucla.edu/˜tim

You might also like