Entropy 1
Entropy 1
1 Introduction
Within mathematics, ‘entropy’ is a quantity which answers a simple and natural
problem about counting sequences. It has several variants and generalizations,
corresponding to variants and generalizations of that counting problem.
This counting problem appears naturally in several other branches of mathe-
matics and science. The best known are information theory and statistical physics.
Although the underlying mathematics has an essential unity, those two fields use
it to answers questions of different kinds. In information theory, the counting
problem is given an operational meaning, in terms of the possibility of communi-
cating a certain amount of data in a certain way (or related questions). In statisti-
cal physics, it has phenomenological consequences, describing why certain large
physical systems seem to behave the way they do.
This course is primarily about the mathematical meaning of entropy, but we
also indicate its relevance to those other fields.
2 Preliminaries
2.1 Setting: types and typical sequences
Let A be a finite set. We call it the alphabet.
1
This vector is rational if every pa is rational. It has denominator n if npa
is an integer for every a. (Note: in this usage, a probability vector with
denominator n also has denominator m whenever m is a multiple of n.)
• The probability simplex on A is the set of all probability vectors on A: that
is, the simplex
n X o
Prob(A) = (pa )a∈A : pa ≥ 0 ∀a and pa = 1 ⊂ R A .
a
2
2.2 Asymptotic notation
We frequently use Landau notation: if f : N −→ R and g : N −→ (0, ∞), then
|f (n)| ≤ Cg(n) ∀n ∈ N;
f (n)
−→ 0 as n −→ ∞.
g(n)
So
f = O(g) =⇒ f = o(g),
but not the reverse. For example:
10 10
= o(1) and 1 + = O(1).
n n
Landau notation is especially useful as a placeholder for some part of a more
complicated expression which still indicates the relevant behaviour. For example:
any real polynomial p satisfies
p(n) = 2o(n) .
(This example appears frequently later, but of course it is very far from tight.
α
For example, for any α ∈ (0, 1) the function 2n grows much faster than any
polynomial but is still 2o(n) .)
We sometimes use another, less standard convention. If ε1 > 0 and f :
(0, ε1 ) −→ (0, ∞), then “f (ε) = ∆(ε)” means that f (ε) −→ 0 as ε −→ 0.
If these estimates depend on some other quantities A, B, . . . (for instance, if
the constant C in (i) above is a function of A, B, . . . ), then we may record these
as subscripts: that is, we write f = OA,B,... (g) or f = oA,B,... (g) or ∆A,B,... (ε). In
other places we leave this dependence to be inferred from the context.
3
Given a probability vector p on A, how many sequences are there in
An of type p?
This problem has many variants. For example, it can be more useful to have
a simple approximation than an exact count. Also, it is often more important to
count sequences that are δ-typical for some small positive δ, rather than exactly
typical. In more complicated problems, we may wish to count sequences with
some other property in addition to their type. Several of these variants appear
later in the course, but for now we stay close to the above.
Here and in the rest of the course we use the convention that 0 log2 0 = 0.
Often the even simpler estimate
is sufficient. Here the implied estimate in the term ‘o(n)’ depends on |A| = k, but
not otherwise on p.
The proof of Theorem 3.1 shows nicely the role of discrete probability in ex-
plaining these results, as well as formulating them. Let p×n denote the product
probability distribution on An under which the coordinates are i.i.d. ∼ p.
4
Proof of Theorem 3.1. Upper bound. If x ∈ Tn (p), then
n
Y Y Y
p×n {x} = p xi = pN
a
(a | x)
= pnp
a
a
= 2−Hn ,
i=1 a∈A a∈A
where the third equality holds because px = p. This calculation is the ultimate
reason why Shannon entropy is so central to our story. As a result, we have
X
1 ≥ p×n [Tn (p)] = p×n {x} = |Tn (p)|2−Hn . (2)
x∈Tn (p)
5
The new quantity which appears in Theorem 3.1 is a function of the probability
vector p.
Definition 3.2 (Shannon entropy). The Shannon entropy of p is
X
H(p) := − pa log2 pa .
a∈A
6
Some of theses properties of H also have a more conceptual interpretation. We
derive a few of them later.
Example. Elements of Prob({0, 1}) may be written as (p, 1 − p) for p ∈ [0, 1]. In
this notation, Shannon’s entropy function is
References
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory.
Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition,
2006.
T IM AUSTIN
U NIVERSITY OF C ALIFORNIA , L OS A NGELES , CA 90025, USA
Email: tim@math.ucla.edu
URL: math.ucla.edu/˜tim