Lecture 1: Introduction, Entropy and ML Estimation
Lecture 1: Introduction, Entropy and ML Estimation
Lecture 1: Introduction, Entropy and ML Estimation
Spring 2012
Scribes: Min Xu
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
1.1
This class focuses on information theory, signal processing, machine learning, and the connections between
these fields.
signals can be audio, speech, music. data can be images, files. They overlap a lot of times.
Both signal processing and machine learning are about how to extract useful information from signals/data.
signals as used in the EE community can be different from data in that (1) they often have temporal
aspect, (2) they are often designed and (3) they are often transmitted through a medium (known as a
channel ).
Information theory studies 2 main questions:
1. How much information is contained in the signal/data?
Example: Consider the data compression (source coding) problem.
What is the fewest number of bits needed to describe the output of a source (also called message) while
preserving all the information, in the sense that a receiver can reconstruct the message from the bits
with arbitrarily low probability of error?
2. How much information can be reliably transmitted through a noisy channel?
Example: Consider the data transmission (channel coding) problem.
What is the maximum number of bits per channel use that can be reliably sent through a noisy channel,
in the sense that the receiver can reconstruct the source message with arbitrarily low probability of
error?
Remark: The data compression or source coding problem can be thought of as a noiseless version of the data
tranmission or channel coding problem.
1-1
1-2
1.2
We will usually specify information content in bits, where a bit is defined to be of value either 0 or 1. We can
think of it as the output of a yes/no question, and the information content can be specified as the minimum
number of yes/no questions needed to lean the outcome of a random experiment.
1. We have an integer chosen randomly from 0 to 63. What is the smallest number of yes/no questions
needed to identify that integer? Answer: log2 (64) = 6 bits.
1
of being the correct integer. Thus,
Note: all integers from 0 to 63 have equal probability p = 64
1
log2 (64) = log2 ( p ). This is the Shannon Information Content.
2. Now lets consider an experiment where questions that do not lead to equi-probable outcomes.
An enemy ship is somewhere in an 8 8 grid (64 possible locations). We can launch a missile that hits
one location. Since the ship can be hidden in any of the 64 possible locations, we expect that we will
still gain 6 bits of information when we find the ship. However, each question (firing of a missile) now
may not provide the same amount of information.
1
The probability of hitting on first launch is p1 (h) = 64
, so the Shannon Information Content of hitting
1
on first launch is log2 ( p1 (h) ) = 6 bits. Since this was a low probability event, we gained a lot of
information (in fact all the infomation we hoped to gain on discovering the ship). However, we will
not gain the same amount of information on more probably events. For example:
log2 (p1 (m)) + log2 (p2 (m)) + ... log2 (p32 (m)) = log2 (
32
Y
pi (m))
i=1
64 63 33
... )
63 62 32
= log2 (2) = 1 bit
= log2 (
Which is what we intuitively expect, since ruling out 32 locations is equivalent to asking 1 question in
the previous experiment 1.
1-3
1
If we hit on the next try, we will gain log2 ( p3 3(h)
) = log2 (32) = 5 bits of information. Simple calculation
will show that, regardless of how many launches we needed, we gain a total of 6 bits of information
whenever we hit the ship.
1.3
Where X is set of all possible values of the random variable X. We often also write H(X) as H(p) since
entropy is a property of the distribution.
Example
For a Bernoulli random variable with distribution Ber(p). The entropy H(p) = p log2 p (1 p) log2 (1 p)
Definition 1.2 Suppose we have random variables X, Y , then the Joint Entropy between them is
X
H(X, Y ) =
p(x, y) log2 p(x, y)
x,y
p(x)
XX
x
= EX,Y [log
1
]
p(y | x)
1-4
1.4
1-5
Suppose X = (X1 , ..., Xn ) are data generated from a distribution p(X). In maximum likelihood estimation,
we want to find a distribution q in some family of distributions Q such that the likelihood q(X) is maximized:
max q(X) = min log q(X)
qQ
In machine learning, we often define a loss function. In this case, the loss function is the negative log loss:
1
]. We want
loss(q, X) = log q(X). The expected value of this loss function is the risk: Risk(q) = Ep [log q(x)
to find a distribution q that minimizes the risk. However, notice that minimizing the risk with respect to a
distribution q is exactly minimizing the relative entropy between p and q. This is because
Risk(q) = Ep [log
p(x)
1
1
] = Ep [log
] + Ep [log
] = D(p||q) + Risk(p)
q(x)
q(x)
p(x)
Because we know that relative entropy is always non-negative, we know here that the risk is minimized by
setting q equal to p. Thus the minimum risk R = Risk(p) = H(p), the entropy of distribution p. The
excess risk, Risk(q) R is precisely the relative entropy between p and q.