Lesson 3: Maximizing Entropy: Notes From Prof. Susskind Video Lectures Publicly Available On Youtube
Lesson 3: Maximizing Entropy: Notes From Prof. Susskind Video Lectures Publicly Available On Youtube
Lesson 3: Maximizing Entropy: Notes From Prof. Susskind Video Lectures Publicly Available On Youtube
1
Introduction
X
P (i) = 1
i
X (2)
P (i) E(i) = < E >
i
2
simply denote E.
1
In this course we use the two terms statistics and probability the-
ory synonymously. Specialists make the following distinction: prob-
ability theory is concerned with computing the probability distribu-
tions of various random variables. It is essentially a section of measure
theory. Statistics on the other hand is concerned with the following
problem: the distribution of a random variable X is known to belong
to a family of distributions indexed by a parameter θ. From a series
of observations x1 , x2 ,... xn of X what can we say about θ? θ is
unknown but it is not a random variable. So it doesn’t have a prob-
ability to be such and such value; it doesn’t have a mean, etc. But
it has a maximum likelihood. Statistics is also concerned with other
problems of the same nature: testing, comparing, deciding, estimat-
ing things which are unknown but are not r.v. Estimators, however,
are themselves r.v. with all sorts of interesting properties. That is
the technical distinction between probability and statistics.
3
Figure 1: Family of distributions of probability. The states are
arranged on the i-axis by increasing E(i).
4
The entropy of the ground state is exactly zero3 . Then as
the one parameter family of probability distribution shifts
to the right, first of all the energy becomes larger, and the
entropy increases too.
5
wider, in the family of distributions that can describe the
system, see figure 1, the average energy E, which is also
the indexing parameter in the family, goes up. And at the
same time the entropy goes up.
4
Notice that thermal equilibrium does not mean that the system
is now in one state i. It is means that it has a certain distribution
P (i) over Ω which we will study in depth.
6
second :-) Then, in this state of thermal equilibrium, heat
tends not to flow. We will understand that soon enough.
Heat tends not to flow because all parts of the system are
in equilibrium. And, as we will see, the probability distri-
butions P (i, E) do in fact broaden as the energy E – or
parameter E, if you like – goes up. As a consequence the
entropy is a monotonically increasing function of energy.
That is an important fact.
Laws of thermodynamics
7
equilibrium, generally speaking the probability distribution
broadens.
Now let’s talk about the zeroth law. First of all the zeroth
law asserts that there is such a thing as thermal equilib-
rium. If we take a box, for instance a box of gas, and we
wait long enough, it will come to equilibrium.
6
We are talking about a system which is not in equilibrium at the
beginning of its study. Indeed for a system in equilibrium we saw that
its distribution of probability is a monotonic function of its entropy.
Therefore if it does not receive energy, its entropy cannot increase.
More on this when we talk about adiabatic processes in later lessons.
8
We take the example of gas because it is easy to think intu-
itively about how all the molecules positions and velocities
should evolve. But it doesn’t have to be gas inside the box,
it could be liquid, it could be solid, or any mix. In fact it
could be any system which is isolated and self-contained.
It will come to equilibrium.
The zeroth law also says that there is a concept called tem-
perature attached to any equilibrium. Temperature has the
characteristic that, if the system is made of several parts
9
separately in equilibrium, and connected together through
small interfaces as in figure 2, the energy always flows from
hotter parts to colder parts. And if we wait long enough the
flows will eventually come to an equilibrium, all the parts
of the system having then one unique temperature.
10
We can summarize the zeroth law with these three rules:
1. There is a notion of temperature. We have already
introduced it, but we will use it to illuminate this law,
see next section.
2. Energy flows from higher temperature to lower tem-
perature.
3. In thermal equilibrium, the temperature of all parts
of the system is the same. If the temperature was not
the same in all parts of the system, energy would flow
until it became the same.
11
dE
=T (3)
dS
Let’s write down all the basic equations that we know. For
the moment we assume that the temperature of B is higher
than the temperature of A. Of course we could have TB =
TA , but let’s start with the case where they are not equal.
TA < TB (4)
12
The first law tells us that when the system responds and
does whatever it does over time, the total energy doesn’t
change. Since the whole system is not in equilibrium, there
will be some energy shift from one subsystem to the other.
Whichever way it happens, the following equation holds
Now let’s use one more statement: that the change in the
energy is equal to the temperature times the change in the
entropy. And let’s apply this to both A and B.
dEA = TA dSA
(7)
dEB = TB dSB
13
Then equation (5), namely the first law of thermodynamics,
rewrites as
TA
dSB = − dSA (9)
TB
TA
dSA − dSA > 0 (10)
TB
or
14
(TB − TA ) dSA > 0 (11)
15
to S –, it indicates a direction of heat flow. Again when we
say heat, so far we just mean energy. Energy will flow from
B to A until the temperatures become equal.
16
A.: Yes, I did discretize the problem. But first of all in
quantum mechanics we can take those variables to be dis-
crete basically.
17
Q.: But if we have say a finite number of molecules in the
box, their positions and velocities are not discrete variables.
That is what it comes down to. And that is not a one liner.
That is a subtle business. But it is true.
9
See volume 2 in the collection The Theoretical Minimum, on
quantum mechanics, where we already showed that quantum mechan-
ics in certain limits is equivalent to classical mechanics.
18
its different microstates i’s. What is that probability dis-
tribution? What is the mathematics of it?
The heat bath allows energy to flow back and forth between
itself and the system. After a while, the system comes to
thermal equilibrium with the big surrounding heat bath.
19
This is something we could also prove. But let’s just think
about the physics: we have a huge system at a certain tem-
perature – the heat bath –, and we have a little system
plunged into it – our system. A little bit of heat flows from
the big system to the little system. Typically the change in
the temperature of the big system will be negligible.
20
provide the heat bath.
21
have different energy for example. Some of the N systems
are in state ω1 . In figure 6, there are four of them. Gen-
erally speaking we will denote their number n1 . Similarly
there are n2 systems in state ω2 . In figure 6, n2 = 2. There
are n3 systems in ω3 . In figure 6, n3 = 3. And so forth.
22
ble states ωi ’s. But most of them will have so much energy
that there is no chance at all that they be occupied. Only
a negligible fraction of the N systems will occupy states of
very high energy – if anything simply because we may not
have that much energy available.
23
must naturally hold
X
ni = N (14)
i
n1 E1 + n2 E2 + n3 E3 + ... = N E
or
X
ni Ei = N E (15)
i
24
Ω which is the space of states of the world after having
performed E, and a probability distribution, or measure, P
over the elements or subsets of Ω.
So these are now the new expressions for our two con-
straints. But if we prefer to think about them in terms of
the occupation numbers, we can write them back as equa-
tions (14) and (15).
25
occupation numbers?
N!
A= (18)
n1 ! n2 ! n3 ! ...
12
For instance, if the boxes are b1 , b2 , b3 , and the states are ω1 ,
ω2 , if the boxes are distinguishable, then (b1 , b2 ) and (b3 ) is not the
same as (b1 , b3 ) and (b2 ).
26
Before that, let’s check it for a couple of cases. First of all,
suppose n1 = N and all the other occupation numbers are
0. Then formula (18) gives A = N !/n1 ! = 1. That seems
reasonable. Indeed, how many ways are there of putting all
the N boxes into one given state? One. So formula (18) is
fine in this case.
N = 4, and n1 = 2, and n2 = n3 = 1.
One last case:
There are 42 = 4 x 3 / 2 = 6 ways to fill the first cup,
multiplied by two remaining ways to fill the other two.
This gives 12 ways. Turning to formula (18), we have
A = 4!/(2! 1! 1!) = 12. Again it is hunky dory.
27
The interesting fact we are going to discover is that, in the
circumstances that concern us, of a large number of sys-
tems distributed among a collection of states, with some
constraints, the function A is highly peaked at a particu-
lar set of occupation numbers. Namely, when N gets very
big, the occupation numbers cluster very strongly about a
particular set of occupation numbers – or better yet a par-
ticular set of fractions ni /N .
We want to approximate
N!
A=
n1 ! n2 ! n3 ! ...
28
number A.
N!
lim √ = 1 (19)
N →+∞ 2πN N N e−N
13
Named after James Stirling (1692 - 1770), Scottish mathemati-
cian.
29
PN
Figure 8: Curve loge x and sum k=1 log k (in grey).
R N +1
We see that N
P
k=1 log k is slightly smaller that 1 log x dx.
But also, had we drawn the grey rectangles shifted one
unit
R N to the left, we would see that it is slightly bigger than
1 log x dx.
Thus we have
N
" #
X Z N Z N +1
0< log k − log x dx < log x dx (22)
k=1 1 N
Z N
N
log x dx = x log x − x
1 1
= N log N − N − 1 log 1 + 1
= N log N − N + 1
30
And the upper bound on the right of expression (22) is
smaller than log(N + 1). So the whole expression can be
rewritten
N
" #
X
0 < log k − N log N + N − 1 < log(N + 1)
k=1
(23)
N ! eN
1< < N +1 (24)
NN e
In other words,
NN
N! ≈ C(N ) (25)
eN
NN
N! ≈ (26)
eN
where the ≈ sign here doesn’t mean that the ratio of the
left-hand side to the right-hand side goes to 1, as it usually
means, but that it is bounded by N itself – and in fact the
31
√
multiplicative factor that we omit is 2πN .
N!
A=
n1 ! n2 ! n3 ! ...
X
ni = N
i
X (27)
ni Ei = N E
i
32
Once we have found the solution, then we effectively know
all of the probabilities Pi = ni /N .
N N e−N
A ≈ Q ni −n (28)
i ni e
i
Q
where is the standard symbol for a product of terms.
NN
A≈ (29)
nn1 1 nn2 2 nn3 3 ...
33
X
log A ≈ N log N − ni log ni (30)
i
X
log A ≈ N log N − N Pi log N Pi
i
X
≈ N log N − N Pi (log N + log Pi )
i
34
the sum of the N Pi log N ’s gives N log N . So the whole
thing can be rewritten
X
log A ≈ − N Pi log Pi
i
or
X
log A ≈ − N Pi log Pi (31)
i
35
states would be uniform. All the states would be equally
probable.
A.: As N gets large, the ni ’s will get large in the same pro-
portion. If you double the total number of systems, every
average occupation number will also double.
36
at least to maximize A under the two constraints.
A.: The minus sign does come from terms that were down-
stairs. But don’t forget that the Pi ’s are less than 1, so
their logarithms are negative. That is why the formula for
the entropy has a minus sign in front of it. We do want to
maximize the right-hand side of equation (31).
37
If you don’t want to see only the slick general public sci-
entific magazine version of it, but you want to see how it
really works, you have to get your hands dirty.
38
constraints, namely
X
Pi = 1
i
X (33)
Pi Ei = E
i
39
subject to the constraints we described.
∂F
=0
∂x1
∂F
=0
∂x2 (34)
...
∂F
=0
∂xp
40
We solve them simultaneously, and this gives us the solu-
tion for the p unknowns we were looking for14 .
41
9 corresponds to all the places where F has a given value.
When the contour lines become smaller and smaller the fi-
nal point is an extremal point, either a peak or a trough.
When they are close to each other that corresponds to a
steep slope in the terrain, etc.
We are now going to look for the place where F (xi ) is max-
imum – xi here stands for all the x1 , x2 , ... xp – but given
the constraint that some other function G(xi ) must be equal
to zero.
42
to some curve in the contour map of F , figure 10.
43
Figure 11: Local analysis, around P , of what would happen if
the contour lines of G were not parallel to the contour lines of F .
Show that in that case P would not be a maximum of F under
the constraint G = 0.
44
Figure 12: Point P solution to the maximization problem under
constraint G = 0, and line L perpendicular to the contour line
of function F at P .
45
Figure 13: Value of F as a function of G when we move on line L
perpendicular to the contour line of F and of G at P .
F 0 = F + λG (35)
46
The number λ can be chosen to make F 0 along the line L
flat at P – with respect to G or to ordinary distance, it
doesn’t matter. And in the perpendicular direction, that
is along the contour line of F , which is locally the contour
line of G as well, F 0 is also flat at P , since G = 0 and F
is maximum there on that curve. So P must be a global
stationary point of F 0 .
15
Joseph-Louis Lagrange (1736 - 1813), Italian mathematician. La-
grange was born in Turin, where he spent the first thirty years of its
life. Then he spent twenty-one years in Berlin. Finally he went to
Paris where he lived till the end of his life.
47
may seem intricate, but the use of it is very easy.
Let’s take
x2 + y 2
F (x, y) = (36)
2
48
What is G?
G(x, y) = x + y − 1 (37)
This gives
x+λ=0
(40)
y+λ=0
49
Hence, x = −λ and y = −λ.
−λ − λ − 1 = 0 (41)
50
G1 (x1 , x2 , x3 ) = 0
(43)
G2 (x1 , x2 , x3 ) = 0
F 0 = F + λ1 G1 + λ2 G2 (44)
∂F 0
(45)
∂xi
51
This seems like a rather complicated things to do. But it is
by far the easiest way to minimize something or maximize
it when we have constraints.
52
under the constraints
X
G1 (P1 , P2 , ....) = Pi − 1 = 0
i
X (47)
G2 (P1 , P2 , ....) = Pi Ei − E = 0
i
53
data about a r.v. X. For instance, E was reproduced n
times, and we obtained the measures x1 , x2 , ... xn for X.
The problem is: what is θ? Of course we don’t have enough
information to know θ exactly. But the experimental data
we have do give us some information.
16
For the interested reader, see the James-Stein estimator, which
is uniformly better than the maximum likelihood estimator according
to some very natural measure of quality. Here is a reference in a
general public scientific magazine:
http://statweb.stanford.edu/~ckirby/brad/other/Article1977.
pdf
54
That is why when we speak of the estimated value of some
parameters, be they probabilities like P1 , P2 , ... Pn in our
problem of entropy, we talk not about their probable val-
ues, but about their likely values.
55
subject to the two constraints the reader should by now be
familiar with.
56