HE Etropolis Lgorithm: Theme Article
HE Etropolis Lgorithm: Theme Article
HE Etropolis Lgorithm: Theme Article
THEME ARTICLE
T
he story goes that Stan Ulam was in Neumann discussed his approach in a letter to
a Los Angeles hospital recuperating Bob Richtmeyer [11 Mar. 1947] and in a later
and, to stave off boredom, he tried letter to Ulam [21 May 1947]. Interestingly, the
computing the probability of getting letter to Richtmeyer contains a fairly detailed
a “perfect” solitaire hand. Before long, he hit on program for the Eniac, while the one to Ulam
the idea of using random sampling: Choose a gives an explanation augmented by what we
solitaire hand at random. If it is perfect, let count would today call pseudocode.)
= count + 1; if not, let count = count. After M sam- Since the rejection method’s invention, it has
ples, take count/M as the probability. The hard been developed extensively and applied in a wide
part, of course, is deciding how to generate a variety of settings. The Metropolis Algorithm
uniform random hand. What’s the probability dis- can be formulated as an instance of the rejection
tribution to draw from, and what’s the algorithm method used for generating steps in a Markov
for drawing a hand? chain. This is the approach we will take.
Somewhat later, John von Neumann provided
part of the answer, but in a different context. He
introduced the rejection algorithm for simulating The rejection algorithm
neutron transport. In brief, if you want to sample First, let’s look at the setup for the rejection
from some specific probability distribution, sim- algorithm itself. We want to sample from a pop-
ply sample from any distribution you have ulation (for example, solitaire hands or neutron
handy, but keep only the good samples. (Von trajectories) according to some probability dis-
tribution function, ν, that is known in theory but
hard to sample from in practice. However, we
can easily sample from some related probability
1521-9615/00/$10.00 © 2000 IEEE
distribution function µ.
ISABEL BEICHL So we do this:
NIST
1. Use µ to select a sample, x.
FRANCIS SULLIVAN 2. Evaluate ν(x). This should be easy, once
IDA/Center for Computing Science we have x.
JANUARY/FEBRUARY 2000 65
3. Generate a uniform random ρ ∈[0,1) simulated annealing—and then we examine the
if ρ < cν(x)/µ(x) problem of counting.
then accept x
else try again with another x The Ising model
Here we choose c so that cν(x)/µ(x) < 1 This model is one of the most extensively
for all x. studied systems in statistical physics. It was de-
veloped early in the 20th century as a model
First, the probability of selecting and then ac- of magnetization and related phenomena. The
cepting some x is model is a 2D or 3D regular array of spins σi ∈
ν (x ) {–1, 1} and an associated energy E(σ) for each
c µ(x ) = cν (x ) . configuration. A configuration is any particular
µ(x )
choice for the spins, and each configuration has
Also, if we are collecting samples to estimate the the associated energy
weighted mean, S(f), of some function f(x)—that
is, S(f) = Σf(x)cν(x)—we could merely select some E (σ ) = − ∑ Ji, jσ iσ j − B∑ σ k.
large number, M, of samples by using µ(x), re- i. j k
ject none of them, and then compute the uniform The sum is over those {i, j} pairs that interact
mean: (usually nearest neighbors). Ji,j is the interaction
ν (x ) coefficient (often constant), and B is another
∑ f ( x )c µ ( x ) .
1
constant related to the external magnetic field.
M
In most applications, we want to estimate a
That is, if we don’t reject, the ratios give us a mean of some function f(σ) because such quanti-
sample whose mean converges to the mean for ties give us a first-principles estimate of some fun-
the limiting probability distribution function ν. damental physical quantity. In the Ising model,
This method for estimating a sum is an instance the mean F is taken over all configurations:
of importance sampling, because it attempts to
∑ f (σ ) exp(− E(σ ) κT ) .
1
choose samples according to their importance. F =
Z (T ) σ
The ideal importance function is
But here, the weights come from the expression
f (x )ν (x ) for the configuration’s energy. The normalizing
µ(x ) = .
∑ y f ( y)ν ( y) factor Z(T) is the partition function:
The alert reader will have noticed that this Z (T ) = ∑ exp(− E(σ ) κT ) .
µ(x) requires knowledge of the answer. However, σ
importance sampling works for a less-than- T is the temperature and κ is the Boltzmann
perfect µ(x). This is because the fraction of the constant.
samples that equal any particular x will converge A natural importance-sampling approach
to µ(x), so the sample mean of f(x)cν(x)/µ(x) will might be to select configurations from the dis-
converge to the true mean. For the special case tribution:
of the constant function, f(x) = 1, the quantity S
is the probability of a “success” on any particular µ(σ ) =
(
exp − E (σ ) κT )
Z (T )
trial of the rejection method. If we take f to be
the function that is identically equal to one, we so that the sample mean of M samples,
∑k f (σ k )
might know the value of S in advance. In that
case, we also know the rejection rate 1/S, which F=
is the average number of trials before each suc- M
cess. As we shall see, when we use rejection in its will converge rapidly to the true mean, F.
formulation as the Metropolis Algorithm, prior The problem, of course, is finding a way to
knowledge of the rejection rate leads to a more sample from µ. In this case, sampling from the
efficient method called Monte Carlo time.1 proposed “easy” distribution µ is not so simple.
Nick Metropolis and his colleagues made the fol-
lowing brilliant observation.2 If we change only
Applications: The Metropolis Algorithm one spin, the change in energy, ∆E, is easy to eval-
We first look at two important applications of uate, because only a few terms of the sum change.
the Metropolis Algorithm—the Ising model and This observation gives a way of constructing an
cν (σ ) =
(
min 1,exp − ∆E (σ ) κT ( )) The answers seem to be
n
so that we take 1/n as the “easy” probability. That 1. Yes. A large literature covers both the theory
is, we select a site uniformly and accept it accord- and applications in many different settings.
ing to the Metropolis criterion we just described. However, if T > 0, the limit probability dis-
For this case, the expression for the success tribution will be nonzero for nonoptimal
rate is tours. The way around this is to decrease T as
( (
min 1, exp − ∆E (σ i ) κT )) the computation proceeds. Usually, T de-
S = ∑ cv(σ ) = ∑ i
n . creases like log(sk) for some positive, de-
i i
creasing sequence of “cooling schedule” val-
So, the probability of exactly k rejections followed ues sk, so that the acceptance probability
by a success is the same as the probability that a decreases linearly until only the true mini-
random ρ satisfies (1 − S)k+1 < ρ < (1 − S)k, giving mum is accepted.
this stochastic expression for the waiting time: 2. It depends. Designing cooling schedules to
log(ρ ) optimize the solution time is an active re-
k= . search topic.
log(1 − S)
3. Someone should investigate this carefully.
We can use this to avoid the rejection steps while
still “counting” how many rejections would have
occurred.1 Counting
In principle, this Monte Carlo-time method Let’s reconsider Ulam’s original question in a
works with any rejection formulation. However, slightly more general form: How many mem-
each stage requires explicit knowledge of all pos- bers of a population P have some specific prop-
sible next steps. In other words, we need the val- erty U? We could do the counting by designing
ues for the “difficult” distribution ν(x). In the a Markov chain that walks through P and has a
Metropolis Ising case, the Markov chain formu- limit probability distribution ν that is somehow
lation makes this feasible. related to our interesting property U. To be
more concrete, P might be the set of partial
Simulated annealing matchings of a bipartite graph G, and U might
Suppose we wish to maximize or minimize some be the set of matchings that are “perfect,” mean-
real-valued function defined on a finite (but large) ing they include every graph node.
set. The classic example is the traveling salesman’s To have our Markov chain do what we want,
problem. The function is the tour’s length, and the we define a partition function:
JANUARY/FEBRUARY 2000 67
Z(λ ) = ∑ m λk .
k Rapid mixing
k Jerrum and Sinclair have provided conver-
The partition function is associated with the gence results and applications to important com-
probability distribution binatorial problems, such as the monomer-dimer
problem.3 To obtain their results, they look for a
m k λk
ν ( k) = . property they call rapid mixing for Markov
Z(λ ) chains. Jerrum has also proved some “slow con-
Here, mk is the number of k-matchings, and λk vergence” results showing that, in some situa-
plays a role similar to that played by exp(− tions, Metropolis sampling does not mix rapidly
E(σ)/(κT)) in the Ising problem. On each step, and so converges too slowly to be practical.4
if the move selected is from a k-matching to a (k
+ 1)-matching, the probability of doing so is λ. Coupling from the past
Mark Jerrum and Alistair Sinclair show that the The Metropolis Algorithm and its generaliza-
fraction of the samples that are k-matchings can tions have come to be known as the Monte Carlo
be used to estimate the mk to whatever accuracy Markov Chain technique (MCMC) because they
is desired and that, for fixed accuracy, the time simulate a Markov chain in the hope of sampling
for doing so is a polynomial in the problem’s from the limit distribution. For the Ising model,
size.3 Physicists call estimating the mk the this limit distribution is
monomer-dimer problem because having a k-
matching means that k pairs have been matched ν (σ ) =
(
exp − E (σ ) kT ).
as dimers and the unmatched are monomers. Z (T )
P
rogress in MCMC has been impressive Society of Physics Students
Corporate Associates
and seems to be accelerating. Problems
The American Institute of Physics is a not-for-profit membership corporation chartered in
that appeared impossible have been solved. New York State in 1931 for the purpose of promoting the advancement and diffusion of the
For combinatorial counting problems, re- knowledge of physics and its application to human welfare. Leading societies in the fields
of physics, astronomy, and related sciences are its members.
cent advances have been remarkable. However, two
things should be borne in mind. The Institute publishes its own scientific journals as well as those of its Member Societies;
provides abstracting and indexing services; provides online database services;
The first is a famous remark attributed to von disseminates reliable information on physics to the public; collects and analyzes statistics on
the profession and on physics education; encourages and assists in the documentation and
Neumann: Anyone using Monte Carlo is in a state of study of the history and philosophy of physics; cooperates with other organizations on edu-
sin. We might add that anyone using MCMC is cational projects at all levels; and collects and analyzes information on Federal programs
and budgets.
committing an especially grievous offense. Monte
Carlo is a last resort, to be used only when no exact The scientists represented by the Institute through its Member Societies number approxi-
mately 120,000. In addition, approximately 5400 students in over 600 colleges and uni-
analytic method or even finite numerical algorithm versities are members of the Institute’s Society of Physics Students, which includes the
is available. And, except for CFTP, the prescription honor society Sigma Pi Sigma. Industry is represented through 47 Corporate Associates
members.
for use always contains the phrase “simulate for a
Governing Board*
while,” meaning until you feel as if you’re at the John A. Armstrong (Chair), Gary C. Bjorklund, Martin Blume, Marc H. Brodsky (ex officio),
limit distribution. As we mentioned, for the Me- James L. Burch, Ilene J. Busch-Vishniac, Robert L. Byer, Brian Clark, Lawrence A. Crum,
Judy R. Franz, Jerome I. Friedman, Christopher G. A. Harrison, Judy C. Holoviak, Ruth
tropolis method, there are even systems for which Howes, Frank L. Huband, Bernard V. Khoury, Larry D. Kirkpatrick, John Knauss, Leonard V.
convergence is provably slow. The antiferromag- Kuhi, Arlo U. Landolt, James S. Langer, Louis J. Lanzerotti, Charlotte Lowe-Ma, Rudolf
Ludeke, Christopher H. Marshall, Thomas J. McIlrath, Arthur B. Metzner, Robert W. Milkey,
netic Ising model is one such case. In some situa- Thomas L. O’Kuma, Richard C. Powell, S. Narasinga Rao, Charles E. Schmid, Andrew M.
tions, no randomized method, including MCMC, Sessler, James B. Smathers, Benjamin B. Snavely (ex officio), A. Fred Spilhaus, Jr., John A.
Thorner, George H. Trilling, William D. Westwood, Jerry M. Woodall
will converge rapidly. *Executive Committee members are printed in italics.
The second thing to bear in mind is that MCMC Management Committee
is only one of many possible importance-sampling tech- Marc H. Brodsky, Executive Director and CEO; Richard Baccante, Treasurer and CFO;
Theresa C. Braun, Director, Human Resources; James H. Stith, Director of Physics Pro-
niques. For several cases, including the dimer cover grams; Darlene A. Walters, Vice President, Publishing; Benjamin B. Snavely, Corporate
problem, the ability to approximate the limit distri- Secretary
References
1. I. Beichl and F. Sullivan, “(Monte-Carlo) Time after Time,” IEEE 6. I. Beichl and F. Sullivan, “Approximating the Permanent via Im-
Computational Science & Eng., Vol. 4, No. 3, July–Sept. 1997, pp. portance Sampling with Application to the Dimer Covering Prob-
91–95. lem,” J. Computational Physics, Vol. 149, No. 1, Feb. 1999, pp.
2. N. Metropolis et al., “Equation of State Calculations by Fast Com- 128–147.
puting Machines,” J. Chemical Physics, Vol. 21, 1953, pp.
1087–1092.
Isabel Beichl is a mathematician in the Information
3. M. Jerrum and A. Sinclair, “The Markov Chain Monte Carlo
Technology Laboratory at the National Institute of Stan-
Method: An Approach to Counting and Integration,” Approxima-
tion Algorithms for NP-Hard Problems, Dorit Hochbaum, ed., PWS dards and Technology. Contact her at NIST, Gaithers-
(Brooks/Cole Publishing), Pacific Grove, Calif., 1996, pp. 482–520. burg, MD 20899; isabel@cam.nist.gov.
4. V. Gore and M. Jerrum, “The Swendson-Wang Process Does Not
Always Mix Rapidly,” Proc. 29th ACM Symp. Theory of Computing,
Francis Sullivan is the associate editor-in-chief of CiSE and
ACM Press, New York, 1997, pp. 157–165.
director of the Institute for Defense Analyses’ Center for
5. J.G. Propp and D.B. Wilson, “Exact Sampling with Coupled Markov
Chains and Applications to Statistical Mechanics,” Random Structures Computing Sciences. Contact him at the IDA/Center for
and Algorithms, Vol. 9, Nos. 1 & 2, 1996, pp. 223–252. Computing Sciences, Bowie, MD 20715; fran@super.org.
JANUARY/FEBRUARY 2000 69