Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Bernoulli 7(2), 2001, 223242

An adaptive Metropolis algorithm


H E I K K I H A A R I O 1 , E E R O S A K S M A N 1  and J O H A N NA TA M M I N E N 2
1

Department of Mathematics, P.O. Box 4 (Yliopistonkatu 5), FIN-00014 University of


Helsinki, Finland. E-mail:  heikki.haario@helsinki.;  eero.saksman@helsinki..
2
Finnish Meteorological Institute, Geophysics Research, P.O. Box 503, FIN-00101
Helsinki, Finland. E-mail: johanna.tamminen@fmi.
A proper choice of a proposal distribution for Markov chain Monte Carlo methods, for example for
the MetropolisHastings algorithm, is well known to be a crucial factor for the convergence of the
algorithm. In this paper we introduce an adaptive Metropolis (AM) algorithm, where the Gaussian
proposal distribution is updated along the process using the full information cumulated so far. Due to
the adaptive nature of the process, the AM algorithm is non-Markovian, but we establish here that it
has the correct ergodic properties. We also include the results of our numerical tests, which indicate
that the AM algorithm competes well with traditional MetropolisHastings algorithms, and
demonstrate that the AM algorithm is easy to use in practical computation.
Keywords: adaptive Markov chain Monte Carlo; comparison; convergence; ergodicity; Markov chain
Monte Carlo; MetropolisHastings algorithm

1. Introduction
It is generally acknowledged that the choice of an effective proposal distribution for the
random walk Metropolis algorithm, for example, is essential in order to obtain reasonable
results by simulation in a limited amount of time. This choice concerns both the size and the
spatial orientation of the proposal distribution, which are often very difcult to choose well
since the target density is unknown (see Gelman et al. 1996; Gilks et al. 1995; 1998; Haario
et al. 1999; Roberts et al. 1997). A possible remedy is provided by adaptive algorithms,
which use the history of the process in order to `tune' the proposal distribution suitably. This
has previously been done (for instance) by assuming that the state space contains an atom.
The adaptation is performed only at the times of recurrence to the atom in order to preserve
the right ergodic properties (Gilks et al. 1998). The adaptation criteria are then obtained by
monitoring the acceptance rate. A related and interesting self-regenerative version of adaptive
Markov chain Monte Carlo (MCMC), based on introducing an auxiliary chain, is contained in
the recent preprint of Sahu and Zhigljavsky (1999). For other versions of adaptive MCMC
and related work, we refer to Evans (1991), Fishman (1996), Gelfand and Sahu (1994), Gilks
and Roberts (1995) and Gilks et al. (1994), together with the references therein.
We introduce here an adaptive Metropolis (AM) algorithm which adapts continuously to
the target distribution. Signicantly, the adaptation affects both the size and the spatial
orientation of the proposal distribution. Moreover, the new algorithm is straightforward to
implement and use in practice. The denition of the AM algorithm is based on the classical
13507265 # 2001 ISI/BS

224

H. Haario, E. Saksman and J. Tamminen

random walk Metropolis algorithm (Metropolis et al. 1953) and its modication, the AP
algorithm, introduced in Haario et al. (1999). In the AP algorithm the proposal distribution
is a Gaussian distribution centred on the current state, and the covariance is calculated from
a xed nite number of previous states. In the AM algorithm the covariance of the proposal
distribution is calculated using all of the previous states. The method is easily implemented
with no extra computational cost since one may apply a simple recursion formula for the
covariances involved.
An important advantage of the AM algorithm is that it starts using the cumulating
information right at the beginning of the simulation. The rapid start of the adaptation
ensures that the search becomes more effective at an early stage of the simulation, which
diminishes the number of function evaluations needed.
To be more exact, assume that at time t the already sampled states of the AM chain are
X 0 , X 1 , . . . , X t , some of which may be multiple. The new proposal distribution for the
next candidate point is then a Gaussian distribution with mean at the current point X t and
covariance given by sd R, where R is the covariance matrix determined by the spatial
distribution of the states X 0 , X 1 , . . . , X t 2 R d . The scaling parameter sd depends only on
the dimension d of the vectors. This adaptation strategy forces the proposal distribution to
approach an appropriately scaled Gaussian approximation of the target distribution, which
increases the efciency of the simulation. A more detailed description of the algorithm is
given in Section 2 below.
One of the difculties in constructing adaptive MCMC algorithms is to ensure that the
algorithm maintains the correct ergodicity properties. We observe here (see also Haario et
al. 1999) that the AP algorithm does not possess this property. Our main result, Theorem 1
below, veries that the AM process does indeed have the correct ergodicity properties,
assuming that the target density is bounded from above and has a bounded support. The
AM chain is not Markovian, but we show that the asymptotic dependence between the
elements of the chain is weak enough to apply known theorems of large numbers for
mixingales see McLeish (1975) and (29) below for this notion. Similar results may be
proven also for variants of the algorithm, where the covariance is computed from a suitably
increasing segment of the near history.
Section 3 contains a detailed description of the AM algorithm as a stochastic process and
the theorem on the ergodicity of the AM. The proof is based on an auxiliary result that is
proven in Section 4. Finally, in Section 5 we present results from test simulations, where the
AM algorithm is compared with traditional MetropolisHastings algorithms (Hastings 1970)
by applying both linear and nonlinear, correlated and uncorrelated unimodal target
distributions. Our tests seem to imply that AM performs at least as well as the traditional
algorithms for which a nearly optimal proposal distribution has been given a priori.

2. Description of the algorithm


We assume that our target distribution is supported on the subset S  R d , and that it has the
(unscaled) density (x) with respect to the Lebesgue measure on S. With a slight abuse of
notation, we shall also denote the target distribution by .

An adaptive Metropolis algorithm

225

We now explain how the AM algorithm works. Recall from Section 1 that the basic idea
is to update the proposal distribution by using the knowledge we have so far acquired about
the target distribution. Otherwise the denition of the algorithm is identical to the usual
Metropolis process.
Suppose, therefore, that at time t 1 we have sampled the states X 0 , X 1 , . . . , X t1 ,
where X 0 is the initial state. Then a candidate point Y is sampled from the (asymptotically
symmetric) proposal distribution qt (jX 0 , . . . , X t1 ), which now may depend on the whole
history (X 0 , X 1 , . . . , X t1 ). The candidate point Y is accepted with probability


(Y )
,
(X t1 , Y ) min 1,
(X t1 )
in which case we set X t Y , and otherwise X t X t1 . Observe that the chosen probability
for the acceptance resembles the familiar acceptance probability of the Metropolis algorithm.
However, here the choice for the acceptance probability is not based on symmetry
(reversibility) conditions since these cannot be satised in our case the corresponding
stochastic chain is no longer Markovian. For this reason we have to study the exactness of the
simulation separately, and we do so in Section 3.
The proposal distribution qt (jX 0 , . . . , X t1 ) employed in the AM algorithm is a
Gaussian distribution with mean at the current point X t1 and covariance
Ct Ct (X 0 , . . . , X t1 ). Note that in the simulation only jumps into S are accepted since
we assume that the target distribution vanishes outside S.
The crucial thing regarding the adaptation is how the covariance of the proposal
distribution depends on the history of the chain. In the AM algorithm this is solved by
setting Ct sd cov(X 0 , . . . , X t1 ) sd I d after an initial period, where sd is a parameter
that depends only on dimension d and . 0 is a constant that we may choose very small
compared to the size of S. Here I d denotes the d-dimensional identity matrix. In order to
start, we select an arbitrary, strictly positive denite, initial covariance C0 , according to our
best prior knowledge (which may be quite poor). We select an index t0 . 0 for the length of
an initial period and dene

C0 ,
t < t0 ,
(1)
Ct
sd cov(X 0 , . . . , X t1 ) sd I d , t . t0 :
The covariance Ct may be viewed as a function of t variables from R d having values in
uniformly positive denite matrices.
Recall the denition of the empirical covariance matrix determined by points
x0 , . . . , xk 2 R d :
!
k
1 X
T
T
xi x i (k 1)x k x k ,
cov(x0 , . . . , xk )
(2)
k i0
Pk
xi and the elements xi 2 R d are considered as column vectors.
where x k (1=(k 1)) i0
So one obtains that in denition (1) for t > t0 1 the covariance Ct satises the recursion
formula

226

H. Haario, E. Saksman and J. Tamminen


C t1

t1
sd
T
T
Ct (t X t1 X t1 (t 1)X t X t X t X Tt I d ):
t
t

(3)

This allows one to calculate Ct without too much computational cost since the mean X t also
satises an obvious recursion formula.
The choice for the length of the initial segment t0 . 0 is free, but the bigger it is chosen
the more slowly the effect of the adaptation is felt. In a sense the size of t0 reects our
trust in the initial covariance C0 . The role of the parameter is just to ensure that Ct will
not become singular (see Remark 1 below). As a basic choice for the scaling parameter we
have adopted the value sd (2:4)2 =d from Gelman et al. (1996), where it was shown that
in a certain sense this choice optimizes the mixing properties of the Metropolis search in
the case of Gaussian targets and Gaussian proposals.
Remark 1. In our test runs the covariance Ct has not had the tendency to degenerate. This has
also been the case in our multimodal test examples. However, potential difculties with 0
(if any) are more likely to appear in the multimodal cases. In practical computations one
presumably may utilize denition (1) with 0, although the change is negligible if has
already been chosen small enough. More importantly, we can prove the correct ergodicity
property of the algorithm only under the assumption . 0; see Theorem 1 below.
Remark 2. In order to avoid the algorithm starting slowly it is possible to employ special
tricks. Naturally, if a priori knowledge (such as the maximum likelihood value or
approximate covariance of the target distribution) is available, it can be utilized in
choosing the initial state or the initial covariance C0 . Also, in some cases it is advisable to
employ the greedy start procedure: during a short initial period one updates the proposal
using only the accepted states. Afterwards the AM is run as described above. Moreover,
during the early stage of the algorithm it is natural to require it to move at least a little. If it
has not moved enough in the course of a certain number of iterations, the proposal
distribution could be shrunk by some constant factor.
Remark 3. It is also possible to choose an integer n0 . 1 and update the covariance every
n0 th step only (again using the entire history). This saves computer time when generating the
candidate points. There is again a simple recursion formula for the covariances Ct .

3. Ergodicity of the AM chain


In the AP algorithm, which was briey described in Section 1, the covariance Ct was
calculated from the last H states only, where H > 2. This strategy has the undesirable
consequence of bringing non-exactness into the simulation. There are several ways to see this.
One may, for instance, study the Markov chain consisting of H-tuples of consecutive
variables of the AP chain, to obtain the limit distribution for the AP by a suitable projection
from the equilibrium distribution of this Markov chain. Simple examples in the case of nite
state space for an analogous model show that the limiting distribution of the AP algorithm

An adaptive Metropolis algorithm

227

differs slightly from the target distribution. Numerical calculations in the continuous case
indicate similar behaviour. An illustrating example of this phenomenon is presented in the
Appendix.
It is our aim in this section to show that the AM algorithm has the right ergodic
properties and hence provides correct simulation of the target distribution.
Let us start by recalling some basic notions of the theory of stochastic processes that are
needed later. We rst dene the set-up. Let (S, B , m) be a state space and denote by M(S)
the set of nite measures on (S, B ). The norm k  k on M(S) denotes the total variation
norm. Let n > 1 be a natural number. A map K n : S n 3 B ! [0, 1] is a generalized
transition probability on the set S if the map x 7! K n (x; A) is B n -measurable for each
A  B , where x 2 S n and K n (x; ) is a probability measure on (S, B ) for each x 2 S n . In a
natural way K n denes a positive contraction from M(S n ) into M(S). A transition
probability on S corresponds to the case n 1 in the above denition.
We assume that a sequence of generalized transition probabilities (K n )1
n1 is given.
Moreover, let 0 be a probability distribution (the initial distribution) on S. Then the
sequence (K n ) and 0 determine uniquely the nite-dimensional distributions of the
discrete-time stochastic process (chain) (X n )1
n0 on S via the formula

P(X 0 2 A0 , X 1 2 A1 , . . . , X n 2 An )

y0 2 A0

y1 2 A1

K 1 ( y0 ; d y1 )

y2 2 A2

K 2 ( y0 , y1 ; d y2 )   

!!

0 (d y0 )

yn2 A n

K n ( y0 , y1 , . . . , y n1 ; d yn )

...

(4)

In fact, it is directly veried that these distributions are consistent and the theorem of Ionescu
Tulcea (see Proposition V.1.1 of Neveu 1965) yields the existence of the chain (X n ) on S
satisfying (4).
We shall now turn to the exact denition of the AM chain as a discrete-time stochastic
process. We assume that the target distribution is supported on a bounded subset S  R d ,
so that (x)  0 outside S. Thus we shall choose S to be our state space, when equipped
with the Borel -algebra B (S) and choosing m to be the normalized Lebesgue measure on
S. The target has the (unscaled) density (x) with respect to the Lebesgue measure on S.
We also assume that the density is bounded from above on S: for some M , 1, we have
that
(x) < M

for x 2 S:

(5)

Let C be a symmetric and strictly positive denite matrix on R d and denote by NC the
density of the mean-zero Gaussian distribution on R n with covariance C. Thus


1
1
p exp x T C 1 x :
(6)
NC (x)
2
(2) n=2 jCj
The Gaussian proposal transition probability corresponding to the covariance C satises

228

H. Haario, E. Saksman and J. Tamminen

QC (x; A)

NC ( y x) d y,

(7)

where A  R d is a Borel set and d y is the standard Lebesgue measure on R d . It follows that
QC is m-symmetric (see Haario and Saksman 1991, Denition 2.2): for A, B  S one has

QC (x; A)m(dx) QC (x; B)m(dx):


B

We next recall the denition of the transition probability M C for the Metropolis process
having the target density (x) and the proposal distribution QC :



( y)
m(d y)
M C (x; A) NC ( y x) min 1,
(x)
A




( y)
m(d y),
(8)
A (x) NC ( y x) 1 min 1,
(x)
Rd
for A 2 B (S), and where A denotes the characteristic function of the set A. It is easily
veried that M C denes a transition probability with state space S.
The following denition of the AM chain corresponds exactly to the AM algorithm
introduced in Section 2.
Denition 1. Let S and be as above and let the initial covariance C0 and the constant
. 0 be given. Dene the functions Cn for n > 1 by formula (1). For a given initial
distribution 0 the adaptive Metropolis (AM ) chain is a stochastic chain on S dened through
(4) by the sequence (K n )1
n1 of generalized transition probabilities, where
K n (x0 , . . . , x n1 ; A) M C n (x0 ,:::,x n1 ) (x n1 ; A)

(9)

for all n > 1, xi 2 S (0 < i < n 1), and for subsets A 2 B (S).
Let us turn to the study of the ergodicity properties of the AM chain, which is more
complicated than in the case of Markov chains. In order to be able to proceed we give
some denitions. Recall rst the denition of the coefcient of ergodicity (Dobrushin
1956). Let T be a transition probability on S and set
(T ) sup 1 ,2

k1 T 2 T k
,
k1 2 k

(10)

where the supremum is taken over distinct probability measures 1 , 2 on (S, B ). As usual,
T denotes
and for bounded measurable functions we write
the measure A 7! S T (x; A)(dx)
Tf (x) S T (x; d y) f ( y) as well as f S (d y) f ( y).
Clearly 0 < (T ) < 1. In the case (T ) , 1 the mapping T is a strict contraction on
M(S) with respect to the metric dened by the total variation norm on M(S). From the
denition it easily follows that
(T1 T2 . . . Tn ) <

n
Y
i1

(Ti ):

(11)

An adaptive Metropolis algorithm

229

The condition (T k 0 ) , 1 for some k 0 > 1 is well known to be equivalent to the uniform
ergodicity (cf. Nummelin 1984, Section 6.6) of the Markov chain having transition
probability T.
For our purposes it is useful to dene the transition probability that is obtained from a
generalized transition probability by `freezing' the n 1 rst variables. Hence, given a
generalized transition probability K n (where n > 2) and a xed (n 1)-tuple ( y0 , y1 ,
. . . , y n2 ) 2 S n1 , we denote y~n2 ( y0 , y1 , . . . , y n2 ) and dene the transition probability
K n, y~n2 by
K n, y~n2 (x; A) K n ( y0 , y1 , . . . , y n2 , x; A)

(12)

for x 2 S and A 2 B (S).


We are now ready to state and prove our main theorem. The role of the assumptions on
the target density is commented on in Remark 7 below.
Theorem 1. Let be the density of a target distribution supported on a bounded measurable
subset S  R d , and assume that is bounded from above. Let . 0 and let 0 be any initial
distribution on S. Dene the AM chain (X n ) by the generalized transition probabilities (9) as
in Denition 1. Then the AM chain simulates properly the target distribution : for any
bounded and measurable function f : S ! R, the equality

1
( f (X 0 ) f (X 1 ) . . . f (X n )) f (x)(dx):
lim
n!1 n 1
S
holds almost surely.
The proof of the theorem is based on the following technical auxiliary result, whose
proof we postpone to the next section.
Theorem 2. Assume that the nite-dimensional distributions of the stochastic process
(X n )1
n0 on the state space S satisfy (4), where the sequence of generalized transition
probabilities (K n ) is assumed to satisfy the following three conditions:
(i) There are a xed integer k 0 and a constant 2 (0, 1) such that
((K n, y~n2 ) k 0 ) < , 1

for all y~n2 2 S n1 and n > 2:

(ii) There are a xed probability measure on S and a constant c0 . 0 such that
c0
for all y~n2 2 S n1 and n > 2:
kK n, y~n2 k <
n
(iii) We have the following estimate for the operator norm
k
kK n, y~n2 K n k, y~n k2 kM(S)!M(S) < c1 ,
n
where c1 is a xed positive constant, n,k > 1 and one assumes that the (n k 1)tuple y~n k2 is a direct continuation of the (n 1)-tuple y~n2 .
Then, if f : S ! R is bounded and measurable, then the equality

230
lim

n!1

1
n1

H. Haario, E. Saksman and J. Tamminen

( f (X 0 ) f (X 1 ) . . . f (X n ))
f (x)(dx):
(13)
S

holds almost surely.


In what follows the auxiliary constants ci , i 2, 3, . . . , depend on S, or C0, and their
actual value is irrelevant for our purposes here.
Proof of Theorem 1. According to Theorem 2 it sufces to prove that the AM chain satises
conditions (i)(iii). In order to check condition (i) we observe that, directly from denition
(1) and by the fact that S is bounded, all the covariances C Cn ( y0 , . . . , y n1 ) satisfy the
matrix inequality
0 , c2 I d < C < c3 I d :

(14)

Hence the corresponding normal densities NC ( x) are uniformly bounded from below on S
for all x 2 S, and (5) and (8) together trivially yield the bound
K n, y~n2 (x; A) > c4 (A)

for all x 2 S and A  S,

with c4 . 0. This easily yields (cf. Nummelin 1984, pp. 122123) that (K n, y~n2 ) < 1 c4 ,
which proves (i) with k 0 1.
We next verify condition (iii). To that end we assume that n > 2 and observe that, for
given y~n k2 2 S n k1 , one has
kK n, y~n2 K n k, y~n k2 kM(S)!M(S) < 2 sup y2S,A2B (S) jK n, y~n2 ( y; A) K n k, y~n k2 ( y; A)j:
(15)
Fix y 2 S and A 2 B (S) and introduce R1 Cn ( y0 , . . . , y n2 , y) together with R2
C n k ( y0 , . . . , y n k2 , y). According to Denition 1 and formula (8), we see that
jK n, y~n2 ( y; A) K n k, y~n k2 ( y; A)j jM R1 ( y; A) M R2 ( y; A)j




(x)

(N R1 N R2 )(x y) min 1,
m(dx)
<
( y)
x2 A

(N R1 N R2 )(x y)
A (x)
x2R d





(x)
m(dx)
3 1 min 1,
( y)

< 2 jN R1 (z) N R2 (z)j dz




Rd

<2

Rd

dz

1
0



d


ds N R1 s( R2 R1 ) (z)
ds

< c5 kR1 R2 k,

(16)

An adaptive Metropolis algorithm

231

where at the last stage we apply (14), in order to deduce that the partial derivatives of
density N R1 s( R2 R1 ) with respect to the components of the covariance are integrable over
with bounds that depend only on , C0 and S. Finally, it is clear from recursion formula
that in general kCt C t1 k < c6 =t for t . 1. By applying this inductively and using
uniform boundedness from above of the covariances Ct , we easily see that

the
Rd
(3)
the

k
kR1 R2 k < c7 (S, C0 , ) ,
n
and hence the previous estimates yield (iii).
In order to check condition (ii), x y~n2 2 S n1 and denote C  C n1 ( y0 , . . . y n2 ). It
follows that kC  Cn ( y0 , . . . , y n2 , y)k < c8 =n, where c8 does not depend on y 2 S. We
may therefore proceed exactly as in (15) and (16) to deduce that
c9
kK n, y~n2 M C  kM(S)!M(S) < :
n
Since M C is a Metropolis transition probability we have that M C (see e.g. Tierney
1994, p. 1705), and we obtain
c9
k K n, y~n2 k k(M C  K n, y~n2 )k < ,
n
which completes the proof of Theorem 1.

Let us record an expected result on the behaviour of the AM chain.


Corollory 3. Under the assumptions of Theorem 1 the covariance Ct almost surely stabilizes
during the algorithm. In fact, as t ! 1 the covariance Ct converges to sd cov() I d ,
where cov() denotes the covariance of the target distribution .
Proof. The claim follows directly from the denition (1) of the covariance Ct by applying
h
Theorem 1 with the choices f (x) xi and f (x) xi xj, where 1 < i, j < d.
We conclude this section with a number of comments on the theory presented above.
Remark 4. Our decision to use Gaussian proposal distributions is based on their tested
practical applicability, even in the case of non-Gaussian targets. Gaussian proposals yield a
family of proposal distributions with a natural parametrization for size and orientation and
which are easy to compute with. However, in the denition of the AM chain one can easily
replace the Gaussian proposals by, for example, uniform distributions in a parallelepiped. In
this case the size and the orientation of the parallelepiped are guided in a natural manner by
the covariance Ct that is determined by (1) as above. Our proof of Theorem 1 remains
unchanged and we again obtain that the simulation is exact. The only difference is that the
constant k 0 in condition (i) of Theorem 2 may now exceed 1. Naturally, here one has to add
suitable assumptions on the set A fx : (x) . 0g. It is, for example, enough to assume that
A is open and connected. In this connection the estimates provided by Haario and Saksman
(1991, Theorem 6.5.(b)) are relevant.

232

H. Haario, E. Saksman and J. Tamminen

Similar remarks apply to modications where one adapts only certain parameters or some
of the parameters are discrete.
Remark 5. It is clear that in the course of the AM algorithm one may also determine the
covariance by using only an increasing part of the near history. For example, one may
determine Cn by using only the samples X [ n=2] , X [ n=2]1 , . . . , X n . This is easily implemented
in practice and in this case Theorem 1 yields that the simulation is exact with only minor
changes in the proof. Similar remarks apply also to the case where one updates the
covariance only every n0 th step (see Remarks 3 and 8).
Remark 6. Theorem 2 can also be used to prove the correct ergodicity for certain other
variants of adaptation, as for algorithms where one suitably tunes the proposal distribution
according to the acceptance rate. However, in our specic practical applications it has turned
out that the tuning of the acceptance rate has yielded inferior results when compared with the
AM algorithm. A similar phenomenon is demonstrated in Figure 2 below. Moreover, in highdimensional cases with possible correlations between the parameters, it may be difcult to
tune the proposal distribution effectively basing the decision on one parameter only. This is
the case even if one uses the single-component Metropolis algorithm.
Remark 7. The proof of Theorem 1 requires that the target density has compact support and
is bounded from above. Otherwise the uniform ergodicity (condition (i) of Theorem 2) may
fail, which is important if we are to be able to control the effects of the adaptation. In the
Markovian case (for example, standard MetropolisHastings) uniform ergodicity is, of
course, not needed to ensure that the simulation is correct, although without it the theoretical
convergence rate may be very slow. However, the requirements above on the target density
correspond reasonably well to practical situations. We believe that one may weaken the
assumptions at the cost of more elaborate proofs. We prefer to leave this topic for future
research, since our main aim here is to introduce a new method and to demonstrate its
usefulness. In our test runs the AM algorithm has also worked successfully with (unrestricted)
Gaussian targets.

4. Proof of Theorem 2
In this section we will prove Theorem 2 by showing that a related process is a mixingale (in
the sense of McLeish 1975) that satises an appropriate law of large numbers. The conditions
of the theorem were tailored to apply to the AM chain on bounded subsets of R n , but they
are stated in the language of a general state space. This is advantageous since one may apply
them in a more general situation, especially for variants of the AM where the state space
contains both discrete and continuous parts. Our proof is based on the following basic
proposition.
Proposition 4. Let the chain (X n ) on the state space S and the generalized transitition
probabilities (K n ) full the conditions of Theorem 2. Denote by F n (X 0 , X 1 , . . . , X n )

An adaptive Metropolis algorithm

233

the -algebra generated by the chain up to time n and write 9 1= k 0 . Let n > 1 and
k > 2. Then for all initial distributions and for any bounded measurable function f on S, the
inequality
!

2


j
E( f (X n k )jF n ) f ( y)(d y) < c(c0 , c1 , ) inf
9 j k f k1
(17)


1< j< k
n k j
S
1
holds.

Proof. We may clearly assume that f S f ( y)(d y) 0 since the general case is then
obtained by applying the proposition to the function f f . Let n > 1, k > 2 and note that
from the denition of the conditional expectation and (4) it follows that (almost surely)
E( f (X n k )jF n )

K n1 (X 0 , X 2 , . . . , X n ; d y n1 )

y n1 2S

y n2 2S

K n2 (X 0 , X 2 , . . . , X n , y n1 ; d y n2 )
!



y n k 2S

K n k (X 0 , X 2 , . . . , X n , y n1 , . . . , y n k1 ; d y n k ) f ( y n k )

!
... :

(18)

Let us denote (X 0 , . . . , X n ) X~ n . In what follows X~ n does not interfere with the


integrations and hence it may be thought as a free variable (or constant). We also introduce
the transition probability Q, where Q( y; dz) K n2 ( X~ n , y; dz). Condition (iii) yields for
arbitrary values of X~ n and y n1 , . . . , y n k1 that





~
~
(K n k ( X n , y n1 , . . . , y n k1 ; d y n k ) K n2 ( X n , y n k1 ; d y n k )) f ( y n k )

y n k 2S

< c 1 k f k1

k2
:
n2

(19)

This estimate enables us to write (18) in the form


E( f (X n k )jF n ) gk ( X~ n )

y n1 2S



y n k1 2S

y n2 2S

K n2 ( X~ n , y n1 ; d y n2 )

K n k1 ( X~ n , y n1 , . . . , y n k2 ; d y n k1 )
!!



K n1 ( X~ n ; d y n1 )

y n k 2S

where gk g k ( X~ n ) satises

K n2 ( X~ n , y n k1 ; d y n k ) f ( y n k )



(20)

234

H. Haario, E. Saksman and J. Tamminen

j gk ( X~ n )j <

y n1 2S



K n1 ( X~ n ; d y n1 )

y n2 2S

K n2 ( X~ n , y n1 ; d y n2 )

k2
K n k1 ( X~ n , y n1 , . . . , y n k2 ; d y n k1 ) f ( y n k )c1 k f k1
n2
y n k1 2S

< c 1 k f k1

!


k2
:
n2

In the next step we iterate the procedure by replacing the generalized transition
probability K n k1 ( X~ n , y n1 , . . . , y n k2 ; d y n k1 ) by the transition probability Q in
formula (20). By continuing in this manner we obtain

K n1 ( X~ n ; d y n1 )
Q( y n1 ; d y n2 )
E( f (X n k )jF n )
y n1 2S

y n2 2S



y n k 2S

Q( y n k1 ; d y n k ) f ( y n k )



g2 ( X~ n ) g 3 ( X~ n )    gk ( X~ n ),
where
g j ( X~ n )

y n1 2S

K n1 ( X~ n ; d y n1 )



y n j 2S

y n2 2S

K n2 ( X~ n , y n1 ; d y n2 )

(K n j ( X~ n , y n1 , . . . , y n j1 ; d y n j )
!

K n2 ( X~ n , y n j1 ; d y n j ))Q

k j

f ( y n j )

!
 :

(21)

Recall here that Q k j denotes the (k j)th


iterate of the transition probability Q and we
apply the standard notation (Q k j f )(x) S Q k j (x; d y) f ( y).
Since kQ k j f k1 < k f k1 we obtain as before from condition (iii) that
j g j j < c1
Summing up, we have shown that
E( f (X n k )jF n ) n, k

y n1 2S

where n, k n, k (X 0 , . . . , X n ) satises

j2
k f k1 :
n2

K n1 (X 0 , . . . , X n , d y n1 )Q k1 f ( y n k ),

(22)

An adaptive Metropolis algorithm


j n, k j <

235

k
X
j2

c1

j2
c1 k 2
k f k1 :
k f k1 <
n
n2

(23)

Next write [(k 1)=k 0 ] k9 and notice that (Q k1 ) < k9 according to (i). By (ii) and
the denition of Q, we have
kQ k1 k <

k2
X

kQ j1 Q j k <

j0

k2
X
c0
c0 (k 1)
,
<
n2
n

2
j0

and hence, using the assumption f 0, we may estimate


kQ k1 f k1 sup x2S j x Q k1 f j < sup x2S j( x )Q k1 f j jQ k1 f j


c0 (k 1)
2 k9 k f k1 :
< 2 k9 k f k1 j(Q k1 ) f j <
n2
Combining this with (22) and (23), it follows that
kE( f (X n k )jF n )k1


k2
[( k1)= k 0 ]

< c~(c0 , c1 , )
k f k1 ,
n

(24)

(25)

which is valid for all n, k > 2.


In order to deduce the proposition, we rst observe that for any index j between 1 and k
the standard properties of the conditional expectation yield that
kE( f (X n k )jF n )k1 < kE( f (X n k )jF n k j )k1 :
Hence, by replacing n by n k j and k by j in the estimate (25), we nally deduce that
!
j2
[( j1)= k 0 ]

(26)
k f k1 :
kE( f (X n k )jF n )k1 < inf c~(c0 , c1 , )
1< j< k
n k j
The claim of the proposition follows immediately from this estimate.
Proof of Theorem 2. From Proposition 4 we obtain, for n > 1 and k > 0, that

kE( f (X n k ) f ( y)(d y)jF n )k1 < (k),


S

where (0) (1) 2k f k1 , and for k > 2 we have


!
j2
log2 k
j
9 k f k1 < c9(c0 , c1 , f , )
,
(k)  c(c0 , c1 , ) inf
1< j< k
kj
k

(27)

(28)

where the last estimate is obtained by choosing j log k=log(1=9) for k > k 1 (9).
At this stage the estimate (28) for the asymptotic independence, together with the
denition of the -algebra F n , makes it clear that f (X n ) E f (X n ) is a mixingale in the
sense of McLeish see McLeish (1975) or Hall and Heyde (1980, p. 19). For the
convenience of the reader, let us recall here the denition of mixingales. Let (F n )1
n1 be

236

H. Haario, E. Saksman and J. Tamminen

an increasing sequence of sub--algebras on a probability space. A sequence (Yn )1


n1 of
square-integrable random variables is a mixingale (difference) sequence if there are real
1
sequences (rm )1
m0 and (an ) n1 such that rm ! 0 as m ! 1, and
kE(Yn jF n m )k2 < rm an

and

kYn E(Yn jF n m )k2 < r m1 an

(29)

for all n > 1 and m > 0. In our case, where Yn f (X n ) E f (X n ), we take (an ) to be a
constant sequence and let F n be the trivial -algebra for n , 0. The right-hand side condition
in (29) is automatically satised. Moreover, we may choose rk (k), and it follows that
rk < C()k 1 for every . 0. Hence we may apply directly the well-known laws of large
numbers for mixingales in the form of Hall and Heyde (1980, Theorem 2.21, p. 41) to the
sequence f (X n ) E
f (X n ). The desired conclusion is obtained by observing that (27) yields
h
lim n!1 E f (X n ) S f ( y)(d y).
Remark 8. We refer to the original article (McLeish 1975) or to the recent review article
(Davidson and de Jong 1997) for basic properties of mixingales. However, we point out that
the proof of Theorem 2 could be concluded by elementary means, without referring to the
theory of P
mixingales, by applying Proposition 4 to estimate the variance of the sum
Sn (1=n) nk1 f (X n ) S f ( y)(d y) and utilizing the boundedness of the function f.
Nevertheless, the reference to mixingales is useful since it is possible to weaken condition
(iii) and still obtain Theorem 2. In this manner one obtains Theorem 1 also in the case where
the covariance is calculated from a relatively slowly increasing segment of the near history
only (cf. Remark 5). For instance, this is the case if at time t this segment has length  t ,
where 2 (12, 1).
Finally, we note that in this paper we have left open the question whether the
convergence of the algorithm (as established in Theorem 1) satises a central limit theorem.

5. Testing AM in practice and comparison with traditional


methods
In this section we present results obtained from testing the AM algorithm numerically. From the
practical point of view, it is important to know how accurate the simulations of the target
distribution will be that one can expect to get from nite MCMC runs. In Haario et al. (1999) we
compared three different methods: the random walk Metropolis algorithm (M), the singlecomponent Metropolis algorithm (SC), and the adaptive proposal algorithm (AP) see Section 1
or Haario et al. (1999) for the exact denition and more details. Recall again that the difference
between the AP and AM algorithms was simply that in AP the covariance for the proposal
distribution was computed only from a xed number of previous states. Here we have done
similar tests to those in Haario et al. (1999) and included the AM algorithm in the comparison.
We have tested the AM algorithm for various dimensions up to d 200. The algorithm
appears to work successfully. Naturally the adaptation becomes slower as the dimension
increases and becomes more sensitive to a very bad choice of the initial covariance. Here
we present the results of extensive tests in dimension d 8. We used two restricted

An adaptive Metropolis algorithm

237

Gaussian distributions as the target distributions uncorrelated (1 ) and correlated (2 )


and two nonlinear `banana'-shaped distributions with compact supports moderately
`twisted' (3 ) and strongly `twisted' (4 ). The supports of the test distributions are compact
in order to satisfy the assumptions of our theoretical result (Theorem 1).
Our test distributions are obtained by those used in Haario et al. (1999) by setting the
densities to zero outside a compact set. Hence, the density of our rst test distribution 1 is
an uncorrelated Gaussian density, which is centred and has covariance diag(100, 1, . . . , 1)
and is restricted to a parallelpiped with corners at points (35, 3:5, . . . , 3:5). In this
set-up about 99.6% of the probability mass of the unrestricted Gaussian is contained in the
parallelpiped. The correlated restricted Gaussian distribution 2 is obtained from 1 simply
by rotating the distribution so that the main axis corresponds to the direction (1, . . . , 1).
The twisted (and restricted) Gaussian test distributions 3 and 4 are similarly obtained
from 1 by applying the same measure-preserving transformations as are used in Haario et
al. (1999, p. 381). We refer to Haario et al. (1999) for a more detailed explanation of the
test procedure see especially p. 382 for pictures of the corresponding unrestricted target
distributions.
The number of function evaluations varied depending on the target distribution: 20 000
for 1 and 2, 40 000 for 3 and 80 000 for 4. The starting values were sampled relatively
close to the peak values of the target densities. The burn-in period was chosen to be half of
the chain length. Each test case was run 100 times in order to retrieve statistically relevant
information. Hence, each accuracy criterion number is an average value over 100
repetitions.
We have tried to be fair in choosing the proposal distributions for the random walk
Metropolis and the single-component Metropolis algorithms. For example, in the case of the
restricted Gaussian target distributions we used for the Metropolis algorithm covariances
corresponding to the unrestricted targets and normalized them with the heuristic optimal
scaling from Gelman et al. (1996).
In Figure 1 the test results in dimension 8 are summarized in graphical form. We present
the mean and the error bars giving the standard deviations corresponding to the 68.3%
probability region. The results expressed in the gure indicate that the AM algorithm
simulates the target distribution most accurately in these tests. With the restricted Gaussian
target distributions the results obtained using the AM algorithm are equally as good as
those using the Metropolis algorithm with an optimal proposal distribution. Moreover, in the
case of nonlinear distributions the AM algorithm seems to be superior.
In Figure 2 we compare the performance of the AM algorithm with the Metropolis
algorithm. The proposal distribution for the Metropolis algorithm was symmetric and the
size was selected so that the acceptance rate becomes quite optimal. We used 1 as the
target distribution. In Figure 2 the autocorrelation functions of the AM and the Metropolis
algorithm are drawn for two projections. In the direction of the largest width of the target
distribution the autocorrelation of the Metropolis algorithm indicates weaker convergence.
This example demonstrates how the tuning of the proposal distribution according to the
acceptance rate only may lead to difculties.
Finally, we point out that according to our tests there was no essential difference in the
performance of the AM algorithm between the restricted and unrestricted target

238

H. Haario, E. Saksman and J. Tamminen


80
SC
M
AP
AM

68.3% probability region

75
1

70

65

60

55

Figure 1. Comparison of the performance of the single-component Metropolis algorithm (SC),


Metropolis algorithm (M), adaptive proposal algorithm (AP) and adaptive Metropolis algorithm (AM)
with different eight-dimensional target distributions 1 4 . The symbols correspond to the mean
frequency of hits in the 68.3% probability region of 100 simulations and the error bar around the
symbol corresponds to the standard deviation of the hits. The true value (68.3%) is indicated by a
horizontal line.

distributions. Thus, it is reasonable to expect that an analogue of Theorem 1 also holds for
non-compactly supported distributions whose densities decay rapidly enough.

Appendix
We present here an illustrative two-dimensional example also considered in Haario et al.
(1999). There the target distribution was tested with the AP algorithm, where the covariance
Ct was calculated from the last 200 states see Section 1 or Haario et al. (1999) for the
denition of the AP algorithm. In the example the AP algorithm produced considerable error
in the simulation. This phenomenon underlines the importance of calculating the covariance
from an increasing segment of the history, as is done in the AM algorithm. When the AM
algorithm was applied to the same example it produced, as expected, simulation that was free
of bias. For many practical applications the error produced by the AP algorithm is, however,
ignorable (see Haario et al. 1999).
Example 1. Let us dene the density on the rectangle R [18, 18] 3 [3, 3]  R2 . Let
S [0:5, 0:5] 3 [3, 3] and set

36 if x 2 S,
(x)
1 if x 2 RnS:

An adaptive Metropolis algorithm

239

Figure 2. Comparison of the performance of the AM algorithm (left column) and the Metropolis
algorithm with a fairly optimal acceptance rate (right column). The target distribution was 1 . On the
top line the autocorrelation function in the direction of the largest eigenvalue of target's covariance
matrix is shown. The bottom line corresponds to an orthogonal direction. The acceptance rate with the
AM algorithm was 27% and with the Metropolis algorithm 26%.

(See Figure 3 for the one-dimensional projection of the density function.)


With this choice (S) : (RnS) 36 : 35 and hence about half of the mass is
concentrated on the middle strip S. Thus an ergodic MCMC algorithm should stay for
about the same amount of time on S and on R. However, S and R are thin rectangles with
opposite orientations. This forces the AP algorithm to regularly turn the direction of the
proposal distribution. This causes notable bias in the simulation on S (see Figure 4(a),

240

H. Haario, E. Saksman and J. Tamminen


40

Density

30

20

10

0
220

210

10

20

Figure 3. The (unscaled) one-dimensional projection of true target distribution of Example


1.

Figure 4. The difference between the real target distribution of Example 1 and the sampled
distributions. In (a) the sampling was done using the AP algorithm. The curve represents the mean
values of 100 runs with 100 000 states. In (b) the sampling method was the AM algorithm.

where the difference between the true target distribution and the one simulated by AP is
presented). In fact, the relative error in the simulation on S is about 10%. There is also a
slight error in the simulation near the far ends of the rectangle R. The corresponding
unbiased results of the AM algorithm are presented in Figure 4(b).

An adaptive Metropolis algorithm

241

Acknowledgements
We thank Elja Arjas, Kari Auranen, Esa Nummelin and Antti Penttinen for useful discussions
on the topics of the paper. The second author (ES) was supported by the Academy of Finland,
Project 32837.

References
Davidson, J. and de Jong, R. (1997) Strong laws of large numbers for dependent heterogeneous
processes: a synthesis of recent and new results. Econometric Rev., 16, 251279.
Dobrushin, R. (1956) Central limit theorems for non-stationary Markov chains II. Theory Probab.
Appl., 1, 329383.
Evans, M. (1991) Chaining via annealing. Ann. Statist., 19, 382393.
Fishman, G.S. (1996) Monte Carlo: Concepts, Algorithms and Applications. New York: SpringerVerlag.
Gelfand, A.E. and Sahu, S.K. (1994) On Markov chain Monte Carlo acceleration. J. Comput. Graph.
Statist., 3, 261276.
Gelman, A.G., Roberts, G.O. and Gilks, W.R. (1996) Efcient Metropolis jumping rules. In J.M.
Bernardo, J.O. Berger, A.F. David and A.F.M. Smith (eds), Bayesian Statistics V, pp. 599608.
Oxford: Oxford University Press.
Gilks, W.R. and Roberts, G.O. (1995) Strategies for improving MCMC. In W.R. Gilks, S. Richardson
and D.J. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, pp. 7588. London:
Chapman & Hall.
Gilks, W.R., Roberts, G.O. and George, E.I. (1994) Adaptive direction sampling. The Statistician, 43,
179189.
Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (1995) Introducing Markov chain Monte Carlo. In
W.R. Gilks, S. Richardson and D.J. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice,
pp. 119. London: Chapman & Hall.
Gilks, W.R., Roberts, G.O. and Sahu, S.K. (1998) Adaptive Markov chain Monte Carlo. J. Amer.
Statist. Assoc., 93, 10451054.
Haario, H. and Saksman, E. (1991) Simulated annealing process in general state space. Adv. Appl.
Probab., 23, 866893.
Haario, H., Saksman, E. and Tamminen, J. (1999) Adaptive proposal distribution for random walk
Metropolis algorithm. Comput. Statist., 14, 375395.
Hall, P. and Heyde, C.C. (1980) Martingale Limit Theory and Its Application. New York, Academic
Press.
Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57, 97109.
McLeish, D.L. (1975) A maximal inequality and dependent strong laws. Ann. Probab., 3, 829839.
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. (1953) Equations of
state calculations by fast computing machines. J. Chem. Phys., 21, 10871091.
Neveu, J. (1965) Mathematical Foundations of the Calculus of Probability. San Francisco: Holden-Day.
Nummelin, E. (1984) General Irreducible Markov Chains and Non-negative Operators. Cambridge:
Cambridge University Press.
Roberts, G.O., Gelman, A. and Gilks, W.R. (1997) Weak convergence and optimal scaling of random
walk Metropolis algorithms. Ann. Appl. Probab., 7, 110120.

242

H. Haario, E. Saksman and J. Tamminen

Sahu, S.K. and Zhigljavsky, A.A. (1999) Self regenerative Markov chain Monte Carlo with adaptation.
Preprint. http://www.statslab.cam.ac.uk/mcmc.
Tierney, L. (1994) Markov chains for exploring posterior distributions (with discussion). Ann. Statist.
22, 17011762.
Received June 1998 and revised February 2000

You might also like