An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
1. Introduction
It is generally acknowledged that the choice of an effective proposal distribution for the
random walk Metropolis algorithm, for example, is essential in order to obtain reasonable
results by simulation in a limited amount of time. This choice concerns both the size and the
spatial orientation of the proposal distribution, which are often very difcult to choose well
since the target density is unknown (see Gelman et al. 1996; Gilks et al. 1995; 1998; Haario
et al. 1999; Roberts et al. 1997). A possible remedy is provided by adaptive algorithms,
which use the history of the process in order to `tune' the proposal distribution suitably. This
has previously been done (for instance) by assuming that the state space contains an atom.
The adaptation is performed only at the times of recurrence to the atom in order to preserve
the right ergodic properties (Gilks et al. 1998). The adaptation criteria are then obtained by
monitoring the acceptance rate. A related and interesting self-regenerative version of adaptive
Markov chain Monte Carlo (MCMC), based on introducing an auxiliary chain, is contained in
the recent preprint of Sahu and Zhigljavsky (1999). For other versions of adaptive MCMC
and related work, we refer to Evans (1991), Fishman (1996), Gelfand and Sahu (1994), Gilks
and Roberts (1995) and Gilks et al. (1994), together with the references therein.
We introduce here an adaptive Metropolis (AM) algorithm which adapts continuously to
the target distribution. Signicantly, the adaptation affects both the size and the spatial
orientation of the proposal distribution. Moreover, the new algorithm is straightforward to
implement and use in practice. The denition of the AM algorithm is based on the classical
13507265 # 2001 ISI/BS
224
random walk Metropolis algorithm (Metropolis et al. 1953) and its modication, the AP
algorithm, introduced in Haario et al. (1999). In the AP algorithm the proposal distribution
is a Gaussian distribution centred on the current state, and the covariance is calculated from
a xed nite number of previous states. In the AM algorithm the covariance of the proposal
distribution is calculated using all of the previous states. The method is easily implemented
with no extra computational cost since one may apply a simple recursion formula for the
covariances involved.
An important advantage of the AM algorithm is that it starts using the cumulating
information right at the beginning of the simulation. The rapid start of the adaptation
ensures that the search becomes more effective at an early stage of the simulation, which
diminishes the number of function evaluations needed.
To be more exact, assume that at time t the already sampled states of the AM chain are
X 0 , X 1 , . . . , X t , some of which may be multiple. The new proposal distribution for the
next candidate point is then a Gaussian distribution with mean at the current point X t and
covariance given by sd R, where R is the covariance matrix determined by the spatial
distribution of the states X 0 , X 1 , . . . , X t 2 R d . The scaling parameter sd depends only on
the dimension d of the vectors. This adaptation strategy forces the proposal distribution to
approach an appropriately scaled Gaussian approximation of the target distribution, which
increases the efciency of the simulation. A more detailed description of the algorithm is
given in Section 2 below.
One of the difculties in constructing adaptive MCMC algorithms is to ensure that the
algorithm maintains the correct ergodicity properties. We observe here (see also Haario et
al. 1999) that the AP algorithm does not possess this property. Our main result, Theorem 1
below, veries that the AM process does indeed have the correct ergodicity properties,
assuming that the target density is bounded from above and has a bounded support. The
AM chain is not Markovian, but we show that the asymptotic dependence between the
elements of the chain is weak enough to apply known theorems of large numbers for
mixingales see McLeish (1975) and (29) below for this notion. Similar results may be
proven also for variants of the algorithm, where the covariance is computed from a suitably
increasing segment of the near history.
Section 3 contains a detailed description of the AM algorithm as a stochastic process and
the theorem on the ergodicity of the AM. The proof is based on an auxiliary result that is
proven in Section 4. Finally, in Section 5 we present results from test simulations, where the
AM algorithm is compared with traditional MetropolisHastings algorithms (Hastings 1970)
by applying both linear and nonlinear, correlated and uncorrelated unimodal target
distributions. Our tests seem to imply that AM performs at least as well as the traditional
algorithms for which a nearly optimal proposal distribution has been given a priori.
225
We now explain how the AM algorithm works. Recall from Section 1 that the basic idea
is to update the proposal distribution by using the knowledge we have so far acquired about
the target distribution. Otherwise the denition of the algorithm is identical to the usual
Metropolis process.
Suppose, therefore, that at time t 1 we have sampled the states X 0 , X 1 , . . . , X t1 ,
where X 0 is the initial state. Then a candidate point Y is sampled from the (asymptotically
symmetric) proposal distribution qt (jX 0 , . . . , X t1 ), which now may depend on the whole
history (X 0 , X 1 , . . . , X t1 ). The candidate point Y is accepted with probability
(Y )
,
(X t1 , Y ) min 1,
(X t1 )
in which case we set X t Y , and otherwise X t X t1 . Observe that the chosen probability
for the acceptance resembles the familiar acceptance probability of the Metropolis algorithm.
However, here the choice for the acceptance probability is not based on symmetry
(reversibility) conditions since these cannot be satised in our case the corresponding
stochastic chain is no longer Markovian. For this reason we have to study the exactness of the
simulation separately, and we do so in Section 3.
The proposal distribution qt (jX 0 , . . . , X t1 ) employed in the AM algorithm is a
Gaussian distribution with mean at the current point X t1 and covariance
Ct Ct (X 0 , . . . , X t1 ). Note that in the simulation only jumps into S are accepted since
we assume that the target distribution vanishes outside S.
The crucial thing regarding the adaptation is how the covariance of the proposal
distribution depends on the history of the chain. In the AM algorithm this is solved by
setting Ct sd cov(X 0 , . . . , X t1 ) sd I d after an initial period, where sd is a parameter
that depends only on dimension d and . 0 is a constant that we may choose very small
compared to the size of S. Here I d denotes the d-dimensional identity matrix. In order to
start, we select an arbitrary, strictly positive denite, initial covariance C0 , according to our
best prior knowledge (which may be quite poor). We select an index t0 . 0 for the length of
an initial period and dene
C0 ,
t < t0 ,
(1)
Ct
sd cov(X 0 , . . . , X t1 ) sd I d , t . t0 :
The covariance Ct may be viewed as a function of t variables from R d having values in
uniformly positive denite matrices.
Recall the denition of the empirical covariance matrix determined by points
x0 , . . . , xk 2 R d :
!
k
1 X
T
T
xi x i (k 1)x k x k ,
cov(x0 , . . . , xk )
(2)
k i0
Pk
xi and the elements xi 2 R d are considered as column vectors.
where x k (1=(k 1)) i0
So one obtains that in denition (1) for t > t0 1 the covariance Ct satises the recursion
formula
226
t1
sd
T
T
Ct (t X t1 X t1 (t 1)X t X t X t X Tt I d ):
t
t
(3)
This allows one to calculate Ct without too much computational cost since the mean X t also
satises an obvious recursion formula.
The choice for the length of the initial segment t0 . 0 is free, but the bigger it is chosen
the more slowly the effect of the adaptation is felt. In a sense the size of t0 reects our
trust in the initial covariance C0 . The role of the parameter is just to ensure that Ct will
not become singular (see Remark 1 below). As a basic choice for the scaling parameter we
have adopted the value sd (2:4)2 =d from Gelman et al. (1996), where it was shown that
in a certain sense this choice optimizes the mixing properties of the Metropolis search in
the case of Gaussian targets and Gaussian proposals.
Remark 1. In our test runs the covariance Ct has not had the tendency to degenerate. This has
also been the case in our multimodal test examples. However, potential difculties with 0
(if any) are more likely to appear in the multimodal cases. In practical computations one
presumably may utilize denition (1) with 0, although the change is negligible if has
already been chosen small enough. More importantly, we can prove the correct ergodicity
property of the algorithm only under the assumption . 0; see Theorem 1 below.
Remark 2. In order to avoid the algorithm starting slowly it is possible to employ special
tricks. Naturally, if a priori knowledge (such as the maximum likelihood value or
approximate covariance of the target distribution) is available, it can be utilized in
choosing the initial state or the initial covariance C0 . Also, in some cases it is advisable to
employ the greedy start procedure: during a short initial period one updates the proposal
using only the accepted states. Afterwards the AM is run as described above. Moreover,
during the early stage of the algorithm it is natural to require it to move at least a little. If it
has not moved enough in the course of a certain number of iterations, the proposal
distribution could be shrunk by some constant factor.
Remark 3. It is also possible to choose an integer n0 . 1 and update the covariance every
n0 th step only (again using the entire history). This saves computer time when generating the
candidate points. There is again a simple recursion formula for the covariances Ct .
227
differs slightly from the target distribution. Numerical calculations in the continuous case
indicate similar behaviour. An illustrating example of this phenomenon is presented in the
Appendix.
It is our aim in this section to show that the AM algorithm has the right ergodic
properties and hence provides correct simulation of the target distribution.
Let us start by recalling some basic notions of the theory of stochastic processes that are
needed later. We rst dene the set-up. Let (S, B , m) be a state space and denote by M(S)
the set of nite measures on (S, B ). The norm k k on M(S) denotes the total variation
norm. Let n > 1 be a natural number. A map K n : S n 3 B ! [0, 1] is a generalized
transition probability on the set S if the map x 7! K n (x; A) is B n -measurable for each
A B , where x 2 S n and K n (x; ) is a probability measure on (S, B ) for each x 2 S n . In a
natural way K n denes a positive contraction from M(S n ) into M(S). A transition
probability on S corresponds to the case n 1 in the above denition.
We assume that a sequence of generalized transition probabilities (K n )1
n1 is given.
Moreover, let 0 be a probability distribution (the initial distribution) on S. Then the
sequence (K n ) and 0 determine uniquely the nite-dimensional distributions of the
discrete-time stochastic process (chain) (X n )1
n0 on S via the formula
P(X 0 2 A0 , X 1 2 A1 , . . . , X n 2 An )
y0 2 A0
y1 2 A1
K 1 ( y0 ; d y1 )
y2 2 A2
K 2 ( y0 , y1 ; d y2 )
!!
0 (d y0 )
yn2 A n
K n ( y0 , y1 , . . . , y n1 ; d yn )
...
(4)
In fact, it is directly veried that these distributions are consistent and the theorem of Ionescu
Tulcea (see Proposition V.1.1 of Neveu 1965) yields the existence of the chain (X n ) on S
satisfying (4).
We shall now turn to the exact denition of the AM chain as a discrete-time stochastic
process. We assume that the target distribution is supported on a bounded subset S R d ,
so that (x) 0 outside S. Thus we shall choose S to be our state space, when equipped
with the Borel -algebra B (S) and choosing m to be the normalized Lebesgue measure on
S. The target has the (unscaled) density (x) with respect to the Lebesgue measure on S.
We also assume that the density is bounded from above on S: for some M , 1, we have
that
(x) < M
for x 2 S:
(5)
Let C be a symmetric and strictly positive denite matrix on R d and denote by NC the
density of the mean-zero Gaussian distribution on R n with covariance C. Thus
1
1
p exp x T C 1 x :
(6)
NC (x)
2
(2) n=2 jCj
The Gaussian proposal transition probability corresponding to the covariance C satises
228
QC (x; A)
NC ( y x) d y,
(7)
where A R d is a Borel set and d y is the standard Lebesgue measure on R d . It follows that
QC is m-symmetric (see Haario and Saksman 1991, Denition 2.2): for A, B S one has
We next recall the denition of the transition probability M C for the Metropolis process
having the target density (x) and the proposal distribution QC :
( y)
m(d y)
M C (x; A) NC ( y x) min 1,
(x)
A
( y)
m(d y),
(8)
A (x) NC ( y x) 1 min 1,
(x)
Rd
for A 2 B (S), and where A denotes the characteristic function of the set A. It is easily
veried that M C denes a transition probability with state space S.
The following denition of the AM chain corresponds exactly to the AM algorithm
introduced in Section 2.
Denition 1. Let S and be as above and let the initial covariance C0 and the constant
. 0 be given. Dene the functions Cn for n > 1 by formula (1). For a given initial
distribution 0 the adaptive Metropolis (AM ) chain is a stochastic chain on S dened through
(4) by the sequence (K n )1
n1 of generalized transition probabilities, where
K n (x0 , . . . , x n1 ; A) M C n (x0 ,:::,x n1 ) (x n1 ; A)
(9)
for all n > 1, xi 2 S (0 < i < n 1), and for subsets A 2 B (S).
Let us turn to the study of the ergodicity properties of the AM chain, which is more
complicated than in the case of Markov chains. In order to be able to proceed we give
some denitions. Recall rst the denition of the coefcient of ergodicity (Dobrushin
1956). Let T be a transition probability on S and set
(T ) sup 1 ,2
k1 T 2 T k
,
k1 2 k
(10)
where the supremum is taken over distinct probability measures 1 , 2 on (S, B ). As usual,
T denotes
and for bounded measurable functions we write
the measure A 7! S T (x; A)(dx)
Tf (x) S T (x; d y) f ( y) as well as f S (d y) f ( y).
Clearly 0 < (T ) < 1. In the case (T ) , 1 the mapping T is a strict contraction on
M(S) with respect to the metric dened by the total variation norm on M(S). From the
denition it easily follows that
(T1 T2 . . . Tn ) <
n
Y
i1
(Ti ):
(11)
229
The condition (T k 0 ) , 1 for some k 0 > 1 is well known to be equivalent to the uniform
ergodicity (cf. Nummelin 1984, Section 6.6) of the Markov chain having transition
probability T.
For our purposes it is useful to dene the transition probability that is obtained from a
generalized transition probability by `freezing' the n 1 rst variables. Hence, given a
generalized transition probability K n (where n > 2) and a xed (n 1)-tuple ( y0 , y1 ,
. . . , y n2 ) 2 S n1 , we denote y~n2 ( y0 , y1 , . . . , y n2 ) and dene the transition probability
K n, y~n2 by
K n, y~n2 (x; A) K n ( y0 , y1 , . . . , y n2 , x; A)
(12)
1
( f (X 0 ) f (X 1 ) . . . f (X n )) f (x)(dx):
lim
n!1 n 1
S
holds almost surely.
The proof of the theorem is based on the following technical auxiliary result, whose
proof we postpone to the next section.
Theorem 2. Assume that the nite-dimensional distributions of the stochastic process
(X n )1
n0 on the state space S satisfy (4), where the sequence of generalized transition
probabilities (K n ) is assumed to satisfy the following three conditions:
(i) There are a xed integer k 0 and a constant 2 (0, 1) such that
((K n, y~n2 ) k 0 ) < , 1
(ii) There are a xed probability measure on S and a constant c0 . 0 such that
c0
for all y~n2 2 S n1 and n > 2:
kK n, y~n2 k <
n
(iii) We have the following estimate for the operator norm
k
kK n, y~n2 K n k, y~n k2 kM(S)!M(S) < c1 ,
n
where c1 is a xed positive constant, n,k > 1 and one assumes that the (n k 1)tuple y~n k2 is a direct continuation of the (n 1)-tuple y~n2 .
Then, if f : S ! R is bounded and measurable, then the equality
230
lim
n!1
1
n1
( f (X 0 ) f (X 1 ) . . . f (X n ))
f (x)(dx):
(13)
S
(14)
Hence the corresponding normal densities NC ( x) are uniformly bounded from below on S
for all x 2 S, and (5) and (8) together trivially yield the bound
K n, y~n2 (x; A) > c4 (A)
with c4 . 0. This easily yields (cf. Nummelin 1984, pp. 122123) that (K n, y~n2 ) < 1 c4 ,
which proves (i) with k 0 1.
We next verify condition (iii). To that end we assume that n > 2 and observe that, for
given y~n k2 2 S n k1 , one has
kK n, y~n2 K n k, y~n k2 kM(S)!M(S) < 2 sup y2S,A2B (S) jK n, y~n2 ( y; A) K n k, y~n k2 ( y; A)j:
(15)
Fix y 2 S and A 2 B (S) and introduce R1 Cn ( y0 , . . . , y n2 , y) together with R2
C n k ( y0 , . . . , y n k2 , y). According to Denition 1 and formula (8), we see that
jK n, y~n2 ( y; A) K n k, y~n k2 ( y; A)j jM R1 ( y; A) M R2 ( y; A)j
(x)
(N R1 N R2 )(x y) min 1,
m(dx)
<
( y)
x2 A
(N R1 N R2 )(x y)
A (x)
x2R d
(x)
m(dx)
3 1 min 1,
( y)
Rd
<2
Rd
dz
1
0
d
ds N R1 s( R2 R1 ) (z)
ds
< c5 kR1 R2 k,
(16)
231
where at the last stage we apply (14), in order to deduce that the partial derivatives of
density N R1 s( R2 R1 ) with respect to the components of the covariance are integrable over
with bounds that depend only on , C0 and S. Finally, it is clear from recursion formula
that in general kCt C t1 k < c6 =t for t . 1. By applying this inductively and using
uniform boundedness from above of the covariances Ct , we easily see that
the
Rd
(3)
the
k
kR1 R2 k < c7 (S, C0 , ) ,
n
and hence the previous estimates yield (iii).
In order to check condition (ii), x y~n2 2 S n1 and denote C C n1 ( y0 , . . . y n2 ). It
follows that kC Cn ( y0 , . . . , y n2 , y)k < c8 =n, where c8 does not depend on y 2 S. We
may therefore proceed exactly as in (15) and (16) to deduce that
c9
kK n, y~n2 M C kM(S)!M(S) < :
n
Since M C is a Metropolis transition probability we have that M C (see e.g. Tierney
1994, p. 1705), and we obtain
c9
k K n, y~n2 k k(M C K n, y~n2 )k < ,
n
which completes the proof of Theorem 1.
232
Similar remarks apply to modications where one adapts only certain parameters or some
of the parameters are discrete.
Remark 5. It is clear that in the course of the AM algorithm one may also determine the
covariance by using only an increasing part of the near history. For example, one may
determine Cn by using only the samples X [ n=2] , X [ n=2]1 , . . . , X n . This is easily implemented
in practice and in this case Theorem 1 yields that the simulation is exact with only minor
changes in the proof. Similar remarks apply also to the case where one updates the
covariance only every n0 th step (see Remarks 3 and 8).
Remark 6. Theorem 2 can also be used to prove the correct ergodicity for certain other
variants of adaptation, as for algorithms where one suitably tunes the proposal distribution
according to the acceptance rate. However, in our specic practical applications it has turned
out that the tuning of the acceptance rate has yielded inferior results when compared with the
AM algorithm. A similar phenomenon is demonstrated in Figure 2 below. Moreover, in highdimensional cases with possible correlations between the parameters, it may be difcult to
tune the proposal distribution effectively basing the decision on one parameter only. This is
the case even if one uses the single-component Metropolis algorithm.
Remark 7. The proof of Theorem 1 requires that the target density has compact support and
is bounded from above. Otherwise the uniform ergodicity (condition (i) of Theorem 2) may
fail, which is important if we are to be able to control the effects of the adaptation. In the
Markovian case (for example, standard MetropolisHastings) uniform ergodicity is, of
course, not needed to ensure that the simulation is correct, although without it the theoretical
convergence rate may be very slow. However, the requirements above on the target density
correspond reasonably well to practical situations. We believe that one may weaken the
assumptions at the cost of more elaborate proofs. We prefer to leave this topic for future
research, since our main aim here is to introduce a new method and to demonstrate its
usefulness. In our test runs the AM algorithm has also worked successfully with (unrestricted)
Gaussian targets.
4. Proof of Theorem 2
In this section we will prove Theorem 2 by showing that a related process is a mixingale (in
the sense of McLeish 1975) that satises an appropriate law of large numbers. The conditions
of the theorem were tailored to apply to the AM chain on bounded subsets of R n , but they
are stated in the language of a general state space. This is advantageous since one may apply
them in a more general situation, especially for variants of the AM where the state space
contains both discrete and continuous parts. Our proof is based on the following basic
proposition.
Proposition 4. Let the chain (X n ) on the state space S and the generalized transitition
probabilities (K n ) full the conditions of Theorem 2. Denote by F n (X 0 , X 1 , . . . , X n )
233
the -algebra generated by the chain up to time n and write 9 1= k 0 . Let n > 1 and
k > 2. Then for all initial distributions and for any bounded measurable function f on S, the
inequality
!
2
j
E( f (X n k )jF n ) f ( y)(d y)
< c(c0 , c1 , ) inf
9 j k f k1
(17)
1< j< k
n k j
S
1
holds.
Proof. We may clearly assume that f S f ( y)(d y) 0 since the general case is then
obtained by applying the proposition to the function f f . Let n > 1, k > 2 and note that
from the denition of the conditional expectation and (4) it follows that (almost surely)
E( f (X n k )jF n )
K n1 (X 0 , X 2 , . . . , X n ; d y n1 )
y n1 2S
y n2 2S
K n2 (X 0 , X 2 , . . . , X n , y n1 ; d y n2 )
!
y n k 2S
K n k (X 0 , X 2 , . . . , X n , y n1 , . . . , y n k1 ; d y n k ) f ( y n k )
!
... :
(18)
< c 1 k f k1
k2
:
n2
(19)
y n1 2S
y n k1 2S
y n2 2S
K n2 ( X~ n , y n1 ; d y n2 )
K n k1 ( X~ n , y n1 , . . . , y n k2 ; d y n k1 )
!!
K n1 ( X~ n ; d y n1 )
y n k 2S
where gk g k ( X~ n ) satises
K n2 ( X~ n , y n k1 ; d y n k ) f ( y n k )
(20)
234
j gk ( X~ n )j <
y n1 2S
K n1 ( X~ n ; d y n1 )
y n2 2S
K n2 ( X~ n , y n1 ; d y n2 )
k2
K n k1 ( X~ n , y n1 , . . . , y n k2 ; d y n k1 ) f ( y n k )c1 k f k1
n2
y n k1 2S
< c 1 k f k1
!
k2
:
n2
In the next step we iterate the procedure by replacing the generalized transition
probability K n k1 ( X~ n , y n1 , . . . , y n k2 ; d y n k1 ) by the transition probability Q in
formula (20). By continuing in this manner we obtain
K n1 ( X~ n ; d y n1 )
Q( y n1 ; d y n2 )
E( f (X n k )jF n )
y n1 2S
y n2 2S
y n k 2S
Q( y n k1 ; d y n k ) f ( y n k )
g2 ( X~ n ) g 3 ( X~ n ) gk ( X~ n ),
where
g j ( X~ n )
y n1 2S
K n1 ( X~ n ; d y n1 )
y n j 2S
y n2 2S
K n2 ( X~ n , y n1 ; d y n2 )
(K n j ( X~ n , y n1 , . . . , y n j1 ; d y n j )
!
K n2 ( X~ n , y n j1 ; d y n j ))Q
k j
f ( y n j )
!
:
(21)
y n1 2S
where n, k n, k (X 0 , . . . , X n ) satises
j2
k f k1 :
n2
K n1 (X 0 , . . . , X n , d y n1 )Q k1 f ( y n k ),
(22)
235
k
X
j2
c1
j2
c1 k 2
k f k1 :
k f k1 <
n
n2
(23)
Next write [(k 1)=k 0 ] k9 and notice that (Q k1 ) < k9 according to (i). By (ii) and
the denition of Q, we have
kQ k1 k <
k2
X
kQ j1 Q j k <
j0
k2
X
c0
c0 (k 1)
,
<
n2
n
2
j0
k2
[( k1)= k 0 ]
< c~(c0 , c1 , )
k f k1 ,
n
(24)
(25)
(26)
k f k1 :
kE( f (X n k )jF n )k1 < inf c~(c0 , c1 , )
1< j< k
n k j
The claim of the proposition follows immediately from this estimate.
Proof of Theorem 2. From Proposition 4 we obtain, for n > 1 and k > 0, that
(27)
(28)
where the last estimate is obtained by choosing j log k=log(1=9) for k > k 1 (9).
At this stage the estimate (28) for the asymptotic independence, together with the
denition of the -algebra F n , makes it clear that f (X n ) E f (X n ) is a mixingale in the
sense of McLeish see McLeish (1975) or Hall and Heyde (1980, p. 19). For the
convenience of the reader, let us recall here the denition of mixingales. Let (F n )1
n1 be
236
and
(29)
for all n > 1 and m > 0. In our case, where Yn f (X n ) E f (X n ), we take (an ) to be a
constant sequence and let F n be the trivial -algebra for n , 0. The right-hand side condition
in (29) is automatically satised. Moreover, we may choose rk (k), and it follows that
rk < C()k 1 for every . 0. Hence we may apply directly the well-known laws of large
numbers for mixingales in the form of Hall and Heyde (1980, Theorem 2.21, p. 41) to the
sequence f (X n ) E
f (X n ). The desired conclusion is obtained by observing that (27) yields
h
lim n!1 E f (X n ) S f ( y)(d y).
Remark 8. We refer to the original article (McLeish 1975) or to the recent review article
(Davidson and de Jong 1997) for basic properties of mixingales. However, we point out that
the proof of Theorem 2 could be concluded by elementary means, without referring to the
theory of P
mixingales, by applying Proposition 4 to estimate the variance of the sum
Sn (1=n) nk1 f (X n ) S f ( y)(d y) and utilizing the boundedness of the function f.
Nevertheless, the reference to mixingales is useful since it is possible to weaken condition
(iii) and still obtain Theorem 2. In this manner one obtains Theorem 1 also in the case where
the covariance is calculated from a relatively slowly increasing segment of the near history
only (cf. Remark 5). For instance, this is the case if at time t this segment has length t ,
where 2 (12, 1).
Finally, we note that in this paper we have left open the question whether the
convergence of the algorithm (as established in Theorem 1) satises a central limit theorem.
237
238
75
1
70
65
60
55
distributions. Thus, it is reasonable to expect that an analogue of Theorem 1 also holds for
non-compactly supported distributions whose densities decay rapidly enough.
Appendix
We present here an illustrative two-dimensional example also considered in Haario et al.
(1999). There the target distribution was tested with the AP algorithm, where the covariance
Ct was calculated from the last 200 states see Section 1 or Haario et al. (1999) for the
denition of the AP algorithm. In the example the AP algorithm produced considerable error
in the simulation. This phenomenon underlines the importance of calculating the covariance
from an increasing segment of the history, as is done in the AM algorithm. When the AM
algorithm was applied to the same example it produced, as expected, simulation that was free
of bias. For many practical applications the error produced by the AP algorithm is, however,
ignorable (see Haario et al. 1999).
Example 1. Let us dene the density on the rectangle R [18, 18] 3 [3, 3] R2 . Let
S [0:5, 0:5] 3 [3, 3] and set
36 if x 2 S,
(x)
1 if x 2 RnS:
239
Figure 2. Comparison of the performance of the AM algorithm (left column) and the Metropolis
algorithm with a fairly optimal acceptance rate (right column). The target distribution was 1 . On the
top line the autocorrelation function in the direction of the largest eigenvalue of target's covariance
matrix is shown. The bottom line corresponds to an orthogonal direction. The acceptance rate with the
AM algorithm was 27% and with the Metropolis algorithm 26%.
240
Density
30
20
10
0
220
210
10
20
Figure 4. The difference between the real target distribution of Example 1 and the sampled
distributions. In (a) the sampling was done using the AP algorithm. The curve represents the mean
values of 100 runs with 100 000 states. In (b) the sampling method was the AM algorithm.
where the difference between the true target distribution and the one simulated by AP is
presented). In fact, the relative error in the simulation on S is about 10%. There is also a
slight error in the simulation near the far ends of the rectangle R. The corresponding
unbiased results of the AM algorithm are presented in Figure 4(b).
241
Acknowledgements
We thank Elja Arjas, Kari Auranen, Esa Nummelin and Antti Penttinen for useful discussions
on the topics of the paper. The second author (ES) was supported by the Academy of Finland,
Project 32837.
References
Davidson, J. and de Jong, R. (1997) Strong laws of large numbers for dependent heterogeneous
processes: a synthesis of recent and new results. Econometric Rev., 16, 251279.
Dobrushin, R. (1956) Central limit theorems for non-stationary Markov chains II. Theory Probab.
Appl., 1, 329383.
Evans, M. (1991) Chaining via annealing. Ann. Statist., 19, 382393.
Fishman, G.S. (1996) Monte Carlo: Concepts, Algorithms and Applications. New York: SpringerVerlag.
Gelfand, A.E. and Sahu, S.K. (1994) On Markov chain Monte Carlo acceleration. J. Comput. Graph.
Statist., 3, 261276.
Gelman, A.G., Roberts, G.O. and Gilks, W.R. (1996) Efcient Metropolis jumping rules. In J.M.
Bernardo, J.O. Berger, A.F. David and A.F.M. Smith (eds), Bayesian Statistics V, pp. 599608.
Oxford: Oxford University Press.
Gilks, W.R. and Roberts, G.O. (1995) Strategies for improving MCMC. In W.R. Gilks, S. Richardson
and D.J. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, pp. 7588. London:
Chapman & Hall.
Gilks, W.R., Roberts, G.O. and George, E.I. (1994) Adaptive direction sampling. The Statistician, 43,
179189.
Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (1995) Introducing Markov chain Monte Carlo. In
W.R. Gilks, S. Richardson and D.J. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice,
pp. 119. London: Chapman & Hall.
Gilks, W.R., Roberts, G.O. and Sahu, S.K. (1998) Adaptive Markov chain Monte Carlo. J. Amer.
Statist. Assoc., 93, 10451054.
Haario, H. and Saksman, E. (1991) Simulated annealing process in general state space. Adv. Appl.
Probab., 23, 866893.
Haario, H., Saksman, E. and Tamminen, J. (1999) Adaptive proposal distribution for random walk
Metropolis algorithm. Comput. Statist., 14, 375395.
Hall, P. and Heyde, C.C. (1980) Martingale Limit Theory and Its Application. New York, Academic
Press.
Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57, 97109.
McLeish, D.L. (1975) A maximal inequality and dependent strong laws. Ann. Probab., 3, 829839.
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. (1953) Equations of
state calculations by fast computing machines. J. Chem. Phys., 21, 10871091.
Neveu, J. (1965) Mathematical Foundations of the Calculus of Probability. San Francisco: Holden-Day.
Nummelin, E. (1984) General Irreducible Markov Chains and Non-negative Operators. Cambridge:
Cambridge University Press.
Roberts, G.O., Gelman, A. and Gilks, W.R. (1997) Weak convergence and optimal scaling of random
walk Metropolis algorithms. Ann. Appl. Probab., 7, 110120.
242
Sahu, S.K. and Zhigljavsky, A.A. (1999) Self regenerative Markov chain Monte Carlo with adaptation.
Preprint. http://www.statslab.cam.ac.uk/mcmc.
Tierney, L. (1994) Markov chains for exploring posterior distributions (with discussion). Ann. Statist.
22, 17011762.
Received June 1998 and revised February 2000