Perturbation Theory For Markov Chains Via Wasserstein Distance
Perturbation Theory For Markov Chains Via Wasserstein Distance
Perturbation Theory For Markov Chains Via Wasserstein Distance
arXiv: arXiv:1503.04123
Perturbation theory for Markov chains addresses the question of how small differences in the
transition probabilities of Markov chains are reflected in differences between their distribu-
tions. We prove powerful and flexible bounds on the distance of the nth step distributions of
two Markov chains when one of them satisfies a Wasserstein ergodicity condition. Our work is
motivated by the recent interest in approximate Markov chain Monte Carlo (MCMC) meth-
ods in the analysis of big data sets. By using an approach based on Lyapunov functions, we
provide estimates for geometrically ergodic Markov chains under weak assumptions. In an au-
toregressive model, our bounds cannot be improved in general. We illustrate our theory by
showing quantitative estimates for approximate versions of two prominent MCMC algorithms,
the Metropolis-Hastings and stochastic Langevin algorithms.
Keywords: perturbations, Markov chains, Wasserstein distance, MCMC, big data.
1. Introduction
Markov chain Monte Carlo (MCMC) algorithms are one of the key tools in computational
statistics. They are used for the approximation of expectations with respect to probability
measures given by unnormalized densities. For almost all classical MCMC methods it is
essential to evaluate the target density. In many cases, this requirement is not an issue,
but there are also important applications where it is a problem. This includes applications
where the density is not available in closed form, see [27], or where an exact evaluation is
computationally too demanding, see [2]. Problems of this kind lead to the approximation
of Markov chains and to the question of how small differences in the transitions of two
Markov chains affect the differences between their distributions.
In Bayesian inference when big data sets are involved an exact evaluation of the target
density is typically very expensive. For instance, in each step of a Metropolis-Hastings
algorithm the likelihood of a proposed state must be computed. Every observation in
the underlying data set contributes to the likelihood and must be taken into account
in the calculation. This may result in evaluating several terabytes of data in each step
of the algorithm. These are the reasons for the recent interest in numerically cheaper
approximations of classical MCMC methods, see [3, 4, 23, 42, 47]. A reduction of the
1
imsart-bj ver. 2014/10/16 file: wasserstein_style1601.tex date: February 27, 2017
2 D. Rudolf and N. Schweizer
computational costs can, e.g., be achieved by relying on a moderately sized random sub-
sample of the data in each step of the algorithm. The function value of the target density
is thus replaced by an approximation. Naturally, subsampling and alternative attempts
at “cutting the Metropolis-Hastings budget” [23] induce additional biases. These biases
can lead to dramatic changes in the properties of the algorithms as discussed in [6].
We thus need a better theoretical understanding of the behavior of such approximate
MCMC methods. Indeed, a number of recent papers prove estimates of these biases, see
[2, 3, 19, 24, 29, 35]. A key tool in these papers are perturbation bounds for Markov chains.
One such result for uniformly ergodic Markov chains due to Mitrophanov [33] is used
in [2]. A similar perturbation estimate implicitly appears in [3]. The focus on uniformly
ergodic Markov chains is rather restrictive, especially for high-dimensional, non-compact
state spaces such as Rm . Working with Wasserstein distances has recently turned out to
be a fruitful alternative in several contributions on high-dimensional MCMC algorithms,
see [11, 12, 14, 18, 25].
We provide perturbation bounds based on Wasserstein distances, which lead to flexi-
ble quantitative estimates of the biases of approximate MCMC methods. Our first main
result is the Wasserstein perturbation bound of Theorem 3.1. Under a Wasserstein er-
godicity assumption, explained in Section 2, it provides an upper bound on the distance
of the nth step distribution between an ideal and an approximating Markov chain in
terms of the difference between their one-step transition probabilities. The result is well-
suited for applications on a non-compact state space, since the difference of the one-step
transition probabilities is measured by a weighted supremum with respect to a suitable
Lyapunov function. For an autoregressive model, we show in Section 4.1 that the resulting
perturbation bound cannot be improved in general. As a consequence of the Wasserstein
approach we also obtain perturbation estimates for geometrically ergodic Markov chains.
We first adapt our Wasserstein perturbation bound to this setting. Then, as a second
main result, Theorem 3.2, we prove a refined estimate for geometrically ergodic chains
where the perturbation is measured by a weighted total variation distance. Our per-
turbation bounds, and earlier ones in [32, 33], establish a direct connection between an
exponential convergence property for Markov chains and their robustness to perturba-
tions. In particular, fast convergence to stationarity implies insensitivity to perturbations
in the transition probabilities. Geometric ergodicity has been studied extensively in the
MCMC literature. Thus, our estimates can be used in combination with many existing
convergence results for MCMC algorithms. In Section 4, we illustrate the applicability
of both theorems by generalizing recent findings on approximate Metropolis-Hastings
algorithms from [3] and on noisy Langevin algorithms for Gibbs random fields from [2].
2. Wasserstein ergodicity
Let G be a Polish space and B(G) be the corresponding Borel σ-algebra. Let d be a
metric, possibly different from the one which makes the space Polish, which is assumed
to be lower semi-continuous with respect to the product topology of G. Let P be the set
of all Borel probability measures on (G, B(G)). Then, we define the Wasserstein distance
of ν, µ ∈ P by Z Z
W (ν, µ) = inf d(x, y) dξ(x, y),
ξ∈M(ν,µ) G G
where M (ν, µ) is the set of all couplings of ν and µ, that is, all probability measures ξ
on G × G with marginals ν and µ. Indeed, on P the Wasserstein distance satisfies the
properties of a metric but is not necessarily finite, see [46, Chapter 6]. For a measurable
function f : G → R we define
|f (x) − f (y)|
kf kLip = sup ,
x,y∈G,x6=y d(x, y)
For details we refer to [45, Chapter 1.2]. By δx we denote the probability measure con-
centrated at x. Hence W (δx , δy ) = d(x, y) is finite for x, y ∈ G.
With this notation we have δx P (A) = P (x, A). Further, for a measurable function
f : G → R and µ ∈ P we have
Z Z
f (x) d(µP )(x) = P f (x) dµ(x),
G G
R
with P f (x) = G f (y)P (x, dy) whenever one of the integrals exist, see for example [40,
Lemma 3.6]. Now, by
W (δx P, δy P )
τ (P ) := sup
x,y∈G,x6=y d(x, y)
we define the generalized ergodicity coefficient of transition kernel P . This coefficient can
be understood as a generalized Dobrushin ergodicity coefficient, see [8, 9]. Dobrushin
himself called τ (P ) the Kantorovich norm of P , see [10, formula (14.34)]. Finally, τ (P )
also provides a lower bound of the coarse Ricci curvature of P introduced in [34].
Two essential properties of the ergodicity coefficient are submultiplicativity and con-
tractivity, see [10, Proposition 14.3 and Proposition 14.4].
Proposition 2.1. For two transition kernels P and Pe on (G, B(G)) and µ, ν ∈ P, we
have
τ (P Pe ) ≤ τ (P )τ (Pe) (Submultiplicativity),
and W (νP, µP ) ≤ τ (P ) W (ν, µ) (Contractivity).
W (δx P, π)
sup ≤ τ (P ). (2.2)
x∈G W (δx , π)
R
Proof. Because of the assumption G d(x0 , x) dπ(x) < ∞ we have that W (δx , π) is finite
for any x ∈ G. Thus, the assertion follows by Proposition 2.1 and stationarity of π.
Remark 2.1. For some special cases one also has an estimate of the form (2.2) in the
other direction. To this end, consider the trivial metric d(x, y) = 2 · 1x6=y with indicator
function (
1 x 6= y
1x6=y =
0 x = y.
Further, let Z
kqktv :=
sup f (y) dq(y) = 2 sup |q(A)|
kf k∞ ≤1 G A∈B(G)
The “1” in the subscript of τ1 (P ) indicates that we use the trivial metric. By applying the
triangle inequality of the total variation norm we obtain τ1 (P ) ≤ supx∈G kδx P − πktv . If
additionally π is atom-free, i.e., π({y}) = 0 for all y ∈ G, we have kδy − πktv = 2. Then,
the previous consideration and (2.2) lead to
1
sup kδx P − πktv ≤ τ1 (P ) ≤ sup kδx P − πktv .
2 x∈G x∈G
For the moment, let us assume that P is uniformly ergodic, that is, there exist numbers
ρ ∈ [0, 1) and C ∈ (0, ∞) such that
Also note that if there is an n0 ∈ N for which τ (P n0 ) < 1 we have by the submultiplica-
tivity, see Proposition 2.1, that τ (P n ) converges exponentially to zero. This motivates
to impose the following assumption which contains the idea to measure convergence of
δx P n to π in terms of τ (P n ).
Assumption 2.1 (Wasserstein ergodicity). For the transition kernel P there exist num-
bers ρ ∈ [0, 1) and C ∈ (0, ∞) such that
3. Perturbation bounds
By N0 = {0, 1, 2, . . . } we denote the non-negative integers and assume that all random
variables are defined on a common probability space (Ω, F , P) mapping to a Polish space
G equipped with a lower semi-continuous metric d. Let the sequence of random variables
(Xn )n∈N0 be a Markov chain with transition kernel P and initial distribution p0 , i.e., we
have almost surely
and p0 (A) = P(X0 ∈ A) for any measurable set A ⊆ G. Assume that (X en )n∈N0 is another
e
Markov chain with transition kernel P and initial distribution pe0 . We denote by pn the
en . Throughout the paper, (Xn )n∈N is
distribution of Xn and by pen the distribution of X
considered to be the ideal, unperturbed Markov chain we would like to simulate while
en )n∈N0 is the perturbed Markov chain that we actually implement.
(X
Theorem 3.1 (Wasserstein perturbation bound). Let Assumption 2.1 be satisfied with
the numbers C ∈ (0, ∞) and ρ ∈ [0, 1), i.e., τ (P n ) ≤ Cρn . Assume that there are numbers
δ ∈ (0, 1) and L ∈ (0, ∞) and a measurable Lyapunov function Ve : G → [1, ∞) of Pe such
that
(PeVe )(x) ≤ δ Ve (x) + L. (3.1)
Let
W (δx P, δx Pe ) e L
γ = sup and κ = max pe0 (V ),
x∈G Ve (x) 1−δ
R
with pe0 (Ve ) = G Ve (x) de
p0 (x). Then
γκ
W (pn , pen ) ≤ C ρn W (p0 , pe0 ) + (1 − ρn ) . (3.2)
1−ρ
We have
Z Z
pi P, pei Pe ) ≤
W (e W (δx P, δx Pe) de
pi (x) ≤ γ Ve (x) de
pi (x).
G G
Then, by (3.3), (3.4) and the triangle inequality of the Wasserstein distance we have
n−1
X
W (pn , pen ) ≤ W (p0 P n , pe0 P n ) + pi PeP n−i−1 , pei P P n−i−1 )
W (e
i=0
n−1
X
≤ W (p0 , pe0 )τ (P n ) + γκ τ (P i ).
i=0
Pn−1 C(1−ρn )
Finally, by (2.4) we obtain i=0 τ (P i ) ≤ 1−ρ , which allows us to complete the
proof.
Remark 3.1. The parameter κ is an upper bound on pei (Ve ). It can be interpreted as
a measure for the stability of the perturbed Markov chain. The parameter γ quantifies
with a weighted supremum norm the one-step difference between P and Pe. The use of the
Lyapunov function increases the flexibility of the resulting estimate, since larger values of
Ve compensate larger values of the Wasserstein distance between the kernels. Notice that
the existence of a Lyapunov function satisfying (3.1) is weaker than assuming Ve -uniform
ergodicity of Pe since it is not associated with a small set condition. In particular, the
condition is satisfied for any Pe with the trivial choice Ve (x) = 1 for all x ∈ G, see Corollary
3.2. As we will see in Section 4, allowing for non-trivial choices of Ve considerably increases
the applicability of our results.
Corollary 3.1. Let the assumptions of Theorem 3.2 be satisfied. Assume that Pe has a
e ∈ P and let W (π, π
stationary distribution π e) be finite. Then
γC L
e) ≤
W (π, π · . (3.5)
1−ρ 1−δ
Remark 3.2. It may seem artificial to assume W (π, π e) < ∞ but this is needed for
the limit argument in the proof. This condition is often satisfied a priori. For example,
it holds if the metric is bounded, i.e., supx,y∈G d(x, y) is finite, or, more generally, if the
distributions π and π e0 ∈ G such
e possess a first moment in the sense that there exist x0 , x
that Z Z
d(x0 , x) dπ(x) < ∞, d(e π (x) < ∞.
x0 , x) de
G G
As pointed out in Remark 3.1, we do not need to impose condition (3.1) to obtain a
non-trivial perturbation bound:
Corollary 3.2. Assume that Assumption 2.1 holds with the numbers C ∈ (0, ∞) and
ρ ∈ [0, 1), i.e., τ (P n ) ≤ Cρn , and let
Then
n n γ
W (pn , pen ) ≤ C ρ W (p0 , pe0 ) + (1 − ρ ) . (3.6)
1−ρ
Remark 3.3. For the trivial metric d(x, y) = 2·1x6=y the last corollary states essentially
the result of [33, Theorem 3.1], where instead of the general Wasserstein distance the
total variation distance is used. There, the bound’s dependence on C and ρ can be further
improved by using the a priori bound τ1 (P n ) ≤ 1 in addition to uniform ergodicity. For
another metric d such an a priori bound is in general not available.
Remark 3.4. Table 3.1 provides a detailed comparison between our Theorem 3.1 and
the related Wasserstein perturbation result of Pillai and Smith, [35, Lemma 3.3]. An
important ingredient in their estimate is a set G b ⊆ G which can be interpreted as
the part of G where both Markov chains remain with high probability. When a good
uniform upper bound on W (δx P, δx Pe) for all x ∈ G is available, we can choose Gb=G
e
in [35, Lemma 3.3] and V (x) = 1 in Theorem 3.1. In that case, both results essentially
simplify to Corollary 3.2. The results become entirely different when such a bound is not
available or too rough. In our estimate, one then needs a non-trivial Lyapunov function
for Pe and a uniform upper bound on W (δx P, δx Pe )/Ve (x). To apply their estimate, one
Table 1. Comparison of the Wasserstein perturbation bound of [35, Lemma 3.3] and Theorem 3.1.
R
Here ρ, δ ∈ [0, 1), L, cp , C, D ∈ (0, ∞), V : G → [0, ∞), Ve : G → [1, ∞) and E(x) = G d(x, y)dπ(y).
Assumptions of Assumptions of
[35, Lemma 3.3] Theorem 3.1
P V (x) ≤ δV (x) + L
Lyapunov function PeVe (x) ≤ δVe (x) + L
PeV (x) ≤ δV (x) + L
b ≤C
E[V (Xn+1 | Xn = x, Xn+1 6∈ G)]
Drift regularity e e e b
E[V (Xn+1 | Xn = x, Xn+1 6∈ G)] ≤ C —
b s.t. d(x, p) ≤ V (x) + cp
∃p ∈ G
W (δx P, δx Pe)
Perturbation error b := sup W (δx P, δx Pe)
γ γ := sup
b
x∈G x∈G Ve (x)
R
G\Gb V (x)dπ(x) ≤ D
Regularity of π —
b small
π(G \ G)
b
γ
Conclusion: ρn E(x) + +
1−ρ
Cρn E(x)+
Upper bound of 2L b +
+ δn (V (x) + D) + cp π(G \ G) Cγ
1−δ max{Ve (x), 1−δ
L
}
W (δx Pe n , π) 2(1 − P[{Xj }n−1 e
∪ {Xj }n−1 b
⊆ G])(C + L + cp )
1−ρ
j=1 j=1 1−δ
Thus
kP n (x, ·) − πkV
sup ≤ Cρn . (3.7)
x∈G V (x)
The following result establishes the connection between V -norms and certain Wasserstein
distances. It is basically due to Hairer and Mattingly [17], see also [26].
By similar arguments as in the proof of [26, Theorem 1.1] we observe that (3.7) implies
a suitable upper bound on
Lemma 3.2. If (3.7) is satisfied for the transition kernel P , then τV (P n ) ≤ Cρn .
Proof. For any positive real numbers a1 , a2 , b1 , b2 we have the following elementary
inequality
a1 + a2 a1 a2
≤ max , . (3.9)
b1 + b2 b1 b2
By (3.9) we obtain
The lemmas above and Theorem 3.1 lead to the following new perturbation bound for
geometrically ergodic Markov chains.
Corollary 3.3. Let P be V -uniformly ergodic, i.e., there are constants ρ ∈ [0, 1) and
C ∈ (0, ∞) such that
We also assume that there are numbers δ ∈ (0, 1) and L ∈ (0, ∞) and a measurable
Lyapunov function Ve : G → [1, ∞) of Pe such that
Let
P (x, ·) − Pe(x, ·)
L
γ = sup V
and κ = max pe0 (Ve ),
x∈G Ve (x) 1−δ
R
with pe0 (Ve ) = G Ve (x) de
p0 (x). Then
γκ
kpn − pen kV ≤ C ρn kp0 − pe0 kV + (1 − ρn ) . (3.11)
1−ρ
Remark 3.5. In [41, Theorem 3.1], a related perturbation bound is proven. The conver-
gence property of the unperturbed transition kernel is slightly weaker than our V -uniform
ergodicity, but also based on a kind of Lyapunov function. More restrictively, there it is
assumed that the difference of P n and Pe n for all n > 0 can be controlled. In addition, the
perturbation error is measured with a weight given by the same Lyapunov function as in
the convergence property of P , but by taking a supremum over a subset of test functions.
With our approach we can take the supremum over all test functions and obtain similar
estimates by setting p0 = π.
The next corollary demonstrates how the Lyapunov function of Pe can be replaced
by a Lyapunov function of P , provided that the distance between the transition kernels
is sufficiently small. Notice that assuming the existence of a Lyapunov function of P in
addition to the V -uniform ergodicity is a definition of constants rather than an additional
requirement, see, e.g., [5].
Corollary 3.4. Let P be V -uniformly ergodic, i.e., there are constants ρ ∈ [0, 1) and
C ∈ (0, ∞) such that
which implies (3.14). The assertion follows by the assumption that δ + γ < 1 and an
application of Corollary 3.3.
Remark 3.6. For discrete state spaces and under the requirement p0 = pe0 , a result
similar to the previous corollary is obtained in [21, Theorem 3, Corollary 3]. The authors
of [21] replace our constant κ by max0≤i≤n pei (V ). This we could do as well, see the proof
of Theorem 3.1.
In the perturbation bound of Corollary 3.3, the function V plays two roles. In its
first role, V appears in the V -uniform ergodicity condition and thus is used to quantify
convergence of P . In its second role, V appears in the constant γ, with which we compare
P and Pe , as well as in the definition of the distance between pn and pen . We can interpret
γ of Corollary 3.3 as an operator norm of P − Pe . To this end, let BV be the set of all
measurable functions f : G → R with finite
|f (x)|
|f |V := sup , (3.15)
x∈G V (x)
which means
BV = {f : G → R | |f |V < ∞} .
It is easily seen that (BV , |·|V ) is a normed linear space. In the setting of Corollary 3.3,
we have
P − Pe := sup (P − Pe )f = γ. (3.16)
BV →BVe Ve
|f |V ≤1
Theorem 3.2. Let P be Ve -uniformly ergodic, i.e., there are constants ρ ∈ [0, 1) and
C ∈ (0, ∞) such that
We also have
Z
Z
e
epi (P − P )
≤
δx P − δx Pe
de
pi (x) ≤ γ Ve (x) de
pi (x),
tv G tv G
Z Z
e
WdVe (δx P, δx Pe )
pei (P − P )
e ≤ e pi (x) ≤ sup
WdVe (δx P, δx P ) de Ve (x) de
pi (x).
V G x∈G Ve (x) G
we have
WdVe (δx P, δx Pe )
sup ≤ 2(L + 1).
x∈G Ve (x)
Then
n−1
X
pn − pn ktv ≤ Cρn ke
ke p0 − p0 kVe + 2s (L + 1)s γ r κ τVe (P i )s .
i=0
For γ ∈ (0, exp(−1)), we can choose the numbers r = 1 + log(γ)−1 and s = log(γ −1 )−1 .
This yields γ r = exp(1)γ and the proof is complete.
Remark 3.8. Let us comment on the dependence of γ. In Section 4.3, we apply Theo-
rem 3.2 combined with (3.19) in a setting where we have γ ≤ K ·log(N )/N for a constant
K ≥ 1 and some parameter N ∈ N of the perturbed transition kernel. For ε ∈ (0, 1) and
any N > (K/ε)1/(1−ε) we have γ < exp(−1). Then, with some simple calculations, we
obtain for p0 = pe0 and N > 6K 3/2 the bound
3κ (2C(L + 1))2/ log(N ) K log(N )2
max{kpn − pen ktv , kπ − π
ektv } ≤ · .
1−ρ N
Remark 3.9. In the setting of Theorem 3.2, we can also interpret γ as an operator
norm. Namely,
P − Pe = sup (P − Pe )f = γ. (3.20)
B1 →BVe e
V
|f |1 ≤1
Here the subscript “1” in |f |1 indicates V (x) = 1 for all x ∈ G, see (3.15). For ε0 > 0
and a family of perturbations (Peε )|ε|≤ε0 let γ = P − Peε B1 →BVe → 0 for ε → 0. This
condition appears in [13, Theorem 1, condition (2)] and is an assumption introduced by
Keller and Liverani, see [22].
4. Applications
We illustrate our perturbation bounds in three different settings. We begin with studying
an autoregressive process also considered in [13]. After this, we show quantitative per-
turbation bounds for approximate versions of two prominent MCMC algorithms, namely
the Metropolis-Hastings and stochastic Langevin algorithms.
Xn = αXn−1 + Zn , n ∈ N. (4.1)
Here X0 is an R-valued random variable, α ∈ (−1, 1) and (Zn )n∈N is an i.i.d. sequence of
random variables, independent of X0 . We also assume that the distribution of Z1 , say µ,
admits a first moment. It is easily seen that (Xn )n∈N0 is a Markov chain with transition
kernel Z
Pα (x, A) = 1A (αx + y) dµ(y),
R
and it is well known that there exists a stationary distribution, say πα , of Pα .
Now, let the transition kernel Pαe with α e ∈ (−1, 1) be an approximation of Pα . For
x, y ∈ G, let us consider the metric which is given by the absolute difference, i.e., d(x, y) =
|x − y|. We assume that |α − α e| is small and study the Wasserstein distance, based on d,
of p0 Pαn and pe0 Pαen with two probability measures p0 and pe0 on (R, B(R)).
We intend to apply Theorem 3.1. Notice that for Ve : R → [1, ∞) with Ve (x) = 1 + |x|
we have
Pαe Ve (x) ≤ |e
α| Ve (x) + 1 − |e
α| + E |Z1 |
and pα,n = p0 Pαn , peαe,n = pe0 Pαen . Then, inequality (3.2) of Theorem 3.1 gives
n
n (1 − |α| ) κ
W (pα,n , peαe,n ) ≤ |α| W (p0 , pe0 ) + |α − α
e| , (4.2)
1 − |α|
and for p0 = pe0 we have
n
(1 − |α| ) κ
W (pα,n , peαe,n ) ≤ |α − α
e| . (4.3)
1 − |α|
From the previous two inequalities one can see that if α e is sufficiently close to α, then
the distance of the distribution pα,n and peαe,n is small. Let us emphasize here that we
provide an explicit estimate rather than an asymptotic statement.
Note that by [16, Proposition 4.24] R and the fact that Pβ g(x) ≤ |β| g(x) + E |Z1 | with
g(x) = |x| and β ∈ {α, αe} we obtain R |x| dπβ (x) < ∞, which leads to a finite W (πα , παe ).
As a consequence we obtain for the stationary distributions of Pα and Pαe by estimate
(3.5) that
1 − |e
α| + E |Z1 |
W (πα , παe ) ≤ |α − α
e| . (4.4)
(1 − |α|)(1 − |e
α|)
The dependence on |α − α e| in the previous inequality cannot be improved in general.
To see this, let us assume that X0,α and X0,e α are real-valued random variables with
distribution πα and παe , respectively. Then, because of the stationarity we have that
X1,α = αX0,α + Z1 and X1,e α = α eX0,e α + Z1 are also distributed according to πα and πα
e,
respectively. Thus
EZ1 EZ1
EX0,α = , EX0,e α = .
1−α 1−α e
Now, for g : R → R with g(x) = x, we have kgkLip ≤ 1 and thus
Z
W (πα , παe ) = sup f (x)(dπα (x) − dπαe (x))
kf kLip ≤1 G
Z
≥ x (dπα (x) − dπαe (x)) = |EX0,α − EX0,e α|
G
|EZ1 |
= |α − α
e| .
|1 − α| |1 − α
e|
Hence, whenever EZ1 6= 0 we have a non-trivial lower bound with the same dependence
on |α − αe| as in the upper bound of (4.4). This fact shows that we cannot improve the
upper bound.
Let us now discuss the application of Corollary 3.4 and Theorem 3.2. Under the
additional assumption that µ, the distribution of Z1 , has a Lebesgue density h, it is
shown in [15, Section 4] that the autoregressive model (4.1) is also Ve -uniformly ergodic.
Precisely, there is a constant C ≥ 1 such that
does not go to 0 when α e ↓ α. Hence, Corollary 3.4 cannot quantify for small |e
α−
α| whether the nth step distributions are close to each other. However, also in [13,
Example 1] it is proven that
and similarly for the second summand. Using that F (a) = F (−a), we obtain F (a) ≤
2|a| hmax . Finally, by substitution we can write
R
|h(z − αx) − h(z − α ex)|dz F (a)
sup R = |α − αe| sup ≤ 2|α − α
e|hmax .
x∈R 1 + |x| a≥0 a + |α − α
e|
Then, let the acceptance probability be α(x, y) = min{1, r(x, y)}. With this notation the
Metropolis-Hastings algorithm defines a transition kernel
Pα (x, dy) = Q(x, dy)α(x, y) + δx (dy) sα (x), (4.6)
with Z
sα (x) = 1 − α(x, y) Q(x, dy).
G
We provide a step of a Markov chain (Xn )n∈N0 with transition kernel Pα in algorithmic
form.
Now, suppose we are unable to evaluate r(x, y), so that we are forced to work with
an approximation of α(x, y). The key idea behind approximate Metropolis-Hastings al-
gorithms is to replace r(x, y) by a non-negative random variable R with distribution, say
µx,y,u , depending on x, y ∈ G and u ∈ [0, 1]. For concrete choices of the random variable
R we refer to [2, 3, 4, 23]. We present a step of the corresponding Markov chain (Xen )n∈N
in algorithmic form.
Algorithm 4.2. en to X
A single transition from X en+1 works as follows:
en , ·) and U ∼ Unif[0, 1] independently, call the result y
1. Draw a sample Y ∼ Q(X
and u;
and the transition kernel of such a Markov chain is still of the form (4.6) with α(x, y)
substituted by αe(x, y), i.e., it is given by Pαe . The following results hold in the slightly more
general case where α e(x, y) is any approximation of the acceptance probability α(x, y).
The next lemma provides an estimate for the Wasserstein distance between transition
kernels of the form (4.6) in terms of the acceptance probabilities.
Lemma 4.1. Let Q be a transition kernel on (G, B(G)) and let α : G × G → [0, 1] and
e : G×G → [0, 1] be measurable functions. By Pα and Pαe we denote the transition kernels
α
of the form (4.6) with acceptance probabilities α and α e. Then, for all x ∈ G, we have
Z
W (δx Pα , δx Pαe ) ≤ d(x, y) E(x, y) Q(x, dy)
G
Proof. By the use of the dual representation of the Wasserstein distance it follows that
Z
W (δx Pα , δx Pαe ) = sup f (y) (Pα (x, dy) − Pαe (x, dy))
kf kLip ≤1 G
Z Z
= sup (f (y) − f (x))(α(x, y) − α
e(x, y))Q(x, dy) ≤ d(x, y)E(x, y)Q(x, dy).
kf kLip ≤1 G G
By the previous lemma and Theorem 3.1, we obtain the following Wasserstein pertur-
bation bound for the approximate Metropolis-Hastings algorithm.
Corollary 4.1. Let Q be a transition kernel on (G, B(G)) and let α : G×G → [0, 1] and
e : G×G → [0, 1] be measurable functions. By Pα and Pαe we denote the transition kernels
α
of the form (4.6) with acceptance probabilities α and α
e. Let the following conditions be
satisfied:
• Assumption 2.1 holds for the transition kernel Pα , i.e., τ (Pαn ) ≤ Cρn for ρ ∈ [0, 1)
and C ∈ (0, ∞).
• There are numbers δ ∈ (0, 1), L ∈ (0, ∞) and a measurable Lyapunov function
Ve : G → [1, ∞) of Pαe , i.e.,
Let us point out several aspects of condition (4.7). Recall that (4.7) is always satisfied
with Ve (x) = 1 for all x ∈ G. However, in this case it seems more difficult to control γ.
If some additional knowledge in form of a Lyapunov function V : G → [1, ∞) of Pα , i.e.,
Pα V (x) ≤ δV (x) + L for some δ ∈ (0, 1) and L ∈ (0, ∞), is available, then a non-trivial
candidate for Ve is V . For sufficiently small
Z
V (y)
δV = sup + 1 E(z, y)Q(z, dy)
z∈G G V (z)
this is indeed true. Namely, we have
Z Z
|(Pα − Pαe )V (x)| ≤ V (y)E(x, y)Q(x, dy) + V (x) E(x, y)Q(x, dy) ≤ V (x)δV .
G G
Then, Pαe V (x) ≤ (δ + δV )V (x) + L and whenever δ + δV < 1 it is clear that condition
(4.7) is verified.
To highlight the usefulness of a non-trivial Lyapunov function, we consider the fol-
lowing scenario which is related to a local perturbation of an independent Metropolis-
Hastings algorithm.
Example 4.1. Let us assume that for Pα Assumption 2.1, as formulated in Corol-
lary 4.1, is satisfied. For some probability measure µ on (G, B(G)) define Q(x, ·) = µ and
p0 = pe0 = µ. For G e ⊆ G let
α
e(x, y) = min{1, α(x, y) + 1Ge (x)}.
Hence, for x ∈ G e the transition kernel Pαe (x, ·) accepts any proposed state and for x 6∈ G
e
we have Pαe (x, ·) = Pα (x, ·). It is easily seen that E(x, y) ≤ 1Ge (x). For arbitrary R > 0
and r ∈ (0, 1) set Ve (x) = 1 + R1Ge (x) and note that
e
R formula follows by distinguishing the cases x ∈ G and
The last inequality of the previous
e Define D(G)
x 6∈ G. e = sup e d(x, y)µ(dy) and observe
x∈G G
e
Rµ(G) e
D(G)
κ=1+ , and γ≤ .
1−r 1+R
Then, Corollary 4.1 leads to
!
C e
Rµ(G) e
D(G)
W (p0 Pαn , p0 Pαen ) ≤ 1+
1−ρ 1−r 1+R
e is finite and
for arbitrary R ∈ (0, ∞) and r ∈ (0, 1). Under the assumption that D(G)
letting R → ∞ as well as r ↓ 0 we obtain
e
Cµ(G)D( e
G)
W (p0 Pαn , p0 Pαen ) ≤ ,
1−ρ
e measures the difference of the distributions. A small
which tells us that basically µ(G)
e
perturbation set G with respect to µ, thus implies a small bias. In contrast, with the trivial
Lyapunov function Ve = 1, and if there is (x, y) ∈ G e × G such that α(x, y) = 0, we only
obtain Z
γκ = D(G) e ≥ inf d(x, y)µ(dy).
x∈G G
The resulting upper bound on W (p0 Pαn , p0 Pαen ) will typically be bounded away from zero
e
regardless of the set G.
Remark 4.1. The constant γ essentially depends on the distance d(x, y) and the differ-
ence of the acceptance probabilities E(x, y). By applying the Cauchy-Schwarz inequality
to the numerator of γ, we can separate the two parts, i.e.,
Z Z Z 1/2
2 2
d(x, y) E(x, y) Q(x, dy) ≤ d(x, y) Q(x, dy) · E(x, y) Q(x, dy) .
G G G
If both integrals remain finite we see that an appropriate control of E(x, y) suffices for
making the constant γ small.
Remark 4.2. By using a Hoeffding-type bound, in Bardenet et al. [3, Lemma 3.1.] it
is shown that for their version of the approximate Metropolis-Hastings algorithm with
adaptive subsampling the approximation error E(x, y) is bounded uniformly in x and y
by a constant s > 0. Moreover, s can be chosen arbitrarily small for the implementation
of the algorithm.
Now we consider the case where the unperturbed transition kernel Pα is geometrically
ergodic. Motivated by Remark 4.2, we also assume that E(x, y) ≤ s for a sufficiently
small number s > 0. The following corollary generalizes a main result of Bardenet et al.
[3, Proposition 3.2] to the geometrically ergodic case.
Corollary 4.2. Let Q be a transition kernel on (G, B(G)) and let α : G×G → [0, 1] and
e : G×G → [0, 1] be measurable functions. By Pα and Pαe we denote the transition kernels
α
of the form (4.6) with acceptance probabilities α and α
e. Let the following conditions be
satisfied:
• The unperturbed transition kernel Pα is V -uniformly ergodic, that is,
for numbers ρ ∈ [0, 1), C ∈ (0, ∞) and a measurable function V : G → [1, ∞).
Moreover, V is a Lyapunov function of Pα , i.e.,
E(x, y) = |α(x, y) − α
e(x, y)| ≤ s.
λ s κ C (1 − ρn )
kp0 Pαn − p0 Pαen kV ≤ .
1−ρ
Proof. We consider the metric dV , defined in Lemma 3.1, set V = Ve and use E(x, y) ≤ s
so that it is easily seen that the constant γ from Corollary 4.1 satisfies γ ≤ sλ. From
the proof of Corollary 3.4, we know that V is a Lyapunov function of Pαe provided that
γ + δ < 1. Thus, we have
Now if s < (1 − δ)/λ, then δ + λs < 1 and the assertion follows from Corollary 4.1 by
writing the Wasserstein distances in terms of V -norms as in Section 3.2.
Remark 4.3. Without V (x) in the denominator, i.e., if we had relied on Corollary 3.2
instead of Theorem 3.1, the constant λ would often be infinite. Consider the following toy
example: Let π be the exponential distribution with density exp(−x) on G = [0, ∞) and
assume that Q(x, dy) is a uniform proposal with support [x−1, x+1]. With V (x) = exp(x)
it is well known that the Metropolis-Hastings algorithm is V -uniformly ergodic, see [30]
or [37, Example 4]. In this example
Z x+1
λ ≤ 1 + sup exp(y − x)dy ≤ 1 + exp(1)
x∈[0,∞) x−1
R x+1
whereas x−1 exp(y)dy is unbounded in x. Notice that λ only depends on the unperturbed
Markov chain so that a bound on λ can be combined with any approximation.
Remark 4.4. Let Pαe and Pα be φ-irreducible and aperiodic. Then, one can prove
under the assumptions of Corollary 4.2 that Pαe is V -uniformly ergodic if s is sufficiently
small. To see this, note that by [31, Theorem 16.0.1] the V -uniform ergodicity of Pα
implies that Pα satisfies their drift condition (V4). By the arguments stated in the proof
of Corollary 3.4, one obtains that Pαe also satisfies (V4) for sufficiently small s and this
implies V -uniform ergodicity. In this case, clearly Pαe possesses a stationary distribution,
say π
e, and
λsC L
kπ − πe kV ≤ · .
1 − ρ 1 − δ − λs
The previous inequality follows by (3.5) and the fact that
kπ − π
e kV ≤ π(V ) + π
e(V ) < ∞.
e(V ) ≤ L/(1 −
Here the finiteness of π(V ) follows by the V -uniform ergodicity of P and π
δ − λs) follows by (4.10) and [16, Proposition 4.24].
where the prior density p(θ) is the Lebesgue density of the normal distribution N (0, σp2 )
with σp > 0.
We consider the Langevin algorithm, a first order Euler discretization of the SDE of
the Langevin diffusion, see [39]. It is given by (Xn )n∈N0 with
σ2
Xn = Xn−1 + ∇ log πy (Xn−1 ) + Zn , n ∈ N. (4.11)
2
Here X0 is a real-valued random variable and (Zn )n∈N is an i.i.d. sequence of random
variables, independent of X0 , with Zn ∼ N (0, σ 2 ) for a parameter σ > 0 which can be
interpreted as the step size in the discretization of the diffusion. It is easily seen that
(Xn )n∈N0 is a Markov chain with transition kernel
Z
σ2
Pσ (θ, A) = 1A θ + ∇ log πy (θ) + z N (0, σ 2 )(dz), A ∈ B(R).
R 2
XN
b N log πy (θ) := s(y) − 1
∇
θ
s(Yi ) − 2 .
N i=1 σp
en )n∈N0 defined by
variables (X
2
X en−1 + σ ∇
en = X b N log πy (Xen−1 ) + Zn
2
N
!
σ2 σ 2
1 X
= 1− 2 X en−1 + s(y) − s(Yi ) + Zn .
2σp 2 N i=1
× ΠN ′ 2
i=1 ℓ(θ | yi ) N (0, σ )(dz)
for θ ∈ R and A ∈ B(R). Let us state a transition of this noisy Langevin Markov chain
according to Pσ,N in algorithmic form.
Algorithm 4.3. en to X
A single transition from X en+1 works as follows:
en ), call the result (y1′ , . . . , y ′ );
1. Draw an i.i.d. sequence (Yi )1≤i≤N with Yi ∼ ℓ(· | X N
2. Calculate
N
b N e 1 X Xen
∇ log πy (Xn ) := s(y) − s(yi′ ) − 2 ;
N i=1 σp
3. Draw Zn ∼ N (0, σ 2 ), independent from step 1., call the result zn . Set
2
X en + σ ∇
en+1 = X b N log πy (X
e n ) + zn .
2
From [2, Lemma 3] and by applying arguments of [39], we obtain the following facts
about the noisy Langevin algorithm.
Proposition 4.1. Let ksk∞ = supz∈Y M |s(z)| be finite with ksk∞ > 0, let V : R →
[1, ∞) be given by V (θ) = 1 + |θ| and assume that σ 2 < 4σp2 . Then
1. the function V is a Lyapunov function for Pσ and Pσ,N . We have
Pσ V (θ) ≤ δV (θ) + L1I (θ), Pσ,N V (θ) ≤ δV (θ) + L1I (θ) (4.12)
σ2 σ2
with δ = 1 − 4σp2 , L = σ + σ 2 ksk∞ + 2σp2 and the interval
( )
4σ 2
p
I= θ ∈ R |θ| ≤ 1 + 4σp2 ksk∞ + .
σ
2. there are distributions πσ and πσ,N on (R, B(R)) which are stationary with respect
to Pσ and Pσ,N , respectively.
log(N )
sup kPσ (θ, ·) − Pσ,N (θ, ·)ktv ≤ 6 max ksk∞ σ 2 , ksk−2
∞σ
−4
. (4.13)
θ∈R N
Proof. We use the same arguments as in [39, Section 3.1]. One can easily see that the
Markov chains (Xn )n∈N0 and (X en )n∈N0 are irreducible with respect to the Lebesgue
measure and weak Feller. Thus, all compact sets are petite, see [31, Proposition 6.2.8].
Hence, for the existence of stationary distributions, say πσ and πσ,N , [31, Theorem 12.3.3],
as well as for the V -uniform ergodicity [31, Theorem 16.0.1] it is enough to show that V
satisfies (4.12). With Z ∼ N (0, σ 2 ), we have
σ2 σ2 σ 2
Pσ V (θ) ≤ 1 − 2 V (θ) + 2 + s(y) − Eℓ(·|θ) s(Y ) + E |Z|
2σp 2σp 2
2
2
σ σ
≤ 1 − 2 V (θ) + 2 + σ 2 ksk∞ + σ
2σp 2σp
2
σ2 σ σ2 2
≤ 1 − 2 V (θ) + max V (θ), 2 + σ ksk∞ + σ
2σp 4σp2 2σp
2
2
σ σ 2
≤ 1 − 2 V (θ) + + σ ksk ∞ + σ · 1I (θ).
4σp 2σp2
Thus, the assertions from 1. to 3. are proven. The statement of 4. is a consequence of [2,
Lemma 3]. There it is shown that for N > 4ksk2∞ σ 4 it holds that
√
log(N ) 4 πksk∞ σ 2
sup kPσ (θ, ·) − Pσ,N (θ, ·)ktv ≤ exp −1+ .
θ∈R 4N ksk2∞ σ 4 N
By using exp(θ) − 1 ≤ θ exp(θ) and N > 4 we further estimate the right-hand side by
√
KN,s,σ 4 πksk∞ σ 2 log (N ) log(N )
+ · with KN,s,σ = exp .
4ksk2∞ σ 4 log(5) N 4N ksk2∞ σ 4
Since log(N )·N −1/3 < 2, we have the bound KN,s,σ ≤ exp(1) provided that 4N 2/3 ksk2∞ σ 4 ≥
−3 −6
2 which follows from N ≥ ksk∞ σ . The assertion of (4.13) follows now by a simple
calculation.
By using the facts collected in the previous proposition, we can apply the perturba-
tion bound of Theorem 3.2 and obtain a quantitative perturbation bound for the noisy
Langevin algorithm.
Corollary 4.3. Let p0 be a probability measure on (R, B(R)) and set pn = p0 Pσn as
n
well as pen,N = p0 Pσ,N . Suppose that σ 2 < 4σp2 . Then, there are numbers ρ ∈ [0, 1) and
C ∈ (0, ∞), independent of n, N , determining
18 max{ksk∞ σ 2 , ksk−2
∞ σ
−4
}
R := · 2 + max Ep0 |X| , 4σp2 (ksk∞ + σ −1 )
1−ρ
R
with Ep0 |X| = R |θ| dp0 (θ), so that for N > 90 max{ksk2∞ σ 4 , ksk∞−3 −6
σ } we have
Proof. We have by Proposition 4.1 that Pσ is V -uniformly ergodic with V (θ) = 1 + |θ|,
i.e., there are numbers ρ ∈ [0, 1) and C ∈ (0, ∞) such that
kPσn (θ, ·) − πσ kV
sup ≤ Cρn .
θ∈R V (θ)
Now, by combining Theorem 3.2 and Remark 3.8 with the results from Proposition 4.1
we obtain the result.
Remark 4.5. We want to point out that the assumptions imposed are the same as in
[2, Theorem 3.2], but instead of the asymptotic result we provide an explicit estimate.
The numbers ρ ∈ [0, 1) and C ∈ (0, ∞) are not stated in terms of the model parameters.
In principle, these values can be derived from the drift condition (4.12) through [5,
Theorem 1.1].
Acknowledgements
We thank Alexander Mitrophanov and the referees for their valuable comments which
helped to improve the paper. D.R. was supported by the DFG Research Training Group
2088.
References
[1] Ahn, S., Korattikara, A. and Welling, M. (2012). Bayesian posterior sam-
pling via stochastic gradient Fisher scoring. In Proceedings of the 29th International
Conference on Machine Learning.
[2] Alquier, P., Friel, N., Everitt, R. and Boland, A. (2016). Noisy Monte Carlo:
Convergence of Markov chains with approximate transition kernels. Stat. Comp. 26
29–47.
[3] Bardenet, R., Doucet, A. and Holmes, C. (2014). Towards scaling up Markov
chain Monte Carlo: an adaptive subsampling approach. In Proceedings of the 31st
International Conference on Machine Learning 405–413.
[4] Bardenet, R., Doucet, A. and Holmes, C. (2015). On Markov chain Monte
Carlo methods for tall data. arXiv preprint arXiv:1505.02827.
[5] Baxendale, P. (2005). Renewal theory and computable convergence rates for ge-
ometrically ergodic Markov chains. Ann. Appl. Probab. 15 700–738.
[6] Betancourt, M. (2015). The Fundamental Incompatibility of Scalable Hamilto-
nian Monte Carlo and Naive Data Subsampling. In Proceedings of the 32nd Inter-
national Conference on Machine Learning 533-540.
[7] Breyer, L., Roberts, G. and Rosenthal, J. (2001). A note on geometric ergod-
icity and floating-point roundoff error. Statist. Probab. Lett. 53 123–127.
[8] Dobrushin, R. (1956). Central limit theorem for non-stationary Markov chains. I.
Teor. Veroyatnost. i Primenen. 1 72–89.
[9] Dobrushin, R. (1956). Central limit theorem for nonstationary Markov chains. II.
Teor. Veroyatnost. i Primenen. 1 365–425.
[10] Dobrushin, R. (1996). Lectures on Probability Theory and Statistics: Ecole d’Eté
de Probabilités de Saint-Flour XXIV—1994 Perturbation methods of the theory of
Gibbsian fields, 1–66. Springer Berlin Heidelberg, Berlin, Heidelberg.
[11] Durmus, A. and Moulines, E. (2015). Quantitative bounds of convergence for
geometrically ergodic Markov chain in the Wasserstein distance with application to
the Metropolis Adjusted Langevin Algorithm. Stat. Comput. 25 5–19.
[12] Eberle, A. (2014). Error bounds for Metropolis-Hastings algorithms applied to
perturbations of Gaussian measures in high dimensions. Ann. Appl. Probab. 24 337–
377.
[13] Ferré, D., Hervé, L. and Ledoux, J. (2013). Regular perturbation of V -
geometrically ergodic Markov chains. J. Appl. Prob. 50 184–194.
[14] Gibbs, A. (2004). Convergence in the Wasserstein metric for Markov chain Monte
Carlo algorithms with applications to image restoration. Stoch. Models 20 473–492.
[15] Guibourg, D., Hervé, L. and Ledoux, J. (2012). Quasi-compactness of Markov
kernels on weighted-supremum spaces and geometrical ergodicity. Preprint. Avail-
able at http://arxiv.org/abs/1110.3240v5.
[16] Hairer, M. (2006). Ergodic properties of Markov processes. Lecture notes, Univ.
Warwick. Available at http://www.hairer.org/notes/Markov.pdf.
[17] Hairer, M. and Mattingly, J. C. (2011). Yet another look at Harris ergodic
theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and
Applications VI 109–117. Springer.
[18] Hairer, M., Stuart, A. and Vollmer, S. (2014). Spectral gaps for a Metropolis-
Hastings algorithm in infinite dimensions. Ann. Appl. Probab. 24 2455–2490.
[19] Johndrow, J., Mattingly, J., Mukherjee, S. and Dunson, D. (2015).
Approximations of Markov Chains and Bayesian Inference. arXiv preprint
arXiv:1508.03387.
[20] Kartashov, N. (1986). Inequalities in theorems of ergodicity and stability for
Markov chains with a common phase space, Parts I and II. Theory Probab. Appl.
30 247–259.
[21] Kartashov, N. and Golomozyı̆, V. (2013). Maximal coupling procedure and sta-
bility of discrete Markov chains. I. Theory of Probability and Mathematical Statistics
86 93–104.
[22] Keller, G. and Liverani, C. (1999). Stability of the spectrum for transfer oper-
ators. Ann. Scuola Norm. Sup. Pisa Classe Sci. 28 141–152.
[23] Korattikara, A., Chen, Y. and Welling, M. (2014). Austerity in MCMC Land:
Cutting the Metropolis-Hastings Budget. In Proceedings of The 31st International
Conference on Machine Learning 181–189.
[24] Lee, A., Doucet, A. and Latuszyński, K. (2014). Perfect simulation using
atomic regeneration with application to Sequential Monte Carlo. arXiv preprint
arXiv:1407.5770.
[25] Madras, N. and Sezer, D. (2010). Quantitative bounds for Markov chain conver-
gence: Wasserstein and total variation distances. Bernoulli 16 882–908.
[26] Mao, Y., Zhang, M. and Zhang, Y. (2013). A Generalization of Dobrushin coef-
ficient. Chinese J. Appl. Probab. Statist. 29 489–494.
[27] Marin, J. M., Pudlo, P., Robert, C. and Ryder, R. (2012). Approximate
Bayesian computational methods. Stat. Comp. 22 1167–1180.
[28] Mathé, P. (2004). Numerical integration using V-uniformly ergodic Markov chains.
J. Appl. Probab. 41 1104–1112.
[29] Medina-Aguayo, J., Lee, A. and Roberts, G. (2016). Stability of noisy
Metropolis–Hastings. Stat. Comp. 26 1187–1211.
[30] Mengersen, K. and Tweedie, R. (1996). Rates of convergence of the Hastings
and Metropolis algorithms. Ann. Statist. 24 101–121.
[31] Meyn, S. and Tweedie, R. (2009). Markov chains and stochastic stability, Second
ed. Cambridge University Press.
[32] Mitrophanov, A. (2003). Stability and exponential convergence of continuous-
time Markov chains. J. Appl. Probab. 40 970–979.
[33] Mitrophanov, A. (2005). Sensitivity and convergence of uniformly ergodic Markov
chains. J. Appl. Prob. 42 1003–1014.
[34] Ollivier, Y. (2009). Ricci curvature of Markov chains on metric spaces. J. Funct.
Anal. 256 810–864.
[35] Pillai, N. and Smith, A. (2015). Ergodicity of Approximate MCMC Chains with
Applications to Large Data Sets. arXiv preprint arXiv:1405.0182v2.
[36] Roberts, G. and Rosenthal, J. (1997). Geometric ergodicity and hybrid Markov
chains. Electron. Comm. Probab. 2 no. 2, 13–25.
[37] Roberts, G. and Rosenthal, J. (2004). General state space Markov chains and
MCMC algorithms. Probability Surveys 1 20–71.
[38] Roberts, G., Rosenthal, J. and Schwartz, P. (1998). Convergence properties
of perturbed Markov chains. J. Appl. Probab. 35 1–11.
[39] Roberts, G. and Tweedie, R. (1996). Exponential convergence of Langevin dis-