Perturbation Theory For Markov Chains Via Wasserstein Distance

Submitted to Bernoulli

arXiv: arXiv:1503.04123

Perturbation theory for Markov chains via

Wasserstein distance
arXiv:1503.04123v3 [stat.CO] 23 Feb 2017


Institut für Mathematische Stochastik, Universität Göttingen, Goldschmidtstraße 7, 37077
Göttingen, Germany. E-mail: daniel.rudolf@uni-goettingen.de
Department of Econometrics & OR, Tilburg University, P.O.box 90153, 5000 LE Tilburg, The
Netherlands. E-mail: n.f.f.schweizer@uvt.nl

Perturbation theory for Markov chains addresses the question of how small differences in the
transition probabilities of Markov chains are reflected in differences between their distribu-
tions. We prove powerful and flexible bounds on the distance of the nth step distributions of
two Markov chains when one of them satisfies a Wasserstein ergodicity condition. Our work is
motivated by the recent interest in approximate Markov chain Monte Carlo (MCMC) meth-
ods in the analysis of big data sets. By using an approach based on Lyapunov functions, we
provide estimates for geometrically ergodic Markov chains under weak assumptions. In an au-
toregressive model, our bounds cannot be improved in general. We illustrate our theory by
showing quantitative estimates for approximate versions of two prominent MCMC algorithms,
the Metropolis-Hastings and stochastic Langevin algorithms.
Keywords: perturbations, Markov chains, Wasserstein distance, MCMC, big data.

1. Introduction
Markov chain Monte Carlo (MCMC) algorithms are one of the key tools in computational
statistics. They are used for the approximation of expectations with respect to probability
measures given by unnormalized densities. For almost all classical MCMC methods it is
essential to evaluate the target density. In many cases, this requirement is not an issue,
but there are also important applications where it is a problem. This includes applications
where the density is not available in closed form, see [27], or where an exact evaluation is
computationally too demanding, see [2]. Problems of this kind lead to the approximation
of Markov chains and to the question of how small differences in the transitions of two
Markov chains affect the differences between their distributions.
In Bayesian inference when big data sets are involved an exact evaluation of the target
density is typically very expensive. For instance, in each step of a Metropolis-Hastings
algorithm the likelihood of a proposed state must be computed. Every observation in
the underlying data set contributes to the likelihood and must be taken into account
in the calculation. This may result in evaluating several terabytes of data in each step
of the algorithm. These are the reasons for the recent interest in numerically cheaper
approximations of classical MCMC methods, see [3, 4, 23, 42, 47]. A reduction of the
2 D. Rudolf and N. Schweizer

computational costs can, e.g., be achieved by relying on a moderately sized random sub-
sample of the data in each step of the algorithm. The function value of the target density
is thus replaced by an approximation. Naturally, subsampling and alternative attempts
at “cutting the Metropolis-Hastings budget” [23] induce additional biases. These biases
can lead to dramatic changes in the properties of the algorithms as discussed in [6].
We thus need a better theoretical understanding of the behavior of such approximate
MCMC methods. Indeed, a number of recent papers prove estimates of these biases, see
[2, 3, 19, 24, 29, 35]. A key tool in these papers are perturbation bounds for Markov chains.
One such result for uniformly ergodic Markov chains due to Mitrophanov [33] is used
in [2]. A similar perturbation estimate implicitly appears in [3]. The focus on uniformly
ergodic Markov chains is rather restrictive, especially for high-dimensional, non-compact
state spaces such as Rm . Working with Wasserstein distances has recently turned out to
be a fruitful alternative in several contributions on high-dimensional MCMC algorithms,
see [11, 12, 14, 18, 25].
We provide perturbation bounds based on Wasserstein distances, which lead to flexi-
ble quantitative estimates of the biases of approximate MCMC methods. Our first main
result is the Wasserstein perturbation bound of Theorem 3.1. Under a Wasserstein er-
godicity assumption, explained in Section 2, it provides an upper bound on the distance
of the nth step distribution between an ideal and an approximating Markov chain in
terms of the difference between their one-step transition probabilities. The result is well-
suited for applications on a non-compact state space, since the difference of the one-step
transition probabilities is measured by a weighted supremum with respect to a suitable
Lyapunov function. For an autoregressive model, we show in Section 4.1 that the resulting
perturbation bound cannot be improved in general. As a consequence of the Wasserstein
approach we also obtain perturbation estimates for geometrically ergodic Markov chains.
We first adapt our Wasserstein perturbation bound to this setting. Then, as a second
main result, Theorem 3.2, we prove a refined estimate for geometrically ergodic chains
where the perturbation is measured by a weighted total variation distance. Our per-
turbation bounds, and earlier ones in [32, 33], establish a direct connection between an
exponential convergence property for Markov chains and their robustness to perturba-
tions. In particular, fast convergence to stationarity implies insensitivity to perturbations
in the transition probabilities. Geometric ergodicity has been studied extensively in the
MCMC literature. Thus, our estimates can be used in combination with many existing
convergence results for MCMC algorithms. In Section 4, we illustrate the applicability
of both theorems by generalizing recent findings on approximate Metropolis-Hastings
algorithms from [3] and on noisy Langevin algorithms for Gibbs random fields from [2].

1.1. Related literature

We refer to [20, 21] for an overview of the classical literature on perturbation theory
for Markov chains. However, as Stuart and Shardlow observed in [41], the classical as-
sumptions on the perturbation might be too restrictive for many interesting applications.
As a consequence, they develop a perturbation theory for geometrically ergodic Markov

Perturbation theory for Markov chains via Wasserstein distance 3

chains [41] which requires to control perturbations of iterated transition kernels in a

weaker sense. In our bounds for geometrically ergodic Markov chains, we have similar
flexibility in the perturbation due to the Lyapunov-type stability condition, and require
only a control on the errors of one-step transition kernels.
Mitrophanov, in [33], considers uniformly ergodic Markov chains and provides the
best estimates in those settings. In the geometrically ergodic case, there are further
related results, see [13] and the references therein. Compared to [13], our focus is on non-
asymptotic estimates with explicit constants, while their main focus is on qualitative
results such as inheritance of geometric ergodicity by the perturbation. Earlier related
results on perturbations induced by floating-point roundoff errors are shown in [7, 38].
Finally, let us point out that our paper is complementary to the work of Pillai and
Smith [35] who also present Wasserstein perturbation bounds for Markov chains. When
moving beyond the uniformly ergodic Markov chain case, an important challenge is to
handle the issue that in many applications suprema of relevant quantities over the whole
state space are infinite. The authors of [35] guarantee finiteness of supremum norms
by restricting attention to subsets of the state space. Their bounds thus involve exit
probabilities from these subsets. Our approach circumvents these issues by relying on
Lyapunov-type stability conditions for the approximate algorithm.

2. Wasserstein ergodicity
Let G be a Polish space and B(G) be the corresponding Borel σ-algebra. Let d be a
metric, possibly different from the one which makes the space Polish, which is assumed
to be lower semi-continuous with respect to the product topology of G. Let P be the set
of all Borel probability measures on (G, B(G)). Then, we define the Wasserstein distance
of ν, µ ∈ P by Z Z
W (ν, µ) = inf d(x, y) dξ(x, y),
ξ∈M(ν,µ) G G

where M (ν, µ) is the set of all couplings of ν and µ, that is, all probability measures ξ
on G × G with marginals ν and µ. Indeed, on P the Wasserstein distance satisfies the
properties of a metric but is not necessarily finite, see [46, Chapter 6]. For a measurable
function f : G → R we define

|f (x) − f (y)|
kf kLip = sup ,
x,y∈G,x6=y d(x, y)

which leads to the well-known duality formula


W (ν, µ) = sup f (x)(dν(x) − dµ(x)) . (2.1)
kf kLip ≤1 G

For details we refer to [45, Chapter 1.2]. By δx we denote the probability measure con-
centrated at x. Hence W (δx , δy ) = d(x, y) is finite for x, y ∈ G.

4 D. Rudolf and N. Schweizer

Let P be a transition kernel on (G, B(G)) which defines a linear operator P : P → P

given by Z
µP (A) = P (x, A) dµ(x), µ ∈ P, A ∈ B(G).

With this notation we have δx P (A) = P (x, A). Further, for a measurable function
f : G → R and µ ∈ P we have
f (x) d(µP )(x) = P f (x) dµ(x),
with P f (x) = G f (y)P (x, dy) whenever one of the integrals exist, see for example [40,
Lemma 3.6]. Now, by
W (δx P, δy P )
τ (P ) := sup
x,y∈G,x6=y d(x, y)
we define the generalized ergodicity coefficient of transition kernel P . This coefficient can
be understood as a generalized Dobrushin ergodicity coefficient, see [8, 9]. Dobrushin
himself called τ (P ) the Kantorovich norm of P , see [10, formula (14.34)]. Finally, τ (P )
also provides a lower bound of the coarse Ricci curvature of P introduced in [34].
Two essential properties of the ergodicity coefficient are submultiplicativity and con-
tractivity, see [10, Proposition 14.3 and Proposition 14.4].

Proposition 2.1. For two transition kernels P and Pe on (G, B(G)) and µ, ν ∈ P, we

τ (P Pe ) ≤ τ (P )τ (Pe) (Submultiplicativity),
and W (νP, µP ) ≤ τ (P ) W (ν, µ) (Contractivity).

As an immediate consequence of this contractivity, we obtain the following corollary.

Corollary 2.1. Let P be a transition kernel with stationary

R distribution π, i.e. πP = π,
and assume for some (and hence any) x0 ∈ G it holds that G d(x0 , x) dπ(x) < ∞. Then

W (δx P, π)
sup ≤ τ (P ). (2.2)
x∈G W (δx , π)
Proof. Because of the assumption G d(x0 , x) dπ(x) < ∞ we have that W (δx , π) is finite
for any x ∈ G. Thus, the assertion follows by Proposition 2.1 and stationarity of π.

Remark 2.1. For some special cases one also has an estimate of the form (2.2) in the
other direction. To this end, consider the trivial metric d(x, y) = 2 · 1x6=y with indicator
function (
1 x 6= y
1x6=y =
0 x = y.

Perturbation theory for Markov chains via Wasserstein distance 5

Further, let Z

kqktv :=
sup f (y) dq(y) = 2 sup |q(A)|
kf k∞ ≤1 G A∈B(G)

be the total variation norm of a signed measure q on G. In this setting W (µ, ν) =

kµ − νktv . For x, y ∈ G with x 6= y we have kδx − δy ktv = d(x, y) = 2 so that
τ1 (P ) = sup kδx P − δy P ktv . (2.3)
2 x,y∈G,x6=y

The “1” in the subscript of τ1 (P ) indicates that we use the trivial metric. By applying the
triangle inequality of the total variation norm we obtain τ1 (P ) ≤ supx∈G kδx P − πktv . If
additionally π is atom-free, i.e., π({y}) = 0 for all y ∈ G, we have kδy − πktv = 2. Then,
the previous consideration and (2.2) lead to
sup kδx P − πktv ≤ τ1 (P ) ≤ sup kδx P − πktv .
2 x∈G x∈G

For the moment, let us assume that P is uniformly ergodic, that is, there exist numbers
ρ ∈ [0, 1) and C ∈ (0, ∞) such that

sup kδx P n − πktv ≤ Cρn , n ∈ N.


An immediate consequence of the uniform ergodicity is that τ1 (P n ) ≤ Cρn .

Also note that if there is an n0 ∈ N for which τ (P n0 ) < 1 we have by the submultiplica-
tivity, see Proposition 2.1, that τ (P n ) converges exponentially to zero. This motivates
to impose the following assumption which contains the idea to measure convergence of
δx P n to π in terms of τ (P n ).

Assumption 2.1 (Wasserstein ergodicity). For the transition kernel P there exist num-
bers ρ ∈ [0, 1) and C ∈ (0, ∞) such that

W (P n (x, ·), P n (y, ·))

τ (P n ) = sup ≤ Cρn , n ∈ N. (2.4)
x,y∈G,x6=y d(x, y)

For any probability measure p0 ∈ P, a transition kernel P with stationary distribution

π and pn = p0 P n we have under the Wasserstein ergodicity condition that

W (pn , π) ≤ Cρn W (p0 , π).

3. Perturbation bounds
By N0 = {0, 1, 2, . . . } we denote the non-negative integers and assume that all random
variables are defined on a common probability space (Ω, F , P) mapping to a Polish space

6 D. Rudolf and N. Schweizer

G equipped with a lower semi-continuous metric d. Let the sequence of random variables
(Xn )n∈N0 be a Markov chain with transition kernel P and initial distribution p0 , i.e., we
have almost surely

P(Xn ∈ A | X0 , . . . , Xn−1 ) = P(Xn ∈ A | Xn−1 ) = P (Xn−1 , A), n∈N

and p0 (A) = P(X0 ∈ A) for any measurable set A ⊆ G. Assume that (X en )n∈N0 is another
Markov chain with transition kernel P and initial distribution pe0 . We denote by pn the
en . Throughout the paper, (Xn )n∈N is
distribution of Xn and by pen the distribution of X
considered to be the ideal, unperturbed Markov chain we would like to simulate while
en )n∈N0 is the perturbed Markov chain that we actually implement.

3.1. Wasserstein perturbation bound

Similar as in [33, Theorem 3.1], we show quantitative bounds on the difference of pn and
pen , but use the Wasserstein distance instead of total variation. Besides Assumption 2.1,
the bounds depend on the difference of the initial distributions and on a suitably weighted
one-step difference between P and Pe.

Theorem 3.1 (Wasserstein perturbation bound). Let Assumption 2.1 be satisfied with
the numbers C ∈ (0, ∞) and ρ ∈ [0, 1), i.e., τ (P n ) ≤ Cρn . Assume that there are numbers
δ ∈ (0, 1) and L ∈ (0, ∞) and a measurable Lyapunov function Ve : G → [1, ∞) of Pe such
(PeVe )(x) ≤ δ Ve (x) + L. (3.1)
W (δx P, δx Pe ) e L
γ = sup and κ = max pe0 (V ),
x∈G Ve (x) 1−δ
with pe0 (Ve ) = G Ve (x) de
p0 (x). Then
W (pn , pen ) ≤ C ρn W (p0 , pe0 ) + (1 − ρn ) . (3.2)

Proof. By induction one can show that

p0 − p0 )P n +
pen − pn = (e pei (Pe − P )P n−i−1 , n ∈ N. (3.3)

We have
pi P, pei Pe ) ≤
W (e W (δx P, δx Pe) de
pi (x) ≤ γ Ve (x) de
pi (x).

Perturbation theory for Markov chains via Wasserstein distance 7

Moreover, for i ≥ 0 we have

Z Z  
L(1 − δ i ) L
Ve (x) de
pi (x) = Pe i Ve (x) de
p0 (x) ≤ δ i pe0 (Ve ) + ≤ max pe0 (Ve ),
G G (1 − δ) 1−δ

pi P, pei Pe) ≤ γκ. By this fact we have

so that we obtain W (e

pi Pe P n−i−1 , pei P P n−i−1 ) ≤ γκ · τ (P n−i−1 ).

W (e (3.4)

Then, by (3.3), (3.4) and the triangle inequality of the Wasserstein distance we have
W (pn , pen ) ≤ W (p0 P n , pe0 P n ) + pi PeP n−i−1 , pei P P n−i−1 )
W (e
≤ W (p0 , pe0 )τ (P n ) + γκ τ (P i ).

Pn−1 C(1−ρn )
Finally, by (2.4) we obtain i=0 τ (P i ) ≤ 1−ρ , which allows us to complete the

Remark 3.1. The parameter κ is an upper bound on pei (Ve ). It can be interpreted as
a measure for the stability of the perturbed Markov chain. The parameter γ quantifies
with a weighted supremum norm the one-step difference between P and Pe. The use of the
Lyapunov function increases the flexibility of the resulting estimate, since larger values of
Ve compensate larger values of the Wasserstein distance between the kernels. Notice that
the existence of a Lyapunov function satisfying (3.1) is weaker than assuming Ve -uniform
ergodicity of Pe since it is not associated with a small set condition. In particular, the
condition is satisfied for any Pe with the trivial choice Ve (x) = 1 for all x ∈ G, see Corollary
3.2. As we will see in Section 4, allowing for non-trivial choices of Ve considerably increases
the applicability of our results.

If Pe has a stationary distribution, say π

e ∈ P, as a consequence of the previous theorem,
we obtain bounds on the difference between π and π e.

Corollary 3.1. Let the assumptions of Theorem 3.2 be satisfied. Assume that Pe has a
e ∈ P and let W (π, π
stationary distribution π e) be finite. Then
γC L
e) ≤
W (π, π · . (3.5)
1−ρ 1−δ

Proof. By Theorem 3.2 we obtain with p0 = π, pe0 = π

e, the stationarity of the distribu-
e and by letting n → ∞ that
tions π, π
e) ≤
W (π, π .

8 D. Rudolf and N. Schweizer

By the Lyapunov condition and [16, Proposition 4.24], it holds that

e(Ve ) =
π Ve (x)de
π (x) ≤
G 1−δ

which leads to κ ≤ L/(1 − δ) and finishes the proof.

Remark 3.2. It may seem artificial to assume W (π, π e) < ∞ but this is needed for
the limit argument in the proof. This condition is often satisfied a priori. For example,
it holds if the metric is bounded, i.e., supx,y∈G d(x, y) is finite, or, more generally, if the
distributions π and π e0 ∈ G such
e possess a first moment in the sense that there exist x0 , x
that Z Z
d(x0 , x) dπ(x) < ∞, d(e π (x) < ∞.
x0 , x) de

As pointed out in Remark 3.1, we do not need to impose condition (3.1) to obtain a
non-trivial perturbation bound:

Corollary 3.2. Assume that Assumption 2.1 holds with the numbers C ∈ (0, ∞) and
ρ ∈ [0, 1), i.e., τ (P n ) ≤ Cρn , and let

γ := sup W (δx P, δx Pe).


n n γ
W (pn , pen ) ≤ C ρ W (p0 , pe0 ) + (1 − ρ ) . (3.6)

Proof. The statement follows by Theorem 3.1 with Ve (x) = 1 and L = 1 − δ.

Remark 3.3. For the trivial metric d(x, y) = 2·1x6=y the last corollary states essentially
the result of [33, Theorem 3.1], where instead of the general Wasserstein distance the
total variation distance is used. There, the bound’s dependence on C and ρ can be further
improved by using the a priori bound τ1 (P n ) ≤ 1 in addition to uniform ergodicity. For
another metric d such an a priori bound is in general not available.

Remark 3.4. Table 3.1 provides a detailed comparison between our Theorem 3.1 and
the related Wasserstein perturbation result of Pillai and Smith, [35, Lemma 3.3]. An
important ingredient in their estimate is a set G b ⊆ G which can be interpreted as
the part of G where both Markov chains remain with high probability. When a good
uniform upper bound on W (δx P, δx Pe) for all x ∈ G is available, we can choose Gb=G
in [35, Lemma 3.3] and V (x) = 1 in Theorem 3.1. In that case, both results essentially
simplify to Corollary 3.2. The results become entirely different when such a bound is not
available or too rough. In our estimate, one then needs a non-trivial Lyapunov function
for Pe and a uniform upper bound on W (δx P, δx Pe )/Ve (x). To apply their estimate, one

Perturbation theory for Markov chains via Wasserstein distance 9

needs a uniform bound on W (δx P, δx Pe ) for all x ∈ G.

b In addition, a bound on π(G \ G),
Lyapunov functions and estimates of the exit probabilities from G of both Markov chains
need to be available. Finally, while [35, Lemma 3.3] requires slightly more regularity on
the Lyapunov function, contractivity of the unperturbed transition kernel P (with C = 1)
is not needed on the whole state space but only on G. b

Table 1. Comparison of the Wasserstein perturbation bound of [35, Lemma 3.3] and Theorem 3.1.
Here ρ, δ ∈ [0, 1), L, cp , C, D ∈ (0, ∞), V : G → [0, ∞), Ve : G → [1, ∞) and E(x) = G d(x, y)dπ(y).

Assumptions of Assumptions of
[35, Lemma 3.3] Theorem 3.1

Convergence b⊆G W (δx P, δy P )

∃G s.t. sup ≤ρ τ (P n ) ≤ Cρn
property b d(x, y)

P V (x) ≤ δV (x) + L
Lyapunov function PeVe (x) ≤ δVe (x) + L
PeV (x) ≤ δV (x) + L

b ≤C
E[V (Xn+1 | Xn = x, Xn+1 6∈ G)]
Drift regularity e e e b
E[V (Xn+1 | Xn = x, Xn+1 6∈ G)] ≤ C —
b s.t. d(x, p) ≤ V (x) + cp
∃p ∈ G

W (δx P, δx Pe)
Perturbation error b := sup W (δx P, δx Pe)
γ γ := sup
x∈G x∈G Ve (x)
G\Gb V (x)dπ(x) ≤ D
Regularity of π —
b small
π(G \ G)

Conclusion: ρn E(x) + +
Cρn E(x)+
Upper bound of 2L b +
+ δn (V (x) + D) + cp π(G \ G) Cγ
1−δ max{Ve (x), 1−δ
W (δx Pe n , π) 2(1 − P[{Xj }n−1 e
∪ {Xj }n−1 b
⊆ G])(C + L + cp )
j=1 j=1 1−δ

3.2. Perturbation bounds for geometrically ergodic Markov

In this section, we derive general perturbation bounds for geometrically ergodic Markov
chains. First, we recall some results from [17], [26] and [36] which are helpful to apply
our Wasserstein perturbation bounds in the geometrically ergodic case. Then we present
the new estimates:
• Corollary 3.3 is an application of Theorem 3.1 with Wasserstein distances replaced
by V -norms of differences between measures.

10 D. Rudolf and N. Schweizer

• In Corollary 3.4, we show that having a Lyapunov function V for P is sufficient

for our bounds if the transition kernels P and Pe are sufficiently close (in a suitable
• In Theorem 3.2, we provide a quantitative perturbation bound which still applies
if we can only control the total variation distance between P (x, ·) and Pe(x, ·). To
measure the perturbation in such a weak sense is new for geometrically ergodic
Markov chains.
A transition kernel P with stationary distribution π is called geometrically ergodic if
there is a constant ρ ∈ [0, 1) and a measurable function C : G → (0, ∞) such that for
π-a.e. x ∈ G we have
kP n (x, ·) − πktv ≤ C(x)ρn .
For φ-irreducible and aperiodic Markov chains, it is well known that geometric ergodicity
is equivalent to V -uniform ergodicity, see [36, Proposition 2.1]. Namely, if P is geomet-
rically ergodic, then there exists a π-a.e. finite measurable function V : G → [1, ∞] with
finite moments with respect to π and there are constants ρ ∈ [0, 1) and C ∈ (0, ∞) such

kP (x, ·) − πkV := sup f (y)(P (x, dy) − π(dy)) ≤ CV (x)ρn , x ∈ G, n ∈ N.
|f |≤V G

kP n (x, ·) − πkV
sup ≤ Cρn . (3.7)
x∈G V (x)
The following result establishes the connection between V -norms and certain Wasserstein
distances. It is basically due to Hairer and Mattingly [17], see also [26].

Lemma 3.1. Assume that V is lower semi-continuous on G. For x, y ∈ G, let us define

the metric (
V (x) + V (y) x 6= y
dV (x, y) = (V (x) + V (y))1x6=y =
0 x = y.
Then, for any µ, ν ∈ P we have

kµ − νkV = WdV (µ, ν), (3.8)

where WdV denotes the Wasserstein distance based on the metric dV .

Lower semi-continuity of V implies lower semi-continuity of dV , which leads to the

duality formula (2.1) by [45, Theorem 1.14]. We thus impose the standing assumption of
lower semi-continuity of V whenever we speak of V -uniform ergodicity in the following.
In principle, this requirement can be removed and (3.8) remains true, but we do not go
into further detail in that direction. In applications, this is typically not restrictive since
V is continuous anyway.

Perturbation theory for Markov chains via Wasserstein distance 11

By similar arguments as in the proof of [26, Theorem 1.1] we observe that (3.7) implies
a suitable upper bound on

WdV (δx P, δy P ) kP (x, ·) − P (y, ·)kV

τV (P ) = sup = sup .
x,y∈G, x6=y dV (x, y) x,y∈G, x6=y V (x) + V (y)

Lemma 3.2. If (3.7) is satisfied for the transition kernel P , then τV (P n ) ≤ Cρn .

Proof. For any positive real numbers a1 , a2 , b1 , b2 we have the following elementary
a1 + a2 a1 a2
≤ max , . (3.9)
b1 + b2 b1 b2
By (3.9) we obtain

WdV (δx P n , δy P n ) kP n (x, ·) − πkV + kP n (y, ·) − πkV

τV (P n ) = sup ≤ sup
x,y∈G, x6=y dV (x, y) x,y∈G, x6=y V (x) + V (y)
 n n

kP (x, ·) − πkV kP (y, ·) − πkV kP n (x, ·) − πkV
≤ sup max , = sup .
x,y∈G V (x) V (y) x∈G V (x)

Now, by using (3.7) we obtain the assertion.

The lemmas above and Theorem 3.1 lead to the following new perturbation bound for
geometrically ergodic Markov chains.

Corollary 3.3. Let P be V -uniformly ergodic, i.e., there are constants ρ ∈ [0, 1) and
C ∈ (0, ∞) such that

kP n (x, ·) − πkV ≤ CV (x)ρn , x ∈ G, n ∈ N.

We also assume that there are numbers δ ∈ (0, 1) and L ∈ (0, ∞) and a measurable
Lyapunov function Ve : G → [1, ∞) of Pe such that

(PeVe )(x) ≤ δ Ve (x) + L. (3.10)

P (x, ·) − Pe(x, ·) L
γ = sup V
and κ = max pe0 (Ve ),
x∈G Ve (x) 1−δ
with pe0 (Ve ) = G Ve (x) de
p0 (x). Then
kpn − pen kV ≤ C ρn kp0 − pe0 kV + (1 − ρn ) . (3.11)

12 D. Rudolf and N. Schweizer

Remark 3.5. In [41, Theorem 3.1], a related perturbation bound is proven. The conver-
gence property of the unperturbed transition kernel is slightly weaker than our V -uniform
ergodicity, but also based on a kind of Lyapunov function. More restrictively, there it is
assumed that the difference of P n and Pe n for all n > 0 can be controlled. In addition, the
perturbation error is measured with a weight given by the same Lyapunov function as in
the convergence property of P , but by taking a supremum over a subset of test functions.
With our approach we can take the supremum over all test functions and obtain similar
estimates by setting p0 = π.

The next corollary demonstrates how the Lyapunov function of Pe can be replaced
by a Lyapunov function of P , provided that the distance between the transition kernels
is sufficiently small. Notice that assuming the existence of a Lyapunov function of P in
addition to the V -uniform ergodicity is a definition of constants rather than an additional
requirement, see, e.g., [5].

Corollary 3.4. Let P be V -uniformly ergodic, i.e., there are constants ρ ∈ [0, 1) and
C ∈ (0, ∞) such that

kP n (x, ·) − πkV ≤ CV (x)ρn , x ∈ G, n ∈ N.

Moreover, V : G → [1, ∞) is a measurable Lyapunov function of P , such that

(P V )(x) ≤ δV (x) + L (3.12)

with constants δ ∈ (0, 1) and L ∈ (0, ∞). Let

P (x, ·) − Pe (x, ·) L
γ = sup and κ = max pe0 (V ),
x∈G V (x) 1−δ−γ
with pe0 (V ) = G V (x) dep0 (x). If γ + δ < 1, then
n n γκ
kpn − pen kV ≤ C ρ kp0 − pe0 kV + (1 − ρ ) . (3.13)

Proof. It suffices to show that

(Pe V )(x) ≤ (δ + γ)V (x) + L (3.14)

and then to apply Corollary 3.3. We have

((Pe − P )V )(x) ≤ |((Pe − P )V )(x)| ≤ Pe (x, ·) − P (x, ·) ≤ γ V (x)

which implies (3.14). The assertion follows by the assumption that δ + γ < 1 and an
application of Corollary 3.3.

Perturbation theory for Markov chains via Wasserstein distance 13

Remark 3.6. For discrete state spaces and under the requirement p0 = pe0 , a result
similar to the previous corollary is obtained in [21, Theorem 3, Corollary 3]. The authors
of [21] replace our constant κ by max0≤i≤n pei (V ). This we could do as well, see the proof
of Theorem 3.1.

In the perturbation bound of Corollary 3.3, the function V plays two roles. In its
first role, V appears in the V -uniform ergodicity condition and thus is used to quantify
convergence of P . In its second role, V appears in the constant γ, with which we compare
P and Pe , as well as in the definition of the distance between pn and pen . We can interpret
γ of Corollary 3.3 as an operator norm of P − Pe . To this end, let BV be the set of all
measurable functions f : G → R with finite

|f (x)|
|f |V := sup , (3.15)
x∈G V (x)

which means
BV = {f : G → R | |f |V < ∞} .
It is easily seen that (BV , |·|V ) is a normed linear space. In the setting of Corollary 3.3,
we have

“ “
“P − Pe “ := sup (P − Pe )f = γ. (3.16)
“ “
BV →BVe Ve
|f |V ≤1

In Corollary 3.4, the more restrictive case V = Ve is considered. The corresponding

operator norm P − PeBV →BV appears in classical perturbation theory for Markov
chains, see [20, 21]. But as discussed in [41, p. 1126] and [13] it might be too restrictive
to measure the perturbation with this operator norm for V = Ve .
By relying, e.g., on [28, Proposition 2] we have some flexibility in the choice of V . There
it is shown that, for r ∈ (0, 1), V -uniform ergodicity implies V r -uniform ergodicity. This
leads to less favorable constants in the V r -uniform ergodicity of P , but can relax the
requirements on the similarity of P and Pe. Namely, with a Lyapunov function Ve of Pe
we can apply Corollary 3.3 with a V r -uniformly ergodic P and γ = P − PeBV r →BVe .
Unfortunately, this approach breaks down for r = 0. To see this, notice that V r -
uniform ergodicity with r = 0 is just uniform ergodicity which is not implied by geometric
ergodicity. The next theorem overcomes this limitation by separating the two roles of the
function V in the previous perturbation bounds. Roughly, we set V = 1 in the sense that
we measure the distances between P and Pe as well as between pn and pen in the total
variation distance. At the same time, we set V = Ve in the sense that we assume P is
Ve -uniformly ergodic with Lyapunov function Ve .

Theorem 3.2. Let P be Ve -uniformly ergodic, i.e., there are constants ρ ∈ [0, 1) and
C ∈ (0, ∞) such that

kP n (x, ·) − πkVe ≤ C Ve (x)ρn , x ∈ G, n ∈ N.

14 D. Rudolf and N. Schweizer

Moreover, Ve : G → [1, ∞) is a measurable Lyapunov function of Pe and P , such that

(PeVe )(x) ≤ δ Ve (x) + L, and (P Ve )(x) ≤ Ve (x) + L,

with constants δ ∈ (0, 1) and L ∈ (0, ∞). Let

P (x, ·) − Pe (x, ·) L
γ = sup tv
and e
κ = max pe0 (V ), (3.17)
x∈G Ve (x) 1−δ
with pe0 (Ve ) = G Ve (x) de
p0 (x). Then, for γ ∈ (0, exp(−1)) we have
κ exp(1) −1 −1
kpn − pen ktv ≤ Cρn kp0 − pe0 kVe + (2C(L + 1))log(γ ) γ log(γ −1 ). (3.18)

Proof. From the proof of Theorem 3.2 we know that


ke p0 − p0 )P n ktv +
pn − pn ktv ≤ k(e pi (P − P )P n−i−1 ,
e n ∈ N.

By Lemma 3.2, we have

p0 − p0 )P n ktv ≤ k(e
k(e p0 − p0 )P n kVe ≤ Cρn ke
p0 − p0 kVe .
Fix a real number r ∈ (0, 1) and let s = 1 − r. By considering (2.3) one can see that
τ1 (P ) ≤ 1. This leads to
r s
e e e
epi (P − P )P n−i−1 ≤ e pi (P − P )P n−i−1 e pi (P − P )P n−i−1
tv tv Ve
r s
e e
≤ epi (P − P ) e pi (P − P ) τVe (P n−i−1 )s .
tv e V

We also have
epi (P − P ) ≤ δx P − δx Pe de
pi (x) ≤ γ Ve (x) de
pi (x),
tv G tv G
e WdVe (δx P, δx Pe )
pei (P − P ) e ≤ e pi (x) ≤ sup
WdVe (δx P, δx P ) de Ve (x) de
pi (x).
V G x∈G Ve (x) G

Moreover, for i ≥ 0 we obtain

e L(1 − δ i )
V (x) de
pi (x) = Pe i Ve (x) de
p0 (x) ≤ δ i pe0 (Ve ) + ≤ κ,
G G (1 − δ)
and, by
WdVe (δx P, δx Pe) = inf (Ve (z) + Ve (y))1z6=y dξ(y, z)
ξ∈M(δx P,δx P G G

≤ P Ve (x) + PeVe (x) ≤ (1 + δ)Ve (x) + 2L,

Perturbation theory for Markov chains via Wasserstein distance 15

we have
WdVe (δx P, δx Pe )
sup ≤ 2(L + 1).
x∈G Ve (x)
pn − pn ktv ≤ Cρn ke
ke p0 − p0 kVe + 2s (L + 1)s γ r κ τVe (P i )s .

Finally, by Lemma 3.2 we obtain

X C s (1 − ρns ) Cs Cs
τVe (P i )s ≤ ≤ ≤ .
1 − ρs 1 − ρs s(1 − ρ)

For γ ∈ (0, exp(−1)), we can choose the numbers r = 1 + log(γ)−1 and s = log(γ −1 )−1 .
This yields γ r = exp(1)γ and the proof is complete.

Remark 3.7. Let π e ∈ P be a stationary distribution of Pe . Notice that by the as-

sumption that V is Lyapunov function of Pe and [16, Proposition 4.24] it follows that
e(Ve ) ≤ L/(1 − δ). Further, by the Ve -uniform ergodicity of P we also know that π(Ve ) is
finite. Thus,
kπ − πe kVe ≤ π(Ve ) + π
e(Ve ) < ∞.
Now, by Theorem 3.2 we can bound kπ − π
ektv with p0 = π, pe0 = π
e and by letting
n → ∞. We obtain
−1 −1
L (2C(L + 1))log(γ )
kπ − π
e ktv ≤ exp(1) γ log(γ −1 ). (3.19)
(1 − δ)(1 − ρ)

Remark 3.8. Let us comment on the dependence of γ. In Section 4.3, we apply Theo-
rem 3.2 combined with (3.19) in a setting where we have γ ≤ K ·log(N )/N for a constant
K ≥ 1 and some parameter N ∈ N of the perturbed transition kernel. For ε ∈ (0, 1) and
any N > (K/ε)1/(1−ε) we have γ < exp(−1). Then, with some simple calculations, we
obtain for p0 = pe0 and N > 6K 3/2 the bound
3κ (2C(L + 1))2/ log(N ) K log(N )2
max{kpn − pen ktv , kπ − π
ektv } ≤ · .
1−ρ N
Remark 3.9. In the setting of Theorem 3.2, we can also interpret γ as an operator
norm. Namely,

“ “
“P − Pe “ = sup (P − Pe )f = γ. (3.20)
“ “
B1 →BVe e
|f |1 ≤1

Here the subscript “1” in |f |1 indicates V (x) = 1 for all x ∈ G, see (3.15). For ε0 > 0
and a family of perturbations (Peε )|ε|≤ε0 let γ = P − Peε B1 →BVe → 0 for ε → 0. This
condition appears in [13, Theorem 1, condition (2)] and is an assumption introduced by
Keller and Liverani, see [22].

16 D. Rudolf and N. Schweizer

4. Applications
We illustrate our perturbation bounds in three different settings. We begin with studying
an autoregressive process also considered in [13]. After this, we show quantitative per-
turbation bounds for approximate versions of two prominent MCMC algorithms, namely
the Metropolis-Hastings and stochastic Langevin algorithms.

4.1. Autoregressive process

Let G = R and assume that (Xn )n∈N0 is the autoregressive model defined by

Xn = αXn−1 + Zn , n ∈ N. (4.1)

Here X0 is an R-valued random variable, α ∈ (−1, 1) and (Zn )n∈N is an i.i.d. sequence of
random variables, independent of X0 . We also assume that the distribution of Z1 , say µ,
admits a first moment. It is easily seen that (Xn )n∈N0 is a Markov chain with transition
kernel Z
Pα (x, A) = 1A (αx + y) dµ(y),
and it is well known that there exists a stationary distribution, say πα , of Pα .
Now, let the transition kernel Pαe with α e ∈ (−1, 1) be an approximation of Pα . For
x, y ∈ G, let us consider the metric which is given by the absolute difference, i.e., d(x, y) =
|x − y|. We assume that |α − α e| is small and study the Wasserstein distance, based on d,
of p0 Pαn and pe0 Pαen with two probability measures p0 and pe0 on (R, B(R)).
We intend to apply Theorem 3.1. Notice that for Ve : R → [1, ∞) with Ve (x) = 1 + |x|
we have

Pαe Ve (x) ≤ |e
α| Ve (x) + 1 − |e
α| + E |Z1 |

which guarantees that condition (3.1) is satisfied with δ = |e

α| and L = 1 − |e α| + E |Z1 |.
W (δx Pα , δy Pα ) ≤ |αx − z − αy + z| dµ(z) ≤ |α| |x − y| = |α| d(x, y),
leads to τ (Pαn ) ≤ |α| . Similarly, one obtains
W (δx Pα , δx Pαe ) ≤ |αx − z − α
ex + z| dµ(z) ≤ |x| |α − α

which implies that

W (δx Pα , δx Pαe )
sup ≤ |α − α
e| .
x∈R Ve (x)
We set Z 
E |Z1 |
κ = 1 + max |x| de
p0 (x),
R 1 − |e

Perturbation theory for Markov chains via Wasserstein distance 17

and pα,n = p0 Pαn , peαe,n = pe0 Pαen . Then, inequality (3.2) of Theorem 3.1 gives
n (1 − |α| ) κ
W (pα,n , peαe,n ) ≤ |α| W (p0 , pe0 ) + |α − α
e| , (4.2)
1 − |α|
and for p0 = pe0 we have
(1 − |α| ) κ
W (pα,n , peαe,n ) ≤ |α − α
e| . (4.3)
1 − |α|
From the previous two inequalities one can see that if α e is sufficiently close to α, then
the distance of the distribution pα,n and peαe,n is small. Let us emphasize here that we
provide an explicit estimate rather than an asymptotic statement.
Note that by [16, Proposition 4.24] R and the fact that Pβ g(x) ≤ |β| g(x) + E |Z1 | with
g(x) = |x| and β ∈ {α, αe} we obtain R |x| dπβ (x) < ∞, which leads to a finite W (πα , παe ).
As a consequence we obtain for the stationary distributions of Pα and Pαe by estimate
(3.5) that
1 − |e
α| + E |Z1 |
W (πα , παe ) ≤ |α − α
e| . (4.4)
(1 − |α|)(1 − |e
The dependence on |α − α e| in the previous inequality cannot be improved in general.
To see this, let us assume that X0,α and X0,e α are real-valued random variables with
distribution πα and παe , respectively. Then, because of the stationarity we have that
X1,α = αX0,α + Z1 and X1,e α = α eX0,e α + Z1 are also distributed according to πα and πα
respectively. Thus
EX0,α = , EX0,e α = .
1−α 1−α e
Now, for g : R → R with g(x) = x, we have kgkLip ≤ 1 and thus

W (πα , παe ) = sup f (x)(dπα (x) − dπαe (x))
kf kLip ≤1 G

≥ x (dπα (x) − dπαe (x)) = |EX0,α − EX0,e α|
|EZ1 |
= |α − α
e| .
|1 − α| |1 − α
Hence, whenever EZ1 6= 0 we have a non-trivial lower bound with the same dependence
on |α − αe| as in the upper bound of (4.4). This fact shows that we cannot improve the
upper bound.
Let us now discuss the application of Corollary 3.4 and Theorem 3.2. Under the
additional assumption that µ, the distribution of Z1 , has a Lebesgue density h, it is
shown in [15, Section 4] that the autoregressive model (4.1) is also Ve -uniformly ergodic.
Precisely, there is a constant C ≥ 1 such that

kPαn (x, ·) − πα ktv ≤ C |α|n Ve (x).

18 D. Rudolf and N. Schweizer

Moreover, from [13, Example 1] we know that

kPα (x, ·) − Pαe (x, ·)kVe

x∈R Ve (x)

does not go to 0 when α e ↓ α. Hence, Corollary 3.4 cannot quantify for small |e
α| whether the nth step distributions are close to each other. However, also in [13,
Example 1] it is proven that

kPα (x, ·) − Pαe (x, ·)ktv

sup →0 if e → α.
x∈R Ve (x)
This indicates that Theorem 3.2 is applicable. By assuming in addition that h is weakly
unimodal 1 and bounded from above by hmax , we can quantify the result. Namely,
kPα (x, ·) − Pαe (x, ·)ktv kµ(· − αx) − µ(· − α
sup = sup
x∈R Ve (x) x∈R 1 + |x|
|h(z − αx) − h(z − α ex)| dz
= sup R ≤ 2 |α − α
e| hmax .
x∈R 1 + |x|
To see the final estimate, define F (a) = R |h(z) − h(z − a)|dz for a ∈ R. By unimodality,
there exists for any fixed a ≥ 0 a constant c such that
Z Z c Z ∞
|h(z) − h(z − a)|dz = h(z) − h(z − a)dz + h(z − a) − h(z)dz.
R −∞ c

The first summand on the right hand side we can bound by

Z c Z c Z c
h(z)dz − h(z − a)dz = h(z)dz ≤ a hmax
−∞ −∞ c−a

and similarly for the second summand. Using that F (a) = F (−a), we obtain F (a) ≤
2|a| hmax . Finally, by substitution we can write
|h(z − αx) − h(z − α ex)|dz F (a)
sup R = |α − αe| sup ≤ 2|α − α
e|hmax .
x∈R 1 + |x| a≥0 a + |α − α

For simplicity set p0 = pe0 and assume that hmax ≤ 1 as well as |α − α

e| ∈ (0, exp(−1)/2).
Then, Theorem 3.2 implies
κ exp(1) −1
max{kpα,n − peαe,n ktv , kπα − παe ktv } ≤ (2C(E |Z1 | + 2)) |α − α
e| log(|α − α
e| )
1 − |α|
which seems to be new.
1 The function h : R → [0, ∞) is called weakly unimodal if there exists s ∈ R such that h(x) is

nondecreasing for x ∈ (−∞, s) and nonincreasing for x ∈ (s, ∞).

Perturbation theory for Markov chains via Wasserstein distance 19

4.2. Approximate Metropolis-Hastings algorithms

We apply our perturbation results to the approximate (or noisy) Metropolis-Hastings
algorithms analyzed in [2, 3, 4, 23, 29, 35]. We assume either that the unperturbed tran-
sition kernel of the Metropolis-Hastings algorithm satisfies the Wasserstein ergodicity
condition stated in Assumption 2.1 or is geometrically ergodic. In particular, we do not
assume that the transition kernel is uniformly ergodic. Let π be a probability distribu-
tion on (G, B(G)) and assume that we are interested in sampling realizations from this
distribution. Let Q be a transition kernel which serves as the proposal for the Metropolis-
Hastings algorithm. From [44, Proposition 1] we know that there exists a set S ⊂ G × G
such that we can define the “acceptance ratio” for (x, y) ∈ G × G as
(x, y) ∈ S
r(x, y) := π(dx)Q(x,dy) (4.5)
0 otherwise.

Then, let the acceptance probability be α(x, y) = min{1, r(x, y)}. With this notation the
Metropolis-Hastings algorithm defines a transition kernel
Pα (x, dy) = Q(x, dy)α(x, y) + δx (dy) sα (x), (4.6)
with Z
sα (x) = 1 − α(x, y) Q(x, dy).
We provide a step of a Markov chain (Xn )n∈N0 with transition kernel Pα in algorithmic

Algorithm 4.1. A single transition from Xn to Xn+1 of the Metropolis-Hastings algo-

rithm works as follows:
1. Draw a sample Y ∼ Q(Xn , ·) and U ∼ Unif[0, 1] independently, call the result y
and u;
2. Set r := r(Xn , y), with the ratio r(·, ·) defined in (4.5);
3. If u < r, then accept the proposal, and set Xn+1 := y, else reject the proposal and
set Xn+1 := Xn .

Now, suppose we are unable to evaluate r(x, y), so that we are forced to work with
an approximation of α(x, y). The key idea behind approximate Metropolis-Hastings al-
gorithms is to replace r(x, y) by a non-negative random variable R with distribution, say
µx,y,u , depending on x, y ∈ G and u ∈ [0, 1]. For concrete choices of the random variable
R we refer to [2, 3, 4, 23]. We present a step of the corresponding Markov chain (Xen )n∈N
in algorithmic form.

Algorithm 4.2. en to X
A single transition from X en+1 works as follows:
en , ·) and U ∼ Unif[0, 1] independently, call the result y
1. Draw a sample Y ∼ Q(X
and u;

20 D. Rudolf and N. Schweizer

2. Draw a sample R ∼ µXen ,y,u , call the result re;

3. If u < re, then accept the proposal, and set X en+1 := y, else reject the proposal and
e e
set Xn+1 := Xn .

The algorithm has acceptance probability

Z 1Z ∞
e(x, y) = E1[0,min{1,R}] (U ) = 1[0,min{1,er}] (u) dµx,y,u (e
r )du
0 0

and the transition kernel of such a Markov chain is still of the form (4.6) with α(x, y)
substituted by αe(x, y), i.e., it is given by Pαe . The following results hold in the slightly more
general case where α e(x, y) is any approximation of the acceptance probability α(x, y).
The next lemma provides an estimate for the Wasserstein distance between transition
kernels of the form (4.6) in terms of the acceptance probabilities.

Lemma 4.1. Let Q be a transition kernel on (G, B(G)) and let α : G × G → [0, 1] and
e : G×G → [0, 1] be measurable functions. By Pα and Pαe we denote the transition kernels
of the form (4.6) with acceptance probabilities α and α e. Then, for all x ∈ G, we have
W (δx Pα , δx Pαe ) ≤ d(x, y) E(x, y) Q(x, dy)

with E(x, y) = |α(x, y) − α

e(x, y)|.

Proof. By the use of the dual representation of the Wasserstein distance it follows that

W (δx Pα , δx Pαe ) = sup f (y) (Pα (x, dy) − Pαe (x, dy))
kf kLip ≤1 G

= sup (f (y) − f (x))(α(x, y) − α
e(x, y))Q(x, dy) ≤ d(x, y)E(x, y)Q(x, dy).

kf kLip ≤1 G G

By the previous lemma and Theorem 3.1, we obtain the following Wasserstein pertur-
bation bound for the approximate Metropolis-Hastings algorithm.

Corollary 4.1. Let Q be a transition kernel on (G, B(G)) and let α : G×G → [0, 1] and
e : G×G → [0, 1] be measurable functions. By Pα and Pαe we denote the transition kernels
of the form (4.6) with acceptance probabilities α and α
e. Let the following conditions be
• Assumption 2.1 holds for the transition kernel Pα , i.e., τ (Pαn ) ≤ Cρn for ρ ∈ [0, 1)
and C ∈ (0, ∞).

Perturbation theory for Markov chains via Wasserstein distance 21

• There are numbers δ ∈ (0, 1), L ∈ (0, ∞) and a measurable Lyapunov function
Ve : G → [1, ∞) of Pαe , i.e.,

(Pαe Ve )(x) ≤ δ Ve (x) + L. (4.7)

• Let E(x, y) = |α(x, y) − α
e(x, y)| and assume that
d(x, y) E(x, y) Q(x, dy)
γ = sup G < ∞. (4.8)
x∈G Ve (x)
Then, for any p0 ∈ P and finite p0 (Ve ) = G Ve (x)dp0 (x) we have
γ κ C(1 − ρn )
W (p0 Pαn , p0 Pαen ) ≤
n o
where κ = max p0 (Ve ), 1−δ

Let us point out several aspects of condition (4.7). Recall that (4.7) is always satisfied
with Ve (x) = 1 for all x ∈ G. However, in this case it seems more difficult to control γ.
If some additional knowledge in form of a Lyapunov function V : G → [1, ∞) of Pα , i.e.,
Pα V (x) ≤ δV (x) + L for some δ ∈ (0, 1) and L ∈ (0, ∞), is available, then a non-trivial
candidate for Ve is V . For sufficiently small
V (y)
δV = sup + 1 E(z, y)Q(z, dy)
z∈G G V (z)
this is indeed true. Namely, we have
|(Pα − Pαe )V (x)| ≤ V (y)E(x, y)Q(x, dy) + V (x) E(x, y)Q(x, dy) ≤ V (x)δV .

Then, Pαe V (x) ≤ (δ + δV )V (x) + L and whenever δ + δV < 1 it is clear that condition
(4.7) is verified.
To highlight the usefulness of a non-trivial Lyapunov function, we consider the fol-
lowing scenario which is related to a local perturbation of an independent Metropolis-
Hastings algorithm.

Example 4.1. Let us assume that for Pα Assumption 2.1, as formulated in Corol-
lary 4.1, is satisfied. For some probability measure µ on (G, B(G)) define Q(x, ·) = µ and
p0 = pe0 = µ. For G e ⊆ G let

e(x, y) = min{1, α(x, y) + 1Ge (x)}.

Hence, for x ∈ G e the transition kernel Pαe (x, ·) accepts any proposed state and for x 6∈ G
we have Pαe (x, ·) = Pα (x, ·). It is easily seen that E(x, y) ≤ 1Ge (x). For arbitrary R > 0
and r ∈ (0, 1) set Ve (x) = 1 + R1Ge (x) and note that

Pαe Ve (x) ≤ rVe (x) + 1 − r + RPαe (x, G)

e ≤ rVe (x) + 1 − r + Rµ(G).

22 D. Rudolf and N. Schweizer

R formula follows by distinguishing the cases x ∈ G and
The last inequality of the previous
e Define D(G)
x 6∈ G. e = sup e d(x, y)µ(dy) and observe
x∈G G

Rµ(G) e
κ=1+ , and γ≤ .
1−r 1+R
Then, Corollary 4.1 leads to
C e
Rµ(G) e
W (p0 Pαn , p0 Pαen ) ≤ 1+
1−ρ 1−r 1+R

e is finite and
for arbitrary R ∈ (0, ∞) and r ∈ (0, 1). Under the assumption that D(G)
letting R → ∞ as well as r ↓ 0 we obtain
Cµ(G)D( e
W (p0 Pαn , p0 Pαen ) ≤ ,
e measures the difference of the distributions. A small
which tells us that basically µ(G)
perturbation set G with respect to µ, thus implies a small bias. In contrast, with the trivial
Lyapunov function Ve = 1, and if there is (x, y) ∈ G e × G such that α(x, y) = 0, we only
obtain Z
γκ = D(G) e ≥ inf d(x, y)µ(dy).
x∈G G
The resulting upper bound on W (p0 Pαn , p0 Pαen ) will typically be bounded away from zero
regardless of the set G.

Remark 4.1. The constant γ essentially depends on the distance d(x, y) and the differ-
ence of the acceptance probabilities E(x, y). By applying the Cauchy-Schwarz inequality
to the numerator of γ, we can separate the two parts, i.e.,
Z Z Z 1/2
2 2
d(x, y) E(x, y) Q(x, dy) ≤ d(x, y) Q(x, dy) · E(x, y) Q(x, dy) .

If both integrals remain finite we see that an appropriate control of E(x, y) suffices for
making the constant γ small.

Remark 4.2. By using a Hoeffding-type bound, in Bardenet et al. [3, Lemma 3.1.] it
is shown that for their version of the approximate Metropolis-Hastings algorithm with
adaptive subsampling the approximation error E(x, y) is bounded uniformly in x and y
by a constant s > 0. Moreover, s can be chosen arbitrarily small for the implementation
of the algorithm.

Now we consider the case where the unperturbed transition kernel Pα is geometrically
ergodic. Motivated by Remark 4.2, we also assume that E(x, y) ≤ s for a sufficiently
small number s > 0. The following corollary generalizes a main result of Bardenet et al.
[3, Proposition 3.2] to the geometrically ergodic case.

imsart-bj ver. 2014/10/16 file: wasserstein_style1601.tex date: February 27, 2017

Perturbation theory for Markov chains via Wasserstein distance 23

Corollary 4.2. Let Q be a transition kernel on (G, B(G)) and let α : G×G → [0, 1] and
e : G×G → [0, 1] be measurable functions. By Pα and Pαe we denote the transition kernels
of the form (4.6) with acceptance probabilities α and α
e. Let the following conditions be
• The unperturbed transition kernel Pα is V -uniformly ergodic, that is,

kPαn (x, ·) − πkV ≤ CV (x)ρn , x ∈ G, n ∈ N

for numbers ρ ∈ [0, 1), C ∈ (0, ∞) and a measurable function V : G → [1, ∞).
Moreover, V is a Lyapunov function of Pα , i.e.,

(Pα V )(x) ≤ δV (x) + L, (4.9)

for numbers δ ∈ (0, 1) and L ∈ (0, ∞).

• A uniform bound s > 0 on the difference of the acceptance probabilities is given,
that is, for all x, y ∈ G, we have

E(x, y) = |α(x, y) − α
e(x, y)| ≤ s.

• The constant λ satisfies

V (y)
λ = 1 + sup Q(x, dy) < ∞.
x∈G G V (x)
n o
If s < (1 − δ)/λ, then, for any p0 ∈ P with finite κ = max p0 (V ), 1−δ−λs we have

λ s κ C (1 − ρn )
kp0 Pαn − p0 Pαen kV ≤ .

Proof. We consider the metric dV , defined in Lemma 3.1, set V = Ve and use E(x, y) ≤ s
so that it is easily seen that the constant γ from Corollary 4.1 satisfies γ ≤ sλ. From
the proof of Corollary 3.4, we know that V is a Lyapunov function of Pαe provided that
γ + δ < 1. Thus, we have

Pαe V (x) ≤ (δ + λs)V (x) + L. (4.10)

Now if s < (1 − δ)/λ, then δ + λs < 1 and the assertion follows from Corollary 4.1 by
writing the Wasserstein distances in terms of V -norms as in Section 3.2.

Remark 4.3. Without V (x) in the denominator, i.e., if we had relied on Corollary 3.2
instead of Theorem 3.1, the constant λ would often be infinite. Consider the following toy
example: Let π be the exponential distribution with density exp(−x) on G = [0, ∞) and
assume that Q(x, dy) is a uniform proposal with support [x−1, x+1]. With V (x) = exp(x)

24 D. Rudolf and N. Schweizer

it is well known that the Metropolis-Hastings algorithm is V -uniformly ergodic, see [30]
or [37, Example 4]. In this example
Z x+1
λ ≤ 1 + sup exp(y − x)dy ≤ 1 + exp(1)
x∈[0,∞) x−1

R x+1
whereas x−1 exp(y)dy is unbounded in x. Notice that λ only depends on the unperturbed
Markov chain so that a bound on λ can be combined with any approximation.

Remark 4.4. Let Pαe and Pα be φ-irreducible and aperiodic. Then, one can prove
under the assumptions of Corollary 4.2 that Pαe is V -uniformly ergodic if s is sufficiently
small. To see this, note that by [31, Theorem 16.0.1] the V -uniform ergodicity of Pα
implies that Pα satisfies their drift condition (V4). By the arguments stated in the proof
of Corollary 3.4, one obtains that Pαe also satisfies (V4) for sufficiently small s and this
implies V -uniform ergodicity. In this case, clearly Pαe possesses a stationary distribution,
say π
e, and
λsC L
kπ − πe kV ≤ · .
1 − ρ 1 − δ − λs
The previous inequality follows by (3.5) and the fact that

kπ − π
e kV ≤ π(V ) + π
e(V ) < ∞.

e(V ) ≤ L/(1 −
Here the finiteness of π(V ) follows by the V -uniform ergodicity of P and π
δ − λs) follows by (4.10) and [16, Proposition 4.24].

4.3. Noisy Langevin algorithm for Gibbs random fields

An alternative to the Metropolis-Hastings algorithm is the Langevin algorithm, see [39].
Unfortunately, in its implementation one needs the gradient of the density of the target
distribution. To overcome this problem, different approximate Langevin algorithms have
been proposed and studied, see [1, 2, 43, 47].
This section is mainly based on Alquier et al. [2, Section 3.4] where a noisy Langevin
algorithm for Gibbs random fields is considered. We provide a quantitative version of
[2, Theorem 3.2]. The setting is as follows. Let Y be a finite set and with M ∈ N let
y = {y1 , . . . , yM } ∈ Y M be an observed data set on nodes {1, . . . , M } of a certain graph.
The likelihood of y with parameter θ ∈ R is defined by
exp(θ s(y))
ℓ(y | θ) = P ,
y∈Y M exp(θ s(y))

where s : Y M → R is a given statistic. The density of the posterior distribution with

respect to the Lebesgue measure on (R, B(R)) given the data y ∈ Y M is determined by

πy (θ) := π(θ | y) ∝ ℓ(y | θ) p(θ)

Perturbation theory for Markov chains via Wasserstein distance 25

where the prior density p(θ) is the Lebesgue density of the normal distribution N (0, σp2 )
with σp > 0.
We consider the Langevin algorithm, a first order Euler discretization of the SDE of
the Langevin diffusion, see [39]. It is given by (Xn )n∈N0 with

Xn = Xn−1 + ∇ log πy (Xn−1 ) + Zn , n ∈ N. (4.11)
Here X0 is a real-valued random variable and (Zn )n∈N is an i.i.d. sequence of random
variables, independent of X0 , with Zn ∼ N (0, σ 2 ) for a parameter σ > 0 which can be
interpreted as the step size in the discretization of the diffusion. It is easily seen that
(Xn )n∈N0 is a Markov chain with transition kernel
Pσ (θ, A) = 1A θ + ∇ log πy (θ) + z N (0, σ 2 )(dz), A ∈ B(R).
R 2

In general πy is not a stationary distribution of Pσ , but there exists a stationary dis-

P (see Proposition 4.1 below), say πσ , which is close to πy depending on σ. Let
z(θ) = y∈Y M exp(θ s(y)) then, by the definition of πy we have
log πy (θ) = θ s(y) − log z(θ) + log p(θ) − log ℓ(y | z)p(z)dz ,
z ′ (θ)
∇ log πy (θ) = s(y) − + ∇ log p(θ)
z∈Y M s(z) exp(θ s(z)) θ
= s(y) − P − 2
z∈Y M exp(θ s(z)) σp
= s(y) − Eℓ(·|θ) s(Y ) − 2 ,

where Y is a random variable on Y M distributed according the likelihood distribution de-

termined by ℓ(· | θ). We do not have access to the exact value of the mean Eℓ(·|θ) s(Y ) since
in general we do not know the normalizing constant of the likelihood. We assume that we
can use a Monte Carlo estimate. For N ∈ N let (Yi )1≤i≤N be an i.i.d. sequence of random
variables with Yi ∼ ℓ(· | θ) independent of (Zn )n∈N from (4.11). Then, N1 i=1 s(Yi ) is
an approximation of Eℓ(·|θ) s(Y ) which leads to an estimate of ∇ log πy (θ) given by

b N log πy (θ) := s(y) − 1

s(Yi ) − 2 .
N i=1 σp

b N log πy (θ) in (4.11) and obtain a sequence of random

We substitute ∇ log πy (θ) by ∇

imsart-bj ver. 2014/10/16 file: wasserstein_style1601.tex date: February 27, 2017

en )n∈N0 defined by
variables (X
X en−1 + σ ∇
en = X b N log πy (Xen−1 ) + Zn
σ2 σ 2
1 X
= 1− 2 X en−1 + s(y) − s(Yi ) + Zn .
2σp 2 N i=1

en )n∈N0 is again a Markov chain with transition kernel

The sequence (X
Z   N
! !
X σ2 σ2 1 X ′
Pσ,N (θ, A) = 1A 1− 2 θ+ s(y) − s(yi ) + z
R (y ′ ,...,y ′ )∈Y M N 2σp 2 N i=1
1 N

× ΠN ′ 2
i=1 ℓ(θ | yi ) N (0, σ )(dz)

for θ ∈ R and A ∈ B(R). Let us state a transition of this noisy Langevin Markov chain
according to Pσ,N in algorithmic form.

Algorithm 4.3. en to X
A single transition from X en+1 works as follows:
en ), call the result (y1′ , . . . , y ′ );
1. Draw an i.i.d. sequence (Yi )1≤i≤N with Yi ∼ ℓ(· | X N
2. Calculate
b N e 1 X Xen
∇ log πy (Xn ) := s(y) − s(yi′ ) − 2 ;
N i=1 σp

3. Draw Zn ∼ N (0, σ 2 ), independent from step 1., call the result zn . Set
X en + σ ∇
en+1 = X b N log πy (X
e n ) + zn .
From [2, Lemma 3] and by applying arguments of [39], we obtain the following facts
about the noisy Langevin algorithm.

Proposition 4.1. Let ksk∞ = supz∈Y M |s(z)| be finite with ksk∞ > 0, let V : R →
[1, ∞) be given by V (θ) = 1 + |θ| and assume that σ 2 < 4σp2 . Then
1. the function V is a Lyapunov function for Pσ and Pσ,N . We have

Pσ V (θ) ≤ δV (θ) + L1I (θ), Pσ,N V (θ) ≤ δV (θ) + L1I (θ) (4.12)
σ2 σ2
with δ = 1 − 4σp2 , L = σ + σ 2 ksk∞ + 2σp2 and the interval
( )
4σ 2
I= θ ∈ R |θ| ≤ 1 + 4σp2 ksk∞ + .

2. there are distributions πσ and πσ,N on (R, B(R)) which are stationary with respect
to Pσ and Pσ,N , respectively.

Perturbation theory for Markov chains via Wasserstein distance 27

3. the transition kernels

 Pσ and Pσ,N are V -uniformly ergodic.
4. for N > 4 max ksk2∞ σ 4 , ksk−3
we have

 log(N )
sup kPσ (θ, ·) − Pσ,N (θ, ·)ktv ≤ 6 max ksk∞ σ 2 , ksk−2
. (4.13)
θ∈R N

Proof. We use the same arguments as in [39, Section 3.1]. One can easily see that the
Markov chains (Xn )n∈N0 and (X en )n∈N0 are irreducible with respect to the Lebesgue
measure and weak Feller. Thus, all compact sets are petite, see [31, Proposition 6.2.8].
Hence, for the existence of stationary distributions, say πσ and πσ,N , [31, Theorem 12.3.3],
as well as for the V -uniform ergodicity [31, Theorem 16.0.1] it is enough to show that V
satisfies (4.12). With Z ∼ N (0, σ 2 ), we have
σ2 σ2 σ 2
Pσ V (θ) ≤ 1 − 2 V (θ) + 2 + s(y) − Eℓ(·|θ) s(Y ) + E |Z|
2σp 2σp 2
σ σ
≤ 1 − 2 V (θ) + 2 + σ 2 ksk∞ + σ
2σp 2σp
σ2 σ σ2 2
≤ 1 − 2 V (θ) + max V (θ), 2 + σ ksk∞ + σ
2σp 4σp2 2σp
σ σ 2
≤ 1 − 2 V (θ) + + σ ksk ∞ + σ · 1I (θ).
4σp 2σp2

By the fact that " #

1 X
E s(y) − s(Yi ) | Xn = θ ≤ 2 ksk∞
N i=1
we obtain with the same arguments that

Pσ,N V (θ) ≤ δV (θ) + L · 1I (θ).

Thus, the assertions from 1. to 3. are proven. The statement of 4. is a consequence of [2,
Lemma 3]. There it is shown that for N > 4ksk2∞ σ 4 it holds that
log(N ) 4 πksk∞ σ 2
sup kPσ (θ, ·) − Pσ,N (θ, ·)ktv ≤ exp −1+ .
θ∈R 4N ksk2∞ σ 4 N

By using exp(θ) − 1 ≤ θ exp(θ) and N > 4 we further estimate the right-hand side by
KN,s,σ 4 πksk∞ σ 2 log (N ) log(N )
+ · with KN,s,σ = exp .
4ksk2∞ σ 4 log(5) N 4N ksk2∞ σ 4

Since log(N )·N −1/3 < 2, we have the bound KN,s,σ ≤ exp(1) provided that 4N 2/3 ksk2∞ σ 4 ≥
−3 −6
2 which follows from N ≥ ksk∞ σ . The assertion of (4.13) follows now by a simple

imsart-bj ver. 2014/10/16 file: wasserstein_style1601.tex date: February 27, 2017

By using the facts collected in the previous proposition, we can apply the perturba-
tion bound of Theorem 3.2 and obtain a quantitative perturbation bound for the noisy
Langevin algorithm.

Corollary 4.3. Let p0 be a probability measure on (R, B(R)) and set pn = p0 Pσn as
well as pen,N = p0 Pσ,N . Suppose that σ 2 < 4σp2 . Then, there are numbers ρ ∈ [0, 1) and
C ∈ (0, ∞), independent of n, N , determining

18 max{ksk∞ σ 2 , ksk−2
∞ σ
R := · 2 + max Ep0 |X| , 4σp2 (ksk∞ + σ −1 )
with Ep0 |X| = R |θ| dp0 (θ), so that for N > 90 max{ksk2∞ σ 4 , ksk∞−3 −6
σ } we have

 2/ log(N ) log(N )2

max kpn − pen,N ktv , kπσ − πσ,N ktv ≤ R · 2C σ + σ 2 ksk∞ + 3 .

Proof. We have by Proposition 4.1 that Pσ is V -uniformly ergodic with V (θ) = 1 + |θ|,
i.e., there are numbers ρ ∈ [0, 1) and C ∈ (0, ∞) such that

kPσn (θ, ·) − πσ kV
sup ≤ Cρn .
θ∈R V (θ)

Now, by combining Theorem 3.2 and Remark 3.8 with the results from Proposition 4.1
we obtain the result.

Remark 4.5. We want to point out that the assumptions imposed are the same as in
[2, Theorem 3.2], but instead of the asymptotic result we provide an explicit estimate.
The numbers ρ ∈ [0, 1) and C ∈ (0, ∞) are not stated in terms of the model parameters.
In principle, these values can be derived from the drift condition (4.12) through [5,
Theorem 1.1].

We thank Alexander Mitrophanov and the referees for their valuable comments which
helped to improve the paper. D.R. was supported by the DFG Research Training Group

You might also like