1 Introduction

Bayesian learning and inference have received a lot of attention in the literature as a principled way to avoid over-fitting and/or to quantify uncertainties in an integrated fashion by estimating the posterior distribution of the model parameters rather than a point-wise estimate. However, analytical solutions of exact posterior or sampling from the exact posterior is often impossible due to the intractability of the evidence. Therefore, Bayesian inference often relies on sampling methods broadly known as Markov Chain Monte Carlo (MCMC). To this effect, consider the following computation of a Bayesian posterior across a decentralized data set:

$$\begin{aligned} \pi (\omega )= p(\omega \vert \{X,Y\}) \propto \prod \limits _{i=1}^m p(\{X_i,Y_i\}\vert \omega )\pi _0 (\omega ), \end{aligned}$$
(1)

where \(i\in [m]\) indicates the index from a set of partitioned data, distributed across a set of agents connected on a communication graph. Each of the agents has access to data samples \(\{X_i,Y_i\}\) that are in general expected to have different distributions, and it is expected that during the process of finding the posterior, the labels are distributed in a heterogeneous way across agents, however, we wish for them all to be able to make inference on all types of samples.

When the likelihood function \(p(\{X_i,Y_i\}\vert \omega )\) arises from M-estimation for an exponential family and the prior is conjugate, the resulting distribution of \(\pi (\omega )\) is known to be of Gibbs form, \(\pi (\omega ) \propto {e^{-U(\omega )}}\), where U is referred to as a potential function for \(\omega\). This makes the process amenable to techniques arising in stochastic differential equations (SDEs), namely Langevin dynamics and Hamiltonian Monte Carlo, which implement a discretization of certain SDEs. The stationary distribution of these processes with a gradient term \(\nabla U(\omega )\) as the drift term in the SDE is precisely the Gibbs distribution above. For high dimensional parameter vectors, these approaches have been shown theoretically and numerically to scale better. Ergodicity of these methods was established in Roberts and Tweedie (1996) and Durmus et al. (2017). For these class of methods, asymptotic convergence to the stationary distribution, regardless of its form, has been established, as well as geometric convergence in measure for when the target distribution \(\pi (\omega ) \propto {e^{-U(\omega )}}\) is strongly log-concave, i.e. U is strongly convex and smooth. However, a Metropolis acceptance-rejection step is necessary to correct for the discretization bias.

Recently, the so-called “stochastic gradient" Langevin Monte Carlo (SGLD) and Hamiltonian Monte Carlo (SGHMC) approaches have become popular with the seminal work of Welling and Teh (2011) on SGLD. See Chen et al. (2014) for a popular work on SGHMC specifically. This was developed for “big-data" contexts wherein it is preferable to subsample from the dataset at each iteration, making these methods resemble stochastic gradient descent. By incorporating a diminishing stepsize akin to stochastic approximation methods in the implementation of the discretization, the Metropolis adjustment becomes no longer necessary. One favorable feature of this class of algorithms is their correspondence to simulated annealing, and with a carefully diminishing step-size and injected noise, non-asymptotic guarantees with respect to global optimality can be achieved, see for instance the recent works (Gao et al., 2021; Akyildiz & Sabanis, 2020; Chau & Rásonyi, 2022)). With respect to sampling, theoretical guarantees for SGHMC come in the form of 1) convergence to the stationary distribution with a diminishing stepsize and log concave distributions (or a moderate relaxation thereof to a log Sobolev or Poincaré inequality), e.g., Teh et al. (2016), Zhang et al. (2017), Chen et al. (2020) and Zou and Gu (2021) or, 2) with a constant step-size and dissipativity, guarantees of a bound on the bias that increases with the step-size, e.g., Gao et al. (2021) and Akyildiz and Sabanis (2020). This contrasts with the Metropolis adjusted variant which guarantees exact convergence to the desired distribution. Note that many contemporary problems of interest, such as Bayesian neural networks, are not expected to satisfy even a log-Sobolev inequality. However, although the geometric (for log Sobolev distributions) and general (otherwise) ergodicity of Metropolis adjusted HMC is a stronger result, beyond (Durmus et al., 2017) the convergence analyses in the literature are scant, suggesting that overall the method is not as readily amenable to thorough analysis. This is due to the heterogenous multi step kernel associated with the Metropolis adjusted HMC operation, as opposed to the simpler SGHMC, which can also draw on extensive convergence theory for stochastic gradient approaches to optimization. In this paper we consider the proof techniques introduced in Bou-Rabee et al. (2020) for inexact HMC to show ergodicity of the stochastic process generated by our algorithm.

In comparing the two empirically, broadly speaking it has been observed that HMC in general can perform better than SGHMC in terms of characterizing the uncertainty, which can be seen visually as mapping the posterior more faithfully far from the mode, and in performance metrics by considering statistical quality (log likelihood) as opposed to optimization criteria (RMSE) (Izmailov et al., 2021; Cobb & Jalaian, 2020). An intuitive argument based on first principles of HMC in regards to why subsampling is expected to degrade the performance is given in Betancourt (2015). Furthermore, SGHMC, like SG methods in general, suffer significant hyperparametric sensitivity– the learning rate schedule must often be frequently tuned for each problem in order to observe strong performance. Of course, on the other hand, in the high data volume regime, it becomes unclear as to how to scale HMC with Metropolis adjustment, as even the acceptance-rejection step requires evaluation of the posterior on all of the samples, and so guarantees without using the entire dataset at every iteration are scant.

Thus the constant stepsize Metropolis adjusted Langevin and HMC methods, as compared to their diminishing stepsize stochastic gradient variants stand as complementary techniques with distinct comparative advantages and disadvantages. Namely, for handling the computational load of large datasets and obtaining accurate point estimates, the stochastic gradient variants are to be preferred, however, when statistical accuracy of the entire posterior is prioritized or the potential exhibits nonconvexities, the constant stepsize Metropolis adjusted variants are the more appropriate choice. Overall, we wish to emphasize that in considering the various schemes for sampling, full-batch or stochastic, adjusted or unadjusted, all have devoted literature towards analyzing their performance in the centralized setting. Practitioners and researchers may prefer one, but there are variants of all that perform sampling at a high level of performance and efficiency, and we make no claim that one is inherently superior than the other. Instead, in order to endow a statistician with the most comprehensive toolbox available, with the most fine tuned theoretical understanding, all methods should be analyzed in detail in various settings. This work presents a scheme indicating how one can perform full (adjusted) Hamiltonian Monte Carlo in an environment of decentralized networks, complementing similar works studying SG schemes.

In this paper, we consider sampling in a decentralized architecture motivated by extensions of federated learning where the decentralized Bayesian extension (Lalitha et al., 2019)–with a variational inference approach–are shown to provide further robustness. We also mention (Shoham et al., 2019) as reviewing the difficulties associated with scaling federated learning. Note that we are considering specifically a decentralized architecture, i.e., one in which there is no central node that coordinates and distributes the work, and not the “hub-spoke"/“parameter-server"/“master-worker" node architecture framework of distributed computing. Thus, the recently popular “divide-and-conquer" approach (e.g., Hsieh et al. (2014) and Mesquita et al. (2020) for a contemporary Bayesian example) is incomparable. The agents never get to aggregate information and can only communicate with specific neighbors as indicated on the group, with the ultimate goal of ultimately sampling from the same global distribution.

By using a schema incorporating methods amenable to high parameter dimension, we hope to develop an algorithm with which every agent on the network successfully samples from high dimensional distributions. As such, our work is contextualized within a handful of studies focusing on SDE methods for decentralized sampling. In particular, Kungurtsev (2020) provided convergence and consensus guarantees for decentralized Langevin dynamics when the posterior is strongly convex. A diminishing stepsize is needed to ensure asymptotic consensus, placing this work conceptually within the stochastic gradient variants. Independently, Gürbüzbalaban et al. (2020) studied stochastic gradient Langevin dynamics, using different proof techniques, as well as stochastic gradient Hamiltonian Dynamics, again for the setting wherein U is strongly convex. Extension for non-log-concave potentials, but satisfying a Sobolev inequality, was given in Parayil et al. (2020). More recently, directed graphs are considered (Kolesov & Kungurtsev, 2021). The theoretical guarantees in these works take the optimization over measures approach to the analysis, and assumptions on the growth of the potentials (either strong convexity of the Sobolev inequality) are required, and in most cases a diminishing stepsize is used. Since a Metropolis adjustment step is not present in any of these methods, this is natural to be expected.

In this paper, we complete the program of extending SDE based sampling methods to the decentralized context by presenting the first algorithm, together with guarantees, for implementing sampling from the constant stepsize Metropolis-adjusted framework on a decentralized platform. This presents two significant methodological challenges: achieving consensus without a diminishing stepsize and performing a Metropolis acceptance-rejection procedure in a decentralized way. By carefully incorporating techniques inspired from the decentralized optimization literature along with developing a novel method for approximating the Metropolis acceptance ratio, we present an algorithm that performs HMC for decentralized data. Conceptually we seek to achieve that each agent, asymptotically, performs HMC, however, with gradient evaluations on the entire data set through gradually diffusing information. Theoretically we prove a bound in expected L2 error between our algorithm and the chain as generated by the classical HMC kernel, quantitatively bounding the discrepancy in terms of the step-size and the total number of sampling iterations. Thus, modulo some adjustable level of error relative to the number of samples, our algorithm approximates the probabilistic convergence properties of HMC, as for instance derived in Durmus et al. (2017), which showed ergodicity for general potentials and geometric ergodicity for strongly convex potentials.

The salient technical features of our proposed solution, specifically, consists of three methodological novelties. First, we integrate the concept of (gradient) tracking as appearing in decentralized optimization (Di Lorenzo & Scutari, 2016; Pu & Nedić 2020) for sampling, for the first time, and indicate its analogous benefit of allowing each agent to asymptotically perform, effectively, optimization on the same global function. This is in contrast to a repeated “zig-zag" pattern of local optimization and consensus pushing in disparate directions at each iterate, even more crucial for sampling since effectively each iterate is important, rather than some specific limit points. Secondly, we obtain guarantees of asymptotic consensus by incorporating multiple mixing rounds in each iteration (Berahas et al., 2018). Finally, we utilize a Metropolis adjustment step to correct for the discretization bias encountered in constant-step-size HMC. In order to compute an approximation to the full posterior in a decentralized context, we introduce a technique of tracking a decentralized second order Taylor approximation of the posterior, using only Hessian-vector products, and similarly tracking to communicate and aggregate local information across the network.

1.1 Graph structure

In this section we present the notation we use regarding the abstract graph that we use to model the communication structure of the network. The notation is standard in the literature for decentralized learning and distributed optimization, and presents a network of reasonable generality.

The communication network of the agents is modeled as a fixed undirected graph \({\mathcal {G}}\triangleq ({\mathcal {V}},{\mathcal {E}})\) with vertices \({\mathcal {V}}\triangleq \{1,..,m\}\) and \({\mathcal {E}}\triangleq \{(i,j)\vert i,j\in {\mathcal {V}}\}\) representing the agents and communication links, respectively. We assume that the graph \({\mathcal {G}}\) is strongly connected. We note by \({\mathcal {N}}_i\) the neighbors of i, i.e., \({\mathcal {N}}_i = \{j:(i,j)\in {\mathcal {E}}\}\). We define the graph Laplacian matrix \({\textbf{L}}={\textbf{I}}-{\textbf{W}}\), where \({\textbf{W}}={\textbf{A}}\otimes {\textbf{I}}\) with \({\textbf{A}}\) satisfying \({\textbf{A}}_{ij}\ne 0\) if \((i,j)\in {\mathcal {E}}\) and \({\textbf{A}}_{ij}=0\) otherwise. We assume that \({\textbf{W}}\) is double stochastic (and symmetric, since the graph is undirected). The eigenvalues of \({\textbf{L}}\) are real and can be sorted in a nonincreasing order \(1=\lambda _1({\textbf{L}})> \lambda _2({\textbf{L}})\ge ...\ge \lambda _n({\textbf{L}})\ge 0\). Defining, \(\beta := \lambda _2({\textbf{L}})\) we shall make the following assumption,

Assumption 1.1

It holds that,

$$\begin{aligned} \beta <1 \end{aligned}$$

This assumption holds for all but a small class of unusual graphs satisfying the standard requirements of a doubly-stochastic matrix as appearing classically in network consensus schemes. The large class of satisfactory graphs include the common lattice and ring structures that appear in pre-defined environments. For simplicity, however, we consider static graphs (i.e., the links do not change with time) that are undirected (communication occurs either both ways or not at all along a between any two agents).

We shall define \({\bar{\beta }}\) to be the smallest eigenvalue of \({\textbf{L}}\) that is nonzero.

2 Decentralized metropolis-adjusted Hamiltonian Monte Carlo

We are now ready to introduce our approach, which we present formally in Algorithm 1. As part of our approach we introduce two key concepts: gradient tracking, and a Taylor approximation to Metropolis adjustment. We will describe both components in the subsequent sections before describing how they combine in Algorithm 1.

2.1 Gradient tracking

Gradient tracking is a technique that enables every agent to obtain an estimate of the full gradient of the potential in a decentralized setting. While originally introduced and implemented for decentralized nonconvex optimization (Di Lorenzo & Scutari, 2016), here we use gradient tracking to enable gradient-based sampling of a potential, whereby each agent only has access to its own portion of the data. In our method, each agent tracks both first and second order information of the global potential, which we denote as \({\textbf{g}}^t_i\) and \(\aleph ^t_i\) respectively.

We index the individual agents by i and the iteration counter by t. Recall the notation in introduction, namely (1) and the subsequent discussion of the parameters to be statistically estimated, \(\varvec{\omega }\), the data samples \(\{X_i,Y_i\}\) and the potential function in the Gibbs form of the posterior, \(U(\varvec{\omega })\). To perform gradient tracking, a weighted average of communicating neighbors is balanced at each iteration to accumulate the gradient. This weighted average is determined by the mixing matrix, \(\varvec{W}\), of the network. As a result, each agent computes its local gradient of the potential, \(\nabla U(\varvec{\omega }^t; X_i,Y_i)=\nabla U_i(\varvec{\omega }^t)\), and updates its global estimation of the gradient using its current local gradient, its previous local gradient, and its current estimation of the global gradient. We display this update equation as follows:

$$\begin{aligned} {\textbf{g}}_i^{t+1} = \sum \limits _{j\in {\mathcal {N}}_i}{\textbf{W}}_{ij} \left( {\textbf{g}}_j^t+\nabla U(\varvec{\omega }^t_j; X_j,Y_j)-\nabla U(\varvec{\omega }^{(t-1)}_j; X_j,Y_j)\right) . \end{aligned}$$
(2)

We intend that as the parameter estimates reach consensus (agreement across agents) we expect that \(\varvec{g}\) to also reach consensus and in effect, every agent samples HMC on the entire dataset. Since in the context of sampling, each iteration is counted rather than just asymptotic limit points, we believe this should improve posterior characterization and avoid the characteristic zig-zagging of decentralized algorithms without gradient tracking. Tracking the Hessian of the potential is necessary to obtain a global estimate of the Hessian of the Hamiltonian with respect to the parameters, which will permit the principled use of a Metropolis step, at least approximately.

2.2 Taylor approximated Metropolis adjustment

A key challenge in performing Metropolis adjustment in the decentralized setting is the necessity of computing an evaluation of the potential over the entire data-set, which is in actuality distributed across nodes. In our algorithm, we calculate the acceptance probability as

$$\begin{aligned} \log \rho \approx \min \{ 0, -\aleph _i-\epsilon ^2 \Vert \varvec{g}_i\Vert ^2\} \end{aligned}$$
(3)

i.e., using local quantities \(\aleph _i\) and \(\varvec{g}_i\) that stand as surrogates for the global information. In this Section we shall explain and justify this approach.

The second key component of our approach builds on gradient tracking by utilizing the tracked second order term, \(\aleph ^t_i\). One of the challenges in performing decentralized MCMC, is the requirement of a decentralized Metropolis-Hastings (MH) step. Since this involves evaluation of the posterior over the whole data set, it is unclear how this can be done with each agent only having access to its local data. Understandably, all previous approaches to decentralized sampling either use diminishing step-sizes (Kungurtsev, 2020) or accept the level of bias according to the discretization error (Gürbüzbalaban et al., 2020).

Recall the generic definition of the HMC Hamiltonian,

$$\begin{aligned} H(\varvec{\omega },\varvec{p}):= U(\varvec{\omega })+\frac{\varvec{p}^T{\textbf{M}}\varvec{p}}{2} \end{aligned}$$

In order to perform a decentralized MH step, we first introduce the Metropolis adjustment step, which takes the difference between the Hamiltonian at the current time step, \(H(\varvec{\omega }^t,\varvec{p}^t)\) and the proposed Hamiltonian with new parameters, \(H(\varvec{\omega }^*,\varvec{p}^*)\), and accepts the new parameters according to the acceptance ratio \(\rho\), where \(\log \rho = \min \{0, - H(\varvec{\omega }^*,\varvec{p}^*) + H(\varvec{\omega },\varvec{p})\}\). In the decentralized setting each agent does not have access to the full Hamiltonian and therefore to overcome this problem, we can approximate the acceptance ratio using the tracked first and second order terms. As a result we perform a Taylor expansion to approximate our acceptance step.

To derive the approximated acceptance ratio, we first take a Taylor expansion of the Hamiltonian at the proposed time step \(\{\varvec{\omega }^*,\varvec{p}^*\}\) and define both \(\Delta \varvec{\omega } = \varvec{\omega }^* - \varvec{\omega }^t\) and \(\Delta {\textbf{p}} = {\textbf{p}}^* - {\textbf{p}}^t\). Then,

$$\begin{aligned} H(\varvec{\omega }^*,\varvec{p}^*) = H(\varvec{\omega }^t, \varvec{p}^t) + \Delta \varvec{\omega } \frac{\partial H }{\partial \varvec{\omega }} + \Delta {\textbf{p}} \frac{\partial H }{\partial {\textbf{p}}} + \frac{1}{2} \Delta \varvec{\omega } ^T \frac{\partial ^2 H}{\partial \varvec{\omega }^2} \Delta \varvec{\omega } + \frac{1}{2} \Delta {\textbf{p}}^T \frac{\partial ^2 H}{\partial \varvec{p}^2} \Delta {\textbf{p}}, \end{aligned}$$

where the first and second order derivatives are evaluated at \(\{\varvec{\omega }^t,\varvec{p}^t\}\). As a result we can write the acceptance ratio as:

$$\begin{aligned} \log \rho =&\min \{0, - H(\varvec{\omega }^*,{\textbf{p}}^*) + H(\varvec{\omega },{\textbf{p}})\}\\ \log \rho \approx&\min \Bigg \{ 0, - \Delta \varvec{\omega } \frac{\partial H }{\partial \varvec{\omega }} - \Delta {\textbf{p}} \frac{\partial H }{\partial {\textbf{p}}} - \frac{1}{2} \Delta \varvec{\omega } ^T \frac{\partial ^2 H}{\partial \varvec{\omega }^2} \Delta \varvec{\omega } - \frac{1}{2} \Delta {\textbf{p}}^T \frac{\partial ^2 H}{\partial \varvec{p}^2} \Delta {\textbf{p}} \Bigg \}. \end{aligned}$$

We are performing numerical integration with a single first order Euler update step, which gives \({\textbf{p}}^* = {\textbf{p}}^t + \epsilon {\textbf{g}}^t\), and \(\varvec{\omega }^* = \varvec{\omega }^t + \epsilon {\textbf{p}}^*\). Therefore, \(\Delta \varvec{\omega } = \epsilon {\textbf{p}}^*\) and \(\Delta {\textbf{p}}^* = \epsilon {\textbf{g}}^t\). As a result the first order terms cancel each other out leaving the quadratic terms. The second order derivative with respect to the momentum is a vector of ones (assuming the mass matrix is the identity) and the second order derivative with respect to the parameters is the Hessian of the unnormalized potential, \(\frac{\partial ^2 U}{\partial \varvec{\omega }^2}\). Rather than directly computing the full Hessian and using tracking in the same manner as Eq. (2), we can track the quadratic term directly and take advantage of the speed-up associated with the vector-Hessian product. As a result, each agent tracks this second order term using the equation

$$\begin{aligned} \aleph _i^{t+1} = \sum \limits _{j\in {\mathcal {N}}_i}{\textbf{W}}_{ij} \left( \mathbf \aleph _j^t + ({\textbf{p}}_j^t)^T \frac{\partial ^2 U_j}{\partial \varvec{\omega }^2}(\varvec{\omega }^t_j) {\textbf{p}}_j^t-({\textbf{p}}_j^{t-1})^T \frac{\partial ^2 U_j}{\partial \varvec{\omega }^2}(\varvec{\omega }^{t-1}_j) {\textbf{p}}_j^{t-1}\right) \end{aligned}$$

Thus we can set our our Metropolis acceptance to

$$\begin{aligned} \log \rho \approx \min \{ 0, -\aleph _i-\epsilon ^2 \Vert \varvec{g}_i\Vert ^2\} \end{aligned}$$

We expect this to maintain, as in the gradient tracking, an asymptotic upper bound on the expected discrepancy between the second derivative of the potential at an average of the parameters across the agents, to their \(\aleph\) estimates thereof.

2.3 The algorithm

Algorithm 1 describes the decentralized Metropolis-adjusted Langevin algorithm, which combines the gradient tracking of first and second order terms as well as the Taylor approximated Metropolis step. Each agent starts by sampling its own momentum and then computes its own local gradient and quadratic second order term. Then each agent updates its estimated global gradient and quadratic terms using gradient tracking. Finally, each agent then makes a single step in the augmented parameter space (including both the model parameters and the momentum) and accepts with probability according to the approximated Metropolis acceptance ratio.

figure a
figure b

3 Convergence analysis

To analyze the convergence of Algorithm 1, we first write the vectorized expression for the iterate sequence generated by the Algorithm as follows. We shall denote \(\varvec{\omega }^t = \begin{pmatrix} \varvec{\omega }_1^T&...&\varvec{\omega }^T_m \end{pmatrix}^T\) as the set of parameter vectors \(\{\varvec{\omega }_i\}\) stacked together, and similarly for \(\varvec{p}^t\), \(\aleph ^t\) and \(\varvec{g}^t\).

$$\begin{aligned} \begin{array}{l} G(\varvec{\omega }^t) = \begin{pmatrix} \nabla U(\varvec{\omega }^t_1; X_1,Y_1) &{} \nabla U(\varvec{\omega }^t_2; X_2,Y_2) &{}... &{} \nabla U(\varvec{\omega }^t_m; X_m,Y_m) \end{pmatrix}^T \\ H(\varvec{\omega }^t) = \begin{pmatrix} ({\textbf{p}}_1^t)^T \frac{\partial ^2 U_1}{\partial \varvec{\omega }^2}(\varvec{\omega }^t_1) {\textbf{p}}_1^t &{} ({\textbf{p}}_2^t)^T \frac{\partial ^2 U_2}{\partial \varvec{\omega }^2}(\varvec{\omega }^t_2) {\textbf{p}}_2^t &{}... &{} ({\textbf{p}}_m^t)^T \frac{\partial ^2 U_m}{\partial \varvec{\omega }^2}(\varvec{\omega }^t_m) {\textbf{p}}_m^t \end{pmatrix}^T \\ {\textbf{g}}^{t+1} = {\textbf{W}} \left( {\textbf{g}}^t+G(\varvec{\omega }^t)-G(\varvec{\omega }^{t-1})\right) \\ \aleph ^{t+1} = {\textbf{W}} \left( \mathbf \aleph ^t+H(\varvec{\omega }^t)-H(\varvec{\omega }^{t-1})\right) \\ \varvec{\omega }^{t+1} = {\textbf{W}}\left( {\mathcal {M}}(\varvec{\omega }^t-\epsilon ({\textbf{p}}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1},u^t)\right) \\ \end{array} \end{aligned}$$
(4)

where recall that \({\textbf{p}}^t\) be a normal random variable and \(u^t\) is uniformly distributed and by \({\mathcal {M}}\) we denote the approximate metropolis acceptance operator, defined to be,

$$\begin{aligned} {\mathcal {M}}(\varvec{\omega },\varvec{\omega }',\aleph ,u)_i:= \left\{ \begin{array}{lr} \varvec{\omega }_i &{} \text {if }\rho (\aleph _i,\varvec{g}_i,\varvec{\omega }_i)\ge \log u\\ \varvec{\omega }'_i &{} \text {otherwise} \end{array} \right. \end{aligned}$$

3.1 Coupling to centralized HMC

We prepare a coupling argument for the convergence (see, for example, e.g., Durmus and Moulines (2019)). We bound the distance in probability measure of the the chain generated by (4) to the chain generated by m parallel centralized HMC chains all with access to the entire dataset, whose ergodicity properties are well established, for instance in Durmus et al. (2017). We show this by coupling (4) to centralized HMC, using the triangle inequality and intermediate quantities. Accordingly we consider that the exogenous random variables \({\textbf{p}}^t\) and \(u^t\) are the same across all of the intermediate constructed chains for all t.

We introduce the notation \({\bar{a}}\) to indicate the averaged and copied m times vector \(\{a_i\}\), e.g.,

$$\begin{aligned} \bar{\varvec{\omega }}^t:= \begin{pmatrix} \frac{1}{m}\sum _i \left( \varvec{\omega }^t_i\right) ^T&\frac{1}{m}\sum _i \left( \varvec{\omega }^t_i\right) ^T&...&\frac{1}{m}\sum _i \left( \varvec{\omega }^t_i\right) ^T \end{pmatrix}^T \end{aligned}$$

and similarly for \(\bar{\varvec{g}}^t\) and \({{\bar{\aleph }}}^t\). The vector recursion for the averaged iterate satisfies,

$$\begin{aligned} \begin{array}{l} \bar{{\textbf{g}}}^{t+1} = \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\left( \bar{\varvec{g}}^t+G(\varvec{\omega }^t)-G(\varvec{\omega }^{t-1})\right) \\ \bar{\mathbf {\aleph }}^{t+1} = \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\left( {{\bar{\aleph }}}^t+H( \varvec{\omega }^t)-H(\varvec{\omega }^{t-1})\right) \\ \bar{\varvec{\omega }}^{t+1} =\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T) \left( {\mathcal {M}}(\varvec{\omega }^t-\epsilon (\varvec{p}^t+\epsilon \varvec{g}^{t+1}),\varvec{\omega }^t,\aleph ^{t+1},u^t)\right) \\ \end{array} \end{aligned}$$
(5)

Now we consider a hypothetical chain wherein there is a stack of m vectors undergoing the decentralized HMC iteration, and subsequently averaged and dispersed across the stack of m. In effect this is the chain wherein the evaluations are performed at the average parameter, and corresponds to exact HMC, however, with the approximate Metropolis operator using the second order approximation. We refer to this as Approximate HMC.

$$\begin{aligned} \begin{array}{l} \tilde{{\textbf{g}}}^{t+1} = \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\left( \tilde{\varvec{g}}^t+G(\tilde{ \varvec{\omega }}^t)-G(\tilde{\varvec{\omega }}^{t-1})\right) \\ \bar{\mathbf {\aleph }}^{t+1} = \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\left( {\tilde{\aleph }}^t+H(\tilde{ \varvec{\omega }}^t)-H(\tilde{\varvec{\omega }}^{t-1})\right) \\ \tilde{\varvec{\omega }}^{t+1} =\frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T) \left( {\mathcal {M}}(\tilde{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon \tilde{\varvec{g}}^{t+1}),\tilde{\varvec{\omega }}^t,{\tilde{\aleph }}^{t+1},u^t)\right) \\ \end{array} \end{aligned}$$
(6)

Finally, we write the centralized HMC iteration. Note that technically it is a stack of m copies of centralized HMC runs, for formal simplicity averaged together (in effect reducing the variance),

$$\begin{aligned} \hat{\varvec{\omega }}^{t+1}:= \frac{1}{m}(\varvec{I}\otimes \varvec{1} \varvec{1}^T)\hat{{\mathcal {M}}}(\hat{\varvec{\omega }}^t-\epsilon (\varvec{p}^t+\epsilon {\bar{G}}(\hat{\varvec{\omega }}^t)),\hat{\varvec{\omega }}^t,u^t) \end{aligned}$$
(7)

where now the Metropolis operator \(\hat{{\mathcal {M}}}\) uses the exact values of the pdf \(p(\varvec{\omega }^t)\) to calculate acceptance and rejection as opposed to an approximation associated with \(\aleph ^t\), and \({\bar{G}}\) is defined as,

$$\begin{aligned} {\bar{G}}(\varvec{\omega }):= \begin{pmatrix} \sum _i \nabla _{\varvec{\omega }} U_i(\varvec{\omega })^T&\sum _i \nabla _{\varvec{\omega }} U_i(\varvec{\omega })^T&...&\sum _i \nabla _{\varvec{\omega }} U_i(\varvec{\omega })^T \end{pmatrix}^T \end{aligned}$$

Recall that \(\hat{\varvec{\omega }}^t\), being the update for standard centralized HMC (7), is an ergodic process as shown in the literature (Durmus et al., 2017). This implies that \({\mathbb {E}}\Vert \hat{\varvec{\omega }}^t\Vert\) and \({\mathbb {E}}\Vert {\bar{G}}(\hat{\varvec{\omega }}^t)\Vert\) are bounded. We denote these bounds by U and \(U_g\).

We make the following assumption on the potential function \(p(\varvec{\omega })\).

Assumption 3.1

It holds that \(U(\varvec{\omega })\) as a function of \(\varvec{\omega }\) has Lipschitz continuous first and second derivatives, with constants \(L_2=\sup _{\varvec{\omega }}\Vert \nabla ^2_{\varvec{\omega }\varvec{\omega }} U(\varvec{\omega })\Vert\) and \(L_3=\sup _{\varvec{\omega }}\Vert \nabla ^3_{\varvec{\omega }\varvec{\omega }\varvec{\omega }} U(\varvec{\omega })\Vert\).

For brevity we shall denote by \(M^{(k)}\) the k’th moment of \(\Vert {\textbf{p}}\Vert\), i.e., \({\mathbb {E}}\Vert {\textbf{p}}\Vert =M^{(1)}\), \({\mathbb {E}}\Vert {\textbf{p}}\Vert ^2=M^{(2)}\), etc.

The bound between the approximate and exact HMC is stated first, with the error accumulating from the Taylor remainder error of the Metropolis acceptance step. Note that the scale of the error is of order \(O(\epsilon ^{5})\) in the step-size.

Theorem 3.1

For any \(\epsilon\), if \(T_1(\epsilon )\) is sufficiently small such that,

$$\begin{aligned} \begin{array}{l} A(\epsilon )^{T_1(\epsilon )} B(\epsilon ) \le 1,\\ A(\epsilon ):= 1+\epsilon ^2 L_2+2e\epsilon ^3L_2 (M^{(1)}+\epsilon U_g)+e\epsilon ^5 L_2L_3 M^{(2)}+2e\epsilon ^4 L_2^2,\\ B(\epsilon ):= e\epsilon ^4 L_3 M^{(2)}(M^{(1)}+\epsilon U_g) \end{array} \end{aligned}$$

we have that for \(t\le T_1(\epsilon )\), the expected L2 distance between \(\tilde{\varvec{\omega }}^t\) and \(\hat{\varvec{\omega }}^t\) is bounded up by the following expression,

$$\begin{aligned} {\mathbb {E}}\left\| \tilde{\varvec{\omega }}^{t+1}-\hat{\varvec{\omega }}^{t+1}\right\| \le K_i(\epsilon ,t):=\sum \limits _{s=0}^t A(\epsilon )^s B(\epsilon ) \end{aligned}$$
(8)

Next we indicate the consensus error, the standard measure for the discrepancy between the parameter estimates across the agents. The consensus error does not accumulate, but rather stays upper bounded, with the mean-squared-error scaling with \(O(\epsilon ^2)\).

Theorem 3.2

Assume that \(\beta\) is small enough such that \(2\beta (3+2L_2+2 L_3)<1\). The consensus error is asymptotically bounded, specifically,

$$\begin{aligned} {\mathbb {E}}\Vert \bar{\varvec{\omega }}^{t}-\varvec{\omega }^{t}\Vert + {\mathbb {E}}\Vert \bar{\varvec{g}}^{t}-\varvec{g}^{t}\Vert + {\mathbb {E}}\Vert {{\bar{\aleph }}}^{t}-\aleph ^{t}\Vert \le K_c:= \frac{\epsilon M^{(1)}}{1-2\beta (3+2L_2+2 L_3)} \end{aligned}$$
(9)

We note that there is a strict requirement on \(\beta\), it must be configured so as to encourage swift consensus to mitigate the drift from the Lipschitzianity of the updates. If the structure of the communication links in the graph is such that the bound is not satisfied, then one can simply run multiple consensus updates \(c\in {\mathbb {N}}\) for each round until \(\beta ^c\) satisfies the required bound.

Finally, the most involved derivation is the probabilistic mass error between approximate HMC and the evolution of the average of the iterates, with the discrepancy accumulating due to the evaluation of the gradient vectors at the parameter values taking place at individual agents’ parameter estimates rather than at the average. Formally,

Theorem 3.3

Let,

$$\begin{aligned} \begin{array}{l} A_2(\epsilon ,t):= (1-\epsilon ^2)^{-1}\left[ 1+4L_2+e\epsilon ^3 {\bar{K}}(\epsilon ,t)\right] \\ B_2(\epsilon ,t):= (1-\epsilon ^2)^{-1}\left[ 3L_2 K_c+{\bar{K}}(\epsilon ,t) +e\epsilon ^3(3K_c+K_i(\epsilon ,t)+2U_g)\right] \\ T_2(\epsilon ):= \max \left\{ T\in {\mathbb {N}}:\, \sum \limits _{t=0}^T \prod _{s=t+1}^T A_2(\epsilon ,s) B_2(\epsilon ,t) \le 1\right\} \end{array} \end{aligned}$$

The L2 expected error accumulates as,

$$\begin{aligned} {\mathbb {E}}\Vert \tilde{\varvec{\omega }}^t-\bar{\varvec{\omega }}^t\Vert + {\mathbb {E}}\Vert \tilde{\varvec{g}}^t-\bar{\varvec{g}}^t\Vert +{\mathbb {E}}\Vert \tilde{\aleph }^t-{\bar{\aleph }}^t\Vert \le \sum \limits _{t=0}^{T_2(\epsilon )} \prod _{s=t+1}^{T_2(\epsilon )} A_2(\epsilon ,s) B_2(\epsilon ,t) \end{aligned}$$
(10)

Considering Theorems 3.13.2 and 3.3, we obtain an understanding of the degree to which the decentralized HMC Algorithm as presented in Algorithm 1 manages to recreate the behavior of centralized HMC up to some error. Given that HMC exhibits ergodicity towards the stationary distribution in the general case, and geometric ergodicity for strongly log-concave distributions (Durmus et al., 2017), our approach generates a sequence of samples with controlled approximate accuracy relative to one with these properties. This error, however, grows with the number of samples. Given the nested form of the guarantees, this can potentially grow quickly. Still, decreasing the step-size \(\epsilon\) will have the effect of decreasing all of the terms, and one run an arbitrary number of iterations with accuracy guarantees with ever decreasing \(\epsilon\), however at the expense of proportionally shorter trajectory times.

In considering the implications of the Theorem, at first glance it could appear that there is an exponentially increasing error, suggesting increasing bias, asymptotically. However, this result only shows the correspondence of the scheme to a centralized HMC. As in, for instance, an ODE integrator, we can only expect in the best case that the error will grow exponentially with time. In order to obtain mixing results for the algorithm, it will be necessary to consider it standalone, and introduce some additional assumptions on the landscape of the potential, which we do so in the next section.

Finally, to give a bit of a clearer picture of the results in this subsection, let us consider the simple case of all problem constants being equal to one. Then, the terms in Theorem 3.3 appear to be,

$$\begin{aligned} \begin{array}{l} A(\epsilon ) = 1+\epsilon ^2+2e\epsilon ^3+2e\epsilon ^4+e\epsilon ^5 \\ B(\epsilon ) = e\epsilon ^4+e\epsilon ^5 \end{array} \end{aligned}$$

and thus clearly the required condition is satisfied for T if,

$$\begin{aligned} (1+2\epsilon ^2)^T 2e\epsilon ^4 \le 1,\,\Longrightarrow T \le -4 \log \epsilon / \log (1+2\epsilon ^2) \approx O(\epsilon ^2 \log \epsilon ) \end{aligned}$$

which yields a reasonably large number of samples. Now, subsequently, we have, for small enough \(\epsilon\) \(A_2(\epsilon ,t) \le 6\) and \(B_2(\epsilon ,t)\le 4 \epsilon\) and so the condition for Theorem 3.1 satisfied if,

$$\begin{aligned} 4\sum ^T_{i=1} 6^{T-i} \epsilon \le 1 \end{aligned}$$

which is roughly \(O(-\log \epsilon )\), indicating that the error is bounded by a small constant O(1) for a number of samples proportional to \(O(-\log \epsilon )\).

3.2 Inexact HMC under function growth assumptions

The result in the previous section indicates that the parameters sampled by the procedure outlined appears to follow a similar trajectory as centralized HMC, gradually diverging as the bias grows. Of course, in practice, we expect that the chain to converge, in some probabilistic sense, asymptotically. For that, however, we would need some assumptions about the potential function U, as is standard for analysis of gradient based sampling method. We present the following, along the lines of Bou-Rabee et al. (2020), and derive correspondingly similar results:

Assumption 3.2

The potential U satisfies,

  1. 1.

    \(U\in C^4({\mathbb {R}}^d)\)

  2. 2.

    U has a local minimum at \(\varvec{\omega }=0\) and \(U(0)=0\).

  3. 3.

    U has bounded fourth derivatives, let \(L_4=\sup \limits _{\varvec{\omega }\in {\mathbb {R}}^d} \Vert \nabla ^4_{\varvec{\omega }^4} U(\varvec{\omega })\Vert\)

  4. 4.

    U is strongly convex outside a Euclidean ball, i.e., \(\exists {\mathcal {R}},K>0\) such that for all \(\varvec{\omega },\varvec{\omega }'\in {\mathbb {R}}^d\) with \(\Vert \varvec{\omega }-\varvec{\omega }'\Vert \ge {\mathcal {R}}\),

    $$\begin{aligned} \langle \varvec{\omega }-\varvec{\omega }',\nabla U(\varvec{\omega })-\nabla U(\varvec{\omega }')\rangle \ge K\Vert \varvec{\omega }-\varvec{\omega }'\Vert ^2 \end{aligned}$$

Remark 3.1

Note that we can suspect to also possibly be able to replace the last Assumption by the statement, appearing in Bou-Rabee et al. (2020): There exists a measurable function \(\Psi :{\mathbb {R}}^d\rightarrow [0,\infty )\) and constants \(\lambda ,\alpha \in (0,\infty )\) and \(R_2>0\) such that the level set \(\{\varvec{\omega }\in {\mathbb {R}}^d:\Psi (\varvec{\omega })\le 4\alpha /\lambda \}\) is compact and for all \(\varvec{\omega }\in {\mathbb {R}}^d\) with \(\Vert \varvec{\omega }\Vert \le R_2\),

$$\begin{aligned} (\pi \Psi )(\varvec{\omega }) \le (1-\lambda )\Psi (\varvec{\omega })+\alpha \end{aligned}$$

We do not investigate this point in this paper, however.

We now develop a contraction argument based on the coupling appearing in Bou-Rabee et al. (2020) for \(\bar{\varvec{\omega }}\). For that, we denote a perturbed Metropolis operator,

$$\begin{aligned} \bar{{\mathcal {M}}}(\bar{\varvec{\omega }}^t-\epsilon (\bar{\varvec{p}}^t+\epsilon \nabla U({\bar{\omega }}^t)),\bar{\varvec{\omega }}^t,u^t,q^t) = \bar{\varvec{\omega }}^{t+1} \end{aligned}$$

with \(q^t\) a random variable representing \(\Vert \bar{\varvec{\aleph }}^{t+1}-\varvec{\aleph }^{t+1}\Vert\), \(\Vert \varvec{g}^{t+1}-\nabla U(\bar{\varvec{\omega }^t})\Vert\) and \(\Vert \nabla ^2_{\varvec{\omega }} U(\bar{\varvec{\omega }}^t)-\aleph ^{t+1}\Vert\) and thus bounded in expectation by \(K_T(\epsilon )=C_T K_c(\epsilon )\) with \(K_c(\epsilon )\) as defined in Theorem 3.2. This permits for the consideration of the transition as characterized by a Markov kernel.

Now we are ready to present the coupling representing one transition of \(\bar{\varvec{\omega }}^t\). Set a constant \(\gamma >0\) to satisfy \(\gamma {\mathcal {R}}\le 1/4\). We denote the other copy of the average parameter set by procedure as \(\bar{\varvec{\nu }}^t\) and the coupling transition as follows:

$$\begin{aligned} \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})&= \bar{{\mathcal {M}}}(\bar{\varvec{\omega }}-\epsilon (\bar{\varvec{p}}+\epsilon \nabla U({\bar{\omega }})),\bar{\varvec{\omega }},u,q_{\omega }) \end{aligned}$$
(11)
$$\begin{aligned} \varvec{{\mathcal {V}}} (\bar{\varvec{\omega }},\bar{\varvec{\nu }})&= \left\{ \begin{array}{lr} \bar{{\mathcal {M}}}(\bar{\varvec{\nu }}-\epsilon (\bar{\varvec{p}}+\epsilon \nabla U({\bar{\nu }})),\bar{\varvec{\nu }},u, q_{\nu }) &{} \text { if }\Vert \varvec{\omega }-\varvec{\nu }\Vert \ge {\mathcal {R}} \\ \bar{{\mathcal {M}}}(\bar{\varvec{\nu }}-\epsilon (\bar{\varvec{r}}+\epsilon \nabla U({\bar{\nu }})),\bar{\varvec{\nu }},u, q_{\nu }) &{} \text { if }\Vert \varvec{\omega }-\varvec{\nu }\Vert < {\mathcal {R}} \end{array}\right. \end{aligned}$$
(12)

where,

$$\begin{aligned} \varvec{r}:= \left\{ \begin{array}{lr} \bar{\varvec{p}}+\gamma (\varvec{\omega }-\varvec{\nu }),&{}\text { if }{\bar{u}}\le \frac{p_{{\mathcal {N}}}((\varvec{\omega }-\varvec{\nu })\cdot \varvec{p}/\Vert (\varvec{\omega }-\varvec{\nu })\Vert +\gamma \Vert (\varvec{\omega }-\varvec{\nu })\Vert )}{p_{{\mathcal {N}}}((\varvec{\omega }-\varvec{\nu })\cdot \varvec{p})} \\ \bar{\varvec{p}}-\frac{2((\varvec{\omega }-\varvec{\nu })\cdot \varvec{p})(\varvec{\omega }-\varvec{\nu })}{\Vert \varvec{\omega }-\varvec{\nu }\Vert ^2} &{}\text { otherwise}\end{array}\right. \end{aligned}$$
(13)

where \({\bar{u}}\sim \text {Unif}(0,1)\) and \(p_{{\mathcal {N}}}\) is the pdf of a unit normal r.v.

Let the map \(\varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})\rightarrow (\bar{\varvec{\omega }}+\epsilon (\varvec{p}+\epsilon \nabla U(\bar{\varvec{\omega }})+\epsilon \varvec{q}),\varvec{p}+\epsilon \nabla U(\bar{\varvec{\omega }})+\epsilon \varvec{q})\) where \(\varvec{q}\) is a random variable representing the tracking error. Noting the relative simplicity of Euler integrator compared to the Verlet approach in Bou-Rabee et al. (2020), some of the bounds derived in Bou-Rabee (2020, Lemma 3.1 and 3.2) are immediate. We write them below without the need for a derivation:

$$\begin{aligned} \begin{array}{l} \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_1-\bar{\varvec{\omega }}-\epsilon \varvec{p})\Vert \le L \epsilon ^2 \Vert \bar{\varvec{\omega }}\Vert +\epsilon ^2 K_T(\epsilon ), \\ \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_2-\varvec{p}\Vert \le L\epsilon \Vert \varvec{\omega }\Vert +\epsilon K_T(\epsilon ),\\ \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_1\Vert \le 2\max \{\Vert \bar{\varvec{\omega }}\Vert ,\Vert \bar{\varvec{\omega }}+\epsilon \varvec{p}\Vert ,\epsilon ^2 K_T(\epsilon ) \},\\ \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_2\Vert \le \Vert \varvec{p}\Vert +L\epsilon \Vert \varvec{\omega }\Vert +\epsilon K_T(\epsilon )\\ \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_1-\varvec{L}(\bar{\varvec{\nu }},\varvec{r},\varvec{s})_1-(\bar{\varvec{\omega }}-\bar{\varvec{\nu }})-\epsilon (\varvec{p}-\varvec{r})\Vert \le L\epsilon ^2 \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +2\epsilon K_T(\epsilon ) \\ \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_2-\varvec{L}(\varvec{\nu },\varvec{r},\varvec{s})_2-(\varvec{p}-\varvec{r})\Vert \le L\epsilon \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert +2\epsilon K_T(\epsilon ) \end{array} \end{aligned}$$
(14)

Next we can obtain a contraction for the case of the initial parameter values being outside a Euclidean ball. This result is similar in statement and proof technique to Bou-Rabee (2020, Lemma 3.4 and Theorem 2.2).

Lemma 3.1

There exists a \(C>0\) depending only on \(L_2\) and \(L_3\), such that for \(\bar{\varvec{\omega }},\bar{\varvec{\nu }}\in {\mathbb {R}}^d\) satisfying \(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert \ge 2{\mathcal {R}}\) and \(\epsilon\) satisfying \(\epsilon ^2 L^2 < K\) and \((1+\Vert \bar{\varvec{\omega }}\Vert +\Vert \bar{\varvec{\omega }}\Vert )\epsilon \le K/C\) and,

$$\begin{aligned} 2K/C (1+L_3\epsilon ^2) K_T(\epsilon )+2 \epsilon K_T(\epsilon )^2 \le K{\mathcal {R}}^2 \end{aligned}$$

it holds that,

$$\begin{aligned} \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_1-\varvec{L}(\bar{\varvec{\nu }},\varvec{p},\varvec{s})_1\Vert ^2 \le (1-K\epsilon ^2/4)\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ^2 \end{aligned}$$

where \(C_c\) depends on K, \(L_2\) and C

Proof

Indeed,

$$\begin{aligned} \begin{array}{l} \Vert \varvec{L}(\bar{\varvec{\omega }},\varvec{p},\varvec{q})_1-\varvec{L}(\bar{\varvec{\nu }},\varvec{p},\varvec{s})_1\Vert ^2 \\ \quad = \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }} +\epsilon ^2(\nabla U(\bar{\varvec{\omega }})+\varvec{q}-\nabla U(\bar{\varvec{\nu }})-\varvec{s})\Vert ^2 \\ \quad \le (1-\epsilon ^2 K +\epsilon ^4 L)\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert ^2\\ \qquad +2K/C \epsilon ^2(1+L_3\epsilon ^2) K_T(\epsilon )+2 \epsilon ^4 K_T(\epsilon )^2 \end{array} \end{aligned}$$

\(\square\)

Theorem 3.4

Fix some \(R_2>0\), there exists \({\bar{\epsilon }}>0\) depending only on K, \(L_2\), \(L_3\), \(L_4\), \(R_2\) and d such that for any \(\epsilon <{\bar{\epsilon }}\), and any \(\bar{\varvec{\omega }},\bar{\varvec{\nu }}\in {\mathbb {R}}^d\) with \(\Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert \ge 2 {\mathcal {R}}\) and \(\max (\Vert \bar{\varvec{\omega }}\Vert ,\Vert \bar{\varvec{\nu }}\Vert )\le R_2\),

$$\begin{aligned} {\mathbb {E}}\Vert \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\Vert \le \left( 1-\frac{1}{8}K\epsilon ^2\right) \Vert \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\Vert \end{aligned}$$

Define a metric of the form,

$$\begin{aligned} \rho (\Vert \varvec{\omega }-\varvec{\nu }\Vert ):= \int _0^{\Vert \varvec{\omega }-\varvec{\nu }\Vert } \exp (-a\min (s,R_1))ds \end{aligned}$$

with,

$$\begin{aligned} a:=\frac{1}{\epsilon },\,R_1:=\frac{5}{2}({\mathcal {R}}+\epsilon ) \end{aligned}$$

Furthermore, we require,

$$\begin{aligned} 4L\epsilon ^2 \le \min \left\{ \frac{K}{L_2},\frac{1}{4},\frac{1}{256L_2 {\mathcal {R}}^2}\right\} \end{aligned}$$
(15)

and set \(\gamma\) as appearing in (11) to be,

$$\begin{aligned} \gamma :=\min \left\{ \epsilon ^{-1},{\mathcal {R}}^{-1}/4\right\} \end{aligned}$$

Finally we derive the ultimate result using the coupling introduced in this Section to obtain contractivity in the modified metric.

Theorem 3.5

For the constants satisfying the above equations, we obtain for sufficiently small \(\epsilon\) depending only on K, \(L_2\), \(L_3\), \(M^{(2)}\), \({\mathcal {R}}\), \(R_2\) and d that with \(\max (\Vert \bar{\varvec{\omega }},\bar{\varvec{\nu }}\Vert )\le R_2\) it holds that if \(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| > R_3(\epsilon )\) with \(R_3(\epsilon )\) depending linearly on \(\epsilon\) and otherwise on the same set of constants, we have,

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) \le \left( 1-c\right) \rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) \end{array} \end{aligned}$$

with c given by,

$$\begin{aligned} c=\frac{1}{20}\min \left( 1,\frac{1}{2}K\epsilon ^2(1+\frac{{\mathcal {R}}}{\epsilon })e^{-{\mathcal {R}}/(2\epsilon )}\right) e^{-2{\mathcal {R}}/\epsilon } \end{aligned}$$

Otherwise, for general \(\left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\|\)

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\rho \left( \left\| \varvec{\Omega }(\bar{\varvec{\omega }},\bar{\varvec{\nu }})-\varvec{{\mathcal {V}}}(\bar{\varvec{\omega }},\bar{\varvec{\nu }})\right\| \right) \le \left( 1-2c+K_e(\epsilon )\right) \rho \left( \left\| \bar{\varvec{\omega }}-\bar{\varvec{\nu }}\right\| \right) \end{array} \end{aligned}$$

with \(K_e(\epsilon ) = O(\epsilon ^3)\).

The previous Theorem indicates that modulo a a ball corresponding to inexactness as associated with the consensus and tracking errors, proportional in size to \(\epsilon\), the stochastic process defined by the Algorithm exhibits quantitative contraction, and thus is almost ergodic. Inside such a ball, the contractivity or not depends on the relative values of the constants and the associated consensus error.

4 Examples

In this section we illustrate the performance of Algorithm 1, which we shall refer to as Decentralized HMC, comparing it to two baselines, Centralized HMC, which is performing HMC on the entire data set, and Decentralized ULA, which we implement as in Parayil et al. (2020).Footnote 1 We leave the discussion of the computing platform, problem dimensions, and hyperparameter selection to the appendix.

4.1 Gaussian mixture model

The purpose of this example is to illustrate pictorially the importance of the Metropolis adjustment step. Here we sample from a Gaussian mixture given by

$$\begin{aligned} \theta _1 \sim {\mathcal {N}}(0,\sigma _1^2);\ \ \theta _2 \sim {\mathcal {N}}(0,\sigma _2^2) \text { and } x_i \sim \left( \frac{1}{2} {\mathcal {N}}(\theta _1,\sigma _x^2) + \frac{1}{2} {\mathcal {N}}(\theta _1 + \theta _2,\sigma _x^2)\right) \end{aligned}$$

where \(\sigma _1^2 = 10, \sigma _2^2 = 1, \sigma _x^2 = 2\). For the centralized setting, 100 data samples are drawn from the model with \(\theta _1 = 0\) and \(\theta _2 = 1\), These are then split into 5 sets of 20 samples that are made available to each agent in the decentralised network. This is a similar setting to that of Parayil et al. (2020). Figure 1 compares sampling from the posterior distribution in both the centralized and decentralized settings. The contour plot corresponds to the true value of the log potential. The plot in the first column displays the samples from the centralized approach and can be thought of as the “ground truth". Columns two and three display the materialized samples for the decentralized setting, where column two applies the Taylor approximated Metropolis adjustment and column three does not. The discrepancy between these two schemes can be seen qualitatively via the spread of the samples. The Metropolis adjustment prevents the collection of samples in low probability regions and ensures that the samples stay in the high probability region in a similar manner to the centralized approach. Leaving out the Metropolis step means that samples that have low log probability are never rejected.

Fig. 1
figure 1

Gaussian mixture model

4.2 Linear regression

Fig. 2
figure 2

Mean squared error by samples for linear regression. The Decentralized HMC

For our first example with real data, we investigate Bayesian linear regression with four agents applied to the Boston Housing data set (Harrison & Rubinfeld, 1978). In the decentralized setting, each agent is only given access to separate parts of the 13-dimensional feature space. Agents 1-3 have access to 3 input features (each a different set of 3) and agent 4 sees the remaining 4 features. We use a simple normal prior for each parameter of the model and compare the results in Fig. 2, which displays the cumulative mean squared error over samples. Our approach (in blue) converges and outperforms the baseline decentralized approach. The centralized approach, with access to all features achieves the best performance.

4.3 Logistic regression

For logistic regression we work with the MNIST data set (LeCun et al., 1998). We define two scenarios for decentralized learning. In scenario one, we only allow each agent partial observation of the total training set. In scenario two, we distribute the agents in a ring network such that each agent can only communicate with its two neighbors. Furthermore, for the ring network, each agent only has access to two classes of the training data.

Partial observation For this experiment there are four agents where each only sees one quarter of the MNIST digit figure. The rest of the feature space is set to zeros. Figure 3 shows the results in statistical performance for the three schemes. The plots show cumulative performance averaged over all agents. The confidence intervals are one standard deviation and are calculated over 9 random seeds for all approaches.

Fig. 3
figure 3

Partial MNIST Set-up Logistic Regression. The two Decentralized algorithms perform similarly, each obtaining an overall worse estimate than the centralized baseline

Ring network For this experiment there are five agents connected in a ring formation, such that each agent can only communicate with two other agents. Each agent only has access to two classes of digits in their training data, e.g. agent 0’s training data consists of digits 0–1. Figure 4 displays that in this setting, Decentralized HMC achieves similar performance to Centralized HMC, both of which in turn outperform Decentralized ULA.

Fig. 4
figure 4

Ring Set-up Logistic Regression. Decentralized HMC performs just as well as its centralized counterpart, while outperforming the Langevin-based approach

4.4 Bayesian neural networks

A unique contribution of our proposed algorithm is its treatment of nonconvex functions which has not been addressed in earlier work on decentralized sampling (Kungurtsev, 2020; Gürbüzbalaban et al., 2020; Parayil et al., 2020). To study this class of potentials, we ran our decentralized sampling scheme over two agents, each with their own neural network model and data. We provide each agent with half the classes available in the MNIST data set, i.e. one with access to digits 0–4 and the other with access to digits 5–9. Each network is fully fully connected with a single hidden layer of 100 units. Figure 5 displays the cumulative performance over the number of samples, averaged across the two agents for the test data. As is evident from the figure, our approach is able to learn in a decentralized fashion over nonconvex models. Without explicitly passing data between the two agents, each agent achieves good performance over classes it has never seen before.

Fig. 5
figure 5

BNNs: Two Agents, one with digits 0–4 and the other with digits 5–9. Test data contains all classes and performance is average over both agents

5 Perspectives and conclusion

We presented an Algorithm that performs Metropolis-adjusted HMC sampling in a decentralized environment. Theoretically it appears to behave close in probability to exact HMC sampling, and numerically, appears to perform well on standard datasets. Theoretically, it would be interesting to establish ergodicity and convergence in probability of the chain generated by the Algorithm itself, rather than just a quantitative distance to the ergodic chain – whether this is possible is still an open question.