Search | arXiv e-print repository

Denoising Lévy Probabilistic Models

Authors: Dario Shariatian, Umut Simsekli, Alain Durmus

Abstract: Investigating noise distribution beyond Gaussian in diffusion generative models is an open problem. The Gaussian case has seen success experimentally and theoretically, fitting a unified SDE framework for score-based and denoising formulations. Recent studies suggest heavy-tailed noise distributions can address mode collapse and manage datasets with class imbalance, heavy tails, or outliers. Yoon… ▽ More Investigating noise distribution beyond Gaussian in diffusion generative models is an open problem. The Gaussian case has seen success experimentally and theoretically, fitting a unified SDE framework for score-based and denoising formulations. Recent studies suggest heavy-tailed noise distributions can address mode collapse and manage datasets with class imbalance, heavy tails, or outliers. Yoon et al. (NeurIPS 2023) introduced the Lévy-Ito model (LIM), extending the SDE framework to heavy-tailed SDEs with $α$-stable noise. Despite its theoretical elegance and performance gains, LIM's complex mathematics may limit its accessibility and broader adoption. This study takes a simpler approach by extending the denoising diffusion probabilistic model (DDPM) with $α$-stable noise, creating the denoising Lévy probabilistic model (DLPM). Using elementary proof techniques, we show DLPM reduces to running vanilla DDPM with minimal changes, allowing the use of existing implementations with minimal changes. DLPM and LIM have different training algorithms and, unlike the Gaussian case, they admit different backward processes and sampling algorithms. Our experiments demonstrate that DLPM achieves better coverage of data distribution tail, improved generation of unbalanced datasets, and faster computation times with fewer backward steps. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.14332 [pdf, ps, other]

Unravelling in Collaborative Learning

Authors: Aymeric Capitaine, Etienne Boursier, Antoine Scheid, Eric Moulines, Michael I. Jordan, El-Mahdi El-Mhamdi, Alain Durmus

Abstract: Collaborative learning offers a promising avenue for leveraging decentralized data. However, collaboration in groups of strategic learners is not a given. In this work, we consider strategic agents who wish to train a model together but have sampling distributions of different quality. The collaboration is organized by a benevolent aggregator who gathers samples so as to maximize total welfare, bu… ▽ More Collaborative learning offers a promising avenue for leveraging decentralized data. However, collaboration in groups of strategic learners is not a given. In this work, we consider strategic agents who wish to train a model together but have sampling distributions of different quality. The collaboration is organized by a benevolent aggregator who gathers samples so as to maximize total welfare, but is unaware of data quality. This setting allows us to shed light on the deleterious effect of adverse selection in collaborative learning. More precisely, we demonstrate that when data quality indices are private, the coalition may undergo a phenomenon known as unravelling, wherein it shrinks up to the point that it becomes empty or solely comprised of the worst agent. We show how this issue can be addressed without making use of external transfers, by proposing a novel method inspired by probabilistic verification. This approach makes the grand coalition a Nash equilibrium with high probability despite information asymmetry, thereby breaking unravelling. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2406.19824 [pdf, ps, other]

Learning to Mitigate Externalities: the Coase Theorem with Hindsight Rationality

Authors: Antoine Scheid, Aymeric Capitaine, Etienne Boursier, Eric Moulines, Michael I Jordan, Alain Durmus

Abstract: In economic theory, the concept of externality refers to any indirect effect resulting from an interaction between players that affects the social welfare. Most of the models within which externality has been studied assume that agents have perfect knowledge of their environment and preferences. This is a major hindrance to the practical implementation of many proposed solutions. To address this i… ▽ More In economic theory, the concept of externality refers to any indirect effect resulting from an interaction between players that affects the social welfare. Most of the models within which externality has been studied assume that agents have perfect knowledge of their environment and preferences. This is a major hindrance to the practical implementation of many proposed solutions. To address this issue, we consider a two-player bandit setting where the actions of one of the players affect the other player and we extend the Coase theorem [Coase, 1960]. This result shows that the optimal approach for maximizing the social welfare in the presence of externality is to establish property rights, i.e., enable transfers and bargaining between the players. Our work removes the classical assumption that bargainers possess perfect knowledge of the underlying game. We first demonstrate that in the absence of property rights, the social welfare breaks down. We then design a policy for the players which allows them to learn a bargaining strategy which maximizes the total welfare, recovering the Coase theorem under uncertainty. △ Less

Submitted 3 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.04012 [pdf, other]

Theoretical Guarantees for Variational Inference with Fixed-Variance Mixture of Gaussians

Authors: Tom Huix, Anna Korba, Alain Durmus, Eric Moulines

Abstract: Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the… ▽ More Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the one of Gaussians. This work aims to contribute to the theoretical study of VI in the non-Gaussian case by investigating the setting of Mixture of Gaussians with fixed covariance and constant weights. In this view, VI over this specific family can be casted as the minimization of a Mollified relative entropy, i.e. the KL between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components. Hence, solving variational inference becomes equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent and takes the form of an interacting particle system. We study two sources of error of variational inference in this context when optimizing the mollified relative entropy. The first one is an optimization result, that is a descent lemma establishing that the algorithm decreases the objective at each iteration. The second one is an approximation error, that upper bounds the objective between an optimal finite mixture and the target distribution. △ Less

Submitted 10 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.20636 [pdf]

Photoluminescence enhancement at the vertical van der Waals semiconductor-metal heterostructures

Authors: Hafiz Muhammad Shakir, Abdulsalam Aji Suleiman, Kübra Nur Kalkan, Amir Parsi, Uğur Başçı, Mehmet Atıf Durmuş, Ahmet Osman Ölçer, Hilal Korkut, Cem Sevik, İbrahim Sarpkaya, Talip Serkan Kasırga

Abstract: Excitons in monolayer transition metal dichalcogenides (TMDCs) offer intriguing new possibilities for optoelectronics with no analogues in bulk semiconductors. Yet, intrinsic defects in TMDCs limit the radiative exciton recombination pathways. As a result, the photoluminescence (PL) quantum yield (QY) is limited. Methods like superacid treatment, electrical doping, and plasmonic engineering can in… ▽ More Excitons in monolayer transition metal dichalcogenides (TMDCs) offer intriguing new possibilities for optoelectronics with no analogues in bulk semiconductors. Yet, intrinsic defects in TMDCs limit the radiative exciton recombination pathways. As a result, the photoluminescence (PL) quantum yield (QY) is limited. Methods like superacid treatment, electrical doping, and plasmonic engineering can inhibit nonradiative decay channels and enhance PL. Here, we show a more straightforward approach that allows PL enhancement. An engineered vertical van der Waals (vdW) metal-monolayer semiconductor junction (MSJ) results in PL enhancement of more than an order of magnitude at technologically relevant excitation powers. Such MSJ can be constructed by vertically stacking metals with suitable work function either above or below a monolayer semiconducting TMDC. Our experiments reveal that the underlying PL enhancement mechanism is to be the suppressed exciton quenching due to the absence of metal-induced gap states and weak Fermi level pinning, thanks to the vdW gapped interface between the metal and the TMDC. Our time-resolved PL measurements further indicate that reduced exciton-exciton annihilation, even at high generation rates, contributes to the observed PL enhancement. The PL intensity is further increased by the proximity of surface plasmons in the metal with the TMDC layer. Our findings shed light on the interaction at vdW metal-semiconductor interfaces and offer a path to improving the optoelectronic performance of semiconducting TMDCs. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2403.11407 [pdf, other]

Divide-and-Conquer Posterior Sampling for Denoising Diffusion Priors

Authors: Yazid Janati, Alain Durmus, Eric Moulines, Jimmy Olsson

Abstract: Interest in the use of Denoising Diffusion Models (DDM) as priors for solving inverse Bayesian problems has recently increased significantly. However, sampling from the resulting posterior distribution poses a challenge. To solve this problem, previous works have proposed approximations to bias the drift term of the diffusion. In this work, we take a different approach and utilize the specific str… ▽ More Interest in the use of Denoising Diffusion Models (DDM) as priors for solving inverse Bayesian problems has recently increased significantly. However, sampling from the resulting posterior distribution poses a challenge. To solve this problem, previous works have proposed approximations to bias the drift term of the diffusion. In this work, we take a different approach and utilize the specific structure of the DDM prior to define a set of intermediate and simpler posterior sampling problems, resulting in a lower approximation error compared to previous methods. We empirically demonstrate the reconstruction capability of our method for general linear inverse problems using synthetic examples and various image restoration tasks. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: preprint

arXiv:2403.03811 [pdf, other]

Incentivized Learning in Principal-Agent Bandit Games

Authors: Antoine Scheid, Daniil Tiapkin, Etienne Boursier, Aymeric Capitaine, El Mahdi El Mhamdi, Eric Moulines, Michael I. Jordan, Alain Durmus

Abstract: This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an i… ▽ More This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon $T$) learning algorithms for the principal's regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.02506 [pdf, other]

Differentially Private Representation Learning via Image Captioning

Authors: Tom Sander, Yaodong Yu, Maziar Sanjabi, Alain Durmus, Yi Ma, Kamalika Chaudhuri, Chuan Guo

Abstract: Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn… ▽ More Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of $\varepsilon=8$, a linear classifier trained on top of learned DP-Cap features attains 65.8% accuracy on ImageNet-1K, considerably improving the previous SOTA of 56.5%. Our work challenges the prevailing sentiment that high-utility DP representation learning cannot be achieved by training from scratch. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.17870 [pdf, other]

Stochastic Approximation with Biased MCMC for Expectation Maximization

Authors: Samuel Gruffaz, Kyurae Kim, Alain Oliviero Durmus, Jacob R. Gardner

Abstract: The expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these result… ▽ More The expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these results are restricted to the case where asymptotically unbiased MCMC algorithms are used. In practice, MCMC-SAEM is often run with asymptotically biased MCMC, for which the consequences are theoretically less understood. In this work, we fill this gap by analyzing the asymptotics and non-asymptotics of SAEM with biased MCMC steps, particularly the effect of bias. We also provide numerical experiments comparing the Metropolis-adjusted Langevin algorithm (MALA), which is asymptotically unbiased, and the unadjusted Langevin algorithm (ULA), which is asymptotically biased, on synthetic and real datasets. Experimental results show that ULA is more stable with respect to the choice of Langevin stepsize and can sometimes result in faster convergence. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted to AISTATS'24

arXiv:2402.14904 [pdf, other]

Watermarking Makes Language Models Radioactive

Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, Teddy Furon

Abstract: This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination le… ▽ More This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.10758 [pdf, other]

Stochastic Localization via Iterative Posterior Sampling

Authors: Louis Grenioux, Maxence Noble, Marylou Gabrié, Alain Oliviero Durmus

Abstract: Building upon score-based learning, new interest in stochastic localization techniques has recently emerged. In these models, one seeks to noise a sample from the data distribution through a stochastic process, called observation process, and progressively learns a denoiser associated to this dynamics. Apart from specific applications, the use of stochastic localization for the problem of sampling… ▽ More Building upon score-based learning, new interest in stochastic localization techniques has recently emerged. In these models, one seeks to noise a sample from the data distribution through a stochastic process, called observation process, and progressively learns a denoiser associated to this dynamics. Apart from specific applications, the use of stochastic localization for the problem of sampling from an unnormalized target density has not been explored extensively. This work contributes to fill this gap. We consider a general stochastic localization framework and introduce an explicit class of observation processes, associated with flexible denoising schedules. We provide a complete methodology, $\textit{Stochastic Localization via Iterative Posterior Sampling}$ (SLIPS), to obtain approximate samples of this dynamics, and as a by-product, samples from the target distribution. Our scheme is based on a Markov chain Monte Carlo estimation of the denoiser and comes with detailed practical guidelines. We illustrate the benefits and applicability of SLIPS on several benchmarks of multi-modal distributions, including Gaussian mixtures in increasing dimensions, Bayesian logistic regression and a high-dimensional field system from statistical-mechanics. △ Less

Submitted 28 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted at ICML 2024

arXiv:2402.08344 [pdf, other]

Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training

Authors: Tom Sander, Maxime Sylvestre, Alain Durmus

Abstract: Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch trainin… ▽ More Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches. We first show that the phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity (and not the clipping) is the cause of this implicit bias, even with additional isotropic Gaussian noise. We theoretically analyse the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings, and reveal that the implicit bias is indeed amplified by the additional noise. Thus, the performance issues of large-batch DP-SGD training are rooted in the same underlying principles as SGD, offering hope for potential improvements in large batch training strategies. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.06447 [pdf, other]

On the irreducibility and convergence of a class of nonsmooth nonlinear state-space models on manifolds

Authors: Armand Gissler, Alain Durmus, Anne Auger

Abstract: In this paper, we analyze a large class of general nonlinear state-space models on a state-space X, defined by the recursion $φ_{k+1} = F(φ_k,α(φ_k,U_{k+1}))$, $k \in\bN$, where $F,α$ are some functions and $\{U_{k+1}\}_{k\in\bN}$ is a sequence of i.i.d. random variables. More precisely, we extend conditions under which this class of Markov chains is irreducible, aperiodic and satisfies important… ▽ More In this paper, we analyze a large class of general nonlinear state-space models on a state-space X, defined by the recursion $φ_{k+1} = F(φ_k,α(φ_k,U_{k+1}))$, $k \in\bN$, where $F,α$ are some functions and $\{U_{k+1}\}_{k\in\bN}$ is a sequence of i.i.d. random variables. More precisely, we extend conditions under which this class of Markov chains is irreducible, aperiodic and satisfies important continuity properties, relaxing two key assumptions from prior works. First, the state-space X is supposed to be a smooth manifold instead of an open subset of a Euclidean space. Second, we only suppose that $F is locally Lipschitz continuous. We demonstrate the significance of our results through their application to Markov chains underlying optimization algorithms. These schemes belong to the class of evolution strategies with covariance matrix adaptation and step-size adaptation. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2312.00417 [pdf, other]

Geodesic slice sampling on Riemannian manifolds

Authors: Alain Durmus, Samuel Gruffaz, Mareike Hasenpflug, Daniel Rudolf

Abstract: We propose a theoretically justified and practically applicable slice sampling based Markov chain Monte Carlo (MCMC) method for approximate sampling from probability measures on Riemannian manifolds. The latter naturally arise as posterior distributions in Bayesian inference of matrix-valued parameters, for example belonging to either the Stiefel or the Grassmann manifold. Our method, called geode… ▽ More We propose a theoretically justified and practically applicable slice sampling based Markov chain Monte Carlo (MCMC) method for approximate sampling from probability measures on Riemannian manifolds. The latter naturally arise as posterior distributions in Bayesian inference of matrix-valued parameters, for example belonging to either the Stiefel or the Grassmann manifold. Our method, called geodesic slice sampling, is reversible with respect to the distribution of interest, and generalizes Hit-and-run slice sampling on $\mathbb{R}^{d}$ to Riemannian manifolds by using geodesics instead of straight lines. We demonstrate the robustness of our sampler's performance compared to other MCMC methods dealing with manifold valued distributions through extensive numerical experiments, on both synthetic and real data. In particular, we illustrate its remarkable ability to cope with anisotropic target densities, without using gradient information and preconditioning. △ Less

Submitted 19 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Journal paper of 51 pages with appendix

MSC Class: 60-08 ACM Class: G.3

arXiv:2310.18455 [pdf, other]

Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent

Authors: Krunoslav Lehman Pavasovic, Alain Durmus, Umut Simsekli

Abstract: A recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this paper, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of hea… ▽ More A recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this paper, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of heavy tails in theoretical findings is contingent upon access to an infinite amount of data. Hence, the underlying mechanism generating the reported heavy-tailed behavior in practical settings, where the amount of training data is finite, is still not well-understood. Our contribution aims to fill this gap. In particular, we show that the stationary distribution of offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and the approximation error is controlled by how fast the empirical distribution of the training data converges to the true underlying data distribution in the Wasserstein metric. Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly 'power-law-like'. To achieve this result, we first prove nonasymptotic Wasserstein convergence bounds for offline SGD to online SGD as the number of data points increases, which can be interesting on their own. Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: In Neural Information Processing Systems (NeurIPS), Spotlight Presentation, 2023

arXiv:2308.12240 [pdf, ps, other]

Score diffusion models without early stopping: finite Fisher information is all you need

Authors: Giovanni Conforti, Alain Durmus, Marta Gentiloni Silveri

Abstract: Diffusion models are a new class of generative models that revolve around the estimation of the score function associated with a stochastic differential equation. Subsequent to its acquisition, the approximated score function is then harnessed to simulate the corresponding time-reversal process, ultimately enabling the generation of approximate data samples. Despite their evident practical signifi… ▽ More Diffusion models are a new class of generative models that revolve around the estimation of the score function associated with a stochastic differential equation. Subsequent to its acquisition, the approximated score function is then harnessed to simulate the corresponding time-reversal process, ultimately enabling the generation of approximate data samples. Despite their evident practical significance these models carry, a notable challenge persists in the form of a lack of comprehensive quantitative results, especially in scenarios involving non-regular scores and estimators. In almost all reported bounds in Kullback Leibler (KL) divergence, it is assumed that either the score function or its approximation is Lipschitz uniformly in time. However, this condition is very restrictive in practice or appears to be difficult to establish. To circumvent this issue, previous works mainly focused on establishing convergence bounds in KL for an early stopped version of the diffusion model and a smoothed version of the data distribution, or assuming that the data distribution is supported on a compact manifold. These explorations have lead to interesting bounds in either Wasserstein or Fortet-Mourier metrics. However, the question remains about the relevance of such early-stopping procedure or compactness conditions. In particular, if there exist a natural and mild condition ensuring explicit and sharp convergence bounds in KL. In this article, we tackle the aforementioned limitations by focusing on score diffusion models with fixed step size stemming from the Ornstein-Ulhenbeck semigroup and its kinetic counterpart. Our study provides a rigorous analysis, yielding simple, improved and sharp convergence bounds in KL applicable to any data distribution with finite Fisher information with respect to the standard Gaussian distribution. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2307.10167 [pdf, other]

VITS : Variational Inference Thompson Sampling for contextual bandits

Authors: Pierre Clavier, Tom Huix, Alain Durmus

Abstract: In this paper, we introduce and analyze a variant of the Thompson sampling (TS) algorithm for contextual bandits. At each round, traditional TS requires samples from the current posterior distribution, which is usually intractable. To circumvent this issue, approximate inference techniques can be used and provide samples with distribution close to the posteriors. However, current approximate techn… ▽ More In this paper, we introduce and analyze a variant of the Thompson sampling (TS) algorithm for contextual bandits. At each round, traditional TS requires samples from the current posterior distribution, which is usually intractable. To circumvent this issue, approximate inference techniques can be used and provide samples with distribution close to the posteriors. However, current approximate techniques yield to either poor estimation (Laplace approximation) or can be computationally expensive (MCMC methods, Ensemble sampling...). In this paper, we propose a new algorithm, Varational Inference Thompson sampling VITS, based on Gaussian Variational Inference. This scheme provides powerful posterior approximations which are easy to sample from, and is computationally efficient, making it an ideal choice for TS. In addition, we show that VITS achieves a sub-linear regret bound of the same order in the dimension and number of round as traditional TS for linear contextual bandit. Finally, we demonstrate experimentally the effectiveness of VITS on both synthetic and real world datasets. △ Less

Submitted 20 July, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.03460 [pdf, other]

On the convergence of dynamic implementations of Hamiltonian Monte Carlo and No U-Turn Samplers

Authors: Alain Durmus, Samuel Gruffaz, Miika Kailas, Eero Saksman, Matti Vihola

Abstract: There is substantial empirical evidence about the success of dynamic implementations of Hamiltonian Monte Carlo (HMC), such as the No U-Turn Sampler (NUTS), in many challenging inference problems but theoretical results about their behavior are scarce. The aim of this paper is to fill this gap. More precisely, we consider a general class of MCMC algorithms we call dynamic HMC. We show that this ge… ▽ More There is substantial empirical evidence about the success of dynamic implementations of Hamiltonian Monte Carlo (HMC), such as the No U-Turn Sampler (NUTS), in many challenging inference problems but theoretical results about their behavior are scarce. The aim of this paper is to fill this gap. More precisely, we consider a general class of MCMC algorithms we call dynamic HMC. We show that this general framework encompasses NUTS as a particular case, implying the invariance of the target distribution as a by-product. Second, we establish conditions under which NUTS is irreducible and aperiodic and as a corrolary ergodic. Under conditions similar to the ones existing for HMC, we also show that NUTS is geometrically ergodic. Finally, we improve existing convergence results for HMC showing that this method is ergodic without any boundedness condition on the stepsize and the number of leapfrog steps, in the case where the target is a perturbation of a Gaussian distribution. △ Less

Submitted 7 July, 2023; originally announced July 2023.

Comments: 24 pages without appendix and references, 2 figures, a future journal paper

MSC Class: 62

arXiv:2306.09513 [pdf, ps, other]

Second order quantitative bounds for unadjusted generalized Hamiltonian Monte Carlo

Authors: Evan Camrud, Alain Durmus, Pierre Monmarché, Gabriel Stoltz

Abstract: This paper provides a convergence analysis for generalized Hamiltonian Monte Carlo samplers, a family of Markov Chain Monte Carlo methods based on leapfrog integration of Hamiltonian dynamics and kinetic Langevin diffusion, that encompasses the unadjusted Hamiltonian Monte Carlo method. Assuming that the target distribution $π$ satisfies a log-Sobolev inequality and mild conditions on the correspo… ▽ More This paper provides a convergence analysis for generalized Hamiltonian Monte Carlo samplers, a family of Markov Chain Monte Carlo methods based on leapfrog integration of Hamiltonian dynamics and kinetic Langevin diffusion, that encompasses the unadjusted Hamiltonian Monte Carlo method. Assuming that the target distribution $π$ satisfies a log-Sobolev inequality and mild conditions on the corresponding potential function, we establish quantitative bounds on the relative entropy of the iterates defined by the algorithm, with respect to $π$. Our approach is based on a perturbative and discrete version of the modified entropy method developed to establish hypocoercivity for the continuous-time kinetic Langevin process. As a corollary of our main result, we are able to derive complexity bounds for the class of algorithms at hand. In particular, we show that the total number of iterations to achieve a target accuracy $\varepsilon >0$ is of order $d/\varepsilon^{1/4}$, where $d$ is the dimension of the problem. This result can be further improved in the case of weakly interacting mean field potentials, for which we find a total number of iterations of order $(d/\varepsilon)^{1/4}$. △ Less

Submitted 13 May, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2305.16557 [pdf, other]

Tree-Based Diffusion Schrödinger Bridge with Applications to Wasserstein Barycenters

Authors: Maxence Noble, Valentin De Bortoli, Arnaud Doucet, Alain Durmus

Abstract: Multi-marginal Optimal Transport (mOT), a generalization of OT, aims at minimizing the integral of a cost function with respect to a distribution with some prescribed marginals. In this paper, we consider an entropic version of mOT with a tree-structured quadratic cost, i.e., a function that can be written as a sum of pairwise cost functions between the nodes of a tree. To address this problem, we… ▽ More Multi-marginal Optimal Transport (mOT), a generalization of OT, aims at minimizing the integral of a cost function with respect to a distribution with some prescribed marginals. In this paper, we consider an entropic version of mOT with a tree-structured quadratic cost, i.e., a function that can be written as a sum of pairwise cost functions between the nodes of a tree. To address this problem, we develop Tree-based Diffusion Schrödinger Bridge (TreeDSB), an extension of the Diffusion Schrödinger Bridge (DSB) algorithm. TreeDSB corresponds to a dynamic and continuous state-space counterpart of the multimarginal Sinkhorn algorithm. A notable use case of our methodology is to compute Wasserstein barycenters which can be recast as the solution of a mOT problem on a star-shaped tree. We demonstrate that our methodology can be applied in high-dimensional settings such as image interpolation and Bayesian fusion. △ Less

Submitted 28 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

arXiv:2304.06549 [pdf, ps, other]

Non-asymptotic convergence bounds for Sinkhorn iterates and their gradients: a coupling approach

Authors: Giacomo Greco, Maxence Noble, Giovanni Conforti, Alain Durmus

Abstract: Computational optimal transport (OT) has recently emerged as a powerful framework with applications in various fields. In this paper we focus on a relaxation of the original OT problem, the entropic OT problem, which allows to implement efficient and practical algorithmic solutions, even in high dimensional settings. This formulation, also known as the Schrödinger Bridge problem, notably connects… ▽ More Computational optimal transport (OT) has recently emerged as a powerful framework with applications in various fields. In this paper we focus on a relaxation of the original OT problem, the entropic OT problem, which allows to implement efficient and practical algorithmic solutions, even in high dimensional settings. This formulation, also known as the Schrödinger Bridge problem, notably connects with Stochastic Optimal Control (SOC) and can be solved with the popular Sinkhorn algorithm. In the case of discrete-state spaces, this algorithm is known to have exponential convergence; however, achieving a similar rate of convergence in a more general setting is still an active area of research. In this work, we analyze the convergence of the Sinkhorn algorithm for probability measures defined on the $d$-dimensional torus $\mathbb{T}_L^d$, that admit densities with respect to the Haar measure of $\mathbb{T}_L^d$. In particular, we prove pointwise exponential convergence of Sinkhorn iterates and their gradient. Our proof relies on the connection between these iterates and the evolution along the Hamilton-Jacobi-Bellman equations of value functions obtained from SOC-problems. Our approach is novel in that it is purely probabilistic and relies on coupling by reflection techniques for controlled diffusions on the torus. △ Less

Submitted 26 June, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

Comments: version accepted to COLT 2023

MSC Class: 49Q22; 93E20 (Primary) 49N05; 90C25; 47D07

arXiv:2304.04451 [pdf, ps, other]

Quantitative contraction rates for Sinkhorn algorithm: beyond bounded costs and compact marginals

Authors: Giovanni Conforti, Alain Durmus, Giacomo Greco

Abstract: We show non-asymptotic exponential convergence of Sinkhorn iterates to the Schrödinger potentials, solutions of the quadratic Entropic Optimal Transport problem on $\mathbb{R}^d$. Our results hold under mild assumptions on the marginal inputs: in particular, we only assume that they admit an asymptotically positive log-concavity profile, covering as special cases log-concave distributions and boun… ▽ More We show non-asymptotic exponential convergence of Sinkhorn iterates to the Schrödinger potentials, solutions of the quadratic Entropic Optimal Transport problem on $\mathbb{R}^d$. Our results hold under mild assumptions on the marginal inputs: in particular, we only assume that they admit an asymptotically positive log-concavity profile, covering as special cases log-concave distributions and bounded smooth perturbations of quadratic potentials. Up to the authors' knowledge, these are the first results which establish exponential convergence of Sinkhorn's algorithm in a general setting without assuming bounded cost functions or compactly supported marginals. △ Less

Submitted 20 June, 2024; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: 34 pages, simplified presentation of main results, added explicit expression for the exponential convergence rates and added stronger results in the log-concave setting

MSC Class: 49Q22; 90C25 (Primary) 49N05; 93E20; 47D07 (Secondary)

arXiv:2303.05838 [pdf, ps, other]

Rosenthal-type inequalities for linear statistics of Markov chains

Authors: Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Marina Sheshukova

Abstract: In this paper, we establish novel deviation bounds for additive functionals of geometrically ergodic Markov chains similar to Rosenthal and Bernstein inequalities for sums of independent random variables. We pay special attention to the dependence of our bounds on the mixing time of the corresponding chain. More precisely, we establish explicit bounds that are linked to the constants from the mart… ▽ More In this paper, we establish novel deviation bounds for additive functionals of geometrically ergodic Markov chains similar to Rosenthal and Bernstein inequalities for sums of independent random variables. We pay special attention to the dependence of our bounds on the mixing time of the corresponding chain. More precisely, we establish explicit bounds that are linked to the constants from the martingale version of the Rosenthal inequality, as well as the constants that characterize the mixing properties of the underlying Markov kernel. Finally, our proof technique is, up to our knowledge, new and based on a recurrent application of the Poisson decomposition. △ Less

Submitted 28 June, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

MSC Class: 60E15; 60J20; 65C40

arXiv:2302.04763 [pdf, other]

On Sampling with Approximate Transport Maps

Authors: Louis Grenioux, Alain Durmus, Éric Moulines, Marylou Gabrié

Abstract: Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Mar… ▽ More Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can be reliably handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler. △ Less

Submitted 18 February, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

arXiv:2301.02446 [pdf, other]

Optimal Scaling Results for Moreau-Yosida Metropolis-adjusted Langevin Algorithms

Authors: Francesca R. Crucinio, Alain Durmus, Pablo Jiménez, Gareth O. Roberts

Abstract: We consider a recently proposed class of MCMC methods which uses proximity maps instead of gradients to build proposal mechanisms which can be employed for both differentiable and non-differentiable targets. These methods have been shown to be stable for a wide class of targets, making them a valuable alternative to Metropolis-adjusted Langevin algorithms (MALA); and have found wide application in… ▽ More We consider a recently proposed class of MCMC methods which uses proximity maps instead of gradients to build proposal mechanisms which can be employed for both differentiable and non-differentiable targets. These methods have been shown to be stable for a wide class of targets, making them a valuable alternative to Metropolis-adjusted Langevin algorithms (MALA); and have found wide application in imaging contexts. The wider stability properties are obtained by building the Moreau-Yosida envelope for the target of interest, which depends on a parameter $λ$. In this work, we investigate the optimal scaling problem for this class of algorithms, which encompasses MALA, and provide practical guidelines for the implementation of these methods. △ Less

Submitted 19 June, 2024; v1 submitted 6 January, 2023; originally announced January 2023.

MSC Class: 65C05; 60F05

arXiv:2211.00100 [pdf, other]

Federated Averaging Langevin Dynamics: Toward a unified theory and new algorithms

Authors: Vincent Plassier, Alain Durmus, Eric Moulines

Abstract: This paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated Averaging algorithm to Bayesian inference. We obtai… ▽ More This paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated Averaging algorithm to Bayesian inference. We obtain a novel tight non-asymptotic upper bound on the Wasserstein distance to the global posterior for FALD. This bound highlights the effects of statistical heterogeneity, which causes a drift in the local updates that negatively impacts convergence. We propose a new algorithm VR-FALD* that uses control variates to correct the client drift. We establish non-asymptotic bounds showing that VR-FALD* is not affected by statistical heterogeneity. Finally, we illustrate our results on several FL benchmarks for Bayesian inference. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: 58 pages

arXiv:2210.11925 [pdf, other]

Unbiased constrained sampling with Self-Concordant Barrier Hamiltonian Monte Carlo

Authors: Maxence Noble, Valentin De Bortoli, Alain Durmus

Abstract: In this paper, we propose Barrier Hamiltonian Monte Carlo (BHMC), a version of the HMC algorithm which aims at sampling from a Gibbs distribution $π$ on a manifold $\mathrm{M}$, endowed with a Hessian metric $\mathfrak{g}$ derived from a self-concordant barrier. Our method relies on Hamiltonian dynamics which comprises $\mathfrak{g}$. Therefore, it incorporates the constraints defining… ▽ More In this paper, we propose Barrier Hamiltonian Monte Carlo (BHMC), a version of the HMC algorithm which aims at sampling from a Gibbs distribution $π$ on a manifold $\mathrm{M}$, endowed with a Hessian metric $\mathfrak{g}$ derived from a self-concordant barrier. Our method relies on Hamiltonian dynamics which comprises $\mathfrak{g}$. Therefore, it incorporates the constraints defining $\mathrm{M}$ and is able to exploit its underlying geometry. However, the corresponding Hamiltonian dynamics is defined via non separable Ordinary Differential Equations (ODEs) in contrast to the Euclidean case. It implies unavoidable bias in existing generalization of HMC to Riemannian manifolds. In this paper, we propose a new filter step, called "involution checking step", to address this problem. This step is implemented in two versions of BHMC, coined continuous BHMC (c-BHMC) and numerical BHMC (n-BHMC) respectively. Our main results establish that these two new algorithms generate reversible Markov chains with respect to $π$ and do not suffer from any bias in comparison to previous implementations. Our conclusions are supported by numerical experiments where we consider target distributions defined on polytopes. △ Less

Submitted 28 October, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

arXiv:2207.04475 [pdf, ps, other]

Finite-time High-probability Bounds for Polyak-Ruppert Averaged Iterates of Linear Stochastic Approximation

Authors: Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov

Abstract: This paper provides a finite-time analysis of linear stochastic approximation (LSA) algorithms with fixed step size, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a $d$-dimensional linear system $\bar{\mathbf{A}} θ= \bar{\mathbf{b}}$ for which $(\bar{\mathbf{A}}, \bar{\mathbf{b}})$ can only be estimated by (asymptotically) unbiased observations… ▽ More This paper provides a finite-time analysis of linear stochastic approximation (LSA) algorithms with fixed step size, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a $d$-dimensional linear system $\bar{\mathbf{A}} θ= \bar{\mathbf{b}}$ for which $(\bar{\mathbf{A}}, \bar{\mathbf{b}})$ can only be estimated by (asymptotically) unbiased observations $\{(\mathbf{A}(Z_n),\mathbf{b}(Z_n))\}_{n \in \mathbb{N}}$. We consider here the case where $\{Z_n\}_{n \in \mathbb{N}}$ is an i.i.d. sequence or a uniformly geometrically ergodic Markov chain. We derive $p$-th moment and high-probability deviation bounds for the iterates defined by LSA and its Polyak-Ruppert-averaged version. Our finite-time instance-dependent bounds for the averaged LSA iterates are sharp in the sense that the leading term we obtain coincides with the local asymptotic minimax limit. Moreover, the remainder terms of our bounds admit a tight dependence on the mixing time $t_{\operatorname{mix}}$ of the underlying chain and the norm of the noise variables. We emphasize that our result requires the SA step size to scale only with logarithm of the problem dimension $d$. △ Less

Submitted 29 March, 2023; v1 submitted 10 July, 2022; originally announced July 2022.

MSC Class: 62L20; 60J20

arXiv:2207.03859 [pdf, other]

Variational Inference of overparameterized Bayesian Neural Networks: a theoretical and empirical study

Authors: Tom Huix, Szymon Majewski, Alain Durmus, Eric Moulines, Anna Korba

Abstract: This paper studies the Variational Inference (VI) used for training Bayesian Neural Networks (BNN) in the overparameterized regime, i.e., when the number of neurons tends to infinity. More specifically, we consider overparameterized two-layer BNN and point out a critical issue in the mean-field VI training. This problem arises from the decomposition of the lower bound on the evidence (ELBO) into t… ▽ More This paper studies the Variational Inference (VI) used for training Bayesian Neural Networks (BNN) in the overparameterized regime, i.e., when the number of neurons tends to infinity. More specifically, we consider overparameterized two-layer BNN and point out a critical issue in the mean-field VI training. This problem arises from the decomposition of the lower bound on the evidence (ELBO) into two terms: one corresponding to the likelihood function of the model and the second to the Kullback-Leibler (KL) divergence between the prior distribution and the variational posterior. In particular, we show both theoretically and empirically that there is a trade-off between these two terms in the overparameterized regime only when the KL is appropriately re-scaled with respect to the ratio between the the number of observations and neurons. We also illustrate our theoretical results with numerical experiments that highlight the critical choice of this ratio. △ Less

Submitted 8 July, 2022; originally announced July 2022.

arXiv:2206.03611 [pdf, other]

FedPop: A Bayesian Approach for Personalised Federated Learning

Authors: Nikita Kotelevskii, Maxime Vono, Eric Moulines, Alain Durmus

Abstract: Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially f… ▽ More Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks. △ Less

Submitted 26 January, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

arXiv:2201.07652 [pdf, ps, other]

Sticky nonlinear SDEs and convergence of McKean-Vlasov equations without confinement

Authors: Alain Durmus, Andreas Eberle, Arnaud Guillin, Katharina Schuh

Abstract: We develop a new approach to study the long time behaviour of solutions to nonlinear stochastic differential equations in the sense of McKean, as well as propagation of chaos for the corresponding mean-field particle system approximations. Our approach is based on a sticky coupling between two solutions to the equation. We show that the distance process between the two copies is dominated by a sol… ▽ More We develop a new approach to study the long time behaviour of solutions to nonlinear stochastic differential equations in the sense of McKean, as well as propagation of chaos for the corresponding mean-field particle system approximations. Our approach is based on a sticky coupling between two solutions to the equation. We show that the distance process between the two copies is dominated by a solution to a one-dimensional nonlinear stochastic differential equation with a sticky boundary at zero. This new class of equations is then analyzed carefully. In particular, we show that the dominating equation has a phase transition. In the regime where the Dirac measure at zero is the only invariant probability measure, we prove exponential convergence to equilibrium both for the one-dimensional equation, and for the original nonlinear SDE. Similarly, propagation of chaos is shown by a componentwise sticky coupling and comparison with a system of one dimensional nonlinear SDEs with sticky boundaries at zero. The approach applies to equations without confinement potential and to interaction terms that are not of gradient type. △ Less

Submitted 13 November, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: 46 pages

MSC Class: 60H10 (Primary); 60J60; 82C31 (Secondary)

arXiv:2201.06133 [pdf, other]

On Maximum-a-Posteriori estimation with Plug & Play priors and stochastic gradient descent

Authors: Rémi Laumont, Valentin de Bortoli, Andrés Almansa, Julie Delon, Alain Durmus, Marcelo Pereyra

Abstract: Bayesian methods to solve imaging inverse problems usually combine an explicit data likelihood function with a prior distribution that explicitly models expected properties of the solution. Many kinds of priors have been explored in the literature, from simple ones expressing local properties to more involved ones exploiting image redundancy at a non-local scale. In a departure from explicit model… ▽ More Bayesian methods to solve imaging inverse problems usually combine an explicit data likelihood function with a prior distribution that explicitly models expected properties of the solution. Many kinds of priors have been explored in the literature, from simple ones expressing local properties to more involved ones exploiting image redundancy at a non-local scale. In a departure from explicit modelling, several recent works have proposed and studied the use of implicit priors defined by an image denoising algorithm. This approach, commonly known as Plug & Play (PnP) regularisation, can deliver remarkably accurate results, particularly when combined with state-of-the-art denoisers based on convolutional neural networks. However, the theoretical analysis of PnP Bayesian models and algorithms is difficult and works on the topic often rely on unrealistic assumptions on the properties of the image denoiser. This papers studies maximum-a-posteriori (MAP) estimation for Bayesian models with PnP priors. We first consider questions related to existence, stability and well-posedness, and then present a convergence proof for MAP computation by PnP stochastic gradient descent (PnP-SGD) under realistic assumptions on the denoiser used. We report a range of imaging experiments demonstrating PnP-SGD as well as comparisons with other PnP schemes. △ Less

Submitted 16 January, 2022; originally announced January 2022.

MSC Class: 65K10 (Primary) 65K05; 62F15; 62C10; 68Q25; 68U10; 90C26 (Secondary) 65K10; 65K05; 62F15; 62C10; 68Q25; 68U10; 90C26

arXiv:2201.05002 [pdf, other]

Boost your favorite Markov Chain Monte Carlo sampler using Kac's theorem: the Kick-Kac teleportation algorithm

Authors: Randal Douc, Alain Durmus, Aurélien Enfroy, Jimmy Olsson

Abstract: The present paper focuses on the problem of sampling from a given target distribution $π$ defined on some general state space. To this end, we introduce a novel class of non-reversible Markov chains, each chain being defined on an extended state space and having an invariant probability measure admitting $π$ as a marginal distribution. The proposed methodology is inspired by a new formulation of K… ▽ More The present paper focuses on the problem of sampling from a given target distribution $π$ defined on some general state space. To this end, we introduce a novel class of non-reversible Markov chains, each chain being defined on an extended state space and having an invariant probability measure admitting $π$ as a marginal distribution. The proposed methodology is inspired by a new formulation of Kac's theorem and allows global and local dynamics to be smoothly combined. Under mild conditions, the corresponding Markov transition kernel can be shown to be irreducible and Harris recurrent. In addition, we establish that geometric ergodicity holds under appropriate conditions on the global and local dynamics. Finally, we illustrate numerically the use of the proposed method and its potential benefits in comparison to existing Markov chain Monte Carlo (MCMC) algorithms. △ Less

Submitted 13 May, 2023; v1 submitted 13 January, 2022; originally announced January 2022.

MSC Class: 62-08; 60J20; 65C05 (Primary); 60F15 (Secondary)

arXiv:2201.01951 [pdf, ps, other]

On the geometric convergence for MALA under verifiable conditions

Authors: Alain Durmus, Éric Moulines

Abstract: While the Metropolis Adjusted Langevin Algorithm (MALA) is a popular and widely used Markov chain Monte Carlo method, very few papers derive conditions that ensure its convergence. In particular, to the authors' knowledge, assumptions that are both easy to verify and guarantee geometric convergence, are still missing. In this work, we establish $V$-uniformly geometric convergence for MALA under mi… ▽ More While the Metropolis Adjusted Langevin Algorithm (MALA) is a popular and widely used Markov chain Monte Carlo method, very few papers derive conditions that ensure its convergence. In particular, to the authors' knowledge, assumptions that are both easy to verify and guarantee geometric convergence, are still missing. In this work, we establish $V$-uniformly geometric convergence for MALA under mild assumptions about the target distribution. Unlike previous work, we only consider tail and smoothness conditions for the potential associated with the target distribution. These conditions are quite common in the MCMC literature and are easy to verify in practice. Finally, we pay special attention to the dependence of the bounds we derive on the step size of the Euler-Maruyama discretization, which corresponds to the proposal Markov kernel of MALA. △ Less

Submitted 6 January, 2022; originally announced January 2022.

arXiv:2111.02702 [pdf, other]

Local-Global MCMC kernels: the best of both worlds

Authors: Sergey Samsonov, Evgeny Lagutin, Marylou Gabrié, Alain Durmus, Alexey Naumov, Eric Moulines

Abstract: Recent works leveraging learning to enhance sampling have shown promising results, in particular by designing effective non-local moves and global proposals. However, learning accuracy is inevitably limited in regions where little data is available such as in the tails of distributions as well as in high-dimensional problems. In the present paper we study an Explore-Exploit Markov chain Monte Carl… ▽ More Recent works leveraging learning to enhance sampling have shown promising results, in particular by designing effective non-local moves and global proposals. However, learning accuracy is inevitably limited in regions where little data is available such as in the tails of distributions as well as in high-dimensional problems. In the present paper we study an Explore-Exploit Markov chain Monte Carlo strategy ($Ex^2MCMC$) that combines local and global samplers showing that it enjoys the advantages of both approaches. We prove $V$-uniform geometric ergodicity of $Ex^2MCMC$ without requiring a uniform adaptation of the global sampler to the target distribution. We also compute explicit bounds on the mixing rate of the Explore-Exploit strategy under realistic conditions. Moreover, we also analyze an adaptive version of the strategy ($FlEx^2MCMC$) where a normalizing flow is trained while sampling to serve as a proposal for global moves. We illustrate the efficiency of $Ex^2MCMC$ and its adaptive version on classical sampling benchmarks as well as in sampling high-dimensional distributions defined by Generative Adversarial Networks seen as Energy Based Models. We provide the code to reproduce the experiments at the link: https://github.com/svsamsonov/ex2mcmc_new. △ Less

Submitted 4 October, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

Comments: arXiv admin note: text overlap with arXiv:1111.5421 by other authors

arXiv:2109.00331 [pdf, ps, other]

Probability and moment inequalities for additive functionals of geometrically ergodic Markov chains

Authors: Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov

Abstract: In this paper, we establish moment and Bernstein-type inequalities for additive functionals of geometrically ergodic Markov chains. These inequalities extend the corresponding inequalities for independent random variables. Our conditions cover Markov chains converging geometrically to the stationary distribution either in $V$-norms or in weighted Wasserstein distances. Our inequalities apply to un… ▽ More In this paper, we establish moment and Bernstein-type inequalities for additive functionals of geometrically ergodic Markov chains. These inequalities extend the corresponding inequalities for independent random variables. Our conditions cover Markov chains converging geometrically to the stationary distribution either in $V$-norms or in weighted Wasserstein distances. Our inequalities apply to unbounded functions and depend explicitly on constants appearing in the conditions that we consider. △ Less

Submitted 15 June, 2023; v1 submitted 1 September, 2021; originally announced September 2021.

MSC Class: 60E15; 60J20; 65C40

arXiv:2108.00682 [pdf, other]

Asymptotic bias of inexact Markov Chain Monte Carlo methods in high dimension

Authors: Alain Oliviero Durmus, Andreas Eberle

Abstract: Inexact Markov Chain Monte Carlo methods rely on Markov chains that do not exactly preserve the target distribution. Examples include the unadjusted Langevin algorithm (ULA) and unadjusted Hamiltonian Monte Carlo (uHMC). This paper establishes bounds on Wasserstein distances between the invariant probability measures of inexact MCMC methods and their target distributions with a focus on understand… ▽ More Inexact Markov Chain Monte Carlo methods rely on Markov chains that do not exactly preserve the target distribution. Examples include the unadjusted Langevin algorithm (ULA) and unadjusted Hamiltonian Monte Carlo (uHMC). This paper establishes bounds on Wasserstein distances between the invariant probability measures of inexact MCMC methods and their target distributions with a focus on understanding the precise dependence of this asymptotic bias on both dimension and discretization step size. Assuming Wasserstein bounds on the convergence to equilibrium of either the exact or the approximate dynamics, we show that for both ULA and uHMC, the asymptotic bias depends on key quantities related to the target distribution or the stationary probability measure of the scheme. As a corollary, we conclude that for models with a limited amount of interactions such as mean-field models, finite range graphical models, and perturbations thereof, the asymptotic bias has a similar dependence on the step size and the dimension as for product measures. △ Less

Submitted 12 April, 2023; v1 submitted 2 August, 2021; originally announced August 2021.

arXiv:2107.14542 [pdf, ps, other]

Uniform minorization condition and convergence bounds for discretizations of kinetic Langevin dynamics

Authors: Alain Durmus, Aurélien Enfroy, Éric Moulines, Gabriel Stoltz

Abstract: We study the convergence in total variation and $V$-norm of discretization schemes of the underdamped Langevin dynamics. Such algorithms are very popular and commonly used in molecular dynamics and computational statistics to approximatively sample from a target distribution of interest. We show first that, for a very large class of schemes, a minorization condition uniform in the stepsize holds.… ▽ More We study the convergence in total variation and $V$-norm of discretization schemes of the underdamped Langevin dynamics. Such algorithms are very popular and commonly used in molecular dynamics and computational statistics to approximatively sample from a target distribution of interest. We show first that, for a very large class of schemes, a minorization condition uniform in the stepsize holds. This class encompasses popular methods such as the Euler-Maruyama scheme and the schemes based on splitting strategies. Second, we provide mild conditions ensuring that the class of schemes that we consider satisfies a geometric Foster--Lyapunov drift condition, again uniform in the stepsize. This allows us to derive geometric convergence bounds, with a convergence rate scaling linearly with the stepsize. This kind of result is of prime interest to obtain estimates on norms of solutions to Poisson equations associated with a given numerical method. △ Less

Submitted 21 April, 2023; v1 submitted 30 July, 2021; originally announced July 2021.

arXiv:2106.15921 [pdf, other]

Monte Carlo Variational Auto-Encoders

Authors: Achille Thin, Nikita Kotelevskii, Arnaud Doucet, Alain Durmus, Eric Moulines, Maxim Panov

Abstract: Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform poorly in high dimensions. While it has been sugg… ▽ More Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform poorly in high dimensions. While it has been suggested many times in the literature to use more sophisticated algorithms such as Annealed Importance Sampling (AIS) and its Sequential Importance Sampling (SIS) extensions, the potential benefits brought by these advanced techniques have never been realized for VAE: the AIS estimate cannot be easily differentiated, while SIS requires the specification of carefully chosen backward Markov kernels. In this paper, we address both issues and demonstrate the performance of the resulting Monte Carlo VAEs on a variety of applications. △ Less

Submitted 30 June, 2021; originally announced June 2021.

arXiv:2106.15427 [pdf, other]

Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections

Authors: Kimia Nadjahi, Alain Durmus, Pierre E. Jacob, Roland Badeau, Umut Şimşekli

Abstract: The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of meas… ▽ More The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem. △ Less

Submitted 4 January, 2022; v1 submitted 29 June, 2021; originally announced June 2021.

Comments: Published at NeurIPS 2021

arXiv:2106.06300 [pdf, other]

DG-LMC: A Turn-key and Scalable Synchronous Distributed MCMC Algorithm via Langevin Monte Carlo within Gibbs

Authors: Vincent Plassier, Maxime Vono, Alain Durmus, Eric Moulines

Abstract: Performing reliable Bayesian inference on a big data scale is becoming a keystone in the modern era of machine learning. A workhorse class of methods to achieve this task are Markov chain Monte Carlo (MCMC) algorithms and their design to handle distributed datasets has been the subject of many works. However, existing methods are not completely either reliable or computationally efficient. In this… ▽ More Performing reliable Bayesian inference on a big data scale is becoming a keystone in the modern era of machine learning. A workhorse class of methods to achieve this task are Markov chain Monte Carlo (MCMC) algorithms and their design to handle distributed datasets has been the subject of many works. However, existing methods are not completely either reliable or computationally efficient. In this paper, we propose to fill this gap in the case where the dataset is partitioned and stored on computing nodes within a cluster under a master/slaves architecture. We derive a user-friendly centralised distributed MCMC algorithm with provable scaling in high-dimensional settings. We illustrate the relevance of the proposed methodology on both synthetic and real data experiments. △ Less

Submitted 18 June, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: 77 pages. Accepted for publication at ICML 2021, to appear

arXiv:2106.01257 [pdf, ps, other]

Tight High Probability Bounds for Linear Stochastic Approximation with Fixed Stepsize

Authors: Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, Hoi-To Wai

Abstract: This paper provides a non-asymptotic analysis of linear stochastic approximation (LSA) algorithms with fixed stepsize. This family of methods arises in many machine learning tasks and is used to obtain approximate solutions of a linear system $\bar{A}θ= \bar{b}$ for which $\bar{A}$ and $\bar{b}$ can only be accessed through random estimates $\{({\bf A}_n, {\bf b}_n): n \in \mathbb{N}^*\}$. Our ana… ▽ More This paper provides a non-asymptotic analysis of linear stochastic approximation (LSA) algorithms with fixed stepsize. This family of methods arises in many machine learning tasks and is used to obtain approximate solutions of a linear system $\bar{A}θ= \bar{b}$ for which $\bar{A}$ and $\bar{b}$ can only be accessed through random estimates $\{({\bf A}_n, {\bf b}_n): n \in \mathbb{N}^*\}$. Our analysis is based on new results regarding moments and high probability bounds for products of matrices which are shown to be tight. We derive high probability bounds on the performance of LSA under weaker conditions on the sequence $\{({\bf A}_n, {\bf b}_n): n \in \mathbb{N}^*\}$ than previous works. However, in contrast, we establish polynomial concentration bounds with order depending on the stepsize. We show that our conclusions cannot be improved without additional assumptions on the sequence of random matrices $\{{\bf A}_n: n \in \mathbb{N}^*\}$, and in particular that no Gaussian or exponential high probability bounds can hold. Finally, we pay a particular attention to establishing bounds with sharp order with respect to the number of iterations and the stepsize and whose leading terms contain the covariance matrices appearing in the central limit theorems. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: 21 pages

arXiv:2106.00797 [pdf, other]

QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning

Authors: Maxime Vono, Vincent Plassier, Alain Durmus, Aymeric Dieuleveut, Eric Moulines

Abstract: The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end… ▽ More The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end, we propose a novel federated Markov Chain Monte Carlo algorithm, referred to as Quantised Langevin Stochastic Dynamics which may be seen as an extension to the FL setting of Stochastic Gradient Langevin Dynamics, which handles the communication bottleneck using gradient compression. To improve performance, we then introduce variance reduction techniques, which lead to two improved versions coined \texttt{QLSD}$^{\star}$ and \texttt{QLSD}$^{++}$. We give both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms. We illustrate their performances using various Bayesian Federated Learning benchmarks. △ Less

Submitted 31 May, 2022; v1 submitted 1 June, 2021; originally announced June 2021.

arXiv:2104.06771 [pdf, other]

Discrete sticky couplings of functional autoregressive processes

Authors: Alain Durmus, Andreas Eberle, Aurélien Enfroy, Arnaud Guillin, Pierre Monmarché

Abstract: In this paper, we provide bounds in Wasserstein and total variation distances between the distributions of the successive iterates of two functional autoregressive processes with isotropic Gaussian noise of the form $Y_{k+1} = \mathrm{T}_γ(Y_k) + \sqrt{γσ^2} Z_{k+1}$ and $\tilde{Y}_{k+1} = \tilde{\mathrm{T}}_γ(\tilde{Y}_k) + \sqrt{γσ^2} \tilde{Z}_{k+1}$. More precisely, we give non-asymptotic boun… ▽ More In this paper, we provide bounds in Wasserstein and total variation distances between the distributions of the successive iterates of two functional autoregressive processes with isotropic Gaussian noise of the form $Y_{k+1} = \mathrm{T}_γ(Y_k) + \sqrt{γσ^2} Z_{k+1}$ and $\tilde{Y}_{k+1} = \tilde{\mathrm{T}}_γ(\tilde{Y}_k) + \sqrt{γσ^2} \tilde{Z}_{k+1}$. More precisely, we give non-asymptotic bounds on $ρ(\mathcal{L}(Y_{k}),\mathcal{L}(\tilde{Y}_k))$, where $ρ$ is an appropriate weighted Wasserstein distance or a $V$-distance, uniformly in the parameter $γ$, and on $ρ(π_γ,\tildeπ_γ)$, where $π_γ$ and $\tildeπ_γ$ are the respective stationary measures of the two processes. The class of considered processes encompasses the Euler-Maruyama discretization of Langevin diffusions and its variants. The bounds we derive are of order $γ$ as $γ\to 0$. To obtain our results, we rely on the construction of a discrete sticky Markov chain $(W_k^{(γ)})_{k \in \mathbb{N}}$ which bounds the distance between an appropriate coupling of the two processes. We then establish stability and quantitative convergence results for this process uniformly on $γ$. In addition, we show that it converges in distribution to the continuous sticky process studied in previous work. Finally, we apply our result to Bayesian inference of ODE parameters and numerically illustrate them on two particular problems. △ Less

Submitted 28 November, 2023; v1 submitted 14 April, 2021; originally announced April 2021.

arXiv:2103.10943 [pdf, other]

NEO: Non Equilibrium Sampling on the Orbit of a Deterministic Transform

Authors: Achille Thin, Yazid Janati, Sylvain Le Corff, Charles Ollion, Arnaud Doucet, Alain Durmus, Eric Moulines, Christian Robert

Abstract: Sampling from a complex distribution $π$ and approximating its intractable normalizing constant Z are challenging problems. In this paper, a novel family of importance samplers (IS) and Markov chain Monte Carlo (MCMC) samplers is derived. Given an invertible map T, these schemes combine (with weights) elements from the forward and backward Orbits through points sampled from a proposal distributi… ▽ More Sampling from a complex distribution $π$ and approximating its intractable normalizing constant Z are challenging problems. In this paper, a novel family of importance samplers (IS) and Markov chain Monte Carlo (MCMC) samplers is derived. Given an invertible map T, these schemes combine (with weights) elements from the forward and backward Orbits through points sampled from a proposal distribution $ρ$. The map T does not leave the target $π$ invariant, hence the name NEO, standing for Non-Equilibrium Orbits. NEO-IS provides unbiased estimators of the normalizing constant and self-normalized IS estimators of expectations under $π$ while NEO-MCMC combines multiple NEO-IS estimates of the normalizing constant and an iterated sampling-importance resampling mechanism to sample from $π$. For T chosen as a discrete-time integrator of a conformal Hamiltonian system, NEO-IS achieves state-of-the art performance on difficult benchmarks and NEO-MCMC is able to explore highly multimodal targets. Additionally, we provide detailed theoretical results for both methods. In particular, we show that NEO-MCMC is uniformly geometrically ergodic and establish explicit mixing time estimates under mild conditions. △ Less

Submitted 23 August, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

arXiv:2103.04715 [pdf, other]

doi 10.1137/21M1406349

Bayesian imaging using Plug & Play priors: when Langevin meets Tweedie

Authors: Rémi Laumont, Valentin de Bortoli, Andrés Almansa, Julie Delon, Alain Durmus, Marcelo Pereyra

Abstract: Since the seminal work of Venkatakrishnan et al. in 2013, Plug & Play (PnP) methods have become ubiquitous in Bayesian imaging. These methods derive Minimum Mean Square Error (MMSE) or Maximum A Posteriori (MAP) estimators for inverse problems in imaging by combining an explicit likelihood function with a prior that is implicitly defined by an image denoising algorithm. The PnP algorithms proposed… ▽ More Since the seminal work of Venkatakrishnan et al. in 2013, Plug & Play (PnP) methods have become ubiquitous in Bayesian imaging. These methods derive Minimum Mean Square Error (MMSE) or Maximum A Posteriori (MAP) estimators for inverse problems in imaging by combining an explicit likelihood function with a prior that is implicitly defined by an image denoising algorithm. The PnP algorithms proposed in the literature mainly differ in the iterative schemes they use for optimisation or for sampling. In the case of optimisation schemes, some recent works guarantee the convergence to a fixed point, albeit not necessarily a MAP estimate. In the case of sampling schemes, to the best of our knowledge, there is no known proof of convergence. There also remain important open questions regarding whether the underlying Bayesian models and estimators are well defined, well-posed, and have the basic regularity properties required to support these numerical schemes. To address these limitations, this paper develops theory, methods, and provably convergent algorithms for performing Bayesian inference with PnP priors. We introduce two algorithms: 1) PnP-ULA (Unadjusted Langevin Algorithm) for Monte Carlo sampling and MMSE inference; and 2) PnP-SGD (Stochastic Gradient Descent) for MAP inference. Using recent results on the quantitative convergence of Markov chains, we establish detailed convergence guarantees for these two algorithms under realistic assumptions on the denoising operators used, with special attention to denoisers based on deep neural networks. We also show that these algorithms approximately target a decision-theoretically optimal Bayesian model that is well-posed. The proposed algorithms are demonstrated on several canonical problems such as image deblurring, inpainting, and denoising, where they are used for point estimation as well as for uncertainty visualisation and quantification. △ Less

Submitted 12 January, 2022; v1 submitted 8 March, 2021; originally announced March 2021.

MSC Class: 65K10; 65K05; 65D18; 62F15; 62C10; 68Q25; 68U10; 90C26

arXiv:2102.07586 [pdf, other]

On Riemannian Stochastic Approximation Schemes with Fixed Step-Size

Authors: Alain Durmus, Pablo Jiménez, Éric Moulines, Salem Said

Abstract: This paper studies fixed step-size stochastic approximation (SA) schemes, including stochastic gradient schemes, in a Riemannian framework. It is motivated by several applications, where geodesics can be computed explicitly, and their use accelerates crude Euclidean methods. A fixed step-size scheme defines a family of time-homogeneous Markov chains, parametrized by the step-size. Here, using this… ▽ More This paper studies fixed step-size stochastic approximation (SA) schemes, including stochastic gradient schemes, in a Riemannian framework. It is motivated by several applications, where geodesics can be computed explicitly, and their use accelerates crude Euclidean methods. A fixed step-size scheme defines a family of time-homogeneous Markov chains, parametrized by the step-size. Here, using this formulation, non-asymptotic performance bounds are derived, under Lyapunov conditions. Then, for any step-size, the corresponding Markov chain is proved to admit a unique stationary distribution, and to be geometrically ergodic. This result gives rise to a family of stationary distributions indexed by the step-size, which is further shown to converge to a Dirac measure, concentrated at the solution of the problem at hand, as the step-size goes to 0. Finally, the asymptotic rate of this convergence is established, through an asymptotic expansion of the bias, and a central limit theorem. △ Less

Submitted 19 February, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 37 pages, 4 figures, to appear in AISTAT21

MSC Class: 60F05

arXiv:2102.00185 [pdf, ps, other]

On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning

Authors: Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Hoi-To Wai

Abstract: This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online learning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrix-valued functions… ▽ More This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online learning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrix-valued functions and uniform ergodicity of the Markov chains. Our main contribution is an exponential stability result for the $p$-th moment of random matrix product, provided that (i) the underlying Markov chain satisfies a super-Lyapunov drift condition, (ii) the growth of the matrix-valued functions is controlled by an appropriately defined function (related to the drift condition). Using this result, we give finite-time $p$-th moment bounds for constant and decreasing stepsize linear stochastic approximation schemes with Markovian noise on general state space. We illustrate these findings for linear value-function estimation in reinforcement learning. We provide finite-time $p$-th moment bound for various members of temporal difference (TD) family of algorithms. △ Less

Submitted 30 January, 2021; originally announced February 2021.

arXiv:2012.15550 [pdf, ps, other]

Nonreversible MCMC from conditional invertible transforms: a complete recipe with convergence guarantees

Authors: Achille Thin, Nikita Kotelevskii, Christophe Andrieu, Alain Durmus, Eric Moulines, Maxim Panov

Abstract: Markov Chain Monte Carlo (MCMC) is a class of algorithms to sample complex and high-dimensional probability distributions. The Metropolis-Hastings (MH) algorithm, the workhorse of MCMC, provides a simple recipe to construct reversible Markov kernels. Reversibility is a tractable property that implies a less tractable but essential property here, invariance. Reversibility is however not necessarily… ▽ More Markov Chain Monte Carlo (MCMC) is a class of algorithms to sample complex and high-dimensional probability distributions. The Metropolis-Hastings (MH) algorithm, the workhorse of MCMC, provides a simple recipe to construct reversible Markov kernels. Reversibility is a tractable property that implies a less tractable but essential property here, invariance. Reversibility is however not necessarily desirable when considering performance. This has prompted recent interest in designing kernels breaking this property. At the same time, an active stream of research has focused on the design of novel versions of the MH kernel, some nonreversible, relying on the use of complex invertible deterministic transforms. While standard implementations of the MH kernel are well understood, the aforementioned developments have not received the same systematic treatment to ensure their validity. This paper fills the gap by developing general tools to ensure that a class of nonreversible Markov kernels, possibly relying on complex transforms, has the desired invariance property and leads to convergent algorithms. This leads to a set of simple and practically verifiable conditions. △ Less

Submitted 29 March, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

arXiv:2008.05793 [pdf, ps, other]

Maximum likelihood estimation of regularisation parameters in high-dimensional inverse problems: an empirical Bayesian approach. Part II: Theoretical Analysis

Authors: Valentin De Bortoli, Alain Durmus, Ana F. Vidal, Marcelo Pereyra

Abstract: This paper presents a detailed theoretical analysis of the three stochastic approximation proximal gradient algorithms proposed in our companion paper [49] to set regularization parameters by marginal maximum likelihood estimation. We prove the convergence of a more general stochastic approximation scheme that includes the three algorithms of [49] as special cases. This includes asymptotic and non… ▽ More This paper presents a detailed theoretical analysis of the three stochastic approximation proximal gradient algorithms proposed in our companion paper [49] to set regularization parameters by marginal maximum likelihood estimation. We prove the convergence of a more general stochastic approximation scheme that includes the three algorithms of [49] as special cases. This includes asymptotic and non-asymptotic convergence results with natural and easily verifiable conditions, as well as explicit bounds on the convergence rates. Importantly, the theory is also general in that it can be applied to other intractable optimisation problems. A main novelty of the work is that the stochastic gradient estimates of our scheme are constructed from inexact proximal Markov chain Monte Carlo samplers. This allows the use of samplers that scale efficiently to large problems and for which we have precise theoretical guarantees. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: SIIMS 2020 - 30 pages

Showing 1–50 of 86 results for author: Durmus, A