Convergence of Langevin-simulated annealing algorithms with multiplicative noise II: Total variation

Pierre Bras; Gilles Pagès

doi:10.1515/mcma-2023-2009

Published by De Gruyter July 4, 2023

Convergence of Langevin-simulated annealing algorithms with multiplicative noise II: Total variation

Pierre Bras and Gilles Pagès

From the journal Monte Carlo Methods and Applications

https://doi.org/10.1515/mcma-2023-2009

Showing a limited preview of this publication:

Abstract

We study the convergence of Langevin-simulated annealing type algorithms with multiplicative noise, i.e. for V : R d → R a potential function to minimize, we consider the stochastic differential equation d ⁢ Y t = − σ ⁢ σ ⊤ ⁢ ∇ V ⁢ ( Y t ) ⁢ d ⁢ t + a ⁢ ( t ) ⁢ σ ⁢ ( Y t ) ⁢ d ⁢ W t + a ⁢ ( t ) 2 ⁢ Υ ⁢ ( Y t ) ⁢ d ⁢ t , where ( W t ) is a Brownian motion, σ : R d → M d ⁢ ( R ) is an adaptive (multiplicative) noise, a : R + → R + is a function decreasing to 0 and where Υ is a correction term. Allowing 𝜎 to depend on the position brings faster convergence in comparison with the classical Langevin equation d ⁢ Y t = − ∇ V ⁢ ( Y t ) ⁢ d ⁢ t + σ ⁢ d ⁢ W t . In a previous paper, we established the convergence in L 1 -Wasserstein distance of Y t and of its associated Euler scheme Y ¯ t to argmin ⁡ ( V ) with the classical schedule a ⁢ ( t ) = A ⁢ log − 1 / 2 ⁡ ( t ) . In the present paper, we prove the convergence in total variation distance. The total variation case appears more demanding to deal with and requires regularization lemmas.

Keywords: Stochastic optimization; Langevin equation; simulated annealing; neural networks

MSC 2010: 62L20; 65C30; 60H35

A Appendix

A.1 Proof of Proposition 4.2

Proof

We use the characterization of the total variation distance as the L 1 -distance between the densities, which reads

d TV ⁡ ( ν a n , ν a n + 1 ) = ∫ R d | Z a n ⁢ e − 2 ⁢ ( V ⁢ ( x ) − V ⋆ ) / a n 2 − Z a n + 1 ⁢ e − 2 ⁢ ( V ⁢ ( x ) − V ⋆ ) / a n + 1 2 | ⁢ d x ≤ Z a n + 1 ⁢ ∫ R d | e − 2 ⁢ ( V ⁢ ( x ) − V ⋆ ) / a n 2 − e − 2 ⁢ ( V ⁢ ( x ) − V ⋆ ) / a n + 1 2 | ⁢ d x + | Z a n − Z a n + 1 | ⁢ ∫ R d e − 2 ⁢ ( V ⁢ ( x ) − V ⋆ ) / a n 2 ⁢ d x = Z a n + 1 ⁢ a n + 1 d ⁢ ∫ R d | e − 2 ⁢ ( V ⁢ ( a n + 1 ⁢ x ) − V ⋆ ) / a n 2 − e − 2 ⁢ ( V ⁢ ( a n + 1 ⁢ x ) − V ⋆ ) / a n + 1 2 | ⁢ d x + | 1 − Z a n Z a n + 1 | ⁢ Z a n + 1 ⁢ a n d ⁢ ∫ R d e − 2 ⁢ ( V ⁢ ( a n ⁢ x ) − V ⋆ ) / a n 2 ⁢ d x .

Using [2, (B.3)] and [2, (B.5)], the first term is bounded by

C ⁢ a n − a n + 1 a n ⁢ ∫ R d e − 2 ⁢ ( V ⁢ ( a n + 1 ⁢ y ) − V ⋆ ) / a n 2 ⁢ V ⁢ ( a n + 1 ⁢ y ) − V ⋆ a n 2 ⁢ d x ≤ C ⁢ a n − a n + 1 a n

because the integral converges by dominated convergence as for the proof of [2, (B.3)]. Using [2, (B.3)] and [2, (B.4)], the second term is bounded by C ⁢ ( n ⁢ log ⁡ ( n ) ) − 1 . ∎

References

[1] P. Bras, Convergence rates of Gibbs measures with degenerate minimum, Bernoulli 28 (2022), no. 4, 2431–2458. 10.3150/21-BEJ1424Search in Google Scholar

[2] P. Bras and G. Pagès, Convergence of Langevin-simulated annealing algorithms with multiplicative noise, preprint (2021), https://arxiv.org/abs/2109.116690. Search in Google Scholar

[3] P. Bras, G. Pagès and F. Panloup, Total variation distance between two diffusions in small time with unbounded drift: Application to the Euler–Maruyama scheme, Electron. J. Probab. 27 (2022), Paper No. 153. 10.1214/22-EJP881Search in Google Scholar

[4] A. S. Dalalyan, Theoretical guarantees for approximate sampling from smooth and log-concave densities, J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 (2017), no. 3, 651–676. 10.1111/rssb.12183Search in Google Scholar

[5] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14), MIT, Cambridge (2014), 2933–2941. Search in Google Scholar

[6] L. Devroye, A. Mehrabian and T. Reddad, The total variation distance between high-dimensional Gaussians, preprint (2018), https://arxiv.org/abs/1810.08693. Search in Google Scholar

[7] A. Durmus and E. Moulines, Nonasymptotic convergence analysis for the unadjusted Langevin algorithm, Ann. Appl. Probab. 27 (2017), no. 3, 1551–1587. 10.1214/16-AAP1238Search in Google Scholar

[8] A. Durmus and E. Moulines, High-dimensional Bayesian inference via the unadjusted Langevin algorithm, Bernoulli 25 (2019), no. 4A, 2854–2882. 10.3150/18-BEJ1073Search in Google Scholar

[9] A. Friedman, Partial Differential Equations of Parabolic Type, Prentice-Hall, Englewood Cliffs, 1964. Search in Google Scholar

[10] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Press, Piscataway (2016), 770–778. 10.1109/CVPR.2016.90Search in Google Scholar

[11] C.-R. Hwang, Laplace’s method revisited: weak convergence of probability measures, Ann. Probab. 8 (1980), no. 6, 1177–1182. 10.1214/aop/1176994579Search in Google Scholar

[12] D. P. Kingma and J. B. Adam, A method for stochastic optimization, preprint (2014), https://arxiv.org/abs/1412.6980. Search in Google Scholar

[13] A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, Toronto, 2009. Search in Google Scholar

[14] D. Lamberton and G. Pagès, Recursive computation of the invariant distribution of a diffusion, Bernoulli 8 (2002), no. 3, 367–405. Search in Google Scholar

[15] V. A. Lazarev, Convergence of stochastic approximation procedures in the case of several roots of a regression equation, Problemy Peredachi Informatsii 28 (1992), no. 1, 75–88. Search in Google Scholar

[16] C. Li, C. Chen, D. Carlson and L. Carin, Preconditioned stochastic gradient Langevin dynamics for deep neural networks, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16), AAAI Press, Washington (2016), 1788–1794. 10.1609/aaai.v30i1.10200Search in Google Scholar

[17] S. Menozzi, A. Pesce and X. Zhang, Density and gradient estimates for non degenerate Brownian SDEs with unbounded measurable drift, J. Differential Equations 272 (2021), 330–369. 10.1016/j.jde.2020.09.004Search in Google Scholar

[18] G. Pagès and F. Panloup, Unadjusted Langevin algorithm with multiplicative noise: Total variation and Wasserstein bounds, Ann. Appl. Probab. 33 (2023), no. 1, 726–779. 10.1214/22-AAP1828Search in Google Scholar

[19] Z. Qian and W. Zheng, A representation formula for transition probability densities of diffusions and applications, Stochastic Process. Appl. 111 (2004), no. 1, 57–76. 10.1016/j.spa.2003.12.004Search in Google Scholar

[20] T. Tieleman and G. E. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, Coursera Neural Netw. Mach. Learn. 4 (2012), 26–31. Search in Google Scholar

[21] C. Villani, Optimal Transport. Old and New, Grundlehren Math. Wiss. 338, Springer, Berlin, 2009. 10.1007/978-3-540-71050-9Search in Google Scholar

[22] M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11), Omnipress, Stuttgart (2011), 681–688. Search in Google Scholar

Received: 2022-05-31

Revised: 2023-05-18

Accepted: 2023-05-19

Published Online: 2023-07-04

Published in Print: 2023-09-01

Convergence of Langevin-simulated annealing algorithms with multiplicative noise II: Total variation

Abstract

A Appendix

A.1 Proof of Proposition 4.2

Proof

References

Journal and Issue

Articles in the same Issue