Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Gaussian Approximation for Penalized Wasserstein Barycenters

  • Published:
Mathematical Methods of Statistics Aims and scope Submit manuscript

Abstract

In this work we consider regularized Wasserstein barycenters (average in Wasserstein distance) in Fourier basis. We prove that random Fourier parameters of the barycenter converge to some Gaussian random vector in distribution. The convergence rate has been derived in finite-sample case with explicit dependence on measures count (\(n\)) and the dimension of parameters (\(p\)).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

REFERENCES

  1. M. Agueh and G. Carlier, ‘‘Barycenters in the Wasserstein space,’’ SIAM Journal on Mathematical Analysis 43 (2), 904–924 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  2. V. Avanesov and N. Buzun, ‘‘Change-point detection in high-dimensional covariance structure,’’ Electronic Journal of Statistics 12 (2), 3254–3294 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  3. H. H. Bauschke, and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 1st ed. (Springer Publishing Company, Incorporated, 2011).

    Book  MATH  Google Scholar 

  4. V. Bentkus, ‘‘A new method for approximations in probability and operator theories,’’ Lithuanian Mathematical Journal 43 (4), 367–388 (2003).

    Article  MathSciNet  MATH  Google Scholar 

  5. V. Bentkus, ‘‘On the dependence of the berry-esseen bound on dimension,’’ Journal of Statistical Planning and Inference (2003).

  6. I. Bespalov, N. Buzun, and D. V. Dylov, Brulé: Barycenter-regularized unsupervised landmark extraction (2020).

  7. J. Bigot, E. Cazelles, and N. Papadakis, ‘‘Penalization of Barycenters in theWasserstein Space,’’ SIAM Journal on Mathematical Analysis 51 (3), 2261–2285 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  8. N. Bonneel, G. Peyré, and M. Cuturi, ‘‘Wasserstein barycentric coordinates: Histogram regression using optimal transport,’’ ACM Transactions on Graphics 35 (4), 71:1–71:10 (2016).

  9. S. Boucheron, G. Lugosi, M. P., Concentration Inequalities: A Nonasymptotic Theory of Independence (Oxford University Press, 2013).

  10. V. Chernozhukov, D. Chetverikov, and K. Kato, ‘‘Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors,’’ Ann. of Stat. 41, 2786–2819 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  11. C. Clason, D. A. Lorenz, H. Mahler, and B. Wirth, ‘‘Entropic regularization of continuous optimal transport problems,’’ Journal of Mathematical Analysis and Applications 494 (1), 124432 (2021).

  12. D. Edwards, ‘‘On the kantorovich-rubinstein theorem,’’ Expositiones Mathematicae 29 (4), 387–398 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  13. F. Götze, A. Naumov, V. Spokoiny, and V. Ulyanov, ‘‘Large ball probabilities, Gaussian comparison, and anti-concentration,’’ Bernoulli 25 (4A), 2538–2563 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  14. A. Kroshnin, A. Suvorikova, and V. Spokoiny, Statistical inference for bureswasserstein barycenters. arXiv:1901.00226 (2019).

  15. L. Li, A. Genevay, M. Yurochkin, and J. Solomon, Continuous regularized wasserstein barycenters, arXiv:2008.12534 (2020).

  16. D. A. Lorenz, P. Manns, and C. Meyer, ‘‘Quadratically regularized optimal transport,’’ Applied Mathematics and Optimization (2021).

  17. E. S. Meckes, On stein’s method for multivariate normal approximation. High Dimensional Probability V: The Luminy Volume. (2009).

  18. T. Rippl, A. Munk, and A. Sturm, ‘‘Limit laws of the empirical wasserstein distance: Gaussian distributions,’’ Journal of Multivariate Analysis 151, 90–109 (2016).

    Article  MathSciNet  MATH  Google Scholar 

  19. N. Shvetsov, N. Buzun, and D. V. Dylov, Unsupervised non-parametric change point detection in quasi-periodic signals. arXiv 2002.02717 (2020).

  20. M. Sommerfeld and A. Munk, ‘‘Inference for empirical wasserstein distances on finite spaces,’’ Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (1), 219–238 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  21. V. Spokoiny, ‘‘Penalized maximum likelihood estimation and effective dimension,’’ Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 53 (1), 389–429 (2017).

    Article  MathSciNet  MATH  Google Scholar 

  22. V. Spokoiny and M. Zhilova, Bootstrap confidence sets under a model misspecification. Preprint no. 1992, WIAS (2014).

    MATH  Google Scholar 

  23. S. Steinerberger, Wasserstein distance, fourier series and applications. Monatshefte für Mathematik 194 (2021).

Download references

ACKNOWLEDGEMENTS

The author thanks Prof. Roman Karasev, Prof. Vladimir Spokoiny and Prof. Dmitry Dylov for discussion and contribution to this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nazar Buzun.

Appendices

Appendix A

PROOF OF THEOREM 3.1

From \((-\nabla^{2}L(\theta)\geq 0)\) and (\(L(\hat{\theta})>L(\theta^{*})\)) follows that the local region \(\Omega(\textbf{r})\) that includes \(\hat{\theta}\) should cover the next region

$$\Omega(\textbf{r})\supset\{\theta:L(\theta)\geq L(\theta^{*})\}.$$

Use notation

$$D^{2}(\theta)=-\nabla^{2}\mathbb{E}L(\theta).$$

Estimate the minimal possible radius \(\textbf{r}_{0}\) that satisfy to the previous condition. Let \(\theta_{0}\) be some point between \(\theta\) and \(\theta^{*}\) that is used in Taylor expansion with central point \(\theta^{*}\)

$$0\geq L(\theta^{*})-L(\theta)$$
$${}=-(\theta-\theta^{*})^{T}\nabla L(\theta^{*})-\frac{1}{2}(\theta-\theta^{*})^{T}\nabla^{2}\zeta(\theta_{0})(\theta-\theta^{*})+\frac{1}{2}||D(\theta_{0})(\theta-\theta^{*})||^{2}.$$

Assumption 1 provides

$$||D(\theta_{0})(\theta-\theta^{*})||^{2}=||D(\theta-\theta^{*})||^{2}+||(D(\theta_{0})-D)(\theta-\theta^{*})||^{2}$$
$${}\geq\textbf{r}^{2}-\delta(\textbf{r})\textbf{r}^{2}.$$

Assumption 2 provides with probability (\(1-e^{-t}\))

$$(\theta-\theta^{*})^{T}\nabla^{2}\zeta(\theta_{0})(\theta-\theta^{*})\leq\mathfrak{z}(t)\textbf{r}^{2}.$$

Put these two properties into the initial inequality

$$0\geq-||D^{-1}\nabla L(\theta^{*})||\textbf{r}-\frac{\mathfrak{z}(t)}{2}\textbf{r}^{2}+\frac{1-\delta(\textbf{r})}{2}\textbf{r}^{2}$$
$$\textbf{r}(1-\delta(\textbf{r})-\mathfrak{z}(t))\leq 2||D^{-1}\nabla L(\theta^{*})||.$$

So one may set under the assumption \(\delta(\textbf{r})+\mathfrak{z}(t)\leq 1/2\)

$$\textbf{r}_{0}=4||D^{-1}\nabla L(\theta^{*})||.$$

From Assumptions 1, 2 also follows that

$$||D(\hat{\theta}-\theta^{*})+D^{-1}\{\nabla L(\hat{\theta})-\nabla L(\theta^{*})\}||\leq\lozenge(\textbf{r},t).$$

Since \(\nabla L(\hat{\theta})=0\) we have

$$||D(\hat{\theta}-\theta^{*})-D^{-1}\nabla L(\theta^{*})||\leq\lozenge(\textbf{r},t).$$

Note that for the coordinates transform \(S\) there exists the following invariant:

$$\left|\left|\begin{pmatrix}\breve{D}&0\\ 0&D_{\vartheta}\end{pmatrix}\begin{pmatrix}u-u^{*}\\ \vartheta-\vartheta^{*}\end{pmatrix}+\begin{pmatrix}\breve{D}^{-1}&0\\ 0&D_{\vartheta}^{-1}\end{pmatrix}\begin{pmatrix}\breve{\nabla}L(u,\vartheta)-\breve{\nabla}L(u^{*},\vartheta^{*})\\ \nabla_{\vartheta}L(u,\vartheta)+\nabla_{\vartheta}L(u^{*},\vartheta^{*})\end{pmatrix}\right|\right|$$
$${}=||D(\theta-\theta^{*})+D^{-1}\{\nabla L(\theta)-\nabla L(\theta^{*})\}||.$$

Since

$$\left|\left|\begin{pmatrix}\breve{D}&0\\ 0&D_{\vartheta}\end{pmatrix}\begin{pmatrix}u\\ \vartheta\end{pmatrix}\right|\right|^{2}=\theta^{T}S^{T}[(S^{-1})^{T}D^{2}(S^{-1})]S\theta=||D\theta||^{2},$$
$$\left|\left|\begin{pmatrix}\breve{D}^{-1}&0\\ 0&D_{\vartheta}^{-1}\end{pmatrix}\begin{pmatrix}\breve{\nabla}\\ \nabla_{\vartheta}\end{pmatrix}\right|\right|^{2}=\nabla^{T}S^{-1}[(S^{-1})^{T}D^{2}(S^{-1})]^{-1}(S^{-1})^{T}\nabla=\nabla^{T}D^{-2}\nabla,$$
$$\begin{pmatrix}\breve{\nabla}\\ \nabla_{\vartheta}\end{pmatrix}{}^{T}\begin{pmatrix}u\\ \vartheta\end{pmatrix}=\nabla^{T}S^{-1}S\theta=\nabla^{T}\theta.$$

Subsequently, basing on this invariant we obtain the bound for projection \(u\)

$$||\breve{D}(\widehat{u}-u^{*})-\breve{D}^{-1}\breve{\nabla}L(\theta^{*})||$$
$$\leq||D(\hat{\theta}-\theta^{*})-D^{-1}\nabla L(\theta^{*})||\leq\lozenge(\textbf{r},t).$$

Appendix B

ELLIPSOID ENTROPY

The upper bound \(\mathfrak{z}(t)\) of the random process in Assumption 2 with parameter \(\theta\in\Omega(\textbf{r})\) requires entropy computation of the ellipsoid \(\Omega(\textbf{r})\). It will be useful to us in the next section and below we provide a short excerpt on this topic. The general formula for the covering number \(N(\varepsilon,\Omega)\) of a convex set \(\Omega\) in \(\mathbb{R}^{p}\) with Euclidean metric is

$$N(\varepsilon,\Omega)\leq\frac{volume(\Omega+(\varepsilon/2)B_{1})}{volume(B_{1})}\left(\frac{2}{\varepsilon}\right)^{p},$$

where \(B_{1}\) is the unit ball. Remind that \(N(\varepsilon,\Omega)\) equals to the minimal count of balls with radius \(\varepsilon\) that is sufficient to cover \(\Omega\). We will need two components of the ellipsoid entropy:

$$H_{1}=\int\limits_{0}^{1/2}\sqrt{\log N(\varepsilon\textbf{r},\Omega(\textbf{r}))}d\varepsilon,\quad H_{2}=\int\limits_{0}^{1/2}\log N(\varepsilon\textbf{r},\Omega(\textbf{r}))d\varepsilon.$$
(4.5)

Lemma 4.6. Let \(||D^{-2}||=1\). Then for the entropy components (3.4) of the ellipsoid \(\Omega(\textbf{r})\) with matrix \(D^{2}\) defined in expression (3.2) it holds

$$H_{1}\leq C({\alpha-1})^{-1/2}\sqrt{\sum_{i}\frac{\log^{\alpha}(\lambda_{i}^{2}(D))}{\lambda_{i}^{2}(D)}}$$

and

$$H_{2}\leq C\sum_{i}\frac{1}{\lambda_{i}(D)},$$

where \(C\) is some absolute constant and \(\alpha>1\) .

Proof. Function \(\log N(\varepsilon\textbf{r},\Omega(\textbf{r}))\) is monotone-decreasing. One may split the integration interval of (3.4) into the following parts:

$$\left[\frac{1}{2},\frac{1}{4}\right],\quad\left[\frac{1}{4},\frac{1}{8}\right],\quad\left[\frac{1}{8},\frac{1}{16}\right],\quad\ldots.$$

Take corresponded values \(N(\textbf{r}/4,\Omega(\textbf{r}))\), \(N(\textbf{r}/8,\Omega(\textbf{r}))\), \(N(\textbf{r}/16,\Omega(\textbf{r}))\), \(\ldots\) and obtain integral approximation by the histogram

$$H_{1}\leq\sum_{k=2}^{\infty}\frac{1}{2^{k}}\sqrt{\log N\left(\frac{\textbf{r}}{2^{k}},\,\Omega(\textbf{r})\right)}$$

and

$$H_{2}\leq\sum_{k=2}^{\infty}\frac{1}{2^{k}}\log N\left(\frac{\textbf{r}}{2^{k}},\,\Omega(\textbf{r})\right).$$

Theorem H.7.1 in [21] provides upper bounds for the right parts of the previous expressions and completes the proof. \(\Box\)

Appendix C

SUPPORT FUNCTIONS

Bounds for the first and second derivatives of the Likelihood of barycenters model (2.1) involves additional theory from convex analysis.

Definition 4.7 (*). Legendre–Fenchel transform of a function \(f:X\to\overline{\mathbb{R}}\) or the convex conjugate function calls

$$f^{*}(y)=\sup_{x\in X}(\langle x,y\rangle-f(x)).$$

Definition 4.8 (s). Support function for a convex body \(E\) is

$$s_{E}(\theta)=\sup_{\eta\in E}\theta^{T}\eta.$$

Note that for the indicator function \(\delta_{E}(\eta)\) of a convex set \(E\) the conjugate function is support function of \(E\), i.e.,

$$\delta^{*}_{E}(\theta)=s_{E}(\theta).$$

Definition 4.9 (\(\oplus\)). Let \(f_{1},f_{2}:E\to\overline{\mathbb{R}}\) be convex functions. The infimal convolution of them is

$$(f_{1}\oplus f_{2})(x)=\inf_{x_{1}+x_{2}=x}(f_{1}(x_{1})+f_{2}(x_{2})).$$

Lemma 4.10 (Proposition 13.21 [3]). Let \(f_{1},f_{2}:E\to\overline{\mathbb{R}}\) be convex lower-semi-continuous functions. Then

$$(f_{1}\oplus f_{2})^{*}=f_{1}^{*}+f_{2}^{*},$$
$$(f_{1}+f_{2})^{*}=\overline{f_{1}^{*}\oplus f_{2}^{*}}.$$
Fig. 2
figure 2

Optimization related to support function.

Lemma 4.11. The support function of intersection \(E=E_{1}\cap E_{2}\) is infimal convolution of support functions for \(E_{1}\) and \(E_{2}\)

$$s_{E}(\theta)=\inf_{\theta_{1}+\theta_{2}=\theta}(s_{E1}(\theta_{1})+s_{E2}(\theta_{2})).$$

Proof. According to the previous lemma

$$\delta_{E_{1}\cap E_{2}}(\eta)=\delta_{E_{1}}(\eta)+\delta_{E_{2}}(\eta),$$
$$(\delta_{E_{1}}+\delta_{E_{2}})^{*}=\overline{\delta^{*}_{E_{1}}\oplus\delta^{*}_{E_{2}}}.$$

With additional property

$$\text{intdom}\,\delta_{E_{1}}\cap\text{dom}\,\delta_{E_{2}}=\text{int}E_{1}\cap E_{2}\neq\emptyset$$

one have

$$(\delta_{E_{1}}+\delta_{E_{2}})^{*}=\delta^{*}_{E_{1}}\oplus\delta^{*}_{E_{2}}.$$

\(\Box\)

Lemma 4.12. Let a support function \(s_{E}(\theta)\) be differentiable, then its gradient belongs to the border of corresponded convex set \(E\)

$$\nabla s_{E}(\theta)=\eta^{*}(\theta)\in\partial E,$$

where

$$\eta^{*}(\theta)=\mathop{\textrm{argmax}}\limits_{\eta\in E}\eta^{T}\theta.$$

Proof. It follows from the convexity of \(E\) and linearity of the optimization functional.

$$\frac{\partial\eta^{*}(\theta)}{\partial||\theta||}=0\Rightarrow\frac{\partial\eta^{*}(\theta)}{\partial\theta}^{T}\theta=0,$$
$$\nabla s_{E}(\theta)=\frac{\partial\eta^{*}(\theta)}{\partial\theta}^{T}\theta+\eta^{*}(\theta)=\eta^{*}(\theta).$$

\(\Box\)

Lemma 4.13 (Proposition 16.48 [3]). Let \(f_{1},f_{2}:E\to\overline{\mathbb{R}}\) be convex continuous functions. Then the subdifferential of their infimal convolution can be computed by formula

$$\partial(f_{1}\oplus f_{2})(x)=\bigcup_{x=x_{1}+x_{2}}\partial f(x_{1})\cap\partial f(x_{2}).$$

Corollary 4.14. If in addition \(f_{1},f_{2}\) are differentiable, then their infimal convolution is differentiable and \(\exists x_{1},x_{2}:x=x_{1}+x_{2}\) and

$$\nabla(f_{1}\oplus f_{2})(x)=\nabla f_{1}(x_{1})=\nabla f_{2}(x_{2}).$$

Lemma 4.15. Let \(f_{1},\ldots,f_{m}:E\to\overline{\mathbb{R}}\) be convex and two times differentiable functions. There is an upper bound for the second derivative of the infimal convolution \(\forall t:\sum_{i=1}^{m}t_{i}=1\)

$$\partial\nabla^{T}(f_{1}\oplus\ldots\oplus f_{m})(x)\preceq\sum_{i=1}^{m}t_{i}^{2}\nabla^{2}f(x_{i}),$$

where \(\sum_{i=1}^{m}x_{i}=x\).

Proof. Use notation \(f=f_{1}\oplus\ldots\oplus f_{m}\). Let

$$f(x)=\sum_{i}f_{i}(x_{i}).$$

According to Lemma 4.13 if all the functions are differentiable then

$$\nabla f(x)=\sum_{i}t_{i}\nabla f_{i}(x_{i}).$$

From the definition \(\oplus\) also follows that

$$f(x+z)\leq\sum_{i}f_{i}(x_{i}+t_{i}z).$$

Make Tailor expansion for the left and right parts and account equality of the first derivatives

$$z^{T}\partial\nabla^{T}f(x+\theta z)z\leq\sum_{i}t_{i}^{2}z^{T}\nabla^{2}f_{i}(x_{i}+\theta_{i}z)z.$$

Since the direction \(z\) was chosen arbitrarily, dividing both parts of the previous equation by \(||z||^{2}\to 0\), we come to inequality

$$\partial\nabla^{T}f(x)\preceq\sum_{i}t_{i}^{2}\nabla^{2}f_{i}(x_{i}).$$

\(\Box\)

Remark 4.16. One can find another provement of the similar theorem in ([3], Theorem 18.15).

Theorem 4.17. Let \(f_{1},\ldots,f_{m}:E\to\overline{\mathbb{R}}\) be convex and two times differentiable functions. There is an upper bounds for infimal convolution \(f=f_{1}\oplus\ldots\oplus f_{m}\) derivatives \(\forall\gamma\) \(\exists x_{1},\ldots,x_{m}\):

$$\gamma^{T}\partial\nabla^{T}f(x)\gamma\leq\max_{i}\gamma^{T}\nabla^{2}f_{i}(x_{i})\gamma\frac{f_{i}(x_{i})}{f(x)}$$

and

$$\gamma^{T}\partial\nabla^{T}f^{2}(x)\gamma\leq 2(\gamma^{T}\nabla f(x))^{2}+2\max_{i}\gamma^{T}\nabla^{2}f_{i}(x_{i})\gamma f_{i}(x_{i}).$$

Proof. Choosing appropriate \(\{t_{i}\}\) in Lemma 4.15 one get the required upper bounds. Set

$$t_{i}=\frac{f_{i}(x_{i})}{\sum_{j=1}^{m}f_{j}(x_{j})}$$

and since

$$\sum_{j=1}^{m}f_{j}(x_{j})=f(x),$$
$$\sum_{i}t_{i}^{2}\gamma^{T}\nabla^{2}f_{i}(x_{i})\gamma\leq\max_{i}t_{i}\gamma^{T}\nabla^{2}f_{i}(x_{i})\gamma=\max_{i}\gamma^{T}\nabla^{2}f_{i}(x_{i})\gamma f_{i}(x_{i}).$$

In order to prove the second formula apply this inequality in

$$\partial\nabla^{T}f^{2}=2\nabla f\nabla^{T}f+2f\partial\nabla f.$$

\(\Box\)

Corollary 4.18. Let \(s_{1},\ldots,s_{m}:E^{*}\to\overline{\mathbb{R}}\) be support functions of the bounded convex smooth sets \(E_{1},\ldots,E_{m}\) . There are upper bounds for the derivatives of support function \(s\) of intersection \(E_{1}\cap\ldots\cap E_{m}\) , such that \(\forall i\)

$$\gamma^{T}\partial\nabla^{T}s(\theta)\gamma\leq\frac{\max_{i}\gamma^{T}\partial\eta^{*}_{i}/\partial\theta_{i}\gamma s_{i}(\theta_{i})}{s(\theta)},$$
$$\gamma^{T}\partial\nabla^{T}s^{2}(\theta)\gamma\leq 2(\gamma^{T}\eta^{*}_{i})^{2}+2\max_{i}\gamma^{T}\partial\eta^{*}_{i}/\partial\theta_{i}\gamma s_{i}(\theta_{i}).$$

Proof. It follows from Theorem 4.17 and Lemma 4.12. \(\Box\)

Appendix D

GAUSSIAN APPROXIMATION

Multivariate analogues of Berry–Esseen Theorem have many modifications depending on space dimension of random vectors and functions set used for measures comparison. Bentkus in [4, 5] has presented excellent results related to this topic. Namely for a sequence of i.i.d random vectors with identity covariance matrix \(\{X_{i}\}_{i=1}^{n}\), \(X_{i}\in\mathcal{P}(\mathbb{R}^{p})\), any convex set \(A\) and Gaussian vector \(Z\in\mathcal{N}(0,I)\) it holds

$$\left|\mathbb{P}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}X_{i}\in A\right)-\mathbb{P}\left(Z\in A\right)\right|\leq\frac{400p^{1/4}\mathbb{E}||X_{1}||^{3}}{\sqrt{n}}$$

and

$$W_{1}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}X_{i},Z\right)\leq O\left(\frac{\mathbb{E}||X_{1}||^{3}}{\sqrt{n}}\right).$$

We extend these two statements for independent random vectors with non-identity covariance \(\Sigma\). Additionally we remove factor \(p^{1/4}\) replacing it with anti-concentration constant defined below.

Definition 4.19 (\(H_{k}\)). The multivariate Hermite polynomial is

$$H_{k}(x)=(-1)^{|k|}e^{x^{T}\Sigma^{-1}x/2}\frac{\partial^{|k|}}{\partial^{k_{1}}\ldots\partial^{k_{p}}}e^{-x^{T}\Sigma^{-1}x/2},$$

where \(x\in\mathbb{R}^{p}\) and \(|k|=k_{1}+\ldots+k_{p}\).

Lemma 4.20 [17]. Consider a Gaussian vector \(Z\sim\mathcal{N}(0,\Sigma)\) and two functions \(h\in C^{1}\) and \(f_{h}\) such that

$$f_{h}(x)=-\int\limits_{0}^{1}\frac{1}{2t}\mathbb{E}\overline{h}(Z(x,t))dt,$$

where for \(t\in[0,1]\)

$$\overline{h}(Z(x,t))=h(\sqrt{t}x+\sqrt{1-t}Z)-\mathbb{E}h(Z).$$

Then \(f_{h}\) is a solution of Stein’s equation

$$\overline{h}(x)=(\textrm{tr}\{\nabla^{2}\Sigma\}-x^{T}\nabla)f_{h}(x)$$

and

$$\frac{\partial^{|k|}}{\partial^{k_{1}}\ldots\partial^{k_{p}}}f_{h}(x)=-\int\limits_{0}^{1}\frac{1}{2}\frac{t^{\frac{|k|}{2}-1}}{(1-t)^{\frac{|k|}{2}}}\mathbb{E}H_{k}(Z)\overline{h}(Z(x,t))dt.$$

Proof. One may verify this statement through substituting the solution \(f_{h}\) into Stein’s equation. It has been done in Lemma 1 of [17]. \(\Box\)

In the following discussion we will need difference between the second derivatives of function \(f_{h}\).

Corollary 4.21. Let \(f_{h}\) be the solution of Stein’s equation, then

$$\nabla^{2}f_{h}(x)-\nabla^{2}f_{h}(y)=-\int\limits_{0}^{1}\frac{1}{2(1-t)}\mathbb{E}H_{2}(Z)\{h(Z(x,t))-h(Z(y,t))\}dt,$$

where

$$H_{2}(Z)=\Sigma^{-1/2}\{(\Sigma^{-1/2}Z)(\Sigma^{-1/2}Z)^{T}-I\}\Sigma^{-1/2}.$$

Use notation \(\forall i\) \(X_{-i}\) for sum of \(\{X_{j}\}_{j=1}^{n}\) without the \(i\)th element and \(X^{\prime}_{i}\) for an independent copy of \(X_{i}\). Use the following notation for conditional expectation

$$\mathbb{E}_{-i}=\mathbb{E}(\cdot|X_{i},X^{\prime}_{i}),$$
$$X_{-i}=\sum_{j=1,j\neq i}^{n}X_{j}.$$

Lemma 4.22. Consider a Gaussian vector \(Z\in\mathcal{N}(0,\Sigma)\) and a sequence of independent zero-mean random vectors \(X=\sum_{i=1}^{n}X_{i}\) in \(\mathbb{R}^{p}\) with the same non-singular variance matrix

$$\mathbb{E}XX^{T}=\Sigma.$$

Then for any function with bounded the first derivative \(h\in C^{1}(\mathbb{R}^{p})\)

$$\mathbb{E}h(X)-\mathbb{E}h(Z)\leq A\log\left(\frac{6B}{A}\right),$$

where

$$A=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}\,A_{i},\quad B=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}\,B_{i},$$

and \(\forall\alpha>0\) on interval \(t\in[0,1-\alpha]\)

$$A_{i}\geq\frac{1}{\sqrt{t}}\sup_{||\gamma||=1,\,\,\theta\in[0,1]}\mathbb{E}_{-i}J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime}),$$

and on interval \(t\in[1-\alpha,1]\) for the same \(\alpha\)

$$B_{i}\geq\frac{1}{2\sqrt{1-t}}\sup_{||\gamma||=1,\,\,\theta\in[0,1]}\mathbb{E}_{-i}J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime}),$$

and

$$J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime})=\{(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1\}\{h(Z(X_{-i}+\theta X_{i},t))-h(Z(X_{-i}+X^{\prime}_{i},t))\},$$

where

$$Z(x,t)=\sqrt{t}x+\sqrt{1-t}Z.$$

Proof. From Lemma 4.20 follows that for any function \(h\) with the first bounded derivative

$$\mathbb{E}h(X)-\mathbb{E}h(Z)=\mathbb{E}\textrm{tr}\{\nabla^{2}\Sigma\}f_{h}(X)-\mathbb{E}\sum_{i=1}^{n}X_{i}^{T}\nabla f_{h}(X).$$

Let \(\theta\) be some value in \([0,1]\). Decompose \(\nabla f_{h}(X)\) by Taylor formula

$$\nabla f_{h}(X)=\nabla f_{h}(X_{-i})+\nabla^{2}f_{h}(X_{-i}+\theta X_{i})X_{i}.$$

Note that

$$\mathbb{E}\textrm{tr}\{\nabla^{2}\Sigma\}f_{h}(X)=\mathbb{E}\sum_{i=1}^{n}X_{i}^{T}\nabla^{2}f_{h}(X_{-i}+X_{i}^{\prime})X_{i}.$$

Substitute them into the first expression

$$\mathbb{E}h(X)-\mathbb{E}h(Z)=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\left\{\nabla^{2}f_{h}(X_{-i}+X^{\prime}_{i})-\nabla^{2}f_{h}(X_{-i}+\theta X_{i})\right\}X_{i}$$
$${}=\sum_{i=1}^{n}\mathbb{E}(\Sigma^{-1/2}X_{i})^{T}\Sigma^{1/2}\left\{\nabla^{2}f_{h}(X_{-i}+X^{\prime}_{i})-\nabla^{2}f_{h}(X_{-i}+\theta X_{i})\right\}\Sigma^{1/2}\Sigma^{-1/2}X_{i}.$$

From the consequence of Lemma 4.20 use equality for the second derivative difference. For a unit vector \(||\gamma||=1\) and conditional expectation \(\mathbb{E}_{-i}=\mathbb{E}(\cdot|X_{i},X^{\prime}_{i})\)

$$\gamma^{T}\mathbb{E}_{-i}\Sigma^{1/2}\left\{\nabla^{2}f_{h}(X_{-i}+X^{\prime}_{i})-\nabla^{2}f_{h}(X_{-i}+\theta X_{i})\right\}\Sigma^{1/2}\gamma$$
$${}=\int\limits_{0}^{1}\frac{1}{2(1-t)}\mathbb{E}_{-i}\{((\Sigma^{-1/2}Z)^{T}\gamma)^{2}-1\}\{h(Z(X_{-i}+\theta X_{i},t))-h(Z(X_{-i}+X^{\prime}_{i},t))\}dt$$
$${}\leq\int\limits_{0}^{1-\alpha}\frac{t^{1/2}}{2(1-t)}A_{i}dt+\int\limits_{1-\alpha}^{1}\frac{1}{(1-t)^{1/2}}B_{i}dt\leq-\frac{A_{i}}{2}\log(\alpha)+2B_{i}\sqrt{\alpha}.$$

Sum it with \(X^{T}_{i}\Sigma^{-1}X_{i}\) and finally we obtain

$$\mathbb{E}h(X)-\mathbb{E}h(Z)\leq-\frac{A}{2}\log(\alpha)+2B\sqrt{\alpha}$$
$${}\leq A\left(1+\log\left(\frac{2B}{A}\right)\right)\leq A\log\left(\frac{6B}{A}\right).$$

\(\Box\)

Theorem 4.23 (Multivariate Berry–Esseen with Wasserstein distance). Consider a sequence of independent zero-mean random vectors \(X=\sum_{i=1}^{n}X_{i}\) in \(\mathbb{R}^{p}\) with a variance matrix

$$\mathbb{E}XX^{T}=\Sigma.$$

Then 1-Wasserstein distance between \(X\) and Gaussian vector \(Z\in\mathcal{N}(0,\Sigma)\) has the following upper bound

$$W_{1}(X,Z)\leq\sqrt{2}\mu_{3}\log\left(6p\sqrt{\textrm{tr}(\Sigma)}/\mu_{3}\right),$$

where

$$\mu_{3}=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}||X_{i}-X^{\prime}_{i}||,$$

and each \(X_{i}^{\prime}\) is an independent copy of \(X_{i}\).

Remark 4.24. In i.i.d case with \(\Sigma=I_{p}\)

$$W_{1}(X,Z)=O\left(\frac{p^{3/2}\log(n)}{\sqrt{n}}\right).$$

These is the same theorem with a different proof in paper [4].

Proof. Basing on Lemma 4.22 we consider \(h\) with property \(||\nabla h(\cdot)||\leq 1\) and involve definition of \(A_{i}\) and \(B_{i}\). This property comes from the dual definition of \(W_{1}\), Section 2. We decompose \(A_{i}\) extracting \(\sqrt{t}(X_{i}-X^{\prime}_{i})\) and decompose \(B_{i}\) extracting \(\sqrt{1-t}Z\)

$$A_{i}=||X_{i}-X^{\prime}_{i}||\,\mathbb{E}_{-i}|(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1|,$$
$$B_{i}=\mathbb{E}_{-i}|(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1|\,||Z||.$$

Note that \((\gamma^{T}\Sigma^{-1/2}Z)^{2}\) has chi-square distribution \(\chi_{1}\). And its variance equals \(2\). So we obtain by means of Cauchy–Bunyakovsky inequality

$$\mathbb{E}_{-i}|(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1|\leq\sqrt{2}$$

and

$$\mathbb{E}_{-i}|(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1|\,||Z||\leq\sqrt{\mathbb{E}|(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1|^{2}}\sqrt{\mathbb{E}||Z||^{2}}=\sqrt{2\textrm{tr}(\Sigma)}.$$

Subsequently

$$A\leq\sqrt{2}\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}||X_{i}-X^{\prime}_{i}||=\sqrt{2}\mu_{3}$$

and

$$B\leq\sqrt{2\textrm{tr}(\Sigma)}\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}=p\sqrt{2\textrm{tr}(\Sigma)}.$$

\(\Box\)

For the next result we will need the following technical lemma.

Lemma 4.25. Let a random variable \(\varepsilon\) has a tail bound \(\forall\textbf{x}\geq\textbf{x}_{0}\)

$$\mathcal{P}(\varepsilon>h(\textbf{x}))\leq e^{-\textbf{x}}.$$

Then for a function \(g:\mathbb{R}_{+}\to\mathbb{R}_{+}\) with derivative \(g^{\prime}:\mathbb{R}_{+}\to\mathbb{R}_{+}\)

$$\mathbb{E}\mathbb{1}[\varepsilon>h(\textbf{x}_{0})]g(\varepsilon)\leq g(h(\textbf{x}_{0}))e^{-\textbf{x}_{0}}+\int\limits_{\textbf{x}_{0}}^{\infty}e^{-\textbf{x}}g^{\prime}(h(\textbf{x}))h^{\prime}(\textbf{x})d\textbf{x}.$$

In particular

$$\mathbb{E}\mathbb{1}[\varepsilon>h(\textbf{x}_{0})]\varepsilon\leq h(\textbf{x}_{0})e^{-\textbf{x}_{0}}+\int\limits_{\textbf{x}_{0}}^{\infty}e^{-\textbf{x}}h^{\prime}(\textbf{x})d\textbf{x},$$
$$\mathbb{E}\mathbb{1}[\varepsilon>h(\textbf{x}_{0})]\varepsilon^{r}\leq h(\textbf{x}_{0})^{r}e^{-\textbf{x}_{0}}+r\int\limits_{\textbf{x}_{0}}^{\infty}e^{-\textbf{x}}h(\textbf{x})^{r-1}h^{\prime}(\textbf{x})d\textbf{x}.$$

Theorem 4.26 (Multivariate Berry–Esseen). Consider a sequence of independent zero-mean random vectors \(X=\sum_{i=1}^{n}X_{i}\) in \(\mathbb{R}^{p}\) with a variance matrix

$$\mathbb{E}XX^{T}=\Sigma.$$

Let \(\varphi:\mathbb{R}^{p}\to\mathbb{R}_{+}\) be some norm function (sub-additive and homogeneous) and with Gaussian vector \(Z\in\mathcal{N}(0,\Sigma)\) fulfills anti-concentration property, such that \(\forall x\in\mathbb{R}_{+}\)

$$\mathbb{P}(\varphi(Z)>x)-\mathbb{P}(\varphi(Z)>x+\Delta)\leq C_{A}\Delta.$$

Then the measure difference between \(X\) and Gaussian vector \(Z\) has the following upper bound \(\forall x\)

$$\left|\mathbb{P}(\varphi(X)>x)-\mathbb{P}(\varphi(Z)>x)\right|\leq 22C_{A}\mu_{3}\log\left(\frac{3p}{C_{A}\mu_{3}}\right)\log\left(\frac{\sqrt{2\mathbb{E}\varphi^{2}(Z)}p}{10C_{A}\mu^{2}_{3}}\right)$$
$${}\leq C_{A}\mu_{3}\,O(\log^{2}n),$$

where

$$\mu_{3}=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}2\varphi(X_{i}).$$

Proof. Make some preliminary computations. Define a smooth indicator function

$$g_{x,\Delta}(t)=\begin{cases}0,&t<x-\Delta\\ (x-t)/\Delta,&t\in[x-\Delta,x]\\ 1,&t>x.\end{cases}$$

Set \(h=g_{x,\Delta}\circ\varphi\). Denote the required bound by \(\delta\):

$$\sup_{x\in\mathbb{R}_{+}}\left|\mathbb{P}(\varphi(X)>x)-\mathbb{P}(\varphi(Z)>x)\right|\leq\delta.$$

Note that from sub-additive property of the function \(\varphi\) follows

$$g_{x,\Delta}\big{(}\varphi(X+dX)\big{)}\leq g_{x,\Delta}\big{(}\varphi(X)+\varphi(dX)\big{)}$$

and

$$g^{\prime}_{x,\Delta}(t)=\frac{1}{\Delta}\mathbb{1}[x-\Delta\leq t\leq x].$$

By the anti-concentration property

$$\mathbb{E}g^{\prime}_{x,\Delta}(\varphi(Z))=\frac{1}{\Delta}\big{(}\mathbb{P}(\varphi(Z)>x-\Delta)-\mathbb{P}(\varphi(Z)>x)\big{)}\leq C_{A}.$$

And using definition of \(\delta\)

$$\mathbb{E}g^{\prime}_{x,\Delta}(\varphi(Z(X,t)))\leq\frac{1}{\Delta}\big{(}\mathbb{P}(\varphi(Z)>x-\Delta)-\mathbb{P}(\varphi(Z)>x)\big{)}+\frac{2\delta}{\Delta}$$
(4.6)
$$\leq C_{A}+\frac{2\delta}{\Delta}.$$
(4.7)

Now we will bound \(\mathbb{E}_{-i}J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime})\) required in Lemma 4.22. Remind that by definition

$$J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime})=\{(\gamma^{T}\Sigma^{-1/2}Z)^{2}-1\}\{h(Z(X_{-i}+\theta X_{i},t))-h(Z(X_{-i}+X^{\prime}_{i},t))\},$$

where

$$Z(x,t)=\sqrt{t}x+\sqrt{1-t}Z.$$

For some \(\theta^{\prime}\in[0,1]\) using sub-additivity of \(\varphi\) and Taylor formula we get

$$h(Z(X_{-i}+\theta X_{i},t))-h(Z(X_{-i}+X^{\prime}_{i},t))$$
$${}\leq g_{x,\Delta}\big{(}\varphi(Z(X_{-i}+X^{\prime}_{i},t))+\varphi(\sqrt{t}(\theta X_{i}-X^{\prime}_{i}))\big{)}-g_{x,\Delta}\big{(}\varphi(Z(X_{-i}+X^{\prime}_{i},t)\big{)}$$
$${}\leq g^{\prime}_{x,\Delta}\big{(}\varphi(Z(X_{-i}+X^{\prime}_{i},t))+\theta^{\prime}\varphi(\sqrt{t}(\theta X_{i}-X^{\prime}_{i}))\big{)}\varphi(\sqrt{t}(\theta X_{i}-X^{\prime}_{i})).$$

Together with (4.7)

$$\mathbb{E}_{-i}h(Z(X_{-i}+\theta X_{i},t))-\mathbb{E}_{-i}h(Z(X_{-i}+X^{\prime}_{i},t))\leq\sqrt{t}\left(C_{A}+\frac{2\delta}{\Delta}\right)\varphi(\theta X_{i}-X^{\prime}_{i}).$$

Analogically one can obtain the same inequality for the opposite sign and subsequently we get inequality with module

$$\left|\mathbb{E}_{-i}h(Z(X_{-i}+X^{\prime}_{i},t))-\mathbb{E}_{-i}h(Z(X^{\prime}+\theta X_{i},t))\right|\leq\sqrt{t}\left(C_{A}+\frac{2\delta}{\Delta}\right)\varphi(\theta X_{i}-X^{\prime}_{i}).$$

Using the previous expression and notation

$$\varepsilon^{2}=(\gamma^{T}\Sigma^{-1/2}Z)^{2}\sim\mathcal{N}^{2}(0,1),$$

estimate the upper bound of \(J_{t}\)

$$\frac{1}{\sqrt{t}}\mathbb{E}_{-i}J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime})\leq|\tau-1|\left(C_{A}+\frac{2\delta}{\Delta}\right)\varphi(\theta X_{i}-X^{\prime}_{i})+\mathbb{E}\mathbb{1}[\varepsilon^{2}>\tau]|\varepsilon^{2}-1|.$$

For \(\varepsilon\sim\mathcal{N}(0,1)\) we have

$$\mathcal{P}(\varepsilon>\sqrt{2\textbf{x}})\leq e^{-\textbf{x}}\quad\text{and}\quad\mathcal{P}(\varepsilon^{2}>2\textbf{x}+2\log 2)\leq e^{-\textbf{x}},$$

and by means of Lemma 4.25 we get for all \(\tau\geq 1\)

$$\mathbb{E}\mathbb{1}[\varepsilon^{2}>\tau]|\varepsilon^{2}-1|\leq 2(\tau+1)e^{-\tau/2}$$

and

$$A_{i}=\sup_{||\gamma||=1,\,\theta\in[0,1]}\frac{1}{\sqrt{t}}\mathbb{E}_{-i}J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime})$$
$${}\leq|\tau-1|\left(C_{A}+\frac{2\delta}{\Delta}\right)\varphi(\theta X_{i}-X^{\prime}_{i})+2(\tau+1)e^{-\tau/2}.$$

One should find optimal values for the arbitrary parameters \(\Delta>0\) and \(\tau\geq 1\). Setting \(\Delta=\delta/(2C_{A})\) and \(\tau=2\log(3p/(C_{A}\mu_{3}))\) we obtain that

$$A=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}\,A_{i}\leq 5|\tau-1|C_{A}\mu_{3}+2(\tau+1)e^{-\tau/2}p$$
$${}\leq 11C_{A}\mu_{3}\log\left(\frac{3p}{C_{A}\mu_{3}}\right).$$

We need also the other upper bound for \(B_{i}\) when \(t\) is close to \(1\).

$$\mathbb{E}_{-i}\{\varepsilon^{2}-1\}h(Z(X_{-i}+X^{\prime}_{i},t))$$
$${}=\mathbb{E}_{-i}\{\varepsilon^{2}-1\}\{h(\sqrt{t}(X_{-i}+X^{\prime}_{i})+\sqrt{1-t}Z)-h(\sqrt{t}(X_{-i}+X^{\prime}_{i}))\}$$
$${}\leq\sqrt{1-t}\mathbb{E}_{-i}|\varepsilon^{2}-1|\,|g^{\prime}_{x,\Delta}|\,\varphi(Z)\leq\frac{\sqrt{1-t}}{\Delta}\sqrt{2\mathbb{E}\varphi^{2}(Z)}.$$

In the last expression we have applied Cauchy–Bunyakovsky inequality and the upper bound for \(|g^{\prime}_{x,\Delta}|\leq 1/\Delta\). Accounting condition \(\Delta=\delta/(2C_{A})\) one may derive that

$$B_{i}=\frac{2C_{A}}{\delta}\sqrt{2\mathbb{E}\varphi^{2}(Z)}$$

and furthermore

$$B=\sum_{i=1}^{n}\mathbb{E}X_{i}^{T}\Sigma^{-1}X_{i}\,B_{i}\leq\frac{2C_{A}}{\delta}\sqrt{2\mathbb{E}\varphi^{2}(Z)}\,p.$$

In order to make step from \(h\) expectation difference to the probabilities difference we will use the next inequality:

$$\mathbb{P}(\varphi(X)>x)\leq\mathbb{E}h(X)=\mathbb{E}h(Z)+\mathbb{E}h(X)-\mathbb{E}h(Z)$$
$${}\leq\mathbb{P}(\varphi(Z)>x-\Delta)+\mathbb{E}h(X)-\mathbb{E}h(Z)\leq\mathbb{P}(\varphi(Z)>x)+\mathbb{E}h(X)-\mathbb{E}h(Z)+C_{A}\Delta,$$

which gives

$$\delta\leq|\mathbb{E}h(X)-\mathbb{E}h(Z)|+C_{A}\Delta\leq|\mathbb{E}h(X)-\mathbb{E}h(Z)|+\frac{\delta}{2},$$
$$\delta\leq 2|\mathbb{E}h(X)-\mathbb{E}h(Z)|.$$

Basing on Lemma 4.22 we consider \(h=g_{x,\Delta}\circ\varphi\) and we have already estimated the main values of \(A\) and \(B\). Assuming \(\delta>A\) we obtain

$$\delta\leq 2A(\log(6B)-\log(A))\leq 2A\left(\log(6B\delta)-\log(\delta)-\log(A)\right)$$
$${}\leq 2A\left(\log(6B\delta)-2\log(A)\right)\leq 22C_{A}\mu_{3}\log\left(\frac{3p}{C_{A}\mu_{3}}\right)\log\left(\frac{p\sqrt{2\mathbb{E}\varphi^{2}(Z)}}{10C_{A}\mu^{2}_{3}}\right).$$

\(\Box\)

Remark 4.27. In i.i.d case with \(\Sigma=I_{p}\) and \(\varphi(x)=O(||x||)\)

$$\left|\mathbb{P}(\varphi(X)>x)-\mathbb{P}(\varphi(Z)>x)\right|=O\left(\frac{p\log^{2}(n)}{\sqrt{n}}\right).$$

Note that Lemma 4.26 improves the classical Multivariate Berry–Esseen Theorem [5] for the case of norm functions \(\varphi(x)=O(||x||)\). Namely, instead of the dependence on the dimension \(p^{7/4}\) we got a linear dependence.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Buzun, N. Gaussian Approximation for Penalized Wasserstein Barycenters. Math. Meth. Stat. 32, 1–26 (2023). https://doi.org/10.3103/S1066530723010039

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S1066530723010039

Keywords: