Abstract
In this work we consider regularized Wasserstein barycenters (average in Wasserstein distance) in Fourier basis. We prove that random Fourier parameters of the barycenter converge to some Gaussian random vector in distribution. The convergence rate has been derived in finite-sample case with explicit dependence on measures count (\(n\)) and the dimension of parameters (\(p\)).
REFERENCES
M. Agueh and G. Carlier, ‘‘Barycenters in the Wasserstein space,’’ SIAM Journal on Mathematical Analysis 43 (2), 904–924 (2011).
V. Avanesov and N. Buzun, ‘‘Change-point detection in high-dimensional covariance structure,’’ Electronic Journal of Statistics 12 (2), 3254–3294 (2018).
H. H. Bauschke, and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 1st ed. (Springer Publishing Company, Incorporated, 2011).
V. Bentkus, ‘‘A new method for approximations in probability and operator theories,’’ Lithuanian Mathematical Journal 43 (4), 367–388 (2003).
V. Bentkus, ‘‘On the dependence of the berry-esseen bound on dimension,’’ Journal of Statistical Planning and Inference (2003).
I. Bespalov, N. Buzun, and D. V. Dylov, Brulé: Barycenter-regularized unsupervised landmark extraction (2020).
J. Bigot, E. Cazelles, and N. Papadakis, ‘‘Penalization of Barycenters in theWasserstein Space,’’ SIAM Journal on Mathematical Analysis 51 (3), 2261–2285 (2019).
N. Bonneel, G. Peyré, and M. Cuturi, ‘‘Wasserstein barycentric coordinates: Histogram regression using optimal transport,’’ ACM Transactions on Graphics 35 (4), 71:1–71:10 (2016).
S. Boucheron, G. Lugosi, M. P., Concentration Inequalities: A Nonasymptotic Theory of Independence (Oxford University Press, 2013).
V. Chernozhukov, D. Chetverikov, and K. Kato, ‘‘Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors,’’ Ann. of Stat. 41, 2786–2819 (2013).
C. Clason, D. A. Lorenz, H. Mahler, and B. Wirth, ‘‘Entropic regularization of continuous optimal transport problems,’’ Journal of Mathematical Analysis and Applications 494 (1), 124432 (2021).
D. Edwards, ‘‘On the kantorovich-rubinstein theorem,’’ Expositiones Mathematicae 29 (4), 387–398 (2011).
F. Götze, A. Naumov, V. Spokoiny, and V. Ulyanov, ‘‘Large ball probabilities, Gaussian comparison, and anti-concentration,’’ Bernoulli 25 (4A), 2538–2563 (2019).
A. Kroshnin, A. Suvorikova, and V. Spokoiny, Statistical inference for bureswasserstein barycenters. arXiv:1901.00226 (2019).
L. Li, A. Genevay, M. Yurochkin, and J. Solomon, Continuous regularized wasserstein barycenters, arXiv:2008.12534 (2020).
D. A. Lorenz, P. Manns, and C. Meyer, ‘‘Quadratically regularized optimal transport,’’ Applied Mathematics and Optimization (2021).
E. S. Meckes, On stein’s method for multivariate normal approximation. High Dimensional Probability V: The Luminy Volume. (2009).
T. Rippl, A. Munk, and A. Sturm, ‘‘Limit laws of the empirical wasserstein distance: Gaussian distributions,’’ Journal of Multivariate Analysis 151, 90–109 (2016).
N. Shvetsov, N. Buzun, and D. V. Dylov, Unsupervised non-parametric change point detection in quasi-periodic signals. arXiv 2002.02717 (2020).
M. Sommerfeld and A. Munk, ‘‘Inference for empirical wasserstein distances on finite spaces,’’ Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (1), 219–238 (2018).
V. Spokoiny, ‘‘Penalized maximum likelihood estimation and effective dimension,’’ Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 53 (1), 389–429 (2017).
V. Spokoiny and M. Zhilova, Bootstrap confidence sets under a model misspecification. Preprint no. 1992, WIAS (2014).
S. Steinerberger, Wasserstein distance, fourier series and applications. Monatshefte für Mathematik 194 (2021).
ACKNOWLEDGEMENTS
The author thanks Prof. Roman Karasev, Prof. Vladimir Spokoiny and Prof. Dmitry Dylov for discussion and contribution to this paper.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
PROOF OF THEOREM 3.1
From \((-\nabla^{2}L(\theta)\geq 0)\) and (\(L(\hat{\theta})>L(\theta^{*})\)) follows that the local region \(\Omega(\textbf{r})\) that includes \(\hat{\theta}\) should cover the next region
Use notation
Estimate the minimal possible radius \(\textbf{r}_{0}\) that satisfy to the previous condition. Let \(\theta_{0}\) be some point between \(\theta\) and \(\theta^{*}\) that is used in Taylor expansion with central point \(\theta^{*}\)
Assumption 1 provides
Assumption 2 provides with probability (\(1-e^{-t}\))
Put these two properties into the initial inequality
So one may set under the assumption \(\delta(\textbf{r})+\mathfrak{z}(t)\leq 1/2\)
From Assumptions 1, 2 also follows that
Since \(\nabla L(\hat{\theta})=0\) we have
Note that for the coordinates transform \(S\) there exists the following invariant:
Since
Subsequently, basing on this invariant we obtain the bound for projection \(u\)
Appendix B
ELLIPSOID ENTROPY
The upper bound \(\mathfrak{z}(t)\) of the random process in Assumption 2 with parameter \(\theta\in\Omega(\textbf{r})\) requires entropy computation of the ellipsoid \(\Omega(\textbf{r})\). It will be useful to us in the next section and below we provide a short excerpt on this topic. The general formula for the covering number \(N(\varepsilon,\Omega)\) of a convex set \(\Omega\) in \(\mathbb{R}^{p}\) with Euclidean metric is
where \(B_{1}\) is the unit ball. Remind that \(N(\varepsilon,\Omega)\) equals to the minimal count of balls with radius \(\varepsilon\) that is sufficient to cover \(\Omega\). We will need two components of the ellipsoid entropy:
Lemma 4.6. Let \(||D^{-2}||=1\). Then for the entropy components (3.4) of the ellipsoid \(\Omega(\textbf{r})\) with matrix \(D^{2}\) defined in expression (3.2) it holds
and
where \(C\) is some absolute constant and \(\alpha>1\) .
Proof. Function \(\log N(\varepsilon\textbf{r},\Omega(\textbf{r}))\) is monotone-decreasing. One may split the integration interval of (3.4) into the following parts:
Take corresponded values \(N(\textbf{r}/4,\Omega(\textbf{r}))\), \(N(\textbf{r}/8,\Omega(\textbf{r}))\), \(N(\textbf{r}/16,\Omega(\textbf{r}))\), \(\ldots\) and obtain integral approximation by the histogram
and
Theorem H.7.1 in [21] provides upper bounds for the right parts of the previous expressions and completes the proof. \(\Box\)
Appendix C
SUPPORT FUNCTIONS
Bounds for the first and second derivatives of the Likelihood of barycenters model (2.1) involves additional theory from convex analysis.
Definition 4.7 (*). Legendre–Fenchel transform of a function \(f:X\to\overline{\mathbb{R}}\) or the convex conjugate function calls
Definition 4.8 (s). Support function for a convex body \(E\) is
Note that for the indicator function \(\delta_{E}(\eta)\) of a convex set \(E\) the conjugate function is support function of \(E\), i.e.,
Definition 4.9 (\(\oplus\)). Let \(f_{1},f_{2}:E\to\overline{\mathbb{R}}\) be convex functions. The infimal convolution of them is
Lemma 4.10 (Proposition 13.21 [3]). Let \(f_{1},f_{2}:E\to\overline{\mathbb{R}}\) be convex lower-semi-continuous functions. Then
Lemma 4.11. The support function of intersection \(E=E_{1}\cap E_{2}\) is infimal convolution of support functions for \(E_{1}\) and \(E_{2}\)
Proof. According to the previous lemma
With additional property
one have
\(\Box\)
Lemma 4.12. Let a support function \(s_{E}(\theta)\) be differentiable, then its gradient belongs to the border of corresponded convex set \(E\)
where
Proof. It follows from the convexity of \(E\) and linearity of the optimization functional.
\(\Box\)
Lemma 4.13 (Proposition 16.48 [3]). Let \(f_{1},f_{2}:E\to\overline{\mathbb{R}}\) be convex continuous functions. Then the subdifferential of their infimal convolution can be computed by formula
Corollary 4.14. If in addition \(f_{1},f_{2}\) are differentiable, then their infimal convolution is differentiable and \(\exists x_{1},x_{2}:x=x_{1}+x_{2}\) and
Lemma 4.15. Let \(f_{1},\ldots,f_{m}:E\to\overline{\mathbb{R}}\) be convex and two times differentiable functions. There is an upper bound for the second derivative of the infimal convolution \(\forall t:\sum_{i=1}^{m}t_{i}=1\)
where \(\sum_{i=1}^{m}x_{i}=x\).
Proof. Use notation \(f=f_{1}\oplus\ldots\oplus f_{m}\). Let
According to Lemma 4.13 if all the functions are differentiable then
From the definition \(\oplus\) also follows that
Make Tailor expansion for the left and right parts and account equality of the first derivatives
Since the direction \(z\) was chosen arbitrarily, dividing both parts of the previous equation by \(||z||^{2}\to 0\), we come to inequality
\(\Box\)
Remark 4.16. One can find another provement of the similar theorem in ([3], Theorem 18.15).
Theorem 4.17. Let \(f_{1},\ldots,f_{m}:E\to\overline{\mathbb{R}}\) be convex and two times differentiable functions. There is an upper bounds for infimal convolution \(f=f_{1}\oplus\ldots\oplus f_{m}\) derivatives \(\forall\gamma\) \(\exists x_{1},\ldots,x_{m}\):
and
Proof. Choosing appropriate \(\{t_{i}\}\) in Lemma 4.15 one get the required upper bounds. Set
and since
In order to prove the second formula apply this inequality in
\(\Box\)
Corollary 4.18. Let \(s_{1},\ldots,s_{m}:E^{*}\to\overline{\mathbb{R}}\) be support functions of the bounded convex smooth sets \(E_{1},\ldots,E_{m}\) . There are upper bounds for the derivatives of support function \(s\) of intersection \(E_{1}\cap\ldots\cap E_{m}\) , such that \(\forall i\)
Proof. It follows from Theorem 4.17 and Lemma 4.12. \(\Box\)
Appendix D
GAUSSIAN APPROXIMATION
Multivariate analogues of Berry–Esseen Theorem have many modifications depending on space dimension of random vectors and functions set used for measures comparison. Bentkus in [4, 5] has presented excellent results related to this topic. Namely for a sequence of i.i.d random vectors with identity covariance matrix \(\{X_{i}\}_{i=1}^{n}\), \(X_{i}\in\mathcal{P}(\mathbb{R}^{p})\), any convex set \(A\) and Gaussian vector \(Z\in\mathcal{N}(0,I)\) it holds
and
We extend these two statements for independent random vectors with non-identity covariance \(\Sigma\). Additionally we remove factor \(p^{1/4}\) replacing it with anti-concentration constant defined below.
Definition 4.19 (\(H_{k}\)). The multivariate Hermite polynomial is
where \(x\in\mathbb{R}^{p}\) and \(|k|=k_{1}+\ldots+k_{p}\).
Lemma 4.20 [17]. Consider a Gaussian vector \(Z\sim\mathcal{N}(0,\Sigma)\) and two functions \(h\in C^{1}\) and \(f_{h}\) such that
where for \(t\in[0,1]\)
Then \(f_{h}\) is a solution of Stein’s equation
and
Proof. One may verify this statement through substituting the solution \(f_{h}\) into Stein’s equation. It has been done in Lemma 1 of [17]. \(\Box\)
In the following discussion we will need difference between the second derivatives of function \(f_{h}\).
Corollary 4.21. Let \(f_{h}\) be the solution of Stein’s equation, then
where
Use notation \(\forall i\) \(X_{-i}\) for sum of \(\{X_{j}\}_{j=1}^{n}\) without the \(i\)th element and \(X^{\prime}_{i}\) for an independent copy of \(X_{i}\). Use the following notation for conditional expectation
Lemma 4.22. Consider a Gaussian vector \(Z\in\mathcal{N}(0,\Sigma)\) and a sequence of independent zero-mean random vectors \(X=\sum_{i=1}^{n}X_{i}\) in \(\mathbb{R}^{p}\) with the same non-singular variance matrix
Then for any function with bounded the first derivative \(h\in C^{1}(\mathbb{R}^{p})\)
where
and \(\forall\alpha>0\) on interval \(t\in[0,1-\alpha]\)
and on interval \(t\in[1-\alpha,1]\) for the same \(\alpha\)
and
where
Proof. From Lemma 4.20 follows that for any function \(h\) with the first bounded derivative
Let \(\theta\) be some value in \([0,1]\). Decompose \(\nabla f_{h}(X)\) by Taylor formula
Note that
Substitute them into the first expression
From the consequence of Lemma 4.20 use equality for the second derivative difference. For a unit vector \(||\gamma||=1\) and conditional expectation \(\mathbb{E}_{-i}=\mathbb{E}(\cdot|X_{i},X^{\prime}_{i})\)
Sum it with \(X^{T}_{i}\Sigma^{-1}X_{i}\) and finally we obtain
\(\Box\)
Theorem 4.23 (Multivariate Berry–Esseen with Wasserstein distance). Consider a sequence of independent zero-mean random vectors \(X=\sum_{i=1}^{n}X_{i}\) in \(\mathbb{R}^{p}\) with a variance matrix
Then 1-Wasserstein distance between \(X\) and Gaussian vector \(Z\in\mathcal{N}(0,\Sigma)\) has the following upper bound
where
and each \(X_{i}^{\prime}\) is an independent copy of \(X_{i}\).
Remark 4.24. In i.i.d case with \(\Sigma=I_{p}\)
These is the same theorem with a different proof in paper [4].
Proof. Basing on Lemma 4.22 we consider \(h\) with property \(||\nabla h(\cdot)||\leq 1\) and involve definition of \(A_{i}\) and \(B_{i}\). This property comes from the dual definition of \(W_{1}\), Section 2. We decompose \(A_{i}\) extracting \(\sqrt{t}(X_{i}-X^{\prime}_{i})\) and decompose \(B_{i}\) extracting \(\sqrt{1-t}Z\)
Note that \((\gamma^{T}\Sigma^{-1/2}Z)^{2}\) has chi-square distribution \(\chi_{1}\). And its variance equals \(2\). So we obtain by means of Cauchy–Bunyakovsky inequality
and
Subsequently
and
\(\Box\)
For the next result we will need the following technical lemma.
Lemma 4.25. Let a random variable \(\varepsilon\) has a tail bound \(\forall\textbf{x}\geq\textbf{x}_{0}\)
Then for a function \(g:\mathbb{R}_{+}\to\mathbb{R}_{+}\) with derivative \(g^{\prime}:\mathbb{R}_{+}\to\mathbb{R}_{+}\)
In particular
Theorem 4.26 (Multivariate Berry–Esseen). Consider a sequence of independent zero-mean random vectors \(X=\sum_{i=1}^{n}X_{i}\) in \(\mathbb{R}^{p}\) with a variance matrix
Let \(\varphi:\mathbb{R}^{p}\to\mathbb{R}_{+}\) be some norm function (sub-additive and homogeneous) and with Gaussian vector \(Z\in\mathcal{N}(0,\Sigma)\) fulfills anti-concentration property, such that \(\forall x\in\mathbb{R}_{+}\)
Then the measure difference between \(X\) and Gaussian vector \(Z\) has the following upper bound \(\forall x\)
where
Proof. Make some preliminary computations. Define a smooth indicator function
Set \(h=g_{x,\Delta}\circ\varphi\). Denote the required bound by \(\delta\):
Note that from sub-additive property of the function \(\varphi\) follows
and
By the anti-concentration property
And using definition of \(\delta\)
Now we will bound \(\mathbb{E}_{-i}J_{t}(\gamma,\theta,X_{i},X_{i}^{\prime})\) required in Lemma 4.22. Remind that by definition
where
For some \(\theta^{\prime}\in[0,1]\) using sub-additivity of \(\varphi\) and Taylor formula we get
Together with (4.7)
Analogically one can obtain the same inequality for the opposite sign and subsequently we get inequality with module
Using the previous expression and notation
estimate the upper bound of \(J_{t}\)
For \(\varepsilon\sim\mathcal{N}(0,1)\) we have
and by means of Lemma 4.25 we get for all \(\tau\geq 1\)
and
One should find optimal values for the arbitrary parameters \(\Delta>0\) and \(\tau\geq 1\). Setting \(\Delta=\delta/(2C_{A})\) and \(\tau=2\log(3p/(C_{A}\mu_{3}))\) we obtain that
We need also the other upper bound for \(B_{i}\) when \(t\) is close to \(1\).
In the last expression we have applied Cauchy–Bunyakovsky inequality and the upper bound for \(|g^{\prime}_{x,\Delta}|\leq 1/\Delta\). Accounting condition \(\Delta=\delta/(2C_{A})\) one may derive that
and furthermore
In order to make step from \(h\) expectation difference to the probabilities difference we will use the next inequality:
which gives
Basing on Lemma 4.22 we consider \(h=g_{x,\Delta}\circ\varphi\) and we have already estimated the main values of \(A\) and \(B\). Assuming \(\delta>A\) we obtain
\(\Box\)
Remark 4.27. In i.i.d case with \(\Sigma=I_{p}\) and \(\varphi(x)=O(||x||)\)
Note that Lemma 4.26 improves the classical Multivariate Berry–Esseen Theorem [5] for the case of norm functions \(\varphi(x)=O(||x||)\). Namely, instead of the dependence on the dimension \(p^{7/4}\) we got a linear dependence.
About this article
Cite this article
Buzun, N. Gaussian Approximation for Penalized Wasserstein Barycenters. Math. Meth. Stat. 32, 1–26 (2023). https://doi.org/10.3103/S1066530723010039
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S1066530723010039