Abstract
In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework for performing differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric on the tangent space. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. This approach presents a novel analysis as compared to directly adding noise on the manifold. We further prove privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Overall, we provide utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-Łojasiewicz condition. Empirical results illustrate the versatility and efficacy of the proposed framework in several applications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
With the ever-increasing complication of statistics and machine learning models, data privacy has become a primary concern as it becomes increasingly difficult to safeguard the potential disclosure of private information during model training. Differential privacy (Dwork et al. , 2006; Dwork and Roth , 2014) provides a framework for quantifying the privacy loss as well as for designing algorithms with privacy-preserving guarantees.
Many problems in machine learning fall under the paradigm of empirical risk minimization (ERM), where the loss is expressed as \(\frac{1}{n}\sum _{i=1}^n f(w; z_i)\) with independent and identically distributed (i.i.d) samples \(z_1,\ldots , z_n\) drawn from a data distribution \({\mathcal {D}}\). Differentially private ERM, originally studied in Chaudhuri et al. (2011), aims to safeguard the privacy disclosure of the samples in the solution \(w^*\). There exist many approaches to achieve this goal. The first class of methods is to perturb the output of a non-differentially private algorithm by adding a Laplace or Gaussian noise (Chaudhuri and Monteleoni , 2008; Chaudhuri et al. , 2011; Zhang et al. , 2017). Another approach considers adding a linear random perturbation term to the objective and is known as objective perturbation (Chaudhuri and Monteleoni , 2008; Chaudhuri et al. , 2011; Kifer et al. , 2012; Iyengar et al. , 2019; Bassily et al. , 2021). The third type of approach is to inject noise to gradient based algorithms at each iteration (Bassily et al. , 2014; Wang et al. , 2017; Wang et al. , 2019a; Abadi et al. , 2016; Bassily et al. , 2019). In addition, there also exist various specialized methods for problems such as linear regression and statistics estimation (Dwork and Lei , 2009; Nissim et al. , 2007; Amin et al. , 2019; Kamath et al. , 2019; Biswas et al. , 2020).
Among all the aforementioned approaches, gradient perturbation receives the most attention due to its generality for arbitrary loss functions and scalability to large datasets. Furthermore, it only requires to bound the sensitivity of the gradients computed at each iteration, rather than the entire process. There is an ongoing line of work that aims to improve the utility of the gradient perturbed algorithms while maintaining the same amount of privacy budget. Such improvements have been seen under (strongly) convex losses (Wang et al. , 2017; Iyengar et al. , 2019; Yu et al. , 2021; Bassily et al. , 2019; Kuru et al. , 2022; Asi et al. , 2021), nonconvex losses (Wang et al. , 2017; Zhang et al. , 2017; Wang et al. , 2019a; Wang et al. , 2019b), and also structured losses such as satisfying Polyak-Łojasiewicz condition (Wang et al. , 2017).
In this paper, we consider the following ERM problem in a differentially private setting where the parameter is constrained to lie on a Riemannian manifold, i.e.,
where \({\mathcal {M}}\) is a d-dimensional Riemannian manifold and \(f: {\mathcal {M}}\times {\mathcal {D}}\xrightarrow {} {\mathbb {R}}\) is a loss function over samples. Riemannian manifolds commonly occur in statistics and machine learning where the constraint space naturally possess additional nonlinear structure, such as orthogonality (Absil et al. , 2009), positive definiteness (Bhatia , 2009), unit norm, and hyperbolic (Boumal , 2020), among others. Popular applications involving manifold structures include matrix completion (Boumal and Absil , 2011b; Han and Gao , 2021a), metric learning, covariance estimation (Han et al. , 2021b), principal component analysis (Absil et al. , 2007), online learning (Amari , 1998), and taxonomy embedding (Nickel and Kiela , 2018), to name a few.
While a few recent works address specific Riemannian optimization problems under differential privacy, such as the private Fréchet mean computation (Reimherr et al. , 2021), there exists no systematic study of general purpose strategies to guarantee differential privacy for (1) on Riemannian manifolds. The main challenges are in terms of privatizing operations on manifolds intrinsically and bounding the sensitivity within a unified framework. On the other hand, differentially private (non-Riemannian) approaches have been studied for ERM problems with specific constraints (Chaudhuri et al. , 2013; Bassily et al. , 2014; Jiang et al. , 2016; Abadi et al. , 2016; Bassily et al. , 2019; Maunu et al. , 2022). Such approaches typically employ the projected gradient algorithm, i.e., taking the gradient step and adding noise in the Euclidean space, and then projecting onto the constraint set. Extending such a strategy to Riemannian manifolds may result in looser sensitivity and utility bounds, scaling poorly with the dimension of the ambient space, which can be much larger than the intrinsic manifold dimension (Reimherr et al. , 2021).
Contributions. In this work, we propose a general framework via Riemannian optimization to achieve differential privacy for (1) by Riemannian gradient perturbation adhering to the Riemannian geometry. To the best of our knowledge, this is the first work that provides systematic gradient perturbation techniques for privatizing on Riemannian manifolds. We generalize the Gaussian mechanism to the tangent space of Riemannian manifolds and also adapt the moments accountant technique to trace the privacy loss. We study the privacy guarantees of the differentially private Riemannian (stochastic) gradient descent method and show its utility guarantees for various function classes of interest on Riemannian manifolds, including geodesic (strongly) convex, general nonconvex functions, and functions satisfying the Riemannian Polyak-Łojasiewicz (PL) conditions. A summary of utility bounds is in Table 1. This allows tasks on manifolds to be privatized and analyzed in a unified framework and often leads to tighter utility guarantees compared to directly privatizing the output (Reimherr et al. , 2021) where more conservative noise is often added (as discussed in Sect. 5). We show that the projected gradient methods for (1) ignore the intrinsic geometry when taking the update, thereby hampering the utility under structured loss functions on manifolds (as discussed in Sects. 4 and 5). Finally, we also explore various interesting applications on Riemannian manifolds where privacy is concerned, including the principal eigenvector computation, Fréchet mean, Wasserstein barycenter, hyperbolic structured prediction, and low-rank matrix completion problems.
2 Preliminaries
Riemannian geometry. A Riemannian manifold \({\mathcal {M}}\) of dimension d is a smooth manifold with an inner product structure \(\langle \cdot , \cdot \rangle _w\) (i.e., a Riemannian metric) on every tangent space \(T_w{\mathcal {M}}\). Given an orthonormal basis \((\partial _1, \ldots , \partial _d)\) for \(T_w{\mathcal {M}}\), the metric can be expressed with a symmetric positive definite matrix \(G_w\) and the inner product can be written as \(\langle \xi , \zeta \rangle _w = \vec {\xi }^\top G_w \vec {\zeta }\) where \(\vec {\xi }, \vec {\zeta } \in {\mathbb {R}}^d\) are the vectorization of tangent vectors \(\xi , \zeta \in T_w{\mathcal {M}}\) in the normal coordinate system. An induced norm is defined as \(\Vert \xi \Vert _w = \sqrt{\langle \xi , \xi \rangle _w}\) for any \(\xi \in T_w{\mathcal {M}}\). A geodesic \(c: [0,1] \xrightarrow {} {\mathcal {M}}\) is a locally distance minimizing curve on the manifold with zero acceleration (intrinsic derivative of the velocity). For any \(\xi \in T_w{\mathcal {M}}\), the exponential map is defined as \(\textrm{Exp}_w(\xi ) = c(1)\), where \(c(0) = w\) and \(c'(0) = \xi\). If, between two points \(w, w'\in {\mathcal {M}}\), there exists a unique geodesic connecting them, the exponential map has a smooth inverse and the Riemannian distance is given by \(\textrm{dist}(w, w') = \Vert \textrm{Exp}_w^{-1}(w') \Vert _w = \Vert \textrm{Exp}_{w'}^{-1}(w) \Vert _{w'}\). We call a neighbourhood \({\mathcal {W}}\) totally normal if for any two points, the exponential map is invertible. The Riemannian gradient of a real-valued function, denoted as \(\textrm{grad}F(w)\), is a tangent vector that satisfies, for any \(\xi \in T_w{\mathcal {M}}\), it holds that \(\langle \textrm{grad}F(w), \xi \rangle _w = \textrm{D}_\xi F(w) = \langle \nabla F(w), \xi \rangle _2\) where \(\textrm{D}_\xi F(w)\) is the directional derivative of F(w) along \(\xi\) and \(\nabla F(w)\) is the Euclidean gradient.
Riemannian optimization. Under non-private settings, Riemannian optimization (Absil et al. , 2009; Boumal , 2020) provides a class of methods to efficiently solve problem (1). The Riemannian steepest descent method (Udriste , 2013; Zhang and Sra , 2016) takes a gradient update via the exponential map so that the iterates stay on the manifold, i.e., \(\textrm{Exp}_{w}(- \eta \, \textrm{grad}F(w))\) for some stepsize \(\eta\). Other more advanced solvers include the Riemannian conjugate gradient (Ring and Wirth , 2012), Riemannian trust region methods (Absil et al. , 2007; Absil et al. , 2009), as well as many recent stochastic optimizers (Bonnabel , 2013; Zhang et al. , 2016; Sato et al. , 2019; Kasai et al. , 2018; Han and Gao , 2021b; Han and Gao , 2021a).
Function classes on Riemannian manifolds. The notion of Lipschitz continuity has been generalized to Riemannian manifolds (Boumal , 2020; Zhang and Sra , 2016). A differentiable function \(F:{\mathcal {M}}\xrightarrow {} {\mathbb {R}}\) is geodesic \(L_0\)-Lipschitz if for any \(w \in {\mathcal {M}}\), \(\Vert \textrm{grad}F(w)\Vert _w \le L_0\). The function F is called geodesic \(L_1\)-smooth if for any \(w \in {\mathcal {M}}\) and \(w' = \textrm{Exp}_w(\xi )\), we have \(|F(w') - F(w) - \langle \textrm{grad}F(w), \xi \rangle _w | \le \frac{L_1}{2}\Vert \xi \Vert _w^2\) and equivalently \(\Vert \textrm{Hess}F(w) [u] \Vert _w \le L_1 \Vert u \Vert _w\) for any \(u \in T_w{\mathcal {M}}\) where \(\textrm{Hess}F(w): T_w{\mathcal {M}}\xrightarrow {} T_w{\mathcal {M}}\) is the Riemannian Hessian.
Geodesic convexity (Zhang and Sra , 2016) is an extension of convexity in the Euclidean space. A set \({\mathcal {W}}\subseteq {\mathcal {M}}\) is geodesic convex if for any two points in the set, there exists a geodesic in the set joining them. A function \(F: {\mathcal {W}}\xrightarrow {} {\mathbb {R}}\) is called geodesic convex if for any \(w, w' \in {\mathcal {W}}\), it satisfies \(F(c(t)) \le (1-t) F(w) + t F(w')\) for all \(t \in [0,1]\), where c is the geodesic such that \(c(0) = w, c(1) = w'\). In addition, a function F is called geodesic \(\beta\)-strongly convex if for any \(w, w' = \textrm{Exp}_w(\xi ) \in {\mathcal {W}}\), it satisfies \(F(w') \ge F(w) + \langle \textrm{grad}F(w), \xi \rangle _w + \frac{\beta }{2}d^2(w, w')\) for some \(\beta > 0\). We also introduce Riemannian Polyak-Łojasiewicz (PL) condition (Zhang et al. , 2016), which is weaker than the geodesic strong convexity. A function \(F:{\mathcal {M}}\xrightarrow {} {\mathbb {R}}\) is said to satisfy the Riemannian PL condition if for any \(w \in {\mathcal {M}}\), there exists \(\tau > 0\) such that \(F(w) - F(w^*) \le \tau \Vert \textrm{grad}F(w) \Vert ^2_w\) where \(w^*\) is a global minimizer of F on \({\mathcal {M}}\).
Finally, we recall a trigonometric distance bound for Riemannian manifolds with lower bounded sectional curvature, which is crucial in convergence analysis for geodesic (strongly) convex optimization.
Lemma 1
(Trigonometric distance bound (Bonnabel , 2013; Zhang and Sra , 2016; Zhang et al. , 2016; Han et al. , 2021b)) Let \(w_0, w_1, w_2 \in {\mathcal {W}}\subseteq {\mathcal {M}}\) lie in a totally normal neighbourhood of a Riemannian manifold with curvature lower bounded by \(\kappa _\textrm{min}\), and \(l_0 = \textrm{dist}(w_0, w_1), l_1 = \textrm{dist}(w_1, w_2)\) and \(l_2 = \textrm{dist}(w_0, w_2)\). Denote \(\theta\) as the angle on \(T_{w_0}{\mathcal {M}}\) such that \(\cos (\theta ) = \frac{1}{l_0 l_2}\langle \textrm{Exp}_{w_0}^{-1}(w_1), \textrm{Exp}_{w_0}^{-1}(w_2) \rangle _{w_0}\). Let \(D_{\mathcal {W}}\) be the diameter of \({\mathcal {W}}\), i.e., \(D_{\mathcal {W}}:= \max _{w, w' \in {\mathcal {W}}} \textrm{dist}(w, w')\). Define the curvature constant \(\varsigma = \frac{\sqrt{|\kappa _\textrm{min}|}D_{\mathcal {W}}}{\tanh (|\sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}})}\) if \(\kappa _\textrm{min} < 0\) and \(\varsigma = 1\) if \(\kappa _\textrm{min} \ge 0\). Then, we have \(l_1^2 \le \varsigma l_0^2 + l_2^2 - 2 l_0 l_2 \cos (\theta )\).
Differential privacy. Let \(D = \{z_1, \ldots , z_n\} \subset {\mathcal {D}}^n\) be a dataset. A neighbouring dataset of D, denoted as \(D'\) is a dataset that differs in only one sample from D. The neighbouring relation is denoted as \(D \sim D'\). We first recall the definition of \((\epsilon , \delta )\)-differential privacy (DP) (Dwork and Roth , 2014), which is defined on arbitrary measurable space \({\mathcal {M}}\) (not necessarily a Riemannian manifold).
Definition 1
(\((\epsilon , \delta )\)-differential privacy) A randomized mechanism \({\mathcal {R}}: {\mathcal {D}}^n \xrightarrow {} {\mathcal {M}}\) is called \((\epsilon , \delta )\)-differentially private on \({\mathcal {M}}\) if for any neighbouring datasets \(D, D' \subset {\mathcal {D}}^n\) and any measurable space \({\mathcal {A}}\subseteq {\mathcal {M}}\), we have \({\mathbb {P}}({\mathcal {R}}(D) \in {\mathcal {A}}) \le \exp (\epsilon ) \, {\mathbb {P}}({\mathcal {R}}(D') \in {\mathcal {A}}) + \delta\).
We also make use of Rényi differential privacy (RDP) (Mironov , 2017), which enjoys a tighter privacy bound under composition and subsampling (Wang et al. , 2019c). Given \(D \sim D'\), we define the cumulant generating function of a mechanism \({\mathcal {R}}\) as \(K_{{\mathcal {R}}, (D, D')}(\lambda ):= \log {\mathbb {E}}_{o \sim {\mathcal {R}}(D)} [\exp (\lambda \, {\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')}) ] = \log {\mathbb {E}}_{o \sim {\mathcal {R}}(D)} \left[ \left( \frac{p({\mathcal {R}}(D) = o)}{p({\mathcal {R}}(D') = o)} \right) ^{\lambda } \right]\), where \({\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')} = \log \left( \frac{p({\mathcal {R}}(D) = o)}{p({\mathcal {R}}(D') = o)} \right)\) is known as the privacy loss random variable at \(o \sim {\mathcal {R}}(D)\) (Dwork and Roth , 2014). When maximized over all the neighbouring datasets, \(K_{{\mathcal {R}}}(\lambda ):= \sup _{D \sim D'} K_{{\mathcal {R}},(D, D')}(\lambda )\) is called the \(\lambda\)-th moment of the mechanism (Abadi et al. , 2016).
Definition 2
(\((\alpha ,\rho )\)-Rényi differential privacy (Mironov , 2017)) For \(\alpha \ge 1\) and \(\rho > 0\), a randomized mechanism \({\mathcal {R}}: {\mathcal {D}}^n \xrightarrow {} {\mathcal {M}}\) is called \((\alpha , \rho )\)-Rényi differentially private if \(\frac{1}{\alpha - 1} K_{{\mathcal {R}}}(\alpha -1) \le \rho\).
Proposition 1
(Relationship between RDP and \((\epsilon , \delta )\)-DP (Mironov , 2017)) If a mechanism \({\mathcal {R}}\) satisfies \((\alpha , \rho )\)-Rényi differential privacy, then it satisfies \((\rho + \log (1/\delta )/(\alpha - 1), \delta )\)-differential privacy.
The notions of differential privacy introduced above are well-defined on Riemannian manifolds, which is a measurable space under the Borel sigma algebra (Pennec , 2006). However, a systematic approach for preserving differential privacy when the parameters of interest are on Riemannian manifolds has not been studied. A recent work (Reimherr et al. , 2021) proposes differentially private Fréchet mean computation over general Riemannian manifolds by output perturbation. The subsequent work (Soto et al. , 2022) also studies private Fréchet mean by generalizing the K-norm gradient mechanism to manifolds. Nevertheless, computing the Fréchet mean is a special problem instance of (1), which may also be solved via the proposed framework under differential privacy. Please refer to Sect. 5 for more comparative discussions.
3 Differential privacy on manifolds
This section develops various tools for preserving and analyzing differential privacy on Riemannian manifolds. Proofs for the results in this section are deferred to the appendix.
First, we generalize the Gaussian mechanism (Dwork and Roth , 2014) from the Euclidean space to Riemannian manifolds. One approach is to directly add noise on the manifold following an intrinsic Gaussian distribution (Pennec , 2006). However, this strategy faces two challenges. First, it is required to bound the sensitivity in terms of the Riemannian distance, which could be difficult particularly for negatively curved manifolds, such as the symmetric positive definite manifold. Second, the generalization suffers from metric distortion by curvature and it requires a nontrivial adaptation of the proof strategy in the Euclidean space (Dwork and Roth , 2014). Nevertheless, it is worth mentioning that the Laplace mechanism (Dwork et al. , 2006) can be generalized with the triangle inequality of the Riemannian distance as has been done recently in Reimherr et al. (2021).
Instead, we consider directly adding noise to the tangent space of the manifold following an isotropic Gaussian distribution with respect to the Riemannian metric. In this case, we can measure the sensitivity on the tangent space. We highlight that although the tangent space can be identified as a Euclidean space, the proposed strategy differs from the classic Gaussian mechanism, which adds isotropic noise to each coordinate in the Euclidean space (where the metric tensor is identity). To this end, we define the tangent space Gaussian distribution as follows.
Definition 3
(Tangent space Gaussian distribution) For any \(w \in {\mathcal {M}}\), a tangent vector \(\xi \in T_w{\mathcal {M}}\) follows a tangent space Gaussian distribution at w, denoted as \(\xi \sim {\mathcal {N}}_{w}(\mu , \sigma ^2)\) with mean \(\mu \in T_w{\mathcal {M}}\) and standard deviation \(\sigma > 0\) if its density is given by \(p_w(\xi ) = C^{-1}_{w, \sigma } \exp (- \frac{\Vert \xi - \mu \Vert ^2_w}{2\sigma ^2})\), where \(C_{w, \sigma }\) is the normalizing constant.
Remark 1
In a normal coordinate system of the tangent space, we remark that \({\mathcal {N}}_w(\mu , \sigma ^2)\) is equivalent to a standard multivariate Gaussian with covariance as a function of the metric tensor of the tangent space. Denote \(\vec {\xi } \in {\mathbb {R}}^{d}\) as the vectorization of the tangent vector \(\xi \in T_w{\mathcal {M}}\simeq {\mathbb {R}}^d\) in the normal coordinates. The density can then be written as \(p_w(\xi ) = C^{-1}_{w,\sigma } \exp (- \frac{1}{{2\sigma ^2}} (\vec {\xi } - \vec {\mu })^\top G_w ( \vec {\xi } - \vec {\mu }))\), where \(G_w\) is the (symmetric positive definite) metric tensor at w. This is a standard Gaussian distribution with mean \(\vec {\mu }\) and covariance \(\sigma ^2 G_w^{-1}\), i.e., \(\vec {\xi } \sim {\mathcal {N}}(\vec {\mu }, \sigma ^2 G_w^{-1})\). The normalizing constant is given by \(C_{w, \sigma } = \sqrt{(2\pi )^d \sigma ^2 |G_w^{-1}|}\).
We next introduce a generalization of the Gaussian mechanism on the tangent space. We stress that the following Gaussian mechanism depends on the sensitivity measured in the Riemannian metric.
Proposition 2
(Tangent space Gaussian mechanism) Given a query function \(H: {\mathcal {D}}^n \xrightarrow {} T_w{\mathcal {M}}\), let \(\varDelta _H:= \sup _{D, D' \subset {\mathcal {D}}^n: D \sim D'} \Vert H(D) - H(D') \Vert _w\) be the global sensitivity of H with respect to the Riemannian metric. Define \({\mathcal {R}}(D) = H(D) + \xi\) where \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\) with \(\sigma ^2 \ge {2\log (1.25/\delta )} \varDelta _H^2/\epsilon ^2\). Then, \({\mathcal {R}}\) is \((\epsilon , \delta )\)-differentially private.
To show privacy guarantees in the subsequent sections, we adapt the moments accountant technique in the Euclidean space (Abadi et al. , 2016) to Riemannian manifolds, which results in a tighter bound compared to the advanced composition (Dwork and Roth , 2014). To achieve this, we first provide results that bound the moments of a tangent space Gaussian mechanism (Proposition 2). The next two lemmas show upper bounds on \(K_{\mathcal {R}}(\lambda )\) under full datasets (Lemma 2) as well as under subsampling (Lemma 3).
Lemma 2
(Moments bound) Consider a query function \(H: {\mathcal {D}}^n \xrightarrow {} T_w{\mathcal {M}}\) for some \(w \in {\mathcal {M}}\). Given a dataset \(D = \{z_1, \ldots , z_n\} \subset {\mathcal {D}}^n\) and suppose \(H(D) = \frac{1}{n} \sum _{i=1}^n h(z_i)\) and h is geodesic \(L_0\)-Lipschitz. Let \({\mathcal {R}}(D) = H(D) + \xi\), where \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\). Then, the \(\lambda\)-th moment of \({\mathcal {R}}\) satisfies \(K_{\mathcal {R}}(\lambda ) \le \frac{2 \lambda (\lambda + 1) L_0^2}{n^2 \sigma ^2}\).
Lemma 3
(Moments bound under subsampling) Under the same settings as in Lemma 2, consider \(D_\textrm{sub}\) to be a subset of size b where samples are selected from D without replacement. Let \({\mathcal {R}}(D) = \frac{1}{b} \sum _{z \in D_\textrm{sub}} h(z) + \xi\), where \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\). Suppose \(\sigma ^2 \ge \frac{4L_0^2}{b^2}, b < n, \lambda \le 2 \sigma ^2 \log (n/(b(\lambda + 1)(1 + \frac{b^2 \sigma ^2}{4 L_0^2})))/3\). Then, \(\lambda\)-th moment of \({\mathcal {R}}\) satisfies \(K_{\mathcal {R}}(\lambda ) \le \frac{15 (\lambda +1) L_0^2}{n^2 \sigma ^2}\).
4 Differentially private Riemannian (stochastic) gradient descent
In this section, we introduce differentially private Riemannian (stochastic) gradient descent method (Algorithm 1), where we add noise following the tangent space Gaussian distribution \({\mathcal {N}}_w(0, \sigma ^2)\). We show under proper choice of parameters, the algorithm preserves both the privacy guarantee as well as utility guarantees under various function classes on Riemannian manifolds. It should be noted that when the Riemannian manifold is the Euclidean space, i.e., \({\mathcal {M}}\equiv {\mathbb {R}}^d\), Algorithm 1 reduces to the standard DP-GD and DP-SGD (Wang et al. , 2017; Abadi et al. , 2016; Bassily et al. , 2019).
In Algorithm 1, the samples are selected without replacement following Wang et al. (2019c), and thus, when \(b = n\), we recover the full gradient descent. The noise variance is chosen as \(\sigma ^2 = c \frac{T\log (1/\delta ) L_0^2}{n^2\epsilon ^2}\) to ensure \((\epsilon , \delta )\)-differential privacy (Theorem 1) for some constant c. We remark that the choice of \(\sigma\) matches the standard results in the Euclidean space for gradient descent (Wang et al. , 2017) and for stochastic gradient descent (Abadi et al. , 2016; Bassily et al. , 2019; Wang et al. , 2019b) up to some constants that may depend on the manifold of interest. The output of the the algorithm depends on the function class of the objective, as discussed below.
4.1 Privacy and utility guarantees
Theorem 1
Algorithm 1 is \((\epsilon , \delta )\)-differentially private.
The proof relies on Lemma 2 and Lemma 3 to bound the moment of the randomized mapping.
For utility analysis, we start by making an assumption that all the iterates \(w_0, \ldots , w_T\) stay bounded within a compact support \({\mathcal {W}}\subseteq {\mathcal {M}}\) that contains a stationary point \(w^*\) (i.e., \(\textrm{grad}F(w^*) = 0\)). Let \(c_l = \inf _{w \in {\mathcal {W}}} \lambda _{\min }(G_w)\) where \(\lambda _{\min }(G_w)\) is the minimum eigenvalue of the metric tensor at w. We first show that the expected norm of the noise injected gradient can be bounded as follows.
Lemma 4
Suppose f(w; z) is geodesic \(L_0\)-Lipschitz for any z. Consider a batch \({\mathcal {B}}\) of size b and \(w \in {\mathcal {W}}\). Let \(\zeta = \frac{1}{b} \sum _{z \in {\mathcal {B}}} \textrm{grad}f(w; z) + \epsilon\) where \(\epsilon \sim {\mathcal {N}}_w(0, \sigma ^2)\). Then, we have \({\mathbb {E}}[\zeta ] = \textrm{grad}F(w)\) and \({\mathbb {E}}\Vert \zeta \Vert _w^2 \le L_0^2 + d\sigma ^2 c_l^{-1}\), where the expectation is over randomness in both \({\mathcal {B}}\) and \(\epsilon\).
Next we show utility guarantees under various function classes on Riemannian manifolds, including geodesic (strongly) convex, general nonconvex and functions that satisfy Riemannian PL condition. It has been shown that many nonconvex problems in the Euclidean space are in fact geodesic (strongly) convex or satisfy the Riemannian PL condition on the manifold. This allows tighter utility bounds compared to differentially private projected gradient methods. Some examples of such problems are given in Sect. 5.
Geodesic convex optimization. When f(w; z) is geodesic convex over \({\mathcal {W}}\) for any \(z \in {\mathcal {D}}\), \(w^*\) is a global minimum of F(w). The utility of Algorithm 1 is measured as the expected empirical excess risk \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*)\).
Theorem 2
Suppose f(w; z) is geodesic convex, geodesic \(L_0\)-Lipschitz over \({\mathcal {W}}\). Assume \({\mathcal {W}}\) to be a totally normal neighbourhood with diameter \(D_{\mathcal {W}}\). Let \(\varsigma\) be the curvature constant of \({\mathcal {W}}\) defined in Lemma 1. Consider Algorithm 1 with output 3 where \(w^\textrm{priv} = \bar{w}_{T-1}\) is computed by geodesic averaging as follows: set \(\bar{w}_1 = w_1\) and \(\bar{x}_{t+1} = \textrm{Exp}_{\bar{w}_t}\left( \frac{1}{t+1} \textrm{Exp}^{-1}_{\bar{x}_t}(x_{t+1}) \right)\). Set \(\eta _t = \frac{D_{\mathcal {W}}}{\sqrt{(L_0^2 + d \sigma ^2 c_l^{-1}) \varsigma T}}\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le D_{\mathcal {W}}\sqrt{\frac{(L_0^2 + d \sigma ^2 c_l^{-1}) \varsigma }{T}} = O\left( \frac{\sqrt{d \log (1/\delta ) c_l^{-1} \varsigma } L_0 D_{\mathcal {W}}}{n\epsilon } \right)\), for the choice of \(\sigma ^2\) in Algorithm 1 and \(T = n^2\).
Geodesic strongly convex optimization. Under geodesic strong convexity, the global optimizer \(w^*\) is unique and we use the same measure for the utility.
Theorem 3
Suppose f(w; z) is geodesic \(\beta\)-strongly convex, geodesic \(L_0\)-Lipschitz over \({\mathcal {W}}\). Assume \({\mathcal {W}}\) to be a totally normal neighbourhood with the curvature constant \(\varsigma\). Consider Algorithm 1 with output 3 where \(w^\textrm{priv} = \bar{w}_{T-1}\) is the geodesic averaging by setting \(\bar{w}_1 = w_1\), \(\bar{w}_{t+1} = \textrm{Exp}_{\bar{w}_t}\left( \frac{2}{t+1} \textrm{Exp}^{-1}_{\bar{w}_t} (w_{t+1}) \right)\). Set \(\eta _t = \frac{1}{\beta (t+1)}\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le \frac{2 \varsigma (L_0^2 + d \sigma ^2 c_l^{-1})}{\beta T} = O \left( \frac{\beta ^{-1} d \log (1/\delta ) c_l^{-1} \varsigma L_0^2}{ n^2 \epsilon ^2} \right)\), with the choice of \(\sigma ^2\) in Algorithm 1 and \(T = n^2\).
Optimization under Riemannian PL condition. We now consider the case when the objective F satisfies the Riemannian Polyak-Łojasiewicz (PL) condition. It is known that Polyak-Łojasiewicz is a sufficient condition to establish linear convergence to global minimum (Karimi et al. , 2016). In addition, the Riemannian PL condition includes the geodesic strong convexity as a special case (Zhang et al. , 2016).
Theorem 4
Suppose f(w; z) is geodesic \(L_0\)-Lipschitz and \(L_1\)-smooth over \({\mathcal {W}}\) and F(w) satisfies the Riemannian PL condition with parameter \(\tau\), i.e., \(F(w) - F(w^*) \le \tau \Vert \textrm{grad}F(w) \Vert ^2_w\). Consider Algorithm 1 with output 1. Set \(\eta _t = \eta < \min \{1/L_1, 1/\tau \}\) and \(T = \log (\frac{n^2\epsilon ^2 c_l}{dL_0^2 \log (1/\delta )})\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le O \left( \frac{\tau ^{-1} d \log (n) \log (1/\delta ) c_l^{-1} L_0^2 \varDelta _0}{n^2\epsilon ^2} \right)\), where we denote \(\varDelta _0:= F(w_0) - F(w^*)\).
General nonconvex optimization. Under general nonconvex objectives, we show utility bound with respect to the expected gradient norm squared, i.e., \({\mathbb {E}}\Vert \textrm{grad}F(w^\textrm{priv}) \Vert ^2_{w^\textrm{priv}}\), which has also been considered in the Euclidean space (Zhang et al. , 2017; Wang et al. , 2017).
Theorem 5
Suppose f(x; w) is geodesic \(L_0\)-Lipschitz and \(L_1\)-smooth. Consider Algorithm 1 with output 2. Set \(\eta _t = \eta = 1/L_1\) and \(T = \frac{\sqrt{L_1} n \epsilon }{\sqrt{d \log (1/\delta ) c_l^{-1}}L_0}\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}\Vert \textrm{grad}F(w^\textrm{priv}) \Vert ^2_{w^\textrm{priv}} \le O \left( \frac{\sqrt{d L_1 \log (1/\delta ) c_l^{-1}} L_0}{n\epsilon } \right)\).
Remark 2
Given the Euclidean space is a special case of Riemannian manifold, we observe that the utility bounds obtained in Theorem 2 to 5 match the standard results in the Euclidean space (Bassily et al. , 2014; Wang et al. , 2017) up to some constants that depend on the curvature and diameter bound of the manifold considered. In addition, we highlight that our bounds depend on the intrinsic dimension of the manifold, rather than the ambient space. This suggests potentially great improvement on the utility when preserving privacy under the proposed Riemannian framework as opposed to the ambient Euclidean space.
Extension to general retraction. So far we have focused on the use of exponential map in Algorithm 1. However, for some manifolds, exponential map can be computationally costly to evaluate, e.g., on the symmetric positive definite (SPD) manifold. In such cases, we can replace the exponential map for the gradient update with its first-order approximation, known as the retraction (Absil et al. , 2009). Given that the proposed privatization in Algorithm 1 is on the Riemannian gradient, the privacy guarantees remain unchanged with the use of retraction based on the composition theorem. Similarly, the utility guarantees are also preserved under retraction, primarily because the retraction is a first-order approximation to the exponential map. In particular, the same utility bounds hold up to some constants that upper bound the deviation between the exponential map and retraction. See Han and Gao (2021a), Sato et al. (2019), Kasai et al. (2018) for more detailed discussions on convergence analysis with retraction.
5 Applications
We explore various kinds of applications to show the versatility and efficacy of the proposed framework of differentially private Riemannian optimization.
5.1 Private principal eigenvector computation over sphere manifold
We consider the problem of computing the principal eigenvector of a sample covariance matrix (Shamir , 2015; Zhang et al. , 2016) by solving
where \({\mathcal {S}}^{d} = \{ w \in {\mathbb {R}}^{d+1}: \Vert w \Vert _2 = w^\top w = 1 \}\) is the sphere manifold of intrinsic dimension d and \(z_1, \ldots , z_n \in {\mathbb {R}}^{d+1}\) are zero-centered samples. Sphere manifold is an embedded submanifold of \({\mathbb {R}}^{d+1}\) with the tangent space \(T_w{\mathcal {S}}^{d} = \{ \xi \in {\mathbb {R}}^{d+1}: w^\top \xi = 0 \}\). A common Riemannian metric used is the Euclidean metric, i.e., \(\langle \xi , \zeta \rangle _w = \xi ^\top \zeta\) for any \(\xi , \zeta \in T_w{\mathcal {S}}^{d}\). The Riemannian gradient on \(T_w{\mathcal {S}}^{d}\) is \(\textrm{grad}F(w) = (I_{d+1} - ww^\top ) \nabla F(w)\), where \(I_{d+1}\) is the identity matrix of size \((d+1)\times (d+1)\) and \(\nabla F(w)\) is the Euclidean gradient.
Utility. It can be shown from Theorem 6 (in the appendix) that problem (2), although being nonconvex in the ambient Euclidean space, satisfies the Riemannian PL condition on \({\mathcal {S}}^{d}\) locally with \(\tau = O(d/q^2 \nu )\) with probability \(1- q\) where \(\nu\) is the difference between the largest two eigenvalues of \(\frac{1}{n} \sum _{i=1}^n z_iz_i^\top\). Also, we see that the metric tensor \(G_w = I_{d}\), and hence, \(c_l = 1\). This is because given an orthonormal basis on the tangent space \(T_w{\mathcal {S}}^{d}\), denoted as \(B \in {\mathbb {R}}^{(d+1) \times d}\), we have \(\langle \xi , \zeta \rangle _w = \langle {\xi }, \zeta \rangle _2 = \langle B \vec {\xi }, B \vec {\zeta } \rangle _2 = \langle \vec {\xi }, \vec {\zeta } \rangle _2\), where \(\vec {\xi }, \vec {\zeta } \in {\mathbb {R}}^{d}\) is the vectorization under the coordinate transformation given by B. In addition, the geodesic Lipschitz constant \(L_0\) is bounded as \(\Vert \textrm{grad}f(w; z_i) \Vert _w = \Vert 2(I_{d+1} - ww^\top ) (z_i z_i^\top ) w \Vert _2 \le 2\Vert (z_i z_i^\top ) w \Vert _2 \le \vartheta := 2 \max _{i} z_i^\top z_i\). Applying Theorem 4 for utility under the Riemannian PL condition, we can show if properly initialized, the expected empirical excess risk of Algorithm 1 is bounded as \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le \widetilde{O} \left( \frac{ d q^2 \nu \vartheta ^2 \varDelta _0}{n^2\epsilon ^2} \right)\), where we have \(c_l = 1, L_0 = \vartheta\) and we use \(\widetilde{O}\) to hide the logarithmic factors.
Remark 3
(Comparison with perturbed projected gradient descent) Considering the alternative approach of using projected gradient descent with gradient perturbation in the ambient space for solving the problem (2), one can only guarantee a utility in the gradient norm, i.e., \({\mathbb {E}}[\Vert \nabla F(w^\textrm{priv})] \Vert _2^2] \le \widetilde{O} \left( \frac{ \vartheta }{n\epsilon } \right)\) due to the nonconvexity of the problem (Wang et al. , 2017). This results in a looser bound compared to our obtained utility guarantee above, which is measured in terms of the excess risk and scales with \(n^{-2}\) rather than \(n^{-1}\).
5.2 Private Fréchet mean computation over SPD manifold
The second application we consider is the problem of Fréchet mean computation (Bhatia , 2009; Jeuris et al. , 2012) over the manifold of \(r \times r\) symmetric positive definite (SPD) matrices, denoted as \({\mathbb {S}}_{++}^r\). Given a set of SPD matrices \(X_1, \ldots , X_n \in {\mathbb {S}}_{++}^r\), the goal is to find a center \(W \in {\mathbb {S}}_{++}^r\) by solving
where \(\textrm{logm}(\cdot )\) represents the principal matrix logarithm. The tangent space of the set of SPD matrices is given as \(T_W{\mathbb {S}}_{++}^r = \{ U \in {\mathbb {R}}^{r \times r}: U^\top = U \}\), i.e., the set of symmetric matrices. It can be shown that the function \(f(W; X_i)\) in (3) is the Riemannian distance squared \(\textrm{dist}^2(W, X_i)\) associated with the affine-invariant Riemannian metric (Skovgaard , 1984; Bhatia , 2009), defined as \(\langle U, V \rangle _W = \textrm{tr}(U W^{-1} V W^{-1})\).
Utility. Problem (3) is known to be nonconvex in the Euclidean space while being 2-geodesic strongly convex on the SPD manifold with the affine-invariant metric (Zhang et al. , 2016; Zhang and Sra , 2016). In addition, the Lipschitz constant can be bounded as follows. First, the Riemannian gradient of problem (3) is derived as \(\textrm{grad}f(W; X_i) = W \textrm{logm}( W^{-1} X_i )\) and \(\Vert \textrm{grad}f(W; X_i) \Vert _W = \Vert \textrm{logm}( W^{-1} X_i ) \Vert _F = \textrm{dist}(W, X_i) \le D_{\mathcal {W}}= L_0\). Also, it is known that the SPD manifold with the affine-invariant metric is negatively curved with the curvature constant \(\zeta \le 2 \sqrt{|\kappa _{\min }|} D_{\mathcal {W}}+1\) according to Lemma 1. Applying Theorem 3, we see that the utility of Algorithm 1 is \({\mathbb {E}}[F(W^\textrm{priv})] - F(W^*) \le \widetilde{O} \left( \frac{d c_l^{-1} \sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}}^3}{ n^2 \epsilon ^2} \right)\), where the intrinsic dimension of the SPD manifold \({\mathbb {S}}_{++}^r\) is \(d = r(r+1)/2\).
Remark 4
(Comparison to (Reimherr et al. , 2021) and (Soto et al. , 2022)) Recent works (Reimherr et al. , 2021; Soto et al. , 2022) have considered \(\epsilon\)-pure differential privacy (with \(\delta = 0\)) for privatizing the Frechet mean. Reimherr et al. (2021) perturbs the output on the manifold with Laplace noise and the utility given by \(\textrm{dist}^2(W^\textrm{priv}, W^*) \le O\left( \frac{d^2 D_{\mathcal {W}}^2}{n^2 \epsilon ^2} \right)\). In Soto et al. (2022), the Fréchet mean is privatized by sampling from the density \(g(W; \{ X_i\}_{i=1}^n) \propto \exp \left( - \frac{1}{\sigma n} \Vert \sum _{i=1}^n \textrm{grad}f(W; X_i) \Vert _W \right)\), known as the K-norm gradient mechanism (Reimherr and Awan , 2019). Their utility is given by \(O(\frac{d^2}{n^2\epsilon ^2})\). In contrast, our Algorithm 1 preserves \((\epsilon , \delta )\)-differential privacy with utility of \(\textrm{dist}^2(W^\textrm{priv}, W^*) \le \widetilde{O}\left( \frac{d D_{\mathcal {W}}^3}{n^2\epsilon ^2} \right)\) by applying the triangle inequality to the expected empirical excess risk and ignoring other factors. We see our obtained utility matches the state-of-the-art results in Reimherr et al. (2021), Soto et al. (2022).
In terms of the noise level, (Reimherr et al. , 2021) requires the noise rate \(\sigma = \frac{2\varDelta _\textrm{FM}}{\epsilon } = \frac{4D_{\mathcal {W}}}{n \epsilon }\) as the sensitivity \(\varDelta _\textrm{FM} = \frac{2D_{\mathcal {W}}}{n}\) for non-positively curved manifold (Reimherr et al. 2021, Theorem 2). The noise level required in Soto et al. (2022) is \(\sigma = \frac{4D_{\mathcal {W}}}{n\epsilon }\), which is the same as Reimherr et al. (2021). The proposed Algorithm 1 requires \(\sigma = O( \frac{\sqrt{T\log (1/\delta )} D_{\mathcal {W}}}{n\epsilon })\). Nevertheless, it should be emphasized that the noise distributions in Reimherr et al. (2021) and Soto et al. (2022) are different from ours which is the tangent space Gaussian distribution (Definition 3).
5.3 Private Wasserstein barycenter over SPD manifold
When endowing the SPD matrices with Bures-Wasserstein metric (Malagò et al. , 2018; Bhatia et al. , 2019), i.e., \(\langle U, V \rangle _W = \frac{1}{2} \textrm{tr}( {\mathcal {L}}_W[U] V)\) where \({\mathcal {L}}_W[U]\) is the solution to the equation \(W {\mathcal {L}}_W[U] + {\mathcal {L}}_W[U] W = U\), one can show that the induced Riemannian distance between \(\varSigma _1, \varSigma _2 \in {\mathbb {S}}_{++}^d\) recovers the 2-Wasserstein distance between zero-centered Gaussian distributions with covariance \(\varSigma _1, \varSigma _2\), i.e., \(\textrm{dist}_\textrm{bw}^2(\varSigma _1, \varSigma _2) = \textrm{tr}(\varSigma _1) + \textrm{tr}(\varSigma _2) -2 \textrm{tr}( \varSigma _1^{1/2} \varSigma _2 \varSigma _1^{1/2})^{1/2}\) where \(\textrm{tr}(X)^{1/2}\) represents the trace of matrix square root. Here we consider solving the Fréchet mean problem (3) with the Bures-Wasserstein metric, i.e., \(f(W; X_i) = \textrm{dist}_\textrm{bw}^2(W, X_i)\), which amounts to finding the Wasserstein barycenter of the zero-centered Gaussian distributions with covariances \(X_i\), \(i = 1,...,n\).
Utility. From Han et al. (2021a), we see that the curvature \(\kappa\) of the SPD manifold with Bures-Wasserstein metric is non-negative (as opposed to non-positive with the affine-invariant metric in Sect. 5.2). The curvature is bounded by \(\kappa _{\min } = 0\) and \(\kappa _{\max } = \sup _{W \in {\mathcal {W}}} \frac{3}{\lambda _r(W) + \lambda _{r-1}(W)}\) where \(\lambda _i(W)\) is the i-th largest eigenvalue of \(W \in {\mathbb {S}}_{++}^r\). Applying the results in Alimisis et al. (2020), problem (3) is geodesic strongly convex when the diameter of the domain \(D_{\mathcal {W}}\) satisfies \(D_{\mathcal {W}}< \frac{\pi }{2 \sqrt{\kappa _{\max }}}\). Its Riemannian gradient can be derived as \(\textrm{grad}f(W; X_i) = - (WX_i)^{1/2} - (X_i W)^{1/2} + 2 W\), and it can be verified that \(\Vert \textrm{grad}f(W; X_i) \Vert _W = \textrm{dist}_\textrm{bw}(W, X_i) \le D_{\mathcal {W}}< \frac{\pi }{2\sqrt{\kappa _\textrm{max}}}\). The curvature constant in Lemma 1 is given by \(\varsigma = 1\) and according to Theorem 3, the utility is \({\mathbb {E}}[F(W^\textrm{priv})] - F(W^*) \le \widetilde{O} \left( \frac{d c_l^{-1} h_{\mathcal {W}}^{-1} D_{\mathcal {W}}^2 }{ n^2 \epsilon ^2} \right)\) where \(h_{\mathcal {W}}= \sqrt{\kappa _{\max }} D_{\mathcal {W}}\cot (\sqrt{\kappa _{\max }} D_{\mathcal {W}})\).
Remark 5
(Comparison to Reimherr et al. (2021) and Soto et al. (2022) for the Bures-Wasserstein metric) In Reimherr et al. (2021), the utility is again \(O\left( \frac{d^2 D_{\mathcal {W}}^2}{n^2 \epsilon ^2} \right)\) with a required noise level \(\sigma = \frac{\varDelta _\textrm{FM}}{\epsilon } = \frac{2 D_{\mathcal {W}}(2- h_{\mathcal {W}}) h_{\mathcal {W}}^{-1}}{n \epsilon }\), when the manifold is positively curved, as per (Reimherr et al. 2021, Theorem 2). In addition, Soto et al. (2022) requires \(\sigma = \frac{2 D_{\mathcal {W}}(2-h_{\mathcal {W}})}{n}\), which is strictly smaller compared to Reimherr et al. (2021) given \(h_{\mathcal {W}}< 1\). The utility however, remains the same as \(O(\frac{d^2}{n^2\epsilon ^2})\). In contrast, Algorithm 1, presents an utility of \(\widetilde{O} \left( \frac{d h_{\mathcal {W}}^{-1} D_{\mathcal {W}}^2 }{ n^2 \epsilon ^2} \right)\) with a noise level \(\sigma = O( \frac{\sqrt{T\log (1/\delta )} D_{\mathcal {W}}}{n\epsilon })\).
5.4 Private hyperbolic structured prediction
The hyperbolic manifold has shown great promise in representation learning of complex structured data, including word semantics (Nickel and Kiela , 2017; Tifrea et al. , 2018), graphs (Chami et al. , 2019), images (Khrulkov et al. , 2020), among others. The hyperbolic manifold can be equivalently represented under multiple models. The Lorentz hyperboloid model is given by \({\mathcal {H}}^d:= \{ w \in {\mathbb {R}}^{d+1}: \langle w, w \rangle _{{\mathcal {L}}} = -1 \}\) where \(\langle x, y \rangle _{{\mathcal {L}}} = - x_0 y_0 + \sum _{i=1}^d x_i y_i\). The tangent space is given by \(T_w{\mathcal {H}}^d = \{ u \in {\mathbb {R}}^{d+1}: \langle w, u\rangle _{\mathcal {L}}= 0 \}\) with the Riemannian metric as \(\langle \cdot , \cdot \rangle _w = \langle \cdot , \cdot \rangle _{\mathcal {L}}\). The Riemannian distance is \(\textrm{dist}_{\mathcal {L}}(x, y) = \textrm{arccosh}(- \langle x, y\rangle _{\mathcal {L}})\) for \(x, y \in {\mathcal {H}}^d\).
In Marconi et al. (2020), hyperbolic structured prediction (also known as hyperbolic regression) has been considered to infer the taxonomy embeddings for unseen words. Consider a set of training pairs \(\{ x_i, y_i \}_{i=1}^n\) where \(x_i \in {\mathbb {R}}^r\) are the features and \(y_i \in {\mathcal {H}}^d\) are the hyperbolic embeddings of the class of \(x_i\), which are learned following (Nickel and Kiela , 2017). For a test sample x, the goal is to predict its hyperbolic embeddings by solving
where \(\alpha (x) = (\alpha _i(x),..., \alpha _n(x))^\top \in {\mathbb {R}}^n\) are learned via a variant of kernel ridge regression (Rudi et al. , 2018), i.e., \(\alpha (x) = (K + \gamma I)^{-1} K_x\) where \(\gamma > 0\) is the regularization parameter and \(K \in {\mathbb {R}}^{n \times n}, K_x \in {\mathbb {R}}^{n}\) are computed with \(K_{ij} = k(x_i, x_j)\) and \((K_x)_i = k(x_i, x)\) for some kernel function \(k(\cdot , \cdot )\). Problem (4) can be solved via Riemannian (stochastic) gradient descent. Here, we wish to privatize the prediction of hyperbolic embedding against the training data pairs \(\{ x_i, y_i\}\), which may contain sensitive information.
Utility. It is worth mentioning that problem (4) can be seen as computing the weighted Fréchet mean on the hyperbolic manifold. However, it is not guaranteed that the weights \(\alpha _i(x)\) are all positive. When \(\alpha _i(x) \ge 0\) for all i, problem (4) can be easily verified to be \(\left( 2/n \sum _{i = 1}^n \alpha _i(x) \right)\)-geodesic strongly convex given that \(\textrm{dist}_{{\mathcal {L}}}^2(w, y_i)\) is 2-geodesic strongly convex for negatively curved space (Alimisis et al. , 2020). In addition, following similar derivation as before, we can show \(\Vert \textrm{grad}f_i(w)\Vert _w = \alpha _i(x) \Vert \textrm{Exp}_{w}^{-1}(y_i) \Vert _w \le \alpha _i(x) \textrm{dist}_{\mathcal {L}}(w, y_i) \le {\alpha }_{i}(x) D_{\mathcal {W}}\). From Lemma 1, we have \(\varsigma \le 2 D_{\mathcal {W}}+ 1\) given \(\kappa _\textrm{min} = \kappa = -1\) for the hyperbolic space. Applying Theorem 3, the utility of Algorithm 1 on problem (4) is \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le \widetilde{O} \left( \frac{dD_{\mathcal {W}}^3}{ n_\alpha n\epsilon ^2} \right)\) where we let \(n_\alpha := \max _i \frac{\alpha _i(x)}{\sum _{i=1}^n \alpha _i(x)}\) and we use the fact that \(c_l = 1\).
On the other hand, when \(\alpha _i(x) < 0\) for some i, the problem is in general nonconvex, where we apply Theorem 5 for the utility guarantee. Based on (Alimisis et al. 2020, Lemma 2), one can show the Riemannian Hessian of the Riemannian distance satisfies \(\Vert \textrm{Hess}\, \textrm{dist}_{\mathcal {L}}^2(w, y_i) [u] \Vert _w \le L_1 \Vert u \Vert _w\) for any \(u \in T_w{\mathcal {M}}\) where \(L_1 = \frac{\sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}}}{\tanh (\sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}})} =\frac{D_{\mathcal {W}}}{\tanh (D_{\mathcal {W}})}\). The Lipschitz constant \(L_0 = \max _{i} |\alpha _i(x)| D_{\mathcal {W}}\). Hence from Theorem 5, the utility bound is given by \(O\left( \frac{\sqrt{d D_{\mathcal {W}}^3\log (1/\delta )} \max _i |\alpha _i(x)|}{ \sqrt{\tanh (D_{\mathcal {W}})} n \epsilon } \right)\).
5.5 Private low rank matrix completion over Grassmann manifold
Low rank matrix completion (LRMC) (Candès and Recht , 2009; Jain et al. , 2013) aims to recover the matrix by assuming a low rank structure, and has been studied via Riemannian optimization (Mishra et al. , 2012; Vandereycken , 2013; Boumal and Absil , 2011a; Boumal and Absil , 2015). In particular, we follow Boumal and Absil (2011a, 2015) to cast the problem as optimization over the Grassmann manifold \({\mathcal {G}}(d, r)\), i.e., the set of r-dimensional subspaces in \({\mathbb {R}}^d\) \((d \ge r)\). Please refer to the appendix for details on the Grassmann manifold. The LRMC problem is defined as
where \(\varOmega = \{(i,j): X_{ij} \; \textrm{observed} \}\) and \(\mathbbm {1}_{\varOmega } = A \in {\mathbb {R}}^{d \times n}\) where \(A_{ij} = 1\) if \((i,j) \in \varOmega\) and \(A_{ij} = 0\) otherwise. \(\odot\) represents elementwise product. \(x_i, v_i, \mathbbm {1}_{\varOmega _i}\) are the i-th column of matrix \(X, V, \mathbbm {1}_{\varOmega }\) respectively. We highlight that given \(W \in {\mathcal {G}}(d,r)\), V has a closed-form solution by least squares. The problem is nonconvex both in the Euclidean space and on the manifold, with utility given by Theorem 5, which matches between the Euclidean and Riemannian solvers up to some constants.
6 Experiments
This section presents the experiment results on applications introduced in Sect. 5. The experiments are implemented using the Manopt toolbox (Boumal et al. , 2014) on Matlab except for the hyperbolic application where we use Pytorch. The codes are available at https://github.com/andyjm3/DP-RiemOpt.
6.1 Results on private Fréchet mean on SPD manifold
We follow the steps in Reimherr et al. (2021) to generate synthetic samples on the SPD manifold \({\mathbb {S}}_{++}^r\) following the Wishart distribution \(W(I_{r}/r, r)\) with a diameter bound \(D_{\mathcal {W}}\). The optimal solution \(W^*\) is obtained by running the Riemannian gradient descent (RGD) method on problem (3) until the gradient norm falls below \(10^{-14}\). For the example, we choose \(r = 2\) and set \(T = n\), which we find empirically better than \(n^2\) (as suggested in Theorem 3). We choose \(\sigma ^2 = T \log (1/\delta ) D_{\mathcal {W}}^2/(n^2\epsilon ^2)\) where \(\epsilon = 0.1, \delta = 10^{-3}, D_{\mathcal {W}}= 1\), \(\eta = 0.01\). To sample noise from the tangent Gaussian distribution, i.e., \(\epsilon \sim {\mathcal {N}}_w(0, \sigma ^2) \propto \exp ( - \frac{1}{2\sigma ^2} \Vert \epsilon \Vert ^2_W )\), we implement the standard random walk Metropolis-Hastings algorithm (Robert et al. , 1999) with a burn-in period of \(10^4\) steps.
We compare the proposed algorithm DP-RGD with the differentially private Fréchet mean (FM-LAP) algorithm by Laplace mechanism in Reimherr et al. (2021) and K-norm gradient mechanism (FM-KNG) in Soto et al. (2022). For FM-LAP, we follow Reimherr et al. (2021) to first obtain a non-privatized Fréchet mean \(\widehat{W}\) by running RGD. Then, we sample from an intrinsic Laplace distribution on \({\mathbb {S}}_{++}^r\) (following the steps in Hajri et al. (2016)) with footprint \(\widehat{W}\) and \(\sigma = \frac{4D_{\mathcal {W}}}{n\epsilon } = \frac{4}{n\epsilon }\) is the sensitivity of the Fréchet mean on SPD manifold (Reimherr et al. 2021, Theorem 2). For FM-KNG, we use Metropolis-Hastings as suggested in Soto et al. (2022) with the proposal generated by \(\textrm{Exp}_W(t \sigma V)\) for the same \(\sigma\) where \(V \in T_W{\mathcal {M}}\) is sampled uniformly and \(t \in (0,1]\) controls the step. The complete procedures of sampling are included in the appendix, where we set \(t = 0.1\) to avoid ill-conditioning. We consider a burn-in period of 5000. It is worth highlighting that the sampling for FM-KNG is computationally costly given each step in Metropolis-Hastings requires the evaluation of full gradient. We plot in Fig. 1a the expected empirical excess risk against the sample size, averaged over 20 runs. We observe better utility of our proposed DP-RGD particularly when the sample size is small. Under large-sample regimes, our mechanism performs competitively against the baselines.
We also consider the real datasets: Extended Yale-B dataset (YaleB) (Wright et al. , 2008) and Kylberg dynamic texture dataset (Dynt) (Kylberg , 2011). We crop the images to \(32 \times 32\) and construct the Gabor-based region covariance descriptors (Pang et al. , 2008), which is an SPD matrix of size \(8\times 8\). We then use the geometry-aware PCA proposed in Han et al. (2021a) to reduce the dimensionality of each SPD covariance to a size of \(2 \times 2\). For each dataset, we extract randomly 50 samples and compute the Fréchet mean. We again follow the same parameter choices. Given the diameter bound \(D_{\mathcal {W}}\) is unknown in this case, we set \(D_{\mathcal {W}}= 1\). The stepsize is set to be 0.1 for DP-RGD. We show in Figs. 1b and 1c the expected empirical excess risk with varying privacy parameter \(\epsilon\) where we see DP-RGD enjoys an improved utility, particularly with smaller \(\epsilon\). Nevertheless, we notice the performance gap reduces when \(\epsilon\) increases.
6.2 Results on private Bures-Wasserstein barycenter
For the barycenter problem, we follow the same steps as in Sect. 6.1 to generate synthetic SPD matrices, while controlling the spectrum to ensure the condition \(D_{\mathcal {W}}< \frac{\pi }{2\sqrt{\kappa _\textrm{max}}}\) is satisfied. We choose \(D_{\mathcal {W}}= 0.3, \epsilon = 0.1, \delta = 10^{-3}, \eta = 0.01\) and the same setting of \(\sigma ^2 = T\log (1/\delta ) D^2_{\mathcal {W}}/(n^2\epsilon ^2)\) and \(T = n\) for the proposed DP-RGD. For FM-LAP (Reimherr et al. , 2021), we recall the sensitivity is given by \(\varDelta _\textrm{FM} = \frac{2 D_{\mathcal {W}}(2 - h_{\mathcal {W}})}{n h_{\mathcal {W}}}\) where \(h_{\mathcal {W}}= 2 D_{\mathcal {W}}\sqrt{\kappa _\textrm{max}} \cot (2 D_{\mathcal {W}}\sqrt{\kappa _\textrm{max}})\) based on (Reimherr et al. 2021, Theorem 2) and then we set \(\sigma = 2\varDelta _\textrm{FM}/\epsilon\). For FM-KNG, we set the same \(\sigma\) while \(\varDelta _\textrm{FM} = \frac{2 D_{\mathcal {W}}(2-h_{\mathcal {W}})}{n}\) as discussed in Sect. 6.1. The sampling is performed the same way as in Sect. 6.1. Figure 2a shows the empirical excess risk averaged over 20 runs. We similarly observe that the proposed DP-RGD obtains an improved utility compared to the baselines under small-sample regime while performing competitively when sample size increases. The better performance of FM-KNG over FM-LAP is expected given its sensitivity is strictly smaller when the curvature is positive.
6.3 Results on private principal eigenvector computation
For the problem of principal eigenvector computation in Sect. 5.1, we consider two real datasets, ijcnn1 and satimage from the LibSVM library (Chang and Lin , 2011), with \((n, d) = (49\,990, 14)\) and \((6\,430, 36)\) respectively. We process the datasets by taking the z-score for each feature and dividing by the spectral norm of the data matrix.
To sample from the tangent Gaussian distribution on sphere, we first sample from \({\mathcal {N}}(0, \sigma ^2 I_{d})\) and then transforming using a basis matrix. This is because the metric tensor is \(G_w = I_d\). We then set the parameters \(T =\log (n^2\epsilon ^2/((d+1)L_0^2\log (1/\delta )))\) and \(\sigma ^2 = T \log (1/\delta )L_0^2/(n^2\epsilon ^2)\) according to Theorem 4 under Riemannian PL condition. We set \(\delta = 10^{-3}\) and vary the privacy parameter \(\epsilon\). We estimate \(L_0\) from the samples.
We compare Algorithm 1 with full gradient \((b = n)\), denoted as DP-RGD, against the projected gradient descent (with noise added in the ambient space), denoted as DP-PGD. In fact, the projected gradient descent on the sphere approximates the Riemannian gradient descent because the former updates by \(w_{t+1} = \frac{w_t + \zeta _t}{\Vert w_t + \zeta _t \Vert _2}\), where \(\zeta _t = \nabla F(w_t) + \epsilon _t\). This approximates the exponential map to the first order and is known as the retraction (Absil and Malick , 2012; Boumal , 2020). For this reason and the purpose of comparability, we set the same noise variance and the max iterations for both DP-RGD and DP-PGD. The stepsizes are tuned and fixed as 0.9 for DP-RGD and 10 for DP-PGD. We also include another baseline, which is input perturbation via Gaussian mechanism (GMIP) on the covariance matrix \(\frac{1}{n} \sum _{i=1}^n z_i z_i^\top\) (Dwork et al. , 2014b). It can be derived that the \(L_2\) sensitivity of \(g(\{ z_i \}_{i=1}^n) = \frac{1}{n} \sum _{i=1}^n z_i z_i^\top\) is \(\varDelta g \le L_0/n = 2/n \max _{i} z_i^\top z_i\) as shown before. By Gaussian mechanism, we add symmetric Gaussian noise to the covariance with \(\sigma = \varDelta g \sqrt{2\log (1.25/\delta )} /\epsilon\). Once the covariance matrix is perturbed, we use SVD to obtain the private output. From Figs. 2b and 2c, we see DP-RGD maintains the superior utility across different levels of privacy guarantees for both the datasets.
6.4 Results on private hyperbolic structured prediction
For the hyperbolic structured prediction discussed in Sect. 5.4, we consider the task of inferring hyperbolic embedding of the mammals subtree of WordNet dataset (Miller , 1998). We use the pretrained hyperbolic embeddings on \({\mathcal {H}}^2\) following Nickel and Kiela (2017), with the transitive closure consisting of \(n = 1\,181\) nodes (words) and \(6\,541\) edges (hierarchies). The features we use are obtained from the Laplacian eigenmap (Belkin and Niyogi , 2003) to dimension \(r = 3\) of the (symmetric) adjacency matrix formed by the edges. Hence we obtain \((x_i, y_i) \in {\mathbb {R}}^3 \times {\mathcal {H}}^2\), for \(i =1,...,n\). We choose ‘horse’ as the test point with the rest being the training samples. We use the RBF kernel, i.e., \(k(x, x') = \exp (- \Vert x - x' \Vert _2^2/(2 \upsilon ^2))\) with \(\upsilon = 0.2\). The regularization parameter \(\gamma\) is set as \(\gamma = 10^{-5}\).
Here, we compare only Riemannian gradient descent (RGD) and private Riemannian gradient descent in Algorithm 1 (DP-RGD) given there exists no available baseline under the private setting. We follow the nonconvex setting in Theorem 5 and derivation in Sect. 5.4 to set \(T = \frac{\sqrt{L_1} n \epsilon }{\sqrt{d\log (1/\delta )} L_0}\) with \(L_0 = \max _i |\alpha _i(x)| D_{\mathcal {W}}\) with \(D_{\mathcal {W}}= 10\) (estimated from the samples) and \(L_1 = D_{\mathcal {W}}/\tanh (D_{\mathcal {W}})\). We set \(\sigma ^2 = T \log (1/\delta ) L_0^2/(n^2\epsilon ^2)\). We tune and set the stepsize for RGD as 0.003 and for DP-RGD as 0.001. For RGD, we set max epochs to be 1500.
We show in Fig. 3a, the Riemannian distance between the prediction to the true embedding on \({\mathcal {H}}^2\) averaged over 20 runs. We observe DP-RGD converges reasonably well to the pretrained embeddings while ensuring the guaranteed \((\epsilon , \delta )\) privacy in around \(T= 1\,000\) iterations.
To further compare the utility between the two, we plot in Fig. 4 the hyperbolic embeddings on the Poincaré disk (an isometric representation for the hyperbolic space). We sample three predicted embeddings from DP-RGD due to the randomness in the noise. We see in (4b), the predictions from DP-RGD are close to the ground-truth while preserving the hierarchical structures.
6.5 Results on private low rank matrix completion
We finally consider the recommender system on Movielens-1 M dataset (Maxwell Harper and Konstan , 2015) for the application of low rank matrix completion. We follow the same procedures in Han and Gao (2021a) to generate train and test data. We compare the proposed DP Riemannian stochastic gradient descent (DP-RSGD) with non-privatized RSGD for matrix completion (Balzano et al. , 2010). To sample noise from the tangent Gaussian distribution, we follow the same steps as for the sphere manifold via basis construction given by Huang et al. (2017). We choose a batch size \(b = 10\) and a fixed tuned stepsize \(\eta = 10^{-5}\) for RSGD and \(5\times 10^{-6}\) for DP-RSGD. We choose \(T = 10^3\), \(\sigma ^2 = c \frac{T \log (1/\delta )}{n^2 \epsilon ^2}\) with \(c = 10^4\). This ensures \(\sigma\) is sufficiently large and satisfies the condition under subsampling (Lemma 3). The results are averaged over 5 random initializations. From Fig. 3b, we see proposed DP-RSGD achieves comparable utility as RSGD particularly for training set while preserving privacy.
7 Concluding remarks
We propose a general framework to ensure differential privacy for ERM problems on Riemannian manifolds. Since the tangent space of a manifold is a vector space, we propose a Gaussian mechanism that adheres to the metric structure in the tangent space. We also prove privacy as well as utility guarantees for our differentially private Riemannian (stochastic) gradient descent method. Our work also provides the first unified framework to perform differentially private optimization on Riemannian manifolds. We have explored various interesting applications on manifolds where privacy is of concern. We expect the present work to motivate the development of other classes of differentially private algorithms on manifolds, which is an increasingly important topic.
We lastly comment on the complexity of the proposed framework. The extra computational effort compared to the baselines mostly comes from the need for sampling and would depend on the manifolds considered. For the sphere, hyperbolic, and Grassmann manifolds, sampling can be performed efficiently by directly sampling a standard Gaussian noise and transforming via a basis matrix. For the SPD manifold, we have implemented a sampling procedure based on the Metropolis-Hastings algorithm, which may be inefficient for SPD matrices with large size. In our follow-up work (Utpala et al. , 2023), we have addressed the computational limitation of the proposed framework and propose faster sampling procedures for all manifolds considered.
References
Abadi, Martin, Chu, Andy, Goodfellow, Ian, Brendan McMahan, H., Mironov, Ilya, Talwar, Kunal, & Zhang, Li. (2016). Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, pages 308–318.
Absil, P.-A., & Malick, Jérôme. (2012). Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1), 135–158.
Absil, P.-A., Baker, Christopher G., & Gallivan, Kyle A. (2007). Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3), 303–330.
Absil, P.-A., Mahony, Robert, & Sepulchre, Rodolphe. (2009). Optimization algorithms on matrix manifolds. In Optimization Algorithms on Matrix Manifolds. Princeton University Press.
Alimisis, Foivos, Orvieto, Antonio, Bécigneul, Gary, & Lucchi, Aurelien. (2020). A continuous-time perspective for modeling acceleration in Riemannian optimization. In International Conference on Artificial Intelligence and Statistics, pages 1297–1307. PMLR.
Amari, Shun-Ichi. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
Amin, Kareem, Dick, Travis, Kulesza, Alex, Munoz, Andres, & Vassilvitskii, Sergei. (2019). Differentially private covariance estimation. In Advances in Neural Information Processing Systems, volume 32.
Asi, Hilal, Duchi, John, Fallah, Alireza, Javidbakht, Omid, & Talwar, Kunal. (2021). Private adaptive gradient methods for convex optimization. In International Conference on Machine Learning, pages 383–392. PMLR.
Balzano, Laura, Nowak, Robert, & Recht, Benjamin. (2010). Online identification and tracking of subspaces from highly incomplete information. In Annual Allerton Conference on Communication, Control, and Computing, pages 704–711. IEEE.
Bassily, Raef, Smith, Adam, & Thakurta, Abhradeep. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE.
Bassily, Raef, Feldman, Vitaly, Talwar, Kunal, & Thakurta, Abhradeep Guha. (2019). Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32.
Bassily, Raef, Guzmán, Cristóbal, & Menart, Michael. (2021). Differentially private stochastic optimization: New results in convex and non-convex settings. In Advances in Neural Information Processing Systems, volume 34.
Belkin, Mikhail, & Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396.
Bhatia, Rajendra. (2009). Positive definite matrices. In Positive Definite Matrices. Princeton University Press.
Bhatia, Rajendra, Jain, Tanvi, & Lim, Yongdo. (2019). On the Bures-Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2), 165–191.
Biswas, Sourav, Dong, Yihe, Kamath, Gautam, & Ullman, Jonathan. (2020). Coinpress: Practical private mean and covariance estimation. In Advances in Neural Information Processing Systems, 33, 14475–14485.
Bonnabel, Silvere. (2013). Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9), 2217–2229.
Boumal, Nicolas. (2020). An introduction to optimization on smooth manifolds. Available online, May, 3.
Boumal, Nicolas, & Absil, P.-A. (2015). Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475, 200–239.
Boumal, Nicolas, & Absil, Pierre-antoine. (2011a). RTRMC: A Riemannian trust-region method for low-rank matrix completion. In Advances in Neural Information Processing Systems, volume 24.
Boumal, Nicolas, & Absil, Pierre-antoine. (2011b). RTRMC: A Riemannian trust-region method for low-rank matrix completion. In Advances in Neural Information Processing Systems, volume 24.
Boumal, Nicolas, Mishra, Bamdev, Absil, P.-A., & Sepulchre, Rodolphe. (2014). Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1), 1455–1459.
Candès, Emmanuel J., & Recht, Benjamin. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717–772.
Chami, Ines, Ying, Zhitao, Ré, Christopher, & Leskovec, Jure. (2019). Hyperbolic graph convolutional neural networks. Advances in Neural Information Processing Systems, 32.
Chang, Chih-Chung., & Lin, Chih-Jen. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.
Chaudhuri, Kamalika, & Monteleoni, Claire. (2008). Privacy-preserving logistic regression. In Advances in Neural Information Processing Systems, volume 21.
Chaudhuri, Kamalika, Monteleoni, Claire, & Sarwate, Anand D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).
Chaudhuri, Kamalika, Sarwate, Anand D., & Sinha, Kaushik. (2013). A near-optimal algorithm for differentially-private principal components. Journal of Machine Learning Research, 14. 0
Dwork, Cynthia, & Lei, Jing. (2009). Differential privacy and robust statistics. In Annual ACM Symposium on Theory of Computing, pages 371–380.
Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, & Smith, Adam. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer.
Dwork, Cynthia, Rothblum, Guy N., & Vadhan, Salil. (2010). Boosting and differential privacy. In IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51–60. IEEE.
Dwork, Cynthia, Roth, Aaron, et al. (2014). The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4), 211–407.
Dwork, Cynthia, Talwar, Kunal, Thakurta, Abhradeep, & Zhang, Li. (2014b). Analyze Gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the Annual ACM Symposium on Theory of Computing, pages 11–20.
Hajri, Hatem, Ilea, Ioana, Said, Salem, Bombrun, Lionel, & Berthoumieu, Yannick. (2016). Riemannian Laplace distribution on the space of symmetric positive definite matrices. Entropy, 18(3), 98.
Han, Andi, & Gao, Junbin. (2021a). Improved variance reduction methods for Riemannian non-convex optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Han, Andi, & Gao, Junbin. (2021b). Riemannian stochastic recursive momentum method for non-convex optimization. In International Joint Conference on Artificial Intelligence, pages 2505–2511, 8.
Han, Andi, Mishra, Bamdev, Jawanpuria, Pratik, & Gao, Junbin. (2021a). Generalized Bures-Wasserstein geometry for positive definite matrices. arXiv:2110.10464.
Han, Andi, Mishra, Bamdev, Jawanpuria, Pratik Kumar, & Gao, Junbin. (2021b). On Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry. In Advances in Neural Information Processing Systems, volume 34.
Maxwell Harper, F., & Konstan, Joseph A. (2015). The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems, 5(4), 1–19.
Huang, Wen, Absil, P.-A., & Gallivan, Kyle A. (2017). Intrinsic representation of tangent vectors and vector transports on matrix manifolds. Numerische Mathematik, 136(2), 523–543.
Iyengar, Roger, Near, Joseph P., Song, Dawn, Thakkar, Om, Thakurta, Abhradeep, & Wang, Lun. (2019). Towards practical differentially private convex optimization. In IEEE Symposium on Security and Privacy (SP), pages 299–316. IEEE.
Jain, Prateek, Netrapalli, Praneeth, & Sanghavi, Sujay. (2013). Low-rank matrix completion using alternating minimization. In Annual ACM Symposium on Theory of Computing, pages 665–674.
Jeuris, Ben, Vandebril, Raf, & Vandereycken, Bart. (2012). A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39(ARTICLE), 379–402.
Jiang, Wuxuan, Xie, Cong, & Zhang, Zhihua. (2016). Wishart mechanism for differentially private principal components analysis. In AAAI Conference on Artificial Intelligence, volume 30.
Kamath, Gautam, Li, Jerry, Singhal, Vikrant, & Ullman, Jonathan. (2019). Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853–1902. PMLR.
Karimi, Hamed, Nutini, Julie, & Schmidt, Mark. (2016). Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer.
Kasai, Hiroyuki, Sato, Hiroyuki, & Mishra, Bamdev. (2018). Riemannian stochastic recursive gradient algorithm. In International Conference on Machine Learning, pages 2516–2524. PMLR.
Khrulkov, Valentin, Mirvakhabova, Leyla, Ustinova, Evgeniya, Oseledets, Ivan, & Lempitsky, Victor. (2020). Hyperbolic image embeddings. In Conference on Computer Vision and Pattern Recognition, pages 6418–6428.
Kifer, Daniel, Smith, Adam, & Thakurta, Abhradeep. (2012). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25–1. JMLR Workshop and Conference Proceedings.
Kuru, Nurdan, İlker Birbil, Ş, Gürbüzbalaban, Mert, & Yıldırım, Sinan. (2022). Differentially private accelerated optimization algorithms. SIAM Journal on Optimization, 32(2), 795–821.
Kylberg, Gustaf. (2011). Kylberg texture dataset v. 1.0. Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University.
Malagò, Luigi, Montrucchio, Luigi, & Pistone, Giovanni. (2018). Wasserstein Riemannian geometry of Gaussian densities. Information Geometry, 1, 137–179.
Marconi, Gian, Ciliberto, Carlo, & Rosasco, Lorenzo. (2020). Hyperbolic manifold regression. In International Conference on Artificial Intelligence and Statistics, pages 2570–2580. PMLR.
Maunu, Tyler, Yu, Chenyu, & Lerman, Gilad. (2022). Stochastic and private nonconvex outlier-robust PCA. arXiv:2203.09276.
Miller, George A. (1998). WordNet: An electronic lexical database. Cambridge, Massachusetts.: MIT press.
Mironov, Ilya. (2017). Rényi differential privacy. In IEEE 30th Computer Security Foundations Symposium, pages 263–275. IEEE.
Mishra, Bamdev, Adithya Apuroop, K., & Sepulchre, Rodolphe. (2012). A Riemannian geometry for low-rank matrix completion. arXiv:1211.1550.
Nickel, Maximillian, & Kiela, Douwe. (2017). Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems, 30.
Nickel, Maximillian, & Kiela, Douwe. (2018). Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pages 3779–3788. PMLR.
Nissim, Kobbi, Raskhodnikova, Sofya, & Smith, Adam. (2007). Smooth sensitivity and sampling in private data analysis. In Annual ACM Symposium on Theory of Computing, pages 75–84.
Pang, Yanwei, Yuan, Yuan, & Li, Xuelong. (2008). Gabor-based region covariance matrices for face recognition. IEEE Transactions on circuits and systems for video technology, 18(7), 989–993.
Pardo, Leandro. (2018). Statistical inference based on divergence measures. Spain: Chapman and Hall/CRC.
Pennec, Xavier. (2006). Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1), 127–154.
Reimherr, Matthew, & Awan, Jordan. (2019). KNG: The k-norm gradient mechanism. Advances in Neural Information Processing Systems, 32.
Reimherr, Matthew, Bharath, Karthik, & Soto, Carlos. (2021). Differential privacy over Riemannian manifolds. In Advances in Neural Information Processing Systems, volume 34.
Rényi, Alfréd. (1961). On measures of entropy and information. In Berkeley Symposium on Mathematical Statistics and Probability, volume 4, pp. 547–562, University of California Press.
Ring, Wolfgang, & Wirth, Benedikt. (2012). Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2), 596–627.
Robert, Christian P., Casella, George, & Casella, George. (1999). Monte Carlo statistical methods (Vol. 2). New York: Springer.
Rudi, Alessandro, Ciliberto, Carlo, Marconi, GianMaria, & Rosasco, Lorenzo. (2018). Manifold structured prediction. Advances in Neural Information Processing Systems, 31.
Sato, Hiroyuki, Kasai, Hiroyuki, & Mishra, Bamdev. (2019). Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport. SIAM Journal on Optimization, 29(2), 1444–1472.
Shamir, Ohad. (2015). A stochastic PCA and SVD algorithm with an exponential convergence rate. In International Conference on Machine Learning, pages 144–152. PMLR.
Skovgaard, Lene Theil. (1984). A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211–223.
Soto, Carlos, Bharath, Karthik, Reimherr, Matthew, & Slavkovic, Aleksandra. (2022). Shape and structure preserving differential privacy. arXiv:2209.12667.
Tifrea, Alexandru, Becigneul, Gary, & Ganea, Octavian-Eugen. (2018). Poincare glove: Hyperbolic word embeddings. In International Conference on Learning Representations.
Udriste, Constantin. (2013). Convex functions and optimization methods on Riemannian manifolds (Vol. 297). Bucharest, Romania: Springer Science & Business Media.
Utpala, Saiteja, Han, Andi, Jawanpuria, Pratik, & Mishra, Bamdev. (2023). Improved differentially private Riemannian optimization: Fast sampling and variance reduction. Transactions on Machine Learning Research, ISSN 2835-8856. URL https://openreview.net/forum?id=paguBNtqiO.
Vandereycken, Bart. (2013). Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimization, 23(2), 1214–1236.
Wang, Di, Ye, Minwei, & Xu, Jinhui. (2017). Differentially private empirical risk minimization revisited: Faster and more general. In Advances in Neural Information Processing Systems, volume 30.
Wang, Di, Chen, Changyou, & Xu, Jinhui. (2019a). Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pages 6526–6535. PMLR.
Wang, Lingxiao, Jayaraman, Bargav, Evans, David, & Gu, Quanquan. (2019b). Efficient privacy-preserving stochastic nonconvex optimization. arXiv:1910.13659.
Wang, Yu-Xiang, Balle, Borja, & Kasiviswanathan, Shiva Prasad. (2019c). Subsampled Rényi differential privacy and analytical moments accountant. In International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR.
Wright, John, Yang, Allen Y., Ganesh, Arvind, Shankar Sastry, S., & Ma, Yi. (2008). Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 210–227.
Yu, Da, Zhang, Huishuai, Chen, Wei, Yin, Jian, & Liu, Tie-Yan. (2021). Gradient perturbation is underrated for differentially private convex optimization. In International Joint Conferences on Artificial Intelligence, pages 3117–3123.
Zhang, Hongyi, & Sra, Suvrit. (2016). First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR.
Zhang, Hongyi, Reddi, Sashank J., & Sra, Suvrit. (2016). Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Advances in Neural Information Processing Systems, volume 29.
Zhang, Jiaqi, Zheng, Kai, Mou, Wenlong, & Wang, Liwei. (2017). Efficient private ERM for smooth objectives. In International Joint Conference on Artificial Intelligence, pages 3922–3928.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. Not Applicable
Author information
Authors and Affiliations
Contributions
AH, BM, PJ, and JG contributed to the analysis and developed the theoretical parts. AH, BM, and PJ conceived and designed the experiments. AH, BM, and PJ performed the experiments. AH, BM, PJ, and JG wrote the paper.
Corresponding author
Ethics declarations
Conflict of interest
The university of Sydney and Microsoft India
Ethical approval
Not Applicable
Consent to participate
Not Applicable
Consent for publication
Not Applicable
Availability of data and material
Link to data and code is included in the manuscript (Sect. 6)
Code availability
Link to data and code is included in the manuscript (Sect. 6)
Additional information
Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, Charalampos Tsourakakis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Missing details for the experiments
1.1 A.1 Principal eigenvector computation over sphere manifold
We first present the following Theorem, showing the problem of leading eigenvector computation satisfies the Riemannian PL condition locally around the optimal solution.
Theorem 6
(From Zhang et al. (2016)) Let \(A = \frac{1}{n} \sum _{i=1}^n z_iz_i^\top\) and denote \(\lambda _{i}\) as the i-th largest eigenvalues of A. Assume \(\nu = \lambda _1 - \lambda _2 > 0\). Then, the problem (2) locally satisfies the Riemannian PL condition with parameter \(\tau = O(d/(q^2 \nu ))\) with probability \(1-q\).
For DP-RGD, we use the exponential map of the sphere manifold, given by \(\textrm{Exp}_{w}(\xi ) = \cos (\Vert \xi \Vert _2) w + \sin (\Vert \xi \Vert _2) \xi /\Vert \xi \Vert _2\).
1.2 A.2 Fréchet mean computation over symmetric positive definite manifold
Affine-invariant metric. The affine-invariant Riemannian metric is given by \(\langle U, V\rangle _W = \textrm{tr}(W^{-1}U W^{-1}V)\). The exponential map is \(\textrm{Exp}_W(U) = W^{1/2} \textrm{expm}(W^{-1/2} U W^{-1/2}) W^{1/2}\), where \(\textrm{expm}(\cdot )\) denotes the principal matrix exponential. The Riemannian gradient is computed by \(\textrm{grad}F(W) = W \nabla F(W) W\), where \(\nabla F(W)\) is the Euclidean gradient.
Bures-Wasserstein metric. The Bures-Wasserstein metric is given by \(\langle U, V \rangle _W = \frac{1}{2} \textrm{tr}( {\mathcal {L}}_{W}[U] V )\) where \({\mathcal {L}}_W[U]\) is the solution to the matrix linear equation \(W {\mathcal {L}}_W[U] + {\mathcal {L}}_W[U] W = U\). The exponential map is \(\textrm{Exp}_{W}(U) = W + U + {\mathcal {L}}_W[U] W {\mathcal {L}}_W[U]\). Riemannian gradient is computed as \(\textrm{grad}F(W) = 2 W \nabla F(W) + 2 \nabla F(W) W\).
Sampling from FM-KNG on SPD manifold. Here we describe the sampling procedures for FM-KNG on \({\mathbb {S}}_{++}^r\) suggested in Soto et al. (2022), which requires the use of Metropolis-Hasting algorithm. Let \(\textrm{vech}(\cdot )\) denotes the vectorization of a symmetric matrix along the columns and recall the density we wish to sample from is \(g(W) \propto \exp \{ - \frac{1}{n\sigma }\Vert \sum _{i=1}^n \textrm{grad}f(W; X_i) \Vert _W \}\).
-
1.
Initialize \(W_0 = \widehat{W}\) (the Fréchet mean computed).
-
2.
At the i-th iteration, draw \({v} \in {\mathbb {R}}^{r(r+1)/2}\) uniformly on \([-0.5, 0.5]\) and set \(V = \textrm{vech}^{-1}(v)\), which is a symmetric matrix.
-
3.
Generate a proposal \(x' = \textrm{Exp}_{W_i}(t \sigma V)\).
-
4.
Accept \(x'\) and set \(x_{i+1} = x'\) with probability \(g(x')/g(x_i)\). Otherwise, reject \(x'\) and return to the previous generation step.
-
5.
Return to Step 2 until sufficient number of samples are created.
1.3 A.3 Hyperbolic structured prediction
The exponential map for the Lorentz model of hyperbolic space is given by \(\textrm{Exp}_w(u) = \cosh ( \Vert u \Vert _{\mathcal {L}}) x + u \frac{\sinh (\Vert u \Vert _{\mathcal {L}})}{\Vert u\Vert _{\mathcal {L}}}\). Another model for hyperbolic space is known as the Poincaré disk, which is represented by \({\mathcal {D}}^d = \{ x \in {\mathbb {R}}^d: \Vert x \Vert _2 < 1\}\), which can be shown to be isometric to the Lorentz model and has been shown to be useful for visualization.
1.4 A.4 Low rank matrix completion over Grassmann manifold
Grassmann manifold geometry. Recall that the Stiefel manifold \(\textrm{St}(d, r):= \{ W \in {\mathbb {R}}^{d \times r} :W^\top W = I_r \}\) is the set of column orthonormal matrices. The Grassmann manifold \({\mathcal {G}}(d, r)\) is the quotient manifold of the Stiefel manifold \(\textrm{St}(d, r)\) under the action of orthogonal group \({\mathcal {O}}(r)\), the set of orthogonal matrices of size \(r\times r\). Thus, each point on the Grassmann manifold \({\mathcal {G}}(d, r)\) can be identified as an equivalence class of a column orthonormal matrix from \(\textrm{St}(d, r)\), representing the basis of the subspace. The tangent space of \({\mathcal {G}}(d, r)\), is an abstract space that can be identified with the horizontal subspace of the tangent space of \(\textrm{St}(d,r)\). For some \(W \in \textrm{St}(d,r)\), we denote [W] as the equivalence class of W under the action of \({\mathcal {O}}(r)\) and thus \([W] \in {\mathcal {G}}(d,r)\). By identifying [W] with W, we can informally write the tangent space \(T_{[W]} {\mathcal {G}}(d,r) = \{ \xi \in {\mathbb {R}}^{d \times r}: W^\top \xi = 0 \}\). The canonical Riemannian metric coincides with the Euclidean metric, i.e., \(\langle U, V \rangle _W = \textrm{tr}(U^\top V)\) The orthogonal projection of an ambient matrix \(U \in {\mathbb {R}}^{d \times r}\) to \(T_{[W]}{\mathcal {G}}(d,r)\) is given as \(\textrm{Proj}_{W}(U) = (I - WW^\top ) U\).
Sampling. To sample from the tangent space Gaussian distribution, we construct an orthonormal basis for the tangent space as \(\{ W_{\perp } \tilde{e}_i e_j^\top , i = 1,...,n-p, j =1,..., p \}\) where \(W_{\perp } \in {\mathbb {R}}^{d \times (d-r)}\) has columns forming an orthonormal basis for the orthogonal complement of the column space of W. Then we follow the procedures as for the case of sphere manifold where we first sample from the Euclidean space and then transforming via the constructed basis.
B Proofs
1.1 B.1 Proofs from Sect. 3
Proof of Proposition 2
For any \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\), we have \({\mathcal {R}}(D) \sim {\mathcal {N}}_w(H(D), \sigma ^2)\). Set \(H(D) - H(D') = \zeta\) and let the privacy loss random variable (Dwork et al. , 2010; Dwork and Roth , 2014) be defined as
By (Dwork and Roth 2014, Lemma 3.17), it suffices to show \(|{\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')}| \le \epsilon\) with probability \(1-\delta\). To see this, we first notice that \({\mathcal {L}}_{{\mathcal {R}}(D)\Vert {\mathcal {R}}(D')} \sim {\mathcal {N}}\left( \frac{\Vert \zeta \Vert _w^2}{2\sigma ^2}, \frac{\Vert \zeta \Vert ^2_w}{\sigma ^2} \right)\) (by vectorization as in Remark 1). Then we can write \({\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')} = \frac{\Vert \zeta \Vert _w}{\sigma ^2} v + \frac{\Vert \zeta \Vert ^2_w}{2\sigma ^2}\), where \(v \sim {\mathcal {N}}(0, \sigma ^2)\). This leads to \(|{\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')}| \le | \frac{1}{2\sigma ^2} ({2\varDelta _h} v - {\varDelta _h^2} ) |\). The rest of the proof directly follows from (Dwork and Roth 2014, Theorem A.1) by applying the tail bound of Gaussian distribution. \(\square\)
Proof of Lemma 2
First, recall that the Rényi divergence (Rényi , 1961) between two distributions P, Q with parameter \(\beta \ge 2\) is defined as \(\textrm{Div}_\beta (P \Vert Q):= \frac{1}{\beta - 1} \log {\mathbb {E}}_{o \sim P} \left[ (\frac{P(o)}{Q(o)})^{\beta -1} \right]\). Since \({\mathcal {R}}(D) \sim {\mathcal {N}}_w(H(D), \sigma ^2)\), \({\mathcal {R}}(D') \sim {\mathcal {N}}_w (H(D'), \sigma ^2)\), we obtain
By standard results on Rényi divergence between multivariate Gaussian distributions, such as Pardo (2018), we have
where we assume without loss of generality, \(D, D'\) differ only in sample j and the last inequality follows from the Lipschitzness of h. \(\square\)
Proof of Lemma 3
The proof is a combination of Lemma 2 and the subsampling theorem (Wang et al. 2019c, Theorem 9) for Rényi differential privacy. From Lemma 2 and the definition of Rényi differential privacy, we see the mechanism \({\mathcal {R}}\) applying to \(D_\textrm{sub}\) is \((\lambda + 1, \frac{2\lambda L_0^2}{b^2 \sigma ^2})\)-RDP. Let \(\rho (\lambda ) = \frac{2\lambda L_0^2}{b^2 \sigma ^2}\). Then by the subsampling theorem (Wang et al. 2019c, Theorem 9), we have
By choosing parameters \(\sigma ^2, b\) and ensuring \(\lambda\) is small, it is possible to show \(K_{\mathcal {R}}(\lambda ) \le \frac{15 L_0^2 (\lambda +1)}{n^2 \sigma ^2}\) as in (Wang et al. 2019b, Lemma 3.7), where we use the fact that sensitivity without subsampling is \(\frac{4L_0^2}{b^2}\). \(\square\)
1.2 B.2 Proofs from Sect. 4
Proof of Theorem 1
Let \({{\mathcal {R}}}_t(D) = w_{t+1} = \textrm{Exp}_{w_t}(-\eta _t \tilde{{\mathcal {R}}}_t(D))\), where \(\tilde{{\mathcal {R}}}_t(D) =\zeta _t = \frac{1}{b} \sum _{z \in {\mathcal {B}}} \textrm{grad}f(w_t; z) + \epsilon _t\). The final output \({\mathcal {R}}(D)\) is a composition of \({\mathcal {R}}_T,..., {\mathcal {R}}_1\). First, from the composition lemma (Dwork and Roth , 2014), the differential privacy properties of \({\mathcal {R}}_t\) is equivalent to that of \(\tilde{{\mathcal {R}}}_t\). To show privacy guarantees of \(\tilde{{\mathcal {R}}}_t\), we consider gradient descent and stochastic gradient descent separately.
Under the setting of gradient descent, i.e., \(b = n\), we have \(K_{{{\mathcal {R}}}_t}(\lambda ) \le \frac{2\lambda (\lambda +1) L_0^2}{n^2\sigma ^2}\) by Lemma 2. Then from the composability theorem (Abadi et al. 2016, Theorem 2.1), we have after T iterations, \(K_{{\mathcal {R}}}(\lambda ) \le \sum _{t=1}^T K_{{\mathcal {R}}_t}(\lambda ) = \frac{2\lambda (\lambda +1)TL_0^2}{n^2\sigma ^2}\). To show \((\epsilon , \delta )\)-differential privacy of \({\mathcal {R}}\), based on Proposition 1, it is sufficient to show
Similar as in Abadi et al. (2016), Wang et al. (2017), when \(\sigma ^2 \ge c_1 \frac{T\log (1/\delta ) L_0^2}{n^2 \epsilon ^2}\) for some constant \(c_1\), we can guarantee the conditions. And hence, the algorithm satisfies \((\epsilon , \delta )\)-diffential privacy.
Under the setting of stochastic gradient descent, i.e., \(b < n\), we can bound \(K_{{\mathcal {R}}_t}(\lambda )\) as in Lemma 3. After T iterations, the cumulative bound is \(\frac{15T(\lambda +1)L_0^2}{n^2\sigma ^2}\) under sufficiently large \(\sigma ^2\). Based on similar reasoning as in the full gradient case, we see when \(\sigma ^2 \ge c_2 \frac{ T\log (1/\delta )L_0^2}{n^2\epsilon ^2}\), the algorithm is \((\epsilon , \delta )\)-differentially private. \(\square\)
Proof of Lemma 4
First, using unbiasedness of \(\textrm{grad}f(w;z)\), we have \({\mathbb {E}}[\zeta ] = \textrm{grad}F(w)\). Consider the vectorization \(\vec {\epsilon }\) and the tangent space Gaussian distribution expressed as a standard multivariate Gaussian. Thus, \({\mathbb {E}}\Vert \epsilon \Vert _w^2 = {\mathbb {E}}\Vert \vec {\epsilon } \Vert ^2_2 = \sigma ^2 \textrm{tr}(G_w^{-1}) \le d \sigma ^2 c_l^{-1}\). Finally, we have
where we notice \({\mathbb {E}}[\langle \textrm{grad}f(w, z), \epsilon \rangle _w] = 0\) as the noise is independent. \(\square\)
Proof of Theorem 2
The proof is adapted from the proof of (Zhang and Sra 2016, Theorem 10). Since each f(w; z) is geodesic convex, then \(F(w) = \frac{1}{n} \sum _{i=1}^n f(w; z_i)\) is geodesic convex. Then from the first-order characterization of geodesic convex functions (Boumal , 2020), we have \(F(w_t) - F(w^*) \le - \langle \textrm{grad}F(w_t), \textrm{Exp}_{w_t}^{-1}(w^*) \rangle _{w_t}\). Further by Lemma 1 on a geodesic triangle formed by \(w_t, w_{t+1}, w^*\), we obtain
where we use Lemma 4 and notice that \(\textrm{Exp}^{-1}_{w_t} (w_{t+1}) = - \eta _t \zeta _t\). The expectation is over both the randomness in the noise and the subsampling (if \(b < n\)) at iteration t. Combining (5) and the first order characterization yields
where the expectation is over the randomness of both the noise and subsampling (if \(b < n\)). Telescoping this inequality from \(t = 0,..., T-1\) and setting \(\eta _t = \frac{D_{\mathcal {W}}}{\sqrt{(L_0^2 + d \sigma ^2 c_l^{-1}) \varsigma T}}\) gives
where we take expectation over all iterations. Finally, it can be shown from the property of geodesic averaging that \(F(w^\textrm{priv}) \le \frac{1}{T} \sum _{t=0}^{T-1} F(w_t)\). \(\square\)
Proof of Theorem 3
The proof adapts from (Zhang and Sra 2016, Theorem 12) and follows similarly as in Theorem 2. \(\square\)
Proof of Theorem 4
First consider the gradient descent with \(\zeta _t = \textrm{grad}F(w_t) + \epsilon _t\). Then by geodesic \(L_1\)-smooth, we have
where the last inequality uses \(\eta _t \le 1/L_1\). From the PL condition, we have
Applying this result recursively by choosing \(\eta _t = \eta < \min \{ 1/L_1, 1/\tau \}\) and taking full expectation yields
where the last inequality follows from the limit of a geometric series and \(\eta < \frac{1}{L_1}\). Finally, choosing \(T = \log (\frac{n^2\epsilon ^2}{d L_0^2 \log (1/\delta ) c_l^{-1}})\) gives the desired result.
Now we consider stochastic gradient descent with \(\zeta _t = \frac{1}{b}\sum _{z \in {\mathcal {B}}_t} \textrm{grad}f(w_t; z) + \epsilon _t\). Similarly by geodesic \(L_1\)-smooth, we obtain
Let \(\eta _t = \eta < \min \{ 1/L_1, 1/\tau \}\), and apply the result recursively. Then we have
under the same choice of T. This matches the bound under full gradient setting. \(\square\)
Proof of Theorem 5
First consider gradient descent with \(\zeta _t = \textrm{grad}F(w_t) + \epsilon _t\). Then by geodesic \(L_1\)-smooth, we obtain
where we choose \(\eta _t= \eta = 1/L_1\). Telescoping this inequality for \(t = 0,..., T-1\) and taking expectation yields
by the choice of \(\sigma ^2\) and \(T = \frac{\sqrt{L_1} n \epsilon }{\sqrt{d \log (1/\delta ) c_l^{-1}}L_0}\). Given the output \(w^\textrm{priv}\) is uniformly selected from \(\{ w_0,..., w_{T-1}\}\), we have \({\mathbb {E}}\Vert \textrm{grad}F(w^\textrm{priv}) \Vert ^2 = \frac{1}{T} \sum _{t=0}^{T-1} {\mathbb {E}}\Vert \textrm{grad}F(w_t) \Vert ^2_{w_t}\).
Similarly, under stochastic setting, \(\zeta _t = \frac{1}{b}\sum _{z \in {\mathcal {B}}_t} \textrm{grad}f(w_t; z) + \epsilon _t\) and
where we use the fact that \(\eta _t = \frac{1}{L}\). Following the same argument and choice of \(\sigma ^2\) and T, we achieve the same bound as full gradient case. \(\square\)
C Sensitivity of the choice of stepsize
To validate the sensitivity of the utility with respect to the stepsize choice \(\eta\), we conduct an additional experiment on computing the private Fréchet mean on the SPD manifold by varying the choice of \(\eta\). In Fig. 5, we plot the averaged utility of DP-RGD against the baselines over five runs for a range of \(\eta\) values from 0.005 to 0.015. We observe that the improvement over the baselines (under small-sample regime) is robust under the specified range of \(\eta\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Han, A., Mishra, B., Jawanpuria, P. et al. Differentially private Riemannian optimization. Mach Learn 113, 1133–1161 (2024). https://doi.org/10.1007/s10994-023-06508-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06508-5