research-article

Open access

Privacy-Preserving Non-Negative Matrix Factorization with Outliers

Authors:

Swapnil Saha,

Hafiz ImtiazAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 3

Article No.: 64, Pages 1 - 26

https://doi.org/10.1145/3632961

Published: 12 January 2024 Publication History

PDF eReader

Abstract

Non-negative matrix factorization is a popular unsupervised machine learning algorithm for extracting meaningful features from inherently non-negative data. Such data often contain privacy-sensitive user information. Additionally, the dataset can contain outliers, which may lead to extracting sub-optimal features from the data. It is, therefore, necessary to address these two issues while analyzing privacy-sensitive data that may contain outliers. In this work, we develop a non-negative matrix factorization algorithm in the privacy-preserving framework that (i) considers the presence of outliers in the data, and (ii) can achieve results comparable to those of the non-private algorithm. We design our method in such a way that one has the control to select the degree of privacy grantee based on the required utility gap. We show the effectiveness of our proposed algorithm’s performance on six real and diverse datasets. The experimental results show that our proposed method can achieve a performance that closely approximates the performance of the non-private algorithm under some parameter choices, while ensuring strict privacy guarantees.

1 Introduction

Non-negative matrix factorisation (NMF) is an unsupervised machine learning (ML) technique for discovering the part-based representation of inherently non-negative data [29]. Consider a data matrix \(\mathbf {V}\in \mathcal {R}^{D \times N}\) with entries \(v_{ij} \geqslant 0\ \forall i\in \lbrace 1, 2, \ldots , D\rbrace\) and \(\forall j\in \lbrace 1, 2, \ldots , N\rbrace\) , where N is the number of the data samples, and D is the data dimension. The NMF objective is to (approximately) decompose the data matrix \(\mathbf {V}\) into two non-negative matrices as follows:

\begin{equation} \mathbf {V} \approx \mathbf {WH}, \end{equation}

where \(\mathbf {W} \in \mathcal {R}^{D\times K}\) is the basis matrix, \(\mathbf {H} \in \mathcal {R}^{K\times N}\) is the coefficient matrix, and K is the latent dimension. The decomposition is performed, usually through minimizing some divergence between the data matrix and the factor matrices, such that each entry of \(\mathbf {W}\) satisfies \(w_{ik} \geqslant 0\) , and each entry of \(\mathbf {H}\) satisfies \(h_{kj}\geqslant 0,\forall k\in \lbrace 1, 2, \ldots , K\rbrace\) . In short, NMF performs dimension reduction by mapping the ambient data dimension D into latent dimension K for N data samples. Unlike other dimension reduction methods, such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA), NMF is well-suited for inherently non-negative data because it finds non-subtractive and interpretable basis and coefficients.

As mentioned before, NMF approximates the jth column of \(\mathbf {V}\) as \(\mathbf {v}_j \approx \mathbf {W} \mathbf {h}_j\) , where \(\mathbf {h}_j\) is the jth column of \(\mathbf {H}\) . Essentially, the jth column of \(\mathbf {V}\) is represented as a linear combination of the columns of \(\mathbf {W}\) , with the coefficients being the corresponding entries from \(\mathbf {h}_j\) . Therefore, the dictionary matrix \(\mathbf {W}\) can be interpreted as storing the “parts” of the data matrix \(\mathbf {V}\) . This part-based decomposition, and considering only the non-negative values, make NMF popular in many practical applications including, but not limited to:

—

dimension reduction

—

topic modeling in text mining

—

representation learning

—

extracting local facial features from human faces

—

unsupervised image segmentation

—

speech denoising

—

community detection in social networks.

However, the acquired data samples may be contaminated by localized outliers [66], such as salt and pepper noise, occlusions, shadows, moving objects in videos, and so on. The presence of such outliers can affect the accurate estimation of basis and thus, the latent low-rank structure of the data [70, 71].

NMF and Privacy. Many services of modern era operate on sensitive user data to provide better suggestions and experiences. These are often done via ML algorithms, such as the ones mentioned above, and the ML algorithms are trained on users’ sensitive data. As shown in theoretical and applied works, the users are rightfully concerned regarding their privacy being compromised by the outputs of the ML algorithms. For example, the seminal work of Homer et al. [28] showed that the presence of an individual in a genome dataset can be identified from simple summary statistics about the dataset. In the ML setting, typically the model parameters (as a matrix \(\mathbf {W}\) or vector \(\mathbf {w}\) ) are learned by training on the users’ data. To that end, Membership Inference Attacks [22] are discussed in detail by Hu et al. [30], and Shokri et al. [60]. They showed that given the learned parameters \(\mathbf {W}\) , an adversary can identify users in the training set. Basically, the trained weights of the model can be used to extract sensitive information. The tendency of the weights to memorize the training examples can be used to regenerate the training example. Several other works [37, 47, 62] also showed how personal data leakage occurred from modern ML and signal processing tasks. Additionally, it has been shown that simple anonymization of data does not provide any privacy in the presence of auxiliary information available from other/public sources [62]. One prime example is the Netflix Challenge, where an anonymized dataset was released in 2007, which contained anonymized movie ratings from Netflix subscribers. Nevertheless, researchers found a way to recover \(99\%\) of the removed personal data [47] by using the publicly available IMDb datasets. In addition, privacy leakage can happen through gradient sharing [72] as well. In distributed ML systems, several nodes exchange gradient values. The authors in [72] showed how one can extract private training information from such gradient sharing. These discoveries led to widespread concern over using private data for ML algorithms. Evidently, personal information leakage is the main hindrance to collecting and analyzing sensor data for training ML and signal processing algorithms. It is, therefore, necessary to employ a framework where one can share private data without disclosing their participation or identity. Differential privacy (DP) is a rigorous mathematical framework that can provide protection against such information leakage [18]. The definition of differential privacy is motivated by the cryptography work, which has gained significant attention in ML and data mining communities [17]. DP introduces randomness into the computation pipeline for preserving privacy. This, however, leads to degradation of the computation accuracy, and the user needs to quantitatively choose the optimum privacy budget considering the required privacy-utility tradeoff.

Our Contributions. In this work, we consider applications of non-negative matrix factorization involving sensitive data, such as dimension reduction/feature extraction, topic modeling/text mining, image segmentation, denoising, and community detection. We intend to perform NMF decomposition on inherently non-negative and privacy-sensitive data (that may contain sparse outliers) to discover novel structure/features of the dataset. That is, (i) the estimated basis matrix \(\mathbf {W}\) should closely approximate the “true” basis matrix in the presence of the outliers, and (ii) since the data matrix \(\mathbf {V}\) contains user-specific privacy-sensitive information, the estimated \(\mathbf {W}\) should ensure formal privacy guarantee such that an adversary cannot extract sensitive information regarding the users. To address this, we propose a differentially-private [18] non-negative matrix factorization algorithm to compute the basis matrix \(\mathbf {W}\) considering the presence of outliers in the dataset. We adopt the outlier model from Reference [71]. To the best of our knowledge, our proposed method is the first differentially-private NMF algorithm considering outliers. Employing the mathematically rigorous privacy guarantee differential privacy allows us to compute \(\mathbf {W}\) such that it reflects very little about any particular user’s data, and is relatively unaffected by the presence of outliers. Our algorithm design makes sure that the learned \(\mathbf {W}\) closely approximates the “true” dictionary matrix, while ensuring strict privacy guarantees. Our major contributions are summarized below:

—

We develop a novel privacy-preserving non-negative matrix factorization algorithm capable of operating on sensitive data, while closely approximating the results of the non-private algorithm.

—

We consider the effect of outliers by specifically modeling them as in Reference [71], such that the presence of outliers has very little effect on our estimated differentially-private basis matrix \(\mathbf {W}\) .

—

Considering the multi-round nature of the proposed algorithm, we analyze our algorithm using Rényi Differential Privacy (RDP) [46] to provide a tighter characterization of privacy under composition. We obtain a much better accounting of the overall privacy loss, compared with the conventional strong composition theorem [18].

—

We perform extensive experimentation on six diverse real datasets to show the effectiveness of our proposed algorithm. We compare the results with those of an existing algorithm [57] and the non-private algorithm. We observe that the basis matrix \(\mathbf {W}\) , which our proposed algorithm estimates, provides a close approximation to the non-private results for certain parameter choices, easily outperforming the existing algorithm.

—

We present the results in a way to make the privacy-utility tradeoff more comprehensive, and that the user can choose between the overall privacy budget and the required “closeness” to non-private results.

Related Works. The non-negative matrix factorization is attained in the literature by minimizing the following objective function:

\begin{equation} \displaystyle \min _{\mathbf {W} \in \mathcal {C}, h_{kj}\geqslant 0, \forall k,j} \left\Vert \mathbf {V} - \mathbf {WH}\right\Vert _F^2, \end{equation}

where \(\mathcal {C} \subseteq \mathcal {R}^{D\times K}\) is the constraint set for \(\mathbf {W}\) . Several algorithms have been proposed to obtain the optimal point of this objective function, such as the multiplicative updates [38], alternating direction method of multipliers (ADMM) [67], block principal pivoting [35], active set method [33], and projected gradient descent (PGD) [43]. Most of these algorithms are based on alternatively updating \(\mathbf {W}\) and \(\mathbf {H}\) . This essentially divides the optimization problem into two sub-problems, each of which can be optimized using the standard optimization techniques, such as the projected gradient or the interior point method. A detailed survey of these optimization techniques can be found in References [34, 64].

Additionally, some algorithms have been proposed to solve the NMF problem considering the outliers [59, 66, 70, 71]. Zhao and Tan [71] proposed an online NMF algorithm with two different solvers: a PGD-based solver and an ADMM-based solver. However, they proposed to use a fixed step size for the iterative updates and their stopping condition does not necessarily indicate that the solution is close to a stationary point. Moreover, due to the online nature of the algorithm, it can be very slow for datasets with high-dimensions and large numbers of samples. Zhang et al. [70] proposed another algorithm for robust NMF, which uses multiplicative updates for updating the basis and the coefficient matrix and a soft-thresholding function for updating the outliers matrix. However, it has been shown in multiple literature (see e.g., Gonzalez and Zhang [26] and Lin [43]) that the multiplicative update method lacks convergence guarantee and optimization properties. The robust NMF proposed by Shen et al. [59] also uses a multiplicative update rule for the basis matrix. Our work is based on the robust NMF algorithm using PGD [71].

Extensive works and surveys exist in the literature on differential privacy. In particular, Dwork and Smith’s survey [19] contains the initial theoretical work. We refer the reader to References [1, 5, 12, 13, 32, 40, 41, 61, 63] for the most relevant works in differentially private machine learning, deep learning, optimization, gradient descent, and empirical risk minimization. Adding randomness in the gradient calculation is one of the most common approaches for implementing differential privacy [5, 61]. Other common approaches are employing the output [13] and objective perturbations [51], the exponential mechanism [45], the Laplace mechanism [17], and Smooth sensitivity [50].

Several researchers have attempted to employ differential privacy in the context of general matrix factorization, focusing on recommendation systems [44] and not explicitly handling the nonnegativity constraint. For example, Nikolaenko et al. [49] proposed a privacy preserving matrix factorization for recommendation systems using partially homomorphic encryption with Yao’s garbled circuits. Zhang et al. [69], Hua et al. [31], and Ran et al. [56] used the objective perturbation [13] (or it’s modification) for matrix factorization. Berlioz et al. [7] proposed differentially private matrix factorization into three different ways: input perturbation, perturbation of the gradients and output perturbation. In distributed-data setting, Ermis et al. [20] applied local differential privacy in collective matrix factorization problem. The relevant works with our work are the work of Alsulaimawi [3], which introduced a privacy filter with federated learning for NMF factorization; the work of Fu et al. [23], which proposed privacy-preserving NMF for dimension reduction using the Paillier Cryptosystem, and the work on privacy preserving data mining [2] using combined NMF and Singular Value Decomposition (SVD) methods. Last but not the least, the authors in Reference [57] proposed differentially private NMF for the recommender system, and we showed the comparison analyses in Section 4.5. However, to the best of our knowledge, no work has introduced differential privacy in the universal NMF decomposition framework considering outliers, and accounted for the overall privacy spent for multi-stage implementation to achieve the optimal privacy budget.

2 Problem Formulation

2.1 Notations

We use bold lower-case letters \(({\bf v})\) , bold upper-case letters \(({\bf V})\) and unbolded letters \((M)\) to denote vectors, matrices and scalars, respectively. To indicate the iteration instant, we use subscript t. For example, \(\mathbf {W}_t\) denotes the dictionary matrix after t iterations. The superscript \(^{\prime }+^{\prime }\) indicates a single update. We denote the indices with lowercase unbolded letter. For example, the n-th column of matrix \({\bf V}\) is denoted as \({\bf v}_n\) , \(v_{ij}\) indicates the entry of the ith row and jth column of the matrix \(\mathbf {V}\) . Inequality \(\mathbf {x} \geqslant 0\) or \(\mathbf {X} \geqslant 0\) apply entry-wise. For element-wise matrix multiplication, we use the notation \(\odot\) . We denote the \(\mathcal {L}_2\) norm (Euclidean norm) with \(\left\Vert .\right\Vert _2\) , the \(\mathcal {L}_{1,1}\) norm with \(\left\Vert .\right\Vert _{1,1}\) , and the Frobenius norm with \(\left\Vert .\right\Vert _F\) . \(\mathcal {R} \text{ , and } \mathcal {R}_{+}\) denote the set of real numbers and set of positive real numbers, respectively. \(\mathcal {P}_+\) denotes the Euclidean projector that projects onto the non-negative orthant, and \(\mathcal {P}_{\tilde{v}}\) denotes normalization projection that bounds the norm of column units. We denote the inner product of two matrices \(\mathbf {A}\) and \(\mathbf {B}\) as \(\langle \mathbf {A}, \mathbf {B}\rangle = \operatorname{tr}(\mathbf {A}^\top \mathbf {B})\) . Finally, the probability density function of zero mean unit variance Gaussian random variable is given as follows: \(f(x)=\frac{1}{\sqrt {2\pi }}\exp (\frac{-x^2}{2})\) .

2.2 Definitions and Preliminaries

In this section, we review some definitions and propositions that are necessary for our problem formulation.

Definition 2.1 (( \(\epsilon ,\delta\) )-DP [17])

Let \(\mathcal {D}\) be the domain of the datasets consisting of N records, and \(\mathbb {D},\ \mathbb {D}^\prime\) be neighbouring datasets that differ by only one record. Then, an algorithm f : \(\mathcal {D} \mapsto \mathcal {T}\) provides ( \(\epsilon ,\delta\) )-differential privacy (( \(\epsilon ,\delta\) )-DP) if \(\Pr (f(\mathbb {D})\in \mathcal {S}) \leqslant \delta + e^\epsilon \Pr (f(\mathbb {D}^\prime)\in \mathcal {S})\) for all measurable \(\mathcal {S} \subseteq \mathcal {T}\) and for all neighbouring datasets \(\mathbb {D},\ \mathbb {D}^\prime \in \mathcal {D}\) .

Here, \(\epsilon ,\delta \geqslant 0\) are the privacy parameters and determine how the algorithm will perform in providing utility and preserving privacy. The parameter \(\epsilon\) indicates how much the algorithm’s output deviates in probability when we replace one single person’s data with another. The parameter \(\delta\) indicates the probability that the privacy mechanism fails to give the guarantee of \(\epsilon\) . Intuitively, higher privacy results in poor utility. That is, smaller \(\epsilon\) and \(\delta\) guarantee more privacy, but lower utility. There are several mechanisms to implement differential privacy: Gaussian [17], Laplace mechanism [18], random sampling, and exponential mechanism [45] are well-known. Among the additive noise mechanisms, the noise standard deviation is scaled by the privacy budget and the sensitivity of the function.

Definition 2.2 ( \(\mathcal {L}_2\) Sensitivity [17])

If \(\mathbb {D}\) and \(\mathbb {D}^\prime\) are neighbouring datasets differing by only one record, then the \(\mathcal {L}_2\) - sensitivity of vector valued function \(f(\mathbb {D})\) is

\begin{align*} \Delta := \max _{\mathbb {D},\ \mathbb {D}^\prime } \left\Vert f(\mathbb {D})-f(\mathbb {D}^\prime)\right\Vert _{2}. \end{align*}

The \(\mathcal {L}_2\) sensitivity of a function gives the upper bound of how much the function can change if one sample at the input is changed. Consequently, it dictates the amount of randomness/perturbation needed at the function’s output to guarantee differential privacy. In other words, it captures the maximum change in the output by changing any one user in the worst-case scenario.

Definition 2.3 (Gaussian Mechanism [18]).

Let \(f : \mathcal {D} \mapsto \mathcal {R}^D\) be an arbitrary function with \(\mathcal {L}_{2}\) sensitivity \(\Delta\) . The Gaussian mechanism with parameter \(\tau\) adds noise from \(\mathcal {N}(0,\tau ^2)\) to each of the D entries of the output and satisfies \((\epsilon ,\delta)\) differential privacy for \(\epsilon \in (0,1)\) and \(\delta \in (0,1)\) , if \(\tau \ge \frac{\Delta}{\epsilon } \sqrt {2\log \frac{1.25}{\delta }}\) .

Here, \((\epsilon ,\delta)\) - differential privacy is guaranteed by adding noise drawn form \(\mathcal {N}(0,\tau ^2)\) distribution. Note that, there are an infinite number of combinations of \((\epsilon ,\delta)\) for a given \(\tau ^2\) .

Definition 2.4 (RDP [46]).

A randomized algorithm f : \(\mathcal {D} \mapsto \mathcal {T}\) is \((\alpha ,\epsilon _{r})\) -Rényi differentially private if, for any adjacent \(\mathbb {D},\ \mathbb {D}^\prime \in \mathcal {D}\) , the following holds: \(D_{\alpha }(\mathcal {A}(\mathbb {D})\ || \mathcal {A}(\mathbb {D}^\prime)) \le \epsilon _{r}\) . Here, \(D_{\alpha }(P(x)||Q(x))=\frac{1}{\alpha -1}\log \mathbb{E}_{x \sim Q} (\frac{P(x)}{Q(x)})^{\alpha }\) and \(P(x)\) and \(Q(x)\) are probability density functions defined on \(\mathcal {T}\) .

We use RDP for calculating the total privacy budget spent in our multi-stage algorithm. RDP provides a much simpler rule for calculating overall privacy risk \(\epsilon\) that is shown to be tight [46].

Proposition 1 (From RDP to DP [46]).

If f is an \((\alpha ,\epsilon _r)\) -RDP mechanism, it also satisfies \((\epsilon _r+\frac{\log 1/\delta }{\alpha -1},\delta)\) -differential privacy for any \(0\lt \delta \lt 1\) .

Proposition 2 (Composition of RDP [46]).

Let \(f_1:\mathcal {D}\rightarrow \mathcal {R}_1\) be \((\alpha ,\epsilon _1)\) -RDP and \(f_2:{\mathcal {R}_1} \times \mathcal {D} \rightarrow \mathcal {R}_2\) be \((\alpha ,\epsilon _2)\) -RDP, then the mechanism defined as \((X_1,X_2)\) , where \(X_1 \sim f_1(\mathcal {D})\) and \(X_2 \sim f_2(X_1,\mathcal {D})\) satisfies \((\alpha ,\epsilon _1+\epsilon _2)\) -RDP.

Proposition 3 (RDP and Gaussian Mechanism [46]).

If f has \(\mathcal {L}_2\) sensitivity 1, then the Gaussian mechanism \(\mathcal {G}_{\sigma }f(\mathcal {D})=f(\mathcal {D})+e\) where \(e \sim \mathcal {N}(0,\sigma ^2)\) satisfies \((\alpha ,\frac{\alpha }{2\sigma ^2})\) -RDP. Additionally, a composition of T such Gaussian mechanisms satisfies \((\alpha ,\frac{\alpha T}{2 \sigma ^2})\) -RDP.

2.3 NMF Problem Formulation

As mentioned before, we adopt the outlier model presented in Reference [71]. We first reformulate the robust NMF model [71] for an offline setting. More specifically, our problem formulation and optimization technique involve updating the coefficient matrix \(\mathbf {H}\) , the outlier matrix \(\mathbf {R}\) , and the basis/dictionary matrix \(\mathbf {W}\) based on the entire data matrix \(\mathbf {V}\) , rather than the single user data entry. Since we want to guarantee differential privacy for the estimated basis matrix \(\mathbf {W}\) , which involves addition of Gaussian noise scaled to the sensitivity of some function, it is customary to have a small sensitivity to attain better “utility” or closeness to the true basis matrix. As shown in Section 3.2, operating on the entire dataset provides a significant advantage in this regard.

Most of the existing algorithms discussed in Section 1 follow the two-block coordinate descent framework shown in Algorithm 1. First, the dictionary matrix \(\mathbf {W}\) is updated, while keeping the coefficient \(\mathbf {H}\) fixed. Then, the updated dictionary matrix \(\mathbf {W}\) is used to update the coefficient \(\mathbf {H}\) . The process continues until some convergence criteria are met. As mentioned before, we follow the robust NMF approach based on PGD that considers the presence of outliers [71]. We intend to decompose the data matrix \(\mathbf {V}\) as \(\mathbf {V} \approx \mathbf {WH} + \mathbf {R}\) , and release the differentially-private basis matrix \(\mathbf {W}\) , where \(\mathbf {W}\) and \(\mathbf {H}\) are defined as before. The matrix \(\mathbf {R} = [\mathbf {r}_{1},\mathbf {r}_{2},\ldots ,\mathbf {r}_{N}] \in \mathcal {R}^{D\times N}\) is a matrix containing the outliers of the data. Thus, the NMF optimization problem is reformulated as

\begin{equation} \min _{\mathbf {W} \in \mathcal {C}, \mathbf {H}\geqslant 0,\mathbf {R} \in \mathcal {Q}} \frac{1}{N} \left(\frac{1}{2} \left\Vert \mathbf {V}-\mathbf {W} \mathbf {H}-\mathbf {R}\right\Vert _F^2 + \lambda \left\Vert \mathbf {R}\right\Vert _{1,1} \right), \end{equation}

(1)

where \(\mathcal {C} \subseteq \mathcal {R}^{D\times K}\) is the constraint set for updating W, \(\mathcal {Q}\) is the feasible set for \(\mathbf {R}\) and \(\lambda \geqslant 0\) is the regularization parameter. Intuitively, the outlier matrix \(\mathbf {R}\) is sparse in nature and contains a smaller values compared to the noise-free original data entries. The sparsity of the outlier \(\mathbf {R}\) is enforced by the choice of \(\mathcal {L}_{1,1}\) -norm regularization [71]. The level of sparsity is controlled by the hyper-parameter \(\lambda\) . Evidently, a grid search can be performed to select optimal hyper-parameters for our proposed differentially-private NMF. In Section 4, we discuss hyper-parameter tuning and matrix initialization methods for better optimization. We note that several robust NMF algorithms exist in the literature that consider the effect of outliers in the dataset, and utilize different norm on the objective function (for example, \(\mathcal {L}_2\) , \(\mathcal {L}_{2,1}\) , \(\mathcal {L}_{1,2}\) norms instead of the conventional \(\mathcal {L}_2\) norm) to reduce the effect of the outliers on \(\mathbf {W}\) and \(\mathbf {H}\) [24, 36, 59]. In this article, we adopted the robust NMF model as in Reference [71]. Evidently, exploring other robust NMF settings in the context of differential privacy, and comparing their performances would be an interesting direction of research.

Note that the robust NMF algorithm cannot guarantee exact recovery of the original data matrix \(\mathbf {V}\) , as the loss function (1) is not convex in nature [71]. However, it can be shown empirically that the estimated basis matrix can be meaningful, and the difference between the data matrix \(\mathbf {V}\) and the reconstructed matrix is negligible and sparse [21, 59, 71]. Now, following the two-block coordinate descent method mentioned in Algorithm 1, we reformulate our optimization steps as follows:

—

Update the coefficient matrix \(\mathbf {H}_t\) and outlier matrix \(\mathbf {R}_t\) based on the previous dictionary matrix \(\mathbf {W}_{t-1}\) . Here, the optimization can be done as

\begin{equation} (\mathbf {H}_t,\mathbf {R}_t)= \mathop{\arg\!\min} _{\mathbf {H}\geqslant 0,\mathbf {R} \in \mathcal {Q}} \hspace{5.0pt} L(\mathbf {V},\mathbf {W}_{t-1},\mathbf {H},\mathbf {R}), \end{equation}

(2)

where the loss function L is

\begin{equation} L(\mathbf {V},\mathbf {W},\mathbf {H},\mathbf {R}) \triangleq \frac{1}{N} \left(\frac{1}{2} \left\Vert \mathbf {V}- \mathbf {WH}- \mathbf {R}\right\Vert _F^2 +\lambda \left\Vert \mathbf {R}\right\Vert _{1,1}\right). \end{equation}

(3)

Here, the constraint set \(\mathcal {Q} \triangleq \lbrace \mathbf {r} \in \mathcal {R}^D : \left\Vert \mathbf {r}\right\Vert _{\infty } \leqslant M\rbrace\) keeps the entries of the outlier matrix \(\mathbf {R}\) uniformly bounded. The value of M depends on the dataset and noise distribution. For example, in the gray scale image data with \(2^b-1\) levels in each pixel, we can set \(M=2^b-1\) where b is the number of bits to represent the pixel value.

—

After optimization with respect to \(\mathbf {H}_t\) and \(\mathbf {R_t}\) , compute \(\mathbf {W}_t\) by minimizing the same loss function (3):

\begin{equation} \mathbf {W}_t= \mathop{\arg\!\min} _{\mathbf {W} \in \mathcal {C}} \hspace{5.0pt} L(\mathbf {V},\mathbf {W},\mathbf {H}_{t},\mathbf {R}_{t}). \end{equation}

(4)

Here, the set \(\mathcal {C}\) constrains the columns of dictionary matrix \(\mathbf {W}\) into a unit (non-negative) \(\mathcal {L}_2\) ball to keep the matrix entries bounded [42, 43].

PGD Solver for (2). The optimization problem in (2) can be solved by following the two steps alternatively for a given \(\mathbf {W}\) [71]:

\begin{equation} \mathbf {H}^+ := \mathop{\arg\!\min} _{\mathbf {H^{\prime }} \geqslant 0}Q (\mathbf {H^{\prime }} | \mathbf {H}), \end{equation}

(5)

\begin{equation} \mathbf {R}^+= \mathop{\arg\!\min} _{\mathbf {R^{\prime }} \in \mathcal {Q}} \frac{1}{N}\Big (\frac{1}{2} \left\Vert \mathbf {V}- \mathbf {WH}- \mathbf {R^{\prime }}\right\Vert _F^2 + \lambda \left\Vert \mathbf {R^{\prime }}\right\Vert _{1,1} \Big), \end{equation}

(6)

where

\begin{align} &Q(\mathbf {H^{\prime }}|\mathbf {H}) \triangleq q(\mathbf {H}) + \langle \bigtriangledown q (\mathbf {H}), \mathbf {H^{\prime }}-\mathbf {H}\rangle + \frac{1}{2\eta N} \left\Vert \mathbf {H^{\prime }}-\mathbf {H}\right\Vert _2^2, \\ &q(\mathbf {H}) \triangleq \frac{1}{2N} \left\Vert \mathbf {V}-\mathbf {WH}-\mathbf {R}\right\Vert _F^2. \end{align}

(7)

Here, \(\eta\) is the fixed step size. Minimizing both (5) and (6) have closed-form solutions [71]. For (5), the solution can be expressed as the following:

\begin{equation} \mathbf {H}^+ := \mathcal {P}_{+}(\mathbf {H}-\eta _{H} \bigtriangledown q(\mathbf {H})). \end{equation}

(8)

Here, we replace the step size \(\eta\) with \(\eta _{H}\) to distinguish this from the dictionary matrix \(\mathbf {W}\) update. We use a fixed step size to ease the hyper-parameter setting in the whole iteration process. \(\bigtriangledown q(\mathbf {H})\) is the gradient function derived by doing partial derivative of (7) with respect to \(\mathbf {H}\) .

\begin{equation} \triangledown q(\mathbf {H})=\frac{1}{N} \big (\mathbf {W}^{\top }\mathbf {WH}-\mathbf {W}^{\top }(\mathbf {V}- \mathbf {R})\big). \end{equation}

(9)

For (6), the solution is straightforward [71]:

\begin{equation} \mathbf {R}^+=S_{\lambda ,M} (\mathbf {V}-\mathbf {WH^+}). \end{equation}

(10)

Here, \(S_{\lambda ,M}(\mathbf {X})\) performs element-wise thresholding as

\begin{equation} S_{\lambda ,M}(\mathbf {X})_{ij} := {\left\lbrace \begin{array}{ll} 0, & \left\Vert x_{ij}\right\Vert \lt \lambda \\ x_{ij}-\textrm {sgn}(x_{ij})\lambda , & \lambda \leqslant \left\Vert x_{ij}\right\Vert \leqslant \lambda +M \\ \textrm {sgn}(x_{ij})M. & \left\Vert x_{ij}\right\Vert \gt \lambda +M \end{array}\right.} \end{equation}

In the tth iteration, we update the matrices \(\mathbf {H}\) and \(\mathbf {R}\) according to (8) and (10); until some stopping criteria is met [71].

PGD Solver for (4). We can rewrite (4) as follows

\begin{equation} \mathbf {W}_t= \mathop{\arg\!\min} _{\mathbf {W} \in \mathcal {C}} \frac{1}{2}\operatorname{tr}(\mathbf {W}^\top \mathbf {W}\mathbf {A}_t)-\operatorname{tr}(\mathbf {W}^\top \mathbf {B}_{t}), \end{equation}

where \(\mathbf {A}_t \triangleq \frac{1}{N}\mathbf {H}_{t}\mathbf {H}_{t}^\top\) , and \(\mathbf {B}_t \triangleq \frac{1}{N} (\mathbf {V-R_{t}})\mathbf {H}_{t}^\top\) . To calculate the gradient value, we define new function \(f_{W}(\mathbf {W})\) .

\begin{equation} f_{W}(\mathbf {W})= \frac{1}{2}\operatorname{tr}(\mathbf {W}^\top \mathbf {W}\mathbf {A}_t)-\operatorname{tr}(\mathbf {W}^\top \mathbf {B}_{t}). \end{equation}

(11)

Taking partial derivative of (11) with respect to \(\mathbf {W}\) , we find the following expression.

\begin{equation} \triangledown f_{W}(\mathbf {W})=\mathbf {WA}_t-\mathbf {B}_t. \end{equation}

(12)

We use the some matrix properties and lemmas [53] to derive the expressions in (9) and (12). We write the update equation ensuring the constraints of \(\mathbf {W}\) .

\begin{equation} \mathbf {W}^+ = \mathcal {P}_{\mathcal {C}}(\mathbf {W}-\eta _{W}\triangledown f_{W}(\mathbf {W})). \end{equation}

(13)

Here, \(\eta _{W}\) is the step size to update \(\mathbf {W}\) . As in (8), we use a fixed step size. The constraint projection function keeps the columns of \(\mathbf {W}\) in the unit \(\mathcal {L}_2\) ball. In (13), each column is being updated as following:

\begin{equation} \mathbf {w}_k^{+}:=\frac{\mathcal {P}_{+}\big (\mathbf {w}_k-\eta _{W}\triangledown f_{W}(\mathbf {w}_k)\big)}{\max \Big (1,\left\Vert \mathcal {P}_{+}\big (\mathbf {w}_k-\eta _{W}\triangledown f_{W}(\mathbf {w}_k)\big)_{}\right\Vert _{2}\Big)}, \forall k \in [K]. \end{equation}

Similarly, as the PGD Solver for (2), we use (13) to update the dictionary matrix \(\mathbf {W}\) . The steps stated above are followed until we reach the optimum loss point of (1). Evidently, estimating \(\mathbf {W}\) depends on the potentially sensitive data. As we intend to release the estimated \(\mathbf {W}\) , while guaranteeing formal privacy, we need to modify the aforementioned steps. In the following, we discuss our proposed differentially-private NMF algorithm in detail to show how to preserve and control privacy leakage for NMF.

3 Proposed Differentially-private NMF

In this section, we describe our proposed algorithm to estimate \(\mathbf {W}_{private}\) in detail. We first specify our privacy model containing a data curator and a data analyst. We then analytically show how to incorporate differential privacy for estimating the basis matrix \(\mathbf {W}_{private}\) , such that it provides close approximation to the true basis matrix \(\mathbf {W}\) . Next, we analyze our multi-round gradient descent algorithm with RDP [46] to attain a tight characterization of the privacy loss encountered over the course of the gradient descent. And finally, we discuss regarding the convergence of the algorithm.

3.1 Separate Private and Non-private Training Nodes

As mentioned earlier, we are interested in estimating the dictionary matrix \(\mathbf {W}\) , which captures the meaningful population features, and then releasing it publicly. More specifically, our goal is to release a differentially-private estimate of the true dictionary matrix \(\mathbf {W}\) , considering the presence of the outliers. Intuitively, we need to process the user-sensitive data and population feature data separately. Figure 1 shows the basic system diagram, which serves the purposes. Here, we have two data processing centers: one is the “Data Curator”, which holds the sensitive data: data matrix \(\mathbf {V}\) and the coefficient matrix \(\mathbf {H}\) and as well as the outlier matrix \(\mathbf {R}\) . It updates the matrices \(\mathbf {H}\) and \(\mathbf {R}\) as mentioned in the Section 2.3. It also calculates the gradient \(\bigtriangledown f_{W}(\mathbf {W})\) of the loss function \(f_{W}(\mathbf {W})\) and then follows the DP protocol to add noise to the gradient before sending it to the other data processing center. Note that, the variance of the noise depends on the privacy budget and \(\mathcal {L}_2\) sensitivity of the gradient \(\bigtriangledown f_{W}(\mathbf {W})\) .

Fig. 1.

The other data processing center is the “Data Analyst”, where the noisy gradient is sent to. This center then updates the dictionary matrix \(\mathbf {W}\) with the received differentially-private estimate of the gradient and sends the updated \(\mathbf {W}\) to the Data Curator. This cycle continues until some stopping criteria is met. At the end, we get a differentially-private dictionary matrix \(\mathbf {W}_\textrm {private}\) at the Data Analyst.

Note that, the Data Curator can choose from several mechanisms to ensure DP: Laplace mechanism, Gaussian mechanism, Exponential mechanism. Although the Laplace mechanism offers pure \(\epsilon\) -DP, it adds too much noise compared to Gaussian mechanism because the \(\mathcal {L}_1\) sensitivity depends on the data dimension. Additionally, the Exponential mechanism is more suitable for discrete-valued algorithms, and often times, it could involve a probability density that is very difficult to sample from Reference [14]. Therefore, we employ the Gaussian mechanism for this work. Now, the Data Curator has several options to employ the Gaussian mechanism to ensure the privacy of the gradient \(\bigtriangledown f_{W}(\mathbf {W})\) : objective perturbation, input perturbation, output perturbation to name a few [13]. Objective perturbation involves adding certain noise to the objective function. However, it requires the objective function to satisfy some stringent conditions that are difficult to ensure in our NMF problem. The output perturbation involves adding certain noise to the output of the algorithm (the basis matrix \(\mathbf {W}\) in our case). However, in this approach, analyzing the effect of changing one data point on \(\mathbf {W}\) is non-trivial due to the intricate relation of \(\mathbf {W}\) , \(\mathbf {H}\) , and \(\mathbf {R}\) with the loss function. Therefore, we choose a variant of the input perturbation—the noisy gradient descent, where we estimate the gradient of the loss function satisfying DP [5, 61].

3.2 Estimating Differentially-Private \(\mathbf {W}_\textrm {private}\)

In this section, we show the necessary proofs and derivations related to estimating \(\mathbf {W}_\textrm {private}\) . According to Gaussian Mechanism [18], we need to calculate the \(\mathcal {L}_2\) sensitivity of the gradient \(\bigtriangledown f_{W}(\mathbf {W})\) of the loss function \(f_{W}(\mathbf {W})\) with respect to \(\mathbf {W}\) for estimating the dictionary matrix \(\mathbf {W}\) satisfying DP. We have, \(\triangledown f_{W}(\mathbf {W})=\mathbf {WA}-\mathbf {B}\) . As \(\triangledown f_{W}(\mathbf {W})\) depends on the statistics matrices \(\mathbf {A}\) and \(\mathbf {B}\) , we can calculate their \(\mathcal {L}_2\) sensitivities separately.

In the following, we first calculate the \(\mathcal {L}_2\) sensitivity of matrix \(\mathbf {A}=\frac{1}{N} \mathbf {H}\mathbf {H}^\top\) . Consider two neighboring data matrices \(\mathbf {V}\) and \(\mathbf {V}^{\prime }\) and their corresponding coefficient matrices \(\mathbf {H}\) and \(\mathbf {H}^{\prime }\) . By definition, they differ by only one user data (e.g., the Nth column). We calculate the \(\mathcal {L}_2\) sensitivity \(\Delta_{A}\) as

\begin{align} \Delta_{A} &= \max \frac{1}{N}\left\Vert \mathbf {H} \mathbf {H}^\top -\mathbf {H}^{\prime } \mathbf {H}^{\prime \top }\right\Vert _F \\ &=\max \frac{1}{N}\left\Vert \mathbf {h}_N \mathbf {h}^\top _N -\mathbf {h}^{\prime }_N \mathbf {h}^{\prime \top }_N\right\Vert _F \\ &\leqslant \max \frac{1}{N}\big (\left\Vert \mathbf {h}_N \mathbf {h}_N^\top \right\Vert _F+\left\Vert \mathbf {h}^{\prime }_N \mathbf {h}^{\prime \top }_N\right\Vert _F\big) \\ &\leqslant \max \frac{1}{N} \big (\left\Vert \mathbf {h}_N\right\Vert _2\left\Vert \mathbf {h}_N\right\Vert ^\top _2 + \left\Vert \mathbf {h}^{\prime }_N\right\Vert _2\left\Vert \mathbf {h}^{\prime }_N\right\Vert ^\top _2 \big)\\ &=\max \frac{1}{N} \big (\left\Vert \mathbf {h}_N\right\Vert _2^2+\left\Vert \mathbf {h}^{\prime }_N\right\Vert _2^2\big)\\ &=\frac{2}{N}\times {(\text{max}\ \mathcal {L}_2\ \text{norm of columns of} \mathbf {H})}^2 , \end{align}

(14)

where we have used the triangle inequality and \(\left\Vert \mathbf {ab}\right\Vert \leqslant \left\Vert \mathbf {a}\right\Vert \left\Vert \mathbf {b}\right\Vert\) . To get a bounded value in (14), we need the maximum \(\mathcal {L}_2\) norm of the columns of \(\mathbf {H}\) . One way to do that is by normalizing each column of \(\mathbf {H}\) during the updates. We can have \(\Delta_{A} = \frac{2}{N}\) . Next, we calculate the \(\mathcal {L}_2\) sensitivity \(\Delta_{B}\) of \(\mathbf {B} = \frac{1}{N}(\mathbf {V-R})\mathbf {H}^\top\) as follows:

\begin{align} \Delta_{B} &= \max \frac{1}{N}\left\Vert (\mathbf {V}-\mathbf {R}) \mathbf {H}^\top -(\mathbf {V}^{\prime }-\mathbf {R}) \mathbf {H}^{\prime \top }\right\Vert _F\\ &=\max \frac{1}{N}\left\Vert (\mathbf {v}_N-\mathbf {r}_N) \mathbf {h}^\top _N-(\mathbf {v}_N^{\prime }-\mathbf {r}_N) \mathbf {h}^{\prime \top }_N\right\Vert _F\\ &\leqslant \max \frac{1}{N}\big (\left\Vert (\mathbf {v}_N-\mathbf {r}_N) \mathbf {h}^\top _N\right\Vert _F+\left\Vert (\mathbf {v}_N^{\prime }-\mathbf {r}_N)\mathbf {h}^{\prime \top }_N\right\Vert _F\big)\\ &\leqslant \max \frac{1}{N} \big (\left\Vert (\mathbf {v}_N-\mathbf {r}_N)\right\Vert _2\left\Vert \mathbf {h}_N\right\Vert ^\top _2 +\left\Vert (\mathbf {v}_N^{\prime }-\mathbf {r}_N)\right\Vert _2\left\Vert \mathbf {h}^{\prime }_N\right\Vert ^\top _2 \big)\\ &=\max \frac{1}{N} \big (\left\Vert (\mathbf {v}_N-\mathbf {r}_N)\right\Vert _2+\left\Vert (\mathbf {v}_N^{\prime }-\mathbf {r}_N)\right\Vert _2\big)\\ &=\max \frac{2}{N} \left\Vert (\mathbf {v}_N-\mathbf {r}_N)\right\Vert _2\\ &\leqslant \max \frac{2}{N} (\left\Vert \mathbf {v}_N\right\Vert +\left\Vert \mathbf {r}_N\right\Vert), \end{align}

(15)

where the second-last equality follows from \(\max _{\forall n}\left\Vert \mathbf {h}_n\right\Vert _2=1\) , and the last inequality follows from \(\left\Vert \mathbf {a}-\mathbf {b}\right\Vert \leqslant \left\Vert \mathbf {a}\right\Vert +\left\Vert \mathbf {b}\right\Vert\) . To get a constant \(\mathcal {L}_2\) sensitivity value in (15), we can normalize the columns of \(\mathbf {V}\) and \(\mathbf {R}\) to have unit \(\mathcal {L}_2\) -norm. Thus, we have \(\Delta_{B} = \frac{4}{N}\) . Note that, if we do not model the outliers explicitly, we would have \(\Delta_{B}=\frac{2}{N}\) .

Now, as we have computed the \(\mathcal {L}_2\) sensitivities \(\Delta_{A}\) and \(\Delta_{B}\) , we can generate noise perturbed statistics \(\overline{\mathbf {A}}, \overline{\mathbf {B}}\) following the Gaussian mechanism [18]. Using these values, we can compute the differentially-private estimate of the true gradient as \(\overline{\triangledown f_{W}(\mathbf {W})} = \mathbf {W}\overline{\mathbf {A}} - \overline{\mathbf {B}}\) and update our dictionary matrix \(\mathbf {W}\) . At the end of optimization, we will obtain the differentially private dictionary matrix \(\mathbf {W}_\textrm {private}\) . The detailed step-by-step description of our proposed method is summarized in Algorithm 2.

Note that, one can avoid analytically computing the sensitivities of \(\mathbf {A}_t\) and \(\mathbf {B}_t\) , and compute \(\overline{\triangledown f_{W}(\mathbf {W})}\) by employing some form of norm clipping [1, 4]. Norm clipping can be performed after noise addition to achieve a bounded gradient, as well. However, limiting the gradient norm has two opposing effects [1]: clipping destroys the unbiasedness of the gradient estimate, and the average clipped gradient may point in a sub-optimal direction. On the other hand, increasing the norm clipping bound results in addition of more noise to the gradients. To remedy this, the authors in Reference [4] proposed an adaptive norm clipping approach in the context of user-level DP guarantee in the federated learning setting for neural network training. We refer the reader to Reference [4] for further details.

Theorem 3.1 (Privacy of Algorithm 2).

Consider Algorithm 2 in the setting of Section 2.3. Then Algorithm 2 releases \((\frac{T\alpha _\mathrm{opt}}{2}(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B})+\frac{\log \frac{1}{\delta }}{\alpha _\mathrm{opt}-1},\delta)\) differentially-private basis matrix \(\mathbf {W}_\textrm {private}\) for any \(0\lt \delta \lt 1\) after T iterations, where \(\alpha _{ \mathrm{opt}}=1+\sqrt {\frac{2}{T(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B}) } \log \frac{1}{\delta }}\) , \(\tau _A = \frac{\Delta_A}{\epsilon } \sqrt {2\log \frac{1.25}{\delta }}\) and \(\tau _B = \frac{\Delta_B}{\epsilon } \sqrt {2\log \frac{1.25}{\delta }}\) .

Proof.

For the proof of Theorem 3.1, we analyze Algorithm 2 using the Gaussian Mechanism [18] and the RDP [46]. Recall that, at each iteration t, we compute the DP estimate of the gradient \(\overline{\triangledown f_{W}(\mathbf {W})}\) , using two differentially-private matrices \(\overline{\mathbf {A}_t}\) and \(\overline{\mathbf {B}_t}\) . According to Proposition 3, computation of these matrices satisfy \((\alpha , \frac{\alpha }{2(\frac{\tau _A}{\Delta_A})^2})\) -RDP and \((\alpha , \frac{\alpha }{2(\frac{\tau _B}{\Delta_B})^2})\) -RDP, respectively. According to Proposition 2, each step of Algorithm 2 is \((\alpha ,\frac{\alpha }{2}(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B}))\) -RDP. If the number of required iterations for reaching convergence is T, then under T-fold composition of RDP, the overall algorithm is \((\alpha ,\frac{T\alpha }{2}(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B}))\) -RDP. From Proposition 1, we have that the algorithm satisfies \((\frac{T\alpha }{2}(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B}) + \frac{\log \frac{1}{\delta }}{\alpha -1},\delta)\) -DP for any \(0\lt \delta \lt 1\) . For a given \(\delta\) , we can compute the optimal (that provides the smallest overall \(\epsilon\) ) \(\alpha\) as \(\alpha _{ \mathrm{opt}} = 1+\sqrt {\frac{2}{T(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B}) } \log \frac{1}{\delta }}\) . This \(\alpha _{ \mathrm{opt}}\) provides the lowest privacy risk. Therefore, Algorithm 2 releases a \((\frac{T\alpha _ \mathrm{opt}}{2}(\frac{\Delta^2_A}{\tau ^2_A}+\frac{\Delta^2_B}{\tau ^2_B}) + \frac{\log \frac{1}{\delta }}{\alpha _ \mathrm{opt}-1},\delta)\) differentially-private basis matrix \(\mathbf {W}_\textrm {private}\) for any \(0\lt \delta \lt 1\) . □

Convergence of Algorithm 2. We note that the objective function is non-increasing under the two update steps (i.e., steps 3 and 6 of Algorithm 2), and the objective function is bounded below. Additionally, the noisy gradient estimate \(\overline{\triangledown f_{W}(\mathbf {W})}\) essentially contains zero mean noise. Although this does not provide guarantees on the excess error, the estimate of the gradient converges in expectation to the true gradient [9]. Therefore, the optimization \(\mathbf {W}_t := \text{argmin}_{\mathbf {W} \in \mathcal {C}} \frac{1}{2}\operatorname{tr}(\mathbf {W}^\top \mathbf {W}\overline{\mathbf {A}_t})-\operatorname{tr}(\mathbf {W}^\top \overline{\mathbf {B}_{t}})\) should converge given sufficient steps and a small-enough step size. We show this in detail in Section 4. Note that if the batch size is too small, the noise can be too high for the algorithm to converge [61]. Since the total additive noise variance is quite small, the convergence rate is faster. We defer a theoretical analysis of the intricate relation between the excess error and the privacy parameters for future work. We refer to the reader Bassily et al. [5] for further details.

4 Experimental Results

In this section, we show the effectiveness of our proposed algorithm by comparing the utilities of existing, and non-private algorithms. We define the Objective Value in (16) as the performance index, as it quantifies how well the algorithm can decompose the data matrix. The Objective Value is calculated using the following formula:

\begin{equation} \text{Objective Value} = \frac{1}{2N} \left\Vert \mathbf {V}^o-\mathbf {WH}\right\Vert ^2_F . \end{equation}

(16)

Here, \(\mathbf {V}^o\) is the noise-free clean dataset. The reason for choosing this performance index is that some NMF algorithms do not model the outliers explicitly like we do. We intend to evaluate how well the estimated basis and coefficient matrices explain the clean data \(\mathbf {V}^o\) . Another reason is that, we cannot use Mean Absolute Error (MAE) or Root-mean Squared Error (RMSE) with the estimated basis matrix. This is because, often times we do not know the true basis matrix (although we can assume the basis matrix estimated from a non-private NMF algorithm is the true basis matrix). But more importantly, the order of the columns of the basis matrix is undefined. Therefore, we choose to use the Objective Value as defined above as the performance index.

To evaluate our proposed method, we use six real datasets. Among which, four are text datasets, and two are face image datasets. The size of the datasets and the corresponding latent dimensions K are mentioned in the Table 1. For selection of the latent dimension K, we mentioned one procedure in Appendix A, where we calculate the topic coherence score of the Guardian News Articles datasets to select the optimum topic number. For the rest of the datasets, similar procedures can be followed. For all of the experiments,¹ we used a fixed \(\delta =10^{-5}\) .

Table 1.

Dataset	N	D	K
Guardian News Articles	4551	10285	8
UCI-news-aggregator	3000	93	7
RCV1	9625	2999	4
TDT2	9394	3677	30
YaleB	2414	1024	38
CBCL	2429	361	50

Table 1. Summary of Datasets

4.1 Hyper-Parameter Selection and Initialization

We followed the hyper-parameter settings as Reference [71], except for the learning rate. The authors in Reference [71] suggested using the same learning rate for updating both the dictionary matrix \(\mathbf {W}\) and the coefficient matrix \(\mathbf {H}\) for ease of parameter tuning. The optimization process requires a different configuration in our privacy-preserving implementation. As discussed in Section 3, we add Gaussian noise in the gradient calculation \(\triangledown f_{W}(\mathbf {W})\) for updating \(\mathbf {W}\) , wheres matrix \(\mathbf {H}\) is updated with its unperturbed gradient calculation \(\triangledown q(\mathbf {H})\) . This procedure for updating \(\mathbf {W}\) and \(\mathbf {H}\) warrants for different learning rates. Note that to get faster convergence, we need to choose proper learning rates for updating \(\mathbf {W}\) and \(\mathbf {H}\) . To that end, we can employ a grid search to find the optimum learning rates for \(\mathbf {H}\) and \(\mathbf {W}\) . The time required for such a search depends on the dataset and the search space. With sub-optimal learning rates, the convergence may be delayed, but the excess error of the proposed algorithm should be approximately the same given that sufficient time is allowed, and small enough learning rate is used. In our experiments, we performed a grid search and found that the learning rate for updating \(\mathbf {W}\) should be much lower (about \(1/10000\) to \(1/20000\) times depending on the dataset) compared to that of updating \(\mathbf {H}\) . To initialize \(\mathbf {W}_0\) and \(\mathbf {H}_0\) , we followed the Non negative Double Singular Value Decomposition (NNDSVD) [10] approach, which performs much better compared with random initializations in minimizing the objective value in (16). For sparseness, we initialized \(\mathbf {R}_0\) with all zeros. Lastly, we normalized each data sample of data matrix \(\mathbf {V}\) so that it has unit maximum entry.

4.2 Text Datasets

With the text datasets, we consider the topic modeling problem. Topic modeling identifies the abstract topics present in the set of documents. A document generally contains multiple topics with different proportions. Suppose a document is “80% about religion and 20% percent about politics”—it implies that about 80 percent of words are related to religion, and the rest 20 percent of words are related to politics. Topic modeling tries to find the unique “cluster of words” that indicates one single topic from the document. We discuss more about it and its implementation in Appendix A.

We show the performance of our proposed method with the learning curve plotted against variable privacy budget \(\epsilon\) in each iteration. We demonstrate the topic word distribution for different \(\epsilon\) . We also calculate the overall privacy budget, \(\epsilon _\textrm {overeall}\) , using RDP and show the comparison of objective value with that of the non-private mechanism to select the best \(\epsilon\) in each iteration. For two datasets, we compared the topic word distribution and average coherence score between the non-private NMF algorithm and our proposed algorithm. Short descriptions of each of the text datasets are below:

Guardian News Articles. This dataset consists of 4551 news articles collected from Guardian News API in 2006. The detailed mechanism of collecting the articles is described in this article [52]. Here, we extract eight distinguished topics ( \(K=8\) ) from the dataset and show the high-scoring word distributions corresponding to the topics.

UCI News Aggregator Dataset. This dataset [16] is formed by collecting news from a web aggregator from 10-March-2014 to 10-August-2014. There is a total of 422,937 news articles in the dataset. The topics covered in the news articles are entertainment, science and technology, business, and health. We take 750 news articles from each category and apply the NMF algorithm.

RCV1. Reuters Corpus Volume I (RCV1) [39] archive consists of over 800,000 manually categorized news wires. For our experiment, we randomly select approximately \(\frac{1}{10}\) th of the features that contain 9,625 documents.

TDT2. The TDT2 [15] text database contains 9,394 documents of size \(9394 \times 36771\) -dimensional matrix. Here, we randomly select \(\frac{1}{10}\) th of the features.

Utility Comparison on Text Datasets. In Figure 2, we show the utility gap between private and non-private mechanism’s output for the text datasets. Except for the Guardian News Articles dataset, there exists a very small utility gap, and this gap decreases further for a higher privacy budget \(\epsilon\) . Comparing the convergence speed, our proposed algorithm needs more epochs to reach the optimal point. This is because we have to keep the learning rate lower in private learning. Additionally, the noisy gradient descent somewhat affects the true gradient structure, which results in a comparatively slower convergence.

Fig. 2.

Average Topic Coherence Score. In Table 2 we show the average topic coherence score comparison for Guardian News Articles and UCI-news-aggregator datasets. Topic coherence score measures quantitatively how the topic modeling algorithm performs. In short, topic coherence attempts to represent human intractability through a mathematical framework by measuring the semantic link between high-scoring words. The greater the coherence score, the more human-interpretable the “cluster of words” is. It is also used to tune the hyper-parameter K. We discuss about topic coherence in more detail in Appendix A.

Table 2.

\(\epsilon\)	Guardian News Articles	UCI-news aggregator
0.5	0.4511	0.9996975
0.6	0.4529	0.9996895
0.7	0.4535	0.9996849
0.8	0.4547	0.9996836
0.9	0.4543	0.9996832
0.999	0.4540	0.9996839
Non-Private	0.4658	0.9996816

Table 2. Comparison Average Coherence Scores

As Table 2 shows, topic coherence score increases with increasing the privacy budget \(\epsilon\) for Guardian News Articles dataset. Recall that bigger \(\epsilon\) means less noise and bigger privacy risk. For UCI-news-aggregator dataset, all the scores are very close to each other in respect to privacy budget. This is also evident from Figure 2(b), as the objective values of non-private and our proposed algorithm are very close.

\(\epsilon _\textrm {overall}\) on Text Datasets. In Figure 3, we show the \(\epsilon _\textrm {overeall}\) and the utility gap between private and non-private mechanisms after reaching the optimum solution. These plots can be read as follows: if the user has a specific privacy budget \(\epsilon _\textrm {overeall}\) , they can use this plot to see what’s the best utility level (objective value) attainable for a particular dataset, and then choose the corresponding \(\epsilon\) from the x-axis to use in our Algorithm 2. On the other hand, if the user has to reach a particular utility level (objective value), they can use this plot to see what’s the overall privacy budget they need to spend, and the required \(\epsilon\) for each stage of iteration of Algorithm 2 from the x-axis.

Fig. 3.

Topic Word Comparison. In Figure 4, we show the topic word comparison for the non-private algorithm and our proposed algorithm. We choose two datasets to show this comparison: the UCI-news-aggregator and the Guardian News Articles. The topic word distribution looks almost similar for our proposed algorithm and the non-private algorithm, which justifies the high similarity of coherence scores (Table 2).

Fig. 4.

4.3 Face Image Datasets

With the face image dataset, we generate the fundamental facial features that explain all the face images of the dataset. The details of the implementation are discussed in B. As the text dataset, we also show the \(\epsilon _\textrm {overeall}\) and utility comparison to select \(\epsilon\) for each iteration. Additionally, as the effect of outliers is more visible and common in practice for image data, we conduct our experiments with additional noise outlier dataset. A short description of each text dataset is given below:

YaleB. There are 2,414 face images in Yaleb [68] of size \(32 \times 32\) . The sample images are captured in different light conditions. There are 38 subjects (male and female) in the dataset.

CBCL. The CBCL [65] database contains 2,429 face images of size \(19\times 19\) . The facial photos consist of \(19 \times 19\) hand-aligned frontal shots. Each face image is processed beforehand. The grey scale intensities of each image are first linearly adjusted so that the pixel mean and standard deviation are equal to 0.25, and then clipped to the range [0, 1].

Utility Comparison on Face Image Datasets. In Figure 5, we show the learning curves of the non-private algorithm and our proposed algorithm for the face image datasets. The characteristics of the result plots are quite similar to those of the text datasets. We note that the utility gaps are especially small and even smaller for higher privacy budget ( \(\epsilon\) ).

Fig. 5.

\(\epsilon _\textrm {overeall}\) on Face Image Datasets. Figure 6 shows the \(\epsilon _\textrm {overeall}\) and utility gap after reaching the optimum loss point. Based on the criteria of preserving privacy as well as the tolerance of utility gap, one can select how much privacy budget \(\epsilon\) one needs to introduce in each iteration.

Fig. 6.

Basic Representation Comparison. Figure 7 shows how an algorithm learns the fundamental representation of face image under DP and non-private conditions. In the case of ( \(\epsilon =0.5,\delta =10^{-5}\) )-DP, the facial features are quite noisy compared to the non-private method. However, they can still generate the interpretable human facial features. Additionally, increasing the \(\epsilon\) budget would certainly produce representative images much closer to the ones from the non-private algorithm. Datasets with Outliers. We also performed experiments to demonstrate the effect of outliers. We contaminated the Yaleb dataset with outliers following [71]: we randomly chose 70% of user data from the dataset, and then we contaminated 10% of the pixel with uniform noise distribution noise \(\mathcal {U}[-1,1]\) . The simulation results are shown in Figure 8. In Section 3, it has been mathematically shown that the \(\mathcal {L}_2\) sensitivity of \(\mathbf {B}\) is double when we allow for updating the outlier matrix \(\mathbf {R}\) . Higher \(\mathcal {L}_2\) sensitivity results in more noise to ensure ( \(\epsilon ,\delta\) )-DP. Thus, the basic representation of Figure 8(d) is noisier compared to the Figure 7(b).

Fig. 7.

Fig. 8.

4.4 Effect of Outliers on Estimating \(\mathbf {W}\)

In this section, we investigate the effect of outliers present in the data on the estimated basis \(\mathbf {W}\) . This also provides us with justification of our choice of modeling the outliers explicitly in \(\mathbf {R}\) . We show two experiments: (i) comparing the topic coherence scores (see Appendix A.4) on the Guardian News Articles dataset; and (ii) comparing the objective value (16) on a noisy synthetic dataset. In both cases, it showed that modelling \(\mathbf {R}\) improves the performance of the optimization.

Guardian News Articles Dataset with Outliers. We followed a similar procedure as creating the noisy Yaleb dataset for adding the outliers. The only difference is that this time the noise was added to the normalized word embeddings. We then used our proposed DP NMF algorithm on the noisy text dataset and compared the topic coherence scores (17). Note that the topic coherence scores are measured only based on the basis matrix \(\mathbf {W}\) . The better the quality of \(\mathbf {W}\) in representing the data, the higher the topic coherence scores are. The experimental results are shown in Table 3. Our proposed DP NMF algorithm was employed with privacy parameters \((\epsilon =0.5,\delta =10^{-5})\) on three experimental setups. We observe that when we do not explicitly model the outliers, the coherence score drops \((3.17\%)\) for the dataset with outliers. We can achieve a better topic coherence score \((1.51\%)\) if we model the outlier matrix.

Table 3.

Outliers in data	R modelled	Topic Coherence
✗	✗	0.4511
✓	✗	0.4368
✓	✓	0.4434

Table 3. Experiment on the Guardian News Articles Dataset with Outliers

Synthetic Dataset with Outliers. We followed the steps mentioned in Reference [71] to create a synthetic dataset with outliers. More explicitly, we first created a clean synthetic dataset, and added outliers and observation noise to it. The steps to create a clean dataset are as follows: we generate the matrices \(\mathbf {W}^o\) and \(\mathbf {H}^o\) with each entry of \(\mathbf {W}^o\) and \(\mathbf {H}^o\) is drawn from the half normal distribution \(\mathcal {HN}(0, \frac{1}{\sqrt {K}})\) . Then, we set \(\mathbf {V}^o = \mathcal {P}_{\tilde{v}}(\mathbf {W}^o \mathbf {H}^o\) ). The normalization projection matrix keeps the norm of columns of \(\mathbf {V}^o\) bounded. To create an outlier matrix, \(\mathbf {R}^o \in \mathcal {R}^{D \times N}\) , we choose \(70\%\) columns randomly from \(\mathbf {R}\) , set \(10\%\) elements of the selected columns with values drawn from the uniform distribution \(U[-1,1]\) and set the reaming entries 0. With that, the noisy synthetic dataset becomes \(V =\mathcal {P_{\tilde{V}}} (\mathbf {V}^o+\mathbf {R}^o+\mathbf {N})\) , where \(\mathbf {N} \in \mathcal {R}^{D \times N}\) is defined as observation noise with entries drawn from the standard normal distribution \(\mathcal {N}(0,1)\) . We applied our proposed DP NMF algorithm with privacy parameters \((\epsilon =0.8,\delta =10^{-5})\) . We show the results of this experiment in Table 4. We observe that there is \(5.67\%\) improvement of the Objective Value (16) when we explicitly model the outlier matrix \(\mathbf {R}\) compared with no outlier modeling.

Table 4.

Outliers in data	R modelled	Objective Value
✓	✗	0.3381
✓	✓	0.3189

Table 4. Experiment on the Synthetic Dataset with Outliers

4.5 Comparison with Existing Differentially-Private NMF Algorithm [57]

The only existing differentially private NMF (DPNMF) algorithm [57] is proposed for recommender systems using the Laplace mechanism [18], and does not consider the presence of outliers. On the other hand, our proposed algorithm works for any part-based NMF learning task. Another crucial distinction of our proposed method with the DPNMF [57] is that DPNMF employs the Laplace mechanism, which involves addition of noise from the Laplace distribution scaled to the \(\mathcal {L}_1\) sensitivity of the associated function. For any D-dimensional vector \(\mathbf {x}\) , the following always holds: \(\left\Vert \mathbf {x}\right\Vert _2 \le \left\Vert \mathbf {x}\right\Vert _1 \le \sqrt {D}\left\Vert \mathbf {x}\right\Vert _2\) (using Cauchy–Schwarz inequality). Therefore, the \(\mathcal {L}_1\) sensitivity can be much larger than the \(\mathcal {L}_2\) sensitivity of the same function, for most practical scenarios and datasets with large ambient dimensions. Since the Gaussian mechanism [18] involves addition of noise from the Gaussian distribution scaled to the \(\mathcal {L}_2\) sensitivity of the associated function, the Gaussian mechanism typically performs much better than the Laplace mechanism in terms of utility [12, 13, 48, 61]. As mentioned before, we use the Gaussian mechanism in our proposed algorithm, and show that our proposed algorithm performs very well on several real datasets. Additionally, the authors in Reference [57] did not analytically calculate the \(\mathcal {L}_1\) sensitivity of the desired objective function, whereas we analytically showed how to calculate the \(\mathcal {L}_2\) sensitivity of the gradient for estimating the gradient satisfying differential privacy. This paves the way for further analysis and performance enhancement through adaptive norm-clipping [4]. Finally, the existing DPNMF [57] used the alternating non-negative least square (ANLS) algorithm [25], which has no mechanism to consider the effect of outliers. Nevertheless, we performed experiments to compare the performance of DPNMF [57] with our proposed algorithm on the MovieLens 1M Dataset [27]. For this experiment, we used RMSE as the performance measure as in Reference [57]:

\begin{equation} \text{RMSE} = \frac{1}{\sqrt {N}} \left\Vert \mathbf {V}-\mathbf { \overset{\wedge }{\strut V}} \odot \mathbf {X}\right\Vert _2, \end{equation}

where \(\mathbf {V} \in \mathcal {R}_{+}^{U \times I}\) is the user-item matrix, \(\mathbf {\overset{\wedge }{\strut V}}\) and \(\mathbf {X}\) , same shape as \(\mathbf {V}\) , are the predicted user-item matrix and the observation mask, respectively, and N is the number of user-item pairs. Each entry \(v_{ui}\) in user-item matrix denotes how much a user \(u \in U\) gives rating to an item \(i \in I\) . Each entry \(x_{ui}\) of observation mask matrix is set to 1 if user u has rated the item i, and 0 otherwise. The simulation results are provided in Table 5. We observe from the table that our proposed method comfortably outperforms the existing DPNMF. We believe the primary reason is employing the Gaussian mechanism (and tight privacy characterization using the RDP analysis) as opposed to the Laplace mechanism. Essentially, we are proposing to use a relaxed definition of privacy (that is, \((\epsilon , \delta)\) -DP) instead of the pure \(\epsilon\) -DP in exchange for much better utility.

Table 5.

Privacy Budget	Proposed Method	DPNMF [57]
Non-private	0.0624	0.8568
\(\epsilon =0.3\)	0.0675	1.1953
\(\epsilon =0.5\)	0.0648	1.0785
\(\epsilon =0.7\)	0.0640	\(1.0155\)

Table 5. RMSE Comparison on MovieLens 1M Dataset

5 Conclusion and Future Works

We proposed a novel privacy-preserving NMF algorithm that can learn the differentially-private basis matrix \(\mathbf {W}_\textrm {private}\) from the data matrix \(\mathbf {V}\) while explicitly modeling the outliers in the data and offers utility close to the non-private setting. For ensuring differential privacy, we used the Gaussian mechanism: we introduced randomness into the computation pipeline in the form of carefully tuned noise. To this end, we presented a novel analysis of calculating the \(\mathcal {L}_2\) sensitivity of the associated functions for estimating the gradient satisfying DP. Recognizing the multi-shot nature of the proposed algorithm, we analyzed our algorithm with RDP to achieve a tight characterization of the privacy spent. The proposed algorithm offers a excellent results, closely approximating that of the non-private algorithm. The overall privacy budget \(\epsilon _\textrm {overall}\) along with the utility plots can provide means to select privacy budget per iteration, given the required utility level or privacy constraints. We empirically showed the effectiveness of our proposed method and compared the results on six real datasets. All the results show that out proposed algorithm can provide a utility close to that of the non-private algorithm for some parameter choices. For the text datasets, we quantitatively measured the performance using the Topic Coherence score. For the face image datasets, we compared facial feature construction. We observed that the features are similar in showing the fundamental facial parts of the human face as with the non-private method. An interesting future work will be further reducing the noise effect on the face image datasets. Another possible future work could be along the lines of federated learning and decentralized learning, where we need to perform NMF on private and decentralized data matrix.

Acknowledgments

The authors would like to express their sincere gratitude towards the authorities of the Department of Electrical and Electronic Engineering and Bangladesh University of Engineering and Technology (BUET) for providing constant support throughout this research.

Footnote

Codes are available at https://github.com/swapnil-saha/NMF_DP.git

A Topic Modeling and Its Implementation

A.1 Topic Modeling

Topic modeling is a statistical model used in statistics and natural language processing to discover the abstract “topics” that occur in a collection of documents. Topic modeling is a common text-mining technique to uncover hidden semantic structures within a text body. Given that a document is about a specific topic, one would expect certain words to appear more or less frequently: “dog” and “bone” will appear more frequently in documents about dogs, “cat” and “meow” will appear more frequently in documents about cats, and “the” and “is” will appear roughly equally in both. A document typically addresses multiple topics in varying proportions; therefore, a document that is 10% about cats and 90% about dogs would likely contain nine times as many dog words as cat words. The “topics” generated by topic modeling techniques are word clusters. A topic model encapsulates this intuition in a mathematical framework, enabling the examination of a set of documents and the identification of their potential topics and balance of topics based on the statistics of their words.

Topic models are also known as probabilistic topic models, which refer to statistical algorithms for identifying the latent semantic structures of a large text body. In this information age, the amount of written material we encounter daily exceeds our capacity to process it. Extensive collections of unstructured text bodies can be organized and comprehended better with topic models. Originally developed as a tool for text mining, topic models have been used to detect instructive structures in data, including genetic information, images, and networks. They have applications in fields such as computer vision [11] and bioinformatics [8].

A text document consists of one or more topics. In the mathematical context, we can say that the linear combination of topics forms each text document. Each topic reflects its semantic meaning by some representative “cluster of words”. In topic modeling, we find these representative clusters of words from the corpus and the coefficient weights, which say how much a single topic is more present than others in a single document. In the context of NMF decomposition, data matrix \(\mathbf {V}\) contains the text documents, dictionary matrix \(\mathbf {W}\) contains topic words and coefficient matrix \(\mathbf {H}\) contains the coefficient weight.

A.2 Text Pre-Processing

The first step before applying any topic modeling algorithm is to do text preprocessing. Raw documents contain textual words which need to convert into numerical form. To do so, we split each word from the document and give a unique token to each of them.

Let us say we have five documents in our corpus of documents, and there is a total of 100 unique words present in all documents. Then, after tokenizing the corpus of documents, we will form a matrix \(\mathbf {A}\) of size \(100 \times 5\) . Column entry indicates the document number and row number indicates the specific term word. If \(\mathbf {a_{ij}}=3\) in \(\mathbf {A}\) where \(i=50,j=4\) , it means that 50 “no-term word” is used 3 times in the 4^th document.

However, we need further preprocessing to do actual topic modeling. Intuitively all the words in a document do not contribute equal contributions to determine the topic category of this document. Besides, some high-frequency words (like articles and auxiliary verbs) and low-frequency words do not indicate a specific topic. We remove these unnecessary words and add weight to the important topic words. The first one is done easily by simple text preprocessing like maximum frequency filtering, minimum frequency filtering, and stop-word (which stores predefined high-frequency English words) filtering. To give extra weight to important topic words, we need to introduce a new mathematical framework: term frequency-inverse document frequency (TF-IDF) [6, 58].

TF-IDF wants to calculate quantitatively how a term word is “important” to determine the nature of the specific document’s topic category. The calculation involves two steps. First, it computes the frequency of word terms in that specific document. Then, it computes the frequency in all the documents. The second calculation wants to penalize if the term word is common in all documents. The equation of TF-IDF is as follows [55]

\begin{equation} w_d=f_{w,d} \times \log \left(\frac{|D|}{f_{w,D}}\right). \end{equation}

(17)

In our implementation, we use scikit-learn default function TfidfVectorizer() to produce TF-IDF normalized document-term matrix. According to our notation, TfidfVectorizer() produces matrix of size \(\mathbf {D} \times \mathbf {N}\) where \(\mathbf {D}\) is the number of word term after processing the text data and \(\mathbf {N}\) is the total number of documents present in the corpus. Now, the corpus of raw documents is ready to use the NMF algorithm.

A.3 Implementation through NMF

If we apply the NMF algorithm on the TF-IDF normalized document-term, we get two matrices: one is dictionary matrix \(\mathbf {W}\) ( \(D \times K\) ), and the other is coefficient matrix \(\mathbf {H}\) ( \(K \times N\) ) where K is the topic number presented in the corpus. The \(\mathbf {W}\) matrix shows the topic distribution word. We can tell about the topic category by observing the highest entry values.

Let us revisit the experimental implementation of the Guardian News Articles dataset discussed in the Section 4. We get the following topic word distribution in Figure 9. Here, the eight distinguish topic word distribution indicates eight distinguished topics. Applying NMF in the topic modeling algorithm requires one important hyper-parameter selection: topic number K. Though here, we assume the topic number \(K=8\) before applying NMF, there is a systemic way to tune this hyper-parameter. This is done by measuring the Topic Coherence score.

Fig. 9.

A.4 Topic Coherence

Topic Coherence measures the semantic similarity between high-scoring words. These measurements help to differentiate between semantically interpretable topics and those that are statistical artifacts of inference. There are numerous methods for measuring coherence, such as NPMI, UMass, TC-W2V, and so on [52]. Our study uses the TC-W2V method to measure the coherence score.

Figure 10 shows the comparison of mean coherence scores with respect to the number of topics. This figure suggests selecting \(K=8\) as the topic number to get the optimum human interpretability from the topic word distribution.

Fig. 10.

B Extracting Local Facial Feature by NMF

B.1 Interpret the Decomposition of Face Image

The extraction of local facial features is one of the most beautiful and practical applications of NMF. The basic idea behind this decomposition is to extract fundamental local facial features so that one can reconstruct any image of the dataset using appropriate weight. To extract these fundamental features, one needs first to construct the data matrix \(\mathbf {V}\) where each column of \(\mathbf {V}\) represents the pixel information of the individual image. If we now apply the NMF algorithm on matrix \(\mathbf {V}\) , we generate the two matrices: matrix \(\mathbf {W}\) stores the facial feature, and \(\mathbf {H}\) stores the coefficient. Figure 11 shows the visual representation of the result.

Fig. 11.

If we want to reconstruct an image of dataset: let’s example, we want to reconstruct the 100^th column image in matrix \(\mathbf {V}\) . Then, we will take all the facial features from matrix \(\mathbf {W}\) and \(100^th\) column vector from matrix \(\mathbf {H}\) as coefficients. Then, we will multiply the features with the coefficients and add them linearly. This will reconstruct the \(100^th\) column image of matrix \(\mathbf {V}\) with little loss.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308–318.

Abstract

1 Introduction

2 Problem Formulation

2.1 Notations

2.2 Definitions and Preliminaries

2.3 NMF Problem Formulation

3 Proposed Differentially-private NMF

3.1 Separate Private and Non-private Training Nodes

3.2 Estimating Differentially-Private \(\mathbf {W}_\textrm {private}\)

4 Experimental Results

4.1 Hyper-Parameter Selection and Initialization

4.2 Text Datasets

4.3 Face Image Datasets

4.4 Effect of Outliers on Estimating \(\mathbf {W}\)

4.5 Comparison with Existing Differentially-Private NMF Algorithm [57]

5 Conclusion and Future Works

Acknowledgments

Footnote

A Topic Modeling and Its Implementation

A.1 Topic Modeling

A.2 Text Pre-Processing

A.3 Implementation through NMF

A.4 Topic Coherence

B Extracting Local Facial Feature by NMF

B.1 Interpret the Decomposition of Face Image

References

Cited By

Index Terms

Recommendations

Applying Differential Privacy to Matrix Factorization

A New Privacy-Preserving Data Mining Method Using Non-negative Matrix Factorization and Singular Value Decomposition

New SVD based initialization strategy for non-negative matrix factorization

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations