1 Introduction
Non-negative matrix factorisation (
NMF) is an unsupervised
machine learning (
ML) technique for discovering the part-based representation of inherently non-negative data [
29]. Consider a data matrix
\(\mathbf {V}\in \mathcal {R}^{D \times N}\) with entries
\(v_{ij} \geqslant 0\ \forall i\in \lbrace 1, 2, \ldots , D\rbrace\) and
\(\forall j\in \lbrace 1, 2, \ldots , N\rbrace\) , where
N is the number of the data samples, and
D is the data dimension. The NMF objective is to (approximately) decompose the data matrix
\(\mathbf {V}\) into two non-negative matrices as follows:
where
\(\mathbf {W} \in \mathcal {R}^{D\times K}\) is the basis matrix,
\(\mathbf {H} \in \mathcal {R}^{K\times N}\) is the coefficient matrix, and
K is the latent dimension. The decomposition is performed, usually through minimizing some divergence between the data matrix and the factor matrices, such that each entry of
\(\mathbf {W}\) satisfies
\(w_{ik} \geqslant 0\) , and each entry of
\(\mathbf {H}\) satisfies
\(h_{kj}\geqslant 0,\forall k\in \lbrace 1, 2, \ldots , K\rbrace\) . In short, NMF performs dimension reduction by mapping the ambient data dimension
D into latent dimension
K for
N data samples. Unlike other dimension reduction methods, such as
Principal Component Analysis (
PCA) or
Independent Component Analysis (
ICA), NMF is well-suited for inherently non-negative data because it finds non-subtractive and interpretable basis and coefficients.
As mentioned before, NMF approximates the
jth column of
\(\mathbf {V}\) as
\(\mathbf {v}_j \approx \mathbf {W} \mathbf {h}_j\) , where
\(\mathbf {h}_j\) is the
jth column of
\(\mathbf {H}\) . Essentially, the
jth column of
\(\mathbf {V}\) is represented as a linear combination of the columns of
\(\mathbf {W}\) , with the coefficients being the corresponding entries from
\(\mathbf {h}_j\) . Therefore, the dictionary matrix
\(\mathbf {W}\) can be interpreted as storing the “parts” of the data matrix
\(\mathbf {V}\) . This part-based decomposition, and considering only the non-negative values, make NMF popular in many practical applications including, but not limited to:
—
topic modeling in text mining
—
extracting local facial features from human faces
—
unsupervised image segmentation
—
community detection in social networks.
However, the acquired data samples may be contaminated by localized outliers [
66], such as salt and pepper noise, occlusions, shadows, moving objects in videos, and so on. The presence of such outliers can affect the accurate estimation of basis and thus, the latent low-rank structure of the data [
70,
71].
NMF and Privacy. Many services of modern era operate on sensitive user data to provide better suggestions and experiences. These are often done via ML algorithms, such as the ones mentioned above, and the ML algorithms are trained on users’ sensitive data. As shown in theoretical and applied works, the users are rightfully concerned regarding their privacy being compromised by the outputs of the ML algorithms. For example, the seminal work of Homer et al. [
28] showed that the presence of an individual in a genome dataset can be identified from simple summary statistics about the dataset. In the ML setting, typically the model parameters (as a matrix
\(\mathbf {W}\) or vector
\(\mathbf {w}\) ) are learned by training on the users’ data. To that end,
Membership Inference Attacks [
22] are discussed in detail by Hu et al. [
30], and Shokri et al. [
60]. They showed that given the learned parameters
\(\mathbf {W}\) , an adversary can identify users in the training set. Basically, the trained weights of the model can be used to extract sensitive information. The tendency of the weights to memorize the training examples can be used to regenerate the training example. Several other works [
37,
47,
62] also showed how personal data leakage occurred from modern ML and signal processing tasks. Additionally, it has been shown that simple anonymization of data does not provide any privacy in the presence of auxiliary information available from other/public sources [
62]. One prime example is the Netflix Challenge, where an anonymized dataset was released in 2007, which contained anonymized movie ratings from Netflix subscribers. Nevertheless, researchers found a way to recover
\(99\%\) of the removed personal data [
47] by using the publicly available IMDb datasets. In addition, privacy leakage can happen through gradient sharing [
72] as well. In distributed ML systems, several nodes exchange gradient values. The authors in [
72] showed how one can extract private training information from such gradient sharing. These discoveries led to widespread concern over using private data for ML algorithms. Evidently, personal information leakage is the main hindrance to collecting and analyzing sensor data for training ML and signal processing algorithms. It is, therefore, necessary to employ a framework where one can share private data without disclosing their participation or identity.
Differential privacy (
DP) is a rigorous mathematical framework that can provide protection against such information leakage [
18]. The definition of differential privacy is motivated by the cryptography work, which has gained significant attention in ML and data mining communities [
17]. DP introduces randomness into the computation pipeline for preserving privacy. This, however, leads to degradation of the computation accuracy, and the user needs to quantitatively choose the optimum privacy budget considering the required privacy-utility tradeoff.
Our Contributions. In this work, we consider applications of non-negative matrix factorization involving sensitive data, such as dimension reduction/feature extraction, topic modeling/text mining, image segmentation, denoising, and community detection. We intend to perform NMF decomposition on inherently non-negative and privacy-sensitive data (that may contain sparse outliers) to discover novel structure/features of the dataset. That is, (i) the estimated basis matrix
\(\mathbf {W}\) should closely approximate the “true” basis matrix in the presence of the outliers, and (ii) since the data matrix
\(\mathbf {V}\) contains user-specific privacy-sensitive information, the estimated
\(\mathbf {W}\) should ensure formal privacy guarantee such that an adversary cannot extract sensitive information regarding the users. To address this, we propose a differentially-private [
18] non-negative matrix factorization algorithm to compute the basis matrix
\(\mathbf {W}\) considering the presence of outliers in the dataset. We adopt the outlier model from Reference [
71]. To the best of our knowledge, our proposed method is the first differentially-private NMF algorithm considering outliers. Employing the mathematically rigorous privacy guarantee differential privacy allows us to compute
\(\mathbf {W}\) such that it reflects very little about any particular user’s data, and is relatively unaffected by the presence of outliers. Our algorithm design makes sure that the learned
\(\mathbf {W}\) closely approximates the “true” dictionary matrix, while ensuring strict privacy guarantees. Our major contributions are summarized below:
—
We develop a novel privacy-preserving non-negative matrix factorization algorithm capable of operating on sensitive data, while closely approximating the results of the non-private algorithm.
—
We consider the effect of outliers by specifically modeling them as in Reference [
71], such that the presence of outliers has very little effect on our estimated differentially-private basis matrix
\(\mathbf {W}\) .
—
Considering the multi-round nature of the proposed algorithm, we analyze our algorithm using
Rényi Differential Privacy (
RDP) [
46] to provide a tighter characterization of privacy under composition. We obtain a much better accounting of the overall privacy loss, compared with the conventional strong composition theorem [
18].
—
We perform extensive experimentation on six diverse real datasets to show the effectiveness of our proposed algorithm. We compare the results with those of an existing algorithm [
57] and the non-private algorithm. We observe that the basis matrix
\(\mathbf {W}\) , which our proposed algorithm estimates, provides a close approximation to the non-private results for certain parameter choices, easily outperforming the existing algorithm.
—
We present the results in a way to make the privacy-utility tradeoff more comprehensive, and that the user can choose between the overall privacy budget and the required “closeness” to non-private results.
Related Works. The non-negative matrix factorization is attained in the literature by minimizing the following objective function:
where
\(\mathcal {C} \subseteq \mathcal {R}^{D\times K}\) is the constraint set for
\(\mathbf {W}\) . Several algorithms have been proposed to obtain the optimal point of this objective function, such as the multiplicative updates [
38],
alternating direction method of multipliers (
ADMM) [
67], block principal pivoting [
35], active set method [
33], and
projected gradient descent (
PGD) [
43]. Most of these algorithms are based on alternatively updating
\(\mathbf {W}\) and
\(\mathbf {H}\) . This essentially divides the optimization problem into two sub-problems, each of which can be optimized using the standard optimization techniques, such as the projected gradient or the interior point method. A detailed survey of these optimization techniques can be found in References [
34,
64].
Additionally, some algorithms have been proposed to solve the NMF problem considering the outliers [
59,
66,
70,
71]. Zhao and Tan [
71] proposed an online NMF algorithm with two different solvers: a PGD-based solver and an ADMM-based solver. However, they proposed to use a fixed step size for the iterative updates and their stopping condition does not necessarily indicate that the solution is close to a stationary point. Moreover, due to the online nature of the algorithm, it can be very slow for datasets with high-dimensions and large numbers of samples. Zhang et al. [
70] proposed another algorithm for robust NMF, which uses multiplicative updates for updating the basis and the coefficient matrix and a soft-thresholding function for updating the outliers matrix. However, it has been shown in multiple literature (see e.g., Gonzalez and Zhang [
26] and Lin [
43]) that the multiplicative update method lacks convergence guarantee and optimization properties. The robust NMF proposed by Shen et al. [
59] also uses a multiplicative update rule for the basis matrix. Our work is based on the robust NMF algorithm using PGD [
71].
Extensive works and surveys exist in the literature on differential privacy. In particular, Dwork and Smith’s survey [
19] contains the initial theoretical work. We refer the reader to References [
1,
5,
12,
13,
32,
40,
41,
61,
63] for the most relevant works in differentially private machine learning, deep learning, optimization, gradient descent, and empirical risk minimization. Adding randomness in the gradient calculation is one of the most common approaches for implementing differential privacy [
5,
61]. Other common approaches are employing the output [
13] and objective perturbations [
51], the exponential mechanism [
45], the Laplace mechanism [
17], and Smooth sensitivity [
50].
Several researchers have attempted to employ differential privacy in the context of general matrix factorization, focusing on recommendation systems [
44] and not explicitly handling the nonnegativity constraint. For example, Nikolaenko et al. [
49] proposed a privacy preserving matrix factorization for recommendation systems using partially homomorphic encryption with Yao’s garbled circuits. Zhang et al. [
69], Hua et al. [
31], and Ran et al. [
56] used the objective perturbation [
13] (or it’s modification) for matrix factorization. Berlioz et al. [
7] proposed differentially private matrix factorization into three different ways: input perturbation, perturbation of the gradients and output perturbation. In distributed-data setting, Ermis et al. [
20] applied local differential privacy in collective matrix factorization problem. The relevant works with our work are the work of Alsulaimawi [
3], which introduced a privacy filter with federated learning for NMF factorization; the work of Fu et al. [
23], which proposed privacy-preserving NMF for dimension reduction using the Paillier Cryptosystem, and the work on privacy preserving data mining [
2] using combined NMF and
Singular Value Decomposition (
SVD) methods. Last but not the least, the authors in Reference [
57] proposed differentially private NMF for the recommender system, and we showed the comparison analyses in Section
4.5. However, to the best of our knowledge, no work has introduced differential privacy in the universal NMF decomposition framework considering outliers, and accounted for the overall privacy spent for multi-stage implementation to achieve the optimal privacy budget.