2.2 Federated Learning
FL is a recently proposed distributed learning scheme, which was originally proposed by Konečný et al. [
58], where a set of client devices
C, jointly train an ML model
\(M_{Global}\) on their private datasets
\(\mathcal {D}_i\). Usually, FL is performed under the supervision of a central coordinating server. In traditional ML, the local client datasets would be accumulated into a central dataset
\(\mathcal {D}_{Central} = \bigcup _{i=1}^{\left|C\right|} \mathcal {D}_i\) on which a central model
\(M_{Central}\) is trained. In FL, the local datasets are never disclosed by the clients. Instead, the central server initializes a global model
\(M_{Global}\) parameterized by a vector
\(\boldsymbol {\theta } \in \mathbb {R}^d\), that is sent to all clients
\(c_i\) which train the global model on their local datasets
\(\mathcal {D}_i\), effectively optimizing their local objective
\(\mathcal {L}_i(\mathcal {D}_i; \boldsymbol {\theta })\), which results in a local update
\(U_i\).
1 The local update is then sent back to the central server, which uses an aggregation operator to combine the updates into an updated global model
\(M^{^{\prime }}_{Global} = \text{Agg} \left\lbrace U_i \mid i \in \left\lbrace 1, 2, \dots , |C|\right\rbrace \right\rbrace\). This process is repeated until a suitable convergence metric is met. The objective of FL can therefore be stated as the following minimization problem [
83]:
with
\(\ell (x, y; \boldsymbol {\theta })\) denoting the loss of the client model on input
x with ground-truth
y, given the model parametrization
\(\boldsymbol {\theta }\). FL allows the global model
\(M_{Global}\) to train on significantly more data than if each client had only trained on its private data. Thus, under ideal conditions, given a performance metric
P, the performance of the global model
\(P_{Global}\) should be better than that of each individual client
\(\forall i \in \left\lbrace 1, 2, \dots , |C|\right\rbrace : P_{Global} \gt P_i\). FL permits a certain degree of deviation from the performance of an equivalent centrally trained model but provides data security and privacy protection in return. Still, the goal is to minimize the deviation
\(|P_{Central} - P_{Global}|\).
In the original FL scheme, federated stochastic gradient descent (FedSGD), proposed by Konečný et al. [
58], the clients perform a training step and send the computed gradient back to the central server, which averages the gradient across all clients and applies it to the global model. Since then, several other methods have been proposed in the literature. McMahan et al. proposed federated averaging (FedAvg), where the clients train for multiple local epochs and send their updated local model to the central server instead of the gradient. The updated parameters are then weighted proportionally by the number of local training samples available to each client and then averaged by the central server [
83]. Furthermore, they employ client sub-sampling, a technique where only a random subset of clients is selected for each communication round [
14,
28]. FedAvg can be seen as a generalization of FedSGD, which only executes a single iteration of gradient descent in each round of communication [
83,
105]. Although there were theoretical guarantees for the convergence of FedAvg in cases of heterogeneous data, impractical assumptions such as strong convexity or smoothness of the objective function needed to hold [
69]. Chai et al. showed experimentally that FedAvg could lose up to 9% accuracy in comparison to FedSGD [
10], when dealing with non-i.i.d. data. Li et al. tackled this problem and presented a generalization of FedAvg. They introduced a surrogate objective to constrain the locally updated parameters to be close to the current global model. This helped to stabilize convergence behavior resulting in a significant increase in test accuracy by 22% on average [
67]. Li et al. proposed to only share the trainable parameters of batch normalization (BatchNorm) with the central server without communicating their running averages of the batch statistics to the server. Aggregating the trainable parameters from all clients but keeping the running averages local helps to alleviate the problem of feature shift in non-i.i.d. training scenarios [
70]. Karimireddy et al. utilize control variates as a variance reduction technique to approximate the update direction of the server model and each client model. The client drift, which naturally arises from training on different local data distributions, can be estimated by the difference between these update directions and is corrected by adding it in the local training of each client [
52]. Cao et al. rely on clustering the clients according to the classes of data they possess. They only average parameters from the same group while updating the central server model, guaranteeing that parameters are only averaged on a set of clients with a comparable data distribution [
8]. Seol and Kim propose a two step approach. First, they use data oversampling to eliminate data class imbalances among clients. In the second step, the clients are selected in such a way that their data distribution is nearly uniform. Furthermore, the central server constantly adjusts the amount of data for local training, the batch size, and the learning rate of the clients to avoid performance degradation [
101]. We also address data heterogeneity and introduce our own generalization of FedAvg, named federated learning with client queuing (FedQ).
Although FL operates in a decentralized environment, the participating client’s privacy may be compromised by merely transmitting the training update. Geiping et al. reconstructed high-resolution images by examining the data present in each client’s communicated gradients [
29]. Dimitrov et al. were also able to extract sensitive information contained in the weights obtained by the FedAvg procedure. Therefore, the concept of differential privacy [
23] is often applied in the setting of FL. When working with aggregated data, differential privacy can be utilized to protect the private information contained in individual data points. Differential privacy achieves this data protection by perturbing the data points with random noise. This exploits the fact that a single data point has relatively little impact on the aggregated data as a whole, but adding random noise alters the individual data points to a degree that no useful information can be extracted from them [
22]. Wei et al. proposed to add specific noise to the parameters of each client before aggregation by the central server [
116]. This ensures a decent training accuracy while a certain level of privacy is maintained, if there are a sufficiently large number of clients involved [
116]. Phong et al. [
91] proposed to use homomorphic encryption in the more general setting of distributed training and Fang and Quan [
25] suggested to use it in the setting of FL. Homomorphic encryption is a specialized encryption scheme that allows performing certain mathematical operations on the data without decrypting it.
2.3 Communication-Efficient Federated Learning
When dealing with mobile clients, internet connections may be inconsistent and potentially have high latency. Even when FL clients are connected via reliable network connections, mobile connections are usually still bandwidth-constrained and, in many cases, even metered. During the course of FL, training updates must be exchanged a multitude of times. Therefore, a central goal in FL is communication minimization. When communicating model parametrizations, possible solutions to this include several size reduction techniques:
Sparsification/Pruning excludes single neurons (unstructured) or entire layers of neurons (structured) from an NN. While sparsification only sets excluded neurons to 0, pruning actually removes them [
65]. Sparsified models are more amenable to compression, but still have their original size when uncompressed. Pruned models, on the other hand, already have a reduction in size even without compression. The disadvantage of pruned networks is that they may require specialized software and/or hardware to be used, while sparsified models can run on regular software and hardware.
Distillation is a technique for transferring the knowledge of a teacher model into a smaller student model. This is done by minimizing the difference between the output of the student model and the output of the teacher model (also known as soft labels) on data points from a separate dataset [
43]. In
quantization, the weights of an NN are constrained to a discrete set of values so that they can be represented with fewer bits [
30].
Lossless compression techniques encode the NN data in a way that removes redundancy and thus reduces its size [
37].
There are many works that have developed communication efficient FL solutions using the above-mentioned techniques or combinations of them [
57,
96,
97], and even some with specialized techniques, such as federated dropout [
6]. Konečný et al. propose employing quantization, random rotations, and sub-sampling to compress the updated model parameters of the clients before sending them to the central server [
57]. Wu et al. adopt an orthogonal strategy: The clients train a teacher model on their local data and distill it into a smaller student model. Instead of communicating the gradients of the teacher models, the clients compress and send the gradients of the smaller student models [
121]. Sattler et al. introduce a compression framework combining communication delay methods, gradient sparsification, binarization, and optimal weight update encoding to reduce the upstream communication cost in distributed learning scenarios [
96]. To adapt it to the FL setting, Sattler et al. enhance this approach, taking the compression of the downstream communication and the non-i.i.d. local data distribution of the clients into account. They construct a framework combining a novel top-
k gradient sparsification method with ternarization and optimal Golomb encoding of updated client model parameters [
97]. Another emerging field of research considers combinations of differential privacy and quantization methods in order to reduce communication costs. Lang and Shlezinger demonstrated that, within their framework, it is possible to quantize data at a given bit rate without sacrificing a specified level of privacy or degrading model performance [
63]. They enhanced methods proposed by Reisizadeh et al. and Konečný et al., which solely use quantization and do not include privacy-related considerations.
2.4 Federated Recommender Systems
The current public discussion of RecSyss (often just referred to as the algorithm or AI personalization), focuses, among other topics, on their invasive behavior concerning personal data collection [
32,
42,
61,
62]. This might create a negative relationship between user and RecSys potentially resulting in anything from user discontent to “algorithmic hate” [
106]. RecSyss are arguably a vital part of the user experience on the internet since, without them, the flood of content would be barely manageable. Therefore, FL may be part of the solution to the privacy problem of RecSyss by training the recommender models directly on user devices and thereby entirely circumventing the need for gathering private information.
FL has already been proven to work well in many other domains, e.g., cancer research [
95], natural language processing [
72], graph NNs [
40], image classification [
81], transfer learning [
77], language models [
5], mobile keyboard prediction [
38], and keyword spotting [
66], so it is reasonable to anticipate that it is likewise effective in the domain of RecSyss. In fact, there are numerous methods in the literature to incorporate current RecSys frameworks into FL. They can be classified as either focusing on learning algorithms [
3], security [
93], or optimization models [
86], depending on the task’s objective [
2]. Matrix factorization is a commonly utilized approach in the first scenario. Ammad-ud-din et al. were among the pioneers in this emerging field by introducing this model to address collaborative filtering tasks in the context of FL. They constructed a RecSys that gives personalized recommendations based on users’ implicit feedback [
3]. Lin et al. designed a new federated rating prediction mechanism for explicit responses. They employed user averaging and hybrid filling in order to keep the system computationally efficient and the communication costs moderately low [
74].
To increase the model capabilities for each client, Jia and Lei incorporated a bias term for the input signals. In addition, weights on the local devices were adjusted, so that any unreasonable user rating is removed [
49]. On the other hand, Flanagan et al. employed a similar strategy, enhancing the model’s capacity by incorporating input from other data sources [
27]. Wang et al. introduced a new algorithmic approach by combining matrix factorization with FedAvg. They demonstrated that the cost of communication with the central server for non-i.i.d. data was decreased by limiting the number of local training iterations [
114].
As previously shown, private information can be reconstructed from the clients’ transmitted parameters. In order to remedy this, a variety of privacy preserving techniques based on encryption, obfuscation, or masking can be utilized [
4]. Communication of encrypted data between the central server and its clients is made possible through the use of homomorphic encryption, allowing for intermediate calculations without the need to first decrypt the data. As a result, the central server is unable to infer the data it is working with [
54]. For this reason, Chai et al. propose a secure matrix factorization framework to handle data leakage. They showed how privacy could be compromised by intercepting the clients’ gradient updates sent in two consecutive communication rounds to the central server. To address this problem, they encrypted the clients’ gradients before sending them to the central server [
11]. Zhang and Jiang enhanced the approach by clustering the encrypted user embeddings to reduce the dimension of the user-item matrix, improving the recommendation’s accuracy [
131]. Lin et al. utilized a different cryptographic technique: They applied secret sharing, wherein a group of clients can only reconstruct sensitive information if they collaborate by combining their shares [
103]. By applying this concept to the clients’ locally computed gradients, the authors managed to construct a FedRec framework that provides strong privacy guarantees on the clients’ individual data [
75]. Another technique concerns secure multi-party computation that refers to a protocol for computing a function based on the data of a group of clients without disclosing private information to one another [
20]. Perifanis and Efraimidis utilized this approach in the setting of federated neural collaborative filtering (NCF). They demonstrated that employing a secure multi-party computation protocol for FedAvg protects privacy when dealing with an honest but curious entity without compromising the quality of the RecSys [
90].
Differential privacy falls in the category of privacy preservation techniques that use obfuscation. Ribero et al. added differential privacy to FL utilizing a matrix factorization technique. They succeeded in balancing the privacy loss posed by the repetitive nature of the FL process by only requiring a few rounds of communication [
93]. Yang et al. designed a matrix factorization-based RecSys that adds Laplacian random noise to the users’ encrypted item embeddings, ensuring a high level of security [
125]. Minto et al. proposed a system combining differential privacy and implicit user feedback. They constrained the number of local gradient updates sent by the users by the level of privacy each user tries to maintain [
84]. We also address the problem of privacy preservation by obfuscation: Instead of applying random noise to the weight updates that are sent to the central server, the weights are quantized, which is both conducive to privacy preservation and reducing the communication overhead. We later provide a detailed attack analysis of the exchanged model parameters that are potentially susceptible to leak information about the underlying datasets of the participating clients. We present specific attacks applicable to our scenario and examine how their requirements and assumptions do not apply to our approach to privacy preservation, thus rendering them ineffective.
Another method of achieving data security is by introducing pseudo interactions in order to mask user behavior in FedRecs. This protection mechanism is implemented by adding artificial interactions with randomly selected items to users. This causes the central server to be unable to determine the real set of items a user has interacted with, as the uploaded gradient was computed with respect to both real and artificial interactions [
74]. Since this method produces noisy gradients, degrading the model performance, Liang et al. introduced denoising clients in the training process [
71]. Another approach that hits the same mark, but entirely foregoes FL was presented by Wainakh et al. [
112]. They employ a random walk-based approach to decentralized optimization, where a randomly chosen client trains its local model for one or multiple epochs before sending its updated parameters to a randomly selected neighboring client according to the underlying graph structure [
108,
111]. Wainakh et al. adapt this approach to account for privacy by introducing the anonymous random walk technique where clients, instead of training a model, can choose to add their own data to an existing dataset that was sent by a neighboring client in a prior round. The accumulated data can then be uploaded to the central server for centralized training. Due to the nature of the random walk, neither the clients nor the central server know where the individual samples of the accumulated dataset originate from, thus effectively masking the users’ identities.
Dealing with the statistical heterogeneity of the clients’ local data in the context of FedRecs is a different area of research. There are various proposed strategies for addressing this issue, which primarily include clustering and meta learning [
109]. Jie et al. designed a FedRec utilizing a clustering approach based on historical parameters to form homogeneous groups of clients, in which a personalized model can be trained. These parameters are retrieved by averaging the model parameters from the clients’ last communication rounds with the central server [
50]. Chen et al. proposed a different method based on model-agnostic meta-learning, which is a training paradigm where a meta-learner is employed to rapidly train models on new tasks. The meta-learner itself is a trainable algorithm that trains a model on a task, which consists of a support set and a query set. The model is trained using the support set and then evaluated on the query set. Based on this evaluation, a loss is computed, which reflects the ability of the meta-learner to train the model. The meta-learner is then updated to minimize this loss. For example, the meta-learner in the model-agnostic meta-learning (MAML) [
26] algorithm is used to provide an initial set of parameters for the model that is trained on the task. Meta-learning algorithms are known to generalize effectively to new tasks, which makes them well-suited for tackling the non-i.i.d. problem in FL. For this reason, Chen et al. adapted MAML, as well as another meta-learning algorithm called Meta-SGD, to the FL setting, which enabled them to reach higher model performance than the FedAvg baseline [
12]. Our FedRec was not only affected by heterogeneous client data but also by exceedingly small local datasets. Our approach to non-i.i.d.-ness, FedQ, therefore differs greatly from the two above-mentioned approaches, as neither clustering nor meta-learning are capable of handling truly small local datasets.
The clients’ potentially constrained resources are the subject of another line of research. Therefore, Muhammad et al. utilized a simple DNN with small embedding sizes to balance the number of learnable parameters and the accuracy of the resulting recommendations. In addition, they presented a new sampling technique coupled with an active aggregation method, which reduced communication costs and produced more accurate models even at an early stage of training [
86]. Zhang et al. addressed related problems and developed a new framework that effectively integrates a novel matrix factorization technique with privacy via a federated discrete optimization algorithm. Although the model’s RAM, storage, and communication bandwidth requirements were modest, performance was not affected and was even superior to related state-of-the-art techniques [
130]. Our suggested approach combines all three of the aforementioned sorts of objectives: We balance the model complexity and capacity by opting for a simple, yet scalable DNN architecture. This results in remaining resource-efficient on the client side, while still maintaining the possibility of scaling up. In addition, we anticipate that applying quantization will provide a specific amount of privacy while also lowering the burden associated with exchanging parameters with the central server via potentially bandwidth-constrained network connections.