research-article

Open access

A Privacy Preserving System for Movie Recommendations Using Federated Learning

Authors:

David Neumann,

Andreas Lutz,

Karsten Müller,

Wojciech SamekAuthors Info & Claims

ACM Transactions on Recommender Systems, Volume 3, Issue 2

Article No.: 14, Pages 1 - 51

https://doi.org/10.1145/3634686

Published: 27 November 2024 Publication History

PDF eReader

Abstract

Recommender systems have become ubiquitous in the past years. They solve the tyranny of choice problem faced by many users, and are utilized by many online businesses to drive engagement and sales. Besides other criticisms, like creating filter bubbles within social networks, recommender systems are often reproved for collecting considerable amounts of personal data. However, to personalize recommendations, personal information is fundamentally required. A recent distributed learning scheme called federated learning has made it possible to learn from personal user data without its central collection. Consequently, we present a recommender system for movie recommendations, which provides privacy and thus trustworthiness on multiple levels: First and foremost, it is trained using federated learning and thus, by its very nature, privacy-preserving, while still enabling users to benefit from global insights. Furthermore, a novel federated learning scheme, called FedQ, is employed, which not only addresses the problem of non-i.i.d.-ness and small local datasets, but also prevents input data reconstruction attacks by aggregating client updates early. Finally, to reduce the communication overhead, compression is applied, which significantly compresses the exchanged neural network parametrizations to a fraction of their original size. We conjecture that this may also improve data privacy through its lossy quantization stage.

1 Introduction

Due to the ever-increasing sizes of corpora of items, such as movies, articles, games, non-digital goods, and so on, the task of finding novel and engaging content or products for each individual user or customer becomes increasingly difficult, even with the help of search engines. This problem is known as the tyranny of choice [99]. Therefore, well-engineered recommender systems (RecSyss) are one of the most important pieces of technology for the success of many digital enterprises, providing them with the required engagement and sales. Harvard Business Review even calls RecSyss the single most important algorithmic distinction between “born digital” enterprises and legacy companies [98]. A total of 80% of the content people watch on Netflix sources from a RecSys, and they estimate that recommendations and personalization save them 1 billion USD per year [33]. A total of 35% of what customers purchase at Amazon comes from a RecSys [82]. At Airbnb, search ranking and similar listing recommendations drive 99% of all booking conversions [34].

Accordingly, a growing number of online businesses are adopting RecSyss to expand customer engagement and sales. This causes a worrying trend of companies gathering and storing continuously increasing amounts of personal customer data. Even with data protection legislation like the European Unions (EUs) General Data Protection Regulation (GDPR) [24], it is opaque to users what data is collected and arduous to take agency over one’s personal data. All this gathered and derived personal information is at risk of being misused or leaked.

On one hand, in order to improve the personalization of customer recommendations, personal information is indispensable. On the other hand, the principles of data economy and data avoidance are essential to preserve user’s privacy, and provide them with control over their own personal data. Recently, federated learning (FL) was introduced as a distributed machine learning (ML) method, which avoids the centralized accumulation of user data entirely and thus provides data privacy. Unlike regular ML training algorithms with centrally collected data, FL is designed to leave the data at its origin and instead train many models or variants of one model on each of these local datasets. The clients only share the training updates, which are then aggregated into an updated global model. As a result, all participating clients benefit from distributively training the model on all data, without ever sharing the data itself. Accordingly, this scheme, first introduced by Konečný et al. [58], is aimed toward scenarios in which the local data is privacy sensitive and thus owners do not want to disclose it.

While classical RecSys approaches usually only require user interaction data as input signals, modern approaches can use more privacy-sensitive input signals, such as age, gender, country of origin, and device information. This has the potential to further improve the predictive power of RecSyss. The privacy-preserving nature of FL makes it a perfect fit for training RecSys models without users having to give up their personal data. Furthermore, FL helps to distribute the burden of data storage and the computational overhead of training among many clients. On the contrary, FL also has the following disadvantages: (1) training time will be increased as compared to traditional central training, because client devices are less capable and not always available, (2) non-independent and identically distributed (i.i.d.) data can hinder convergence and result in a model with lower performance than its centrally trained counterpart, (3) battery usage of mobile client devices will increase due to the complex computations required to train the model, resulting in shorter battery lives, and (4) the communication overhead of continuously exchanging training updates between the clients and the central server, which is especially problematic when clients are on a metered mobile connection.

The exploration of the combination of FL and RecSyss toward the subfield of federated recommender systems (FedRecs) has only recently started and has not yet been fully explored in the literature with only a few publications available on this topic. Therefore, this work introduces an end-to-end, high-performance, scalable FedRec solution for movie recommendations, which is entirely driven by FL and addresses common issues of FedRecs. System scalability is verified through experiments conducted on more than 162,000 FL clients. To our best knowledge, this is the first work with this client range.

The proposed system inherently provides privacy and thus trustworthiness on several levels: First, through the federated training that only transmits neural network (NN) parameters, while every participating client’s personal data remains private. Second, early aggregation of client updates prevents input data reconstruction attacks. And third, we apply lossy neural network coding (NNC) compression methods that not only provide significant communication reduction, but we also conjecture that its quantization acts as a parameter obfuscation and thus may also strengthen the FL setup against input data reconstruction attacks.

A common challenge among RecSyss is that most users only produce extraordinarily little training data, while a tiny fraction of highly engaged users produce a lot of training data. In classical RecSyss, this is primarily an issue because these few users dominate the RecSys and suppress the interests of less frequent users. In FedRecs, this poses the additional problem of small noisy updates, which can hinder global model convergence. To counteract this problem, we introduce a technique to chain client trainings together in a privacy-preserving manner, in order to produce more stable model updates. To summarize, our contributions are as follows:

(1)

A privacy-preserving movie RecSys trained end-to-end using FL

(2)

Extreme scalability with experimental evidence for more than 162,000 clients

(3)

Compressed communication between central server and clients with state-of-the-art NNC

(4)

Novel queue-based federated training to address non-i.i.d. and imbalanced local datasets

2 Related Work

2.1 Recommender Systems

Initially RecSyss for collaborative filtering tasks were often modeled using matrix factorization techniques. The general idea is to embed input signals, like users and items, in a joint latent space, and quantify the similarity between them using an interaction function, which is the dot product in the simplest case [60]. Several approaches in the literature were introduced to enhance the predictive power of the model, e.g., incorporating additional features [13] or combining it with neighborhood models [59]. Since matrix factorization relies on linear dependencies between the input signals, substituting any arbitrary function for the inner product led to promising results. He et al. utilized a deep neural network (DNN) for this task, which proved to be better suited for capturing the latent structures in the data, resulting in higher prediction accuracy [19, 41]. Several architectures in the context of deep learning (DL) were proposed to further improve the baseline models for RecSyss. Choe et al. utilized a recurrent neural network (RNN) to include time series data from the previous items the user has interacted with [16]. To address the limits of RNNs for sequential recommendations, Tang and Wang proposed a convolutional neural network (CNN) to incorporate the fact that dependency relations were not necessarily the consequence of consecutive user–item interactions [110]. Sedhain et al. used an item-based autoencoder to reconstruct ratings received as an input [100]. Wu et al. enhanced this approach by employing a denoising autoencoder, which can handle corrupted data [123]. Ying et al. used a graph convolutional network (GCN) to combine graph convolutions and efficient random walks to solve scalability issues faced in web-scale recommendation tasks [127].

2.2 Federated Learning

FL is a recently proposed distributed learning scheme, which was originally proposed by Konečný et al. [58], where a set of client devices C, jointly train an ML model \(M_{Global}\) on their private datasets \(\mathcal {D}_i\). Usually, FL is performed under the supervision of a central coordinating server. In traditional ML, the local client datasets would be accumulated into a central dataset \(\mathcal {D}_{Central} = \bigcup _{i=1}^{\left|C\right|} \mathcal {D}_i\) on which a central model \(M_{Central}\) is trained. In FL, the local datasets are never disclosed by the clients. Instead, the central server initializes a global model \(M_{Global}\) parameterized by a vector \(\boldsymbol {\theta } \in \mathbb {R}^d\), that is sent to all clients \(c_i\) which train the global model on their local datasets \(\mathcal {D}_i\), effectively optimizing their local objective \(\mathcal {L}_i(\mathcal {D}_i; \boldsymbol {\theta })\), which results in a local update \(U_i\).¹ The local update is then sent back to the central server, which uses an aggregation operator to combine the updates into an updated global model \(M^{^{\prime }}_{Global} = \text{Agg} \left\lbrace U_i \mid i \in \left\lbrace 1, 2, \dots , |C|\right\rbrace \right\rbrace\). This process is repeated until a suitable convergence metric is met. The objective of FL can therefore be stated as the following minimization problem [83]:

\begin{align} \min _{\boldsymbol {\theta } \in \mathbb {R}^d} \sum _{i=1}^{\left|C\right|} \frac{\left|\mathcal {D}_i\right|}{\left|\bigcup _{j=1}^{\left|C\right|} \mathcal {D}_j \right|} \mathcal {L}_i(\mathcal {D}_i; \boldsymbol {\theta }), \text{where} \qquad \mathcal {L}_i(\mathcal {D}_i; \boldsymbol {\theta }) = \frac{1}{\left|\mathcal {D}_i\right|} \sum _{x, y \in \mathcal {D}_i} \ell (x, y; \boldsymbol {\theta }), \end{align}

(1)

with \(\ell (x, y; \boldsymbol {\theta })\) denoting the loss of the client model on input x with ground-truth y, given the model parametrization \(\boldsymbol {\theta }\). FL allows the global model \(M_{Global}\) to train on significantly more data than if each client had only trained on its private data. Thus, under ideal conditions, given a performance metric P, the performance of the global model \(P_{Global}\) should be better than that of each individual client \(\forall i \in \left\lbrace 1, 2, \dots , |C|\right\rbrace : P_{Global} \gt P_i\). FL permits a certain degree of deviation from the performance of an equivalent centrally trained model but provides data security and privacy protection in return. Still, the goal is to minimize the deviation \(|P_{Central} - P_{Global}|\).

In the original FL scheme, federated stochastic gradient descent (FedSGD), proposed by Konečný et al. [58], the clients perform a training step and send the computed gradient back to the central server, which averages the gradient across all clients and applies it to the global model. Since then, several other methods have been proposed in the literature. McMahan et al. proposed federated averaging (FedAvg), where the clients train for multiple local epochs and send their updated local model to the central server instead of the gradient. The updated parameters are then weighted proportionally by the number of local training samples available to each client and then averaged by the central server [83]. Furthermore, they employ client sub-sampling, a technique where only a random subset of clients is selected for each communication round [14, 28]. FedAvg can be seen as a generalization of FedSGD, which only executes a single iteration of gradient descent in each round of communication [83, 105]. Although there were theoretical guarantees for the convergence of FedAvg in cases of heterogeneous data, impractical assumptions such as strong convexity or smoothness of the objective function needed to hold [69]. Chai et al. showed experimentally that FedAvg could lose up to 9% accuracy in comparison to FedSGD [10], when dealing with non-i.i.d. data. Li et al. tackled this problem and presented a generalization of FedAvg. They introduced a surrogate objective to constrain the locally updated parameters to be close to the current global model. This helped to stabilize convergence behavior resulting in a significant increase in test accuracy by 22% on average [67]. Li et al. proposed to only share the trainable parameters of batch normalization (BatchNorm) with the central server without communicating their running averages of the batch statistics to the server. Aggregating the trainable parameters from all clients but keeping the running averages local helps to alleviate the problem of feature shift in non-i.i.d. training scenarios [70]. Karimireddy et al. utilize control variates as a variance reduction technique to approximate the update direction of the server model and each client model. The client drift, which naturally arises from training on different local data distributions, can be estimated by the difference between these update directions and is corrected by adding it in the local training of each client [52]. Cao et al. rely on clustering the clients according to the classes of data they possess. They only average parameters from the same group while updating the central server model, guaranteeing that parameters are only averaged on a set of clients with a comparable data distribution [8]. Seol and Kim propose a two step approach. First, they use data oversampling to eliminate data class imbalances among clients. In the second step, the clients are selected in such a way that their data distribution is nearly uniform. Furthermore, the central server constantly adjusts the amount of data for local training, the batch size, and the learning rate of the clients to avoid performance degradation [101]. We also address data heterogeneity and introduce our own generalization of FedAvg, named federated learning with client queuing (FedQ).

Although FL operates in a decentralized environment, the participating client’s privacy may be compromised by merely transmitting the training update. Geiping et al. reconstructed high-resolution images by examining the data present in each client’s communicated gradients [29]. Dimitrov et al. were also able to extract sensitive information contained in the weights obtained by the FedAvg procedure. Therefore, the concept of differential privacy [23] is often applied in the setting of FL. When working with aggregated data, differential privacy can be utilized to protect the private information contained in individual data points. Differential privacy achieves this data protection by perturbing the data points with random noise. This exploits the fact that a single data point has relatively little impact on the aggregated data as a whole, but adding random noise alters the individual data points to a degree that no useful information can be extracted from them [22]. Wei et al. proposed to add specific noise to the parameters of each client before aggregation by the central server [116]. This ensures a decent training accuracy while a certain level of privacy is maintained, if there are a sufficiently large number of clients involved [116]. Phong et al. [91] proposed to use homomorphic encryption in the more general setting of distributed training and Fang and Quan [25] suggested to use it in the setting of FL. Homomorphic encryption is a specialized encryption scheme that allows performing certain mathematical operations on the data without decrypting it.

2.3 Communication-Efficient Federated Learning

When dealing with mobile clients, internet connections may be inconsistent and potentially have high latency. Even when FL clients are connected via reliable network connections, mobile connections are usually still bandwidth-constrained and, in many cases, even metered. During the course of FL, training updates must be exchanged a multitude of times. Therefore, a central goal in FL is communication minimization. When communicating model parametrizations, possible solutions to this include several size reduction techniques: Sparsification/Pruning excludes single neurons (unstructured) or entire layers of neurons (structured) from an NN. While sparsification only sets excluded neurons to 0, pruning actually removes them [65]. Sparsified models are more amenable to compression, but still have their original size when uncompressed. Pruned models, on the other hand, already have a reduction in size even without compression. The disadvantage of pruned networks is that they may require specialized software and/or hardware to be used, while sparsified models can run on regular software and hardware. Distillation is a technique for transferring the knowledge of a teacher model into a smaller student model. This is done by minimizing the difference between the output of the student model and the output of the teacher model (also known as soft labels) on data points from a separate dataset [43]. In quantization, the weights of an NN are constrained to a discrete set of values so that they can be represented with fewer bits [30]. Lossless compression techniques encode the NN data in a way that removes redundancy and thus reduces its size [37].

There are many works that have developed communication efficient FL solutions using the above-mentioned techniques or combinations of them [57, 96, 97], and even some with specialized techniques, such as federated dropout [6]. Konečný et al. propose employing quantization, random rotations, and sub-sampling to compress the updated model parameters of the clients before sending them to the central server [57]. Wu et al. adopt an orthogonal strategy: The clients train a teacher model on their local data and distill it into a smaller student model. Instead of communicating the gradients of the teacher models, the clients compress and send the gradients of the smaller student models [121]. Sattler et al. introduce a compression framework combining communication delay methods, gradient sparsification, binarization, and optimal weight update encoding to reduce the upstream communication cost in distributed learning scenarios [96]. To adapt it to the FL setting, Sattler et al. enhance this approach, taking the compression of the downstream communication and the non-i.i.d. local data distribution of the clients into account. They construct a framework combining a novel top-k gradient sparsification method with ternarization and optimal Golomb encoding of updated client model parameters [97]. Another emerging field of research considers combinations of differential privacy and quantization methods in order to reduce communication costs. Lang and Shlezinger demonstrated that, within their framework, it is possible to quantize data at a given bit rate without sacrificing a specified level of privacy or degrading model performance [63]. They enhanced methods proposed by Reisizadeh et al. and Konečný et al., which solely use quantization and do not include privacy-related considerations.

2.4 Federated Recommender Systems

The current public discussion of RecSyss (often just referred to as the algorithm or AI personalization), focuses, among other topics, on their invasive behavior concerning personal data collection [32, 42, 61, 62]. This might create a negative relationship between user and RecSys potentially resulting in anything from user discontent to “algorithmic hate” [106]. RecSyss are arguably a vital part of the user experience on the internet since, without them, the flood of content would be barely manageable. Therefore, FL may be part of the solution to the privacy problem of RecSyss by training the recommender models directly on user devices and thereby entirely circumventing the need for gathering private information.

FL has already been proven to work well in many other domains, e.g., cancer research [95], natural language processing [72], graph NNs [40], image classification [81], transfer learning [77], language models [5], mobile keyboard prediction [38], and keyword spotting [66], so it is reasonable to anticipate that it is likewise effective in the domain of RecSyss. In fact, there are numerous methods in the literature to incorporate current RecSys frameworks into FL. They can be classified as either focusing on learning algorithms [3], security [93], or optimization models [86], depending on the task’s objective [2]. Matrix factorization is a commonly utilized approach in the first scenario. Ammad-ud-din et al. were among the pioneers in this emerging field by introducing this model to address collaborative filtering tasks in the context of FL. They constructed a RecSys that gives personalized recommendations based on users’ implicit feedback [3]. Lin et al. designed a new federated rating prediction mechanism for explicit responses. They employed user averaging and hybrid filling in order to keep the system computationally efficient and the communication costs moderately low [74].

To increase the model capabilities for each client, Jia and Lei incorporated a bias term for the input signals. In addition, weights on the local devices were adjusted, so that any unreasonable user rating is removed [49]. On the other hand, Flanagan et al. employed a similar strategy, enhancing the model’s capacity by incorporating input from other data sources [27]. Wang et al. introduced a new algorithmic approach by combining matrix factorization with FedAvg. They demonstrated that the cost of communication with the central server for non-i.i.d. data was decreased by limiting the number of local training iterations [114].

As previously shown, private information can be reconstructed from the clients’ transmitted parameters. In order to remedy this, a variety of privacy preserving techniques based on encryption, obfuscation, or masking can be utilized [4]. Communication of encrypted data between the central server and its clients is made possible through the use of homomorphic encryption, allowing for intermediate calculations without the need to first decrypt the data. As a result, the central server is unable to infer the data it is working with [54]. For this reason, Chai et al. propose a secure matrix factorization framework to handle data leakage. They showed how privacy could be compromised by intercepting the clients’ gradient updates sent in two consecutive communication rounds to the central server. To address this problem, they encrypted the clients’ gradients before sending them to the central server [11]. Zhang and Jiang enhanced the approach by clustering the encrypted user embeddings to reduce the dimension of the user-item matrix, improving the recommendation’s accuracy [131]. Lin et al. utilized a different cryptographic technique: They applied secret sharing, wherein a group of clients can only reconstruct sensitive information if they collaborate by combining their shares [103]. By applying this concept to the clients’ locally computed gradients, the authors managed to construct a FedRec framework that provides strong privacy guarantees on the clients’ individual data [75]. Another technique concerns secure multi-party computation that refers to a protocol for computing a function based on the data of a group of clients without disclosing private information to one another [20]. Perifanis and Efraimidis utilized this approach in the setting of federated neural collaborative filtering (NCF). They demonstrated that employing a secure multi-party computation protocol for FedAvg protects privacy when dealing with an honest but curious entity without compromising the quality of the RecSys [90].

Differential privacy falls in the category of privacy preservation techniques that use obfuscation. Ribero et al. added differential privacy to FL utilizing a matrix factorization technique. They succeeded in balancing the privacy loss posed by the repetitive nature of the FL process by only requiring a few rounds of communication [93]. Yang et al. designed a matrix factorization-based RecSys that adds Laplacian random noise to the users’ encrypted item embeddings, ensuring a high level of security [125]. Minto et al. proposed a system combining differential privacy and implicit user feedback. They constrained the number of local gradient updates sent by the users by the level of privacy each user tries to maintain [84]. We also address the problem of privacy preservation by obfuscation: Instead of applying random noise to the weight updates that are sent to the central server, the weights are quantized, which is both conducive to privacy preservation and reducing the communication overhead. We later provide a detailed attack analysis of the exchanged model parameters that are potentially susceptible to leak information about the underlying datasets of the participating clients. We present specific attacks applicable to our scenario and examine how their requirements and assumptions do not apply to our approach to privacy preservation, thus rendering them ineffective.

Another method of achieving data security is by introducing pseudo interactions in order to mask user behavior in FedRecs. This protection mechanism is implemented by adding artificial interactions with randomly selected items to users. This causes the central server to be unable to determine the real set of items a user has interacted with, as the uploaded gradient was computed with respect to both real and artificial interactions [74]. Since this method produces noisy gradients, degrading the model performance, Liang et al. introduced denoising clients in the training process [71]. Another approach that hits the same mark, but entirely foregoes FL was presented by Wainakh et al. [112]. They employ a random walk-based approach to decentralized optimization, where a randomly chosen client trains its local model for one or multiple epochs before sending its updated parameters to a randomly selected neighboring client according to the underlying graph structure [108, 111]. Wainakh et al. adapt this approach to account for privacy by introducing the anonymous random walk technique where clients, instead of training a model, can choose to add their own data to an existing dataset that was sent by a neighboring client in a prior round. The accumulated data can then be uploaded to the central server for centralized training. Due to the nature of the random walk, neither the clients nor the central server know where the individual samples of the accumulated dataset originate from, thus effectively masking the users’ identities.

Dealing with the statistical heterogeneity of the clients’ local data in the context of FedRecs is a different area of research. There are various proposed strategies for addressing this issue, which primarily include clustering and meta learning [109]. Jie et al. designed a FedRec utilizing a clustering approach based on historical parameters to form homogeneous groups of clients, in which a personalized model can be trained. These parameters are retrieved by averaging the model parameters from the clients’ last communication rounds with the central server [50]. Chen et al. proposed a different method based on model-agnostic meta-learning, which is a training paradigm where a meta-learner is employed to rapidly train models on new tasks. The meta-learner itself is a trainable algorithm that trains a model on a task, which consists of a support set and a query set. The model is trained using the support set and then evaluated on the query set. Based on this evaluation, a loss is computed, which reflects the ability of the meta-learner to train the model. The meta-learner is then updated to minimize this loss. For example, the meta-learner in the model-agnostic meta-learning (MAML) [26] algorithm is used to provide an initial set of parameters for the model that is trained on the task. Meta-learning algorithms are known to generalize effectively to new tasks, which makes them well-suited for tackling the non-i.i.d. problem in FL. For this reason, Chen et al. adapted MAML, as well as another meta-learning algorithm called Meta-SGD, to the FL setting, which enabled them to reach higher model performance than the FedAvg baseline [12]. Our FedRec was not only affected by heterogeneous client data but also by exceedingly small local datasets. Our approach to non-i.i.d.-ness, FedQ, therefore differs greatly from the two above-mentioned approaches, as neither clustering nor meta-learning are capable of handling truly small local datasets.

The clients’ potentially constrained resources are the subject of another line of research. Therefore, Muhammad et al. utilized a simple DNN with small embedding sizes to balance the number of learnable parameters and the accuracy of the resulting recommendations. In addition, they presented a new sampling technique coupled with an active aggregation method, which reduced communication costs and produced more accurate models even at an early stage of training [86]. Zhang et al. addressed related problems and developed a new framework that effectively integrates a novel matrix factorization technique with privacy via a federated discrete optimization algorithm. Although the model’s RAM, storage, and communication bandwidth requirements were modest, performance was not affected and was even superior to related state-of-the-art techniques [130]. Our suggested approach combines all three of the aforementioned sorts of objectives: We balance the model complexity and capacity by opting for a simple, yet scalable DNN architecture. This results in remaining resource-efficient on the client side, while still maintaining the possibility of scaling up. In addition, we anticipate that applying quantization will provide a specific amount of privacy while also lowering the burden associated with exchanging parameters with the central server via potentially bandwidth-constrained network connections.

3 Method

In this work, we propose a framework for a RecSys that is trained end-to-end using FL. Before examining the design of the FedRec and its components, we want to motivate our decisions with a problem statement. Then, we will explore the general architecture of many complex information retrieval systems on which the architecture of our RecSys is based and show how each of these components is constructed. Finally, we will demonstrate how all of this translates into an FL setting and how we alleviate the problems that arise from such a setup.

3.1 Problem Statement

The research documented in this work was conducted as part of the COPA EUROPE project, which is a beneficiary of the EUs Horizon 2020 Research and Innovation Programme. The project aims to create a live-streaming and video-on-demand (VoD) platform that provides users with sports and esports content. To keep users engaged, discoverability of the content is key, therefore, one part of the project aims at developing a RecSys. Specifically, the objective was to develop a RecSys in an FL setting to provide high-quality recommendations while preserving the user’s privacy. From the project’s goals and objectives, the following requirements for the FedRec can be derived:

•

Large Client Population – A live-streaming and VoD platform for sports and esports may build a large user base, which results in an FL client population that comprises hundreds of thousands or even millions of clients.

•

Large Video Catalog – With dozens of types of sports and esports games covered, and hundreds of leagues, tournaments and events, the catalog of live streams and VoD content may grow substantially over time.

•

Increased Personalization – The FL setup is meant to enable the RecSys to leverage more personal user data in addition to user–item interactions for higher personalization without requiring the data to ever leave the user’s device. The requirement to take advantage of more personal user data implies that the employed ML model must be able to handle multiple data modalities and learn complex, non-linear dependencies between features contained in this data.

•

Substantial Communication Overhead – The potentially large client population leads to a very significant communication overhead for the central server. Furthermore, the clients are expected to use mobile devices that may lack a reliable, high-bandwidth internet connection. Therefore, it is of paramount importance to reduce the communication overhead incurred by the constant communication between the central server and its clients.

The following sections will detail how these requirements were translated into the architecture of the FedRec and the design of its components. All decisions concerning architecture and design, as well as the research into privacy-preservation, scalability, NNC, and the handling of non-i.i.d. and imbalanced local datasets were motivated and informed by these requirements.

3.2 Recommender System Architecture

As the RecSys is required to handle a large user base and movie catalog, we decided to follow the well-known three-stage funnel-like architecture, which is also employed by other forms of information retrieval systems. These three phases comprise: candidate generation, ranking, and re-ranking (cf. Figure 1). The candidate generation phase takes the entire corpus of movies and narrows it down to usually a couple hundred movies that are somewhat relevant to the user. This phase must be fast because it must sift through possibly millions of movies, which, in turn, means that not all of the resulting elements are 100% relevant to the user. The ranking phase has a more complex model of the user’s interest. It scores each of the candidate movies and ranks them by their scores. This two-step approach to the generation of recommendations greatly expedites the retrieval process. If each item in the corpus had to be ranked individually, this process would not scale well to the large item corpora. Finally, the re-ranking phase is an optional phase, which can implement hand-crafted rules to improve recommendations. This can include rules such as removing click-bait content, enforcing age restrictions, ensuring freshness, and promoting predefined content. These systems will be further explored in the following sections.

Fig. 1.

3.3 Candidate Generation

Candidate generation is comprised an algorithm that is trained to select a small number (usually in the order of hundreds) of items from a vast corpus of items (usually in the order of millions) that are generally relevant to the user. One classical approach to candidate generation is matrix factorization. Non-linear models, such as NNs, however, are capable of forming a much deeper “understanding” of the latent structures in the data and NNs have been used in RecSyss since at least 2016 [19]. Although there have been attempts to adapt classical ML algorithms for the use in FL, e.g., matrix factorization [3], gradient-based learning algorithms are much better suited and well-researched within the framework of FL. Furthermore, NNs allow for much more fine-grained control over model architecture decisions and are capable of handling a diverse set of input data modalities, which is one of the project’s requirements. For this reason, we decided to use a DNN architecture for our candidate generation model.

Prior to choosing a specific design, the training objective must be formulated. For RecSyss there are many different objectives that are commonly used, e.g., rating prediction, watch time prediction, click-through-rate prediction, and watch prediction. Since the algorithm has to be able to sift through millions of items, the underlying model must be simple and, most importantly, fast. Therefore, we decided to train the candidate generation model on next watch prediction. This means that it receives a list of past movie watches of a user as input and predicts a probability distribution over all movies in the corpus. The top-k movies can then be interpreted as the movies that the user will most likely watch next. So instead of performing inference on all movies in the corpus, the model only has to be invoked once to retrieve a list of candidate recommendations.

The chosen architecture for the candidate generator model is shown in Figure 2 and inspired by the architecture used in [19]. An experiment using various recurrent architectures was conducted, but the chosen DNN architecture is the best tradeoff between model performance and size. The results of this experiment can be found in Appendix B.1. The first layer of the model is an embedding layer, which takes the sparse one-hot encoded movie watches and embeds them into a 64-dimensional dense vector space. The size of the embedding vectors was experimentally determined. The experiment results can be found in Appendix B.2. In contrast to recurrent NNs, non-recurrent NNs require inputs of a fixed size. However, the watch histories have variable length and can consist of any number between 1 and window size movie watches. To provide the required fixed-length input for the model, the embedded movie watches are then averaged. In practice, other input features could be added here and concatenated to the watch history vector. For example, user-level information could be utilized to improve predictions, if past movie watches are not available or a user only has a few of them, thereby solving the cold-start problem for new users. Unfortunately, we are restrained by the lack of a suitable dataset, which includes user-level information.

Fig. 2.

The inputs are then fed into a funnel, or tower-like architecture of multiple fully connected layers with rectified linear unit (ReLU) activations. The final fully connected layer prior to the output layer is of size 256 and each preceding layer doubles this number, i.e., for a three-layer architecture, the first fully connected layer is of size 1,024, the second of size 512, and the final layer of size 256. As already mentioned, the size of the model has a substantial impact in an FL setting. Consequently, an experiment was conducted to determine the optimal number of hidden layers. The results of this experiment are presented in Appendix B.3.

Finally, the next-watch prediction is realized in terms of a classification task, therefore, the output layer of the candidate generator model has as many outputs as there are movies in the corpus. The model is then trained using the softmax cross-entropy loss. A detailed breakdown of the layers that comprise the NN architecture of the candidate generator model is presented in Table 3 in Appendix B.4.

Table 1.

	1k Clients	10k Clients	100k Clients	150k Clients	162k Clients
	Accuracy	Accuracy	Accuracy	Accuracy	Accuracy
Sub-Sampling	FedAvg
10	27.93%	-	-	-	-
100	28.29%	17.98%	5.18%	7.83%	10.14%
1,000	28.11%	17.86%	6.34%	8.53%	10.52%
10,000	-	17.91%	6.39%	6.68%	11.64%
Queue Length	FedQ
10	49.61%	26.41%	17.59%	16.68%	17.01%
100	51.57%	48.80%	25.35%	22.66%	23.72%
1,000	52.12%	51.47%	42.34%	39.22%	38.67%

Table 1. Comparison of the Candidate Generator FL and FedQ Experiment Results

The table reports the final validation top-100 accuracies after 300 communication rounds.

Table 2.

	1k Clients		10k Clients		100k Clients		150k Clients		162k Clients
	Accuracy	MSE	Accuracy	MSE	Accuracy	MSE	Accuracy	MSE	Accuracy	MSE
Sub-Sampling	FedAvg
10	27.78%	1.22	-	-	-	-	-	-	-	-
100	27.91%	1.22	26.32%	1.32	24.27%	1.42	24.80%	1.42	23.14%	1.33
1,000	27.94%	1.21	26.47%	1.33	24.47%	1.42	24.43%	1.43	23.51%	1.34
10,000	-	-	26.36%	1.33	24.74%	1.42	24.61%	1.42	22.77%	1.36
Queue Length	FedQ
10	30.38%	1.09	27.85%	1.22	26.10%	1.34	26.10%	1.34	25.56%	1.34
100	39.45%	0.88	30.37%	1.1	27.43%	1.26	27.33%	1.26	27.15%	1.28
1,000	40.14%	0.83	39.07%	0.91	30.10%	1.12	29.98%	1.12	29.69%	1.13

Table 2. Comparison of the Ranker FL and FedQ Experiment Results

The table reports the final validation accuracies and MSEs after 300 communication rounds.

3.4 Ranking

The ranking phase of the RecSys receives the candidate recommendations from the candidate generator phase and ranks them by user relevance. Since it only has to be invoked for a small subset of all movies in the corpus, processing speed is less crucial in contrast to the previous candidate generator model. Therefore, a more precise and complex representation of the user’s interests can be learned. Note that the model must be trained within the FL environment and thus should not be selected too large.

Learning to rank is a well-studied [9] problem within ML and there are numerous approaches, ranging from simple point-wise models, which directly predict a rank, and pair-wise models, which learn to rank two items relative to each other, to more elaborate list-wise models, which learn to rank items in a list [76]. In the case of a movie RecSys, the ranker model can be implemented as a rating prediction, where the predicted rating is used to sort the items. We decided on this simple approach. It turned out that a simple regression model tended to learn to predict the mean rating if trained without any constraints. Therefore, we decided to re-formulate the problem as a classification task, as the dataset being used contains a discrete set of possible ratings between 0.5 and 5.0 in steps of 0.5, resulting in 10 distinct classes. This approach performs considerably better.

The base architecture of the ranker model is almost equivalent to the design of the candidate generator. The input features, user ID, movie ID, and movie genres are embedded using embedding layers. The optimal embedding sizes were experimentally determined to be 32 for users, 128 for movies, and 16 for genres. A detailed description of these experiments can be found in Appendix C.1. The genre embeddings are then averaged and the resulting vectors are concatenated to form the input of a tower-like classifier, which consists of a single fully connected layer that outputs a probability distribution over the set of possible ratings. Just like in the case of the candidate generator, we considered adding multiple hidden layers, but experiments with varying numbers of hidden layers determined that a single layer is sufficient. The hidden layer experiments are described in Appendix C.2. Again, more movie- or user-level information could be added as input features here. Only rating timestamps are provided in the dataset which can be utilized as additional user-level information. By correlating the movies in the dataset with an online movie database, further movie-level information can be retrieved. Therefore, we decided to add the age of the rating and the age of the movie as further input features to determine the efficacy of adding more input signals to the model. A detailed discussion of this can be found in Section 4.1.2. The architecture of the ranker model is shown in Figure 3.

Fig. 3.

Since the classes, distinguished by the ranker model, have a hierarchical relation to each other, we considered using other loss functions than softmax cross-entropy. We have experimentally tested other loss functions, but in practice, softmax cross-entropy provides the best results. The results of the experiment can be found in Appendix C.3. A detailed breakdown of the layers that comprise the NN architecture of the ranker model is presented in Table 4 in Appendix C.4.

3.5 Re-ranking

The re-ranking phase is an optional step that is often overlooked in RecSys research, but plays a crucial role in real-world applications. It implements hand-crafted rules to improve recommendations. Examples are the removal of click-bait content, enforcing age restrictions, ensuring freshness, and promoting predefined content. Here, ensuring freshness is probably one of the most important aspects. The candidate generation and ranking phases do not take freshness of the recommended content into consideration, as the ratio between novel and more established content is often hand-tuned (also described as exploration vs. exploitation tradeoff). Age restrictions are also important, as the candidate generator model has no filter in place to prevent recommending age-restricted movies to underage users. Both the candidate generator and the ranker models are static, i.e., given the same input, they will always produce the same output (unless further trained in the meantime). Therefore, the re-ranking phase should also randomly select a subset of the final recommendations, e.g., weighted by the rank predicted by the ranker model, in order to ensure that the user will see something different every time they are presented with recommendations. Mixing in some predefined content, for example movies that have just been released, is an effective way of overcoming the cold-start problem for new content. This would increase the chances of new movies being watched and thus generating training data that can be used to recommend the movies later. Finally, the topic of click-bait detection is an interesting one, but it is considered out-of-scope in this work. As the re-ranking phase only consists of hand-crafted rules and thus does not affect the proposed method, we will abstain from delving deeper into its implementation.

3.6 Federated Recommender Systems at Scale Using Queue-Based Federated Learning

Many variants and adaptations were introduced to FL, among which FedAvg [83] is one of the most prevalent. In FedAvg, the server initializes a global model, which is sent to all clients. The clients then proceed to train the model on their local data and send the updated model back to the central server. The central server then aggregates the client models into a new global model by averaging them (usually the mean weighted by the number of samples that the clients trained on is used). The process can be seen in Figure 4 and the algorithm is detailed in Algorithm 1. FedAvg has been proven successful in many FL tasks despite theoretical predictions suggesting otherwise [113].

Fig. 4.

A significant challenge of FedAvg lies, however, in dealing with non-i.i.d. client data. The data generating distribution may be different for each client, i.e., the data is not independent and identically distributed between the clients. This means that the local objective of each client may differ, sometimes even significantly, from the global training objective, which may lead to conflicting model updates being sent to the central server that hinders the convergence of the global model. There are different types of non-i.i.d.-ness, which include:

•

Covariate Shift – Local samples may have a different statistical distribution compared to the samples of other clients

•

Prior Probability Shift – The labels of the local samples may have a different statistical distribution compared to the samples of other clients

•

Concept Shift – Local samples have the same labels as other clients, but they correspond to different features, or local samples have the same features as other clients, but they correspond to different labels

•

Imbalanced Data – The data available at the clients may vary significantly in size

Many different techniques have been proposed to alleviate the problems associated with non-i.i.d. data, cf. Zhu et al. [133] for a timely overview of different techniques.

Clients with limited local data are another issue that has a comparable effect to non-i.i.d.-ness. In the case of movie RecSyss, it is common that most users have only watched a few dozen or maybe a few hundred movies. This can lead to exceedingly small, noisy updates of the local model, which result in the global model not converging. Both the problem of imbalanced data and small local datasets can be attenuated by weighting the local model updates during aggregation by the local dataset size of the client. But this also has the unwanted effect of suppressing the interests of many users with little training data and amplifying the interests of a few users with a lot of training data.

We address both problems of non-i.i.d.-ness and small local datasets by chaining client trainings together. The central server selects a random subset of the client population for each communication round before further subdividing them into small queues of a specified size. The clients constituting a specific queue are assigned uniformly at random from the client subset. The first client in each queue receives the global model for local training, while each consecutive client receives the local model of the client prior to it. The local models of the last client in each of these queues are then aggregated by the central server, similar to FedAvg. The goal of chaining multiple client trainings is that the resulting model updates are less noisy because they were not only exposed to more data but also to data from multiple different distributions, in contrast to what would normally be possible. Since no client in a queue has information about the origin of its local model nor about its position in the queue, this method is still at least as privacy-preserving as regular FL. We call this technique FedQ. Algorithm 2 shows the exact training protocol that we follow.

For the complexity analysis, we compare FedQ to its baseline, FedAvg, with respect to the expected time the central server needs to wait before it can aggregate the updated model parameters of the clients in each communication round. The number of local update steps on the ith client are given by \(E \cdot \tfrac{|\mathcal {D}_i|}{B}\). This implies that each client performs \(E \cdot [ \tfrac{\mathbb {E}[|\mathcal {D}_i|]}{B} ] = E \cdot [ \tfrac{\sum _{i=1}^{|C|} |\mathcal {D}_i|}{|C| \cdot B} ]\) steps on average, where the expectation is over the random selection of a client, which follows the uniform distribution [83]. Therefore, the expected time complexity for a single communication round, depending on the utilized algorithm, can be expressed as follows:

\begin{equation*} \begin{aligned}&\text{FedAvg:} && \mathcal {O} \left(P \cdot E \cdot \left[\frac{\sum _{i=1}^{|C|} \left|\mathcal {D}_i\right|}{|C| \cdot B}\right] \right) \\ &\text{FedQ: } && \mathcal {O} \left(\mathbf {L} \cdot P \cdot E \cdot \left[ \frac{\sum _{i=1}^{|C|} \left|\mathcal {D}_i\right|}{|C| \cdot B} \right] \right), \end{aligned} \end{equation*}

where P denotes the time of a forward and backward pass on the client’s local model on a batch of data [21]. Furthermore, it was assumed that the communication time with the central server is dominated by the average local training time for each client. In summary, FedQ requires L times as much time as FedAvg for each communication round.

During the development of FedQ, further techniques for addressing non-i.i.d.-ness and small local datasets in FL that are partially comparable to FedQ have emerged, for which the similarities with and differences to FedQ are discussed in Appendix F.

3.7 Achieving Communication Efficiency

Besides the problems of data heterogeneity and clients having very little local data, constantly communicating model parametrizations can also lead to a significant overhead. The candidate generator and the ranker models are, depending on the sizes of the embeddings and the number of hidden layers, between 60MB and 120MB in size. Given the massive scale of the user base of a typical movie RecSys, using FL can result in multiple gigabytes of data that must be communicated in each communication round, even at low client sub-sampling rates. Furthermore, the clients are relatively resource constrained, so communication reduction techniques that require complex processing, such as pruning or learned quantization, are not an option.

A recent standard for NNC, ISO/IEC 15938-17:2022 (MPEG-7 part 17) [36, 46, 56, 85],² which is based on the Deep Context-Adaptive Binary Arithmetic Coding (DeepCABAC) NN compression algorithm [119], has shown excellent compression results and requires only little or no preprocessing. Furthermore, it has already been shown to exhibit remarkably high performance in an FL setup [87]. In its coding core, NNC combines specific quantization methods that are adapted to the NN layers, followed by a context-adaptive binary arithmetic coding method, which reduces data redundancy.

Employing the NNC standard to compress the upstream and downstream communication in our proposed FedRec is motivated by the fact that the coding engine at its core, DeepCABAC, permits higher compression performance on a variety of NN architectures than comparable techniques in the literature [120]. Wiedemann et al. showed that the NNs can be compressed by a factor of 50.6 on average with negligible loss in performance. Comparable coders based on the weighted Lloyd algorithm [17, 79] or uniform quantization [17, 73] only managed to compress the models by factors of 13.6 and 5.7, respectively. For example, the authors obtained a compression ratio of 1.58% with an accuracy of 69.43% for the VGG16 architecture, whereas comparable literature reports only a compression ratio of 2.05% with an accuracy of 68.83%. Similar results are obtained for the MobileNet-v1 and MixNet architectures, resulting in a compression ratio gain of 3.6 and 92.1 percentage points, respectively, without affecting the model performance. These results are obtained by simply applying DeepCABAC, they do not require the use of any optimization techniques, such as bias correction, distillation, or fine-tuning, rendering the NNC standard a straightforward plug-and-play procedure [87, 120].

There are more specialized techniques for reducing the communication overhead in FL that are, however, less comparable to NNC as they are not based on entropy coding. For example, FedFast [86] is an alternative to FedAvg, which increases convergence speeds of the models and thus reduces the number of times updates have to be communicated between server and client. Muhammad et al. provide an experimental evaluation of their method on MovieLens 1M [39], MovieLens 100K [39], TripAdvisor hotel reviews [1], and the Yelp dataset [126]. On MovieLens 100K, FedFast required ~24.2%³^,⁴ of the communication rounds to achieve the highest performance of FedAvg, which corresponds to around 4 times less data communicated. On MovieLens 1M, FedFast reached the best performance of FedAvg even faster, i.e., after only ~1.13%\(^{4}\) of the communication rounds that FedAvg required, which means that approximately 88 times less data was communicated. For TripAdvisor, FedFast only required ~7.5%\(^{4}\)^,⁵ of time to reach the highest performance of FedAvg as compared to the time that FedAvg required, which resulted in around 13 times less communication cost. Finally, FedFast required only ~17.8%\(^{4}\) of the communication rounds to reach the highest performance of FedAvg, in contrast to how many communication rounds FedAvg required to reach the performance. This reduces the communication cost of FedFast by almost a factor of 6. These results are, however, not comparable to the compression performance of other methods, as they measure the communication cost required to reach the highest accuracy of FedAvg, which, however, performs very poorly as compared to FedFast and does not even converge in the case of the TripAdvisor and Yelp datasets. Under realistic conditions, one would not stop the training there, but train the model until convergence, which in some cases happened much later. For example, the training curves presented in Figure 3 [86] seem to suggest that for MovieLens 100K and the Yelp dataset FedFast only reached its own highest accuracy at the very end of the training, after 1,000 communication rounds.

Another interesting approach is that of FedKD [121], where the clients train a teacher model, which is then distilled into a smaller student model. FL clients communicate the compressed gradients of the student models, which substantially reduces the communication overhead. Wu et al. report that they accrued 18.6 times less communication cost per client on the MIND [122] dataset and 19.9 times less communication cost per client on the ADR [118] dataset as compared to directly using the larger teacher model, with no loss in performance. Both FedFast and FedKD, however, require substantial changes to the FL pipeline, while DeepCABAC consistently offers high, in many cases even the highest reduction in size, while being a plug-and-play solution, that only needs to be applied to the NN model. This justifies our choice of utilizing the NNC framework for our FedRec, since we can expect to have higher compression performances than the previously proposed coding techniques in the literature, without having to integrate any complex optimization techniques.

3.8 Data Security and Privacy Protection

To achieve the goal of data security and privacy protection, FL incorporates the principles of data minimization, i.e., processing the data as early as possible (data processing is carried out on the client’s device), only collecting data that is absolutely necessary (e.g., in FedAvg, only model parametrizations are transmitted), and discarding any obtained data as soon as possible (after the client models were aggregated into an updated global model, the local models are discarded). Furthermore, FL employs the principle of anonymization, i.e., no conclusions about the originator shall be drawn from the respective data. In terms of FL, this implies that, ideally, only sending training updates should prevent the central server from deriving any further information about its clients. In practice, however, it has been shown that local samples can be reverse-engineered from the gradients [29] in FedSGD. To alleviate this problem, anonymization techniques, such as differential privacy, where random noise is added to client data communication [116], or homomorphic encryption, where encrypted client updates can be aggregated without decrypting them [25, 91], can be utilized.

Generally, these kinds of attacks are performed by the central server, who has access to the gradient updates sent by the clients. The attacks reconstruct the client’s input data by starting with some arbitrary, e.g., randomly initialized input data, and adapting this dummy data in such a way that the distance between its gradient and the actual gradient received from the client is minimized, for example, by solving the following optimization problem [21, 29]:

\begin{align} \underset{\widetilde{x}}{\text{argmin}} \; \text{dist}(\nabla _{\boldsymbol {\theta }} \, \ell (\widetilde{x}, y; \boldsymbol {\theta }), \nabla _{\boldsymbol {\theta }} \, \ell (x, y; \boldsymbol {\theta }))\text{,} \end{align}

(2)

where \(\widetilde{x}\) denotes the dummy input data, x the unseen training sample of client c, \(\nabla _{\boldsymbol {\theta }} \, \ell (x, y; \boldsymbol {\theta })\) the intercepted gradient of client c, \(\nabla _{\boldsymbol {\theta }} \, \ell (\widetilde{x}, y; \boldsymbol {\theta })\) the gradient computed on the dummy input data \(\widetilde{x}\) with the ground-truth y, which can, for example, be extracted from the gradient of the output layer [132], \(\boldsymbol {\theta }\) the parametrization of the updated local model of client c, and \(\text{dist}(\cdot)\) a distance function. For example, Geiping et al. [29] show that in many cases it is possible to use such a technique to reconstruct training images almost perfectly from the gradient, thus demonstrating that FedSGD is not as privacy-preserving as thought. A related method, proposed by Chai et al., is able to reverse-engineer a user’s rating information from two consecutive gradient updates in a FedRec based on matrix factorization, where the factorization is learned by the users using stochastic gradient descent (SGD) [11]. The attack proposed by Chai et al. is specifically tailored toward federated matrix factorization and is therefore not applicable to our scenario. Furthermore, both methods need to intercept the client’s gradient updates and are therefore only pertinent to FedSGD. FedQ on the other hand, which is employed by us, does not share the gradient but the updated local model and is thus not vulnerable to these kinds of attacks.

And still, Dimitrov et al. [21] showed that it is possible to reconstruct training images in realistic FedAvg settings. Despite the method’s success with a single client relying on many local training rounds, attacking aggregated parameter updates from multiple clients, even if only a few of them are used, significantly degrades the reconstruction performance. Using the Federated EMNIST (FEMNIST) dataset for demonstration, they specifically showed that attacking the averaged updates of just four clients instead of one significantly lowers the average reconstruction performance of images with peak signal-to-noise ratios (PSNRs) of 20 or above by 35.8 percentage points, which is evaluated on 100 randomly selected clients from the training set. When conducting this experiment, they chose an optimal configuration of local epochs and batch sizes for the clients. In addition, they rely on the unrealistic assumption that the label counts are known. Having to estimate them, degraded the attack performance by 17 percentage points using the updated parameters of only a single client. In FedAvg, an attacker can easily retrieve the parameter updates of individual clients, thus making this kind of attack highly effective. But, by the very nature of FedQ, a potential attacker usually only receives aggregated parameter updates from multiple clients. Thus, using a reasonably large queue of clients should guarantee a high level of data security.

Some recent works have tried to employ gradient/parameter obfuscation to counteract these kinds of attacks. For example, differential privacy is an obfuscation method, where random noise is added to the client updates. While differential privacy is one of the most prevalent obfuscation schemes, others, like gradient quantization and magnitude pruning, have been proposed. For example, Wei et al. [117] and Zhu et al. [134] showed that gradient sparsification is a well-functioning approach to mitigate data reconstruction attacks. Ovi et al. [88] demonstrated the efficacy of using mixed precision quantization to counteract gradient leakage attacks. They quantized the model gradients of the clients after local training to 16-bit and 8-bit integers before sending them to the central server and showed experimentally that no information was leaked. They ran the attack for 450 iterations both on FL setups where the communicated gradients were quantized, as well as baseline FL setups without gradient quantization using the Modified NIST Database (MNIST), Fashion-MNIST, and CIFAR-10 datasets. In the baseline experiments, training images could be extracted from the gradients after 20 iterations for MNIST, 20 iterations for Fashion-MNIST, and 40 iterations for CIFAR-10. In the experiments that applied gradient quantization they were not able to extract any training samples, even after 450 iterations of their attack. Our NNC module is capable of using an arbitrary number of quantization points for quantizing model updates, where the number of quantization points is fine-tuned for each layer. This results in a quantization that goes well below 16-bit and in many cases even below 8-bit quantization, which should result in better obfuscation.

These works are good indicators that gradient obfuscation techniques can be successfully employed to counteract attacks such as those proposed by Geiping et al. [29] and Dimitrov et al. [21]. Yue et al. [128], however, call the effectiveness of gradient obfuscation into question, by proposing a novel data reconstruction attack scheme. However, they have only shown their attack to be effective in the domain of image classification, which is a special case, as even reconstructed images that diverge a lot from the actual input image, may contain enough visual information for human observers, while the same amount of reconstruction error on tabular data, as used in FedRecs, would not be usable with the same amount of error. Also, they have only tried small batch sizes with a small number of local epochs and have only shown uniform quantization. Therefore, we still conjecture that the error induced by the mixed-precision quantization of our NNC module may successfully obfuscate the information contained in the parameter updates sent to the central server or at least make it much harder for attackers to recover any useful information. This, however, remains to be tested in future work.

Finally, most of these attacks assume the central server to be the culprit who wants to reconstruct the input data of its clients. We want to note that outside attackers are usually incapable of intercepting any data from the FL process as simple techniques, such as employing SSL/TLS, can effectively mitigate these kinds of attacks.

This concludes our new proposed FedRec, which consists of a three-staged recommendation architecture, including a candidate generation, ranking, and re-ranking stage. Furthermore, the RecSys was extended to use FL, applying the developed FedQ method to effectively operate with extremely high numbers of heterogeneous clients. The communication overhead introduced by constantly communicating parameter updates between the central server and the clients is alleviated by compressing the model parametrizations using a state-of-the-art NN compression scheme. Finally, we have discussed the data security and privacy protection capabilities of the proposed architecture. In the following Section 4, we will evaluate the performance of the FedRec system experimentally.

4 Experiments

In this section, we describe the experiments performed using our FedRec and evaluate its performance. We will first start by describing how the dataset was acquired and processed. Then, we will lay out a non-FL baseline to which we will compare the performance of the FL system. Then we will demonstrate that standard FedAvg only yields a moderate performance before applying the FedQ algorithm to improve performance to equal or even exceed the performance of the non-FL baseline. Finally, we will show how the new NNC standard can be utilized to significantly decrease the communication overhead. All experiments were performed using PyTorch [89].

4.1 Dataset

Among datasets suitable for movie RecSyss, the MovieLens dataset by Harper and Konstan [39] is one of the most widely known and used datasets. MovieLens comes in multiple different flavors, among which the 25M variant is the latest stable benchmark dataset. It contains more than 25 million ratings across almost 60,000 movies made by more than 162,000 users. The MovieLens datasets consist of users, movies, ratings, and tags. As the 25M flavor of the MovieLens dataset is a stable benchmark dataset, it was chosen for our experiments.

4.1.1 Dataset Analysis.

For the candidate generation model, we treat the ratings of the MovieLens dataset as movie watches to predict future watches from past viewing behavior. Therefore, the temporal cohesion of the ratings is particularly important. During an initial screening of the dataset, we observed that the data was inconsistent with “normal” viewing behavior, at least for a small number of random samples. For example, some users rated an infeasible number of movies in a single day, while other users had an impossibly high number of total ratings. Therefore, the MovieLens 25M dataset was inspected more closely in terms of four different metrics: (1) average times between ratings of all users in the dataset, i.e., the speed at which users have rated movies, (2) number of ratings per user, (3) number of ratings per movie, and (4) number of ratings cast by rating value. The results are shown in Figure 5.

Fig. 5.

The MovieLens 25M dataset contains 59,047 movies that have been rated 25,000,095 times by 162,541 users. A total of 87.1% of users have an average time of less than 1 minute and 97.3% have an average time of less than 1 hour between two ratings. On average, there are 32.7 minutes between two ratings. The smallest number of ratings per user is 20 and the highest number of ratings of any user is 32,202. On average, each user has 153.8 ratings. A total of 58.8% of movies have less than 10 ratings and 82.5% have less than 100 ratings. On average, each movie has 423.4 ratings. The smallest number of ratings per movie is 1 and the highest number of ratings of any movie is 81,491. The top-10 most-rated movies have amassed 2.8% of all ratings.

These findings suggest that most of the ratings were performed in a way that indicates that the users of the MovieLens website have mass-rated movies, rather than individually casting the ratings after watching each movie. The ratings per movie are also highly imbalanced, as most movies have few ratings and a few movies have a large number of ratings. This is actually somewhat expected, as there are only a few “blockbuster” movies that many people watch, while most movies are only watched by very few people. Finally, the ratings are heavily skewed toward more positive evaluations: Ratings of 3.0 and higher are significantly more prevalent than those of 2.5 and below.

The in-depth analysis suggests that the MovieLens dataset may not be suitable for next watch predictions, as the mass-ratings imply that the temporal order does not necessarily coincide with the order in which the movies were watched. To avoid an ill-posed task from the start, a set of experiments were performed, where the user ratings were sorted in multiple ways: by timestamp in ascending order, by timestamp in descending order, by rating in ascending order, by rating in descending order, and in random order. The results of this experiment are shown in Figure 6. The findings reveal that ordering the movie watches by timestamp yields a higher prediction performance, which is measured in terms of top-100 accuracy, than any other ordering.⁶ Ordering by rating already gives a lower prediction performance, but it is still higher than the performance for random order. This means that, despite the ratings not conforming to “normal” viewing behavior, the dataset is actually suitable for the purposes of training the candidate generator, because the assumption of temporal cohesion holds.

Fig. 6.

4.1.2 Dataset Preprocessing.

The candidate generator and the ranker models each have different inputs and outputs, and therefore require a custom dataset that has to be derived from MovieLens. We refer to the dataset for the candidate generation model as the watch history dataset, and the dataset for the ranker model as the rating dataset.

The samples of the former consist of a list of previous movies that a user has watched and a single future movie as prediction target. Since the movie prediction is performed on a per-user basis, the dataset is first grouped by users. Watch histories are made from consecutive movie watches; therefore, the ratings are then ordered by their timestamp. A sliding window is used to extract watch history samples from the movie watches of the users. The preprocessing of the dataset is visualized in Figure 7. The created samples are then stored in a suitable data format for the training, validation, and testing of the candidate generator model.

Fig. 7.

The use of a sliding window with a defined upper limit for the number of movies in a watch history is based on the premise that the users’ tastes change over time. This implies that a watch becomes less predictive of subsequent watches the longer it lies in the past. Furthermore, depending on the dataset size and the number of trainable parameters, the candidate generator model has an upper capacity limit for learning structure. For too high values of window size, the candidate generator model performs worse as it is unable to learn the complex correlations in the input data. To determine the optimal window size, multiple datasets with different window sizes were created and used for training the models. The results suggest an optimal window size of 7 (cf. Figure 8).

Fig. 8.

The rating dataset is much simpler, as the MovieLens samples do not have to be reinterpreted. Instead, the rating samples can be directly inferred from each MovieLens sample. Each of it consists of user ID, movie ID, genres of the movie, and user rating. Optionally, the age of the movie and the rating age can be added. The rating age is computed from the rating timestamps, while the movie age is derived from the movie release date, which was retrieved by cross-referencing the MovieLens movies with their corresponding entries in The Movie Database (TMDb). The movie age and the rating age are both normalized between −1 and 1. Adding the movie age should encourage the model to learn that certain users prefer older or newer movies. The rating age is used to provide the model with an understanding of the temporal component of ratings. During inference, the rating age can be set to 1 to ensure that the model does not take old information about the user into consideration, and thus makes predictions right at the end of the training window. A similar technique has been proposed by Covington et al. [19]. In order to determine the efficacy of adding these two features, experiments were performed, whose results are presented in Figure 9.

Fig. 9.

Using the movie age yields the best overall accuracy, closely followed by using neither movie nor rating age. Utilizing either the rating age alone or the rating age and the movie age together results in slower convergence of the model, as well as lower overall accuracy. In terms of MSE, using movie age, rating age, and using neither yield almost the same overall performance, while using both performs slightly worse. For this reason, we decided to only use the movie age and discard the rating age.

In order to perform FL experiments, both the watch history, as well as the rating datasets, were split into much smaller subsets for each FL client. Since the movie IDs, user IDs, and genres are fed into embedding layers, the datasets were not simply split randomly, but in a way, that the training data still contained all possible IDs. Otherwise, the validation and test subsets may end up containing IDs that the model was not trained on. For testing the FL pipeline, the datasets were randomly split into equal-sized subsets for all FL clients, which ensures that the client datasets are balanced and somewhat i.i.d. As the MovieLens dataset also contains user IDs, the samples could be split such that each FL client receives samples of a single MovieLens user. This allows for properly simulating real-world conditions with non-i.i.d. data.

4.2 Baseline Experiments

We first conducted a baseline experiment using the hyperparameters that were selected based on the experiments described in Appendices B and C, as well as in Section 4.1. These experiments are used as a baseline for the FL experiments. We trained the candidate generator and the ranker models five times each and present their minimum, maximum, and mean performance in Figure 10.

Fig. 10.

The candidate generator outputs a probability distribution over the entire corpus of movies in the MovieLens dataset, which means that it has to distinguish between almost 60,000 classes. Therefore, we report top-100 accuracy (also sometimes referred to as hit-ratio@k, where \(k = 100\)), which rates a classification result as “correct” if the ground-truth next watched movie is among the 100 movies with the highest classification scores. For the ranker, we report accuracy, as well as MSE, which measures how much the predicted rating differs from its ground-truth. The performance was measured on a validation subset of the dataset, which is distinct from the training subset. The highest final top-100 accuracy that was achieved by any of the five trained candidate generators was 47.26%, with an average top-100 accuracy of 47.15%. The best performing ranker model achieved a final accuracy of 38.43% and a final MSE of 0.91, with a mean final accuracy of 38.31% and a mean final MSE of 0.93 across all five tries.

4.3 Federated Learning Experiments

We subsequently performed FL experiments by simulating the FL process. A detailed description of how this FL simulator operates can be found in Appendix A. The FL experiments use the same hyperparameter configuration as the baseline experiments, except for the learning rates of the candidate generator, which had to be decreased by one order of magnitude to stabilize the training. A broad range of different numbers of clients in the dataset were selected in order to simulate the impact of varied local data distributions on the performance of the global model. Different client sub-sampling rates were employed to determine the optimal number of clients per communication round for the individual scale of the experiments, ensuring accurate client updates to be aggregated by the central server. With a range of 1k to 150k clients, the underlying datasets were randomly split into equal-sized local datasets for the clients and randomly distributed among them, assuring that they have approximately the same local data-generating distribution, especially in the low scale experiments. As the number of clients grows, the sizes of the local datasets shrink, which, in turn, reduces the likelihood of receiving an i.i.d. subset of the underlying dataset and gradually increases the non-i.i.d.-ness of the client data. The 162k experiments split the underlying dataset using the user IDs provided by MovieLens, thereby ensuring that each FL client receives the samples from exactly one real-world user. As a result, each client’s local dataset has a unique data-generating distribution. In addition, in the 162k setup, the local datasets are imbalanced, as the users of the MovieLens dataset have varying numbers of samples. Furthermore, with an increasing number of clients the local datasets become smaller, thus increasing the negative effects from noisy updates. This setup allows us to clearly identify the effects of small local datasets and non-i.i.d.-ness to be compared to our FedQ method. The results of these experiments are shown in Figure 11.

Fig. 11.

As can be seen in Figure 11(a), the candidate generator is strongly affected by non-i.i.d.-ness and small local datasets, as even the setup with only 1k clients already performs much worse compared to the non-FL baseline, and increasing the number of clients decreases the performance significantly. The ranker, which can be seen in Figure 11(b), is not as much affected by non-i.i.d.-ness and small local datasets. In addition, the performance drop from increasing the number of clients is not as pronounced as with the candidate generator. Still, the performance is significantly lower than the non-FL baseline. The ranker performs better than the candidate generator, since the watch behavior varies more between users than rating behavior, e.g., two users with different watch histories may still rate the same movie similarly. Since the rating data is much more homogeneous, the data-generating distributions of the users do not differ as much as in the case of the watch history data.

Reasonably, one might expect that, as the number of clients grows and the sizes of the local datasets shrink, the performance of the candidate generator should gradually decline. The experiments, however, reveal that the performance declines between 1k and 100k clients, before increasing again with 150k and 162k clients. We believe this can be explained by viewing the performance penalty incurred by FL compared to centralized training as a compound error. One of the components of this error arises from the non-i.i.d.-ness of the clients, as the different local data generating distributions cause the clients to have disparate local objectives. This leads to contradicting client updates that, when averaged by the central server, can cancel out some of the training progress of other clients and result in an update to the global model that does not minimize the global objective. Another component of the error is caused by noisy client updates: The smaller the local dataset of a client is, the worse its estimation of the empirical loss becomes, which results in a noisy gradient and unstable training. Even with a homogeneous client population, this can lead to contradicting client updates, which cause the global model to not properly converge. In the setups with 1k and 10k clients, each client has a large local dataset, which results in stable local training and good client updates. This means that the compound error causing the decrease in performance is dominated by the increase in non-i.i.d.-ness. As the number of clients increases, the sizes of the local datasets decrease, which, in turn, increases the heterogeneity of the clients. But this decrease in the size of the local datasets also causes the client updates to become noisier and the error induced by noisy client updates to become a significant component of the compound error. This increase in both error components causes the sharp decline in performance between the setups with 10k and 100k clients. Between the setups with 100k, 150k, and 162k clients, the non-i.i.d.-ness and noisy client update error components do not significantly change, because the increase in the number of clients is not as large and the sizes of the local datasets do not change as much. At this point, a new error component comes into play: With the decrease of the local dataset sizes also comes a decrease in the number of different MovieLens users represented in the local datasets. In the setup with 100k clients, the number of different MovieLens users represented in the local datasets becomes small enough that the global model starts to be negatively impacted by the heterogeneity of the local dataset samples. The 1k and 10k setups do not suffer from this, as their local datasets have samples from so many MovieLens users that the effect averages out. In the 150k setup the number of MovieLens users decreases and in the 162k setup it is even guaranteed that each client only has samples from a single MovieLens user, thus gradually decreasing the negative impact of the heterogeneity of the local dataset samples. Again, the ranker model is not as affected by this, since the rating data is much more homogeneous than the watch history data, as described above.

4.4 FedQ Experiments

As described in the previous section, the effects of non-i.i.d.-ness and small local datasets result in a significant decrease in performance. Therefore, we employed the FedQ technique, described in Section 3.6. We fixed the client sub-sampling rate at 1,000 clients per communication round and used varying queue lengths. In order to stabilize the training, the learning rate applied to the candidate generator was once again lowered in comparison to the non-FL baseline experiments. The experiments were also conducted using the FL simulator described in Appendix A, the results of which can be seen in Figure 12.

Fig. 12.

As shown in Figure 12(a), the candidate generator now performs much better as compared to standard FedAvg and in particular for the setups with 1k and 10k clients, FedQ even outperforms its baseline. The latter may be caused by a regularizing effect. In addition, the setups with large numbers of clients not only perform much better, but also in the expected way, as the performance slightly decreases with an increased number of clients. This provides evidence for our hypothesis that the candidate generator started to perform slightly better with increasing numbers of clients due to noisy updates induced by the decreased heterogeneity of the local dataset samples. Since the clients now train the global model sequentially, the number of samples adding to the local model update has drastically increased, thus resulting in much higher quality local updates. The ranker likewise shows a comparable improvement in performance and outperformed its non-FL baseline, though with a smaller margin than the candidate generator, as shown in Figure 12(b). Tables 1 and 2 compile the results of both the FL and FedQ experiments and clearly show that FedQ outperforms FedAvg in every single experiment.

To further investigate the efficacy of FedQ, we have evaluated it using the LEAF benchmark [7], which is a benchmark for testing FL algorithms. The LEAF benchmark includes multiple different datasets that can naturally be partitioned into local datasets for FL clients, as well as accompanying NN models and metrics. We have benchmarked FedQ on four of LEAF’s datasets and NN architectures, which range from image classification using CNNs, to text classification and next word prediction using long short-term memorys (LSTMs). The results of these experiments and a detailed evaluation can be found in Appendix E.

4.5 Communication Compression Experiments

As described in the Section 3.7, FL has, due to the continuous exchange of local updates between clients and central server, a significant communication overhead. We employed the recent NNC standard to compress the NN parametrizations communicated between the clients and the central server. The coding engine uses parameter quantization as a lossy preprocessing step and DeepCABAC as arithmetic coder. The quantization of the parameters requires a hyperparameter called quantization parameter (QP), which controls the step size \(\delta\) between quantization points and thus the rate-performance tradeoff. A lower QP results in a smaller step size and therefore in more quantization points and lower compression performance, while a higher QP results in a larger step size and therefore in less quantization points and higher compression performance. To compute \(\delta\) as demonstrated by Algorithm 3, it is necessary to provide an additional parameter \(f_{QP}\), which incorporates the dependency between QPs and the quantization step sizes. Lower values of \(f_{QP}\) result in larger neighboring quantization step sizes.⁷

Besides influencing the compression performance, the QP also impacts the performance of the NN model after decompression, i.e., if the QP was chosen too large, the resulting performance is significantly decreased. In order to determine the optimal value of the QP for the candidate generator and the ranker models, we performed an experiment, testing QP values between −48 and 0. The results are shown in Figure 13.

Fig. 13.

For FL, the QP value should be chosen in a way to optimize the rate-distortion tradeoff. As can be seen in Figure 13, a QP range of −38 and −30 for the candidate generator, and −43 and −35 for the ranker results in compression rates with no or marginal performance degradation. Since the compression performance (per client) in our setting is independent from the number of clients, we only performed experiments with 100 clients and no client sub-sampling, i.e., all clients were included in every communication round. The experiments also perform FedQ with a queue length of 10. Besides the number of clients, the client sub-sampling rate, and the FedQ queue length, the other hyperparameters of the experimental setup are identical to the FL experiments in Section 4.3. The results of this experiment are depicted in Figure 14.

The plots on the left of each sub-figure show the model performance, while the plots on the right of each sub-figure demonstrate the compression performance for different QP values. According to the compression performance plots, the initial number of communicated MiB is slightly higher, as in the beginning weights are initialized with random values. During the course of training, the entropy of the weights decreases, resulting in better compression performance. After a few communication rounds, the compression performance saturates at an almost constant value. For the candidate generator, the space saving, as compared to uncompressed communication, varies between 92.97% for QP −38 and 95.37% for QP −30. For the ranker, the space saving, as compared to uncompressed communication, varies between 85.88% for QP −43 and 86.17% for QP −35. The space savings are lower in comparison to the non-FL baseline, where the candidate generator achieved 97.04% for QP −38 and 98.39% for QP −30, and the ranker 91.85% for QP −43 and 96.4% for QP −35. This seems to be an effect that is inherent to FL. In centralized training, regularization methods produce small magnitude weights, which results in higher sparsity when applying quantization. The exact weights that are going toward zero can, however, differ between several training runs, which means that in an FL setting each client can have different weights of small magnitude. Due to the averaging of the weights in FedAvg, the produced global model will most likely be less sparse than its constituent local models. For example, the overall entropy of the candidate generator that was trained using FL with compression is 4.41 bits, while the overall entropy of the candidate generator that was trained using FL without compression is 7.5 bits, and the baseline candidate generator which was trained centrally without compression has an entropy of 1.26 bits. The overall entropy of the ranker that was trained using FL with compression is 11.44 bits, while the overall entropy of the ranker that was trained using FL without compression is 12.59 bits and the baseline ranker, which was trained centrally without compression, has an entropy of 3.67 bits. This shows that models that are trained using FL do, in fact, have a higher entropy and are thus less amenable to compression. Quantization on the other hand seems to induce more sparsity, thus lowering the resulting entropy for models trained with compression. Furthermore, this also shows why the ranker performs much worse in terms of space saving, as it has much higher entropy in general.

The candidate generator has excellent model performance, even for higher QPs, well outperforming the non-FL baseline and showing the same performance characteristics as the FedQ experiments presented in Section 4.4. As the loss in performance and the increase in compression performance are very small, any of the tested QPs are well-suited to be used, so we selected a QP of −30, which offers the best overall space saving of 95.37% and a top-100 accuracy of 50.5%, which is only one percentage point smaller than the best accuracy and well above the non-FL baseline of 47.15%.

The training of the ranker model, however, seems to be much more affected by the compression as compared to the candidate generator, although the increase in compression performance is exceedingly small with increasing QPs. Only QPs −43 and −41 manage to meet the non-FL baseline and none of the QPs achieve a performance that is in line with the results of the FedQ experiments. This is, however, to be expected since the lossy compression of NNC may hurt the performance of the models. In this case, the difference between the best performance reached with compression is only slightly lower than the best performance of FedQ without compression. Therefore, QP −43 was selected as it slightly outperforms the non-FL baseline with 38.85% accuracy and an MSE of 0.91 but reaches almost the same space savings as the smallest QP with 85.88% as compared to 86.17%.

5 Conclusion and Outlook

Modern RecSyss, especially the ones based on DL, benefit from increasing amounts of personal information about its users. This has resulted in the collection of substantial amounts of personal data on many platforms in recent years, leading to a data privacy problem. Here, FL has emerged as a technique that intrinsically provides privacy and is therefore used in many scenarios where data privacy is of high priority. Consequently, we presented a movie RecSys, which is being trained end-to-end using FL and scales well to exceptionally large numbers of users. We have identified major problems in such systems and proposed solutions to them. In particular, we have shown that the non-i.i.d.-ness of the clients’ local datasets, as well as small local datasets can significantly degrade the federated training of a RecSys and developed a novel technique, called FedQ, which satisfactorily counteracts this problem. Furthermore, the substantial overhead of constantly communicating NN parametrizations between server and clients in FL poses a problem, especially when clients are connected via mobile internet connections. For this, we have shown that the most recent NNC compression technology can considerably reduce this communication overhead to a fraction of the uncompressed communication.

Beyond the proposed significant improvements to the overall RecSys, additional improvements can be achieved through further research. In the area of data privacy, differential privacy methods could be further investigated and combined with the quantization-induced privacy by NNC communication compression. Another topic of interest is the learning of embeddings in an FL setting, which is known to be problematic. Solutions proposed in the literature, seem to all depend on the partial disclosure of client data. Here, future work could investigate possibilities of learning embeddings in an FL setting without disclosing private information. The space savings of the compression can be further improved by differential compression, i.e., only the difference between the global model and the updated local model is compressed, which is sparser and is thus more amenable to compression. Finally, the non-i.i.d.-ness in the FedRec scenario originates from different user preferences. The local datasets within user groups of similar preference should be much more homogeneous, leading the way to further model performance improvements for FedRecs.

Footnotes

Depending on the specific FL algorithm that is being used, these local updates are of different type, e.g., in FedSGD an update is represented by the gradient [58], in FedAvg the update is represented by the parametrization of the updated local model [83], and in federated distillation the update is represented by the soft labels that were produced by the updated local model on a central training dataset [48].

A standards-compliant implementation of the NNC standard, under a permissive license, is available on GitHub: https://github.com/fraunhoferhhi/nncodec

Muhammad et al. claim that for MovieLens 100K FedFast already reached the same performance as FedAvg at communication round 30, which would correspond to approximately 33 times less communicated data, but their own training graphs suggest that this only happened at approximately communication round 196, which corresponds to the factor of approximately 4 that we reported here.

⁴

Muhammad et al. do not publish communication cost savings, so the values presented here were read from the training curves in Figure 3 [86] and are therefore only approximations.

⁵

Again, Muhammad et al. claim that FedFast was 20 times faster than FedAvg, although their own training curves suggest it was closer to the factor of 13 reported here.

⁶

Surprisingly, ordering the ratings from future to present, i.e., predicting past movie watches from future watch behavior, yields a slightly higher performance than the regular temporal ordering. A statistical fluke can be ruled out, as the experiment was repeated five times and the plot shows the minimum, maximum, and mean top-100 accuracy. We have no explanation for this interesting result.

⁷

https://github.com/fraunhoferhhi/nncodec/wiki/usage

⁸

For example, the Apple Neural Engine (ANE) introduced with the iPhone X’s A11 System on a Chip (SoC) and Google’s Tensor SoC introduced with the Pixel 6 line of smartphones.

A Federated Learning Simulator

The unique requirements for the FedRec presented in this work, especially the considerable number of FL clients involved, render its experimental evaluation exceedingly difficult. Performing the experiments under real-world conditions, i.e., deploying real devices that communicate with the central server via a network connection, was infeasible. Simulating the FL process is common in research, since most of the time the algorithmic and methodological underpinnings of the process are to be researched. Simulating an FL system with the required number of clients proved to be challenging, however, as simultaneously keeping all clients, their local datasets, and their local models in main memory is impractical. This meant that some concessions had to be made in order to be able to perform the simulations.

The first and most obvious concession is that the clients must be strictly trained sequentially, which increases training times considerably, but allows for the training of the clients on limited computing resources. This enables the simulator to run on hardware whose processing capabilities allow for the training of at least one client.

The second concession is that the clients cannot remain in main memory at the same time. In fact, because of the potential high amount of data involved in the training process, intermediate results can also not be stored on hard disks. FL clients are usually comprised the following data: a local dataset, a local model, and the data required by the optimizer. At the very least, an optimizer must store the gradient of the loss function with respect to the weights, whose size is equal to the size of the NN model itself. Furthermore, some optimization algorithms require the storage of additional information. For example, the Adam optimizer [55] also stores estimates of the first and second moments of the gradient, which are both equal in size to the gradient. Finally, the central server must store the global model, as well as all client updates, which are each equal in size to the global model. Although a single RecSys NN model is only tens of Megabytes in size, this can add up to an inhibitive amount of data. For example, in the federated training of the candidate generator NN model with more than 162.000 clients, the amount of storage required for all clients adds up to approximately 32 terabytes when using the regular SGD optimizer [53, 94] and approximately 52 terabytes when using the Adam optimizer. Considering these memory requirements, it becomes obvious that this is not a viable option for the simulation.

The FL simulator employs multiple simple improvements to circumvent the need to keep all data in the main memory simultaneously. Since the clients are trained sequentially, they do not have to keep their local datasets and models in memory simultaneously. At the beginning of each training round, the central server sends the parameters of the global model to the current client, which will load its local dataset and instantiate a local model using the parameters of the global model before starting training. After the local training has finished, the client sends the updated parameters of its local model back to the central server and frees up the memory resources occupied by its local dataset and model.

The second improvement is to not store all client updates on the central server and only aggregate them when all client updates have been received. Instead, a cumulative mean is kept in memory, which is updated each time the central server receives an update from a client. Since the client updates are weighted by the amount of data each client has trained on, this means that the central server must ask all participating clients at the beginning of each training round to reveal the size of their local dataset. In a real-world scenario, each client would send this information when it sends its training updates to the central server, but to be able to keep a cumulative mean, the server must know the local dataset sizes of all clients in advance. The central server then computes the percentage of training data every client contributes to the overall training, which it uses as the weights for the cumulative mean. When a client finishes its local training and sends the training updates back to the central server, the central server multiplies the received update by the weight of the respective client and adds it to the cumulative mean. After all clients have finished their training and the central server has aggregated all their contributions, the cumulative mean is equal to the actual mean of the client updates, which is then used as the updated parametrization of the global model.

Finally, the last improvement is to only use the SGD optimizer, as it is stateless and requires no memory at all beyond having to store the gradient. However, since the gradient is volatile and only needs to be stored until it has been applied to the weights of the local model, its memory footprint is as low as possible. Other optimizers, such as Adam, need to store further information, which cannot be discarded and must be kept in memory for the entire duration of the federated training, thus rendering these optimizers unusable. Depending on the model and the training objective, the choice of optimizer can influence convergence time, as well as the final performance of the model. In the present case, SGD can train the RecSys models to the same level of performance in a similar amount of time as Adam.

Employed in unison, these improvements not only bring the computational requirements down to a manageable level, but they also reduce the memory footprint to a small fraction of the theoretical requirements. In fact, only the validation dataset and model of the central server, the cumulative mean of the client updates, the local training dataset and model of the current client, and the gradient calculated by the current client’s optimizer will ever be in memory at the same time. This reduces the memory requirements of the FL simulator from several tens of terabytes down to a few gigabytes.

The temporal complexity of the FL simulator is still relatively high, however. Although a round of local training of a single client only requires a couple of minutes, this adds up to a substantial amount of training time considering that the federated training procedure must be repeated dozens or even hundreds of times. This is a problem that is also faced by real-world FL systems and is usually solved by client sub-sampling [14, 28], i.e., only selecting a small random subset of clients from the client population for each round of training. This is also the solution that we have chosen: For each round of federated training, we only select between 100 and 10,000 from the more than 162,000 clients. In conclusion, all these measures make it possible to train the RecSys models using FL in a matter of a few days.

B Candidate Generator Experiments

In this appendix, we will perform various experiments to determine the optimal model architecture for the candidate generator model described in Section 3.3, and then provide a detailed explanation of the chosen NN architecture.

B.1 Model Type Experiment

There are many possible architectures for candidate generator models based on NNs ranging from simple DNNs [19] and RNNs [16] to more elaborate autoencoder architectures [123]. Since the NN model will be trained using FL, the size of the model is a crucial factor. Mobile devices, such as smartphones, are likely candidates for training the RecSys, as smartphones are the most used devices to watch online VoD content [107]. Although some modern smartphones even have dedicated hardware for NN training and inference,⁸ they are still very resource-constrained as compared to contemporary ML hardware. Therefore, only the simplest architectures can be considered for the candidate generator. The most basic NN architectures are feed-forward fully connected DNNs. However, as the candidate generator will be trained on time-series data, RNNs would be a more appropriate choice. Therefore, an experiment with a simple feed-forward fully connected architecture and multiple simple recurrent architectures, including plain RNNs [104], LSTM [44] networks, and gated recurrent units (GRUs) [15], was conducted. The recurrent architectures were all trained as both unidirectional and bidirectional models. The results of this experiment are shown in Figure 15.

Fig. 14.

Fig. 15.

The LSTM and the GRU have the worst average performance of the tested model architectures. They clearly show the pitfalls of recurrent architectures: Although the best-performing recurrent architectures reach best-in-class performances, they are tricky to train and show a large variance in training performance. Surprisingly, the RNNs are the highest performing among the recurrent architectures. Generally, the bi-directional versions of the recurrent architectures outperform their unidirectional counterpart. The feed-forward fully connected model (denoted as DNN) reaches an acceptable performance, which is almost as high as that of the bi-directional recurrent neural network (BRNN) or the bi-directional long short-term memory (BLSTM). Just considering the performance of the tested architectures, the BRNN should be favored, but it also has its downsides: (1) it is the slowest to converge with an average wall clock time of roughly 200 hours as compared to an average wall clock time of roughly 55 hours for the DNN, which is almost 4 times as long, and (2) the complexity of the two architectures differs significantly, while the DNN only has 17,994,852 trainable parameters, the BRNN has 128,494,436 trainable parameters, which is more than 7 times as many. The same is true for the BLSTM: It is much slower to converge in terms of wall clock time and is significantly larger. Especially considering that the model must be trained on resource-constrained devices, the simpler but also well-performing DNN architecture was selected.

B.2 Movie Embedding Layer Size Experiment

The size of the embedding vectors has a substantial impact on the classification result: They cannot fully capture the latent information from the data when they are too small. In addition, there is a computational cost and a risk of overfitting when they are too large, which means that more data (or regularization) is needed to properly train the model. We determined the optimal size of the embedding vectors experimentally by testing different sizes, as shown in Figure 16. The results demonstrate that increasing the size of the movie embedding vectors directly results in a performance gain, but the return on investment falls off quickly: While doubling from a size of 32 to a size of 64 results in a sizable performance increase of roughly 0.83 percentage points on average, doubling it again to 128 only yields a rise of roughly 0.12 percentage points on average. This means that a 64-dimensional embedding vector provides the best tradeoff between performance and model size.

Fig. 16.

B.3 Number of Hidden Layers Experiment

Likewise, the number of hidden layers in the candidate generator model also impacts both the performance, as well as the size of the resulting model. We performed an experiment with varying numbers of hidden layers. The results are shown in Figure 17, giving an optimum of a 3-layer configuration, as both increasing and decreasing the number of hidden layers results in inferior performance.

Fig. 17.

B.4 Candidate Generator Model Architecture

The final NN architecture that was chosen for the candidate generator has a 64-dimensional embedding layer for the movies in the watch history inputs, followed by three hidden fully connected layers, which are each followed by a normalization layer and a ReLU activation. The hidden layers with their normalization layers and ReLU activations are then followed by an output fully connected layer, which feeds its logits into a softmax.

Ever since its introduction, BatchNorm [47] has been a mainstay in deep learning. Today it is used in a wide variety of NN architectures. Unfortunately, BatchNorm also comes with some drawbacks. First and foremost, BatchNorm normalizes along the batch dimension, which causes problems with small batch sizes as the estimation of the statistics of a batch become more error prone the smaller the batch size becomes, which can make the training unstable. Especially in FL, the clients tend to use small batch sizes because of the limited computing power. Group normalization (GroupNorm) [124] was introduced to deal with this problem. Instead of estimating the mean and variance of the data based on the batches, it divides the data into groups and measures the statistics within these groups. This makes GroupNorm independent of the batch size.

Secondly, BatchNorm keeps, besides its two trainable parameters \(\gamma\) and \(\beta\), a running average of the mean and the variance of the batches. This makes it complicated to use in FL, as the running averages of the mean and the variance cannot be simply averaged. Li et al. [70] propose to only communicate the trainable parameters of BatchNorm, i.e., \(\gamma\) and \(\beta\), to the central server for aggregation but keep the running average of the batch statistics local. GroupNorm, however, has the edge over BatchNorm in this case, as it does not keep a running average of the data statistics and instead always estimates them from the current input.

Wang et al. have performed a convergence analysis and were able to show that, although several schemes have been proposed to remedy the problems of BatchNorm in FL, most of them still suffer a loss in performance due to the fact that a mismatch between the local and global statistics, incurred by non-i.i.d. data distributions, causes a gradient deviation, which, in turn, leads the model to converge to a biased solution with a slower rate. To avoid all of the above-mentioned problems, we have decided to use GroupNorm for all FL experiments and BatchNorm for all non-FL experiments. A detailed breakdown of the layers that comprise the NN architecture of the candidate generator model is presented in Table 3.

Table 3.

Type	Shape	Parameters
Embedding Layer	\(53,797 \times 64\)	3,443,008
Fully Connected Layer	Weights: \(\hbox{1,024} \times 64\)	66,560
Fully Connected Layer	Bias: 1,024	66,560
BatchNorm Layer	Gamma: 1,024	2,048
BatchNorm Layer	Beta: 1,024	2,048
or
GroupNorm Layer (32 Groups)	Gamma: 1,024	2,048
GroupNorm Layer (32 Groups)	Beta: 1,024	2,048
ReLU
Fully Connected Layer	Weights: \(512 \times \hbox{1,024}\)	524,800
Fully Connected Layer	Bias: 512	524,800
BatchNorm Layer	Gamma: 512	1,024
BatchNorm Layer	Beta: 512	1,024
or
GroupNorm Layer (32 Groups)	Gamma: 512	1,024
GroupNorm Layer (32 Groups)	Beta: 512	1,024
ReLU
Fully Connected Layer	Weights: \(256 \times 512\)	131,328
Fully Connected Layer	Bias: 256	131,328
BatchNorm Layer	Gamma: 256	512
BatchNorm Layer	Beta: 256	512
or
GroupNorm Layer (32 Groups)	Gamma: 256	512
GroupNorm Layer (32 Groups)	Beta: 256	512
ReLU
Fully Connected Layer	Weights: \(53,796 \times 256\)	13,825,572
Fully Connected Layer	Bias: 53,796	13,825,572
Softmax
Total		17,994,852

Table 3. A Detailed Breakdown of the Layers that Make up the Architecture of the Candidate Generator NN Model

C Ranker Experiments

In this appendix, we will perform various experiments to determine the optimal model architecture for the ranker model described in Section 3.4, as well as the loss function that is used for its training. Furthermore, we will provide a detailed explanation of the chosen NN architecture.

C.1 Embedding Layer Sizes Experiment

Again, the embedding vector sizes for the three embedding layers for the users, the movies, and the movie genres, must be fine-tuned. Having too large embeddings may result in larger model sizes, overfitting, and longer convergence times. Therefore, we experimentally determined the optimal size of embedding vectors for each embedding layer. As can be seen in Figure 18, the optimal embedding sizes are 32 for users, 128 for movies, and 16 for genres. In the case of the user and the genre embedding vector sizes, the experiments are clearly determined, as 32 dimensions and 16 dimensions outperform all other embedding vector sizes both in terms of best final accuracy and MSE, as well as in best overall accuracy and MSE, respectively. The results of the movie embedding vector size are a bit ambiguous, as 16 dimensions outperform the other embedding vector sizes in terms of final accuracy and MSE, however, both 128 dimensions and 256 dimensions yield the highest overall accuracies and MSEs. To balance accuracy and computational complexity, we selected 128 dimensions with a higher overall accuracy and MSE than the 16-dimension-case and less computational complexity than the 256-dimension case.

Fig. 18.

C.2 Number of Hidden Layers Experiment

Similar to the candidate generator, we also determine the optimal number of hidden layers, as shown in Figure 19, resulting in an optimal ranker model with one hidden layer. The ranker model with two layers converges faster than the ranker model with one layer, however, at the expense of a lower final accuracy. Using three hidden layers already introduces overfitting and lowers the accuracy further. An optimal model with only one hidden layer also requires less computational complexity and is thus beneficial in the FL setting.

Fig. 19.

C.3 Loss Function Experiment

As the ranker model is trained to perform a classification task, the softmax cross-entropy loss function can be used. However, unlike in a typical classification problem, we want our prediction to be close to the correct value, even if it is wrong (predicting a rating of 3.5, when the actual ground-truth rating is 4.0 is still better than predicting a rating of 0.5, because the deviation from the true rating is smaller). Therefore, other loss functions such as MSE, which penalize both incorrect predictions and the magnitude of the deviation, may be better suited. To determine this, we conducted experiments using softmax cross-entropy, MSE, and the sum of the two to combine the best of both approaches. The results are shown in Figure 20. Against our expectations, the MSE loss function performs worse than the other two in terms of validation MSE. Here, one would assume that a model optimized on the MSE loss function should perform best when measuring its performance in terms of MSE. Although using the MSE loss function causes the model to not overfit in terms of accuracy like the other loss functions and outperforms them when measuring the final accuracy of the model, the other loss functions converge faster and achieve a better overall accuracy. The softmax cross-entropy loss function and the sum of both yield a similar accuracy, however the softmax cross-entropy loss function is computationally less complex and was therefore selected.

Table 4.

Type	Shape	Parameters
User Embedding Layer	\(\hbox{162,541} \times 32\)	5,201,312
Movie Embedding Layer	\(\hbox{53,796} \times 128\)	6,885,888
Genre Embedding Layer	\(20 \times 16\)	320
Genre Embedding Average
Input Concatenation
Fully Connected Layer	Weights: \(256 \times 177\)	45,568
Fully Connected Layer	Bias: 256	45,568
BatchNorm Layer	Gamma: 256	512
BatchNorm Layer	Beta: 256	512
or
GroupNorm Layer (32 Groups)	Gamma: 256	512
GroupNorm Layer (32 Groups)	Beta: 256	512
ReLU
Fully Connected Layer	Weights: \(10 \times 256\)	2,570
Fully Connected Layer	Bias: 10	2,570
Softmax
Total		12,136,170

Table 4. A Detailed Breakdown of the Layers that Make up the Architecture of the Ranker NN Model

Fig. 20.

C.4 Ranker Model Architecture

The final NN architecture that was chosen for the ranker has a 32-dimensional embedding layer for the user, a 128-dimensional embedding layer for the movie, and a 16-dimensional embedding layer for the genres. The genres are then averaged and all inputs, including the embeddings and the movie age, are concatenated. This is followed by a single hidden fully connected layer. The output of the hidden layer is normalized using a normalization layer, which is followed by a ReLU activation. The hidden layer with its normalization layer and ReLU activation is then followed by an output fully connected layer, which feeds its logits into a softmax. For the reasons described in Appendix B.4, a GroupNorm layer is used for the FL experiments and a BatchNorm layer is used for the non-FL experiments. A detailed breakdown of the layers that comprise the NN architecture of the ranker model is presented in Table 4.

D Extended Federated Learning and FedQ Experiment Results

Due to the broad range of different numbers of clients for the FL and FedQ experiments, the complete training graphs are poorly legible and were therefore omitted from Sections 4.3 and 4.4. Instead, only the final validation top-100 accuracy for the candidate generator experiments, as well as the final accuracy and MSE for the ranker experiments were presented. For reference, the complete training graphs are included in this appendix as Figures 21 and 22.

Fig. 21.

Fig. 22.

E Validation of FedQ On the LEAF Federated Learning Benchmark

Although FedQ was developed in the context of a FedRec, it is a much more general algorithm that can be employed in other FL pipelines that have to deal with small local datasets. To provide further evidence of FedQ’s efficacy, it was evaluated on LEAF, which is an open-source, modular benchmarking framework for federated settings [7]. It consists of (1) multiple open-source datasets, (2) reference implementations for common FL methods, and (3) several metrics that measure the statistical properties of the models that are being trained (e.g., accuracy), as well as metrics that measure properties of the FL system (e.g., number of communicated bytes and local computation). The reference implementation currently includes scripts for preprocessing the data, the federated optimization algorithms FedSGD and FedAvg, and one or more model architectures for each of the included datasets. We based the evaluation of FedQ on the following datasets contained in LEAF:

•

FEMNIST [7] is a dataset that was created by the authors of the LEAF benchmark by partitioning the digit and character images of the Extended MNIST (EMNIST) [18] dataset by the person that wrote it. This partitioning makes the dataset more amenable to FL, since writers can be understood as clients. EMNIST is a dataset that was created from the National Institute of Standards and Technology (NIST) Special Database 19 [35], which is the same database that the popular MNIST [64] is based on. The NIST Special Database 19 contains handwritten digits, uppercase letters, and lowercase letters, which is much more data than what is exposed by MNIST. EMNIST was created in an effort to create a more challenging benchmark dataset by covering all data contained in the NIST Special Database 19, while employing the same conversion paradigm used for MNIST to stay compatible.

•

Large-scale CelebFaces Attributes Dataset (CelebA) [78] is a dataset, which contains images of celebrities that were annotated with 40 attributes, including wearing eyeglasses, wearing a hat, wavy hair, and smiling. For the LEAF benchmark, CelebA was adapted to the federated setting by partitioning it into client datasets based on the celebrity in the image. Furthermore, the classification task was simplified from a multi-label classification task to a binary classification task, which only distinguishes between smiling and not smiling celebrities.

•

Sentiment140 [31] is an automatically generated dataset that contains X (formerly known as Twitter) messages that are classified as either positive or negative based on the emoticons contained in them. The dataset therefore presents a binary classification sentiment analysis task, where the input is a sequence of words. For the use in the LEAF benchmark, the messages are partitioned, such that each FL client is represented by a different X user.

•

Reddit [7] is a dataset that was created by the authors of the LEAF benchmark. They took comments posted on the social network Reddit in December 2017 and preprocessed them by (1) converting all named and numeric HTML character references to their corresponding unicode characters, (2) removing extraneous white spaces, (3) removing non-ASCII characters, (4) replacing URLs, Reddit user names and Subreddit names with special tokens, (5) converting the text to lowercase, and (6) tokenizing it using NLTK’s [80] tweet tokenizer. Furthermore, users that were determined to be bots, or that had less than 5 or more than 1000 comments were removed, along with their comments. Caldas et al. sub-sampled the dataset for their own experiments, as their reference implementation is not yet capable of training on the complete Reddit dataset. The training task of the dataset is next word prediction with a sequence of previous words as input. Each Reddit user is considered to be an FL client.

The LEAF benchmark provides two more datasets: Shakespeare [83], which is a dataset that is based on “The Complete Works of William Shakespeare” [102], where each speaking role represents an FL client, and a synthetic dataset that is based on the synthetic dataset proposed by Li et al. [68]. The Shakespeare dataset comprises 4,226,158 samples across 1,129 FL clients (i.e., speaker roles). On average, each client has 3,743.28 samples with a standard deviation of 6,212.26. FedQ is specifically tailored toward federated scenarios with small local datasets, therefore, the Shakespeare dataset is inadequate for evaluating FedQ’s potency, as almost all clients have plenty of local data (only 8 clients have less than 10 and 114 clients have less than 100 samples in their local datasets). The synthetic dataset was specifically designed by Caldas et al. to create a more challenging task for meta-learning methods, which does not apply to FedQ. For these reasons, we decided to only evaluate FedQ on the four above-mentioned datasets.

To conduct the FedQ benchmark experiments, we used the LEAF reference implementation and integrated FedQ as a new federated optimization algorithm. For the FEMNIST dataset, we use a simple two-layer CNN, which consists of two convolution layers each followed by a maximum pooling layer, followed by two fully connected layers. For the CelebA dataset, we utilize a CNN with four convolution layers, each followed by a BatchNorm and a maximum pooling layer, followed by a single fully connected layer. For the Sentiment140 dataset, we use a stacked LSTM model with an embedding layer that is initialized with 300-dimensional, pre-trained Global Vectors for Word Representation (GloVe) embeddings, followed by two LSTM cells and two fully connected layers. For the Reddit dataset, we rely on a stacked LSTM model with an embedding layer that embeds the input words into an 8-dimensional vector space, followed by two LSTM cells with dropout and a single fully connected layer. All of these models are part of the reference implementation of the LEAF benchmark. The use of pre-trained GloVe embeddings in the stacked LSTM model used for the Sentiment140 dataset is an adaptation that we incorporated. The embedding layer of the original reference implementation was randomly initialized and trained on the Sentiment140 dataset using the GloVe vocabulary to embed its input words into a 100-dimensional vector space. Without this adaptation, the model fails to learn anything using the hyperparameters proposed by Caldas et al. In fact, the model usually settles in on an accuracy of around 50% after the first round of federated training and more or less keeps that accuracy for the entire duration of the training. For a binary classification task, an accuracy of 50% is not better than random chance. As a matter of fact, Chen et al. also made this adaptation in their LEAF benchmark experiments.

The preprocessed and sub-sampled version of the Reddit dataset used by Caldas et al. was graciously made available for download. All other datasets were preprocessed using the tools provided in the reference implementation of the LEAF benchmark. We used the settings presented in Table 5. The statistics of the resulting datasets can be seen in Table 6.

Table 5.

	FEMNIST	CelebA	Sentiment140
Client Sample Distribution -s	non-i.i.d.	non-i.i.d.	non-i.i.d.
Fraction of Data to Sample –sf	100%	100%	15%
Minimum Number of Samples per Client -k	0	0	0
Training/Test Data Split Mode -t	Sample	Sample	Sample
Training Data Fraction –tf	90%	90%	90%
Sampling Seed –smplseed	1691607340	1691605746	1692132357
Split Seed –spltseed	1691608842	1691605747	1692132372

Table 5. The Settings used to Preprocess the FEMNIST, CelebA, and Sentiment140 Datasets

Table 6.

		FEMNIST	CelebA	Sentiment140	Reddit
Number of Clients		3,597	9,343	99,149	817
				(of 660,120)	(of 1,660,820)
Number of Samples		817,851	200,288	240,074	55,556
				(of 1,600,498)	(of 56,587,343)
Samples per Client	Minimum	19	5	1	10
	Maximum	584	35	236	1,394
	Mean	227.37	21.44	2.42	68.0
	Standard Deviation	88.84	7.63	4.63	120.27

Table 6. The Dataset Statistics of the Preprocessed FEMNIST, CelebA, Sentiment140, and Reddit Datasets

All models were trained using the default random seeds. Most of the remaining hyperparameters, however, deviate from the hyperparameters suggested by Caldas et al. Especially the number of clients per communication round was increased to facilitate different queue lengths for FedQ. The hyperparameters used for each dataset are specified in Table 7. The experiments for each dataset were repeated three times, once with FedAvg as a baseline against which FedQ can be compared, once with FedQ and a queue length of 10, and once with FedQ and a queue length of 100. The results of the experiments can be seen in Figure 23.

Table 7.

	FEMNIST	CelebA	Sentiment140	Reddit
Communication Rounds	400	400	400	100
–num-rounds	400	400	400	100
Clients per Communication Round	1,000	1,000	1,000	500
–clients-per-round	1,000	1,000	1,000	500
Learning Rate	0.01	0.01	0.01	8.0
-lr	0.01	0.01	0.01	8.0
Batch Size	10	10	10	5
–batch-size	10	10	10	5
Local Epochs	5	5	5	1
–num-epochs	5	5	5	1

Table 7. The Hyperparameters that were used for Benchmarking FedQ on LEAF

Fig. 23.

In all experiments, FedQ with a queue length of 10 had a higher final accuracy than the other two experiments. In the cases of the Sentiment140 and the Reddit datasets, it even manages to clearly outperform FedQ with a queue length of 100. This is interesting in two ways: First of all, in the FedQ experiments on our FedRec, there was always a benefit when using a larger queue length, albeit with diminishing returns. The LEAF benchmark experiments not only show that using a larger queue length does not always result in a significant increase in performance, but it may even make the training unstable and hinders convergence, as is the case for the Sentiment140 and the Reddit dataset. The second remarkable thing is that in both cases where a larger queue length causes the training to become unstable, the model is an LSTM. Of course, no trend can be derived from just these experiments, but this interesting behavior could be explored in future work.

The margins with which FedQ outperforms the FedAvg baseline are much smaller as compared to the results achieved with the models of our FedRec. Nonetheless, it can be clearly seen that FedQ has a much faster convergence rate. The FedAvg baseline reaches its highest accuracy in all cases at the very end of the training window (communication round 400/400 for FEMNIST, 390/400 for Sentiment140, 400/400 for CelebA, and 96/100 for Reddit). Both FedQ experiments are able to reach or exceed the baseline’s highest accuracy in a much shorter time frame: For FEMNIST, FedQ with a queue length of 10 exceeded FedAvg at communication round 40, while FedQ with a queue length of 100 already outperformed FedAvg at communication round 10. For Sentiment140, FedQ with a queue length of 10 surpasses FedAvg at communication round 180 and FedQ with a queue length of 100 at communication round 50. For CelebA, the communication rounds were 50 and 20, while those for Reddit were 180 and 50, respectively. It should also be noted that, although its training was less stable, FedQ with a queue length of 100 outperformed the other two experiments in terms of highest accuracy for all datasets except for Reddit. It was also always significantly faster to exceed the highest accuracy of FedAvg than FedQ with a queue length of 10. Table 8 presents the results of our experiments in comparison to the results published by Caldas et al. [7]. Please be aware that our experiments used different hyperparameters for both the preprocessing of the datasets as well as the training of the models, which renders the results incomparable. Particularly notable is the difference in the Sentiment140 dataset, where Caldas et al. report on four experiments with varying minimum numbers of samples per client ranging from 3 to 100. In our experiments, we have set the minimum number of samples per client to 0 in the preprocessing of Sentiment140, which means that our experiments had a considerably lower number of samples per client on average. We have still included the results for your reference and as another baseline.

Table 8.

Dataset	Method		Result
FEMNIST	FedAvg (LEAF)		74.72%
	FedAvg (ours)		82.66%
	FedQ (ours)	Queue Length 10	86.98%
	FedQ (ours)	Queue Length 100	86.65%
CelebA	FedAvg (LEAF)		89.46%
	FedAvg (ours)		91.24%
	FedQ (ours)	Queue Length 10	91.98%
	FedQ (ours)	Queue Length 100	91.58%
Sentiment140	FedAvg (LEAF)	\(\ge 3\) Samples per Client	~50%*
		\(\ge 10\) Samples per Client	~50%*
		\(\ge 30\) Samples per Client	~60%*
		\(\ge 100\) Samples per Client	~69%*
	FedAvg (ours)		69.91%
	FedQ (ours)	Queue Length 10	71.18%
	FedQ (ours)	Queue Length 100	69.32%
Reddit	FedAvg (LEAF)		13.35%
	FedAvg (ours)		13.23%
	FedQ (ours)	Queue Length 10	14.60%
	FedQ (ours)	Queue Length 100	13.04%

Table 8. Comparison of the FedQ LEAF Benchmark Results against the Results Published by Caldas et al. [7]

*Please note that Caldas et al. do not publish final accuracies for the Sentiment140 dataset. The accuracies shown in the table were read from the graph in Figure 3 [7] and are only approximations.

In conclusion, we think these experiments demonstrate that FedQ is capable of outperforming FedAvg on a wide variety of data modalities and training tasks. Although the margin with which FedQ outperforms the baseline varies with dataset and model architecture, its ability to drastically improve convergence speed makes it particularly efficacious.

F FedQ and Other Client Chaining Techniques

This appendix section describes further techniques for FL client chaining in comparison to FedQ that have been developed in parallel to our method and provides similarities and differences between them. Kamp et al. [51], for example, aim to improve FL in scenarios where each client only has a small local dataset. They propose a technique called federated daisy-chaining (FedDC), where the central server, instead of aggregating the updated local models of the clients into a new global model, sends each updated local model \(M_i\) to a randomly selected client \(c_j\), where \(i \ne j\). After a few rounds of this daisy-chaining, the resulting models are aggregated analogously to FedAvg. Hu et al. [45] tackle the problem of non-i.i.d. data in FL and propose a technique called federated learning via device concatenation (FedCat) that is essentially equivalent to FedDC. The main difference between FedDC and FedCat is that in FedCat each model is trained by each client before they are aggregated to form a new global model and only the order of the client updates differs, while in FedDC, depending on the daisy-chaining period, each model is only trained on a random subset of all clients. Zaccone et al. [129] also try to alleviate the problem of heterogeneous client datasets by proposing a technique called federated learning via sequential superclients training (FedSeq). They perform a pre-training phase, after which they use the resulting model to estimate the data generating distribution of each client. Using the estimated distributions, they generate groups of clients with different local distributions, which they denote as superclients. During FL, the clients within each superclient are trained sequentially, where the first client receives the global model and all consecutive clients receive the model of the previous client. The resulting local models of the superclients are then aggregated as in FedAvg.

All of the proposed techniques have similar goals and try to solve these problems by chaining the local training of multiple clients, but each of the techniques has variations in the training protocol that they follow. Both FedDC and FedCat train as many different models as there are clients in each communication round. FedSeq and our method FedQ, however, only train \(\frac{\#clients}{\#clients \, per \, superclient/queue}\) models per communication round. In FedDC, FedCat, and FedQ the clients for the sequential training are selected randomly, while in FedSeq they are purposely selected in order to group clients together that have different data generating distributions.

References

[1]

Md. Hijbul Alam, Woo-Jong Ryu, and SangKeun Lee. 2016. Joint multi-grain topic sentiment. Information Sciences 339, C (April 2016), 206--223. DOI:

Abstract

1 Introduction

2 Related Work

2.1 Recommender Systems

2.2 Federated Learning

2.3 Communication-Efficient Federated Learning

2.4 Federated Recommender Systems

3 Method

3.1 Problem Statement

3.2 Recommender System Architecture

3.3 Candidate Generation

3.4 Ranking

3.5 Re-ranking

3.6 Federated Recommender Systems at Scale Using Queue-Based Federated Learning

3.7 Achieving Communication Efficiency

3.8 Data Security and Privacy Protection

4 Experiments

4.1 Dataset

4.1.1 Dataset Analysis.

4.1.2 Dataset Preprocessing.

4.2 Baseline Experiments

4.3 Federated Learning Experiments

4.4 FedQ Experiments

4.5 Communication Compression Experiments

5 Conclusion and Outlook

Footnotes

A Federated Learning Simulator

B Candidate Generator Experiments

B.1 Model Type Experiment

B.2 Movie Embedding Layer Size Experiment

B.3 Number of Hidden Layers Experiment

B.4 Candidate Generator Model Architecture

C Ranker Experiments

C.1 Embedding Layer Sizes Experiment

C.2 Number of Hidden Layers Experiment

C.3 Loss Function Experiment

C.4 Ranker Model Architecture

D Extended Federated Learning and FedQ Experiment Results

E Validation of FedQ On the LEAF Federated Learning Benchmark

F FedQ and Other Client Chaining Techniques

References

Cited By

Index Terms

Recommendations

Efficient federated item similarity model for privacy-preserving recommendation

Privacy-preserving graph convolution network for federated item recommendation

Federated matrix factorization for privacy-preserving recommender systems

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations