Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

CLC: A Consensus-based Label Correction Approach in Federated Learning

Published: 21 June 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Federated learning (FL) is a novel distributed learning framework where multiple participants collaboratively train a global model without sharing any raw data to preserve privacy. However, data quality may vary among the participants, the most typical of which is label noise. The incorrect label would significantly damage the performance of the global model. In FL, the inaccessibility of raw data makes this issue more challenging. Previously published studies are limited to using a task-specific benchmark-trained model to evaluate the relevance between the benchmark dataset in the server and the local one on the participants’ side. However, such approaches have failed to exploit the cooperative nature of FL itself and are not practical. This paper proposes a Consensus-based Label Correction approach (CLC) in FL, which tries to correct the noisy labels using the developed consensus method among the FL participants. The consensus-defined class-wise information is used to identify the noisy labels and correct them with pseudo-labels. Extensive experiments are conducted on several public datasets in various settings. The experimental results prove the advantage over the state-of-art methods. The link to the source code is https://github.com/bixiao-zeng/CLC.git.

    1 Introduction

    Federated learning (FL) is a machine learning framework where multiple participants (e.g., hospitals or banks) collaboratively learn a global model without exchanging the local raw data explicitly [22]. Only the intermediate model parameters are shared between the server and each participant in the federation. In this way, FL provides a privacy-preserving way to integrate knowledge from distributed data. More participants may bring more knowledge but also more risk of label noise. Label noise is a common issue in real-world applications, owing to the blankness of annotation expertise or malicious attack. For example, accurate diagnosis in the healthcare domain usually requires a clinical expert or additional costly spectroscopic measurements.
    Over the past two decades, some works have studied the problem of training a deep neural network (DNN) for a specific classification task with noisy labels. There are two types of solutions at a macro level: (1) directly training noise-robust models on unclean data, and (2) detecting and cleansing noisy labels before model training. The noise-robust solutions typically focus on designing new loss functions to prevent the model from overfitting label noise [27]. Such approaches, however, have failed to address distinguishing clean data and noisy data so that they essentially cannot enhance the label quality further. In other words, the participants can hardly obtain any new information from the existing dataset. These approaches inevitably allow all the label noise to be involved in model training, making it difficult for them to adapt to situations with more label noise. The cleansing solutions, which conduct data pre-selection or label correction, solve this dilemma in an easy-to-understand way. In more detail, label correction methods are to correct noisy labels to their true labels via a label inference step using complex noise models characterized by directed graphical models, conditional random fields, neural networks, or knowledge graphs [28]. These approaches further mitigate the loss of information caused by data pre-selection methods. However, much of this research is carried out with an extra clean dataset (benchmark dataset).
    There has been an increasing interest in dealing with label noise in federated learning with a strict benchmark dataset condition in recent years. Some of them try to train a noise-robust model by lowering the aggregation weight of the unclean participant [7], and some try to detect and clean the noisy labels before local training [26]. However, these studies’ generalisability is problematic because these works are highly dependent on a task-related benchmark; however, federated learning is created to solve high-quality data collection. More importantly, the methods of using an extra dataset ignore the ability of federated learning itself to utilize the cooperative characteristic of federated learning [17]. Consider that, directly transferring traditional noise detection methods to FL is not a wise choice since those participants who own low-quality or a small amount of data have a poor ability to fix their own problems.
    As a rumor says, two heads are better than one. Long before the rise of federated learning, a considerable amount of literature recognized the critical role of consensus. Consensus is a vital inference method to examine multiple noisy labels to find a single label. Zhang [30] points out that the goal of the consensus algorithm is to infer the ground truth from the wisdom of the crowd. Consensus makes it possible for annotators to contribute to the very accurate labeling of instances collectively. One of the well-known consensus algorithms is majority voting (MV) [13], which is the simplest method of inferring an integrated label. It selects the most frequent label among the group of noisy labels as the integrated label. This naive method can be significantly improved further. In [24]’s work, a consensus framework called GLAD models both the levels of expertise and the difficulties of examples, treating the probability of an object being positive as a latent variable. In most cases, such consensus algorithms use an EM procedure to infer the estimated labels of examples.
    Although previous consensus methods can often produce clean data sets from a macro view, the server’s invisibility of participants’ data makes these methods challenging to implement for federated learning.
    To fully utilize the cooperative characteristic of federated learning [17], we developed a Consensus-based Label Correction approach in FL. This framework not only aims to benefit participants but also benefit the noise-robust model. The importance and originality of this study are that it explores label correction in FL and focuses on the power of ‘federation’. Firstly, we define the class-wise global thresholds based on the Consensus method to estimate the temporary latent label. Secondly, we combine the latent label and a margin tool to dynamically decide which sample should be set aside for the current round. Then, the left part of the data will contribute to the model training. Once the global model converges in the participants’ validation dataset, the latent label for the next round will be adopted as the true one, which will further improve the global model.
    CLC presented in Figure 1 is oriented to a real-world federated learning scenario. Each participant often holds a dataset of different quality and quantity, and each dataset’s distribution is not necessarily consistent with the overall distribution. In addition, the setting of this method meets the basic requirements of federated learning about data privacy. The server cannot directly obtain the original data of the participants. In the information exchange stage (Step 2~3, 7~8), the server and the participants only need to exchange model parameters and class-wise information. Unlike most noisy label research in the field of federated learning, the server does not need to provide a task-related benchmark dataset. More details on the framework will be given in the following sections. To evaluate the effectiveness of CLC, we conduct extensive experiments on public datasets with both synthetic and real-world noise settings, respectively. Experimental results show that CLC can detect and correct noisy labels to improve classification performance. The contributions of this paper can be summarized in three folds: (1) CLC mitigates the impact of noisy labels in FL by correcting them to alleviate the loss of knowledge; (2) CLC brings the cooperative nature of FL into full play by using a Consensus-based method, which means it does not need an extra benchmark dataset; and (3) CLC is implemented by communicating non-sensitive parameters to satisfy the data privacy policy in FL.
    Fig. 1.
    Fig. 1. The proposed CLC framework.
    The rest of the article has been organized in the following way: Section 2 discusses the related works. Section 3 presents the methodology of CLC and corresponding derivation, and Section 4 shows the empirical evaluation on both synthetic and real-world label noise. Section 5 concludes the article.

    2 Related Work

    2.1 An Insight Into Label Noise

    Label noise generally occurs when human experts are involved. With this claim, possible causes of label noise include imperfect patterns and perceptual errors. Potential sources of label noise can be classified into the following categories [5]: (1) insufficient information: the attributes that are provided to the expert may be insufficient to conclude a reliable label; (2) subjective labeling: there may exist a significant difference in the labeling of experts if the labeling task is subjective such as medical applications or image data analysis; and (3) recording problems: label noise can also be caused by a communication issue or wrong logging such as accidental click. All of these causes make label noise a problem not to be underestimated. The existing body of research on machine learning with label noise suggests three typical categories: (1) loss correction methods, (2) data pre-selection methods, and (3) label correction methods. From a macro point of view, loss correction methods correspond to directly training a robust model, and the latter two methods correspond to detecting and cleaning noise labels. These approaches improve the robustness of the deep neural network to label noise to some extent but remain at different levels of limitation. An example of loss correction methods is the study carried out by, in which an additional softmax layer explicitly models the noise to connect the correct labels to the noisy ones. In a follow-up study, Wang [27] proposes a symmetric cross-entropy with a noise-robust term called Reverse Cross Entropy (RCE). However, these approaches inevitably acquiesce in the participation of all label noise in model training, making it difficult for them to adapt to situations with more label noise. To date, several studies have investigated data pre-selection methods, which refer to selecting clean data before the start of model training; in other words, data with noisy labels is no longer involved in training. INCV [6] finds clean data with multiple iterations of cross-validation, then trains on the clean set. Northcutt [19] explores the use of label transmission matrix to filter noisy labels. However, since the estimate at the initial moment is not always accurate, over-selection or insufficient selection can often happen. Another severe weakness with these approaches is that the dropout of data with noisy labels causes a waste of samples’ available features. To alleviate the loss of information, more recent attention has focused on label correction methods. A significant analysis on label correction was presented by Xiao [28], where directed graphical models characterize the complex noise models. Likewise, undirected graphs are adopted to model label noise. However, much of the researches up to now require support from extra clean data, and the quality of the extra data highly determines whether the inferred noise model matches the actual situation.

    2.2 Federated Learning with Label Noise

    Since multiple participants may bring low-quality labeled datasets and the corresponding negative effects on model training, the label noise problem in federated learning has increased academic interest in recent years. Although the main ideas for solving label noise in federated learning are inseparable from the above three mainstream methods, the unique data privacy policy of federated learning makes label noise an open problem. While the raw data is not visible to the server, the noise is no longer easily identifiable. In federated learning, generating intermediate parameters for measuring label noise can relieve this issue. FOCUS [7] reduces the weight of low-quality participants to alleviate the negative effect of label noise. Tuor [26] conducts data-preselection before federated training, which is implemented by estimating the similarity between participants’ training data and the server’s benchmark data. These explorations may be hard to be satisfactory because they drop out the available data features. Also, these methods require a task-related benchmark dataset that should be completely clean. However, most real-world cases can hardly satisfy such requirements; in fact, federated learning is created to solve the problem that high-quality data resources are difficult to upload and share. In our work, noisy labels are corrected via a consensus-based method without benchmark dataset, which means the knowledge hidden in participants’ datasets can be fully utilized.

    2.3 Consensus Mechanism

    The simplest efficient consensus algorithm is majority voting (MV) [13], which leads to high-quality labels of instances, as long as the provided label noise is guaranteed to have a certain degree of accuracy. However, real-world scenarios are so complicated that MV does not always work well, especially under situations where biases occur in the labeling process. For the classification task, deciding which class to be adopted completely depends on the number of annotators holding the corresponding opinion, regardless of factors such as the annotator’s expertise. In recent years, several more sophisticated agnostic consensus algorithms such as GLAD, RY have been investigated in crowdsourcing learning [30], highlighting the capability of correctly inferring the minority. GLAD models both the levels of expertise and the difficulties of examples, treating the probability of an object being positive as a latent variable [24]. Raykar proposed the Bayesian approach RY to add specific priors for each class [21]. Unfortunately, these consensus mechanisms are not suitable for the environment of federated learning because the inaccessibility of all data makes them invalid.

    2.4 Confident Learning

    Confident learning (CL) [19] is a novel approach that estimates label errors by examining the predictive probability of each given label. By drawing on the concept of self-confidence (given label’s predictive probability), the authors have been able to show that there is a joint distribution between noisy (given) labels and true labels. Unfortunately, CL provides several noise estimation methods which are highly dependent on a benchmark pre-trained model. Besides, CL works by considering all data, which is not intuitively feasible in federated learning. However, if there is a tie among all the participant’s data, and the raw data is not involved in the tie, the true latent label can be identified in a privacy-reserved fashion. We acknowledge that this implementation is a heuristic way of the author’s original algorithm, yet it is justified because dealing with label noise in FL is by no means the same thing as in traditional machine learning settings.

    2.5 Forgetting Events

    In [25], the authors investigated the learning dynamics of neural networks as they train on single classification tasks. This work defines a forgetting event to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Detailed examination of forgetting event showed that the noisy label data is more likely to experience forgetting event. Since counting forgetting event is inconvenient in most cases, a margin tool is designed for estimating the forgetting event. This margin represents the difference between the largest probability among all the classes and the probability of the given class. An interesting regulation concluded by the authors is that frequently forgotten samples usually have a large margin. In other words, this margin tool boosts estimating label noise during the learning course. With that in mind, we take inspiration from this valuable study and deploy a Train-noise splitting strategy (TNS). TNS supports CLC to better adapt to the dynamic federated learning process.

    3 Methodology

    3.1 Problem Statement and Notations

    In the context of multi-class data with possibly noisy labels, let \( [m] \) denote \( \lbrace 1,2, \ldots , m\rbrace , \) the set of \( m \) unique class labels, and \( \mathcal {D}^{k}:=(x, y)^{n_{k}} \in (\mathbb {R}^{d},[m])^{n_{k}} \) denote the \( k^{th} \) participant’s dataset of \( n_{k} \) examples \( x \in \mathbb {R}^{d} \) with associated observed noisy labels \( y \in [m] \) . While a number of relevant works address the setting where extra annotator labels are available [4], this paper addresses the general setting where no annotation information is available except the observed noisy labels.
    We assume there exists, for every example, a latent, true label \( y^{*} \) . Before observing \( y \) , a class-conditional classification noise process (CNP) [3] maps \( y^{*} \rightarrow {y} \) such that every label in class \( j \in [m] \) may be independently mislabeled as class \( i \in [m] \) with probability \( p(y=i \mid y^{*}=j) \) . In other words, \( p(y=i \mid y^{*}=j) \) describes a label transmission matrix between the given label and true label. CNP is reasonable and currently one of the most popular assumptions for estimating label noise.
    In a real-world FL scenario, local datasets unavoidably contain label noise, and the number of noisy labels in participants’ data differs. The global model may overfit noisy data by aggregating model parameters from participants’ sides in federated learning. In the following, we focus on estimating the label transmission matrix over the overall data \( \mathcal {D} = \bigcup _{k} \mathcal {D}^{k} \) . In the traditional distributed learning, since data are distributed in different nodes and are highly controlled by the server, the overall data estimation is feasible. However, in the federated setting, the server can neither access any training data nor obtain an extra task-related benchmark dataset easily. Therefore, we utilize the federated learning’s collaborative nature to offer a Consensus based method for identifying label noise in FL.

    3.2 Estimation in Traditional Distributed Learning

    According to CNP, our goal is to estimate the random probability of a given label \( y \) could be determined to its true label. This probability is designed for measuring the probability’s lower threshold of assigning the true label to the sample with its raw label. In the literature [19], the term threshold tends to be represented as:
    \( \begin{equation} c(i)=\sum _{j \in [m]} p\left(y=i \mid y^{*}=j\right) p\left(y^{*}=j \mid y=i\right)\!. \end{equation} \)
    (1)
    Intuitively, when \( i \ne j \) , the contribution of the sum represents the probability of being determined correctly ( \( p(y^{*}=j \mid y=i) \) ) when the given one is wrong ( \( p(y=i \mid y^{*}=j) \) ); when \( i=j \) , it represents the probability of being determined correctly when the given one is correct. In the traditional machine learning, the overall dataset \( \mathcal {D} = \bigcup _{k} \mathcal {D}^{k} \) can be used to estimate the threshold directly. Assuming that the model \( \boldsymbol {\theta } \) achieves a certain predictive ability (reliable softmax outputs) at \( t \) round, we can obtain:
    \( \begin{equation} \begin{aligned}c_{t}(i)&=\sum _{j \in [m]} p\left(y=i \mid y^{*}=j\right) \mathbb {E}_{x \in \mathcal {D}_{y=i}} \hat{p}\left(y^{*}=j ; x, \boldsymbol {\theta }_{t}\right)\\ &=\mathbb {E}_{x \in \mathcal {D}_{y=i}} \sum _{j \in [m]} p\left(y=i \mid y^{*}=j\right) \hat{p}\left(y^{*}=j ; x, \boldsymbol {\theta }_{t}\right)\!, \end{aligned} \end{equation} \)
    (2)
    where the term \( \mathcal {D}_{y=i} \) is the set of all samples with a given label \( i \) .
    According to CNP assumption, we have:
    \( \begin{equation} \begin{aligned}c_{t}(i)&=\mathbb {E}_{x \in \mathcal {D}_{y=i}} \sum _{j \in [m]} p\left(y=i \mid y^{*}=j ; x, \boldsymbol {\theta }_{t}\right)\hat{p}\left(y^{*}=j ; x, \boldsymbol {\theta }_{t}\right)\!. \end{aligned} \end{equation} \)
    (3)
    Using the law of total probability and Bayes, we have:
    \( \begin{equation} c_{t}(i)=\frac{1}{|\mathcal {D}_{y=i}|} \sum _{x \in \mathcal {D}_{y=i}} \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{t}\right)\!. \end{equation} \)
    (4)
    For traditional distributed learning, Equation (4) is computable as the server can access all the data easily.

    3.3 Consensus Based Estimation in Federated Learning

    Consensus mechanism. Though a threshold \( c_{t}(i) \) can be computed in the traditional distributed learning, it encounters data barriers in federated learning. Thus, we proposed a new consensus method in the FL setting, in which weighted average the local class-wise thresholds to compute the global ones. Following Equation (4), it is necessary here to clarify exactly what is meant by local threshold in each participants’ end, that is, \( c_{t}^{k}(i)=\tfrac{1}{|\mathcal {D}_{y=i}^{k}|} \sum _{x \in \mathcal {D}_{y=i}^{k}} \hat{p}(y=i ; x, {\boldsymbol {\theta }}_{t}^{G}) \) . Similarly, the term that will be used to describe the threshold from an FL server’s view is called global threshold:
    \( \begin{equation} \begin{aligned}c_{t}^{G}(i)&=\frac{1}{|\mathcal {D}_{y=i}|} \sum _{x \in \mathcal {D}_{y=i}} \hat{p}\left(y=i ; x, {\boldsymbol {\theta }}_{t}^{G}\right)\!. \end{aligned} \end{equation} \)
    (5)
    For estimating the threshold precisely in FL, we adopt the Weighted Average Consensus method to compute global threshold in a privacy-preserved way. The derivation process is as follows:
    \( \begin{equation} \begin{aligned}c_{t}^{G}(i)&=\frac{1}{|\mathcal {D}_{y=i}|} \sum _{x \in \mathcal {D}_{y=i}} \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{t}^{G}\right)\\ &=\frac{1}{\sum _{k=1}^{N}\left|\mathcal {D}_{y=i}^{k}\right|} \sum _{k=1}^{N} \sum _{x \in \mathcal {D}_{y=i}^{k}} \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{t}^{G}\right)\\ &=\sum _{k=1}^{N} \frac{\left|\mathcal {D}_{y=i}^{k}\right|}{\sum _{k=1}^{N}\left|\mathcal {D}_{y=i}^{k}\right|} \frac{1}{\left|\mathcal {D}_{y=i}^{k}\right|} \sum _{x \in \mathcal {D}_{y=l}^{k}} \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{t}^{G}\right)\\ &=\sum _{k=1}^{N} \frac{\left|\mathcal {D}_{y=i}^{k}\right|}{\sum _{k=1}^{N}\left|\mathcal {D}_{y=i}^{k}\right|} c_{t}^{k}(i). \end{aligned} \end{equation} \)
    (6)
    The corresponding process of Generating Global Threshold (GGT) is shown in Algorithm 1. Thus far, the global threshold has attempted to provide a reference floor of the probability relating to one label’s predicted probability. And a latent label \( y^{*} \) can be derived as:
    \( \begin{equation} y^{*}=\underset{i \in [m] : \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{t}^{G}\right)\gt c_{t}^{G}(i)}{\arg \max } \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{t}^{G}\right)\!. \end{equation} \)
    (7)
    By using the class-wise global thresholds, the latent label of a sample can be inferred by two principles, that is, its predicted probability should be larger than (1) the corresponding global thresholds, as well as (2) the other classes. If all the probabilities are unsatisfied with the first principle, the latent label would be set as the second principle only. We illustrate the advantages of the latent label \( y^{*} \) in CLC with actual cases on MNIST dataset [15] in Table 1, and the corresponding digit examples are shown in Figure 2: (1) Effectiveness: for the instance of handwritten digit ‘5’, it is easy to be mislabeled with ‘6’ because of the visual error. In this case, the model is more likely to learn such instances as ‘6’ instead of ‘5’. However, with the Consensus guidance, the model’s average prediction probability for the label ‘6’ is so high that the prediction probability on ‘6’ cannot easily pass. (2) Reliability: for the instance of handwritten digit ‘2’ with its true label, even if the predicted probabilities of multiple classes exceed their global thresholds \( c_{t}^{G} \) , this instance will be easier to identify due to its correct given label. Furthermore, for the instance of handwritten digit ‘8’ with its true label, our algorithm keeps the model’s actual output if no classes surpass the Consensus. With effectiveness and robustness, the latent label \( y^{*} \) will contribute to the next ‘correcting’ stage as an important reference.
    Fig. 2.
    Fig. 2. Handwritten digits from left to right: the digits with true label ‘5’, ‘2’, and ‘8’.
    Table 1.
    classtruegiventruegiventruegiven
     562288
    \( \hat{p}(y=i ; x, {\boldsymbol {\theta }}_{t}^{G}) \) 0.2420.4900.4330.4010.3450.215
    \( c_{t}^{G} \) 0.2300.5000.4280.3890.3900.369
    compare( \( \hat{p} \) , \( c \) )><>><<
    argmax5 2   
    \( y^{*} \) 5 2 8 
    Table 1. The Advantages of Latent Label \( y^{*} \)
    Correcting Noisy labels. The Correcting Noisy Labels process (CNL) is presented in Algorithm 2. Before conducting correction, it is important to improve the global model’s predictive ability, which suffers from label noise. To prevent label noise from interfering with model training, we adopt the Train-Noise Splitting (TNS) strategy. Since the global model’s early predictive ability is relatively poor, especially as there is not always an extra benchmark dataset to utilize for pre-training, CLC performs splitting dynamically, that is, splitting before each global round of training. At the participant side, let \( \mathcal {H}_{t}^{k} \) denote the Holdout set for \( t \) round, and let \( {\mathcal {T}}_t^{k}={\mathcal {D}}^{k}\backslash {\mathcal {H}}_t^{k} \) be the Train set. In the early stage of training, since the global model’s performance is still growing up, we adopt a margin tool to alleviate the mistaken holdout. In our setting, whether an instance should be split into \( \mathcal {H}_{t} \) needs to satisfy two conditions as follows:
    \( \begin{equation} \mathbb {1}_{\mathcal {H}_{t}}(x,y)=\left\lbrace \!\!\begin{array}{ll} 1, &#x0026; y \ne y^{*} \&#x0026; m(x) \gt \tau \\ 0, &#x0026; y=y^{*} \mid m(x) \le \tau . \end{array}\right. \end{equation} \)
    (8)
    Only when a sample satisfies \( y \ne y^{*} \) and \( m(x)\gt \tau \) can it be set aside. \( y^{*} \) recalls the temporary latent label shown in Equation (7), and \( m(x) \) can be represented as:
    \( \begin{equation} m(x)=\arg \max _{j \in [m]} \hat{p}\left(y=j ; x, \boldsymbol {\theta }_{t}^{G}\right)-\hat{p}\left(y=l ; x, \boldsymbol {\theta }_{t}^{G}\right)\!. \end{equation} \)
    (9)
    The effectiveness of such margin has been demonstrated in [25], where observed, the data with the noisy label is more likely to have a large margin. By computing the difference between the largest probability among all the classes and the probability of the given class, the noisy sample can be further determined as shown in Figure 3. Let the objective for TNS be that the more clean data is reserved, the better. Figure 4 explains this objective achieved by using margin tool \( m(x) \) . \( \tau \) is a hyperparameter, which is empirically set to 0.1 in the experiment.
    Fig. 3.
    Fig. 3. The noise determination process in each round.
    Fig. 4.
    Fig. 4. Estimating label noise by using the margin tool m(x), where TP (True Positive) is the noisy sample determined to holdout, FP (False Positive) is the clean sample determined to holdout, FN (False Negative) is the noisy sample determined to not holdout, and TN(True Negative) is the clean sample determined to not holdout.
    Since the holdout datasets \( \mathcal {H} \) for each round may be different, the latent label \( y^{*} \) should be adopted until the global model converges. In our method, the validation loss is used to determine the convergence. In the FL setting, where the local data is inaccessible, the convergence of the model is determined by the average loss on all participant’s validation datasets. For each participant, the validation set is randomly sampled 20% from the training set. Each participant uploads the validation loss to the cloud server. Once the averaged validation loss is no longer getting lower, it is believed to reach convergence. Moreover, a maximum number of iterations is used as an empirical value in other experiments related to the dataset to save time. Such a setting is for the iterative process not to run indefinitely. A similar two-stage training method was also used in [12], and the convergence determination of CLC is consistent with it. Suppose \( T-1 \) round the global model converges, and we introduce the pseudo-label as follows:
    \( \begin{equation} \hat{y}=\underset{i \in [m] : \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{T}^{G}\right)\gt c_{T}^{G}(i)}{\arg \max } \hat{p}\left(y=i ; x, \boldsymbol {\theta }_{T}^{G}\right)\!. \end{equation} \)
    (10)
    Note that if no class exceeds its global threshold, the label will remain as it is.

    3.4 Updating Model Parameters

    For a federated learning task, let \( {\mathcal {T}}_t={{\mathcal {D}}}\backslash {\mathcal {H}}_t \) denote the set of samples that was retained before the start of the \( t \) round of training. This set is identified by finding the subset \( \mathcal {T}_{t}^{k} \subseteq \mathcal {D}^{k} \) in each participant. In the first round, the train set \( \mathcal {T}_{t}^{k} \) defaults to \( \mathcal {D}^{k} \) . The overall loss of retained data at participant \( k \) is represented as :
    \( \begin{equation} L_{t}^{k}(\boldsymbol {\theta })=\frac{1}{\left|\mathcal {T}_{t}^{k}\right|} \sum _{(x, y) \in \mathcal {T}_{t}^{k}} l(f(x, \boldsymbol {\theta }), y), \end{equation} \)
    (11)
    where \( |\cdot | \) denotes the cardinality of the set, based on which the global loss across all participants is represented as:
    \( \begin{equation} L_{t}^{G}(\boldsymbol {\theta })=\sum _{k=1}^{N} \frac{\left|\mathcal {T}_{t}^{k}\right| L_{t}^{k}(\boldsymbol {\theta })}{\sum _{k=1}^{N}\left|\mathcal {T}_{t}^{k}\right|}. \end{equation} \)
    (12)
    The objective of federated learning on the retained data subset \( \mathcal {T} \) is to find the model parameters \( \hat{\boldsymbol {\theta }} \) that minimizes \( L^{G}(\boldsymbol {\theta }) \) :
    \( \begin{equation} \hat{\boldsymbol {\theta }}^{G}=\underset{\boldsymbol {\theta }}{\arg \min } L^{G}(\boldsymbol {\theta }). \end{equation} \)
    (13)
    The minimization process described in Equation (13) is solved in a distributed manner using a standard federated learning procedure that includes the following steps:
    (1)
    Each participant \( k \) performs stochastic gradient descent (imgs/SGD in parallel on its local model parameters \( \boldsymbol {\theta }^{k} \) :
    \( \begin{equation} \boldsymbol {\theta }_{t^{\prime }}^{k}=\boldsymbol {\theta }_{t^{\prime }-1}^{k}-\eta \nabla L_{t}^{k}\left(\boldsymbol {\theta }_{t^{\prime }-1}^{k}\right), \end{equation} \)
    where \( t^{\prime } \) denotes the number of the mini-batch of data, and \( \nabla L_{t}^{k}(\boldsymbol {\theta }_{t^{\prime }-1}^{k}) \) is the stochastic gradient of the loss value computed on a mini-batch of data randomly sampled from \( \mathcal {T}_{t}^{k} \) .
    (2)
    Each participant \( k \) sends its new parameter \( \boldsymbol {\theta }_{t}^{k} \) to the server.
    (3)
    The server aggregates the parameters received from each participant, according to
    \( \begin{equation} \boldsymbol {\theta }_{t}^{G}=\sum _{k=1}^{N} \frac{\left|\mathcal {T}_{t}^{k}\right| \boldsymbol {\theta }_{t}^{k}}{\sum _{k=1}^{N}\left|\mathcal {T}_{t}^{k}\right|}. \end{equation} \)
    (4)
    The server sends the global model parameters \( \boldsymbol {\theta }_{t}^{G} \) computed in Equation (15) to each participant.
    After receiving the global model parameters, each participant \( k \) uses \( \boldsymbol {\theta }_{t}^{G} \) to compute the local thresholds.

    4 Experiments and Analysis

    4.1 Datasets and Preprocessing

    We evaluate CLC on four public datasets as listed in Table 2. MNIST [15] is a popular dataset of handwritten digits, which is frequently used in the literature of learning with noisy labels [26]. USC-HAD [31] is a widespread human activity dataset used in the area of ubiquitous computing, which recently attracted the attention of Federated learning-related research [8]. CIFAR-10 dataset consists of 60,000 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. CIFAR-10 [32] is a well-recognized dataset in noise learning, which is often used to verify the generalization of methods robust to label noise [19]. Clothing1M [28] is a large clothing dataset, which comprises 1M images with real noisy labels. In recent years, many settings based on real-world label noise have been proven on Clothing1M [16].
    Table 2.
     Train #Test #Class #Sample Size
    MNIST30005001028 \( \times \) 28
    USC-HAD240040012300 \( \times \) 6
    Clothing1M37497500014224 \( \times \) 224
    CIFAR-1050000100001032 \( \times \) 32
    Table 2. Datasets
    For MNIST and USC-HAD, we partition data into four participants randomly as a simulation of federated learning scenarios. Especially in USC-HAD, a participant’s data is extracted from a specific subject’s sensor data to better simulate privacy issues in federated learning. As these two datasets are clean without noisy labels, we follow the common setting in the literature to add synthetic label noise in the test sets [10]. For CIFAR-10, we randomly divided the training set into 20 parts to verify CLC’s insensitivity to the number of participants. 19 of them were distributed to the participants, and one was used as the benchmark. Note that the participating user settings always meet the requirements of the Cross-silo federated learning (e.g., clients are medical or financial organizations), that is, nofewer than 2 and no more than 100 [14].
    Following prior works [19], [23], [9], we verify CLC’s performance on the commonly used asymmetric label noise, where the labels of error-free, clean data are randomly flipped for its resemblance to real-world noise. Specifically, the noisy labels are added according to a randomly generated noise transition matrix \( \boldsymbol {Q}_{y \mid y^{*}} \) presented in Figure 5. We generate noisy data from clean data by randomly switching some labels of training examples to different classes non-uniformly according to \( \boldsymbol {Q}_{y \mid y^{*}} \) . The noise rate is implemented by controlling the trace of the matrix, i.e., the higher the noise rate, the smaller the trace.
    Fig. 5.
    Fig. 5. The randomly generated noise transition matrix on synthetic dataset.
    The matrix is generated with different traces to run experiments for different noise levels. To make the synthetic process more explicit, we take CIFAR-10 as an example. For the noise rate of \( \gamma \) , we can derive the trace as follows:
    \( \begin{equation} trace=(1-\gamma)*10. \end{equation} \)
    (16)
    Obviously, we can conclude that the following formula is also valid:
    \( \begin{equation} trace*\frac{1}{10}*N=(1-\gamma)*N. \end{equation} \)
    (17)
    Both sides of the equation represent the amount of clean data, and \( N \) denotes the scale of the whole dataset. During the random generation, the element \( q_{i,j} \) of \( \boldsymbol {Q}_{y \mid y^{*}} \) needs to meet the following constraints:
    \( \begin{equation} \left\lbrace \!\!\begin{array}{l} \sum _{i=1}^{m} q_{i, i}=\text{ trace } \\ \sum _{j=1}^{m} q_{i, j}=1. \end{array}\right. \end{equation} \)
    (18)
    In this way, the generated noise transition matrix \( \boldsymbol {Q}_{y \mid y^{*}} \) can convert the raw label \( y* \) to the noisy one \( y \) :
    \( \begin{equation} y(\boldsymbol {Q}_{y \mid y^{*}})=\text{random}_{-} \text{choice}\left((1, \ldots , m), p=\left(q_{y^{*}, 1},\ldots ,q_{y^{*}, m}\right)\right)\!. \end{equation} \)
    (19)
    Here, the element \( q_{i,j} \) of \( \boldsymbol {Q}_{y \mid y^{*}} \) represents the probability of the raw class \( i \) becoming the noisy class \( j \) . The implementation of noise generation has been posted in [19], where the interface can be called directly to generate the required noise matrix.
    From the left to right of the heatmap shown in Figure 5, it is apparent that the color on the diagonal line gradually becomes lighter, while the color on the non-diagonal line gradually darkens.
    For Clothing1M, there are 37,497 clothing images labeled with both noisy and clean class. Here the ‘noisy class’ refers to the label extracted from the image’s surrounding texts, ‘clean class’ refers to the label manually annotated. We partition these data into four participants randomly, and the amount of real noisy data (’clean class’ not equal to ‘noisy class’) accounts for around 38% of the total data. In addition, we generate a benchmark from the set where all data only has a clean label to provide a benchmark model option.

    4.2 Configurations

    For experiments on MNIST, we use the network CNN-MNIST as previous works do [2]. By common practice, we use a batch size of 10, a learning rate of 0.01, and update \( \boldsymbol {\theta } \) using SGD with a momentum of 0.5. We use a margin threshold \( \tau \) of 0.1 for the label estimation. In the ablation study, we will show the effect of this hyperparameter.
    For experiments on USC-HAD, the input of 300*6 sensor data is not the same situation as the shape of 32*32 images. Inspired by prior works [29], [18], we adopt one-dimensional CNN on this Human Action Recognition (HAR) task. Due to the less correlation between the axes, a 10*1 convolution kernel is used in the CNN architecture to learn the timing features on the sensor data. The successful implementation of CNNs for HAR is due to their capability to utilize convolutions across 1-D temporal sequences so as to capture local dependencies between nearby input samples.
    We set the learning rate of 0.001 and used a batch size of 10. SGD’s parameters and the margin threshold are set as the same as MNIST. For experiments on CIFAR10, we adopt the 18-layer Resnet using stochastic gradient descent with the momentum of 0.9, weight decay of 0.0005, and batch size of 64. Here, the hyperparameter \( \tau \) is also set to 0.1. For all experiments, the number of local epochs is 20, and the number of global epochs is 200.
    For experiments on Clothing1M, we follow previous work [20] and use the ResNet-50 pre-trained on ImageNet. We use a batch size of 64, a learning rate of 8 \( e^{-4} \) , and update \( \boldsymbol {\theta } \) using SGD with a momentum of 0.9 and a weight decay of 9 \( e^{-3} \) . The hyperparameter is set as 0.1.

    4.3 Baselines and Our Methods

    To our best knowledge, there are few works that correct noisy labels in federated learning, and few works dealing with noisy labels not requiring a task-related benchmark dataset. We compare CLC performance versus state-of-the-art approaches, respectively, under the premise of with benchmark model (BM) and without a benchmark model (No BM). Note that all these baselines, including traditional machine learning (ML) approaches, have been transferred to the context of federated learning:
    DT: Direct training (DT) is the most fundamental Federated learning in which the model is directly trained on the original dataset with noisy labels.
    \( C_{y, y^{*}} \) : \( C_{y, y^{*}} \) estimates the label transmission matrix to determine noisy labels by comparing the correct labels and noisy ones [19].
    INCV: INCV finds clean data with multiple iterations of cross-validation, then trains on the clean set [6].
    SL: Symmetric Learning boosts cross-entropy symmetrically with a noise-robust term Reverse Cross Entropy (RCE) [27]. No longer blindly fitting the noisy labels but at the same time fitting the prediction distribution, thereby alleviating the tendency of the network model to overfit the noisy labels.
    DS: Data Selection (DS) is a method for distributively selecting relevant data. A benchmark model is used to evaluate the relevance of the benchmark dataset and participants’ dataset, selecting the data with sufficiently high relevance. According to the principle of this method, it can not be implemented without a benchmark dataset [26].
    FedDyn: FedDyn is a novel federated learning method, where the server orchestrates cooperation between participants based on dynamic regularization in each round [1].
    CLC: CLC represents CLC aggregating both model parameters and class-wise thresholds to correct noisy labels.
    From Table 3, it can be seen that by far, few works claim that they can address label noise without a benchmark model. Besides, the outstanding FL approach DS was implemented by exchanging the sample-wise intermediate parameters, exposing a lot of potentially sensitive information. Our method CLC only exchanges class-wise information to preserve data privacy. Moreover, CLC deploys a Consensus-based label correction approach in FL to utilize the cooperative nature of FL and alleviate knowledge loss. In addition, CLC does not require complex hyperparameters; generally, a margin threshold of 0.1 can deal with lots of cases.
    Table 3.
     BM freeIntermediate ParametersDetectionCorrectionHyperparameters
    \( \boldsymbol {C_{y, y^{*}}} \)
    SCE-loss \( \alpha \) , \( \beta \) , A
    INCVIterations#, epochs#, r
    DSSample-wise
    CLCClass-wise \( \tau \)
    Table 3. Comparison of the Characteristics of Baselines

    4.4 Experiment Results

    We report the performance of CLC from the following two aspects:
    Classification: We compare the final classifier’s test accuracy to present the classification performance on synthetic noise and real-world noise settings, respectively. As can be seen from Table 4, CLC significantly improves the test accuracy on the three public datasets containing synthetic noise. With the increasing noise levels from 0.3 to 0.5, CLC’s classification performance shows its noise-robustness compared to other baselines. For example, from the data on MNIST (No BM), it can be seen that under the condition of noise rate of 0.5, the test accuracy of CLC is \( 33\% \) higher than that of Direct training (DT) and \( 3.67\% \) higher than that of SCE loss (SL). What stands out in the table is that no benchmark model will not have a significant impact on CLC. Figures 7, 8, and 9 further illustrate such advantage by showing the test accuracy on USC-HAD with each noise rate. These graphs show that other baselines like SCE loss (SL) and INCV fail under the No BM condition due to their methodological reasons. Note that the DT’s test accuracy with BM on USC-HAD degrades a little less than No BM’s. This kind of result may be explained by the fact that the benchmark dataset does not contain complicated information such as real class distribution. Once the model faces extensive examples, it suffers degradation. Another possible explanation for this might be that the benchmark dataset scale is relatively small compared to all the data, so the benchmark is not enough to boost a higher performance [11].
    Table 4.
    Dataset Benchmark ModelavgNo Benchmark Modelavg
      0.30.40.5 0.30.40.5 
    MNISTDT86.0075.6772.3077.6680.0067.7659.0068.92
    \( \boldsymbol {C_{y, y^{*}}s} \) 91.6793.3390.6791.8922.0015.0015.3317.44
    SL92.6791.3390.3391.4491.0090.3388.3389.89
    INCV82.0078.3370.3376.8968.0062.6756.6762.45
    DS91.0089.6790.0090.22----
    FedDyn86.6681.5161.3776.5187.1682.6259.4876.42
    CLC95.6795.0094.6795.1193.6792.3392.0092.67
    USC-HADDT65.0057.7551.5058.0868.2560.2554.7561.08
    \( \boldsymbol {C_{y, y^{*}}} \) 64.7567.7564.7565.7515.2514.2514.5014.67
    SL67.7560.5054.2560.8369.5055.2553.7559.5
    INCV63.7555.7552.7557.4159.0045.7539.5048.08
    DS72.0069.7571.0070.92----
    FedDyn69.2561.2554.5061.6769.5063.7550.5061.25
    CLC75.2570.7571.0072.3376.2570.7565.0070.67
    CIFAR-10DT79.4869.5456.4468.4975.2263.3451.3563.30
    \( \boldsymbol {C_{y, y^{*}}} \) 79.4376.1974.5076.7123.4423.3815.7520.85
    SL87.8982.2972.7580.7970.7161.2636.9556.31
    INCV66.0856.5745.2355.9654.7249.5938.8447.71
    DS72.4364.1550.1262.23----
    FedDyn47.2645.3140.1544.2447.2544.9238.9743.71
    CLC90.0586.9179.2785.4189.2286.1080.7185.34
    Table 4. The Classification Performance on Synthetic Noise
    The experiments on CIFAR-10 provide a different participating user setting. With 19 participants, CLC shows its noise-robustness compared to other baselines. From an average level, the performance of CLC with BM model is improved by \( 4.62\% \) compared to the second place (SL), and the one without BM model is improved by \( 22.04\% \) compared to the second place (DT). In addition, Figure 6 shows the detailed training process of CLC and FedDyn. From the average result of all noise settings, the performance is improved by 0.42 and 0.41 when there is a benchmark model and no benchmark model, respectively, compared to FedDyn. It can be seen that FedDyn is not quite robust to data containing noisy labels, though it can achieve good performance on unbalanced and non-IID data. However, the comparison with the non-IID-robust method may not be fair. To this end, we mainly focus on noise-robust methods in our experiments. In Table 5, the efficacy of CLC on real-world noisy labels is demonstrated on the Clothing1M dataset. The proposed CLC achieves better performance compared to state-of-the-art methods. The test accuracy of CLC without BM model achieves an improvement of \( +2.10\% \) over the best baseline method SL.
    FedDyn vs. CLC on CIFAR-10 with various noise rate.
    The test accuracy on USC-HAD with noise rate 0.3.
    The test accuracy on USC-HAD with noise rate 0.4.
    The test accuracy on USC-HAD with noise rate 0.5.
    Comparison of different versions of CLC.
    Table 5.
     DT \( \boldsymbol {C_{y, y^{*}}} \) SLINCVDSFedDynCLC 
    Benchmark Model72.4028.2073.0069.2073.3039.3574.10 
    No Benchmark Model68.6025.6070.1067.00-36.1472.20 
    Table 5. The Classification Performance on Real Noise (Clothing1M)
    Noise estimation: Since CLC aims to correct the noisy labels, we verify the label noise estimation’s three critical indexes (recall, precision, F1-score). The recall is computed through the number of correctly detected/corrected noisy labels over the total number of noisy labels. The precision is computed through the number of correctly detected/corrected noisy labels over the total number of labels that have been corrected. As to the F1-score, it is represented as the average of precision and recall. Table 6 shows the noise detection performance of CLC on the MNIST dataset, CIFAR-10 dataset (synthetic noise) and Clothing1M (real noise), compared with \( C_{y, y^{*}} \) ’s and DS’s performance. These two reference methods represent the existing traditional methods and FL methods on label noise, respectively. It can be observed that CLC performs well regardless of whether it has a benchmark model or no benchmark model, while \( C_{y, y^{*}} \) and DS both show a decline when there is no benchmark model. The detection performance of CLC on MNIST and CIFAR-10 both reached an F1-score of 90% around. We also note that CLC’s performance did not decrease with the increasing noise rate, proving that CLC can detect and correct noise even at a high noise level. Interestingly, these three methods’ F1-score become better when the noise is more, which can be interpreted as ‘high noise level makes the difference between clean and noise distribution more obvious’. After detecting the noise, CLC also needs to correct the noisy labels for the participants to produce a clean dataset. It is worth mentioning that the label correction is challenging in a multi-classification task: to detect the noise and correct it to the right class. From the data in Table 7, it is apparent that the CLC’s ability to correct labels is significant. The correction performance of CLC on MNIST and CIFAR-10 reached an F1-score of 90% and 80% around in either case, separately. Note that the noise estimation task conducted on CIFAR-10 had 19 participants, which also highlighted the stability of CLC under the increasing number of participants. Furthermore, CLC manages to correct 70% around noisy labels on Clothing1M, shown in Table 7, which results in the advantages of CLC over the state-of-the-art works.
    Table 6.
       MNISTCIFAR-10Clothing1M
       0.30.40.50.30.40.5R
    BM \( \boldsymbol {C_{y, y^{*}}} \) Precision.816.851.885.512.618.710.369
    Recall.933.956.974.925.940.950.890
    F1-score.875.903.929.718.779.830.629
    DSPrecision.957.968.972.302.403.500.668
    Recall.856.881.908.536.576.621.805
    F1-score.906.924.940.419.489.560.736
    CLCPrecision.983.982.978.918.946.925.738
    Recall.898.917.950.865.879.883.701
    F1-score.941.950.964.892.912.904.739
      \( \boldsymbol {C_{y, y^{*}}} \) Precision.303.400.491.346.489.580.385
    Recall.933.932.918.838.911.922.534
    F1-score.618.666.705.592.700.751.460
    No BMCLCPrecision.983.988.989.917.945.924.777
    Recall.872.871.888.865.859.882.643
    F1-score.928.929.939.891.902.903.710
    Table 6. The Comparison of Detection Performance
    Table 7.
      MNISTCIFAR-10Clothing1M
      0.30.40.50.30.40.5R
    BMPrecision.964.958.945.864.904.900.702
    Recall.881.894.917.829.803.701.565
    F1-score.922.926.931.847.854.800.703
    No BMPrecision.964.968.959.863.906.899.693
    Recall.856.853.861.828.780.703.623
    F1-score.910.910.910.846.842.801.658
    Table 7. The Correction Performance of CLC

    4.5 Ablation Study

    Versions of CLC. We conducted several ablation experiments with noise rates of 0.3, 0.4, and 0.5 on CIFAR-10 to make the consensus mechanism more persuasive. The ablation study does not use the benchmark model, and the number of participants is set to 19. We analyzed CLC with different crucial parts, i.e., the iterative holdout (H), the label correction (C), and the class-wise thresholds aggregation part (agg). The experimental results are shown in Figure 10. Without the agg part, the labels were corrected based on the local thresholds. Since CLC deals with label noise, it outperformed FedAvg under all the settings. As the noise rate increases, the advantages of ‘CLC: H+C+agg’ become significant. The aggregated global thresholds, known as the product of the consensus, can better mine the noise distribution from a global perspective. Moreover, the label correction can further augment the available training data. Overall, the H part helps filter the noisy labels, and the C part reuses them, while the agg part helps learn a better distribution of noise and avoid negative correction. Hyperparameters. We conduct an ablation study to examine the effect of the hyperparameter \( \tau \) . \( \tau \) is the lower threshold of margin \( m(x) \) used to determine further the size of the holdout dataset \( \mathcal {H} \) . We experiment with \( \tau =0, 0.1, 0.3, 0.5 \) . Figure 11 shows the performance of CLC using different \( \tau \) trained on USC-HAD with the noise rate of 0.4. Closer inspection of Figure 11 shows that the optimal \( \tau \) is generally fixed at 0.1. The corresponding history recording of margin distributions are presented in Figures 12, 13, 14, and 15, respectively. The margin distribution here reflects the margin of those data that are considered to contain label noise. Note that \( \tau =0 \) means that noise detection is conducted without the margin tool. As shown in Figure 12, during the period before noisy labels are corrected, although the detected noise (true positive) is gradually increasing, a lot of clean data (false positive) is mixed. Interestingly, a large amount of noisy data and a certain amount of clean data gradually move closer to larger values. Our goal is to find as much noisy data as possible and minimize clean data mixing. We study \( \tau \) from 0.1 to 0.5, taking a step of 0.2. Figure 13 below illustrates that when \( \tau =0.1 \) , the number of false-positive samples decline sharply, and the number of true positive stays relatively stable–noted that CLC aims to revise noisy labels as much and as accurately as possible. Unfortunately, with \( \tau =0.3 \) , the number of true positives and false positives both drastically reduce. Furthermore, the margin threshold of 0.5 in Figure 15 performs even worse than 0.3. In summary, these results show that a margin threshold of 0.1 guarantees both recall and precision.
    Fig. 11.
    Fig. 11. Comparison of CLC using different \( \tau \) .
    Fig. 12.
    Fig. 12. The margin distribution with threshold 0 before label correction.
    Fig. 13.
    Fig. 13. The margin distribution with threshold 0.1 before label correction.
    Fig. 14.
    Fig. 14. The margin distribution with threshold 0.3 before label correction.
    Fig. 15.
    Fig. 15. The margin distribution with threshold 0.5 before label correction.

    5 Conclusions

    Federated learning provides a distributed learning framework for ‘joining’ all participants to learn a robust model collaboratively. However, the model trained on each participant’s data suffers from different levels of label noise. Considering the inaccessibility of participants’ data, in this paper, we propose a novel Consensus-based label correction approach in federated learning, namely CLC. The advantages of CLC are three-fold: (1) CLC corrects noisy labels rather than dropping noisy with available attributes, alleviating the loss of knowledge in federated learning. (2) CLC fully utilizes federated learning’s cooperative nature so that it does not require a task-related benchmark dataset, adapting diverse scenarios in the industry. (3) CLC exchanges intermediate parameters in a non-sensitive form, satisfying the privacy policy in federated learning. To evaluate the accuracy and precision/recall of CLC, we conduct extensive experiments on public datasets with synthetic and real-world noise settings, respectively. The results prove the superiority of CLC in various cases.

    References

    [1]
    Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. 2020. Federated learning based on dynamic regularization. In International Conference on Learning Representations.
    [2]
    Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. 2018. Local explanation methods for deep neural networks lack sensitivity to parameter values. arXiv preprint arXiv:1810.03307 (2018).
    [3]
    Dana Angluin and Philip Laird. 1988. Learning from noisy examples. Machine Learning 2, 4 (1988), 343–370.
    [4]
    Mohamed-Rafik Bouguelia, Slawomir Nowaczyk, K. C. Santosh, and Antanas Verikas. 2018. Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics 9, 8 (2018), 1307–1319.
    [5]
    Carla E. Brodley and Mark A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11 (1999), 131–167.
    [6]
    Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. 2019. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning. PMLR, 1062–1070.
    [7]
    Yiqiang Chen, Xiaodong Yang, Xin Qin, Han Yu, Biao Chen, and Zhiqi Shen. 2020. Focus: Dealing with label quality disparity in federated learning. arXiv preprint arXiv:2001.11359 (2020).
    [8]
    Nirmit Desai and Dinesh Verma. 2019. Properties of federated averaging on highly distributed data. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006. International Society for Optics and Photonics, 110061K.
    [9]
    Jacob Goldberger and Ehud Ben-Reuven. 2017. Training deep neural-networks using a noise adaptation layer. In ICLR.
    [10]
    Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, and Dinglong Huang. 2018. CurriculumNet: Weakly supervised learning from large-scale web images. In Proceedings of the European Conference on Computer Vision (ECCV). 135–150.
    [11]
    Kaiming He, Ross Girshick, and Piotr Dollár. 2019. Rethinking ImageNet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4918–4927.
    [12]
    Jinchi Huang, Lie Qu, Rongfei Jia, and Binqiang Zhao. 2019. O2U-Net: A simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3326–3334.
    [13]
    Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation. 64–67.
    [14]
    Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. 2019. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977 (2019).
    [15]
    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
    [16]
    Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2019. Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5051–5059.
    [17]
    Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, and Bingsheng He. 2019. A survey on federated learning systems: Vision, hype and reality for data privacy and protection. arXiv preprint arXiv:1907.09693 (2019).
    [18]
    Abdulmajid Murad and Jae-Young Pyun. 2017. Deep recurrent neural networks for human activity recognition. Sensors 17, 11 (2017), 2556.
    [19]
    Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. 2019. Confident learning: Estimating uncertainty in dataset labels. arXiv preprint arXiv:1911.00068 (2019).
    [20]
    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1944–1952.
    [21]
    Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11, 4 (2010).
    [22]
    Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. 2019. Robust and communication-efficient federated learning from non-IID data. IEEE Transactions on Neural Networks and Learning Systems (2019).
    [23]
    Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. 2014. Training convolutional networks with noisy labels. arXiv: Computer Vision and Pattern Recognition (2014).
    [24]
    Wei Tang and Matthew Lease. 2011. Semi-supervised consensus labeling for crowdsourcing. In SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval (CIR). 1–6.
    [25]
    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2018. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159 (2018).
    [26]
    Tiffany Tuor, Shiqiang Wang, Bong Jun Ko, Changchang Liu, and Kin K. Leung. 2020. Data selection for federated learning with relevant and irrelevant data at clients. arXiv preprint arXiv:2001.08300 (2020).
    [27]
    Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. 2019. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 322–330.
    [28]
    Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. 2015. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2691–2699.
    [29]
    Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In Twenty-fourth International Joint Conference on Artificial Intelligence.
    [30]
    Jing Zhang, Victor S. Sheng, Qianmu Li, Jian Wu, and Xindong Wu. 2017. Consensus algorithms for biased labeling in crowdsourcing. Information Sciences 382 (2017), 254–273.
    [31]
    Mi Zhang and Alexander A. Sawchuk. 2012. USC-HAD: A daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. 1036–1043.
    [32]
    Alex Krizhevsky, Geoffrey Hinton, and others. 2009. Learning multiple layers of features from tiny images. Citeseer.

    Cited By

    View all
    • (2024)Improving Semi-Supervised Text Classification with Dual Meta-LearningACM Transactions on Information Systems10.1145/364861242:4(1-28)Online publication date: 26-Apr-2024
    • (2024)Labeling Chaos to Learning Harmony: Federated Learning with Noisy LabelsACM Transactions on Intelligent Systems and Technology10.1145/362624215:2(1-26)Online publication date: 22-Feb-2024
    • (2024)A Dual Enrichment Synergistic Strategy to Handle Data Heterogeneity for Domain Incremental Cardiac SegmentationIEEE Transactions on Medical Imaging10.1109/TMI.2024.336424043:6(2279-2290)Online publication date: Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 13, Issue 5
    October 2022
    424 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/3542930
    • Editor:
    • Huan Liu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 June 2022
    Online AM: 22 March 2022
    Accepted: 01 February 2022
    Revised: 01 January 2022
    Received: 01 March 2021
    Published in TIST Volume 13, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Federated learning
    2. data evaluation
    3. consensus mechanism

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key Research and Development Plan of China
    • National Natural Science Foundation of China
    • Science and Technology Service Network Initiative, Chinese Academy of Sciences
    • Beijing Municipal Science & Technology Commission
    • Jinan S&T Bureau

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)819
    • Downloads (Last 6 weeks)106
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Improving Semi-Supervised Text Classification with Dual Meta-LearningACM Transactions on Information Systems10.1145/364861242:4(1-28)Online publication date: 26-Apr-2024
    • (2024)Labeling Chaos to Learning Harmony: Federated Learning with Noisy LabelsACM Transactions on Intelligent Systems and Technology10.1145/362624215:2(1-26)Online publication date: 22-Feb-2024
    • (2024)A Dual Enrichment Synergistic Strategy to Handle Data Heterogeneity for Domain Incremental Cardiac SegmentationIEEE Transactions on Medical Imaging10.1109/TMI.2024.336424043:6(2279-2290)Online publication date: Jun-2024
    • (2024)TrustBCFL: Mitigating Data Bias in IoT Through Blockchain-Enabled Federated LearningIEEE Internet of Things Journal10.1109/JIOT.2024.337936311:15(25648-25662)Online publication date: 1-Aug-2024
    • (2024)Label Noise Correction for Federated Learning: A Secure, Efficient and Reliable Realization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00277(3600-3612)Online publication date: 13-May-2024
    • (2023)On the Impact of Label Noise in Federated Learning2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt)10.23919/WiOpt58741.2023.10349830(183-190)Online publication date: 24-Aug-2023
    • (2023)Noise-aware Local Model Training Mechanism for Federated LearningACM Transactions on Intelligent Systems and Technology10.1145/359136314:4(1-22)Online publication date: 15-Jun-2023
    • (2023)Dynamic Data Sample Selection and Scheduling in Edge Federated LearningIEEE Open Journal of the Communications Society10.1109/OJCOMS.2023.33132574(2133-2149)Online publication date: 2023
    • (2023)Price of Stability in Quality-Aware Federated LearningGLOBECOM 2023 - 2023 IEEE Global Communications Conference10.1109/GLOBECOM54140.2023.10437743(734-739)Online publication date: 4-Dec-2023
    • (2023)An edge‐assisted federated contrastive learning method with local intrinsic dimensionality in noisy label environmentSoftware: Practice and Experience10.1002/spe.329554:9(1793-1810)Online publication date: 30-Nov-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media