Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Meta-learning Advisor Networks for Long-tail and Noisy Labels in Social Image Classification

Published: 07 June 2023 Publication History

Abstract

Deep neural networks (DNNs) for social image classification are prone to performance reduction and overfitting when trained on datasets plagued by noisy or imbalanced labels. Weight loss methods tend to ignore the influence of noisy or frequent category examples during the training, resulting in a reduction of final accuracy and, in the presence of extreme noise, even a failure of the learning process. A new advisor network is introduced to address both imbalance and noise problems, and is able to pilot learning of a main network by adjusting the visual features and the gradient with a meta-learning strategy. In a curriculum learning fashion, the impact of redundant data is reduced while recognizable noisy label images are downplayed or redirected. Meta Feature Re-Weighting (MFRW) and Meta Equalization Softmax (MES) methods are introduced to let the main network focus only on the information in an image deemed relevant by the advisor network and to adjust the training gradient to reduce the adverse effects of frequent or noisy categories. The proposed method is first tested on synthetic versions of CIFAR10 and CIFAR100, and then on the more realistic ImageNet-LT, Places-LT, and Clothing1M datasets, reporting state-of-the-art results.

1 Introduction

Training deep neural networks on large numbers of labeled images is critical for social multimedia retrieval [30]. Several recent applications depend on correctly retrieving many different concepts in images like micro-video recommendation [29] or predicting image popularity [37]. Covering many concepts is very challenging, especially for rare ones, since a large number of images are required for training classifiers. Therefore, automatic methods for label generation have recently been investigated by researchers, for instance in the form of semi-supervised. They exploit labeled images from non-experts as is typical in multimedia sources (e.g., social networks, textual description of products, video captions, etc.) or even unlabeled ones that are available in very large quantities at no cost. Due to their nature, these data have mislabeled or unbalanced samples [1], which can follow a long-tail distribution [25]. The great adaptability of deep neural networks, provided by their large number of parameters, leads to the generation of highly discriminative models if the training data are balanced and correct. When this assumption is not true and the data are unbalanced or their annotations are noisy, then there is a resulting reduction in performance and possible overfitting [9, 19]. Recent methods have attempted to address the label noise problem by measuring network confidence during training through curriculum learning [26], employing another co-trained network [14], or directly estimating noise in the set [16]. Finding samples out of the correct distribution and trying to reduce their impact on training is the general idea for dealing with this dataset problem. Instead, the unbalanced distribution (long-tail) of concepts is solved through feature augmentation techniques [7], but above all with the design of ad-hoc loss functions for this type of problem [9, 43, 46]. In contrast, a meta-learning approach is here proposed to address long-tailed and noisy labels problems, which is based on an advisor network trained to help the main classifier model (Figure 1) to perform better at the image classification task. During the training of a standard classification model, the advisor network adjusts feature activations and gradients of the main model by observing its feature activations and training loss. At test time, the advisor is discarded keeping only the main network as the final model. Compared to the teacher-learner paradigm our advisor network is trained to help another model instead of being trained to do image classification. Our contributions are:
Fig. 1.
Fig. 1. General overview of our system. An advisor network assists an image classifier by exploiting an auxiliary meta-set to reduce noise and unbalance problems in annotations of the training set.
Following the principles of our advisor network, we present a new meta-model to solve concurrently both the imbalance and label noise problems for the image classification task.
We introduce a Meta Feature Re-Weighting (MFRW) method that automatically generates an attention mask on the visual features of the classifier so that it focuses only on the information in an image deemed relevant by the advisor network.
The Meta Equalization Softmax (MES) activation function has been formulated to automatically adjust the grader gradients so that its learning is not adversely affected by an image belonging to a frequent category or with a noisy label.
The effective performance of our method is shown numerically and qualitatively by experiments conducted on synthetic (long-tailed and noisy label corruption) and real-world datasets. We achieve the state-of-the-art result on Clothing1M.
The code will be released upon the acceptance of this paper.

2 Related Works

Noisy training labels
In literature, the problem of noisy labels in training data is well studied because machine learning systems are prone to performance degradation when noise is present in the training label [38, 41]. Loss correction was a well-treated technique to mitigate the effect of noisy samples on the classifier network. Works like Reed [42], F-correction [13], GLC [16], M-correction [2], and S-adaptation [40] made loss adjustments based on the estimated corruption probabilities matrix, changing the wrong labels to the correct ones. In [35, 47, 54] the noise distribution was modeled by linearly combining the noisy label with the output of the network. Different approaches assigned a weight to each example, avoiding the contribution of a noisy sample to the training by giving it a lower weight value. MentorNet [20] and MentorMix [19] found the latent weights through data-driven curriculum learning. Some works used augmentation strategies that encourage the main model to behave linearly in-between training examples like Mixup [56] and AdvAug [6]. DivideMix [27] dynamically separated training data into clean and noisy sets to optimize two diverged networks with a semi-supervised strategy. In contrast, our method takes advantage of an advisor network that alters activations and gradients of the main classifier and can increase its performance, without isolating noisy label samples from the clean ones.
Imbalanced training labels
Training on imbalanced (or Long-tailed) datasets is an active research field in computer vision [3, 7, 18, 43, 46, 57, 59]. A common solution present in the literature is re-sampling. While [4, 15, 36] sampled more (over-sampling) training data from the minority classes to balance the distribution of all classes, [12] removed (under-sampling) data from frequent classes to make the data distribution more uniform. Under-sampling is infeasible in extreme long-tailed datasets, where the imbalance ratio between the head and the tail class is high, because most of the examples would be excluded from training. Another solution is re-weighting, where weight was assigned to each different training sample, according to its importance. [17, 49] used the inverse of class frequency to determine the weight value. Re-weighting can be done even on a sample level. In [31] a modulating loss factor was introduced to make the neural network cost-sensitive, reducing the loss contribution from easy examples. Instead, [3, 9, 23, 57] manipulated the loss based on the category distribution. In [43] an unbiased softmax function was derived to explicitly model the class distribution shift and minimize the generalization error bound. [46] introduced a new loss that avoids discouraging gradients for the rare class. [7, 55] exploited the feature augmentation method to transfer the feature variance of common classes to the rare ones. The solution proposed by [33] adopted a memory module to augment the rare categories with semantic feature representation obtained from common ones. Instead, our method exploits a new layer of meta-attention to direct the classifier’s attention much more to the rare categories, while still not forgetting the common ones. Moreover, our advisor network automatically modifies the classifier’s gradients to avoid the negative impact of the common classes on the rare ones. Self-supervised learning approaches [22] handle severe class imbalances effectively. It has been shown that self-learned representations are also robust to label noise if adjusted with an unbalance- and noise-resistant loss function.
Meta-learning
Meta-learning was used to assist the training and optimization of learning models. The noisy labels problem was addressed in [28, 44, 50] with this approach. For example, L2R [44] weighed each sample by giving less importance to the noisy one. MLNT [28] imitated regular training with fabricated noisy labels. MLC [50] estimated the corruption probabilities matrix to adjust the training loss values. Meta-learning was also applied to the long-tailed classification task. In [18, 28] the meta-model learned to assign higher weights to the examples of the rare classes. MW-Net [45] automatically determined an explicit weighting function that can be easily fitting to a different type of task, and it works on both the noise and imbalance training problems. GDW [5] introduced class-level meta-weights for several gradients flows and adjusted them to make better use of class-level information. All these methods took advantage of a small clean validation dataset to apply the meta-learning scheme. Differently from them, our advisor network modifies the network activations using a meta-attention layer and simultaneously learns to weigh training gradients to increase the performance of the main classifier during its training. Our method addresses both imbalance and noise training concurrently.

3 Method

3.1 Task

We developed a new advisor network that helps the deep neural network (DNN) to address both the noisy labels and long-tail image classification problems. Our method is composed of two main parts that can work jointly: Meta Feature Re-Weighting (MFRW) and Meta Equalization Softmax (MES). The first component, MFRW, makes use of an auxiliary advisor network that automatically learns how to weigh the features extracted from a DNN during its training. Our idea was to exploit an attention mechanism to enhance the useful parts of visual information and lower the rest. If a network can concentrate only on some convenient parts of an image, that information can contribute to increasing the overall generalization capacity of the network even if the annotation is wrong. This is also true for the long-tailed distribution of data where information from common classes can be leveraged to improve performance on the rare ones. In the second component, MES, the advisor network automatically learns how to reduce the discouraging gradients of some images concerning others. In long-tailed distributions, the discouraging gradients of frequent categories samples can worsen the learning of the rare ones [46]. This can happen even with noisy labels because the discouraging gradients of an example with the wrong label can affect the correct learning of the entire classifier. These two methods can be used simultaneously to help the learning of an image classifier, reducing the negative effect that unbalanced or noisy annotations in the training datasets can produce. Our advisor network is trained with the meta-learning paradigm, so it can know the current state of the classifier and learn how to help it at that moment.
We first introduce a meta-learning basic formulation for methods that learn robust deep neural networks from noisy and long-tailed category distributions. We then proceed to show in detail each part of our method: Meta Feature Re-Weighting is specified in Section 3.3 and Meta Equalization Softmax in Section 3.4. Finally, the learning process of the classifier together with the advisor network is described in Section 3.6.

3.2 Background Meta-learning

In general, meta-learning (ML) refers to the process of improving a learning algorithm over multiple learning episodes, it is also called learning to learn. ML is divided into two algorithms: an inner (or base) and an outer (or upper/meta) algorithm. The inner one solves the main task minimizing an objective function, for image classification we have a convolutional neural network and the cross-entropy loss, respectively. Instead, the outer algorithm updates the inner one such that it improves also on an outer objective function. When the objective functions are the same for both algorithms, the outer algorithm can help the inner one to work well on new data distribution. If the new distribution is a smaller version of training data but free of errors and balanced, it is possible to train the outer algorithm to solve the problem of noisy or imbalanced labels inside the main training data. We refer to this distribution as meta-set. As in [45], the outer algorithm can be a multilayer perceptron network, called meta-model, that learns automatically how to address these problems helping the main image classifier during its learning. We introduce the symbols useful for understanding ML in this particular setting and how the entire learning process is divided, describing the [45] algorithm for simplicity.
Let \(D^{train} = \lbrace x_i^{tra}, y_i^{tra}\rbrace ^N_{i=1}\) be the training set with noisy or imbalanced annotations, where \(N\) is the total number of samples, composed of an image \(x_i\) and the correspondent one-hot label \(y_i\) over \(C\) classes. The main DNN model is defined as \(\Phi (\cdot ; w)\) , where \(w\) are its parameters. The prediction on an input image \(x\) is \(\hat{y} = \Phi (x; w)\) and the optimal parameters \(w^*\) are obtained by minimizing the softmax cross-entropy loss \(\ell (\hat{y}, y)\) on the training set. Let \(D^{meta} = \lbrace x_j^{meta}, y_j^{meta}\rbrace ^M_{j=1}\) be the meta-set, a well-verified and balanced version of training one but much smaller, \(M \ll N\) . The meta-model is defined with \(\Psi (\cdot ;\theta)\) , parameterized by \(\theta\) . In [45] the optimal parameter \(w^*\) is derived using a loss weighted with a value predicted by the meta-model. The meta-model is trained minimizing the softmax cross-entropy loss of previously updated \(\Phi (\cdot ;w^*(\theta))\) on the meta dataset \(D^{meta}\) .
Both \(\Phi (\cdot ; w)\) and \(\Psi (\cdot ;\theta)\) can be updated by alternating optimization through gradient descent. An online strategy, that is divided into three main steps, can be adopted to update \(\theta\) and \(w\) through a single optimization loop. This guarantees the efficiency of the algorithm and its convergence [45]. In the first step, called Virtual-Train, the original DNN will not be updated and the optimization is carried out on a virtual model that is the clone of the original one. Keeping in mind the virtual update previously carried out, in the successive step called Meta-Train, the meta-model is updated. Actual-Train is the last step where the base DNN model is optimized taking into account the already updated meta-model.

3.3 Meta Feature Re-Weighting (MFRW)

Human attention is the ability of the brain to selectively concentrate on one aspect of the environment while ignoring other information. Attention for a DNN is a mechanism that tries to mimic the cognitive attention of the human brain, calculating a soft (or hard) mask which is then multiplied with the visual features of the network. The mask \(W\) is usually the output of a function \(g\) of some input \(x\)
\begin{equation} W = g(x) \end{equation}
(1)
and \(W\) is element-wise multiplied with a feature \(f\) of the network
\begin{equation} f_{att} = W \odot f \end{equation}
(2)
where \(\odot\) is the symbol for element-wise multiplication.
This intensifies the important parts of the feature and reduces the rest. We proposed a meta-attention mechanism, called Meta Feature Re-Weighting (MFRW), that can be used to mitigate noisy or imbalanced labels problems in the training data. In the first case, if there is a mismatch between the content of the image and its associated annotation, that can lead to a degradation of the classifier’s performance for that annotated class. With our method, the main network can use only parts of the erroneous visual information to improve performance in that class. Instead in the case of imbalance, MFRW can attribute to the visual information relative to common classes smaller importance than those of the rare ones. Finding the handwritten function \(g\) that generates the right masks for each of the two cases is challenging. We used a meta-model to automatically infer the correct \(g\) . This gives two properties to \(g\) : it can change during the training of the main network and it can adapt automatically to the problem present in the training data. The element-wise product is done between the feature \(f\) extracted from a DNN and a vector of weights \(W_f\) learned from a meta-model
\begin{equation} f_{att} = W_f \odot f \end{equation}
(3)
The meta-model can take into consideration important aspects of each training data, so it can generate the proper activation weights based on them. Attention must be differentiated between the various categories, as each may have a different number of examples and noise levels. This is done by giving as input to the meta-model visual features extracted from the classifier’s backbone. In the case of unbalanced labels, examples of the more common categories, since they are presented many more times during training, are easier to be learned than those of the rarer ones. In addition, mislabeled images have a different difficulty than cleanly labeled images, as they are outside the correct distribution of each category, and their size is often smaller than the correct ones. The attention should be adjusted according to the difficulty that a training example represents for the classifier. This way, it can focus differently on the information in the data by learning a better representation. The cost value typically used to express the difficulty of classification samples [26] is given to the meta-model in combination with the visual features.

3.4 Meta Equalization Softmax (MES)

The conventional loss function for image classification is the softmax cross-entropy. A multinomial distribution \(p\) over \(C\) categories is obtained from the network outputs score \(z\) with the softmax activation function. Then the cross-entropy is calculated between \(p\) and target distribution \(y\) . The softmax cross-entropy loss can be formulated as:
\begin{equation} \mathcal {L}_{SCE} = - \sum _{j=1}^{C}y_j log(p_j) \end{equation}
(4)
where the distribution \(p_j\) is described as follows:
\begin{equation} p_j = Softmax(z_j) = \frac{e^{z_j}}{\sum _{k=1}^{C}e^{z_k}} \end{equation}
(5)
When the distribution of categories in the training dataset is imbalanced, the softmax cross-entropy loss makes the learning of rare categories easily suppressed by the common ones. In [46] a softmax equalization loss is proposed to avoid discouraging gradients from samples of frequent categories for the rare ones. The difference with the softmax cross-entropy is the weighting of a term within the softmax activation function. The new distribution \(p_j\) is calculated as:
\begin{equation} p_j = EQ_{Softmax}(z_j, \tilde{w}_k) = \frac{e^{z_j}}{\sum _{k=1}^{C} \tilde{w}_k e^{z_k}} \end{equation}
(6)
where
\begin{equation} \tilde{w}_k = 1 -\beta T_\lambda (f_k)(1-y_k) \end{equation}
(7)
The element \(T_\lambda (f_k)\) is a handcrafted threshold function that outputs a value \(\in \lbrace 0,1\rbrace\) based on the category frequency value \(f_k\) . When \(T_\lambda\) output is 1 the gradient is ignored, otherwise it is taken into account. Instead, \(\beta\) is a Bernoulli random variable with a probability of \(\rho\) to be 1 and \(1 - \rho\) to be 0.
The strategy of avoiding the discouraging gradients can be useful in other problems different from imbalance training, for example image classification with noisy labels. The discouraging gradients of a mislabeled image can be scaled to not harm the correct learning of the classifier model. This behavior can be obtained by modifying the weights \(\tilde{w}_k\) passed to the \(EQ_{Softmax}\) function in the Equation (6). The element that determines each category’s weight is \(T_\lambda (f_k)\) , but it works only for the imbalance training problem. Writing a new function for the noisy annotation problem is hard because the noise can be really complex or completely unknown, for example when data is collected automatically [52].
Inspired by this we proposed a meta-learned equalization loss (MES) that can adapt the weight to the task that needs to be solved. The new formulation of the weights \(\tilde{w}_k\) passed to the \(EQ_{Softmax}\) is:
\begin{equation} \tilde{w}_k = 1 -\beta s_k (1-y_k) \end{equation}
(8)
where \(s_k\) is the vector of value \(\in (0,1)\) . This vector \(s_k\) is the output of a meta-model trained to help the main model handle noise and imbalance label problems present in the training data. The visual feature and the cost of each training data are given as input to the meta-model. This allows the model to generate output vectors \(s_k\) that are differentiated between classes and between “easy” and “hard” examples.

3.5 Meta-model Architecture

Since both MFRW and MES require the same input data, it was possible to combine the two methods through a single meta-model. Our meta-model \(\Psi\) is a neural network composed only of a fully connected layer. The inputs are a feature \(f\) and a loss value \(\mathcal {L}_x\) . Each input is projected in a fixed-size embedding space through a separate fully connected layer followed by a ReLu function. These embeddings are concatenated to form a larger common space, the size of which is the sum of the dimension of each previous embedding. MFRW method requires a weight vector \(W_f\) in the range \(\in (0,1)\) of the same size as the feature input \(f\) . Instead, MES needs a weight vector in the range \(\in (0,1)\) but with a length equal to the number of classes \(C\) . From the last embedding space, we get the needed outputs thanks to a fully connected layer followed by a sigmoid activation function for each of the outputs. In this way, MFRW and MES can learn a common internal representation from the inputs, obtaining their respective benefits.

3.6 Algorithm

In this section, we describe how the base classifier \(\Phi\) and our meta-model \(\Psi\) are trained jointly. Because the meta-model needs as input the visual feature, we separate the main model \(\Phi (\cdot ; w)\) into two different parts: the backbone \(\Phi _b(\cdot ; w_b)\) and the category predictor \(\Phi _c(\cdot ; w_c)\) . The first one has an image \(x\) as input and gives out a feature vector \(f\) . Instead, the second part has \(f\) as input and a probability score vector \(z\) as output. In this way, it is possible to manipulate the feature \(f\) directly with our meta-model \(\Psi\) . The meta-model takes two different inputs \(\Psi (f,\mathcal {L})\) and gives back two vectors of weights \(W_f\) and \(s_k\) . Our algorithm is divided into four main phases that are shown in Figure 2 and summarized in Algorithm 1. We describe our method in detail starting with the \(t\) -th iteration and moving forward each step until we reach the \((t+1)\) -th. Different from the meta-learning optimization strategy described in Section 3.2, we need an additional initial phase, called Loss Pre-Calculation (Figure 2(a)). The value of loss \(\mathcal {L}^{pre}\) related to the training batch \(X^{train}\) must be calculated at the beginning. This loss value must be dependent on the original feature \(f^{train}\) and not on the weighted one \(f^{att}\) . In the second step Virtual-Train (Figure 2(b)), \(\Phi _b^t\) and \(\Phi _c^t\) are the virtual clones of the backbone \(\Phi _b(\cdot ; w_b)\) and the category predictor \(\Phi _c(\cdot ; w_c)\) at the beginning of the \(t\) -th iteration. We obtain the features \(f^{train}\) passing through \(\Phi _b^t\) the batch \(X^{train}\) . Then the loss values \(\mathcal {L}^{pre}\) pre-calculated and its relative feature \(f^{train}\) are given to \(\Psi ^t\) (the meta-model at time \(t\) ) to obtain the two vectors of weights \(W_f\) and \(s_k\) . The feature \(f^{train}\) is multiplied element-wise with \(W_f\) to get a new feature vector with attention \(f^{att}\) as in Section 3.3. The modified feature is passed to the predictor \(\Phi _c^t\) obtaining the score \(z^{train}\) . Now we calculate the \(\mathcal {L}^{train}\) with the equalization loss, described in Section 3.4, using the vector \(s_k\) in the Equation (8). Then \(\Phi _b^t\) and \(\Phi _c^t\) parameters are virtually updated to minimize \(\mathcal {L}^{train}\) , excluding those of \(\Psi ^t\) . For the third step Meta-Train (Figure 2(c)), we need a clean and balanced meta-dataset that will be used to train the meta-model \(\Psi\) . We pass a meta batch \(X^{meta}\) through the virtually updated \(\Phi _b^{t+1}\) and \(\Phi _c^{t+1}\) in order to get a validation loss \(\mathcal {L}^{meta}\) . In this step, the feature is not modified and the loss is the classic softmax cross-entropy loss. Then only \(\Psi ^t\) is updated minimizing \(\mathcal {L}^{meta}\) . In this way, the meta-model is optimized to help the main model minimize its error on clean and balanced data. Here the optimization takes into consideration also the previous Virtual-Train. In the last phase, Actual-Train (Figure 2(d)) the original \(\Phi _b^t\) and \(\Phi _c^t\) are optimized taking into account the updated meta-model \(\Psi ^{t+1}\) . Our meta-model is used only during the training of the main network \(\Phi\) when external help is needed to solve noisy or imbalance label problems. It is discarded at test time when only the main network is retained as the final model.
Fig. 2.
Fig. 2. Full training scheme, divided by steps, of our method reaching the \((t+1)\) -th iteration from the \((t)\) -th.
Computation and memory overhead
Excluding the Loss Pre-Calculation phase, the Virtual-Train, Meta-Train, and Actual-Train steps need a backward operation in addition to a forward one. The Meta-Train backward step, in which the meta-gradient is computed from the loss on the meta-set, takes more than \(80\%\) of the total computation [53]. In this step, to update the meta-model parameters, the meta-gradient is back-propagated backward through each layer of the main network. Since normal training does not involve this step, this additional cost quickly becomes significant as the number of layers in deep networks increases. In addition, the amount of GPU memory required is duplicated compared to traditional training. The gradients obtained in the Virtual-Train step must still be kept in memory so that the meta-gradient can be calculated during the Meta-Train step. These computation and memory problems are typical of a lot of meta-learning approaches. However, a method like [53] which computes the meta-gradient with a faster layer-wise approximation provides strategies to overcome them. These overhead costs are present only during the training. In the test phase, the meta-model is not used and there is only a forward pass on the classifier to get a prediction on the input.

4 Experiments

To demonstrate the effectiveness of our method, we conducted several experiments on synthetically generated datasets with a controlled level of noise and imbalance. We also tested it in real-world datasets to prove its ability to adapt to any context.

4.1 Datasets

CIFAR10 and CIFAR100 synthetic datasets
Following previous works [20, 44, 45], we used CIFAR-10 and CIFAR-100 as bases to generate synthetic datasets. They are composed of \(50,\!000\) training images and \(10,\!000\) test images of size 32 \(\times\) 32. Off the training set, we randomly selected 100 images for CIFAR-10, and 10 images for CIFAR-100, per class to create the meta-set for meta-training. The long-tailed versions of the datasets, CIFAR-LT-10 and CIFAR-LT-100, are created randomly removing training examples [3]. Following the standard evaluation protocol for the long-tailed problem, we studied five different imbalance factors (IFs) of 200, 100, 50, 20, and 10, where IF = 1 coincides with the original datasets. These IFs are related to the parameter \(\mu \in (0, 1)\) , where the number of examples dropped from the y-th class is \(n_y\mu ^y\) and \(n_y\) is the original number of training examples for that class.
We tested our method also on a noisy labels version of CIFAR-10 and CIFAR-100, namely Flip CIFAR-10 and Flip CIFAR-100. In these variants, we chose the standard Flip (or asymmetric) noise because it is designed to mimic the structure where labels are only replaced by similar classes, e.g., dog \(\leftrightarrow\) cat. This type of noise usually happens when there is ambiguity between categories or visual similarity between images [52]. The noise ratio is controlled with a parameter \(p\) , which represents the probability that a correct label is flipped with the corresponding similar one. In this way, we could test our method on different levels of noise, from \(p = 0.0\) (no noise) to \(p=0.6\) (heavy noise).
Merging the two strategies to inject data issues, we also introduced a new synthetic version of each dataset, named respectively LT Flip CIFAR-10 and LT Flip CIFAR-100, as a new evaluation protocol for the case of training data with simultaneously unbalanced and noisy labels.
ImageNet-LT
In [33] a long-tailed version of ImageNet-2012 [10] called ImageNet-LT, was introduced as standard evaluation protocol for the long-tailed problem. From a Pareto distribution with shape value \(\alpha = 6\) , each class size is sampled to obtain the corresponding number of images for each one. ImageNet-LT has \(115,\!800\) training images in \(1,\!000\) classes with an imbalance factor of \(1280/5\) . We randomly selected 10 images per class from the provided validation set to create our meta-set for meta-training. The test set is the original balanced ImageNet-2012 validation set with 50 images per class.
Places-LT
The Places-LT dataset is created by sampling from the dataset Places-2 [58] with the same strategy used for ImageNet-LT. The training set is composed of \(62{,}500\) images from 365 classes with an imbalance factor of \(4980/5\) . The test set has 100 images per class. Our meta-set is created by randomly selecting 10 example per class from a validation set of 20 images per class.
Clothing1M
The Clothing1M [52] is a dataset that is composed of 1 million images of clothing taken from online shopping websites. There are 14 categories like T-shirts, Shirts, Knitwear, and so on. The labels are obtained from the text of the images provided by the sellers and not from an expert annotator. This process introduces into the labels a real-world noise, which cannot be predicted in advance. A validation set of 14,313 manually well-annotated images is provided and it was used as the meta dataset in our experiments.

4.2 Meta-model Implementation Details

In every experiment, the meta-model was optimized with Adam [24] and a learning rate of 1e-4. The size of each embedding space was set always to 100. The probability of \(\rho\) of the Bernoulli distribution \(\beta\) for MES was equal to 0.9.

4.3 Long-Tail Label Distribution Results

We conducted several experiments on the imbalance training problem related to image classification. We tried different settings and datasets to compare our method with the others in the literature. We tested MFRW and MES both disjointly and together (MFRW-MES). We consider as baseline method the direct training of the classifier with a standard softmax cross-entropy loss.
CIFAR-LT-10 and CIFAR-LT-100
The first part of experiments on CIFAR-LT-10 and CIFAR-LT-100 was conducted with a Resnet-32 network trained through SGD with a momentum of 0.9, weight decay of 5e-4, batch size of 128, and a starting learning rate of 0.1. The learning rate decreased to its \(1/10\) at the 160 epoch and 180 epoch, stopping the learning at 200 epochs. The results of our methods and related works are shown in Table 1.
Table 1.
DatasetLong-Tailed CIFAR-10Long-Tailed CIFAR-100
IF200100502010200100502010
Baseline (CE) [45]65.6870.3674.8182.2386.3934.8438.3243.8551.1455.71
Focal Loss [31]65.2970.3876.7182.7686.6635.6238.4144.3251.9555.78
Fine-tuning [45]66.0871.3377.4283.3786.4238.2241.8346.452.1157.44
CB Loss [9]68.8974.5779.2784.3687.4936.2339.645.3252.5957.99
L2RW [44]66.5174.1678.9382.1285.1933.3840.2344.4451.6453.73
MW-Net [45]68.9175.2180.0684.9487.8437.9142.0946.7454.3758.46
LDAM-DRW [3]-77.03--88.16-42.04--58.71
LDAM [18]-80.0082.3484.3787.40-44.0849.1652.3858.00
FaMUS CE [53]-79.3083.1587.1589.39-45.6049.5656.2260.42
FaMUS LDAM [53]-80.9683.3286.2487.90-46.0349.9355.9559.03
GDW [5]-72.34--87.32-39.52--57.3
BALMS \(^\dagger\) [43]74.7680.4283.5687.3389.1942.9147.2151.8557.4361.61
MFRW78.0780.4384.0887.4388.7640.7744.8549.6556.4660.25
MES72.2378.3581.8486.7188.9540.5644.6850.8157.0761.35
MFRW-MES75.9181.1983.8786.8488.8343.3346.852.0256.9560.6
Table 1. Test Accuracy ( \(\%\) ) of ResNet-32 Architecture on CIFAR-LT-10 and CIFAR-LT-100 under Different Imbalance Factors (IFs)
The results for the cited methods are reported directly from their original papers. Instead, \(^\dagger\) indicates the results obtained by our implementation. The first results are marked in bold and the second ones with an underline.
For the CIFAR-10-LT our methods MFRW and MFRW-MES got the first and the second-best accuracy values, especially at higher values of IF. Instead, in CIFAR-100-LT the higher accuracy results are shared across BALMS [43] and our method MFRW-MES.
Increasing the number of categories to be classified from CIFAR-10-LT to CIFAR-100-LT, but maintaining the same ResNet-32 backbone, the gain in performance of MFRW was less pronounced than MES. Instead, the few classes of CIFAR-10 led MES to a modest improvement when compared to MFRW. These behaviors might depend on the number of examples per class or even on the classifier backbone.
To investigate further on that we conducted a second phase of experiments where a more strong preprocessing (increase the variety of training samples) and a different learning rate scheduler were applied to the training. A Resnet-32 classifier is trained for \(13,\!000\) iteration with a batch of 512, on which was applied AutoAugment [8] and Cutout [11]. The initial learning rate was 0.1, then decreased to zero with a Cosine Annealing scheduler [34]. The optimizer used was SGD with a momentum of 0.9 and a weight decay of 5e-4.
The results of this experiment are reported in Table 2. It shows an overall performance improvement over the results in Table 1 due to the additional preprocessing applied to the inputs of the classifier. In this setting our method MFRW-MES, which benefits jointly from MFRW and MES strategies, got comparable results with BALMS [43]. Compared to the results obtained in Table 1, it can be seen that MES benefits more from the increase than MFRW in both the CIFAR-10-LT and CIFAR-100-LT datasets, mainly in the highest imbalance factor IF = 200. This suggests that the few classes of CIFAR-10 and the very small number of training samples, as in Table 1 (where there is no data augmentation), led MES to a confused estimation of the vector \(s_k\) used in Equation (8). Because MFRW modifies the visual feature size of the classifier, the small number of parameters of the Resnet-32 architecture could be a limitation for this method. For this reason, we also tested our methods on the backbone ResNet-18 with more parameters (11.17M trainable parameters) and a bigger visual feature size than ResNet-32 (0.48M trainable parameters), but with the same settings as the experiments done in Table 1. The new visual feature size went from 64 of ResNet-32 to 512 of ResNet-18.
Table 2.
DatasetLong-Tailed CIFAR-10Long-Tailed CIFAR-100
IF2001001020010010
Baseline (CE)71.277.490.041.045.361.9
CBW72.578.690.136.742.361.4
CBS68.377.890.237.842.661.2
Focal Loss [31]71.877.190.340.243.860.0
CB Loss [9]72.678.289.939.944.659.8
LDAM Loss [3]73.678.990.341.346.162.1
Equalization Loss [46]74.678.590.243.347.460.5
cRT [21]76.682.091.044.550.063.3
LWS [21]78.183.791.145.350.563.4
BALMS [43]81.584.991.345.550.863.0
MFRW78.3282.4990.2242.947.0562.77
MES77.1281.1991.0343.5248.4463.63
MFRW-MES81.3884.9790.9946.5250.4464.06
Table 2. Test Accuracy ( \(\%\) ) of ResNet-32 Architecture on CIFAR-LT-10 and CIFAR-LT-100 under Different Imbalance Factors (IFs)
Autoaugment and Cutout are additionally applied as preprocessing on the data. The results of the cited methods are reported directly from their original papers. Bold is used for the first results and underline for the second ones.
The results in Table 3 demonstrate our intuition that MFRW benefits from a bigger classifier backbone. Instead, MES achieved only a small improvement in having a more complex backbone showing that its performance is related to the number of examples in the training set and their preprocessing. With Resnet-18 as backbone, our method MFRW-MES and MFRW got the first and the second-best accuracy result on almost every IF value. Figure 3 shows how the accuracy of MFRW-MES and BALMS [43] varies with the number of parameters of the classifier’s backbone on CIFAR-10-LT (3(a)) and CIFAR-100-LT (3(b)) with IF = 100.
Fig. 3.
Fig. 3. Accuracy results of MFRW-MES and BALMS [43] with different classifier backbone on CIFAR-10-LT (3(a)) and CIFAR-100-LT (3(b)) at the same IF of 100. Under the backbone’s name is present its number of parameters.
Table 3.
DatasetLong-Tailed CIFAR-10Long-Tailed CIFAR-100
IF200100502010200100502010
Baseline (CE)70.2275.1682.3287.2490.7338.8743.6548.5557.0962.59
CB Loss [9]69.1675.1681.986.6190.7938.5843.5148.1557.0263.1
FSA [7]77.0680.5784.5188.5491.7542.8446.5751.958.6965.08
BALMS \(^\dagger\) [43]76.8681.7885.2889.2790.8642.1948.0753.8359.8764.13
MFRW79.4981.3586.3289.2391.4443.1947.5153.1560.3865.36
MES73.1177.9683.3388.6990.6340.2844.4950.158.2463.99
MFRW-MES79.9483.4386.889.1391.0243.8550.0454.1261.3765.28
Table 3. Classification Accuracy ( \(\%\) ) of the Architecture ResNet-18, Trained on the Same Settings as Table 1
The first and the second-best results are respectively marked in bold and underline. \(^\dagger\) indicates the results obtained by our implementation.
From Tables 13 it is possible to notice how our method exceeds or is in line with the results obtained from the state-of-the-art algorithms for long-tailed training. We obtained the best accuracy values in almost all IFs, especially when the dataset is extremely unbalanced (IF = 200,100,50). Table 3 shows the effectiveness of our method with a bigger network backbone ResNet-18. Both MFRW and MES obtained good results, even individually. We could observe how they could be used simultaneously without compromising the final performance of the classifier. We designed MFRW-MES to address even the more complex case of imbalance together with noisy labels.
The embedding space utilized in the meta-model employed by MFRW-MES is learned by taking into account the collaboration between MFRW and MES. During the training of MFRW-MES, the embedding space is influenced first by MES, which acts directly on the loss of the classifier, and then by MFRW, which operates on the visual feature. In Table 1 for the case of CIFAR-10-LT, where MES has poor performance, the application of confused predicted vectors \(s_k\) to the loss let MFRW received a diminished gradient on the visual features, rendering it incapable of learning the \(W_f\) masks correctly. In some cases, this makes MFRW-MES having a lower accuracy result than the application of the singular method MFRW. Instead, in Table 3, where a more capable backbone Resnet-18 (11.17M trainable parameters) with a bigger visual feature size than ResNet-32 (0.48M trainable parameters) was used, MFRW-MES got better accuracy results than each singular method MFRW and MES. In this case, when the two methods were applied jointly, MFRW could act upon the visual features even if it received a reduced gradient (produced by the application of MES), exploiting the bigger number of learnable parameters of the backbone.
ImageNet-LT and Places-LT
Following the experiment setup of [43], we employed ResNet-10 and ResNet-152 networks for ImageNet-LT and Places-LT, respectively. For ImageNet-LT, we adopted an initial learning rate of 0.2 and decayed with Cosine Annealing scheduler during training of 180 epochs. For Places-LT, the learning rate started at 0.005 and it was reduced like for ImageNet-LT. We trained ResNet-152 for a total of 60 epochs with a batch size of 64. In both cases, our method started from a baseline that had been pre-trained on the entire dataset. We did not freeze the feature extractor part of the pre-trained network as the decoupled training strategy of [21] does. We pre-trained the backbone to accelerate the total training time and to make our method starts from an almost good feature extractor.
Table 4 shows the result of MFRW-MES on ImageNet-LT and Places-LT. In the first dataset, our method achieved a Top-1 accuracy value comparable to other methods that only target this task. Instead, for the Places-LT dataset, we got the second-best result.
Table 4.
DatasetImageNet-LTPlaces-LT
MethodTop-1Top-3Top-5Top-1Top-3Top-5
Baseline (CE)25.2638.6547.882747.9558.56
RCB [18]29.946.7154.8230.852.0562
OLTR [33]35.6--35.9--
Equalization Loss [46]36.44-61.19---
cRT [21]41.8--36.7--
LWS [21]41.4--37.6--
BALMS [43]41.8--38.7--
MFRW-MES41.7859.8767.2538.3461.271.09
Table 4. Top-1, Top-3, and Top-5 Accuracy ( \(\%\) ) of ResNet-10 Classifier on ImageNet-LT and Places-LT
We report directly the result of the cited methods from their original papers. The first and the second results are marked in bold and the second ones with an underline.
With these experiments, we showed how our algorithm can solve the long-tail data problem via a simple advisor network trained with the meta-learning paradigm.

4.4 Flip Label Noise Results

We trained our model under Flip (or asymmetric) label corruption noise at various levels. To assess the performance of the advisor network, we compared it to other works that studied this type of noise. We trained a ResNet-32 through SGD with a starting learning rate of 0.1 and batch size of 128. We decreased the learning rate at epoch 50 and 70 by a factor of 0.1. We stopped the training after 100 epochs. We also reproduced the [43] algorithm under this experiment setting to observe how an ad-hoc long-tailed distribution method works under the Flip noise. The baseline method is a direct training of the classifier with a standard softmax cross-entropy loss.
We can notice from Table 5 that our method obtained the best results for the flip noise on CIFAR10 and CIFAR100. The use of our advisor network avoided a drastic accuracy drop than the other methods, especially when the noise was really strong ( \(p = 0.6\) ). When there is no noise ( \(p = 0.0\) ) our method got worse accuracy values than a normal training with the classic softmax cross-entropy loss on both CIFAR10 and CIFAR100. It happens because the advisor network, trying to help the classifier, introduces a bias of the examples distribution contained in the meta-set. If the training distribution already reflects the test one better than the one contained in the meta-set then, introducing this meta bias, the accuracy is a little worse than without. In some experiments, MES may slightly outperform MFRW-MES. This behavior is reasonable because MFRW-MES, which tries to address even the more complex case of imbalance and noisy labels, attempts to give the classifier model well-balanced training.
Table 5.
DatasetFlip CIFAR-10Flip CIFAR-100
Noise \(p\) 0.00.20.40.60.00.20.40.6
Baseline (CE) [45]92.8976.8370.77-70.5050.8643.01-
Reed-Hard [42]92.3188.2881.06-69.0260.2750.40-
S-Model [13]83.6179.2575.73-51.4645.4543.8-
Self-paced [26]88.5287.0381.63-67.5563.6353.51-
Focal Loss [31]93.0386.4580.45-70.0261.8754.13-
Co-teaching [14]89.8782.8375.41-63.3154.1344.85-
D2L [35]92.0287.6683.89-68.1163.4851.83-
Fine-tuning [45]93.2382.4774.07-70.7256.9846.37-
MentorNet [20]92.1386.381.76-70.2461.9752.66-
L2RW [44]89.2587.8685.66-64.1157.4750.98-
GLC [16]91.0289.6888.92-65.4263.0762.22-
MW-Net [45]92.0490.3387.54-70.1164.2258.64-
GDW [5]92.9491.0587.70-70.6565.4152.44-
Baseline (CE) \(^\dagger\) 92.3390.5686.2526.6770.1865.0250.2518.67
MW-Net \(^\dagger\) [45]92.1990.7487.6342.4170.5764.1351.2319.89
BALMS \(^\dagger\) [43]92.8690.9983.5151.7669.6665.6156.8339.16
MFRW91.8791.0990.2689.3468.9363.5459.0756.13
MES93.0391.2590.7690.5869.7465.3662.9660.82
MFRW-MES92.4691.4490.790.2168.3365.1762.2658.43
Table 5. Test Accuracy on CIFAR10 and CIFAR100 Dataset with Flip (Asymmetric) Label Noise
The backbone used is a ResNet-32. \(p\) denotes the different levels of noise. The results for the cited methods are reported directly from their original papers. Instead, \(^\dagger\) indicates the results obtained by our implementation. The first and the second-best results are respectively marked in bold and underline.

4.5 Long-Tail & Flip Label Noise Results

We decided to introduce a new synthetic dataset setting in which unbalanced and noisy label problems are both present. We chose three values of IFs (200, 100, 10) and two of \(p\) (0.4, 0.6), and all possible combinations for both CIFAR10 and CIFAR100 were generated. All experiments were performed by training a ResNet-32 with the same settings and hyperparameters as the one used to obtain the results listed in Table 1. This experiment is important to establish the ability of an algorithm to handle different types of dataset conditions at the same time.
We compared our method with BALMS [43] because it is designed for long-tailed distributions, and with MW-Net [45] and GDW [5], which can deal with any type of problem in the data, similar to us. The results shown in Table 6 indicate that our advisor network can manage at the same time both noisy labels and long-tailed distributions better than the other methods. MFRW-MES can exploit the two different properties of MFRW and MES at the same time, achieving better results than each method used separately.
Table 6.
DatasetLT Flip CIFAR-10LT Flip CIFAR-100
Noise p0.40.60.40.6
I.R20010010200100102001001020010010
Baseline (CE) \(^\dagger\) 49.6456.9876.5831.7831.9631.7822.0323.8139.4812.0813.6519.6
MW-Net \(^\dagger\) [45]45.7452.4382.2232.0633.2246.524.3425.2439.0512.9614.9620.01
GDW \(^\dagger\) [5]36.2849.7381.327.934.3960.8321.9823.6234.9213.4714.1220.06
BALMS \(^\dagger\) [43]53.7359.2470.448.5352.5557.3927.0229.1844.4418.822.1432.02
MFRW55.963.5283.7644.4152.6269.1623.4525.2638.0817.6518.4829.58
MES53.3364.5185.844.4552.4683.1725.4126.7647.5417.1218.7938.75
MFRW-MES55.1361.6785.3450.1959.3890.2631.7334.253.8921.7924.339.99
Table 6. Test Accuracy on CIFAR10 and CIFAR100 Dataset with Two Levels of Flip Label Noise p (0.4, 0.6) and Three Different Imbalance Factors IFs (200, 100, 10)
The backbone used is a ResNet-32. \(^\dagger\) indicates the results obtained by our implementation of different methods. Bold is used for the first results and underline for the second ones.

4.6 Real-world Label Noise Results

In order to test real-world noise, we used Clothing1M and ResNet-50 as backbone, pre-trained on ImageNet, which was trained through SGD with a momentum of 0.9, weight decay of 1e-3, and a starting learning rate of 0.01. The batch had a size of 32 and it was preprocessed by resizing the image to 256 \(\times\) 256, then random cropping a 224 \(\times\) 224 patch, and finally performing normalization. The total training process consisted of 20 epochs where the learning rate was multiplied by 0.1 after 10 and 15 epochs.
The results reported in Table 7 show how our method obtains the state-of-the-art accuracy on the clothing dataset, improving it by \(3,10\%\) compared to the best algorithm previously used [39]. We got a \(12.33\%\) increment to the baseline accuracy.
Table 7.
MethodAccuracy (%)
Baseline (CE) [45]68.94
F-correction [40]69.84
JoCoR [51]70.30
S-adaptation [13]70.36
M-correction [2]71.00
MLC [50]71.06
Joint-Optim [47]72.16
MLNT [28]73.47
P-correction [54]73.49
MW-Net [45]73.72
MentorMix [19]74.30
FaMUS [53]74.43
DivideMix [27]74.76
AugDesc [39]75.11
MFRW75.35
MES76.43
MFRW-MES77.44
Table 7. Comparison with State-of-the-Art Methods in Test Accuracy \((\%)\) on Clothing1M Dataset with Real-world Noise
Results for cited methods were copied from original papers.

4.7 Real-world Label Noise with Long-Tail Imbalance Results

To verify the effectiveness of our method in a mixed setting of real-word label noise and long-tail distribution, we apply three different levels of IF (1, 50, 100) to Clothing1M as described in [22]. We trained a ResNet-18, pre-trained on ImageNet, with the SGD optimizer with a momentum of 0.9, weight decay of 1e-4, and a starting learning rate of 0.01. We optimized the classifier for a total of 20 epochs, and the learning rate was multiplied by 0.1 after 10 and 15 epochs. We used a batch size of 64 that was preprocessed by resizing the image to 256 \(\times\) 256, then random cropping a 224 \(\times\) 224 patch, and finally performing normalization.
From the results in Table 8, we can observe how our method obtains state-of-the-art accuracy values on the Clothing1M dataset with and without an artificially applied long-tail distribution.
Table 8.
 Imbalance Factor (IF)
Method150100
DivideMix [27]73.967.164.9
ELR [32]74.263.959.6
SimSiam+L+SL [22]71.169.368.2
GDW \(^\dagger\) [5]72.9772.8269.81
MFRW-MES77.1575.3774.24
Table 8. Accuracy \((\%)\) Values of State-of-the-Art Methods on the Clothing1M Dataset with Real-world Noise and Three IFs (1, 50, 100)
Results for cited methods were copied from original papers. \(^\dagger\) indicates the results obtained by our implementation of other methods.

4.8 Variation of Meta-set Size

We verified how the size of the meta-set affects the actual performance of our method MFRW-MES. We increased the number of samples in the meta-set from 0, which corresponds to the baseline method, to a maximum of 1,000 for the CIFAR-10 and CIFAR-100 datasets and the full validation set ( \(14k\) images) for the Clothing1M dataset. We chose two specific settings for the CIFAR-10/100 datasets, one with Flip noise of intensity \(p=0.6\) and the other with a long-tail distribution generated by an \(IF=100\) . The results of each experiment are shown in Figure 4. Even with few examples per class, starting from a meta-set size that is the \(0.2 \%\) of the entire training dataset, our method got good performance on the two artificial settings of CIFAR-10 and CIFAR-100. Instead, from the plot in Figure 4(c), it is possible to notice how the meta-set size is relevant to reaching the state-of-the-art result on Clothing1M. This was to be expected since the noise structure in annotations of Clothing1M is much harder to model than an artificially generated noise, so having more examples in the meta-set allows our method to learn a much more complex function to help the classifier. However, with the size of the meta-set that is only \(1.38\%\) of all the training data, and using our method we got a \(12.33\%\) increment from the baseline accuracy.
Fig. 4.
Fig. 4. Plot of accuracy results of MFRW-MES method at variations of meta-set size.

4.9 Qualitative Advisor Network Results

This section provides a qualitative analysis of various aspects of our advisor network. To better understand how our method is helping the main classifier, it is important to look at what and how the meta-model learns.
Distribution of learned attention weights
First, we checked how the predicted weight masks, that the meta-model learns for the meta-activation part (MFRW), are distributed across the training examples. We extracted the first two main components of a PCA reduction on the predicted weights \(W_f\) of the meta-model after the classifier’s training on Flip noised CIFAR10. The two PCA components are plotted in Figure 5 for 4 different values of Flip noise strength, from \(p=0.0\) to \(p=0.6\) . For every \(p\) value, except \(p=0\) where there is no noise, the predicted \(W_f\) are separated into two large clusters which indicate that the meta-model learned to weigh the examples that contain label errors differently from those that have correct labeling. This is the effect of giving the advisor network the loss value of each training data. An out-of-distribution example has a bigger cost than a good one.
Fig. 5.
Fig. 5. Plot of the first two main components of a PCA reduction on the weight \(W_f\) obtained from the training on CIFAR10 with four different values of Flip noise \(p=0.0 (5(a)), 0.2 (5(b)), 0.4 (5(c)), 0.6 (5(d))\) . Pink dots indicate an example with the correct label, instead, the light blue ones are for example with the noisy label. The clear separation between noisy and correct examples indicates a different way of generating weights between these two categories.
Next, we did a T-SNE [48] on the predicted \(W_f\) to see if there is also a per-class separation between them. From the T-SNE plot in Figure 6, it is possible to deduce that the \(W_f\) have also an additional per-class separation concerning the noisy/correct one. This is due to the contribution of having the visual features as input to the meta-model, which allows predicting different weights not only based on the loss value but also depending on the image content. In Figure 7(a), the weights \(W_f\) relative to the first 24 examples of the original class “airplane”, affected by Flip noise with \(p=0.6\) , are shown. The information from noisy label examples (light blue border) is manipulated differently from one of the correct examples (pink border).
Fig. 6.
Fig. 6. T-SNE of the predicted weight vectors \(W_f\) learned on CIFAR-10 with four different values of Flip noise \(p=0.0 (5(a)), 0.2 (5(b)), 0.4 (5(c)), 0.6 (5(d))\) . Pink dots indicate an example with the correct label, instead, the light blue ones are for an example with the noisy label. Each category is denoted by a colored border. Besides a separation between noisy/correct examples, there is also one at the category level. This indicates distinct predicted weight vectors \(W_f\) for features belonging to different classes.
Fig. 7.
Fig. 7. Attention weights \(W_f\) relative to the first 24 examples of the class “airplane” (7(a)) learned by our meta-model at the end of training on Flip ( \(p=0.6\) ) noised CIFAR10. The pink color indicates examples with the correct label, instead, the light blue is for the noisy ones. Attention weights \(W_f\) relative to examples of the common class “apples” (7(b)) and the ones of the rare class “roses” (7(c)) of CIFAR-LT-100 with IF 200.
We analyzed the \(W_f\) learned on CIFAR-LT-100 when the IF is 200, the most difficult setting case. As shown in Figures 7(b) and 7(c), the predicted weights \(W_f\) differ both between different classes and within the same class. The weights of the frequent class “apples” (Figure 7(b)) have more values closer to zero (black color) instead of the one belonging to the rare class “roses” (Figure 7(c)) with a lot of value close to one (white color). To better visualize the \(W_f\) distribution in the case of imbalance, we conducted a PCA reduction and a T-SNE on the attention mask learned on CIFAR-LT-10 when the IF is 200 and 100. We plotted the results of these two operations in Figure 8.
Fig. 8.
Fig. 8. PCA (top) and T-SNE (bottom) of the predicted weight vectors \(W_f\) learned on CIFAR-10-LT with two different values of imbalance, \(IF=200\) (8(a)), \(IF=100\) (8(b)). Each category is denoted by a different color described in the legend. An indicator (bar) of each class size is present under the category legend. The indicator doesn’t express the exact number but serves only to understand which category has fewer examples than another.
This means that the information from common category examples is ignored much more than the one belonging to the rare classes. Moreover, every example belonging to the same class is not weighted equally. This is shown in Figure 7(b) where there are some weight vectors with more values close to one (white color) than others. It happens because those examples contain information that is still useful to the main classifier.
Softmax weights learned with MES
We investigated how our softmax weight \(s_k\) , learned with MES, differs from the handmade solution proposed by [46]. We measured the effectiveness of each solution by calculating the Mean Absolute Error (MAE) between the distribution of the class sizes, normalized between 0 and 1, and the vector of the weights passed to Equation 6. Figure 9 shows the MAE values obtained during the training of the main network on CIFAR-LT-100 with different IF values. MES fits the target distribution better than the simple threshold function applied in [46] for all IF values and does not need any extra hyperparameter tuning.
Fig. 9.
Fig. 9. Comparison of MES with the two functions defined in [46] on CIFAR-LT-100 with different values of IF 10 (9(a)), 50 (9(b)), 100 (9(c)), 200 (9(d)). In the graph is reported the Mean Absolute Error (MAE) between the distribution of the size of the classes (normalized between 0 and 1) and the vector of weights given to Equation (6). Lower values of MAE indicate a better fit of the target distribution. In the graph, there are also details of the predicted vector weights \(s_k\) at various learning steps.

5 Conclusions

We introduced two new methods Meta Feature Re-Weighting (MFRW) and Meta Equalization Softmax (MES), that make use of a novel concept of advisor network to mitigate the problem of training DNNs on noisy labels and long-tailed class distributions. We empirically showed the effectiveness of our method on synthetic generated and real-world datasets for the classification task. Experimental results demonstrate that the advisor strategy can help the main classifier achieve better generalization performance for both training data problems. We introduced a new synthetic dataset setting where the long-tailed distribution is mixed with the noisy label problem. Then we showed how our method succeeds in solving both problems simultaneously unlike other similar work. We got the state-of-the-art performance on the Clothing1M dataset, which contains real-world label noise. Future research in this area may include adapting the advisor network to a more complex task than classification, like Object Detection or Image Segmentation.

References

[1]
Görkem Algan and Ilkay Ulusoy. 2021. Image classification with deep learning in the presence of noisy labels: A survey. Knowledge-Based Systems 215 (2021), 106771.
[2]
Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. 2019. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning. PMLR, 312–321.
[3]
Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. 2019. Learning imbalanced datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413 (2019).
[4]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.
[5]
Can Chen, Shuhao Zheng, Xi Chen, Erqun Dong, Xue Steve Liu, Hao Liu, and Dejing Dou. 2021. Generalized DataWeighting via class-level gradient manipulation. Advances in Neural Information Processing Systems 34 (2021), 14097–14109.
[6]
Yong Cheng, Lu Jiang, Wolfgang Macherey, and Jacob Eisenstein. 2020. AdvAug: Robust adversarial augmentation for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5961–5970.
[7]
Peng Chu, Xiao Bian, Shaopeng Liu, and Haibin Ling. 2020. Feature space augmentation for long-tailed data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16. Springer, 694–710.
[8]
Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2019. AutoAugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 113–123.
[9]
Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9268–9277.
[10]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[11]
Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
[12]
Chris Drummond. 2003. Class imbalance and cost sensitivity: Why undersampling beats oversampling. In ICML-KDD 2003 Workshop: Learning from Imbalanced Datasets, Vol. 3.
[13]
Jacob Goldberger and Ehud Ben-Reuven. 2016. Training deep neural-networks using a noise adaptation layer. (2016).
[14]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. arXiv preprint arXiv:1804.06872 (2018).
[15]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing. Springer, 878–887.
[16]
Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. 2018. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems 31 (2018), 10456–10465.
[17]
Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. 2016. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5375–5384.
[18]
Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, and Boqing Gong. 2020. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7610–7619.
[19]
Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. 2020. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning. PMLR, 4804–4815.
[20]
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning. PMLR, 2304–2313.
[21]
Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. 2019. Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217 (2019).
[22]
Shyamgopal Karthik, Jérome Revaud, and Boris Chidlovskii. 2021. Learning from long-tailed data with noisy labels. arXiv preprint arXiv:2108.11096 (2021).
[23]
Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A. Sohel, and Roberto Togneri. 2017. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems 29, 8 (2017), 3573–3587.
[24]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[25]
Svetlana Kordumova, Jan van Gemert, and Cees G. M. Snoek. 2016. Exploring the long tail of social media tags. In International Conference on Multimedia Modeling. Springer, 51–62.
[26]
M. Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced learning for latent variable models. Advances in Neural Information Processing Systems 23 (2010), 1189–1197.
[27]
Junnan Li, Richard Socher, and Steven C. H. Hoi. 2020. DivideMix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations. https://openreview.net/forum?id=HJgExaVtwr.
[28]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2019. Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5051–5059.
[29]
Mengmeng Li, Tian Gan, Meng Liu, Zhiyong Cheng, Jianhua Yin, and Liqiang Nie. 2019. Long-tail hashtag recommendation for micro-videos with graph convolutional network. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 509–518.
[30]
Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR) 49, 1 (2016), 1–39.
[31]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.
[32]
Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. 2020. Early-learning regularization prevents memorization of noisy labels. Advances in Neural Information Processing Systems 33 (2020), 20331–20342.
[33]
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. 2019. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2537–2546.
[34]
Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
[35]
Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. 2018. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning. PMLR, 3355–3364.
[36]
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV’18). 181–196.
[37]
Mayank Meghawat, Satyendra Yadav, Debanjan Mahata, Yifang Yin, Rajiv Ratn Shah, and Roger Zimmermann. 2018. A multimodal approach to predict social media popularity. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR’18). IEEE, 190–195.
[38]
David F. Nettleton, Albert Orriols-Puig, and Albert Fornells. 2010. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33, 4 (2010), 275–306.
[39]
Kento Nishi, Yi Ding, Alex Rich, and Tobias Hollerer. 2021. Augmentation strategies for learning with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8022–8031.
[40]
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1944–1952.
[41]
Mykola Pechenizkiy, Alexey Tsymbal, Seppo Puuronen, and Oleksandr Pechenizkiy. 2006. Class noise and supervised learning in medical domains: The effect of feature extraction. In 19th IEEE Symposium on Computer-based Medical Systems (CBMS’06). IEEE, 708–713.
[42]
Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2014. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014).
[43]
Jiawei Ren, Cunjun Yu, Shunan Sheng, Xiao Ma, Haiyu Zhao, Shuai Yi, and Hongsheng Li. 2020. Balanced meta-softmax for long-tailed visual recognition. arXiv preprint arXiv:2007.10740 (2020).
[44]
Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning. PMLR, 4334–4343.
[45]
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. 2019. Meta-weight-net: Learning an explicit mapping for sample weighting. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 1919–1930.
[46]
Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. 2020. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11662–11671.
[47]
Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5552–5560.
[48]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008).
[49]
Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 7032–7042.
[50]
Zhen Wang, Guosheng Hu, and Qinghua Hu. 2020. Training noise-robust deep neural networks via meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4524–4533.
[51]
Hongxin Wei, Lei Feng, Xiangyu Chen, and Bo An. 2020. Combating noisy labels by agreement: A joint training method with co-regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13726–13735.
[52]
Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. 2015. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2691–2699.
[53]
Youjiang Xu, Linchao Zhu, Lu Jiang, and Yi Yang. 2021. Faster meta update strategy for noise-robust deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 144–153.
[54]
Kun Yi and Jianxin Wu. 2019. Probabilistic end-to-end noise correction for learning with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7017–7025.
[55]
Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. 2019. Feature transfer learning for face recognition with under-represented data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5704–5713.
[56]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. Mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.
[57]
Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. 2020. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9719–9728.
[58]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452–1464.
[59]
Linchao Zhu and Yi Yang. 2020. Inflated episodic memory with region self-attention for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4344–4353.

Cited By

View all
  • (2023)Smoothing and Transition Matrices Estimation to Learn with Noisy LabelsImage Analysis and Processing – ICIAP 202310.1007/978-3-031-43148-7_38(450-462)Online publication date: 5-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5s
October 2023
280 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3599694
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023
Online AM: 17 February 2023
Accepted: 06 February 2023
Revised: 15 January 2023
Received: 26 January 2022
Published in TOMM Volume 19, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Meta-learning
  2. neural networks
  3. long-tail
  4. noisy labels

Qualifiers

  • Research-article

Funding Sources

  • European Commission under European Horizon 2020 Programme

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)594
  • Downloads (Last 6 weeks)69
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Smoothing and Transition Matrices Estimation to Learn with Noisy LabelsImage Analysis and Processing – ICIAP 202310.1007/978-3-031-43148-7_38(450-462)Online publication date: 5-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media