A Framework Using Contrastive Learning for Classification with Noisy Labels
Abstract
:1. Introduction
- A framework increasing robustness of any loss function to noisy labels by adding a contrastive pre-training task.
- The adaptation of the supervised contrastive loss to use sample weight values, representing the probability of correctness for each sample in the training set
- An extensive empirical study identifying and benchmarking additional state-of-the-art strategies to boost the performance of pre-trained models: pseudo-labeling, sample selection with GMM, weighted supervised contrastive learning, and mixup with bootstrapping.
2. Related Works
2.1. Noise Tolerant Classification
2.2. Contrastive Learning for Vision Data
3. Preliminaries
3.1. Classification with Robust Loss Functions
3.2. Contrastive Learning
- Data augmentation: Data augmentation is used to decouple the pretext tasks from the network architecture. Chen et al. [14] study broadly the impact of data augmentation. We follow their suggestion combining random crop (and flip), color distortion, Gaussian blur, and gray-scaling.
- Encoding: The encoder extracts features (or representation) from augmented data samples. A classical choice for the encoder is the ResNet model [37] for image data. The final goal of the contrastive approach is to find correct weights for the encoder.
- Loss function: The loss function usually combines positive and negative pairs. The Noise Contrastive Estimation (NCE) and its variants are popular choices. The general formulation for such a loss function is defined for the ith pair as [38]:
- Projection head: That step is not used in all frameworks. The projection head maps the representation to a lower-dimensional space and acts as an intermediate layer between the representation and the embedding pairs. Chen et al. [14,31] show that the projection head helps to improve the representation quality.
4. A Framework Coupling Contrastive Learning and Noisy Labels
4.1. Sample Selection and Correction with Pseudo-Labels
4.2. Weighted Supervised Contrastive Learning
5. Experiments
5.1. Datasets
5.2. Settings
6. Results
6.1. Impact of Contrastive Pre-Training
6.2. Sensitivity to the Hyperparameters
6.3. Impact of the Fine-Tuning Phase
7. Discussion and Limits of the Framework
8. Conclusions
Supplementary Materials
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Sample Availability
Abbreviations
NLL | Noisy-label learning |
CE | Cross Entropy |
NFL | Normalized Focal Loss |
RCE | Reversed Cross Entropy |
ELR | Early Learning Regularization |
GMM | Gaussian Mixture Model |
Appendix A. Description of the Datasets
Data Set | Train | Test | Size | # Classes |
---|---|---|---|---|
CIFAR10 | 50 K | 10 K | 32 × 32 | 10 |
CIFAR100 | 50 K | 10 K | 32 × 32 | 100 |
Clothing1M | 56 K | 5 K | 128 × 128 | 14 |
mini Webvision | 66 K | 2.5 K | 128 × 128 | 50 |
Appendix B. Detailed Settings of the Experiments
C10/C100 | Webvision | Clothing1M | ||
---|---|---|---|---|
Repre. | Batch | 512 | 512 | 512 |
Opti. | Adam | Adam | Adam | |
l.r. | ||||
w.d. | ||||
Epochs | 500 | 500 | 500 | |
Proj. dim. | 128 | 512 | 512 | |
Classi. | Batch | 256 | 256 | 256 |
Opti. | SGD | SGD | SGD | |
l.r. | 0.01/0.1 | 0.4 | 0.01 | |
w.d. | ||||
Epochs | 200 | 200 | 200 |
Appendix C. Ablation Study
Appendix C.1. Contrastive Learning with a Momentum Encoder
SimCLR | Moco | Moco-Fine Tune | |
---|---|---|---|
CE | 12.4 | 12.0 | 49.0 |
ELR | 45.3 | 38.8 | 42.3 |
NFL+RCE | 50.2 | 26.3 | 47.0 |
Appendix C.2. Sensitivity to the Learning Rate
Appendix C.3. Impact of the Classifier Architecture
CIFAR10 | CIFAR100 | |||||
---|---|---|---|---|---|---|
Type | Loss | L | M | L | M | |
Sym | 0.2 | ce | 91.7 | 87.7 | 58.6 | 56.5 |
elr | 92.9 | 93.0 | 66.4 | 67.4 | ||
nfl_rce | 93.2 | 92.7 | 69.7 | 68.8 | ||
0.4 | ce | 90.6 | 78.0 | 44.2 | 41.9 | |
elr | 92.1 | 92.0 | 60.8 | 62.0 | ||
nfl_rce | 92.1 | 91.4 | 67.0 | 66.3 | ||
0.6 | ce | 88.1 | 59.2 | 28.9 | 26.8 | |
elr | 89.7 | 90.4 | 54.0 | 55.7 | ||
nfl_rce | 90.2 | 88.1 | 63.7 | 61.8 | ||
0.8 | ce | 72.6 | 27.3 | 14.1 | 12.4 | |
elr | 82.0 | 84.8 | 41.6 | 45.3 | ||
nfl_rce | 78.9 | 59.9 | 54.2 | 50.2 | ||
Asym | 0.2 | ce | 91.6 | 87.9 | 60.1 | 57.8 |
elr | 92.7 | 92.4 | 69.3 | 70.2 | ||
nfl_rce | 92.5 | 91.5 | 69.1 | 68.4 | ||
0.3 | ce | 90.2 | 83.9 | 52.3 | 50.4 | |
elr | 90.6 | 91.7 | 68.5 | 69.3 | ||
nfl_rce | 91.2 | 89.9 | 68.0 | 63.5 | ||
0.4 | ce | 84.7 | 77.8 | 43.7 | 42.4 | |
elr | 68.4 | 89.5 | 65.5 | 67.6 | ||
nfl_rce | 62.6 | 82.4 | 63.0 | 47.8 |
Appendix D. Dynamic Bootstrapping with Mixup
Appendix E. Classification Warmup
Appendix F. Execution Time Analysis
C10 80% S | C10 40% A | C100 80% S | C100 40% A | |
---|---|---|---|---|
Ours (Pre-t) | 2.36 | 2.53 | 2.40 | 2.32 |
Ours (Fine-tune) | 3.42 | 3.63 | 4.31 | 4.36 |
Taks | 0.53 | 1.04 | 0.52 | 0.98 |
Co-teach+ | 2.00 | 2.00 | 2.00 | 2.01 |
JoCoR | 1.73 | 1.74 | 1.72 | 1.74 |
Appendix G. An Attempt to Prevent Overfitting with Early Stopping
References
- Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; van der Maaten, L. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 181–196. [Google Scholar]
- Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv 2016, arXiv:1611.03530. [Google Scholar]
- Patrini, G.; Rozza, A.; Krishna Menon, A.; Nock, R.; Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1944–1952. [Google Scholar]
- Goldberger, J.; Ben-Reuven, E. Training Deep Neural-Networks Using a Noise Adaptation Layer. ICLR. 2017. Available online: https://openreview.net/forum?id=H12GRgcxg (accessed on 15 June 2020).
- Xia, X.; Liu, T.; Wang, N.; Han, B.; Gong, C.; Niu, G.; Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 6838–6849. [Google Scholar]
- Hendrycks, D.; Mazeika, M.; Wilson, D.; Gimpel, K. Using trusted data to train deep networks on labels corrupted by severe noise. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 10456–10465. [Google Scholar]
- Jiang, L.; Zhou, Z.; Leung, T.; Li, L.J.; Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 2304–2313. [Google Scholar]
- Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 8527–8537. [Google Scholar]
- Li, J.; Socher, R.; Hoi, S.C. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. In Proceedings of the International Conference on Learning Representations, Virtual Event, 26 April–1 May 2020. [Google Scholar]
- Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 8778–8788. [Google Scholar]
- Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; Bailey, J. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 322–330. [Google Scholar]
- Ma, X.; Huang, H.; Wang, Y.; Romano, S.; Erfani, S.; Bailey, J. Normalized loss functions for deep learning with noisy labels. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 6543–6553. [Google Scholar]
- Liu, S.; Niles-Weed, J.; Razavian, N.; Fernandez-Granda, C. Early-Learning Regularization Prevents Memorization of Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
- Song, H.; Kim, M.; Park, D.; Lee, J.G. Learning from noisy labels with deep neural networks: A survey. arXiv 2020, arXiv:2007.08199. [Google Scholar]
- Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive representation learning: A framework and review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
- Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.; McGuinness, K. Unsupervised label noise modeling and loss correction. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 312–321. [Google Scholar]
- Song, H.; Kim, M.; Lee, J.G. Selfie: Refurbishing unclean samples for robust deep learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5907–5915. [Google Scholar]
- Arpit, D.; Jastrzebski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M.S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; et al. A closer look at memorization in deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 233–242. [Google Scholar]
- Nguyen, D.T.; Mummadi, C.K.; Ngo, T.P.N.; Nguyen, T.H.P.; Beggel, L.; Brox, T. SELF: Learning to Filter Noisy Labels with Self-Ensembling. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Wang, Z.; Jiang, J.; Han, B.; Feng, L.; An, B.; Niu, G.; Long, G. SemiNLL: A Framework of Noisy-Label Learning by Semi-Supervised Learning. arXiv 2020, arXiv:2012.00925. [Google Scholar]
- Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5049–5059. [Google Scholar]
- Ortego, D.; Arazo, E.; Albert, P.; O’Connor, N.E.; McGuinness, K. Multi-Objective Interpolation Training for Robustness to Label Noise. arXiv 2020, arXiv:2012.04462. [Google Scholar]
- Zhang, H.; Yao, Q. Decoupling Representation and Classifier for Noisy Label Learning. arXiv 2020, arXiv:2011.08145. [Google Scholar]
- Li, J.; Xiong, C.; Hoi, S.C. MoPro: Webly Supervised Learning with Momentum Prototypes. arXiv 2020, arXiv:2009.07995. [Google Scholar]
- Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 4182–4192. [Google Scholar]
- Misra, I.; Maaten, L.V.D. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 14–19 June 2020; pp. 6707–6717. [Google Scholar]
- Kalantidis, Y.; Sariyildiz, M.B.; Pion, N.; Weinzaepfel, P.; Larlus, D. Hard negative mixing for contrastive learning. arXiv 2020, arXiv:2010.01028. [Google Scholar]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 21271–21284. [Google Scholar]
- Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
- Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Proceedings of the Thirty-Fourth Conference on Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020. [Google Scholar]
- Ghosh, A.; Kumar, H.; Sastry, P. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Ouahabi, A.; Taleb-Ahmed, A. Deep learning for real-time semantic segmentation: Application in ultrasound imaging. Pattern Recognit. Lett. 2021, 144, 27–34. [Google Scholar] [CrossRef]
- Zadeh, S.G.; Schmid, M. Bias in cross-entropy-based training of deep survival networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
- Falcon, W.; Cho, K. A framework for contrastive self-supervised learning and designing a new approach. arXiv 2020, arXiv:2009.00104. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
- Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 3733–3742. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33. [Google Scholar]
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
- Li, W.; Wang, L.; Li, W.; Agustsson, E.; Van Gool, L. Webvision database: Visual learning and understanding from web data. arXiv 2017, arXiv:1708.02862. [Google Scholar]
- Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; Wang, X. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 2691–2699. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alch’e-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
- Song, H.; Mitsuo, N.; Uchida, S.; Suehiro, D. No Regret Sample Selection with Noisy Labels. arXiv 2020, arXiv:2003.03179. [Google Scholar]
- Yu, X.; Han, B.; Yao, J.; Niu, G.; Tsang, I.; Sugiyama, M. How does disagreement help generalization against label corruption? In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7164–7173. [Google Scholar]
- Wei, H.; Feng, L.; Chen, X.; An, B. Combating noisy labels by agreement: A joint training method with co-regularization. arXiv 2020, arXiv:2003.02752. [Google Scholar]
- Bianco, S.; Cadene, R.; Celona, L.; Napoletano, P. Benchmark analysis of representative deep neural network architectures. IEEE Access 2018, 6, 64270–64277. [Google Scholar] [CrossRef]
- Kamabattula, S.R.; Devarajan, V.; Namazi, B.; Sankaranarayanan, G. Identifying Training Stop Point with Noisy Labeled Data. arXiv 2020, arXiv:2012.13435. [Google Scholar]
- Kornblith, S.; Norouzi, M.; Lee, H.; Hinton, G. Similarity of neural network representations revisited. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3519–3529. [Google Scholar]
- Mitrovic, J.; McWilliams, B.; Rey, M. Less can be more in contrastive learning. In Proceedings of the “I Can’t Believe It’s Not Better!” NeurIPS 2020 Workshop, Virtual Event, 6–14 December 2020. [Google Scholar]
- Zhang, H.; Cissé, M.; Dauphin, Y.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
CIFAR10 | CIFAR100 | |||||
---|---|---|---|---|---|---|
Type | Loss | Base | Pre-t. | Base | Pre-t. | |
Sym | 0.2 | ce | 77.2 | 87.7 | 55.6 | 56.5 |
elr | 90.3 | 93.0 | 64.1 | 67.4 | ||
nfl+rce | 91.0 | 92.7 | 66.6 | 68.8 | ||
0.4 | ce | 58.2 | 78.0 | 39.9 | 41.9 | |
elr | 82.3 | 92.0 | 56.9 | 62.0 | ||
nfl+rce | 87.0 | 91.4 | 60.2 | 66.3 | ||
0.6 | ce | 35.2 | 59.2 | 21.8 | 26.8 | |
elr | 64.2 | 90.4 | 40.6 | 55.7 | ||
nfl+rce | 80.2 | 88.1 | 47.0 | 61.8 | ||
0.8 | ce | 17.0 | 27.3 | 7.80 | 12.4 | |
elr | 18.3 | 84.8 | 16.2 | 45.3 | ||
nfl+rce | 42.8 | 59.9 | 20.1 | 50.2 | ||
Asym | 0.2 | ce | 84.0 | 87.9 | 59.0 | 57.8 |
elr | 91.8 | 92.4 | 70.3 | 70.2 | ||
nfl+rce | 90.2 | 91.5 | 63.9 | 68.4 | ||
0.3 | ce | 79.2 | 83.9 | 50.6 | 50.4 | |
elr | 89.6 | 91.7 | 69.8 | 69.3 | ||
nfl+rce | 86.7 | 89.9 | 53.5 | 63.5 | ||
0.4 | ce | 75.3 | 77.8 | 41.8 | 42.4 | |
elr | 72.3 | 89.5 | 67.6 | 67.6 | ||
nfl+rce | 80.0 | 82.4 | 40.6 | 47.8 |
C10 80% S | C10 40% A | C100 80% S | C100 40% A | |
---|---|---|---|---|
Ours (ELR) | 84.8 | 89.5 | 45.3 | 67.6 |
ELR [13] | 73.9 | 91.1 | 29.7 | 73.2 |
Taks [44] | 40.2 | 73.4 | 16.0 | 35.2 |
Co-teach+ [45] | 23.5 | 68.5 | 14.0 | 34.3 |
DivideMix [9] | 92.9 | 93.4 | 59.6 | 72.1 |
SELF [21] | 69.9 | 89.1 | 42.1 | 53.8 |
JoCoR [46] | 25.5 | 76.1 | 12.9 | 32.3 |
Webvision | Clothing1M | |||||
---|---|---|---|---|---|---|
Loss | Base. | Pre-t. | Fine-Tune | Base. | Pre-t. | Fine-Tune |
ce | 51.8 | 57.1 | 58.4 | 54.8 | 59.1 | 61.5 |
elr | 53.0 | 58.1 | 59.0 | 57.4 | 60.8 | 60.4 |
nfl+rce | 49.9 | 54.8 | 58.2 | 57.4 | 59.4 | 60.1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ciortan, M.; Dupuis, R.; Peel, T. A Framework Using Contrastive Learning for Classification with Noisy Labels. Data 2021, 6, 61. https://doi.org/10.3390/data6060061
Ciortan M, Dupuis R, Peel T. A Framework Using Contrastive Learning for Classification with Noisy Labels. Data. 2021; 6(6):61. https://doi.org/10.3390/data6060061
Chicago/Turabian StyleCiortan, Madalina, Romain Dupuis, and Thomas Peel. 2021. "A Framework Using Contrastive Learning for Classification with Noisy Labels" Data 6, no. 6: 61. https://doi.org/10.3390/data6060061