Article

Free access

Active bias: training more accurate neural networks by emphasizing high variance samples

Authors:

Haw-Shiuan Chang,

Erik Learned-Miller,

Andrew McCallumAuthors Info & Claims

NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

Pages 1003 - 1013

Published: 04 December 2017 Publication History

PDF eReader Publisher Site

Abstract

Self-paced learning and hard example mining re-weight training instances to improve learning accuracy. This paper presents two improved alternatives based on lightweight estimates of sample uncertainty in stochastic gradient descent (SGD): the variance in predicted probability of the correct class across iterations of mini-batch SGD, and the proximity of the correct class probability to the decision threshold. Extensive experimental results on six datasets show that our methods reliably improve accuracy in various network architectures, including additional gains on top of other popular training techniques, such as residual learning, momentum, ADAM, batch normalization, dropout, and distillation.

References

[1]

G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio. Variance reduction in SGD by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.

[2]

S.-I. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399-1409, 2000.

Digital Library

[3]

M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.

Digital Library

[4]

V. Avramova. Curriculum learning with deep convolutional neural networks, 2015.

[5]

Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.

Digital Library

[6]

A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579-1619, 2005.

Digital Library

[7]

S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1-122, 2012.

[8]

P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. In ICLR, 2017.

[9]

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493-2537, 2011.

Digital Library

[10]

G. Druck and A. McCallum. Toward interactive training and evaluation. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 947-956. ACM, 2011.

Digital Library

[11]

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121-2159, 2011.

Digital Library

[12]

J. Gao, H. Jagadish, and B. C. Ooi. Active sampler: Light-weight accelerator for complex data analytics at scale. arXiv preprint arXiv:1512.03880, 2015.

[13]

S. Gopal. Adaptive sampling for SGD by exploiting side information. In ICML, 2016.

[14]

A. Guillory, E. Chastain, and J. A. Bilmes. Active learning as non-convex optimization. In AISTATS, 2009.

[15]

C. Gulcehre, M. Moczulski, F. Visin, and Y. Bengio. Mollifying networks. In ICLR, 2017.

[16]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778, 2016.

[17]

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.

[18]

G. E. Hinton. To recognize shapes, first learn to generate images. Progress in brain research, 165:535-547, 2007.

[19]

N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.

[20]

E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. OntoNotes: the 90% solution. In HLT-NAACL, 2006.

[21]

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.

Digital Library

[22]

Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.

[23]

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[24]

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.

[25]

M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.

Digital Library

[26]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.

[27]

G.-H. Lee, S.-W. Yang, and S.-D. Lin. Toward implicit sample noise modeling: Deviation-driven matrix factorization. arXiv preprint arXiv:1610.09274, 2016.

[28]

X. Li and D. Roth. Learning question classifiers. In COLING, 2002.

Digital Library

[29]

I. Loshchilov and F. Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.

[30]

D. J. MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590-604, 1992.

Digital Library

[31]

S. Mandt, M. D. Hoffman, and D. M. Blei. A variational analysis of stochastic gradient algorithms. In ICML, 2016.

Digital Library

[32]

S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. Blei. Variational tempering. In AISTATS, 2016.

[33]

D. Meng, Q. Zhao, and L. Jiang. What objective does self-paced learning indeed optimize? arXiv preprint arXiv:1511.06049, 2015.

[34]

Y. Mu, W. Liu, X. Liu, and W. Fan. Stochastic gradient made stable: A manifold propagation approach for large-scale optimization. IEEE Transactions on Knowledge and Data Engineering, 2016.

[35]

C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936, 2017.

[36]

T. Pi, X. Li, Z. Zhang, D. Meng, F. Wu, J. Xiao, and Y. Zhuang. Self-paced boost learning for classification. In IJCAI, 2016.

Digital Library

[37]

D. Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, pages 485-498, 1982.

[38]

N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145-151, 1999.

Digital Library

[39]

J. D. Rennie. Regularized logistic regression is strictly convex. Unpublished manuscript. URL: people.csail.mit.edu/jrennie/writing/convexLR.pdf, 2005.

[40]

T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML, 2013.

[41]

A. I. Schein and L. H. Ungar. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235-265, 2007.

Digital Library

[42]

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In ICML, 2000.

Digital Library

[43]

B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.

[44]

A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.

[45]

E. Strubell, P. Verga, D. Belanger, and A. McCallum. Fast and accurate sequence labeling with iterated dilated convolutions. arXiv preprint arXiv:1702.02098, 2017.

[46]

E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In HLT-NAACL, 2003.

Digital Library

[47]

C. Wang, X. Chen, A. J. Smola, and E. P. Xing. Variance reduction for stochastic gradient optimization. In NIPS, 2013.

[48]

Y. Wang, A. Kucukelbir, and D. M. Blei. Reweighted data for robust probabilistic models. arXiv preprint arXiv:1606.03860, 2016.

[49]

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057-2075, 2014.

Digital Library

[50]

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

[51]

P. Zhao and T. Zhang. Stochastic optimization with importance sampling. arXiv preprint arXiv:1412.2753, 2014.

Cited By

Li ACao YGuo JPeng HGuo QYu H(2023)FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated LearningProceedings of the ACM on Management of Data10.1145/36173321:3(1-24)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617332
Liu SNiles-Weed JRazavian NFernandez-Granda CLarochelle HRanzato MHadsell RBalcan MLin H(2020)Early-learning regularization prevents memorization of noisy labelsProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497431(20331-20342)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497431
Zhang JYu HDhillon IWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)AutoAssistProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454826(5998-6008)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454826
Show More Cited By

Recommendations

Co-teaching: robust training of deep neural networks with extremely noisy labels
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally memorize these noisy labels sooner or later during training. Nonetheless, recent studies on the memorization effects of deep ...
ImageNet classification with deep convolutional neural networks

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, ...
Robust loss functions under label noise for deep neural networks
AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

In many applications of classifier learning, training data suffers from label noise. Deep networks are learned using huge training data where the problem of noisy labels is particularly relevant. The current techniques proposed for learning deep ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

December 2017

7104 pages

ISBN:9781510860964

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 04 December 2017

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
90
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)9

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ACao YGuo JPeng HGuo QYu H(2023)FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated LearningProceedings of the ACM on Management of Data10.1145/36173321:3(1-24)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617332
Liu SNiles-Weed JRazavian NFernandez-Granda CLarochelle HRanzato MHadsell RBalcan MLin H(2020)Early-learning regularization prevents memorization of noisy labelsProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497431(20331-20342)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497431
Zhang JYu HDhillon IWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)AutoAssistProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454826(5998-6008)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454826
Zheng KZha ZWei WWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Abstract reasoning with distracting featuresProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454812(5842-5853)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454812
Joseph KR. VSingh KBalasubramanian V(2019)Submodular batch selection for training deep neural networksProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367243.3367412(2677-2683)Online publication date: 10-Aug-2019
https://dl.acm.org/doi/10.5555/3367243.3367412
Johnson TGuestrin C(2018)Training deep models faster with robust, approximate importance samplingProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327829(7276-7286)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327757.3327829
Mussmann SLiang P(2018)Uncertainty sampling is preconditioned stochastic gradient descent on zero-one lossProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327799(6955-6964)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327757.3327799

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten