Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3294771.3294867guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article
Free access

Active bias: training more accurate neural networks by emphasizing high variance samples

Published: 04 December 2017 Publication History

Abstract

Self-paced learning and hard example mining re-weight training instances to improve learning accuracy. This paper presents two improved alternatives based on lightweight estimates of sample uncertainty in stochastic gradient descent (SGD): the variance in predicted probability of the correct class across iterations of mini-batch SGD, and the proximity of the correct class probability to the decision threshold. Extensive experimental results on six datasets show that our methods reliably improve accuracy in various network architectures, including additional gains on top of other popular training techniques, such as residual learning, momentum, ADAM, batch normalization, dropout, and distillation.

References

[1]
G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio. Variance reduction in SGD by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
[2]
S.-I. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399-1409, 2000.
[3]
M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.
[4]
V. Avramova. Curriculum learning with deep convolutional neural networks, 2015.
[5]
Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
[6]
A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579-1619, 2005.
[7]
S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1-122, 2012.
[8]
P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. In ICLR, 2017.
[9]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493-2537, 2011.
[10]
G. Druck and A. McCallum. Toward interactive training and evaluation. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 947-956. ACM, 2011.
[11]
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121-2159, 2011.
[12]
J. Gao, H. Jagadish, and B. C. Ooi. Active sampler: Light-weight accelerator for complex data analytics at scale. arXiv preprint arXiv:1512.03880, 2015.
[13]
S. Gopal. Adaptive sampling for SGD by exploiting side information. In ICML, 2016.
[14]
A. Guillory, E. Chastain, and J. A. Bilmes. Active learning as non-convex optimization. In AISTATS, 2009.
[15]
C. Gulcehre, M. Moczulski, F. Visin, and Y. Bengio. Mollifying networks. In ICLR, 2017.
[16]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778, 2016.
[17]
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.
[18]
G. E. Hinton. To recognize shapes, first learn to generate images. Progress in brain research, 165:535-547, 2007.
[19]
N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
[20]
E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. OntoNotes: the 90% solution. In HLT-NAACL, 2006.
[21]
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.
[22]
Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
[23]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[24]
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
[25]
M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.
[26]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
[27]
G.-H. Lee, S.-W. Yang, and S.-D. Lin. Toward implicit sample noise modeling: Deviation-driven matrix factorization. arXiv preprint arXiv:1610.09274, 2016.
[28]
X. Li and D. Roth. Learning question classifiers. In COLING, 2002.
[29]
I. Loshchilov and F. Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
[30]
D. J. MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590-604, 1992.
[31]
S. Mandt, M. D. Hoffman, and D. M. Blei. A variational analysis of stochastic gradient algorithms. In ICML, 2016.
[32]
S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. Blei. Variational tempering. In AISTATS, 2016.
[33]
D. Meng, Q. Zhao, and L. Jiang. What objective does self-paced learning indeed optimize? arXiv preprint arXiv:1511.06049, 2015.
[34]
Y. Mu, W. Liu, X. Liu, and W. Fan. Stochastic gradient made stable: A manifold propagation approach for large-scale optimization. IEEE Transactions on Knowledge and Data Engineering, 2016.
[35]
C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936, 2017.
[36]
T. Pi, X. Li, Z. Zhang, D. Meng, F. Wu, J. Xiao, and Y. Zhuang. Self-paced boost learning for classification. In IJCAI, 2016.
[37]
D. Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, pages 485-498, 1982.
[38]
N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145-151, 1999.
[39]
J. D. Rennie. Regularized logistic regression is strictly convex. Unpublished manuscript. URL: people.csail.mit.edu/jrennie/writing/convexLR.pdf, 2005.
[40]
T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML, 2013.
[41]
A. I. Schein and L. H. Ungar. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235-265, 2007.
[42]
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In ICML, 2000.
[43]
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
[44]
A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
[45]
E. Strubell, P. Verga, D. Belanger, and A. McCallum. Fast and accurate sequence labeling with iterated dilated convolutions. arXiv preprint arXiv:1702.02098, 2017.
[46]
E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In HLT-NAACL, 2003.
[47]
C. Wang, X. Chen, A. J. Smola, and E. P. Xing. Variance reduction for stochastic gradient optimization. In NIPS, 2013.
[48]
Y. Wang, A. Kucukelbir, and D. M. Blei. Reweighted data for robust probabilistic models. arXiv preprint arXiv:1606.03860, 2016.
[49]
L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057-2075, 2014.
[50]
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
[51]
P. Zhao and T. Zhang. Stochastic optimization with importance sampling. arXiv preprint arXiv:1412.2753, 2014.

Cited By

View all
  • (2023)FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated LearningProceedings of the ACM on Management of Data10.1145/36173321:3(1-24)Online publication date: 13-Nov-2023
  • (2020)Early-learning regularization prevents memorization of noisy labelsProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497431(20331-20342)Online publication date: 6-Dec-2020
  • (2019)AutoAssistProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454826(5998-6008)Online publication date: 8-Dec-2019
  • Show More Cited By
  1. Active bias: training more accurate neural networks by emphasizing high variance samples

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems
    December 2017
    7104 pages

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 04 December 2017

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated LearningProceedings of the ACM on Management of Data10.1145/36173321:3(1-24)Online publication date: 13-Nov-2023
    • (2020)Early-learning regularization prevents memorization of noisy labelsProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497431(20331-20342)Online publication date: 6-Dec-2020
    • (2019)AutoAssistProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454826(5998-6008)Online publication date: 8-Dec-2019
    • (2019)Abstract reasoning with distracting featuresProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454812(5842-5853)Online publication date: 8-Dec-2019
    • (2019)Submodular batch selection for training deep neural networksProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367243.3367412(2677-2683)Online publication date: 10-Aug-2019
    • (2018)Training deep models faster with robust, approximate importance samplingProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327829(7276-7286)Online publication date: 3-Dec-2018
    • (2018)Uncertainty sampling is preconditioned stochastic gradient descent on zero-one lossProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327799(6955-6964)Online publication date: 3-Dec-2018

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media