research-article

Large-batch training for LSTM and beyond

Authors:

Cho-Jui HsiehAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 9, Pages 1 - 16

https://doi.org/10.1145/3295500.3356137

Published: 17 November 2019 Publication History

Abstract

Large-batch training approaches have enabled researchers to utilize distributed processing and greatly accelerate deep neural networks training. However, there are three problems in current large-batch research: (1) Although RNN approaches like LSTM have been widely used in many applications, current large-batch research is principally focused on CNNs. (2) Even for CNNs, there is no automated technique for extending the batch size beyond 8K. (3) To keep the variance in the gradient expectation constant, theory suggests that a Sqrt Scaling scheme should be used in large-batch training. Unfortunately, there are not many successful applications. In this paper, we propose Dynamic Adaptive-Tuning Engine (DATE) for better large-batch training. DATE achieves a 5.3x average speedup over the baselines for four LSTM-based applications on the same hardware. We finish the ImageNet training with ResNet-50 in two minutes on 1024 v3 TPUs (76.7% top-1 accuracy), which is the fastest version as of June of 2019.

References

[1]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017).

[2]

Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157--166.

[3]

Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2016. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016).

[4]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).

[5]

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017).

[6]

James Demmel. 2013. Communication-Avoiding Algorithms for Linear Algebra and Beyond. In IPDPS. 585.

[7]

Aditya Devarakonda, Maxim Naumov, and Michael Garland. 2017. AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks. arXiv preprint arXiv:1712.02029 (2017).

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121--2159.

Digital Library

[10]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[11]

AA Goldstein. 1977. Optimization of Lipschitz continuous functions. Mathematical Programming 13, 1 (1977), 14--22.

Digital Library

[12]

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.

[13]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

[16]

Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741 (2017).

[17]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[18]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).

[19]

Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.

[20]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv preprint arXiv:1807.11205 (2018).

[21]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--12.

Digital Library

[22]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).

[23]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[24]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).

[25]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

[26]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.

[27]

Mu Li. 2017. Scaling Distributed Machine Learning with System and Algorithm Co-design. Ph.D. Dissertation. Intel.

[28]

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 661--670.

Digital Library

[29]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144 (2016).

[30]

Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. https://github.com/tensorflow/nmt (2017).

[31]

Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19, 2 (1993), 313--330.

[32]

James Martens and Roger Grosse. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning. 2408--2417.

[33]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).

[34]

Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. 2016. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems. 3882--3890.

[35]

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2018. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs. arXiv preprint arXiv:1811.12019 (2018).

[36]

Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145--151.

[37]

Herbert Robbins and Sutton Monro. 1985. A stochastic approximation method. In Herbert Robbins Selected Papers. Springer, 102--109.

[38]

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems. 10435--10444.

[39]

Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. 2017. Don't Decay the Learning Rate, Increase the Batch Size. arXiv preprint arXiv:1711.00489 (2017).

[40]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning. 1139--1147.

Digital Library

[41]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.

Digital Library

[42]

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. 2018. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626 (2018).

[43]

Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4, 2 (2012), 26--31.

[44]

Yiren Wang and Fei Tian. 2016. Recurrent residual learning for sequence classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 938--943.

[45]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Technical Report. Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States).

[46]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[47]

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. arXiv preprint arXiv:1811.06992 (2018).

[48]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888 (2017).

[49]

Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. 2017. ImageNet training in minutes. CoRR, abs/1709.05011 (2017).

[50]

Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).

[51]

Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).

Cited By

Ek-Chacón EMolino-Minero-Re EMéndez-Monroy PNeme AÁngeles-Hernández H(2024)Semi-Supervised Training for (Pre-Stack) Seismic Data AnalysisApplied Sciences10.3390/app1410417514:10(4175)Online publication date: 15-May-2024
https://doi.org/10.3390/app14104175
Zeng LLiu LChen DLu HXue YBi HYang W(2023)The innovative model based on artificial intelligence algorithms to predict recurrence risk of patients with postoperative breast cancerFrontiers in Oncology10.3389/fonc.2023.111742013Online publication date: 7-Mar-2023
https://doi.org/10.3389/fonc.2023.1117420
Yu EDong DLiao X(2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: Dec-2023
https://doi.org/10.1109/TPDS.2023.3323282
Show More Cited By

Index Terms

Large-batch training for LSTM and beyond
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Fast Training of Deep LSTM Networks
Advances in Neural Networks – ISNN 2019
Abstract
Deep recurrent neural networks (RNN), such as LSTM, have many advantages over forward networks. However, the LSTM training method, such as backward propagation through time (BPTT), is really slow.
In this paper, by separating the LSTM cell into ...
On Large-Batch Training of Residual Networks with SignSGD
ICAAI '21: Proceedings of the 5th International Conference on Advances in Artificial Intelligence

Large-batch training of deep neural networks (DNN) has recently been widely studied, since traversing the optimization landscape is faster with large batches and the emergence of parallel computing has made large-batch training feasible. However, its ...
Enhanced LSTM with Batch Normalization
Neural Information Processing
Abstract
Recurrent neural networks (RNNs) are powerful models for sequence learning. However, the training of RNNs is complicated because the internal covariate shift problem, where the input distribution at each iteration changes during the training as ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
1,375
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)11

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ek-Chacón EMolino-Minero-Re EMéndez-Monroy PNeme AÁngeles-Hernández H(2024)Semi-Supervised Training for (Pre-Stack) Seismic Data AnalysisApplied Sciences10.3390/app1410417514:10(4175)Online publication date: 15-May-2024
https://doi.org/10.3390/app14104175
Zeng LLiu LChen DLu HXue YBi HYang W(2023)The innovative model based on artificial intelligence algorithms to predict recurrence risk of patients with postoperative breast cancerFrontiers in Oncology10.3389/fonc.2023.111742013Online publication date: 7-Mar-2023
https://doi.org/10.3389/fonc.2023.1117420
Yu EDong DLiao X(2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: Dec-2023
https://doi.org/10.1109/TPDS.2023.3323282
Meola AWinkler MWeinrich S(2023)Metaheuristic optimization of data preparation and machine learning hyperparameters for prediction of dynamic methane productionBioresource Technology10.1016/j.biortech.2023.128604(128604)Online publication date: Jan-2023
https://doi.org/10.1016/j.biortech.2023.128604
Renjith VJose P(2023)Early-Stage Detection and Classification of Breast Neoplasm Stages Using OGRU-LSTM-BiRNN and Multivariate Data AnalysisJournal of The Institution of Engineers (India): Series B10.1007/s40031-023-00882-3104:3(659-678)Online publication date: 3-May-2023
https://doi.org/10.1007/s40031-023-00882-3
Choi JZhang PMehta KBlanchard ALupo Pasini M(2022)Scalable training of graph convolutional neural networks for fast and accurate predictions of HOMO-LUMO gap in moleculesJournal of Cheminformatics10.1186/s13321-022-00652-114:1Online publication date: 17-Oct-2022
https://doi.org/10.1186/s13321-022-00652-1
Yu JWu FZhao J(2022)Trust Region Method Using K-FAC in Multi-Agent Reinforcement LearningProceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence10.1145/3579654.3579702(1-7)Online publication date: 23-Dec-2022
https://dl.acm.org/doi/10.1145/3579654.3579702
André JStrati FKlimovic ALaskaridis SAlmeida MCrowcroft JLi AChen L(2022)Exploring learning rate scaling rules for distributed ML training on transient resourcesProceedings of the 3rd International Workshop on Distributed Machine Learning10.1145/3565010.3569067(1-8)Online publication date: 9-Dec-2022
https://dl.acm.org/doi/10.1145/3565010.3569067
Li SHoefler TLee JAgrawal KSpear M(2022)Near-optimal sparse allreduce for distributed deep learningProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508399(135-149)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508399
Ko YLee DKim S(2022)Not All Layers Are Equal: A Layer-Wise Adaptive Approach Toward Large-Scale DNN TrainingProceedings of the ACM Web Conference 202210.1145/3485447.3511989(1851-1859)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511989
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents