research-article

Public Access

Not All Layers Are Equal: A Layer-Wise Adaptive Approach Toward Large-Scale DNN Training

Authors:

Sang-Wook KimAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 1851 - 1859

https://doi.org/10.1145/3485447.3511989

Published: 25 April 2022 Publication History

All formats PDF

Abstract

A large-batch training with data parallelism is a widely adopted approach to efficiently train a large deep neural network (DNN) model. Large-batch training, however, often suffers from the problem of the model quality degradation because of its fewer iterations. To alleviate this problem, in general, learning rate (lr) scaling methods have been applied, which increases the learning rate to make an update larger at each iteration. Unfortunately, however, we observe that large-batch training with state-of-the-art lr scaling methods still often degrade the model quality when a batch size crosses a specific limit, rendering such lr methods less useful. To this phenomenon, we hypothesize that existing lr scaling methods overlook the subtle but important differences across “layers” in training, which results in the degradation of the overall model quality. From this hypothesis, we propose a novel approach (LENA) toward the learning rate scaling for large-scale DNN training, employing: (1) a layer-wise adaptive lr scaling to adjust lr for each layer individually, and (2) a layer-wise state-aware warm-up to track the state of the training for each layer and finish its warm-up automatically. The comprehensive evaluation with variations of batch sizes demonstrates that LENA achieves the target accuracy (i.e., the accuracy of single-worker training): (1) within the fewest iterations across different batch sizes (up to 45.2% fewer iterations and 44.7% shorter time than the existing state-of-the-art method), and (2) for training very large-batch sizes, surpassing the limits of all baselines.

References

[1]

Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. 2019. Stochastic gradient push for distributed deep learning. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 344–353.

[2]

Medha Atre, Birendra Jha, and Ashwini Rao. 2021. Distributed Deep Learning Using Volunteer Computing-Like Paradigm. arXiv preprint arXiv:2103.08894(2021).

[3]

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. 2020. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 117, 48 (2020), 30071–30078.

[4]

Oded Ben-David and Zohar Ringel. 2019. The role of a layer in deep neural networks: a Gaussian Process perspective. arXiv preprint arXiv:1902.02354(2019).

[5]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).

[6]

Dong-Kyu Chae, Jin-Soo Kang, Sang-Wook Kim, and Jaeho Choi. 2019. Rating augmentation with generative adversarial networks towards accurate collaborative filtering. In Proceedings of the World Wide Web Conference (WWW). 2616–2622.

Digital Library

[7]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI)). 571–582.

[8]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, 2012. Large scale distributed deep networks. In Proceedings of the Advances in Neural Information Processing Systems. 1223–1231.

[9]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248–255.

[10]

Aditya Devarakonda, Maxim Naumov, and Michael Garland. 2017. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029(2017).

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[12]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677(2017).

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.

[14]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the Advances in Neural Information Processing Systems. 1223–1231.

[15]

Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS). 1729–1739.

[16]

Zhouyuan Huo, Bin Gu, and Heng Huang. 2020. Large batch training does not need warmup. arXiv preprint arXiv:2002.01576(2020).

[17]

Prateek Jain, Sham Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. 2018. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of Machine Learning Research 18 (2018).

[18]

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. 2017. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623(2017).

[19]

Yuting Jia, Qinqin Zhang, Weinan Zhang, and Xinbing Wang. 2019. Communitygan: Community detection with generative adversarial nets. In Proceedings of the World Wide Web Conference (WWW). 784–794.

Digital Library

[20]

Tyler Johnson, Pulkit Agrawal, Haijie Gu, and Carlos Guestrin. 2020. AdaScale SGD: A User-Friendly Algorithm for Distributed Training. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 4911–4920.

[21]

Tyler B Johnson and Carlos Guestrin. 2018. Training deep models faster with robust, approximate importance sampling. Advances in Neural Information Processing Systems 31 (2018), 7265–7275.

[22]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[23]

Yunyong Ko, Kibong Choi, Hyunseung Jei, Dongwon Lee, and Sang-Wook Kim. 2021. ALADDIN: Asymmetric Centralized Training for Distributed Deep Learning. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM).

Digital Library

[24]

Yunyong Ko, Kibong Choi, Jiwon Seo, and Sang-Wook Kim. 2021. An In-Depth Analysis of Distributed Training of Deep Neural Networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 994–1003.

[25]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997(2014).

[26]

Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).

[27]

Youngnam Lee, Sang-Wook Kim, Sunju Park, and Xing Xie. 2018. How to impute missing ratings? Claims, solution, and its application to collaborative filtering. In Proceedings of the World Wide Web Conference (WWW). 783–792.

[28]

Siyuan Ma, Raef Bassily, and Mikhail Belkin. 2018. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 3325–3334.

[29]

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162(2018).

[30]

Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, and Manik Varma. 2021. ECLARE: Extreme Classification with Label Graph Correlations. In Proceedings of the World Wide Web Conference (WWW). 3721–3732.

Digital Library

[31]

Junjie Qian, Taeyoon Kim, and Myeongjae Jeon. 2021. Reliability of Large Scale GPU Clusters for Deep Learning Workloads. In Companion Proceedings of the Web Conference (WWW). 179–181.

[32]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems (NIPS). 693–701.

[33]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y

Digital Library

[34]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR).

[35]

Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. 2018. Don’t Decay the Learning Rate, Increase the Batch Size. In Proceedings of International Conference on Learning Representations (ICLR).

[36]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 6105–6114.

[37]

Ran Xin, Soummya Kar, and Usman A Khan. 2020. Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence. IEEE Signal Processing Magazine 37, 3 (2020), 102–113.

[38]

Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE transactions on Big Data 1, 2 (2015), 49–67.

Digital Library

[39]

Weizheng Xu, Youtao Zhang, and Xulong Tang. 2021. Parallelizing DNN Training on GPUs: Challenges and Opportunities. In Companion Proceedings of the Web Conference (WWW). 174–178.

[40]

Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. 2018. Gradient diversity: a key ingredient for scalable distributed learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 1998–2007.

[41]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888(2017).

[42]

Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large-batch training for LSTM and beyond. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1–16.

Digital Library

[43]

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962(2019).

[44]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of the USENIX Annual Technical Conference (ATC). 181–193.

[45]

Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Proceedings of the Advances in Neural Information Processing Systems. 685–693.

[46]

Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. 2017. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning (ICML). PMLR, 4120–4129.

Cited By

Li YHuang JLi ZLiu JZhou SJiang WWang J(2024)Gsyn: Reducing Staleness and Communication Waiting via Grouping-based Synchronization for Distributed Deep LearningIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621250(1731-1740)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621250
Dahl MLarsson E(2024)Over-the-Air Federated Learning with Phase Noise: Analysis and Countermeasures2024 58th Annual Conference on Information Sciences and Systems (CISS)10.1109/CISS59072.2024.10480215(1-6)Online publication date: 13-Mar-2024
https://doi.org/10.1109/CISS59072.2024.10480215
Karimi AKalhor ASadeghi Tabrizi M(2024)Forward layer-wise learning of convolutional neural networks through separation index maximizingScientific Reports10.1038/s41598-024-59176-314:1Online publication date: 13-Apr-2024
https://doi.org/10.1038/s41598-024-59176-3

Index Terms

Not All Layers Are Equal: A Layer-Wise Adaptive Approach Toward Large-Scale DNN Training
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Layer-wise regularized adversarial training using layers sustainability analysis framework
Highlights
- The layer sustainability analysis (LSA) framework is introduced to evaluate the behavior of layer-level representations of DNNs in dealing with network input ...
Abstract
Deep neural network models are used today in various applications of artificial intelligence, the strengthening of which, in the face of adversarial attacks is of particular importance. An appropriate solution to adversarial attacks is ...
Pair wise training for stacked convolutional autoencoders using small scale images
Soft Computing and Intelligent Systems: Techniques and Applications

Deep neural networks have dramatically gained immense potential in almost every field like computer vision, natural language processing, biomedical informatics etc. Among these networks, autoencoders are popular in performing dimensionality reduction task,...
Semi-supervised ensemble DNN acoustic model training
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
It is very important to exploit abundant unlabeled speech for improving the acoustic model training in automatic speech recognition (ASR). Semi-supervised training methods incorporate unlabeled data in addition to labeled data to enhance the model ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Research Foundation of Korea
Institute of Information & Communications Technology Planning & Evaluation
NSF (National Science Foundation)

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
621
Total Downloads

Downloads (Last 12 months)298
Downloads (Last 6 weeks)35

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YHuang JLi ZLiu JZhou SJiang WWang J(2024)Gsyn: Reducing Staleness and Communication Waiting via Grouping-based Synchronization for Distributed Deep LearningIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621250(1731-1740)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621250
Dahl MLarsson E(2024)Over-the-Air Federated Learning with Phase Noise: Analysis and Countermeasures2024 58th Annual Conference on Information Sciences and Systems (CISS)10.1109/CISS59072.2024.10480215(1-6)Online publication date: 13-Mar-2024
https://doi.org/10.1109/CISS59072.2024.10480215
Karimi AKalhor ASadeghi Tabrizi M(2024)Forward layer-wise learning of convolutional neural networks through separation index maximizingScientific Reports10.1038/s41598-024-59176-314:1Online publication date: 13-Apr-2024
https://doi.org/10.1038/s41598-024-59176-3
Ko YRyu SHan SJeon YKim JPark SHan KTong HKim S(2023)KHAN: Knowledge-Aware Hierarchical Attention Networks for Accurate Political Stance PredictionProceedings of the ACM Web Conference 202310.1145/3543507.3583300(1572-1583)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583300

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents