SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning
Abstract
:1. Introduction
- Identifying the limitations of existing asynchronous distributed training in updating local models of workers—i.e., not considering both the scale of distributed training and the heterogeneity among workers.
- Proposing a novel asynchronous PS-based training algorithm, named as SHAT, successfully addressing the limitations of the existing asynchronous distributed training.
- Comprehensive evaluation verifying the effectiveness of SHAT in terms of the convergence rate and robustness under heterogeneous environments.
2. Related Work
2.1. PS-Based Distributed Training
2.2. P2P-Based Distributed Training
2.3. Other Techniques for Distributed Training
3. The Proposed Method: SHAT
3.1. Asynchronous Distributed Training
3.2. Update Strategy for Model Convergence
3.3. Algorithm and Performance Consideration
Algorithm 1 Training processes of a worker and PS in SHAT |
|
4. Experimental Validation
- EQ1. Does the model trained by SHAT achieve the accuracy higher than those of the state-of-the-art methods?
- RQ2. How robust is SHAT to various heterogeneous environments in terms of the model convergence?
- RQ3. How effective are parameter sharding techniques on SHAT in terms of training performance?
4.1. Experimental Setup
4.1.1. Datasets and Models
4.1.2. Competing Methods
- ASP (i.e., HOGWILD) [6]: this baseline is a widely-recognized data-parallel distributed training method, where each worker updates its local model by replacement with the global model (i.e., ).
- ENSEMBLE: this baseline is an ensemble learning method, where each worker trains its local model only locally without receiving the global model (i.e., ).
- : this baseline is a variation of SHAT with for each worker.
- SHAT: this method is the original version of SHAT, the asynchronous PS-based training with for each worker.
4.1.3. Hyperparameter Settings
4.1.4. System Configuration
4.2. EQ1. Model Accuracy and Convergnece
4.3. EQ2. Robustness to Heterogeneous Environments
4.4. EQ3. Effects of Parameter Sharding
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegasm, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Recht, B.; Re, C.; Wright, S.; Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Granada, Spain, 12–14 December 2011; pp. 693–701. [Google Scholar]
- Ho, Q.; Cipar, J.; Cui, H.; Lee, S.; Kim, J.K.; Gibbons, P.B.; Gibson, G.A.; Ganger, G.; Xing, E.P. More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the Advances in Neural Information Processing Systems, Tahoe, NV, USA, 5–10 December 2013; pp. 1223–1231. [Google Scholar]
- Zhang, S.; Choromanska, A.E.; LeCun, Y. Deep learning with elastic averaging SGD. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 685–693. [Google Scholar]
- Zhao, X.; An, A.; Liu, J.; Chen, B.X. Dynamic stale synchronous parallel distributed training for deep learning. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–9 July 2019; pp. 1507–1517. [Google Scholar]
- Zhang, H.; Zheng, Z.; Xu, S.; Dai, W.; Ho, Q.; Liang, X.; Hu, Z.; Wei, J.; Xie, P.; Xing, E.P. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of the USENIX Annual Technical Conference (ATC), Santa Clara, CA, USA, 12–14 July 2017; pp. 181–193. [Google Scholar]
- Ko, Y.; Choi, K.; Jei, H.; Lee, D.; Kim, S.W. ALADDIN: Asymmetric Centralized Training for Distributed Deep Learning. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), Gold Coast, QLD, Australia, 1–5 November 2021. [Google Scholar]
- Gerbessiotis, A.V.; Valiant, L.G. Direct bulk-synchronous parallel algorithms. J. Parallel Distrib. Comput. 1994, 22, 251–267. [Google Scholar] [CrossRef]
- Zhao, X.; Papagelis, M.; An, A.; Chen, B.X.; Liu, J.; Hu, Y. Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 1504–1509. [Google Scholar]
- Ko, Y.; Choi, K.; Seo, J.; Kim, S.W. An In-Depth Analysis of Distributed Training of Deep Neural Networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), Portland, OR, USA, 17–21 May 2021; pp. 994–1003. [Google Scholar]
- Narayanan, D.; Harlap, A.; Phanishayee, A.; Seshadri, V.; Devanur, N.R.; Ganger, G.R.; Gibbons, P.B.; Zaharia, M. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), Huntsville, ON, Canada, 27–30 October 2019; pp. 1–15. [Google Scholar]
- Yu, C.; Tang, H.; Renggli, C.; Kassing, S.; Singla, A.; Alistarh, D.; Zhang, C.; Liu, J. Distributed learning over unreliable networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 7202–7212. [Google Scholar]
- Eshraghi, N.; Liang, B. Distributed Online Optimization over a Heterogeneous Network with Any-Batch Mirror Descent. In Proceedings of the International Conference on Machine Learning (ICML), Online, 13–18 July 2020; pp. 2933–2942. [Google Scholar]
- Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P.; Ye, Y.; Li, L.J.; Fei-Fei, L. Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 5970–5979. [Google Scholar]
- Wang, S.; Pi, A.; Zhou, X. Scalable distributed dl training: Batching communication and computation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2019; Volume 33, pp. 5289–5296. [Google Scholar]
- Blot, M.; Picard, D.; Cord, M.; Thome, N. Gossip training for deep learning. In Proceedings of the Advances in Neural Information Processing Systems Workshop on Optimization for Machine Learning, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.J.; Zhang, W.; Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5330–5340. [Google Scholar]
- Lian, X.; Zhang, W.; Zhang, C.; Liu, J. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 3049–3058. [Google Scholar]
- Assran, M.; Loizou, N.; Ballas, N.; Rabbat, M. Stochastic gradient push for distributed deep learning. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 344–353. [Google Scholar]
- Li, Y.; Yu, M.; Li, S.; Avestimehr, S.; Kim, N.S.; Schwing, A. Pipe-SGD: A decentralized pipelined SGD framework for distributed deep net training. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 8056–8067. [Google Scholar]
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 103–112. [Google Scholar]
- Johnson, T.; Agrawal, P.; Gu, H.; Guestrin, C. AdaScale SGD: A User-Friendly Algorithm for Distributed Training. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 4911–4920. [Google Scholar]
- Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
- Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q.V.; et al. Large scale distributed deep networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1223–1231. [Google Scholar]
- Jiang, J.; Cui, B.; Zhang, C.; Yu, L. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data, (SIGMOD), Chicago, IL, USA, 14–19 May 2017; pp. 463–478. [Google Scholar]
- Sergeev, A.; Balso, M.D. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv 2018, arXiv:1802.05799. [Google Scholar]
- Kempe, D.; Dobra, A.; Gehrke, J. Gossip-based computation of aggregate information. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), Cambridge, MA, USA, 11–14 October 2003; pp. 482–491. [Google Scholar]
- Wang, J.; Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv 2018, arXiv:1808.07576. [Google Scholar]
- Nadiradze, G.; Markov, I.; Chatterjee, B.; Kungurtsev, V.; Alistarh, D. Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021. [Google Scholar]
- Chilimbi, T.; Suzue, Y.; Apacible, J.; Kalyanaraman, K. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, USA, 6–8 October 2014; pp. 571–582. [Google Scholar]
- You, Y.; Gitman, I.; Ginsburg, B. Large batch training of convolutional networks. arXiv 2017, arXiv:1708.03888. [Google Scholar]
- You, Y.; Li, J.; Reddi, S.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; Hsieh, C.J. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv 2019, arXiv:1904.00962. [Google Scholar]
- You, Y.; Hseu, J.; Ying, C.; Demmel, J.; Keutzer, K.; Hsieh, C.J. Large-batch training for LSTM and beyond. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA, 17–19 November 2019; pp. 1–16. [Google Scholar]
- Huo, Z.; Gu, B.; Huang, H. Large batch training does not need warmup. arXiv 2020, arXiv:2002.01576. [Google Scholar]
- Smith, S.L.; Kindermans, P.J.; Ying, C.; Le, Q.V. Don’t Decay the Learning Rate, Increase the Batch Size. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Cui, H.; Zhang, H.; Ganger, G.R.; Gibbons, P.B.; Xing, E.P. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the European Conference on Computer Systems (EUROSYS), London, UK, 18–21 April 2016; p. 4. [Google Scholar]
PS-Based. | P2P-Based. | |
---|---|---|
Synchronous. | Bulk Synchronous Parallel (BSP) [12] | AllReduce-SGD (AR-SGD) [27] |
Asynchronous. | Asynchronous Parallel (ASP) [6] | Gossip-SGD (GoSGD) [20] |
Stale Synchronous Parallel (SSP) [7] | Decentralized Parallel SGD (D-PSGD) [21] | |
Elastic Averaging SGD (EASGD) [8] | Asynchronous D-PSGD (AD-PSGD) [22] | |
Dynamic SSP (DSSP) [9] | Stochstic Gradient Push (SGP) [23] |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ko, Y.; Kim, S.-W. SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning. Appl. Sci. 2022, 12, 292. https://doi.org/10.3390/app12010292
Ko Y, Kim S-W. SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning. Applied Sciences. 2022; 12(1):292. https://doi.org/10.3390/app12010292
Chicago/Turabian StyleKo, Yunyong, and Sang-Wook Kim. 2022. "SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning" Applied Sciences 12, no. 1: 292. https://doi.org/10.3390/app12010292
APA StyleKo, Y., & Kim, S. -W. (2022). SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning. Applied Sciences, 12(1), 292. https://doi.org/10.3390/app12010292