research-article

Adaptive synchronous strategy for distributed machine learning

Authors:

Zhen‐Zheng GuoAuthors Info & Claims

International Journal of Intelligent Systems, Volume 37, Issue 12

Pages 11713 - 11741

https://doi.org/10.1002/int.23060

Published: 29 December 2022 Publication History

Abstract

In distributed machine learning training, bulk synchronous parallel (BSP) and asynchronous parallel (ASP) are two main synchronization methods to help achieve gradient aggregation. However, BSP needs longer training time due to “stragglers” problem, while ASP sacrifices the accuracy due to “gradient staleness” problem. In this article, we propose a distributed training paradigm on parameter server framework called adaptive synchronous strategy (A2S) which improves the BSP and ASP paradigms by adaptively adopting different parallel training schemes for workers with different training speeds. Based on the stale value between the fastest and slowest worker, A2S adaptively adds a relaxed synchronous barrier for fast workers to alleviate gradient staleness, where a differentiated weighting gradient aggregation method is used to reduce the impact of slow gradients. Simultaneously, A2S adopts ASP training for slow workers to eliminate stragglers. Hence, A2S not only improves the “gradient staleness” and “stragglers” problems, but also obtains convergence stability and synchronous gain through synchronous and asynchronous parallel, respectively. Specially, we theoretically proved the convergence of A2S by deriving the regret bound. Moreover, experiment results show that A2S improves accuracy by up to 2.64% and accelerates training by up to 41% more than the state‐of‐the‐art synchronization methods BSP, ASP, stale synchronous parallel (SSP), dynamic SSP, and Sync‐switch.

References

[1]

Yan H, Jiang N, Li K, Wang Y, Yang G. Collusion‐free for cloud verification toward the view of game theory. ACM Trans Internet Techn. 2022;22(2):33:1‐33:21.

[2]

Li T, Li J, Chen X, Liu Z, Lou W, Hou YT. NPMML: a framework for non‐interactive privacy‐preserving multi‐party machine learning. IEEE Trans Depend Secure Comput. 2021;18(6):2969‐2982. doi:10.1109/TDSC.2020.2971598

Digital Library

[3]

Hou R, Ai S, Chen Q, Yan H, Huang T, Chen K. Similarity‐based integrity protection for deep learning systems. Inf Sci. 2022;601:255‐267.

[4]

Dabare R, Wong KW, Shiratuddin MF, Koutsakis P. A fuzzy data augmentation technique to improve regularisation. Int J Intell Syst. 2022;37(8):4561‐4585.

Digital Library

[5]

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV; 2016:770‐778.

[6]

Chowdhary KR. Fundamentals of artificial intelligence[M]. Springer; 2020.

[7]

Liu W‐X, Lu J, Cai J, Zhu Y, Ling S, Chen Q. DRL‐PLink: deep reinforcement learning with private link approach for mix‐flow scheduling in software‐defined data‐center networks. IEEE Trans Netwk Service Manage. 2022;19(2):1049‐1064. doi:10.1109/TNSM.2021.3128267

Digital Library

[8]

Shi S, Tang Z, Chu X, Liu C, Wang W, Li B. A quantitative survey of communication optimizations in distributed deep learning. IEEE Netw. 2020;35(3):230‐237.

[9]

Li M, Zhou L, Yang Z, et al. Parameter server for distributed machine learning. Big Learning NIPS Workshop. Vol 6, No. 2, Lake Tahoe, CA; 2013.

[10]

Li M, et al. Scaling distributed machine learning with the parameter server. USENIX OSDI, Colorado; 2014:583‐598.

[11]

Abadi M, Barham P, Chen J, et al. {TensorFlow}: a system for {Large‐Scale} machine learning. 12th USENIX symposiumon operating systems design and implementation (OSDI 16), Savannah, GA; 2016:265‐283.

[12]

Awan AA, Hamidouche K, Hashmi JM, Panda DK. Scaffe: co‐designing mpi runtimes and caffe for scalable deep learning on modern GPU clusters. ACM Sigplan Notices, Vol 52, No. 8. ACM; 2017:193‐205.

[13]

Chu C‐H, Lu X, Awan AA, et al. Efficient and scalable multi‐source streaming broadcast on GPU clusters for deep learning. 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, IEEE; 2017:161‐170.

[14]

Wang S, Li D, Geng J, Gu Y, Cheng Y. Impact of network topology on the performance of dml: Theoretical analysis and practical factors. IEEE INFOCOM 2019‐IEEE Conference on Computer Communications, Paris, France, IEEE; 2019:1729‐1737.

[15]

Romero J, Yin J, Laanait N, et al. Accelerating collective communication in data parallel training across deep learning frameworks. 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), Renton, WA; 2022:1027‐1040.

[16]

Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: generalized pipeline parallelism for DNN training. Proceedings of the 27th ACM Symposium on Operating Systems Principles, Canada; 2019:1‐15.

[17]

Gerbessiotis AV, Valiant LG. Direct bulk‐synchronous parallel algorithms. J Parallel Distributed Comput. 1994;22(2):251‐267.

Digital Library

[18]

Chen J, Pan X, Monga R, et al. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981, 2016.

[19]

Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems, Lake Tahoe, NV; 2012:1223‐1231.

[20]

Recht B, Re C, Wright S, Niu F. Hogwild: a lock‐free approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, Granada, Spain; 2011:693‐701.

[21]

Li S, Mangoubi O, Xu L, et al. Sync‐switch: hybrid parameter synchronization for distributed deep learning. 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). IEEE; 2021:528‐538.

[22]

Ho Q, Cipar J, Cui H, et al. More effective distributed ML via a stale synchronous parallel parameter server. Advances in Neural Information Processing Systems, Lake Tahoe, CA; 2013:1223‐1231.

[23]

Zhao X, An A, Liu J, Chen BX. Dynamic stale synchronous parallel distributed training for deep learning. In Dallas, Texas, USA, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE; 2019:1507‐1517.

[24]

Chen C, Wang W, Li B. Round‐robin synchronization: mitigating communication bottlenecks in parameter servers. IEEE Conference on Computer Communications—IEEE INFOCOM, Paris, France. IEEE; 2019:532‐540.

[25]

Stich SU. Local SGD converges fast and communicates little. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, May 6–9, 2019.

[26]

Lin T, Stich SU, Patel KK, et al. Don't use large mini‐batches, use local SGD. 8th International Conference on Learning Representations, ICLR; 2020:1‐40.

[27]

Zheng S, Meng Q, Wang T, et al. Asynchronous stochastic gradient descent with delay compensation. International Conference on Machine Learning. PMLR; 2017:4120‐4129.

[28]

Zhang W, Gupta S, Lian X, et al. Staleness‐aware async‐SGD for distributed deep learning. Proceedings of the Twenty‐Fifth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 2016:2350‐2356.

[29]

Sapio A, Canini M, Ho CY, et al. Scaling distributed machine learning with {in‐network} aggregation. 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21); 2021:785‐808.

[30]

Lao C, Le Y, Mahajan K, et al. Atp: In‐network aggregation for multi‐tenant learning. NSDI; 2021:741‐761.

[31]

Gebara N, Ukyab T, Costa P, et al. PANAMA: network architecture for machine learning workloads in the cloud. Technical report; 2020. https://people.csail.mit.edu/ghobadi/papers/panama.pdf

[32]

Colin I, Bellet A, Salmon J, Clémenc'on S. Gossip dual averaging for decentralized optimization of pairwise functions. Proceedings of the 33rd International Conference on International Conference on Machine Learning—Volume 48, ser. ICML’16, New York. JMLR.org; 2016:1388‐1396.

[33]

Wang W, Zhang C, Yang L, Chen K, Tan K. Addressing network bottlenecks with divide‐and‐shuffle synchronization for distributed DNN training. IEEE Conference on Computer Communications—IEEE INFOCOM 2022, London, UK, 2022:320‐329. doi:10.1109/INFOCOM48880.2022.9796688

[34]

Aji A, Heafield K. Sparse communication for distributed gradient descent. EMNLP: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL); 2017:440‐445.

[35]

Shi S, et al. A distributed synchronous SGD algorithm with global top‐k sparsification for low bandwidth networks. 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX; 2019:2238‐2247. doi:10.1109/ICDCS.2019.00220

[36]

Seide F, Fu H, Droppo J, et al. 1‐bit stochastic gradient descent and its application to data‐parallel distributed training of speech DNNs. Fifteenth annual conference of the international speech communication association, Singapore; 2014:1058‐1062.

[37]

Wen W, Xu C, Yan F, et al. Terngrad: ternary gradients to reduce communication in distributed deep learning. Proceedings of the 31st International Conference on Neural Information Processing Systems, Series of NIPS’17. Curran Associates Inc.; 2017:1508‐1518.

[38]

Karimireddy SP, Rebjock Q, Stich S, et al. Error feedback fixes sign SGD and other gradient compression schemes. International Conference on Machine Learning. PMLR; 2019:3252‐3261.

[39]

Gajjala RR, Banchhor S, Abdelmoniem AM, et al. Huffman coding based encoding techniques for fast distributed deep learning. Proceedings of the 1st Workshop on Distributed Machine Learning; 2020:21‐27.

[40]

Zhang H, Zheng Z, Xu S, et al. Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. 2017 USENIX Annual Technical Conference (USENIX ATC 17), Santa Clara, CA; 2017:181‐193.

[41]

Shi S, Chu X, Li B. MG‐WFBP: Efficient data communication for distributed synchronous SGD algorithms. IEEE INFOCOM 2019‐IEEE Conference on Computer Communications, Paris, France, IEEE; 2019:172‐180.

[42]

Shi S, Chu X, Li B. Exploiting simultaneous communications to accelerate data parallel distributed deep learning. IEEE INFOCOM 2021—IEEE Conference on Computer Communications; 2021:1‐10. doi:10.1109/INFOCOM42981.2021.9488803

[43]

Wang S, Li D, Geng J. Geryon: accelerating distributed CNN training by network‐level flow scheduling. Proceedings of IEEE INFOCOM‐IEEE Conference Computation and Communication, Beijing, China; 2020:1678‐1687.

[44]

Shi S, Wang Q, Chu X, et al. Communication—efficient distributed deep learning with merged gradient sparsification on GPUs. IEEE INFOCOM—IEEE Conference on Computer Communications, Beijing; 2020:406‐415. doi:10.1109/INFOCOM41043.2020.9155269

[45]

Yu E, Dong D, Xu Y, et al. CD‐SGD: distributed stochastic gradient descent with compression and delay compensation. 50th International Conference on Parallel Processing, Bordeaux; 2021:1‐10.

[46]

Lin G, Yan H, Kou G, et al. Understanding adaptive gradient clipping in DP‐SGD, empirically. Int J Intell Syst. 2022;8:1‐25.

[47]

Ruder S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.

[48]

Zhang H, Cisse M, Dauphin YN, Lopez‐Paz D. mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

[49]

Li Y, Yan H, Huang T, et al. Model architecture level privacy leakage in neural networks.J Sci China Inform Sci. 2022;7:1‐14. doi:10.1007/s11432-022-3507-7

[50]

Liang C, Miao M, Ma J, Yan H, Zhang Q, Li X. Detection of global positioning system spoofing attack on unmanned aerial vehicle system. Concurrency Computat Pract Exp. 2022;34(7):e5925.

Cited By

Berloco FBevilacqua VColucci S(2024)Distributed Analytics For Big DataNeurocomputing10.1016/j.neucom.2024.127258574:COnline publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127258

Recommendations

An object-oriented bulk synchronous parallel library for multicore programming

We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. ...
ZipLine: an optimized algorithm for the elastic bulk synchronous parallel model
Abstract
The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of deep learning models. A shortcoming of the BSP is that it requires ...
Portable and Efficient Parallel Computing Using the BSP Model

The Bulk-Synchronous Parallel (BSP) model was proposed by Valiant as a standard interface between parallel software and hardware. In theory, the BSP model has been shown to allow the asymptotically optimal execution of architecture-independent software ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Intelligent Systems

International Journal of Intelligent Systems Volume 37, Issue 12

December 2022

2488 pages

ISSN:0884-8173

DOI:10.1002/int.v37.12

Issue’s Table of Contents

© 2022 Wiley Periodicals LLC.

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 29 December 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Berloco FBevilacqua VColucci S(2024)Distributed Analytics For Big DataNeurocomputing10.1016/j.neucom.2024.127258574:COnline publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127258

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents