Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adaptive synchronous strategy for distributed machine learning

Published: 29 December 2022 Publication History
  • Get Citation Alerts
  • Abstract

    In distributed machine learning training, bulk synchronous parallel (BSP) and asynchronous parallel (ASP) are two main synchronization methods to help achieve gradient aggregation. However, BSP needs longer training time due to “stragglers” problem, while ASP sacrifices the accuracy due to “gradient staleness” problem. In this article, we propose a distributed training paradigm on parameter server framework called adaptive synchronous strategy (A2S) which improves the BSP and ASP paradigms by adaptively adopting different parallel training schemes for workers with different training speeds. Based on the stale value between the fastest and slowest worker, A2S adaptively adds a relaxed synchronous barrier for fast workers to alleviate gradient staleness, where a differentiated weighting gradient aggregation method is used to reduce the impact of slow gradients. Simultaneously, A2S adopts ASP training for slow workers to eliminate stragglers. Hence, A2S not only improves the “gradient staleness” and “stragglers” problems, but also obtains convergence stability and synchronous gain through synchronous and asynchronous parallel, respectively. Specially, we theoretically proved the convergence of A2S by deriving the regret bound. Moreover, experiment results show that A2S improves accuracy by up to 2.64% and accelerates training by up to 41% more than the state‐of‐the‐art synchronization methods BSP, ASP, stale synchronous parallel (SSP), dynamic SSP, and Sync‐switch.

    References

    [1]
    Yan H, Jiang N, Li K, Wang Y, Yang G. Collusion‐free for cloud verification toward the view of game theory. ACM Trans Internet Techn. 2022;22(2):33:1‐33:21.
    [2]
    Li T, Li J, Chen X, Liu Z, Lou W, Hou YT. NPMML: a framework for non‐interactive privacy‐preserving multi‐party machine learning. IEEE Trans Depend Secure Comput. 2021;18(6):2969‐2982. doi:10.1109/TDSC.2020.2971598
    [3]
    Hou R, Ai S, Chen Q, Yan H, Huang T, Chen K. Similarity‐based integrity protection for deep learning systems. Inf Sci. 2022;601:255‐267.
    [4]
    Dabare R, Wong KW, Shiratuddin MF, Koutsakis P. A fuzzy data augmentation technique to improve regularisation. Int J Intell Syst. 2022;37(8):4561‐4585.
    [5]
    He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV; 2016:770‐778.
    [6]
    Chowdhary KR. Fundamentals of artificial intelligence[M]. Springer; 2020.
    [7]
    Liu W‐X, Lu J, Cai J, Zhu Y, Ling S, Chen Q. DRL‐PLink: deep reinforcement learning with private link approach for mix‐flow scheduling in software‐defined data‐center networks. IEEE Trans Netwk Service Manage. 2022;19(2):1049‐1064. doi:10.1109/TNSM.2021.3128267
    [8]
    Shi S, Tang Z, Chu X, Liu C, Wang W, Li B. A quantitative survey of communication optimizations in distributed deep learning. IEEE Netw. 2020;35(3):230‐237.
    [9]
    Li M, Zhou L, Yang Z, et al. Parameter server for distributed machine learning. Big Learning NIPS Workshop. Vol 6, No. 2, Lake Tahoe, CA; 2013.
    [10]
    Li M, et al. Scaling distributed machine learning with the parameter server. USENIX OSDI, Colorado; 2014:583‐598.
    [11]
    Abadi M, Barham P, Chen J, et al. {TensorFlow}: a system for {Large‐Scale} machine learning. 12th USENIX symposiumon operating systems design and implementation (OSDI 16), Savannah, GA; 2016:265‐283.
    [12]
    Awan AA, Hamidouche K, Hashmi JM, Panda DK. Scaffe: co‐designing mpi runtimes and caffe for scalable deep learning on modern GPU clusters. ACM Sigplan Notices, Vol 52, No. 8. ACM; 2017:193‐205.
    [13]
    Chu C‐H, Lu X, Awan AA, et al. Efficient and scalable multi‐source streaming broadcast on GPU clusters for deep learning. 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, IEEE; 2017:161‐170.
    [14]
    Wang S, Li D, Geng J, Gu Y, Cheng Y. Impact of network topology on the performance of dml: Theoretical analysis and practical factors. IEEE INFOCOM 2019‐IEEE Conference on Computer Communications, Paris, France, IEEE; 2019:1729‐1737.
    [15]
    Romero J, Yin J, Laanait N, et al. Accelerating collective communication in data parallel training across deep learning frameworks. 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), Renton, WA; 2022:1027‐1040.
    [16]
    Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: generalized pipeline parallelism for DNN training. Proceedings of the 27th ACM Symposium on Operating Systems Principles, Canada; 2019:1‐15.
    [17]
    Gerbessiotis AV, Valiant LG. Direct bulk‐synchronous parallel algorithms. J Parallel Distributed Comput. 1994;22(2):251‐267.
    [18]
    Chen J, Pan X, Monga R, et al. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981, 2016.
    [19]
    Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems, Lake Tahoe, NV; 2012:1223‐1231.
    [20]
    Recht B, Re C, Wright S, Niu F. Hogwild: a lock‐free approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, Granada, Spain; 2011:693‐701.
    [21]
    Li S, Mangoubi O, Xu L, et al. Sync‐switch: hybrid parameter synchronization for distributed deep learning. 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). IEEE; 2021:528‐538.
    [22]
    Ho Q, Cipar J, Cui H, et al. More effective distributed ML via a stale synchronous parallel parameter server. Advances in Neural Information Processing Systems, Lake Tahoe, CA; 2013:1223‐1231.
    [23]
    Zhao X, An A, Liu J, Chen BX. Dynamic stale synchronous parallel distributed training for deep learning. In Dallas, Texas, USA, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE; 2019:1507‐1517.
    [24]
    Chen C, Wang W, Li B. Round‐robin synchronization: mitigating communication bottlenecks in parameter servers. IEEE Conference on Computer Communications—IEEE INFOCOM, Paris, France. IEEE; 2019:532‐540.
    [25]
    Stich SU. Local SGD converges fast and communicates little. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, May 6–9, 2019.
    [26]
    Lin T, Stich SU, Patel KK, et al. Don't use large mini‐batches, use local SGD. 8th International Conference on Learning Representations, ICLR; 2020:1‐40.
    [27]
    Zheng S, Meng Q, Wang T, et al. Asynchronous stochastic gradient descent with delay compensation. International Conference on Machine Learning. PMLR; 2017:4120‐4129.
    [28]
    Zhang W, Gupta S, Lian X, et al. Staleness‐aware async‐SGD for distributed deep learning. Proceedings of the Twenty‐Fifth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 2016:2350‐2356.
    [29]
    Sapio A, Canini M, Ho CY, et al. Scaling distributed machine learning with {in‐network} aggregation. 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21); 2021:785‐808.
    [30]
    Lao C, Le Y, Mahajan K, et al. Atp: In‐network aggregation for multi‐tenant learning. NSDI; 2021:741‐761.
    [31]
    Gebara N, Ukyab T, Costa P, et al. PANAMA: network architecture for machine learning workloads in the cloud. Technical report; 2020. https://people.csail.mit.edu/ghobadi/papers/panama.pdf
    [32]
    Colin I, Bellet A, Salmon J, Clémenc'on S. Gossip dual averaging for decentralized optimization of pairwise functions. Proceedings of the 33rd International Conference on International Conference on Machine Learning—Volume 48, ser. ICML’16, New York. JMLR.org; 2016:1388‐1396.
    [33]
    Wang W, Zhang C, Yang L, Chen K, Tan K. Addressing network bottlenecks with divide‐and‐shuffle synchronization for distributed DNN training. IEEE Conference on Computer Communications—IEEE INFOCOM 2022, London, UK, 2022:320‐329. doi:10.1109/INFOCOM48880.2022.9796688
    [34]
    Aji A, Heafield K. Sparse communication for distributed gradient descent. EMNLP: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL); 2017:440‐445.
    [35]
    Shi S, et al. A distributed synchronous SGD algorithm with global top‐k sparsification for low bandwidth networks. 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX; 2019:2238‐2247. doi:10.1109/ICDCS.2019.00220
    [36]
    Seide F, Fu H, Droppo J, et al. 1‐bit stochastic gradient descent and its application to data‐parallel distributed training of speech DNNs. Fifteenth annual conference of the international speech communication association, Singapore; 2014:1058‐1062.
    [37]
    Wen W, Xu C, Yan F, et al. Terngrad: ternary gradients to reduce communication in distributed deep learning. Proceedings of the 31st International Conference on Neural Information Processing Systems, Series of NIPS’17. Curran Associates Inc.; 2017:1508‐1518.
    [38]
    Karimireddy SP, Rebjock Q, Stich S, et al. Error feedback fixes sign SGD and other gradient compression schemes. International Conference on Machine Learning. PMLR; 2019:3252‐3261.
    [39]
    Gajjala RR, Banchhor S, Abdelmoniem AM, et al. Huffman coding based encoding techniques for fast distributed deep learning. Proceedings of the 1st Workshop on Distributed Machine Learning; 2020:21‐27.
    [40]
    Zhang H, Zheng Z, Xu S, et al. Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. 2017 USENIX Annual Technical Conference (USENIX ATC 17), Santa Clara, CA; 2017:181‐193.
    [41]
    Shi S, Chu X, Li B. MG‐WFBP: Efficient data communication for distributed synchronous SGD algorithms. IEEE INFOCOM 2019‐IEEE Conference on Computer Communications, Paris, France, IEEE; 2019:172‐180.
    [42]
    Shi S, Chu X, Li B. Exploiting simultaneous communications to accelerate data parallel distributed deep learning. IEEE INFOCOM 2021—IEEE Conference on Computer Communications; 2021:1‐10. doi:10.1109/INFOCOM42981.2021.9488803
    [43]
    Wang S, Li D, Geng J. Geryon: accelerating distributed CNN training by network‐level flow scheduling. Proceedings of IEEE INFOCOM‐IEEE Conference Computation and Communication, Beijing, China; 2020:1678‐1687.
    [44]
    Shi S, Wang Q, Chu X, et al. Communication—efficient distributed deep learning with merged gradient sparsification on GPUs. IEEE INFOCOM—IEEE Conference on Computer Communications, Beijing; 2020:406‐415. doi:10.1109/INFOCOM41043.2020.9155269
    [45]
    Yu E, Dong D, Xu Y, et al. CD‐SGD: distributed stochastic gradient descent with compression and delay compensation. 50th International Conference on Parallel Processing, Bordeaux; 2021:1‐10.
    [46]
    Lin G, Yan H, Kou G, et al. Understanding adaptive gradient clipping in DP‐SGD, empirically. Int J Intell Syst. 2022;8:1‐25.
    [47]
    Ruder S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
    [48]
    Zhang H, Cisse M, Dauphin YN, Lopez‐Paz D. mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
    [49]
    Li Y, Yan H, Huang T, et al. Model architecture level privacy leakage in neural networks.J Sci China Inform Sci. 2022;7:1‐14. doi:10.1007/s11432-022-3507-7
    [50]
    Liang C, Miao M, Ma J, Yan H, Zhang Q, Li X. Detection of global positioning system spoofing attack on unmanned aerial vehicle system. Concurrency Computat Pract Exp. 2022;34(7):e5925.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image International Journal of Intelligent Systems
    International Journal of Intelligent Systems  Volume 37, Issue 12
    December 2022
    2488 pages
    ISSN:0884-8173
    DOI:10.1002/int.v37.12
    Issue’s Table of Contents

    Publisher

    John Wiley and Sons Ltd.

    United Kingdom

    Publication History

    Published: 29 December 2022

    Author Tags

    1. ASP
    2. BSP
    3. distributed training
    4. parameter server
    5. synchronous strategy

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media