Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3419111.3421307acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Elastic parameter server load distribution in deep learning clusters

Published: 12 October 2020 Publication History

Abstract

In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.

Supplementary Material

MP4 File (p507-chen-presentation.mp4)

References

[1]
2012. ImageNet Dataset. http://www.image-net.org/.
[2]
2014. ps-lite: A Lightweight Parameter Server Interface. https://github.com/dmlc/ps-lite.
[3]
2019. Alibaba PS-Plus. https://github.com/alibaba/x-deeplearning/tree/master/xdl/ps-plus.
[4]
2019. AWS EC2 Instance. https://aws.amazon.com/ec2/instance-types/.
[5]
2019. BytePS: A High Performance and General Framework for Distributed Training. https://github.com/bytedance/byteps/.
[6]
2019. Linux tc. https://linux.die.net/man/8/tc.
[7]
2019. NCCL. https://developer.nvidia.com/nccl.
[8]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A System for Large-Scale Machine Learning. In Proc. of USENIX OSDI.
[9]
Umut A Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proc. of ACM SIGPLAN Notices.
[10]
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proc. of USENIX NSDI.
[11]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. of ICLR.
[12]
Yixin Bao, Yanghua Peng, Yangrui Chen, and Chuan Wu. 2020. Preemptive All-reduce Scheduling for Expediting Distributed DNN Training. In Proc. of IEEE INFOCOM.
[13]
Yixin Bao, Yanghua Peng, and Chuan Wu. 2019. Deep Learning-based Job Placement in Distributed Machine Learning Clusters. In Proc. of IEEE INFOCOM.
[14]
Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online Job Scheduling in Distributed Machine Learning Clusters. In Proc. of IEEE INFOCOM.
[15]
Chen Chen, Wei Wang, and Bo Li. 2019. Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers. In Proc. of IEEE INFOCOM.
[16]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In NIPS Workshop on Machine Learning Systems (LearningSys).
[17]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proc. of USENIX OSDI.
[18]
Bulpitt Ci. 1987. Confidence Intervals. Lancet (1987).
[19]
Jinkun Geng, Dan Li, and Shuai Wang. 2019. Accelerating Distributed Machine Learning by Smart Parameter Server. In Proc. of APNet.
[20]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. In Proc. of IEEE ICASSP.
[21]
Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In Proc. of ACM SoCC.
[22]
Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In Proc. of ACM EuroSys.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of IEEE CVPR.
[24]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. of ICML.
[25]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In Proc. of ACM SIGMOD.
[26]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proc. of USENIX OSDI.
[27]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. of NIPS.
[28]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. of USENIX OSDI.
[29]
Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In Proc. of ICML.
[30]
Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proc. of ACM ASPLOS.
[31]
Andrew Or, Haoyu Zhang, and Michael J Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proc. of Machine Learning and Systems (MLSys).
[32]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proc. of NIPS Autodiff Workshop.
[33]
Pitch Patarasuk and Xin Yuan. 2007. Bandwidth Efficient All-reduce Operation on Tree Topologies. In Proc. of IEEE IPDPS.
[34]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel andDistrib. Comput. (2009).
[35]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proc. of ACM EuroSys.
[36]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2019. DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters. arXiv preprint arXiv:1909.06040 (2019).
[37]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In Proc. of ACM SOSP.
[38]
Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A Gibson, and Eric P Xing. 2018. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In Proc. of USENIX ATC.
[39]
Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics (1951).
[40]
Shriram Sarvotham, Rudolf Riedi, and Richard Baraniuk. 2001. Connection-Level Analysis and Modeling of Network Traffic. In Proc. of ACM SIGCOMM Workshop on Internet Measurement.
[41]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[42]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-scale Image Recognition. In Proc. of ICLR.
[43]
Peng Sun, Yonggang Wen, Nguyen Binh Duong Ta, and Shengen Yan. 2017. Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach. In Proc. of IEEE International Conference on Smart Computing.
[44]
Richard S Sutton and Andrew G Barto. 1998. Reinforcement Learning: An Introduction. MIT Press Cambridge.
[45]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Proc. of IEEE CVPR.
[46]
Rashish Tandon, Qi Lei, Alexandras G Dimakis, and Nikos Karampatziakis. 2017. Gradient Coding: Avoiding Stragglers in Distributed Learning. In Proc. of ICML.
[47]
Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. IEEE Transactions on Big Data (2015).
[48]
Xing Zhao, Aijun An, Junfeng Liu, and Bao Xin Chen. 2019. Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning. In Proc. of IEEE ICDCS.

Cited By

View all
  • (2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
  • (2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
  • (2024)SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624847(368-385)Online publication date: 27-Apr-2024
  • Show More Cited By

Index Terms

  1. Elastic parameter server load distribution in deep learning clusters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
    October 2020
    535 pages
    ISBN:9781450381376
    DOI:10.1145/3419111
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • Hong Kong RGC

    Conference

    SoCC '20
    Sponsor:
    SoCC '20: ACM Symposium on Cloud Computing
    October 19 - 21, 2020
    Virtual Event, USA

    Acceptance Rates

    SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;
    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
    • (2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
    • (2024)SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624847(368-385)Online publication date: 27-Apr-2024
    • (2024)Efficient Inter-Datacenter AllReduce With Multiple TreesIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.341903011:5(4793-4806)Online publication date: Sep-2024
    • (2024)Efficient, Scalable, and Sustainable DNN Training on SoC-Clustered Edge ServersIEEE Transactions on Mobile Computing10.1109/TMC.2024.344243023:12(14344-14360)Online publication date: Dec-2024
    • (2024)ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.331584773:1(30-43)Online publication date: Jan-2024
    • (2024)Accelerating Containerized Machine Learning WorkloadsNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575188(1-10)Online publication date: 6-May-2024
    • (2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
    • (2023)Interference-Aware Opportunistic Job Placement for Shared Distributed Deep Learning ClustersSSRN Electronic Journal10.2139/ssrn.4351162Online publication date: 2023
    • (2023)Elastic deep learning through resilient collective operationsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3626080(44-50)Online publication date: 12-Nov-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media