research-article

Elastic parameter server load distribution in deep learning clusters

Authors:

Chuanxiong GuoAuthors Info & Claims

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

Pages 507 - 521

https://doi.org/10.1145/3419111.3421307

Published: 12 October 2020 Publication History

Abstract

In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.

Supplementary Material

MP4 File (p507-chen-presentation.mp4)

Download
184.50 MB

References

[1]

2012. ImageNet Dataset. http://www.image-net.org/.

[2]

2014. ps-lite: A Lightweight Parameter Server Interface. https://github.com/dmlc/ps-lite.

[3]

2019. Alibaba PS-Plus. https://github.com/alibaba/x-deeplearning/tree/master/xdl/ps-plus.

[4]

2019. AWS EC2 Instance. https://aws.amazon.com/ec2/instance-types/.

[5]

2019. BytePS: A High Performance and General Framework for Distributed Training. https://github.com/bytedance/byteps/.

[6]

2019. Linux tc. https://linux.die.net/man/8/tc.

[7]

2019. NCCL. https://developer.nvidia.com/nccl.

[8]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A System for Large-Scale Machine Learning. In Proc. of USENIX OSDI.

[9]

Umut A Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proc. of ACM SIGPLAN Notices.

Digital Library

[10]

Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proc. of USENIX NSDI.

Digital Library

[11]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. of ICLR.

[12]

Yixin Bao, Yanghua Peng, Yangrui Chen, and Chuan Wu. 2020. Preemptive All-reduce Scheduling for Expediting Distributed DNN Training. In Proc. of IEEE INFOCOM.

Digital Library

[13]

Yixin Bao, Yanghua Peng, and Chuan Wu. 2019. Deep Learning-based Job Placement in Distributed Machine Learning Clusters. In Proc. of IEEE INFOCOM.

[14]

Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online Job Scheduling in Distributed Machine Learning Clusters. In Proc. of IEEE INFOCOM.

[15]

Chen Chen, Wei Wang, and Bo Li. 2019. Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers. In Proc. of IEEE INFOCOM.

[16]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In NIPS Workshop on Machine Learning Systems (LearningSys).

[17]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proc. of USENIX OSDI.

[18]

Bulpitt Ci. 1987. Confidence Intervals. Lancet (1987).

[19]

Jinkun Geng, Dan Li, and Shuai Wang. 2019. Accelerating Distributed Machine Learning by Smart Parameter Server. In Proc. of APNet.

Digital Library

[20]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. In Proc. of IEEE ICASSP.

[21]

Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In Proc. of ACM SoCC.

Digital Library

[22]

Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In Proc. of ACM EuroSys.

Digital Library

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of IEEE CVPR.

[24]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. of ICML.

Digital Library

[25]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In Proc. of ACM SIGMOD.

Digital Library

[26]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proc. of USENIX OSDI.

[27]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. of NIPS.

Digital Library

[28]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. of USENIX OSDI.

Digital Library

[29]

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In Proc. of ICML.

[30]

Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proc. of ACM ASPLOS.

Digital Library

[31]

Andrew Or, Haoyu Zhang, and Michael J Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proc. of Machine Learning and Systems (MLSys).

[32]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proc. of NIPS Autodiff Workshop.

[33]

Pitch Patarasuk and Xin Yuan. 2007. Bandwidth Efficient All-reduce Operation on Tree Topologies. In Proc. of IEEE IPDPS.

[34]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel andDistrib. Comput. (2009).

[35]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proc. of ACM EuroSys.

Digital Library

[36]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2019. DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters. arXiv preprint arXiv:1909.06040 (2019).

[37]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In Proc. of ACM SOSP.

Digital Library

[38]

Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A Gibson, and Eric P Xing. 2018. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In Proc. of USENIX ATC.

[39]

Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics (1951).

[40]

Shriram Sarvotham, Rudolf Riedi, and Richard Baraniuk. 2001. Connection-Level Analysis and Modeling of Network Traffic. In Proc. of ACM SIGCOMM Workshop on Internet Measurement.

Digital Library

[41]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[42]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-scale Image Recognition. In Proc. of ICLR.

[43]

Peng Sun, Yonggang Wen, Nguyen Binh Duong Ta, and Shengen Yan. 2017. Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach. In Proc. of IEEE International Conference on Smart Computing.

[44]

Richard S Sutton and Andrew G Barto. 1998. Reinforcement Learning: An Introduction. MIT Press Cambridge.

Digital Library

[45]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Proc. of IEEE CVPR.

[46]

Rashish Tandon, Qi Lei, Alexandras G Dimakis, and Nikos Karampatziakis. 2017. Gradient Coding: Avoiding Stragglers in Distributed Learning. In Proc. of ICML.

[47]

Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. IEEE Transactions on Big Data (2015).

[48]

Xing Zhao, Aijun An, Junfeng Liu, and Bao Xin Chen. 2019. Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning. In Proc. of IEEE ICDCS.

Cited By

Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Cao JGuan YQian KGao JXiao WDong JFu BCai DZhai ESekar VYu MSeneviratne AVeitch D(2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672239
Xu DXu MLou CZhang LHuang GJin XLiu XTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624847(368-385)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624847
Show More Cited By

Index Terms

Elastic parameter server load distribution in deep learning clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing

Recommendations

Deep Label Distribution Learning With Label Ambiguity

Convolutional neural networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its success. However, it is difficult to collect ...
Elastic Load Balancing Classic Load Balancers
Practical age estimation using deep label distribution learning
Abstract
Age estimation plays an important role in humancomputer interaction system. The lack of large number of facial images with definite age label makes age estimation algorithms inefficient. Deep label distribution learning (DLDL) which employs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

October 2020

535 pages

ISBN:9781450381376

DOI:10.1145/3419111

General Chair:
Rodrigo Fonseca
Microsoft and Brown University
,
Program Chairs:
Christina Delimitrou
Cornell University
,
Beng Chin Ooi
National University of Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Hong Kong RGC

Conference

SoCC '20

Sponsor:

SoCC '20: ACM Symposium on Cloud Computing

October 19 - 21, 2020

Virtual Event, USA

Acceptance Rates

SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
986
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Cao JGuan YQian KGao JXiao WDong JFu BCai DZhai ESekar VYu MSeneviratne AVeitch D(2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672239
Xu DXu MLou CZhang LHuang GJin XLiu XTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624847(368-385)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624847
Luo SWang RXing H(2024)Efficient Inter-Datacenter AllReduce With Multiple TreesIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.341903011:5(4793-4806)Online publication date: Sep-2024
https://doi.org/10.1109/TNSE.2024.3419030
Xu MXu DLou CZhang LHuang GJin XLiu X(2024)Efficient, Scalable, and Sustainable DNN Training on SoC-Clustered Edge ServersIEEE Transactions on Mobile Computing10.1109/TMC.2024.344243023:12(14344-14360)Online publication date: Dec-2024
https://doi.org/10.1109/TMC.2024.3442430
Shen ZTang QZhou TZhang YJia ZYu DZhang ZLi B(2024)ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.331584773:1(30-43)Online publication date: Jan-2024
https://doi.org/10.1109/TC.2023.3315847
Tariq ACao LAhmed FRozner ESharma P(2024)Accelerating Containerized Machine Learning WorkloadsNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575188(1-10)Online publication date: 6-May-2024
https://doi.org/10.1109/NOMS59830.2024.10575188
Xiao YJu LZhou ZLi SHuan ZZhang DJiang RWang LZhang XLiang LZhou J(2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00394
Li HZhao HSun TLi XXu HLi K(2023)Interference-Aware Opportunistic Job Placement for Shared Distributed Deep Learning ClustersSSRN Electronic Journal10.2139/ssrn.4351162Online publication date: 2023
https://doi.org/10.2139/ssrn.4351162
Li JBosilca GBouteiller ANicolae B(2023)Elastic deep learning through resilient collective operationsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3626080(44-50)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3626080
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten