research-article

A generic communication scheduler for distributed DNN training acceleration

Authors:

Chuanxiong GuoAuthors Info & Claims

SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pages 16 - 29

https://doi.org/10.1145/3341301.3359642

Published: 27 October 2019 Publication History

Abstract

We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).

References

[1]

2019. ByteScheduler Appendix. https://www.dropbox.com/s/smoq6xd6pr7av81/bytescheduler_appendix.pdf?dl=0.

[2]

2019. ByteScheduler Source Code. https://github.com/bytedance/byteps.

[3]

2019. MLPerf Training v0.6 Results. https://mlperf.org/training-results-0-6/.

[4]

2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.

[5]

2019. TensorFlow Grapper. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/grappler.

[6]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).

[7]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).

[8]

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[9]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of Advances in Neural Information Processing Systems (NIPS).

[10]

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU Infiniband Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users' Group Meeting.

Digital Library

[11]

Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599 (2010).

[12]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of NIPS Workshop on Machine Learning Systems.

[13]

Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In Proceedings of ACM Special Interest Group on Data Communication (SIGCOMM).

Digital Library

[14]

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. arXiv preprint arXiv:1803.05880 (2018).

[15]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).

[16]

Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. In Proceedings of Very Large Data Bases (VLDB) Endowment.

Digital Library

[17]

Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[18]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Systems and Machine Learning (SysML).

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of Advances in Neural Information Processing Systems (NIPS).

[21]

Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-Based Parameter Propagation for Distributed DNN Training. In Proceedings of Systems and Machine Learning (SysML).

[22]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia.

Digital Library

[23]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).

[24]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).

Digital Library

[25]

Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High Performance RDMA-based MPI Implementation over InfiniBand. International Journal of Parallel Programming (2004).

[26]

Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing Network Performance in Distributed Machine Learning. In Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).

[27]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine Learning in Apache Spark. Journal of Machine Learning Research (2016).

[28]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proceedings of NIPS Autodiff Workshop.

[29]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th ACM European Conference on Computer Systems (EuroSys).

Digital Library

[30]

Sebastian Ruder. 2016. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747 (2016).

[31]

Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (KDD).

Digital Library

[32]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[33]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).

[34]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of Advances in Neural Information Processing Systems (NIPS).

[35]

Evan R Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2013. MLI: An API for Distributed Machine Learning. In Proceedings of IEEE International Conference on Data Mining (ICDM).

[36]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. In Proceedings of the 14th ACM European Conference on Computer Systems (EuroSys).

Digital Library

[37]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS).

[38]

Wikipedia. 2019. Monkey Patch. https://en.wikipedia.org/wiki/Monkey_patch.

[39]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of USENIX Annual Technical Conference (USENIX ATC).

Cited By

Peng JLi ZShi SLi B(2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673140
Hu JLiu YWang HWang J(2024)AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU ClusterProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673047(443-452)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673047
Wu S(2024)A Fast Machine Learning Framework with Distributed Packet Loss ToleranceProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664033(1-6)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3663976.3664033
Show More Cited By

Index Terms

A generic communication scheduler for distributed DNN training acceleration
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
    2. Other architectures
      1. Neural networks

Recommendations

Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on ...
Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Deep learning training (DLT), e.g., large language model (LLM) training, has become one of the most important services in multitenant cloud computing. By deeply studying in-production DLT jobs, we observed that communication contention among different ...
EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

October 2019

615 pages

ISBN:9781450368735

DOI:10.1145/3341301

General Chairs:
Tim Brecht
University of Waterloo
,
Carey Williamson
University of Calgary

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

In-Cooperation

USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SOSP '19

Sponsor:

SIGOPS

SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles

October 27 - 30, 2019

Ontario, Huntsville, Canada

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Sponsor:
sigops

ACM SIGOPS 30th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

182
Total Citations
View Citations
4,270
Total Downloads

Downloads (Last 12 months)503
Downloads (Last 6 weeks)47

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Peng JLi ZShi SLi B(2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673140
Hu JLiu YWang HWang J(2024)AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU ClusterProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673047(443-452)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673047
Wu S(2024)A Fast Machine Learning Framework with Distributed Packet Loss ToleranceProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664033(1-6)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3663976.3664033
Li WLiu XLi YJin YTian HZhong ZLiu GZhang YChen K(2024)Understanding Communication Characteristics of Distributed TrainingProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663409(1-8)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.1145/3663408.3663409
Liu KJiang ZZhang JGuo SZhang XBai YDong YLuo FZhang ZWang LShi XXu HBai YSong DWei HLi BPan YPan THuang TSekar VYu MSeneviratne AVeitch D(2024)R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic SystemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672264(554-567)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672264
Cao JGuan YQian KGao JXiao WDong JFu BCai DZhai ESekar VYu MSeneviratne AVeitch D(2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672239
Lim HYe JAbdu Jyothi SHan DSekar VYu MSeneviratne AVeitch D(2024)Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672228(707-720)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672228
Shi SPan XWang QLiu CRen XHu ZYang YLi BChu X(2024)ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks SchedulingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650083(236-249)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3650083
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Huang XZhang JCheng XZhang HJin YHu STian HChen K(2024)Accelerating Privacy-Preserving Machine Learning With GeniBatchProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629563(489-504)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629563
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents