Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3341301.3359642acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

A generic communication scheduler for distributed DNN training acceleration

Published: 27 October 2019 Publication History
  • Get Citation Alerts
  • Abstract

    We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).

    References

    [1]
    2019. ByteScheduler Appendix. https://www.dropbox.com/s/smoq6xd6pr7av81/bytescheduler_appendix.pdf?dl=0.
    [2]
    2019. ByteScheduler Source Code. https://github.com/bytedance/byteps.
    [3]
    2019. MLPerf Training v0.6 Results. https://mlperf.org/training-results-0-6/.
    [4]
    2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.
    [5]
    2019. TensorFlow Grapper. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/grappler.
    [6]
    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [7]
    Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).
    [8]
    Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
    [9]
    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    [10]
    Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU Infiniband Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users' Group Meeting.
    [11]
    Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599 (2010).
    [12]
    Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of NIPS Workshop on Machine Learning Systems.
    [13]
    Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In Proceedings of ACM Special Interest Group on Data Communication (SIGCOMM).
    [14]
    Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. arXiv preprint arXiv:1803.05880 (2018).
    [15]
    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    [16]
    Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. In Proceedings of Very Large Data Bases (VLDB) Endowment.
    [17]
    Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
    [18]
    Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Systems and Machine Learning (SysML).
    [19]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [20]
    Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    [21]
    Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-Based Parameter Propagation for Distributed DNN Training. In Proceedings of Systems and Machine Learning (SysML).
    [22]
    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia.
    [23]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    [24]
    Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [25]
    Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High Performance RDMA-based MPI Implementation over InfiniBand. International Journal of Parallel Programming (2004).
    [26]
    Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing Network Performance in Distributed Machine Learning. In Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
    [27]
    Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine Learning in Apache Spark. Journal of Machine Learning Research (2016).
    [28]
    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proceedings of NIPS Autodiff Workshop.
    [29]
    Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th ACM European Conference on Computer Systems (EuroSys).
    [30]
    Sebastian Ruder. 2016. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747 (2016).
    [31]
    Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (KDD).
    [32]
    Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
    [33]
    Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
    [34]
    Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    [35]
    Evan R Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2013. MLI: An API for Distributed Machine Learning. In Proceedings of IEEE International Conference on Data Mining (ICDM).
    [36]
    Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. In Proceedings of the 14th ACM European Conference on Computer Systems (EuroSys).
    [37]
    Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    [38]
    Wikipedia. 2019. Monkey Patch. https://en.wikipedia.org/wiki/Monkey_patch.
    [39]
    Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of USENIX Annual Technical Conference (USENIX ATC).

    Cited By

    View all
    • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
    • (2024)AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU ClusterProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673047(443-452)Online publication date: 12-Aug-2024
    • (2024)A Fast Machine Learning Framework with Distributed Packet Loss ToleranceProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664033(1-6)Online publication date: 26-Apr-2024
    • Show More Cited By

    Index Terms

    1. A generic communication scheduler for distributed DNN training acceleration

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles
        October 2019
        615 pages
        ISBN:9781450368735
        DOI:10.1145/3341301
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        • USENIX Assoc: USENIX Assoc

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 October 2019

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. ML frameworks
        2. communication scheduling

        Qualifiers

        • Research-article

        Conference

        SOSP '19
        Sponsor:
        SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles
        October 27 - 30, 2019
        Ontario, Huntsville, Canada

        Acceptance Rates

        Overall Acceptance Rate 131 of 716 submissions, 18%

        Upcoming Conference

        SOSP '24

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)503
        • Downloads (Last 6 weeks)47
        Reflects downloads up to 10 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
        • (2024)AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU ClusterProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673047(443-452)Online publication date: 12-Aug-2024
        • (2024)A Fast Machine Learning Framework with Distributed Packet Loss ToleranceProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664033(1-6)Online publication date: 26-Apr-2024
        • (2024)Understanding Communication Characteristics of Distributed TrainingProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663409(1-8)Online publication date: 3-Aug-2024
        • (2024)R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic SystemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672264(554-567)Online publication date: 4-Aug-2024
        • (2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
        • (2024)Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672228(707-720)Online publication date: 4-Aug-2024
        • (2024)ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks SchedulingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650083(236-249)Online publication date: 22-Apr-2024
        • (2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
        • (2024)Accelerating Privacy-Preserving Machine Learning With GeniBatchProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629563(489-504)Online publication date: 22-Apr-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media