Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126686.3126749acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Efficient Communications in Training Large Scale Neural Networks

Published: 23 October 2017 Publication History
  • Get Citation Alerts
  • Abstract

    We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for partial gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of the more conventional Minimum Spanning Tree (MST) scales like O(log P). LP also demonstrates up to 2x higher bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.

    References

    [1]
    Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2014. A reliable effective terascale linear learning system. Journal of Machine Learning Research Vol. 15, 1 (2014), 1111--1133.
    [2]
    Alekh Agarwal and John C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873--881.
    [3]
    George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI collective communication on BlueGene/L systems Proceedings of the 19th annual international conference on Supercomputing. ACM, 253--262.
    [4]
    Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, Vol. 19, 13 (2007), 1749--1783.
    [5]
    Remco Chang, Fumeng Yang, and Marianne Procopio. 2016. From Vision Science to Data Science: Applying Perception to Problems in Big Data. Electronic Imaging, Vol. 2016, 16 (2016), 1--7.
    [6]
    Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th international conference on machine learning. 1337--1345.
    [7]
    Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Proceedings of the Eleventh European Conference on Computer Systems. ACM, 4.
    [8]
    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.
    [9]
    Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research Vol. 13, 1 (2012), 165--202.
    [10]
    John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research Vol. 12 (2011), 2121--2159.
    [11]
    Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.
    [12]
    Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and Eric P. Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server Advances in neural information processing systems. 1223--1231.
    [13]
    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.
    [14]
    Mu Li, David G. Andersen, Alex J. Smola, and Kai Yu. 2014. Communication efficient distributed machine learning with the parameter server Advances in Neural Information Processing Systems. 19--27.
    [15]
    Tim Nelson, Andrew D. Ferguson, Da Yu, Rodrigo Fonseca, and Shriram Krishnamurthi. 2015. Exodus: Toward automatic migration of enterprise network configurations to SDNs Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 13.
    [16]
    Tim Nelson, Da Yu, Yiming Li, Rodrigo Fonseca, and Shriram Krishnamurthi. 2015. Simon: Scriptable interactive monitoring for SDNs. Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 19.
    [17]
    Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent Advances in Neural Information Processing Systems. 693--701.
    [18]
    Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH. 1058--1062.
    [19]
    Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems. 163--171.
    [20]
    Galen M. Shipman, Timothy S. Woodall, Richard L. Graham, Arthur B. Maccabe, and Patrick G. Bridges. 2006. Infiniband scalability in Open MPI. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 10--pp.
    [21]
    Rajeev Thakur and William D. Gropp. 2003. Improving the performance of collective operations in MPICH. Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, 257--267.
    [22]
    Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, Vol. 19, 1 (2005), 49--66.
    [23]
    Linnan Wang, Wu Wei, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing ICS '16: Proceedings of the 30th ACM on International Conference on Supercomputing. ACM, 4.
    [24]
    Linnan Wang, Yi Yang, Martin Renqiang Min, and Srimat Chakradhar. 2016. Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent. arXiv preprint arXiv:1603.05544 (2016).
    [25]
    Joachim Worringen. 2003. Pipelining and overlapping for MPI collective operations Local Computer Networks, 2003. LCN'03. Proceedings. 28th Annual IEEE International Conference on. IEEE, 548--557.
    [26]
    Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603.

    Cited By

    View all
    • (2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
    • (2022)CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed LearningIEEE/ACM Transactions on Networking10.1109/TNET.2021.310909730:1(148-161)Online publication date: Feb-2022
    • (2021)Accelerating distributed deep neural network training with pipelined MPI allreduceCluster Computing10.1007/s10586-021-03370-9Online publication date: 7-Aug-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
    October 2017
    558 pages
    ISBN:9781450354165
    DOI:10.1145/3126686
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep learning system
    2. mpi collectives
    3. neural networks

    Qualifiers

    • Research-article

    Conference

    MM '17
    Sponsor:
    MM '17: ACM Multimedia Conference
    October 23 - 27, 2017
    California, Mountain View, USA

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)5

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
    • (2022)CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed LearningIEEE/ACM Transactions on Networking10.1109/TNET.2021.310909730:1(148-161)Online publication date: Feb-2022
    • (2021)Accelerating distributed deep neural network training with pipelined MPI allreduceCluster Computing10.1007/s10586-021-03370-9Online publication date: 7-Aug-2021
    • (2020)FFT-based Gradient Sparsification for the Distributed Training of Deep Neural NetworksProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392681(113-124)Online publication date: 23-Jun-2020
    • (2020)Structured pruning of recurrent neural networks through neuron selectionNeural Networks10.1016/j.neunet.2019.11.018123(134-141)Online publication date: Mar-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media