research-article

Efficient Communications in Training Large Scale Neural Networks

Authors:

George Bosilca,

Wenqi Tang, and

Zenglin XuAuthors Info & Claims

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

October 2017

Pages 110 - 116

https://doi.org/10.1145/3126686.3126749

Published: 23 October 2017 Publication History

Abstract

We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for partial gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of the more conventional Minimum Spanning Tree (MST) scales like O(log P). LP also demonstrates up to 2x higher bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.

References

[1]

Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2014. A reliable effective terascale linear learning system. Journal of Machine Learning Research Vol. 15, 1 (2014), 1111--1133.

Digital Library

[2]

Alekh Agarwal and John C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873--881.

Digital Library

[3]

George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI collective communication on BlueGene/L systems Proceedings of the 19th annual international conference on Supercomputing. ACM, 253--262.

Digital Library

[4]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, Vol. 19, 13 (2007), 1749--1783.

[5]

Remco Chang, Fumeng Yang, and Marianne Procopio. 2016. From Vision Science to Data Science: Applying Perception to Problems in Big Data. Electronic Imaging, Vol. 2016, 16 (2016), 1--7.

[6]

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th international conference on machine learning. 1337--1345.

Digital Library

[7]

Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Proceedings of the Eleventh European Conference on Computer Systems. ACM, 4.

Digital Library

[8]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.

Digital Library

[9]

Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research Vol. 13, 1 (2012), 165--202.

Digital Library

[10]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research Vol. 12 (2011), 2121--2159.

Digital Library

[11]

Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.

[12]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and Eric P. Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server Advances in neural information processing systems. 1223--1231.

Digital Library

[13]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.

Digital Library

[14]

Mu Li, David G. Andersen, Alex J. Smola, and Kai Yu. 2014. Communication efficient distributed machine learning with the parameter server Advances in Neural Information Processing Systems. 19--27.

Digital Library

[15]

Tim Nelson, Andrew D. Ferguson, Da Yu, Rodrigo Fonseca, and Shriram Krishnamurthi. 2015. Exodus: Toward automatic migration of enterprise network configurations to SDNs Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 13.

Digital Library

[16]

Tim Nelson, Da Yu, Yiming Li, Rodrigo Fonseca, and Shriram Krishnamurthi. 2015. Simon: Scriptable interactive monitoring for SDNs. Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 19.

Digital Library

[17]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent Advances in Neural Information Processing Systems. 693--701.

Digital Library

[18]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH. 1058--1062.

[19]

Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems. 163--171.

Digital Library

[20]

Galen M. Shipman, Timothy S. Woodall, Richard L. Graham, Arthur B. Maccabe, and Patrick G. Bridges. 2006. Infiniband scalability in Open MPI. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 10--pp.

Digital Library

[21]

Rajeev Thakur and William D. Gropp. 2003. Improving the performance of collective operations in MPICH. Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, 257--267.

[22]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, Vol. 19, 1 (2005), 49--66.

Digital Library

[23]

Linnan Wang, Wu Wei, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing ICS '16: Proceedings of the 30th ACM on International Conference on Supercomputing. ACM, 4.

Digital Library

[24]

Linnan Wang, Yi Yang, Martin Renqiang Min, and Srimat Chakradhar. 2016. Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent. arXiv preprint arXiv:1603.05544 (2016).

[25]

Joachim Worringen. 2003. Pipelining and overlapping for MPI collective operations Local Computer Networks, 2003. LCN'03. Proceedings. 28th Annual IEEE International Conference on. IEEE, 548--557.

Digital Library

[26]

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603.

Digital Library

Cited By

Nuriyev EManumachu RAseeri SVerma MLastovetsky A(2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104767
Reisizadeh APrakash SPedarsani RAvestimehr A(2022)CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed LearningIEEE/ACM Transactions on Networking10.1109/TNET.2021.310909730:1(148-161)Online publication date: Feb-2022
https://doi.org/10.1109/TNET.2021.3109097
Castelló AQuintana-Ortí EDuato J(2021)Accelerating distributed deep neural network training with pipelined MPI allreduceCluster Computing10.1007/s10586-021-03370-9Online publication date: 7-Aug-2021
https://doi.org/10.1007/s10586-021-03370-9
Show More Cited By

Index Terms

Efficient Communications in Training Large Scale Neural Networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Efficient and reliable training of neural networks
HSI'09: Proceedings of the 2nd conference on Human System Interactions

This paper introduces a neural network training tool, NBN 2.0, which is developed based on neuron by neuron computing method [1][2]. Error backpropagation (EBP) algorithm, Levenberg Marquardt (LM) algorithm and its improved versions are implemented in ...
Read More
τ-Lop: Modeling performance of shared memory MPI

We present a performance model for representing and predicting cost of parallel algorithms.Model goal is to help in the design and optimization of parallel collective algorithms.It is applied to collective algorithms in mainstream MPI implementations in ...
Read More
Efficient offloading of collective communications in large-scale systems
CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing

In parallel applications communication overheads generally increase as the processor count increases and in particular, collective communication operations can become a critical limiting factor in achieving high performance. In this paper we explore a ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

October 2017

558 pages

ISBN:9781450354165

DOI:10.1145/3126686

Program Chairs:
Wanmin Wu
Google, USA
,
Jianchao Yang
Snap Inc., USA
,
Qi Tian
The University of Texas at San Antonio, USA
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

Cited By

Nuriyev EManumachu RAseeri SVerma MLastovetsky A(2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104767
Reisizadeh APrakash SPedarsani RAvestimehr A(2022)CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed LearningIEEE/ACM Transactions on Networking10.1109/TNET.2021.310909730:1(148-161)Online publication date: Feb-2022
https://doi.org/10.1109/TNET.2021.3109097
Castelló AQuintana-Ortí EDuato J(2021)Accelerating distributed deep neural network training with pipelined MPI allreduceCluster Computing10.1007/s10586-021-03370-9Online publication date: 7-Aug-2021
https://doi.org/10.1007/s10586-021-03370-9
Wang LWu WZhang JLiu HBosilca GHerlihy MFonseca RParashar MVlassov VIrwin DMohror K(2020)FFT-based Gradient Sparsification for the Distributed Training of Deep Neural NetworksProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392681(113-124)Online publication date: 23-Jun-2020
https://dl.acm.org/doi/10.1145/3369583.3392681
Wen LZhang XBai HXu Z(2020)Structured pruning of recurrent neural networks through neuron selectionNeural Networks10.1016/j.neunet.2019.11.018123(134-141)Online publication date: Mar-2020
https://doi.org/10.1016/j.neunet.2019.11.018

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents