Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3332466.3374528acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Taming unbalanced training workloads in deep learning with partial collective operations

Published: 19 February 2020 Publication History

Abstract

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experiments are conducted on a variety of neural networks and datasets. The results on load-imbalanced environments show that eager-SGD achieves 2.64 X speedup (ResNet-50 on ImageNet) over the asynchronous centralized SGD, and achieves 1.29 X speedup (ResNet-50 on ImageNet) and 1.27X speedup (LSTM on UCF101) over the state-of-the-art synchronous decentralized SGDs, without losing accuracy.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Sillens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, https://www.tensorflow.org/ Software available from tensorflow.org.
[2]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cedric Renggli. 2018. The Convergence of Sparsified Gradient Methods. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc.
[3]
Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/.
[4]
Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. 2018. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792 (2018).
[5]
A. Awan, K. Hamidouche, J. Hashmi, and D. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.
[6]
Jimmy Ba and Brendan Frey. 2013. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems. 3084--3092.
[7]
Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving Deeper into Convolutional Networks for Learning Video Representations. arXiv e-prints (2015). arXiv:1511.06432
[8]
Brian W Barrett, Ron Brightwell, E Ryan Grant, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith D Underwood, R Reisen, Torsten Hoefler, Arthur B Maccabe, and Trammell Hudson. 2018. The Portals 4.2 network programming interface. Sandia National Laboratories, November 2018, Technical Report SAND2017-3825 (2018).
[9]
T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler. 2019. A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 66--77.
[10]
T. Ben-Nun and T. Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (Feb. 2018).
[11]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571--582. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi
[12]
Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. CoRR abs/1803.05880 (2018). arXiv:1803.05880 http://arxiv.org/abs/1803.05880
[13]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[14]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 248--255.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
[16]
Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). arXiv:1708.04552 http://arxiv.org/abs/1708.04552
[17]
Salvatore Di Girolamo, Pierre Jolivet, Keith D Underwood, and Torsten Hoefler. 2015. Exploiting offload enabled network interfaces. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 26--33.
[18]
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. CoRR abs/1411.4389 (2014). arXiv:1411.4389 http://arxiv.org/abs/1411.4389
[19]
Andrew Gibiansky. 2017. Bringing HPC techniques to deep learning. (2017). URL http://research.baidu.com/bringing-hpc-techniques-deep-learning (2017).
[20]
Suyog Gupta, Wei Zhang, and Fei Wang. 2015. Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study. arXiv e-prints (Sep 2015). arXiv:1509.04210
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[22]
Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'13). Curran Associates Inc., USA, 1223--1231. http://dl.acm.org/citation.cfm?id=2999611.2999748
[23]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[24]
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E Grant, and Ron Brightwell. 2017. sPIN: High-performance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 59.
[25]
Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 52.
[26]
T. Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58--75.
[27]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2009. The Effect of Network Noise on Large-Scale Collective Communications. Parallel Processing Letters (PPL) 19, 4 (Aug. 2009), 573--593.
[28]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10).
[29]
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed Machine Learning Approaching LAN Speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI'17). USENIX Association, Berkeley, CA, USA, 629--647. http://dl.acm.org/citation.cfm?id=3154630.3154682
[30]
G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261--2269.
[31]
Alexandra Iosup, Nezih Yigitbasi, and Dick Epema. 2011. On the performance variability of production cloud services. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 104--113.
[32]
Keith R Jackson, Lavanya Ramakrishnan, Krishna Muriki, Shane Canon, Shreyas Cholia, John Shalf, Harvey J Wasserman, and Nicholas J Wright. 2010. Performance analysis of high performance computing applications on the amazon web services cloud. In 2nd IEEE international conference on cloud computing technology and science. IEEE, 159--168.
[33]
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. In Proceedings of the 2nd SysML Conference.
[34]
Peter H. Jin, Qiaochu Yuan, Forrest N. Iandola, and Kurt Keutzer. 2016. How to scale distributed deep learning? CoRR abs/1611.04581 (2016). arXiv: 1611.04581 http://arxiv.org/abs/1611.04581
[35]
Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale Deep Learning for Climate Analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 51, 12 pages.
[36]
Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.
[37]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[38]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 583--598. http://dl.acm.org/citation.cfm?id=2685048.2685095
[39]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., USA, 5336--5346. http://dl.acm.org/citation.cfm?id=3295222.3295285
[40]
Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 3043--3052. http://proceedings.mlr.press/v80/lian18a.html
[41]
Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J. Pennycook, Kristyn Maschhoff, Jason Sewall, Nalini Kumar, Shirley Ho, Michael F. Ringenburg, Prabhat, and Victor Lee. 2018. CosmoFlow: Using Deep Learning to Learn the Universe at Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 65, 11 pages.
[42]
Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1.
[43]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117--124.
[44]
Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science. Springer, 1--9.
[45]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018). https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
[46]
B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24. 693--701.
[47]
Cédric Renggli, Dan Alistarh, and Torsten Hoefler. 2018. SparCML: High-Performance Sparse Communication for Machine Learning. CoRR abs/1802.08021 (2018). arXiv:1802.08021 http://arxiv.org/abs/1802.08021
[48]
Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics (1951).
[49]
Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proceedings of the VLDB Endowment 3, 1-2 (2010), 460--471.
[50]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.
[51]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[52]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv.1409.1556 (2014).
[53]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv: 1212.0402 (2012).
[54]
Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.
[55]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.
[56]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukliin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[58]
Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, YaoliangYu, and Eric Xing. 2016. Lighter-communication Distributed Machine Learning via Sufficient Factor Broadcasting. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (UAI'16). AUAI Press, Arlington, Virginia, United States, 795--804. http://dl.acm.org/citation.cfm?id=3020948.3021030
[59]
Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing. ACM, 1.
[60]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[61]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4694--4702.
[62]
Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2015. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950 (2015).

Cited By

View all
  • (2024)Straggler-Resilient Decentralized Learning via Adaptive Asynchronous UpdatesProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3690036(434-439)Online publication date: 14-Oct-2024
  • (2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
  • (2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. collective operations
  2. distributed deep learning
  3. eager-SGD
  4. stochastic gradient descent
  5. workload imbalance

Qualifiers

  • Research-article

Funding Sources

  • European Research Council (ERC) under the European Union?s Horizon 2020 programme, grant agreement DAPP,

Conference

PPoPP '20

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)7
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Straggler-Resilient Decentralized Learning via Adaptive Asynchronous UpdatesProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3690036(434-439)Online publication date: 14-Oct-2024
  • (2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
  • (2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
  • (2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3303431(1-20)Online publication date: 2024
  • (2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
  • (2024)LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU clusterThe Journal of Supercomputing10.1007/s11227-023-05886-w80:9(12247-12272)Online publication date: 1-Jun-2024
  • (2023)ADA-GP: Accelerating DNN Training By Adaptive Gradient PredictionProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623779(1092-1105)Online publication date: 28-Oct-2023
  • (2023)Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training EfficiencyProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607073(1-13)Online publication date: 12-Nov-2023
  • (2023) HPC 2 lusterScape: Increasing Transparency and Efficiency of Shared High-Performance Computing Clusters for Large-scale AI Models 2023 IEEE Visualization in Data Science (VDS)10.1109/VDS60365.2023.00008(21-29)Online publication date: 15-Oct-2023
  • (2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: Dec-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media