Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3423211.3425675acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

Fast Training of Deep Learning Models over Multiple GPUs

Published: 11 December 2020 Publication History

Abstract

This paper proposes FastT, a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over multiple GPUs, for expedited model training. We propose white-box algorithms to compute the strategies with small computing resource consumption in a short time. Recently, similar studies have been done to optimize device placement using reinforcement learning. Compared to those works which learn to optimize device placement of operations in several hours using large amounts of computing resources, our approach can find excellent device placement and execution order within minutes using the same computing node as for training. We design a list of scheduling algorithms to compute the device placement and execution order for each operation and also design an algorithm to split operations in the critical path to support fine-grained (mixed) data and model parallelism to further improve the training speed in each iteration. We compare FastT with representative strategies and obtain insights on the best strategies for training different types of DNN models based on extensive testbed experiments.

References

[1]
2015. LeNet-5, convolutional neural networks. "http://yann.lecun.com/exdb/lenet".
[2]
2016. A New Lightweight, Modular, and Scalable Deep Learning Framework. "https://caffe2.ai".
[3]
2016. Tensorflow slim. "https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim".
[4]
2017. Tensorflow in-graph implementation. "https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md".
[5]
2017. Tensorflow RunMetadata. "https://www.tensorflow.org/api_docs/python/tf/RunMetadata".
[6]
2017. Tensors and Dynamic neural networks in Python with strong GPU acceleration. "https://pytorch.org".
[7]
2018. Tensorflow Mesh. https://github.com/tensorflow/mesh.
[8]
2019. GNMT v2 For TensorFlow. "https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT".
[9]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI.
[10]
Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2018. Placeto: Efficient Progressive Device Placement Optimization. In NIPS Machine Learning for Systems Workshop.
[11]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[12]
Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform. arXiv preprint arXiv:1809.02839 (2018).
[13]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[14]
Xie Chen, Adam Eversole, Gang Li, Dong Yu, and Frank Seide. 2012. Pipelined back-propagation for context-dependent deep neural networks. In Thirteenth Annual Conference of the International Speech Communication Association.
[15]
Cody Coleman, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian Zhang, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2019. Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark. ACM SIGOPS Operating Systems Review (2019).
[16]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[18]
Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Post: Device placement with cross-entropy minimization and proximal policy optimization. In Advances in Neural Information Processing Systems. 9971--9980.
[19]
Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing device placement for training deep neural networks. In International Conference on Machine Learning.
[20]
Apostolos Gerasoulis and Tao Yang. 1992. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. J. Parallel and Distrib. Comput. (1992).
[21]
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A {GPU} Cluster Manager for Distributed Deep Learning. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 485--500.
[22]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).
[23]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. arXiv preprint arXiv:1803.03288 (2018).
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In In proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[25]
Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965 (2018).
[26]
Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. In International Conference on Machine Learning. 2279--2288.
[27]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).
[28]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).
[29]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[30]
Mathias Lecuyer, Joshua Lockerman, Lamont Nelson, Siddhartha Sen, Amit Sharma, and Aleksandrs Slivkins. 2017. Harvesting randomness to optimize distributed systems. In In Proceedings of the 16th ACM Workshop on Hot Topics in Networks.
[31]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In In proceedings of 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598.
[32]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
[33]
Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. 2017. Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402 (2017).
[34]
Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, Yaosheng Fu, Victor Zhang, Szymon Migacz, David Nellans, and Puneet Gupta. 2019. Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training. arXiv preprint arXiv:1907.13257 (2019).
[35]
Aditya Paliwal, Felix Gimeno, Vinod Nair, Yujia Li, Miles Lubin, Pushmeet Kohli, and Oriol Vinyals. 2019. REGAL: Transfer Learning For Fast Optimization of Computation Graphs. arXiv preprint arXiv:1905.02494 (2019).
[36]
Jay H Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, and Sam H Noh. 2019. Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement. arXiv preprint arXiv:1901.05803 (2019).
[37]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, 3.
[38]
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems.
[39]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[40]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In In Proceedings of the IEEE conference on computer vision and pattern recognition.
[41]
Haluk Topcuoglu, Salim Hariri, and Min-you Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems (2002).
[42]
Jeffrey D. Ullman. 1975. NP-complete scheduling problems. Journal of Computer and System sciences 10, 3 (1975), 384--393.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.
[44]
Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019. ACM, 26.
[45]
Xiaorui Wu, Hong Xu, Bo Li, and Yongqiang Xiong. 2018. Stanza: Distributed Deep Learning with Small Communication Footprint. arXiv preprint arXiv:1812.10624 (2018).
[46]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[47]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
[48]
Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter C Ma, Qiumin Xu, Ming Zhong, Hanxiao Liu, Anna Goldie, Azalia Mirhoseini, et al. 2019. GDP: Generalized Device Placement for Dataflow Graphs. arXiv preprint arXiv:1910.01578 (2019).

Cited By

View all
  • (2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
  • (2024)Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environmentComputer Communications10.1016/j.comcom.2023.12.034215(169-179)Online publication date: Feb-2024
  • (2023)人工智能大模型医学应用研究SCIENTIA SINICA Vitae10.1360/SSV-2022-0298Online publication date: 17-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '20: Proceedings of the 21st International Middleware Conference
December 2020
455 pages
ISBN:9781450381536
DOI:10.1145/3423211
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Distributed training
  2. data parallel
  3. model parallel

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

Middleware '20
Sponsor:
Middleware '20: 21st International Middleware Conference
December 7 - 11, 2020
Delft, Netherlands

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
  • (2024)Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environmentComputer Communications10.1016/j.comcom.2023.12.034215(169-179)Online publication date: Feb-2024
  • (2023)人工智能大模型医学应用研究SCIENTIA SINICA Vitae10.1360/SSV-2022-0298Online publication date: 17-Jul-2023
  • (2023)Mercury: Fast and Optimal Device Placement for Large Deep Learning ModelsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605603(412-422)Online publication date: 7-Aug-2023
  • (2023)An Auto-Parallel Method for Deep Learning Models Based on Genetic Algorithm2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00042(230-235)Online publication date: 17-Dec-2023
  • (2022)Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAGMathematics10.3390/math1024478810:24(4788)Online publication date: 16-Dec-2022
  • (2022)Trinity: Neural Network Adaptive Distributed Parallel Training Method Based on Reinforcement LearningAlgorithms10.3390/a1504010815:4(108)Online publication date: 24-Mar-2022
  • (2022)CGXProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565248(241-254)Online publication date: 7-Nov-2022
  • (2022)EOP: efficient operator partition for deep learning inference over edge serversProceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3516807.3516820(45-57)Online publication date: 25-Feb-2022
  • (2022)Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor FusionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320153133:12(4694-4706)Online publication date: 1-Dec-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media