research-article

Fast Training of Deep Learning Models over Multiple GPUs

Authors:

Wei LinAuthors Info & Claims

Middleware '20: Proceedings of the 21st International Middleware Conference

Pages 105 - 118

https://doi.org/10.1145/3423211.3425675

Published: 11 December 2020 Publication History

Abstract

This paper proposes FastT, a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over multiple GPUs, for expedited model training. We propose white-box algorithms to compute the strategies with small computing resource consumption in a short time. Recently, similar studies have been done to optimize device placement using reinforcement learning. Compared to those works which learn to optimize device placement of operations in several hours using large amounts of computing resources, our approach can find excellent device placement and execution order within minutes using the same computing node as for training. We design a list of scheduling algorithms to compute the device placement and execution order for each operation and also design an algorithm to split operations in the critical path to support fine-grained (mixed) data and model parallelism to further improve the training speed in each iteration. We compare FastT with representative strategies and obtain insights on the best strategies for training different types of DNN models based on extensive testbed experiments.

References

[1]

2015. LeNet-5, convolutional neural networks. "http://yann.lecun.com/exdb/lenet".

[2]

2016. A New Lightweight, Modular, and Scalable Deep Learning Framework. "https://caffe2.ai".

[3]

2016. Tensorflow slim. "https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim".

[4]

2017. Tensorflow in-graph implementation. "https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md".

[5]

2017. Tensorflow RunMetadata. "https://www.tensorflow.org/api_docs/python/tf/RunMetadata".

[6]

2017. Tensors and Dynamic neural networks in Python with strong GPU acceleration. "https://pytorch.org".

[7]

2018. Tensorflow Mesh. https://github.com/tensorflow/mesh.

[8]

2019. GNMT v2 For TensorFlow. "https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT".

[9]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI.

[10]

Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2018. Placeto: Efficient Progressive Device Placement Optimization. In NIPS Machine Learning for Systems Workshop.

[11]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[12]

Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform. arXiv preprint arXiv:1809.02839 (2018).

[13]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[14]

Xie Chen, Adam Eversole, Gang Li, Dong Yu, and Frank Seide. 2012. Pipelined back-propagation for context-dependent deep neural networks. In Thirteenth Annual Conference of the International Speech Communication Association.

[15]

Cody Coleman, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian Zhang, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2019. Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark. ACM SIGOPS Operating Systems Review (2019).

[16]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.

[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[18]

Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Post: Device placement with cross-entropy minimization and proximal policy optimization. In Advances in Neural Information Processing Systems. 9971--9980.

[19]

Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing device placement for training deep neural networks. In International Conference on Machine Learning.

[20]

Apostolos Gerasoulis and Tao Yang. 1992. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. J. Parallel and Distrib. Comput. (1992).

[21]

Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A {GPU} Cluster Manager for Distributed Deep Learning. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 485--500.

[22]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).

[23]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. arXiv preprint arXiv:1803.03288 (2018).

[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In In proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[25]

Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965 (2018).

[26]

Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. In International Conference on Machine Learning. 2279--2288.

[27]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).

[28]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).

[29]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

[30]

Mathias Lecuyer, Joshua Lockerman, Lamont Nelson, Siddhartha Sen, Amit Sharma, and Aleksandrs Slivkins. 2017. Harvesting randomness to optimize distributed systems. In In Proceedings of the 16th ACM Workshop on Hot Topics in Networks.

Digital Library

[31]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In In proceedings of 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598.

Digital Library

[32]

Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.

Digital Library

[33]

Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. 2017. Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402 (2017).

[34]

Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, Yaosheng Fu, Victor Zhang, Szymon Migacz, David Nellans, and Puneet Gupta. 2019. Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training. arXiv preprint arXiv:1907.13257 (2019).

[35]

Aditya Paliwal, Felix Gimeno, Vinod Nair, Yujia Li, Miles Lubin, Pushmeet Kohli, and Oriol Vinyals. 2019. REGAL: Transfer Learning For Fast Optimization of Computation Graphs. arXiv preprint arXiv:1905.02494 (2019).

[36]

Jay H Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, and Sam H Noh. 2019. Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement. arXiv preprint arXiv:1901.05803 (2019).

[37]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, 3.

Digital Library

[38]

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems.

[39]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[40]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In In Proceedings of the IEEE conference on computer vision and pattern recognition.

[41]

Haluk Topcuoglu, Salim Hariri, and Min-you Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems (2002).

Digital Library

[42]

Jeffrey D. Ullman. 1975. NP-complete scheduling problems. Journal of Computer and System sciences 10, 3 (1975), 384--393.

Digital Library

[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.

[44]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019. ACM, 26.

Digital Library

[45]

Xiaorui Wu, Hong Xu, Bo Li, and Yongqiang Xiong. 2018. Stanza: Distributed Deep Learning with Small Communication Footprint. arXiv preprint arXiv:1812.10624 (2018).

[46]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[47]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).

[48]

Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter C Ma, Qiumin Xu, Ming Zhong, Hanxiao Liu, Anna Goldie, Azalia Mirhoseini, et al. 2019. GDP: Generalized Device Placement for Dataflow Graphs. arXiv preprint arXiv:1910.01578 (2019).

Cited By

Deng SZhao HHuang BZhang CChen FDeng YYin JDustdar SZomaya A(2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
https://doi.org/10.1109/JPROC.2024.3353855
Huang BHuang XLiu XDing CYin YDeng S(2024)Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environmentComputer Communications10.1016/j.comcom.2023.12.034215(169-179)Online publication date: Feb-2024
https://doi.org/10.1016/j.comcom.2023.12.034
Guo HLiu PLu RYang FXu HZhuang YHuang GSong SHe K(2023)人工智能大模型医学应用研究SCIENTIA SINICA Vitae10.1360/SSV-2022-0298Online publication date: 17-Jul-2023
https://doi.org/10.1360/SSV-2022-0298
Show More Cited By

Index Terms

Fast Training of Deep Learning Models over Multiple GPUs
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Computing methodologies
  1. Machine learning

Recommendations

Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes
Database Systems for Advanced Applications
Abstract
Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communication overhead and the ...
Deep Learning at Scale and at Ease
Special Section on Trust Management for Multimedia Big Data and Special Section on Best Papers of ACM Multimedia 2015

Recently, deep learning techniques have enjoyed success in various multimedia applications, such as image classification and multimodal data analysis. Large deep learning models are developed for learning rich representations of complex data. There are ...
Rise the Momentum: A Method for Reducing the Training Error on Multiple GPUs
Algorithms and Architectures for Parallel Processing
Abstract
Deep neural network training is a common issue that is receiving increasing attention in recent years and basically performed on Stochastic Gradient Descent or its variants. Distributed training increases training speed significantly but causes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Middleware '20: Proceedings of the 21st International Middleware Conference

December 2020

455 pages

ISBN:9781450381536

DOI:10.1145/3423211

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

Middleware '20

Sponsor:

ACM

Middleware '20: 21st International Middleware Conference

December 7 - 11, 2020

Delft, Netherlands

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Deng SZhao HHuang BZhang CChen FDeng YYin JDustdar SZomaya A(2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
https://doi.org/10.1109/JPROC.2024.3353855
Huang BHuang XLiu XDing CYin YDeng S(2024)Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environmentComputer Communications10.1016/j.comcom.2023.12.034215(169-179)Online publication date: Feb-2024
https://doi.org/10.1016/j.comcom.2023.12.034
Guo HLiu PLu RYang FXu HZhuang YHuang GSong SHe K(2023)人工智能大模型医学应用研究SCIENTIA SINICA Vitae10.1360/SSV-2022-0298Online publication date: 17-Jul-2023
https://doi.org/10.1360/SSV-2022-0298
Xu HZhou PXie HLiao Y(2023)Mercury: Fast and Optimal Device Placement for Large Deep Learning ModelsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605603(412-422)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605603
Zeng YHuang CNi YZhou CZhang JWang JZhou MXue MZhang Y(2023)An Auto-Parallel Method for Deep Learning Models Based on Genetic Algorithm2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00042(230-235)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00042
Zeng YWang WDing YZhang JRen YYi G(2022)Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAGMathematics10.3390/math1024478810:24(4788)Online publication date: 16-Dec-2022
https://doi.org/10.3390/math10244788
Zeng YWu JZhang JRen YZhang Y(2022)Trinity: Neural Network Adaptive Distributed Parallel Training Method Based on Reinforcement LearningAlgorithms10.3390/a1504010815:4(108)Online publication date: 24-Mar-2022
https://doi.org/10.3390/a15040108
Markov IRamezanikebrya HAlistarh DBellavista PZhang KGherbi ABagchi SPatiño MDi Modica GGascon-Samson J(2022)CGXProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565248(241-254)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3528535.3565248
Xu YWu HZhang WHu YCriswell JWilliams DXia Y(2022)EOP: efficient operator partition for deep learning inference over edge serversProceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3516807.3516820(45-57)Online publication date: 25-Feb-2022
https://dl.acm.org/doi/10.1145/3516807.3516820
Yi XZhang SDiao LWu CZheng ZFan SWang SYang JLin W(2022)Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor FusionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320153133:12(4694-4706)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3201531
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten