Abstract
Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple GPUs and slow distribution compared with running on single machine. It is of great significance to reduce the training time through parallel models. In view of this, we firstly provided an in-depth analysis of the implementation principle of Tensorflow and identify the bottlenecks of its native distributed parallel models to improve. Then, two optimal algorithms are designed and implemented based on data parallelism and model parallelism modes of Tensorflow. For data parallelism, the proposed algorithm is implemented to replace the native linear execution mode with pipeline execution mode. As for model parallelism, the native random partitioning mode is replaced by our proposed novel greedy algorithm. Finally, we built a homogeneous distributed cluster and a heterogeneous distributed cluster respectively to verify the effectiveness of the proposed algorithms. Through a number of comparative experiments, we showed that the proposed optimal parallel algorithms can effectively reduce model training time by an average of 26.5%(or average 1.5x speedup than native distributed algorithms) and improve the utilization of the cluster while keeping the same accuracy level of native Tensorflow.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Brownlee J (2018) Better deep learning: train faster, reduce overfitting, and make better predictions machine learning mastery
Shanmugamani R (2018) Deep learning for computer vision: expert techniques to train advanced neural networks using Tensorflow and Keras. Packt Publishing Ltd
Hendrycks D, Mazeika M, Wilson D, Gimpel K (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In: Advances in neural information processing systems, pp 10456–10465
Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N et al (2019) Evolving deep neural networks. In: Artificial intelligence in the age of neural networks and brain computing. Elsevier, pp 293–312
Traore BB, Kamsu-Foguem B, Tangara F (2018) Deep convolution neural network for image recognition. Ecological Informatics 48:257–268
Che Z, Purushotham S, Cho K, Sontag D, Liu Y (2018) Recurrent neural networks for multivariate time series with missing values. Scientific Reports 8(1):1–12
Gu J, Chowdhury M, Shin KG, Zhu Y, Jeon M, Qian J, Liu H, Guo C (2019) Tiresias: a {GPU} cluster manager for distributed deep learning. In: 16th {USENIX} symposium on networked systems design and implementation ({NSDI}, vol 19, pp 485–500
Shi S, Wang Q, Chu X, Li B, Qin Y, Liu R, Zhao X (2020) Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM
Malik A, Lu M, Wang N, Lin Y, Yoo S (2018) Detailed performance analysis of distributed Tensorflow on a gpu cluster using deep learning algorithms. In: 2018 New York scientific data summit (NYSDS). IEEE, pp 1–8
Chen C, Weng Q, Wang W, Li B, Li B (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM symposium on cloud computing, pp 521–521
Yang E, Kim S-H, Kim T-W, Jeon M, Park S, Youn C-H (2018) An adaptive batch-orchestration algorithm for the heterogeneous gpu cluster environment in distributed deep learning system. In: 2018 IEEE international conference on big data and smart computing (BigComp). IEEE, pp 725–728
Bao Y, Peng Y, Wu C (2019) Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE conference on computer communications. IEEE, pp 505–513
Pang B, Nijkamp E, Wu YN (2020) Deep learning with Tensorflow: a review. J Educ Behav Stat 45(2):227–248
Seetala K, Birdsong W, Reddy YB (2019) Image classification using Tensorflow. In: 16th international conference on information technology-new generations (ITNG 2019). Springer, pp 485–488
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231
Baldi P, Sadowski P (2014) The dropout learning algorithm. Artificial Intelligence 210:78–122
Kennedy RK, Khoshgoftaar TM, Villanustre F, Humphrey T (2019) A parallel and distributed stochastic gradient descent implementation using commodity clusters. Journal of Big Data 6(1):16
Du X, Kuang D, Ye Y, Li X, Chen M, Du Y, Wu W (2018) Comparative study of distributed deep learning tools on supercomputers. In: International conference on algorithms and architectures for parallel processing. Springer, pp 122–137
Kang B, Jeong J-H, Jeong C (2018) Distributed parallel deep learning for fast extraction of similar weather map. In: TENCON 2018-2018 IEEE region 10 conference. IEEE, pp 1426–1429
Li D, Lai Z, Ge K, Zhang Y, Zhang Z, Wang Q, Wang H (2019) Hpdl: towards a general framework for high-performance distributed deep learning. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp 1742–1753
Kim S, Yu G-I, Park H, Cho S, Jeong E, Ha H, Lee S, Jeong JS, Chun B-G (2019) Parallax: sparsity-aware data parallel training of deep neural networks. In: Proceedings of the fourteenth eurosys conference 2019, pp 1–15
Gunn DJ, Liu Z, Dave R, Yuan X, Roy K (2019) Touch-based active dloud authentication using traditional machine learning and LSTM on a distributed Tensorflow framework. International Journal of Computational Intelligence and Applications 18(04):1950022
Ranbirsingh JK, Kimm H, Kimm H (2019) Distributed neural networks using Tensorflow over multicore and many-core systems. In: 2019 IEEE 13th international symposium on embedded multicore/many-core systems-on-chip (MCSoC). IEEE, pp 101–107
Kennedy RKL (2018) Parallel distributed deep learning on cluster computers. Training 4(32):256
Marques J, Falcao G, Alexandre LA (2018) Distributed learning of cnns on heterogeneous cpu/gpu architectures. Appl Artif Intell 32(9-10):822–844
Grabaskas N (2019) Improving usability of distributed neural network training. In: Intelligent computing-proceedings of the computing conference. Springer, pp 867–886
Wen W, Xu C, Yan F, Wu C, Wang Y, Chen Y, Li H (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In: Advances in neural information processing systems, pp 1509–1519
Ben-Nun T, Hoefler T (2019) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52(4):1–43
Chang K, Balachandar N, Lam C, Yi D, Brown J, Beers A, Rosen B, Rubin DL, Kalpathy-Cramer J (2018) Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc 25(8):945–954
Chen C, Yang C, Cheng H (2018) Efficient and robust parallel dnn training through model parallelism on multi-gpu platform, arxiv: Distributed, Parallel and Cluster Computing
Peng Y, Zhu Y, Chen Y, Bao Y, Yi B, Lan C, Wu C, Guo C (2019) A generic communication scheduler for distributed dnn training acceleration. In: Proceedings of the 27th ACM symposium on operating systems principles, ser. SOSP ’19. New York, NY, USA: Association for Computing Machinery, pp 16–29. [Online]. Available: https://doi.org/10.1145/3341301.3359642
Surya RY, Imam Kistijantoro A (2019) Dynamic resource allocation for distributed Tensorflow training in kubernetes cluster. In: 2019 international conference on data and software engineering (ICoDSE), pp 1–6
Mayer R, Mayer C, Laich L (2017) The Tensorflow partitioning and scheduling problem: it’s the critical path! arxiv: Distributed, Parallel, and Cluster Computing, pp 1–6
Chen C, Weng Q, Wang W, Li B, Li B (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM symposium on cloud computing, ser. SoCC ’18. New York, NY, USA: Association for Computing Machinery, p 521. [Online]. Available: https://doi.org/10.1145/3267809.3275463
Liu J, Jia C, Chen J, Lin H, Jin X, An H (2019) An effective method for operations placement in tensor flow. In: Proceedings of the 3rd international conference on high performance compilation, computing and communications, ser. HP3C ’19. New York, NY, USA: Association for Computing Machinery, pp 13–19. [Online]. Available: https://doi.org/10.1145/3318265.3318270
Sergeev A, Del Balso M (2018) Horovod: fast and easy distributed deep learning in Tensorflow. arXiv:1802.05799
Fujiki D, Mahlke S, Das R (2018) In-memory data parallel processor. ACM SIGPLAN Not 53(2):1–14
Bienia C, Kumar S, Singh JP, Li K (2008) The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, pp 72–81
Hu Z, Qin W (2017) Fuzzy method and neural network model parallel implementation of multi-layer neural network based on cloud computing for real time data transmission in large offshore platform. Polish Maritime Research 24(s2):39–44
Kurth T, Smorkalov M, Mendygral P, Sridharan S, Mathuriya A (2019) Tensorflow at scale: performance and productivity analysis of distributed training with horovod, mlsl, and cray pe ml. Concurrency and Computation: Practice and Experience 31(16):e4989
Liu M, Grana D (2019) Accelerating geostatistical seismic inversion using Tensorflow: a heterogeneous distributed deep learning framework. Computers & Geosciences 124:37–45
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su B-Y (2014) Scaling distributed machine learning with the parameter server. In: 11th {USENIX} symposium on operating systems design and implementation ({OSDI}, vol 14, pp 583–598
Gibiansky A (2017) Bringing hpc techniques to deep learning, Baidu Research, Tech. Rep.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Kaiser L, Kudlur M, Levenberg J, Zheng X (2015) Tensorflow : large-scale machine learning on heterogeneous distributed systems, 01
Acknowledgements
This research is partially supported by National Key Research and Development Program of China with ID 2018AAA0103203.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
Author Yuanlun Xie declares that he has no conflict of interest. Author Majun He declares that he has no conflict of interest. Author Tingsong Ma declares that he has no conflict of interest. Wenhong Tian declares that he has no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xie, Y., He, M., Ma, T. et al. Optimal distributed parallel algorithms for deep learning framework Tensorflow. Appl Intell 52, 3880–3900 (2022). https://doi.org/10.1007/s10489-021-02588-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02588-9