Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

Published: 18 April 2019 Publication History

Abstract

With the fast development of deep learning (DL), the communication is increasingly a bottleneck for distributed workloads, and a series of optimization works have been done to scale out successfully. Nevertheless, the network behavior has not been investigated much yet. We intend to analyze the network behavior and then carry out some research through network simulation. Under this circumstance, an accurate communication measurement is necessary, as it is an effective way to study the network behavior and the basis for accurate simulation. Therefore, we propose to capture the deep learning communication (DLC) trace to achieve the measurement.
To the best of our knowledge, we make the first attempt to capture the communication trace for DL training. In this article, we first provide detailed analyses about the communication mechanism of MXNet, which is a representative framework for distributed DL. Secondly, we define the DLC trace format to describe and record the communication behaviors. Third, we present the implementation of method for trace capturing. Finally, we make some statistics and analyses about the distributed DL training, including communication pattern, overlap ratio between computation and communication, computation overhead, synchronization overhead, update overhead, and so forth. Both the statistics and analyses are based on the trace files captured in a cluster with six machines. On the one hand, our trace files provide a sketch on the DLC, which contributes to understanding the communication details. On the other hand, the captured trace files can be used for figuring out various overheads, as they record the communication behaviors of each node.

References

[1]
High-Performance Deep Learning (HiDL). 2018. RDMA-TensorFlow 0.9.1. Retrieved March 21, 2019 from http://hidl.cse.ohio-state.edu.
[2]
GitHub. 2018. Caffe-MPI 2.0. Retrieved March 21, 2019 from https://github.com/Caffe-MPI/.
[3]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.
[4]
Franisco J. Andújar, Juan A. Villar, José L. Sánchez, Francisco J. Alfaro, and Jesus Escudero-Sahuquillo. 2015. VEF traces: A framework for modelling MPI traffic in interconnection network simulators. In Proceedings of the 2015 IEEE International Conference on Cluster Computing (CLUSTER’15). IEEE, Los Alamitos, CA, 841--848.
[5]
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 193--205.
[6]
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, et al. 2012. Theano: New features and speed improvements. arXiv:1211.5590.
[7]
Henri Casanova, Frédéric Desprez, George S. Markomanolis, and Frédéric Suter. 2015. Simulation of MPI applications with time-independent traces. Concurrency and Computation: Practice and Experience 27, 5 (2015), 1145--1168.
[8]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGPLAN Notices 49, 4 (2014), 269--284.
[9]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, et al. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.
[10]
Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Communications of the ACM 59, 11 (2016), 105--112.
[11]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 609--622.
[12]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, et al. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News 44, 3, 27--39.
[13]
Trishul M. Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 571--582.
[14]
Microsoft Azure. 2018. The Microsoft Cognitive Toolkit. Retrieved March 21, 2019 from https://www.cntk.ai/.
[15]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.
[16]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Los Alamitos, CA, 248--255.
[17]
NVIDIA. 2018. NVIDIA DGX-1. Retrieved March 21, 2019 from https://www.nvidia.com/en-us/data-center/dgx-1/.
[18]
NVIDIA. 2018. NVIDIA DGX-2. Retrieved March 21, 2019 from https://www.nvidia.com/en-us/data-center/dgx-2/.
[19]
Caffe2. 2018. Gloo. Retrieved March 21, 2019 from https://caffe2.ai/docs/distributed-training.html.
[20]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[22]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
[23]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, et al. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678.
[24]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, New York, NY, 1--12.
[25]
Edward Karrels and Ewing Lusk. 1994. Performance analysis of MPI programs. In Environments and Tools for Parallel Scientific Computing, J. J. Dongarra and B. Tourancheau (Eds.). Advances in Parallel Computing. North-Holland, 195--200.
[26]
Benjamin Klenk and Holger Fröning. 2017. An overview of MPI characteristics of exascale proxy applications. In Proceedings of the International Supercomputing Conference. 217--236.
[27]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Vol. 1. No. 4. Technical report, University of Toronto.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[29]
Yann LeCun. 1998. The MNIST database of handwritten digits. Retrieved March 21, 2019 from http://yann.lecun.com/exdb/mnist/.
[30]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278--2324.
[31]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583--598.
[32]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887.
[33]
J. Miguel-Alonso, J. Navaridas, and F. J. Ridruejo. 2009. Interconnection network simulation using traces of MPI applications. International Journal of Parallel Programming 37, 2 (2009), 153--174.
[34]
HPC Wire. 2017. In-Network Computing and Next Generation HDR 200G InfiniBand. Retrieved March 21, 2019 from https://www.hpcwire.com/2017/10/23/network-computing-next-generation-hdr-200g-infiniband/.
[35]
Next Platform. 2018. Programmable Networks Train Neural Nets Faster. Retrieved March 21, 2019 from https://www.nextplatform.com/2018/02/14/programmable-networks-train-neural-nets-faster/.
[36]
Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 688--698.
[37]
Vincent Pillet, Jesús Labarta, Toni Cortes, and Sergi Girona. 1995. PARAVER: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and Occam Developments, Vol. 44. IOS Press, Amsterdam, Netherlands, 17--31.
[38]
NVIDIA. 2018. Scalable AI Platform for Autonomous Driving. Retrieved March 21, 2019 from https://www.nvidia.com/en-us/self-driving-cars/drive-platform/.
[39]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
[41]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, et al. 2015. Going deeper with convolutions. arXiv:1409.4842.
[42]
Francisco Trivino, Francisco J. Andujar, Francisco J. Alfaro, José L. Sánchez, and Alberto Ros. 2011. Self-related traces: An alternative to full-system simulation for NoCs. In Proceedings of the 2011 International Conference on High Performance Computing and Simulation (HPCS’11). IEEE, Los Alamitos, CA, 819--824.
[43]
Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H. Beck. 2016. Deep learning for identifying metastatic breast cancer. arXiv:1606.05718.
[44]
Linnan Wang, Yi Yang, Renqiang Min, and Srimat Chakradhar. 2017. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks 93 (2017), 219--229.
[45]
Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, et al. 2015. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the 6th ACM Symposium on Cloud Computing. ACM, New York, NY, 381--394.
[46]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519.
[47]
Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, and Eric Xing. 2015. Distributed machine learning via sufficient factor broadcasting. arXiv:1511.08486.
[48]
Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2018. RPC considered harmful: Fast distributed deep learning on RDMA. arXiv:1805.08430.
[49]
Wikipedia. 2018. ZeroMQ. Available at https://en.wikipedia.org/.
[50]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, et al. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 181--193.

Cited By

View all
  • (2023)A survey on sliding window sketch for network measurementComputer Networks10.1016/j.comnet.2023.109696226(109696)Online publication date: May-2023
  • (2022)CP-SGD: Distributed stochastic gradient descent with compression and periodic compensationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.05.014169(42-57)Online publication date: Nov-2022
  • (2021)CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay CompensationProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472508(1-10)Online publication date: 9-Aug-2021
  • Show More Cited By

Index Terms

  1. SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 2
      June 2019
      317 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3325131
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 April 2019
      Accepted: 01 February 2019
      Revised: 01 November 2018
      Received: 01 May 2018
      Published in TACO Volume 16, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Distributed system
      2. communication trace
      3. deep learning
      4. parameter server

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Science and Technology Major Projects on Core Electronic Devices, High-End Generic Chips and Basic Software

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)128
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A survey on sliding window sketch for network measurementComputer Networks10.1016/j.comnet.2023.109696226(109696)Online publication date: May-2023
      • (2022)CP-SGD: Distributed stochastic gradient descent with compression and periodic compensationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.05.014169(42-57)Online publication date: Nov-2022
      • (2021)CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay CompensationProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472508(1-10)Online publication date: 9-Aug-2021
      • (2021)X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine LearningJournal of Lightwave Technology10.1109/JLT.2021.307327739:13(4247-4254)Online publication date: Jul-2021
      • (2021)FastHorovod: Expediting Parallel Message-Passing Schedule for Distributed DNN Training2021 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC53001.2021.9631443(1-7)Online publication date: 5-Sep-2021
      • (2021)vSketchDLC: A Sketch on Distributed Deep Learning Communication via Fine-grained Tracing VisualizationNetwork and Parallel Computing10.1007/978-3-030-93571-9_3(28-39)Online publication date: 3-Nov-2021
      • (2020)Communication optimization strategies for distributed deep neural network training: A surveyJournal of Parallel and Distributed Computing10.1016/j.jpdc.2020.11.005Online publication date: Nov-2020

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media