research-article

Open access

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

Authors:

Xiangke LiaoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 2

Article No.: 7, Pages 1 - 26

https://doi.org/10.1145/3312570

Published: 18 April 2019 Publication History

All formats PDF

Abstract

With the fast development of deep learning (DL), the communication is increasingly a bottleneck for distributed workloads, and a series of optimization works have been done to scale out successfully. Nevertheless, the network behavior has not been investigated much yet. We intend to analyze the network behavior and then carry out some research through network simulation. Under this circumstance, an accurate communication measurement is necessary, as it is an effective way to study the network behavior and the basis for accurate simulation. Therefore, we propose to capture the deep learning communication (DLC) trace to achieve the measurement.

To the best of our knowledge, we make the first attempt to capture the communication trace for DL training. In this article, we first provide detailed analyses about the communication mechanism of MXNet, which is a representative framework for distributed DL. Secondly, we define the DLC trace format to describe and record the communication behaviors. Third, we present the implementation of method for trace capturing. Finally, we make some statistics and analyses about the distributed DL training, including communication pattern, overlap ratio between computation and communication, computation overhead, synchronization overhead, update overhead, and so forth. Both the statistics and analyses are based on the trace files captured in a cluster with six machines. On the one hand, our trace files provide a sketch on the DLC, which contributes to understanding the communication details. On the other hand, the captured trace files can be used for figuring out various overheads, as they record the communication behaviors of each node.

References

[1]

High-Performance Deep Learning (HiDL). 2018. RDMA-TensorFlow 0.9.1. Retrieved March 21, 2019 from http://hidl.cse.ohio-state.edu.

[2]

GitHub. 2018. Caffe-MPI 2.0. Retrieved March 21, 2019 from https://github.com/Caffe-MPI/.

[3]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.

[4]

Franisco J. Andújar, Juan A. Villar, José L. Sánchez, Francisco J. Alfaro, and Jesus Escudero-Sahuquillo. 2015. VEF traces: A framework for modelling MPI traffic in interconnection network simulators. In Proceedings of the 2015 IEEE International Conference on Cluster Computing (CLUSTER’15). IEEE, Los Alamitos, CA, 841--848.

Digital Library

[5]

Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 193--205.

Digital Library

[6]

Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, et al. 2012. Theano: New features and speed improvements. arXiv:1211.5590.

[7]

Henri Casanova, Frédéric Desprez, George S. Markomanolis, and Frédéric Suter. 2015. Simulation of MPI applications with time-independent traces. Concurrency and Computation: Practice and Experience 27, 5 (2015), 1145--1168.

Digital Library

[8]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGPLAN Notices 49, 4 (2014), 269--284.

Digital Library

[9]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, et al. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.

[10]

Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Communications of the ACM 59, 11 (2016), 105--112.

Digital Library

[11]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 609--622.

Digital Library

[12]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, et al. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News 44, 3, 27--39.

Digital Library

[13]

Trishul M. Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 571--582.

Digital Library

[14]

Microsoft Azure. 2018. The Microsoft Cognitive Toolkit. Retrieved March 21, 2019 from https://www.cntk.ai/.

[15]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.

Digital Library

[16]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Los Alamitos, CA, 248--255.

[17]

NVIDIA. 2018. NVIDIA DGX-1. Retrieved March 21, 2019 from https://www.nvidia.com/en-us/data-center/dgx-1/.

[18]

NVIDIA. 2018. NVIDIA DGX-2. Retrieved March 21, 2019 from https://www.nvidia.com/en-us/data-center/dgx-2/.

[19]

Caffe2. 2018. Gloo. Retrieved March 21, 2019 from https://caffe2.ai/docs/distributed-training.html.

[20]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143.

Digital Library

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[22]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.

Digital Library

[23]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, et al. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678.

Digital Library

[24]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, New York, NY, 1--12.

Digital Library

[25]

Edward Karrels and Ewing Lusk. 1994. Performance analysis of MPI programs. In Environments and Tools for Parallel Scientific Computing, J. J. Dongarra and B. Tourancheau (Eds.). Advances in Parallel Computing. North-Holland, 195--200.

[26]

Benjamin Klenk and Holger Fröning. 2017. An overview of MPI characteristics of exascale proxy applications. In Proceedings of the International Supercomputing Conference. 217--236.

[27]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Vol. 1. No. 4. Technical report, University of Toronto.

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[29]

Yann LeCun. 1998. The MNIST database of handwritten digits. Retrieved March 21, 2019 from http://yann.lecun.com/exdb/mnist/.

[30]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278--2324.

[31]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583--598.

Digital Library

[32]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887.

[33]

J. Miguel-Alonso, J. Navaridas, and F. J. Ridruejo. 2009. Interconnection network simulation using traces of MPI applications. International Journal of Parallel Programming 37, 2 (2009), 153--174.

Digital Library

[34]

HPC Wire. 2017. In-Network Computing and Next Generation HDR 200G InfiniBand. Retrieved March 21, 2019 from https://www.hpcwire.com/2017/10/23/network-computing-next-generation-hdr-200g-infiniband/.

[35]

Next Platform. 2018. Programmable Networks Train Neural Nets Faster. Retrieved March 21, 2019 from https://www.nextplatform.com/2018/02/14/programmable-networks-train-neural-nets-faster/.

[36]

Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 688--698.

Digital Library

[37]

Vincent Pillet, Jesús Labarta, Toni Cortes, and Sergi Girona. 1995. PARAVER: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and Occam Developments, Vol. 44. IOS Press, Amsterdam, Netherlands, 17--31.

[38]

NVIDIA. 2018. Scalable AI Platform for Autonomous Driving. Retrieved March 21, 2019 from https://www.nvidia.com/en-us/self-driving-cars/drive-platform/.

[39]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.

Digital Library

[40]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

[41]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, et al. 2015. Going deeper with convolutions. arXiv:1409.4842.

[42]

Francisco Trivino, Francisco J. Andujar, Francisco J. Alfaro, José L. Sánchez, and Alberto Ros. 2011. Self-related traces: An alternative to full-system simulation for NoCs. In Proceedings of the 2011 International Conference on High Performance Computing and Simulation (HPCS’11). IEEE, Los Alamitos, CA, 819--824.

[43]

Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H. Beck. 2016. Deep learning for identifying metastatic breast cancer. arXiv:1606.05718.

[44]

Linnan Wang, Yi Yang, Renqiang Min, and Srimat Chakradhar. 2017. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks 93 (2017), 219--229.

[45]

Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, et al. 2015. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the 6th ACM Symposium on Cloud Computing. ACM, New York, NY, 381--394.

Digital Library

[46]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519.

Digital Library

[47]

Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, and Eric Xing. 2015. Distributed machine learning via sufficient factor broadcasting. arXiv:1511.08486.

[48]

Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2018. RPC considered harmful: Fast distributed deep learning on RDMA. arXiv:1805.08430.

[49]

Wikipedia. 2018. ZeroMQ. Available at https://en.wikipedia.org/.

[50]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, et al. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 181--193.

Digital Library

Cited By

Zeng ZCui LQian MZhang ZWei K(2023)A survey on sliding window sketch for network measurementComputer Networks10.1016/j.comnet.2023.109696226(109696)Online publication date: May-2023
https://doi.org/10.1016/j.comnet.2023.109696
Yu EDong DXu YOuyang SLiao X(2022)CP-SGD: Distributed stochastic gradient descent with compression and periodic compensationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.05.014169(42-57)Online publication date: Nov-2022
https://doi.org/10.1016/j.jpdc.2022.05.014
Yu EDong DXu YOuyang SLiao X(2021)CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay CompensationProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472508(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472508
Show More Cited By

Index Terms

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing
1. Networks
  1. Network performance evaluation
    1. Network measurement
    2. Network simulations

Recommendations

Efficiently Acquiring Communication Traces for Large-Scale Parallel Applications

Communication patterns of parallel applications are important to optimize application performance and design better communication subsystems. Communication patterns can be extracted from communication traces. However, existing approaches to generate ...
FACT: fast communication trace collection for parallel applications through program slicing
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

A proper understanding of communication patterns of parallel applications is important to optimize application performance and design better communication subsystems. Communication patterns can be obtained by analyzing communication traces. However, ...
Optimal NODUP All-to-All Broadcast Schemes in Distributed Computing Systems

Broadcast, referring to a process of information dissemination in a distributed systemwhereby a message originating from a certain node is sent to all other nodes in thesystem, is a very important issue in distributed computing. All-to-all broadcast ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 16, Issue 2

June 2019

317 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3325131

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2019

Accepted: 01 February 2019

Revised: 01 November 2018

Received: 01 May 2018

Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science and Technology Major Projects on Core Electronic Devices, High-End Generic Chips and Basic Software

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,239
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)12

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zeng ZCui LQian MZhang ZWei K(2023)A survey on sliding window sketch for network measurementComputer Networks10.1016/j.comnet.2023.109696226(109696)Online publication date: May-2023
https://doi.org/10.1016/j.comnet.2023.109696
Yu EDong DXu YOuyang SLiao X(2022)CP-SGD: Distributed stochastic gradient descent with compression and periodic compensationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.05.014169(42-57)Online publication date: Nov-2022
https://doi.org/10.1016/j.jpdc.2022.05.014
Yu EDong DXu YOuyang SLiao X(2021)CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay CompensationProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472508(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472508
Lu YGu HYu XLi P(2021)X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine LearningJournal of Lightwave Technology10.1109/JLT.2021.307327739:13(4247-4254)Online publication date: Jul-2021
https://doi.org/10.1109/JLT.2021.3073277
Wang YDong DXu YOuyang SLiao X(2021)FastHorovod: Expediting Parallel Message-Passing Schedule for Distributed DNN Training2021 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC53001.2021.9631443(1-7)Online publication date: 5-Sep-2021
https://doi.org/10.1109/ISCC53001.2021.9631443
Wang YOuyang SDong DYu ELiao X(2021)vSketchDLC: A Sketch on Distributed Deep Learning Communication via Fine-grained Tracing VisualizationNetwork and Parallel Computing10.1007/978-3-030-93571-9_3(28-39)Online publication date: 3-Nov-2021
https://dl.acm.org/doi/10.1007/978-3-030-93571-9_3
Ouyang SDong DXu YXiao L(2020)Communication optimization strategies for distributed deep neural network training: A surveyJournal of Parallel and Distributed Computing10.1016/j.jpdc.2020.11.005Online publication date: Nov-2020
https://doi.org/10.1016/j.jpdc.2020.11.005

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents