research-article

Scalable Deep Learning via I/O Analysis and Optimization

Authors:

Pavan BalajiAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 6, Issue 2

Article No.: 6, Pages 1 - 34

https://doi.org/10.1145/3331526

Published: 01 July 2019 Publication History

Abstract

Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.

References

[1]

{n.d.}. NVIDIA Collective Communications Library (NCCL): Multi-GPU and Multi-Node Collective Communication Primitives. Retrieved from https://developer.nvidia.com/nccl.

[2]

2015. Caffe-MPI for Deep Learning. Retrieved from https://github.com/Caffe-MPI/Caffe-MPI.github.io.

[3]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://tensorflow.org/ Software available from tensorflow.org.

[4]

Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205.

Digital Library

[5]

Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2017. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. 29, 2 (2017), 420--434.

[6]

Nicolas Castet. 2018. Distributed deep learning with Horovod and PowerAI DDL. Retrieved from https://developer.ibm.com/linuxonpower/2018/08/24/distributed-deep-learning-horovod-powerai-ddl/.

[7]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNET: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[8]

Steven W. D. Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis Santos, Pawel Herman, Sai Narasimhamurthy, and Erwin Laure. 2018. Characterizing deep-learning I/O workloads in TensorFlow. In Proceedings of the IEEE/ACM 3rd International Workshop on Parallel Data Storage 8 Data Intensive Scalable Computing Systems (PDSW-DISCS’18). IEEE, 54–63.

[9]

Glenn Davis and Russ Rew. 1990. Data management: NetCDF: An interface for scientific data access. IEEE Comput. Graph. Appl. 10, 4 (1990), 76--82.

Digital Library

[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.

[11]

Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo/blob/master/docs/readme.md.

[12]

Michael Feldman. 2017. Intel Spills Details on Knights Mill Processor. Retrieved from https://www.top500.org/news/intel-spills-details-on-knights-mill-processor/.

[13]

Andrew Gibiansky. {n.d.}. Bringing HPC Techniques to Deep Learning. Retrieved from http://andrew.gibiansky.com.

[14]

Google. 2018. Cloud Tensor Processing Units (TPUs). Retrieved from https://cloud.google.com/tpu/docs/tpus.

[15]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[16]

The HDF Group. 2012. Enabling a Strict Consistency Semantics Model in Parallel HDF5. Retrieved from https://support.hdfgroup.org/HDF5/doc/Advanced/PHDF5FileConsistencySemantics/PHDF5FileConsistencySemantics.pdf.

[17]

Antonio Gulli and Sujit Pal. 2017. Deep Learning with Keras. Packt Publishing Ltd.

Digital Library

[18]

Mark Harris. 2017. NVIDIA DGX-1: The Fastest Deep Learning System. Retrieved from https://devblogs.nvidia.com/dgx-1-fastest-deep-learning-system/.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[20]

Atsushi Hori, Min Si, Balazs Gerofi, Masamichi Takagi, Jai Dayal, Pavan Balaji, and Yutaka Ishikawa. 2018. Process-in-process: Techniques for practical address-space sharing. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 131--143.

Digital Library

[21]

Jeremy Hsu. 2016. Fujitsu Memory Tech Speeds Up Deep-Learning AI. Retrieved from https://spectrum.ieee.org/tech-talk/computing/software/fujitsu-memory-tech-speeds-up-deep-learning-ai.

[22]

Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.

[23]

Zhi-Lin Ke, Hsiang-Yun Cheng, and Chia-Lin Yang. 2018. LIRS: Enabling efficient machine learning on NVM-based storage via a lightweight implementation of random shuffling. arXiv preprint arXiv:1810.04509 (2018).

[24]

Akhmedov Khumoyun, Yun Cui, and Lee Hanku. 2016. Spark based distributed deep learning framework for big data applications. In Proceedings of the International Conference on Information Science and Communications Technologies (ICISCT’16). IEEE, 1--5.

[25]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).

[26]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.

[27]

Sameer Kumar, Dheeraj Sreedhar, Vaibhav Saxena, Yogish Sabharwal, and Ashish Verma. 2017. Efficient training of convolutional neural nets on large distributed systems. arXiv preprint arXiv:1711.00705 (2017).

[28]

Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, et al. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 51.

Digital Library

[29]

Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David J. Crandall, and Dhruv Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv (2015). http://arxiv.org/abs/1511.06314

[30]

Jianwei Li, Wei keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Rob Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. 2003. Parallel netCDF: A high-performance scientific I/O interface. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (SC’03). ACM, New York, NY, 39.

Digital Library

[31]

He Ma, Fei Mao, and Graham W. Taylor. 2016. Theano-MPI: A theano-based distributed training framework. In Proceedings of the European Conference on Parallel Processing. Springer, 800–813.

[32]

Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, et al. 2018. CosmoFlow: Using deep learning to learn the universe at scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 65.

Digital Library

[33]

Pierre Matri, María S Pérez, Alexandru Costan, and Gabriel Antoniu. 2018. TỳrFS: Increasing small files access performance with dynamic metadata replication. In Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’18). IEEE.

Digital Library

[34]

Microsoft. {n.d.}. Cognitive Toolkit: Multiple GPUs and Machines. Retrieved from https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines.

[35]

Timothy Prickett Morgan. 2017. Machine Learning Gets an InfiniBand Boost with Caffe2. Retrieved from https://www.nextplatform.com/2017/04/19/machine-learning-gets-infiniband-boost-caffe2/.

[36]

NVIDIA. 2018. NVIDIA Deep Learning Platform: Giant Leaps in Performance and Efficiency for AI Services, From the Data Center to the Network’s Edge. Retrieved from https://images.nvidia.com/content/pdf/inference-technical-overview.pdf.

[37]

Travis E. Oliphant. 2006. A Guide to NumPy. Vol. 1. Trelgol Publishing USA.

[38]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS-W’17).

[39]

Sarunya Pumma, Min Si, Wu chun Feng, and Pavan Balaji. 2017. Parallel I/O optimizations for scalable deep learning. In Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS’17).

[40]

Sarunya Pumma, Min Si, Wu chun Feng, and Pavan Balaji. 2017. Towards scalable deep learning via I/O analysis and optimization. In Proceedings of the 19th International Conference on High Performance Computing and Communications (HPCC’17).

[41]

Carl Edward Rasmussen. 2003. Gaussian processes in machine learning. In Summer School on Machine Learning. Springer, 63--71.

[42]

Baidu Research. {n.d.}. baidu-allreduce. Retrieved from https://github.com/baidu-research/baidu-allreduce.

[43]

Microsoft Research. 2017. The Microsoft Cognitive Toolkit. Retrieved from https://docs.microsoft.com/en-us/cognitive-toolkit/.

[44]

Karl Rupp. 2018. Microprocessor Trend Data. Retrieved from https://github.com/karlrupp/microprocessor-trend-data.

[45]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.

Digital Library

[46]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[47]

Facebook Open Source. {n.d.}. Caffe2 A New Lightweight, Modular, and Scalable Deep Learning Framework. Retrieved from https://caffe2.ai.

[48]

TensorFlow. {n.d.}. How To Compile and Use MPI-Enabled TensorFlow. Retrieved from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi.

[49]

R. Thakur, W. Gropp, and E. Lusk. 1998. A case for using MPI’s derived datatypes to improve I/O performance. In Proceedings of the IEEE/ACM Conference on Supercomputing (SC’98).

Digital Library

[50]

R. Thakur, W. Gropp, and E. Lusk. 1999. Data sieving and collective I/O in ROMIO. In Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation. Washington, DC, 182--189.

Digital Library

[51]

Rajeev Thakur, Ewing Lusk, and William Gropp. 1997. Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation. Technical Report. Technical Report ANL/MCS-TM-234, Mathematics and Computer Science Division, Argonne National Laboratory.

[52]

The Ohio State University. 2014. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE. Retrieved from http://mvapich.cse.ohio-state.edu.

[53]

Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688

[54]

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: A next-generation open source framework for deep learning. In Proceedings of the Workshop on Machine Learning Systems (LearningSys) in the 29th Annual Conference on Neural Information Processing Systems (NIPS’15), Vol. 5. 1--6.

[55]

Abhinav Vishnu, Charles Siegel, and Jeffrey Daily. 2016. Distributed TensorFlow with MPI. CoRR abs/1603.02339 (2016). http://arxiv.org/abs/1603.02339

[56]

Michael Woodacre, Derek Robb, Dean Roe, and Karl Feind. 2005. The SGI® AltixTM 3000 global shared-memory architecture. Silicon Graphics, Inc. (2005).

[57]

Joe Yaworski. 2017. Intel Omni-Path Architecture Enables Deep Learning Training on HPC. Retrieved from https://itpeernetwork.intel.com/intel-omni-path-deep-learning-training/.

[58]

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992 (2018).

[59]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD batch size to 32k for ImageNet training. arXiv preprint arXiv:1708.03888 (2017).

[60]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing (ICPP’18).

Digital Library

[61]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual Networks. arXiv preprint arXiv:1605.07146 (2016).

[62]

Kunlei Zhang and Xue-Wen Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395--403.

[63]

Yue Zhu, Fahim Chowdhury, Huansong Fu, Adam Moody, Kathryn Mohror, Kento Sato, and Weikuan Yu. 2018. Entropy-aware I/O pipelining for large-scale deep learning on HPC systems. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’18).

Cited By

Lee CHewes VCerati GWang KAurisano AAgrawal AChoudhary ALiao W(2024)Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applicationsFrontiers in High Performance Computing10.3389/fhpcp.2024.14586742Online publication date: 18-Sep-2024
https://doi.org/10.3389/fhpcp.2024.1458674
Khan RPaul ACheng YJian XButt A(2024)FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698559(52-68)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698559
Devarajan HLumsden IWang CGeorgouli KScogland TYeom JTaufer M(2024)DYAD: Locality-aware Data Management for accelerating Deep Learning Training2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00010(13-24)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00010
Show More Cited By

Recommendations

Benchmarking deep learning techniques for face recognition
Highlights
- Training networks for face recognition is very complex and time-consuming.
- Multiple factors need to be considered: deep learning frameworks, GPU platforms, deep network models, and datasets.
- We compare three deep learning ...
Abstract
Recent progresses in Convolutional Neural Networks (CNNs) and GPUs have greatly advanced the state-of-the-art performance for face recognition. However, training CNNs for face recognition is complex and time-consuming. Multiple factors need to be ...
Scalable deep learning for healthcare: methods and applications
BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

This paper provides an overview on scalable deep learning platforms and how they are used in medical context. An introduction highlights the key factors, then an overview on medical context is provided. Afterwards, the basic concepts about deep learning ...
Performance benchmarking of deep learning framework on Intel Xeon Phi
Abstract
With the success of deep learning (DL) methods in diverse application domains, several deep learning software frameworks have been proposed to facilitate the usage of these methods. By knowing the frameworks which are employed in big data analysis,...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 6, Issue 2

June 2019

109 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3343018

Editor:
David A. Bader
Georgia Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2019

Accepted: 01 April 2019

Revised: 01 April 2019

Received: 01 September 2018

Published in TOPC Volume 6, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (SC-21)
NSF XPS

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
908
Total Downloads

Downloads (Last 12 months)120
Downloads (Last 6 weeks)9

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee CHewes VCerati GWang KAurisano AAgrawal AChoudhary ALiao W(2024)Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applicationsFrontiers in High Performance Computing10.3389/fhpcp.2024.14586742Online publication date: 18-Sep-2024
https://doi.org/10.3389/fhpcp.2024.1458674
Khan RPaul ACheng YJian XButt A(2024)FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698559(52-68)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698559
Devarajan HLumsden IWang CGeorgouli KScogland TYeom JTaufer M(2024)DYAD: Locality-aware Data Management for accelerating Deep Learning Training2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00010(13-24)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00010
López-Paradís GHair IKannan SRabbat RMurray PLopes AZahedi RZuo WBalkind J(2024)The Case For Data Centre Hyperloops2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00026(230-244)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00026
Wang XTang YGuo TSang BWu JSha JZhang KQian JTang M(2024)Couler: Unified Machine Learning Workflow Optimization in Cloud2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00393(5224-5237)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00393
Su JJafari MZhang YZhang W(2024)FastMatch: Enhancing Data Pipeline Efficiency for Accelerated Distributed Training2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00044(239-246)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00044
Ofeidis IKiedanski DTassiulas L(2024)An Overview of the Data-Loader Landscape: Comparative Performance Analysis2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825421(360-367)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825421
Xian GYang WTan YFeng JLi YZhang JYu J(2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
https://doi.org/10.1016/j.parco.2024.103095
Kumar MKim S(2024)Augmented access pattern-based I/O performance prediction using directed acyclic graph regressionCluster Computing10.1007/s10586-024-04719-628:1Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1007/s10586-024-04719-6
Ritter MWolf F(2023)Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep LearningProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624204(1345-1356)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624204
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents