Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scalable Deep Learning via I/O Analysis and Optimization

Published: 01 July 2019 Publication History

Abstract

Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.

References

[1]
{n.d.}. NVIDIA Collective Communications Library (NCCL): Multi-GPU and Multi-Node Collective Communication Primitives. Retrieved from https://developer.nvidia.com/nccl.
[2]
2015. Caffe-MPI for Deep Learning. Retrieved from https://github.com/Caffe-MPI/Caffe-MPI.github.io.
[3]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://tensorflow.org/ Software available from tensorflow.org.
[4]
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205.
[5]
Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2017. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. 29, 2 (2017), 420--434.
[6]
Nicolas Castet. 2018. Distributed deep learning with Horovod and PowerAI DDL. Retrieved from https://developer.ibm.com/linuxonpower/2018/08/24/distributed-deep-learning-horovod-powerai-ddl/.
[7]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNET: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[8]
Steven W. D. Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis Santos, Pawel Herman, Sai Narasimhamurthy, and Erwin Laure. 2018. Characterizing deep-learning I/O workloads in TensorFlow. In Proceedings of the IEEE/ACM 3rd International Workshop on Parallel Data Storage 8 Data Intensive Scalable Computing Systems (PDSW-DISCS’18). IEEE, 54–63.
[9]
Glenn Davis and Russ Rew. 1990. Data management: NetCDF: An interface for scientific data access. IEEE Comput. Graph. Appl. 10, 4 (1990), 76--82.
[10]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.
[11]
Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo/blob/master/docs/readme.md.
[12]
Michael Feldman. 2017. Intel Spills Details on Knights Mill Processor. Retrieved from https://www.top500.org/news/intel-spills-details-on-knights-mill-processor/.
[13]
Andrew Gibiansky. {n.d.}. Bringing HPC Techniques to Deep Learning. Retrieved from http://andrew.gibiansky.com.
[14]
Google. 2018. Cloud Tensor Processing Units (TPUs). Retrieved from https://cloud.google.com/tpu/docs/tpus.
[15]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
[16]
The HDF Group. 2012. Enabling a Strict Consistency Semantics Model in Parallel HDF5. Retrieved from https://support.hdfgroup.org/HDF5/doc/Advanced/PHDF5FileConsistencySemantics/PHDF5FileConsistencySemantics.pdf.
[17]
Antonio Gulli and Sujit Pal. 2017. Deep Learning with Keras. Packt Publishing Ltd.
[18]
Mark Harris. 2017. NVIDIA DGX-1: The Fastest Deep Learning System. Retrieved from https://devblogs.nvidia.com/dgx-1-fastest-deep-learning-system/.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[20]
Atsushi Hori, Min Si, Balazs Gerofi, Masamichi Takagi, Jai Dayal, Pavan Balaji, and Yutaka Ishikawa. 2018. Process-in-process: Techniques for practical address-space sharing. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 131--143.
[21]
Jeremy Hsu. 2016. Fujitsu Memory Tech Speeds Up Deep-Learning AI. Retrieved from https://spectrum.ieee.org/tech-talk/computing/software/fujitsu-memory-tech-speeds-up-deep-learning-ai.
[22]
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.
[23]
Zhi-Lin Ke, Hsiang-Yun Cheng, and Chia-Lin Yang. 2018. LIRS: Enabling efficient machine learning on NVM-based storage via a lightweight implementation of random shuffling. arXiv preprint arXiv:1810.04509 (2018).
[24]
Akhmedov Khumoyun, Yun Cui, and Lee Hanku. 2016. Spark based distributed deep learning framework for big data applications. In Proceedings of the International Conference on Information Science and Communications Technologies (ICISCT’16). IEEE, 1--5.
[25]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).
[26]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
[27]
Sameer Kumar, Dheeraj Sreedhar, Vaibhav Saxena, Yogish Sabharwal, and Ashish Verma. 2017. Efficient training of convolutional neural nets on large distributed systems. arXiv preprint arXiv:1711.00705 (2017).
[28]
Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, et al. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 51.
[29]
Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David J. Crandall, and Dhruv Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv (2015). http://arxiv.org/abs/1511.06314
[30]
Jianwei Li, Wei keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Rob Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. 2003. Parallel netCDF: A high-performance scientific I/O interface. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (SC’03). ACM, New York, NY, 39.
[31]
He Ma, Fei Mao, and Graham W. Taylor. 2016. Theano-MPI: A theano-based distributed training framework. In Proceedings of the European Conference on Parallel Processing. Springer, 800–813.
[32]
Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, et al. 2018. CosmoFlow: Using deep learning to learn the universe at scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 65.
[33]
Pierre Matri, María S Pérez, Alexandru Costan, and Gabriel Antoniu. 2018. TỳrFS: Increasing small files access performance with dynamic metadata replication. In Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’18). IEEE.
[34]
Microsoft. {n.d.}. Cognitive Toolkit: Multiple GPUs and Machines. Retrieved from https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines.
[35]
Timothy Prickett Morgan. 2017. Machine Learning Gets an InfiniBand Boost with Caffe2. Retrieved from https://www.nextplatform.com/2017/04/19/machine-learning-gets-infiniband-boost-caffe2/.
[36]
NVIDIA. 2018. NVIDIA Deep Learning Platform: Giant Leaps in Performance and Efficiency for AI Services, From the Data Center to the Network’s Edge. Retrieved from https://images.nvidia.com/content/pdf/inference-technical-overview.pdf.
[37]
Travis E. Oliphant. 2006. A Guide to NumPy. Vol. 1. Trelgol Publishing USA.
[38]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS-W’17).
[39]
Sarunya Pumma, Min Si, Wu chun Feng, and Pavan Balaji. 2017. Parallel I/O optimizations for scalable deep learning. In Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS’17).
[40]
Sarunya Pumma, Min Si, Wu chun Feng, and Pavan Balaji. 2017. Towards scalable deep learning via I/O analysis and optimization. In Proceedings of the 19th International Conference on High Performance Computing and Communications (HPCC’17).
[41]
Carl Edward Rasmussen. 2003. Gaussian processes in machine learning. In Summer School on Machine Learning. Springer, 63--71.
[42]
Baidu Research. {n.d.}. baidu-allreduce. Retrieved from https://github.com/baidu-research/baidu-allreduce.
[43]
Microsoft Research. 2017. The Microsoft Cognitive Toolkit. Retrieved from https://docs.microsoft.com/en-us/cognitive-toolkit/.
[44]
Karl Rupp. 2018. Microprocessor Trend Data. Retrieved from https://github.com/karlrupp/microprocessor-trend-data.
[45]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.
[46]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[47]
Facebook Open Source. {n.d.}. Caffe2 A New Lightweight, Modular, and Scalable Deep Learning Framework. Retrieved from https://caffe2.ai.
[48]
TensorFlow. {n.d.}. How To Compile and Use MPI-Enabled TensorFlow. Retrieved from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi.
[49]
R. Thakur, W. Gropp, and E. Lusk. 1998. A case for using MPI’s derived datatypes to improve I/O performance. In Proceedings of the IEEE/ACM Conference on Supercomputing (SC’98).
[50]
R. Thakur, W. Gropp, and E. Lusk. 1999. Data sieving and collective I/O in ROMIO. In Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation. Washington, DC, 182--189.
[51]
Rajeev Thakur, Ewing Lusk, and William Gropp. 1997. Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation. Technical Report. Technical Report ANL/MCS-TM-234, Mathematics and Computer Science Division, Argonne National Laboratory.
[52]
The Ohio State University. 2014. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE. Retrieved from http://mvapich.cse.ohio-state.edu.
[53]
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688
[54]
Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: A next-generation open source framework for deep learning. In Proceedings of the Workshop on Machine Learning Systems (LearningSys) in the 29th Annual Conference on Neural Information Processing Systems (NIPS’15), Vol. 5. 1--6.
[55]
Abhinav Vishnu, Charles Siegel, and Jeffrey Daily. 2016. Distributed TensorFlow with MPI. CoRR abs/1603.02339 (2016). http://arxiv.org/abs/1603.02339
[56]
Michael Woodacre, Derek Robb, Dean Roe, and Karl Feind. 2005. The SGI® AltixTM 3000 global shared-memory architecture. Silicon Graphics, Inc. (2005).
[57]
Joe Yaworski. 2017. Intel Omni-Path Architecture Enables Deep Learning Training on HPC. Retrieved from https://itpeernetwork.intel.com/intel-omni-path-deep-learning-training/.
[58]
Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992 (2018).
[59]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD batch size to 32k for ImageNet training. arXiv preprint arXiv:1708.03888 (2017).
[60]
Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing (ICPP’18).
[61]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual Networks. arXiv preprint arXiv:1605.07146 (2016).
[62]
Kunlei Zhang and Xue-Wen Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395--403.
[63]
Yue Zhu, Fahim Chowdhury, Huansong Fu, Adam Moody, Kathryn Mohror, Kento Sato, and Weikuan Yu. 2018. Entropy-aware I/O pipelining for large-scale deep learning on HPC systems. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’18).

Cited By

View all
  • (2024)Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applicationsFrontiers in High Performance Computing10.3389/fhpcp.2024.14586742Online publication date: 18-Sep-2024
  • (2024)FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698559(52-68)Online publication date: 20-Nov-2024
  • (2024)DYAD: Locality-aware Data Management for accelerating Deep Learning Training2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00010(13-24)Online publication date: 13-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 6, Issue 2
June 2019
109 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3343018
Issue’s Table of Contents
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2019
Accepted: 01 April 2019
Revised: 01 April 2019
Received: 01 September 2018
Published in TOPC Volume 6, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Caffe
  2. I/O bottleneck
  3. I/O in deep learning
  4. LMDB
  5. LMDBIO
  6. Scalable deep learning
  7. parallel I/O

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (SC-21)
  • NSF XPS

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)9
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applicationsFrontiers in High Performance Computing10.3389/fhpcp.2024.14586742Online publication date: 18-Sep-2024
  • (2024)FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698559(52-68)Online publication date: 20-Nov-2024
  • (2024)DYAD: Locality-aware Data Management for accelerating Deep Learning Training2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00010(13-24)Online publication date: 13-Nov-2024
  • (2024)The Case For Data Centre Hyperloops2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00026(230-244)Online publication date: 29-Jun-2024
  • (2024)Couler: Unified Machine Learning Workflow Optimization in Cloud2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00393(5224-5237)Online publication date: 13-May-2024
  • (2024)FastMatch: Enhancing Data Pipeline Efficiency for Accelerated Distributed Training2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00044(239-246)Online publication date: 18-Nov-2024
  • (2024)An Overview of the Data-Loader Landscape: Comparative Performance Analysis2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825421(360-367)Online publication date: 15-Dec-2024
  • (2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
  • (2024)Augmented access pattern-based I/O performance prediction using directed acyclic graph regressionCluster Computing10.1007/s10586-024-04719-628:1Online publication date: 14-Oct-2024
  • (2023)Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep LearningProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624204(1345-1356)Online publication date: 12-Nov-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media