research-article

Open access

Lobster: Load Balance-Aware I/O for Distributed DNN Training

Authors:

Bogdan Nicolae,

Dong LiAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 26, Pages 1 - 11

https://doi.org/10.1145/3545008.3545090

Published: 13 January 2023 Publication History

All formats PDF

Abstract

The resource-hungry and time-consuming process of training Deep Neural Networks (DNNs) can be accelerated by optimizing and/or scaling computations on accelerators such as GPUs. However, the loading and pre-processing of training samples then often emerges as a new bottleneck. This data loading process engages a complex pipeline that extends from the sampling of training data on external storage to delivery of those data to GPUs, and that comprises not only expensive I/O operations but also decoding, shuffling, batching, augmentation, and other operations. We propose in this paper a new holistic approach to data loading that addresses three challenges not sufficiently addressed by other methods: I/O load imbalances among the GPUs on a node; rigid resource allocations to data loading and data preprocessing steps, which lead to idle resources and bottlenecks; and limited efficiency of caching strategies based on pre-fetching due to eviction of training samples needed soon at the expense of those needed later. We first present a study of key bottlenecks observed as training samples flow through the data loading and preprocessing pipeline. Then, we describe Lobster, a data loading runtime that uses performance modeling and advanced heuristics to combine flexible thread management with optimized eviction for distributed caching in order to mitigate I/O overheads and load imbalances. Experiments with a range of models and datasets show that the Lobster approach reduces both I/O overheads and end-to-end training times by up to 1.5 × compared with state-of-the-art approaches.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467(2016).

[2]

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J Nair. 2021. Accelerating recommendation system training by leveraging popular choices. Proceedings of the VLDB Endowment 15, 1 (2021), 127–140.

Digital Library

[3]

Idan Burstein. 2021. Nvidia Data Center Processing Unit (DPU) Architecture. In 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 1–20.

[4]

Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K. Panda. 2020. NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems. In 34th ACM International Conference on Supercomputing. Virtual, 1–12.

Digital Library

[5]

Matthew Curtis-Maury, Ankur Shah, Filip Blagojevic, Dimitrios S. Nikolopoulos, Bronis R. de Supinski, and Martin Schulz. 2008. Prediction models for multi-dimensional power-performance optimization on many cores. In International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[6]

Wenqian Dong, Jie Liu, Zhen Xie, and Dong Li. 2019. Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation. In International Conference for High Performance Computing, Performance Measurement, Modeling and Tools (SC).

Digital Library

[7]

Wenqian Dong, Zhen Xie, Gokcen Kestor, and Dong Li. 2020. Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation. In International Conference for High Performance Computing, Performance Measurement, Modeling and Tools.

[8]

Nikoli Dryden, Roman Böhringer, Tal Ben-Nun, and Torsten Hoefler. 2021. Clairvoyant prefetching for distributed machine learning I/O. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.

Digital Library

[9]

Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC). IEEE, 1–8.

[10]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Digital Library

[11]

A Handa, V Patraucean, V Badrinarayanan, S Stent, and R Cipolla. 2015. SceneNet: Understanding real world indoor scenes with synthetic data. arXiv preprint arXiv:1511.07041(2015).

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[13]

Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355.

Digital Library

[14]

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2021. Data movement is all you need: A case study on optimizing transformers. In 4th Conference on Machine Learning and Systems. Virtual.

[15]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In 22nd ACM International Conference on Multimedia. 675–678.

Digital Library

[16]

Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Venkataraman. 2019. The case for unifying data loading in machine learning clusters. In 11th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 19).

[17]

Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun. 2019. Parallax: Sparsity-aware data parallel training of deep neural networks. In 15th EuroSys Conference. 1–15.

Digital Library

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.

[19]

Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies. 283–296.

[20]

Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, 2018. Exascale deep learning for climate analytics. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 649–660.

Digital Library

[21]

Gyewon Lee, Irene Lee, Hyeonmin Ha, Kyunggeun Lee, Hwarim Hyun, Ahnjae Shin, and Byung-Gon Chun. 2021. Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 537–550.

[22]

Dong Li, Bronis de Supinski, Martin Schulz, Dimitrios S. Nikolopoulos, and Kirk W. Cameron. 2010. Hybrid MPI/OpenMP power-aware computing. In International Parallel and Distributed Processing Symposium.

[23]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704(2020).

[24]

Jie Liu, Bogdan Nicolae, Dong Li, Justin M Wozniak, Tekin Bicer, Zhengchun Liu, and Ian Foster. 2022. Large Scale Caching and Streaming of Training Data for Online Deep Learning. In Proceedings of the 12th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures. 19–26.

Digital Library

[25]

Ricardo Macedo, Cláudia Correia, Marco Dantas, Cláudia Brito, Weijia Xu, Yusuke Tanimura, Jason Haga, and Joao Paulo. 2021. The Case for Storage Optimization Decoupling in Deep Learning Frameworks. In IEEE International Conference on Cluster Computing. IEEE, 649–656.

[26]

Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. 2020. Analyzing and mitigating data stalls in DNN training. arXiv preprint arXiv:2007.06775(2020).

[27]

Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A data system for optimized deep learning model selection. Proceedings of the VLDB Endowment 13, 12 (2020), 2159–2173.

Digital Library

[28]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN Training. In 27th ACM Symposium on Operating Systems Principles. Huntsville, Canada, 1–15.

Digital Library

[29]

Deepak Narayanan, Keshav Santhanam, and Matei Zaharia. 2018. Accelerating model search with model batching. In 1st Conference on Systems and Machine Learning (SysML), SysML, Vol. 18.

[30]

Pyeongsu Park, Heetaek Jeong, and Jangwoo Kim. 2020. TrainBox: An Extreme-Scale Neural Network Training Server Architecture by Systematically Balancing Operations. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 825–838.

[31]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 8026–8037.

[32]

Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455(2014).

[33]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv preprint arXiv:2104.07857(2021).

[34]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. Zero-offload: Democratizing billion-scale model training. arXiv preprint arXiv:2101.06840(2021).

[35]

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972(2021).

[36]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.

Digital Library

[37]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).

[38]

Vasilis Sourlas, Lazaros Gkatzikis, Paris Flegkas, and Leandros Tassiulas. 2013. Distributed cache management in information-centric networks. IEEE Transactions on Network and Service Management 10, 3(2013), 286–299.

[39]

Lipeng Wang, Songgao Ye, Baichen Yang, Youyou Lu, Hequan Zhang, Shengen Yan, and Qiong Luo. 2020. DIESEL: A dataset-based distributed storage and caching system for large-scale deep learning training. In 49th International Conference on Parallel Processing. 1–11.

Digital Library

[40]

Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD-HM: memoization-based molecular dynamics simulations on big memory system. In Proceedings of the ACM International Conference on Supercomputing. 215–226.

Digital Library

[41]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.

[42]

Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, 2021. Understanding and co-designing the data ingestion pipeline for industry-scale recsys training. arXiv preprint arXiv:2108.09373(2021).

[43]

Yue Zhu, Fahim Chowdhury, Huansong Fu, Adam Moody, Kathryn Mohror, Kento Sato, and Weikuan Yu. 2018. Entropy-aware I/O pipelining for large-scale deep learning on HPC systems. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 145–156.

[44]

Mahdi Zolnouri, Xinlin Li, and Vahid Partovi Nia. 2020. Importance of data loading pipeline in training DNNs. arXiv preprint arXiv:2005.02130(2020).

Cited By

Bouvier TNicolae BCostan ABicer TFoster IAntoniu G(2024)Efficient distributed continual learning for steering experiments in real-timeFuture Generation Computer Systems10.1016/j.future.2024.07.016Online publication date: Jul-2024
https://doi.org/10.1016/j.future.2024.07.016
Maurya ARafique MTonellot TAlSalem HCappello FNicolae BButt AMi NChard K(2023)GPU-Enabled Asynchronous Multi-level Checkpoint Caching and PrefetchingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592987(73-85)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592987
Assogba KNicolae BRafique M(2023)Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00042(246-255)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00042

Recommendations

Load balance aware distributed differential evolution for computationally expensive optimization problems
GECCO '17: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Computationally expensive problem challenges the application of evolutionary algorithms (EAs) due to the long runtime. Distributed EAs on distributed resources for calculating the individual fitness value in paralllel is a promising method to reduce ...
Quantifying the effectiveness of load balance algorithms
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Load balance is critical for performance in large parallel applications. An imbalance on today's fastest supercomputers can force hundreds of thousands of cores to idle, and on future exascale machines this cost will increase by over a factor of a ...
A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

GPGPU improves the computing performance due to the massive parallelism. The cooperative-thread-array (CTA) schedulers employed by the current GPGPUs greedily issue CTAs to GPU cores as soon as the resources become available for higher thread level ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
800
Total Downloads

Downloads (Last 12 months)541
Downloads (Last 6 weeks)50

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bouvier TNicolae BCostan ABicer TFoster IAntoniu G(2024)Efficient distributed continual learning for steering experiments in real-timeFuture Generation Computer Systems10.1016/j.future.2024.07.016Online publication date: Jul-2024
https://doi.org/10.1016/j.future.2024.07.016
Maurya ARafique MTonellot TAlSalem HCappello FNicolae BButt AMi NChard K(2023)GPU-Enabled Asynchronous Multi-level Checkpoint Caching and PrefetchingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592987(73-85)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592987
Assogba KNicolae BRafique M(2023)Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00042(246-255)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00042

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents