Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3588195.3592987acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching

Published: 07 August 2023 Publication History

Abstract

Checkpointing is an I/O intensive operation increasingly used by High-Performance Computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the case of resilience, where only the last checkpoint is needed for application restart and rarely accessed to recover from failures, in this scenario, it is important to optimize frequent reads and writes of an entire history of checkpoints. State-of-the-art checkpointing approaches often rely on asynchronous multi-level techniques to hide I/O overheads by writing to fast local tiers (e.g. an SSD) and asynchronously flushing to slower, potentially remote tiers (e.g. a parallel file system) in the background, while the application keeps running. However, such approaches have two limitations. First, despite the fact that HPC infrastructures routinely rely on accelerators (e.g. GPUs), and therefore a majority of the checkpoints involve GPU memory, efficient asynchronous data movement between the GPU memory and host memory is lagging behind. Second, revisiting previous data often involves predictable access patterns, which are not exploited to accelerate read operations. In this paper, we address these limitations by proposing a scalable and asynchronous multi-level checkpointing approach optimized for both reading and writing of an arbitrarily long history of checkpoints. Our approach exploits GPU memory as a first-class citizen in the multi-level storage hierarchy to enable informed caching and prefetching of checkpoints by leveraging foreknowledge about the access order passed by the application as hints. Our evaluation using a variety of scenarios under I/O concurrency shows up to 74× faster checkpoint and restore throughput as compared to the state-of-art runtime and optimized unified virtual memory (UVM) based prefetching strategies and at least 2× shorter I/O wait time for the application across various workloads and configurations.

References

[1]
Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 141--150. https://doi.org/10.1109/IPDPS49936.2021.00023
[2]
Tariq Alturhkestani, Hatem Ltaief, and David Keyes. 2020. Maximizing I/O Bandwidth for Reverse Time Migration on Heterogeneous Large-Scale Systems. In Euro-Par 2020: Parallel Processing, Maciej Malawski and Krzysztof Rzadca (Eds.). Springer International Publishing, Cham, 263--278.
[3]
Tariq Alturkestani, Thierry Tonellot, Hatem Ltaief, Rached Abdelkhalak, Vincent Etienne, and David Keyes. 2019. MLBS: Transparent Data Caching in Hierarchical Storage for Out-of-Core HPC Applications. In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 312--322. https://doi.org/10.1109/HiPC.2019.00046
[4]
Argonne Leadership Computing Facility. n.d. Theta GPU. https://www.alcf.anl.gov/alcf-resources/theta. Accessed: May 9, 2023.
[5]
M. Arif, K. Assogba, and M. Rafique. 2022. Canary: Fault-Tolerant FaaS for Stateful Time-Sensitive Applications. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, Los Alamitos, CA, USA, 1--16. https://doi.org/10.1109/SC41404.2022.00046
[6]
Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun. 2020. HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 62--72. https://doi.org/10.1109/IPDPS47924.2020.00017
[7]
Matthieu Dreher and Tom Peterka. 2016. Bredala: Semantic Data Redistribution for In Situ Applications. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 279--288. https://doi.org/10.1109/CLUSTER.2016.30
[8]
Nikoli Dryden, Roman Böhringer, Tal Ben-Nun, and Torsten Hoefler. 2021. Clairvoyant Prefetching for Distributed Machine Learning I/O. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 92, 15 pages. https://doi.org/10.1145/3458817.3476181
[9]
Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro, Vol. 37, 2 (2017), 7--17. https://doi.org/10.1109/MM.2017.37
[10]
Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 224--235.
[11]
William F. Godoy, Norbert Podhorszki, Ruonan Wang, Chuck Atkins, Greg Eisenhauer, Junmin Gu, Philip Davis, Jong Choi, Kai Germaschewski, Kevin Huck, Axel Huebl, Mark Kim, James Kress, Tahsin Kurc, Qing Liu, Jeremy Logan, Kshitij Mehta, George Ostrouchov, Manish Parashar, Franz Poeschel, David Pugmire, Eric Suchyta, Keichi Takahashi, Nick Thompson, Seiji Tsutsumi, Lipeng Wan, Matthew Wolf, Kesheng Wu, and Scott Klasky. 2020. ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management. SoftwareX, Vol. 12 (2020), 100561. https://doi.org/10.1016/j.softx.2020.100561
[12]
Paul H Hargrove and Jason C Duell. 2006. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, Vol. 46, 1 (sep 2006), 494. https://doi.org/10.1088/1742--6596/46/1/067
[13]
Tong Jin, Fan Zhang, Qian Sun, Hoang Bui, Melissa Romanus, Norbert Podhorszki, Scott Klasky, Hemanth Kolla, Jacqueline Chen, Robert Hager, Choong-Seock Chang, and Manish Parashar. 2015. Exploring Data Staging Across Deep Memory Hierarchies for Coupled Data Intensive Simulation Workflows. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1033--1042. https://doi.org/10.1109/IPDPS.2015.50
[14]
Mahmut Kandemir, Taylan Yemliha, Ramya Prabhakar, and Myoungsoo Jung. 2012. On Urgency of I/O Operations. In 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). 188--195. https://doi.org/10.1109/CCGrid.2012.40
[15]
Suha N. Kayum, Thierry Tonellot, Vincent Etienne, Ali Momin, Ghada Sindi, Maxim Dmitriev, and Hussain Salim. 2020. GeoDRIVE - a high performance computing flexible platform for seismic applications. First Break, Vol. 38, 2 (2020), 97--100. https://doi.org/10.3997/1365--2397.fb2020015
[16]
Jie Liu, Bogdan Nicolae, and Dong Li. 2023. Lobster: Load Balance-Aware I/O for Distributed DNN Training. In Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP '22). ACM, NY, USA, Article 26, 11 pages. https://doi.org/10.1145/3545008.3545090
[17]
Jay Lofstead, Ivo Jimenez, Carlos Maltzahn, Quincey Koziol, John Bent, and Eric Barton. 2016. DAOS and Friends: A Proposal for an Exascale Storage System. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah) (SC '16). IEEE Press, Article 50, 12 pages.
[18]
Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Amr M. Elsayed, Thierry Tonellot, and Franck Cappello. 2022. Towards Efficient Cache Allocation for High-Frequency Checkpointing. In 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC). 262--271. https://doi.org/10.1109/HiPC56025.2022.00043
[19]
Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Thierry Tonellot, and Franck Cappello. 2021. Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1--8. https://doi.org/10.1109/MASCOTS53633.2021.9614284
[20]
Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 203--216. https://www.usenix.org/conference/fast21/presentation/mohan
[21]
Dmitriy Monozov and Zarija Lukie. 2016. Henson v1.0. https://www.osti.gov/servlets/purl/1312559. https://www.osti.gov/biblio/1312559 HENSON; 004707WKSTN00.
[22]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. https://doi.org/10.1109/SC.2010.18
[23]
Dmitriy Morozov and Tom Peterka. 2016. Block-parallel data analysis with DIY2. In 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV). 29--36. https://doi.org/10.1109/LDAV.2016.7874307
[24]
Sri Hari Krishna Narayanan, Thomas Propson, Marcelo Bongarti, Jan Hückelheim, and Paul Hovland. 2022. Reducing Memory Requirements of Quantum Optimal Control. In Computational Science -- ICCS 2022, Derek Groen, Clélia de Mulatier, Maciej Paszynski, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M. A. Sloot (Eds.). Springer International Publishing, Cham, 129--142.
[25]
Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 911--920. https://doi.org/10.1109/IPDPS.2019.00099
[26]
Bogdan Nicolae, Adam Moody, Gregory Kosinovsky, Kathryn Mohror, and Franck Cappello. 2021. VELOC: VEry Low Overhead Checkpointing in the Age of Exascale. CoRR, Vol. abs/2103.02131 (2021).
[27]
Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. 2011. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 104--113. https://doi.org/10.1109/IPDPS.2011.131
[28]
NVIDIA. n.d. NVIDIA DGX A100 System Architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/dgx-a100/dgxa100-system-architecture-white-paper.pdf. Accessed: May 6, 2023.
[29]
Konstantinos Parasyris, Kai Keller, Leonardo Bautista-Gomez, and Osman Unsal. 2020. Checkpoint Restart Support for Heterogeneous HPC Applications. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 242--251. https://doi.org/10.1109/CCGrid49817.2020.00--69
[30]
Jongsoo Park, Richard M. Yoo, Daya S. Khudia, Christopher J. Hughes, and Daehyun Kim. 2013. Location-aware cache management for many-core processors with deep cache hierarchy. In SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--12. https://doi.org/10.1145/2503210.2503224
[31]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS Workshop on Autodiff. USA.
[32]
R.-E. Plessix. 2006. A review of the adjoint-state method for computing the gradient of a functional with geophysical applications. Geophysical Journal International, Vol. 167, 2 (2006). https://doi.org/10.1111/j.1365--246X.2006.02978.x
[33]
Line C. Pouchard, Tanzima Z. Islam, Bogdan Nicolae, and Robert Ross. 2022. A (meta)data framework for reproducing hybrid workflows with FAIR. In WORKS'22: 17th Workshop on Workflows in Support of Large-Scale Science (in conjunction with SC'22). Dallas, USA.
[34]
Yubo Qin, Ivan Rodero, Anthony Simonet, Charles Meertens, Daniel Reiner, James Riley, and Manish Parashar. 2021. Leveraging user access patterns and advanced cyberinfrastructure to accelerate data delivery from shared-use scientific observatories. Future Generation Computer Systems, Vol. 122 (2021), 14--27. https://doi.org/10.1016/j.future.2021.03.004
[35]
Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kathryn Mohror, Bradley Settlemyer, and Matthew Wolf. 2018. Storage Systems and Input/Output: Organizing, Storing, and Accessing Data for Scientific Discovery. Report for the DOE ASCR Workshop on Storage Systems and I/O. [Full Workshop Report]. Technical Report DOE/SC-0196. US Department of Energy. https://doi.org/10.2172/1491994 Conference: Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery, Gaithersburg, MD, 19--20 Sep 2018.
[36]
Philip Schwan et al. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380--386.
[37]
Qingchuan Shi, George Kurian, Farrukh Hijaz, Srinivas Devadas, and Omer Khan. 2016. LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies., Vol. 13, 4, Article 37 (nov 2016), 28 pages. https://doi.org/10.1145/2983632
[38]
Po-An Tsai, Nathan Beckmann, and Daniel Sanchez. 2017. Jenga: Software-defined cache hierarchies. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 652--665. https://doi.org/10.1145/3079856.3080214
[39]
Andrea Walther and Sri Hari Krishna Narayanan. 2016. Extending the Binomial Checkpointing Technique for Resilience. Technical Report. Albuquerque, NM, US. https://doi.org/10.1137/1.9781611974693.13
[40]
Hanqin Wan. 2019. Deep Learning:Neural Network, Optimizing Method and Libraries Review. In 2019 International Conference on Robots and Intelligent System (ICRIS). 497--500. https://doi.org/10.1109/ICRIS.2019.00128
[41]
Xian Wang, Paul Kairys, Sri Hari Krishna Narayanan, Jan Hückelheim, and Paul Hovland. 2022. Memory-Efficient Differentiable Programming for Quantum Optimal Control of Discrete Lattices. In 2022 IEEE/ACM Third International Workshop on Quantum Computing Software (QCS). 94--99. https://doi.org/10.1109/QCS56647.2022.00016
[42]
Shucai Xiao, Pavan Balaji, James Dinan, Qian Zhu, Rajeev Thakur, Susan Coghlan, Heshan Lin, Gaojin Wen, Jue Hong, and Wu-chun Feng. 2012. Transparent Accelerator Migration in a Virtualized GPU Environment. In 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). 124--131. https://doi.org/10.1109/CCGrid.2012.26
[43]
Hua-Wei Zhou, Hao Hu, Zhihui Zou, Yukai Wo, and Oong Youn. 2018. Reverse time migration: A prospect of seismic imaging methodology. Earth-Science Reviews, Vol. 179 (2018), 207--227. https://doi.org/10.1016/j.earscirev.2018.02.008
[44]
Mahdi Zolnouri, Xinlin Li, and Vahid Partovi Nia. 2020. Importance of data loading pipeline in training DNNs. arXiv preprint arXiv:2005.02130 (2020).

Cited By

View all
  • (2024)AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing SystemsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_24(342-355)Online publication date: 26-Aug-2024
  • (2024)Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_12(167-181)Online publication date: 26-Aug-2024
  • (2023)Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624256(1748-1756)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
August 2023
350 pages
ISBN:9798400701559
DOI:10.1145/3588195
  • General Chair:
  • Ali R. Butt,
  • Program Chairs:
  • Ningfang Mi,
  • Kyle Chard
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. asynchronous multi-level checkpointing
  2. graphics processing unit (GPU)
  3. hierarchical cache management
  4. high-performance computing (HPC)
  5. prefetching

Qualifiers

  • Research-article

Funding Sources

Conference

HPDC '23

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)274
  • Downloads (Last 6 weeks)75
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing SystemsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_24(342-355)Online publication date: 26-Aug-2024
  • (2024)Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_12(167-181)Online publication date: 26-Aug-2024
  • (2023)Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624256(1748-1756)Online publication date: 12-Nov-2023
  • (2023)Towards Efficient I/O Pipelines Using Accumulated Compression2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00043(256-265)Online publication date: 18-Dec-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media