Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3624062.3624255acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Implementation-Oblivious Transparent Checkpoint-Restart for MPI

Published: 12 November 2023 Publication History

Abstract

This work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major available MPI implementations: “develop once, run everywhere”. The new platform enables application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Since its original academic prototype, MANA has been under development for three of the past four years, and is planned to enter full production at NERSC in early Fall of 2023. To the best of the authors’ knowledge, MANA is currently the only production-capable, system-level checkpointing package running on a large supercomputer (Perlmutter at NERSC) using a major MPI implementation (HPE Cray MPI). Experiments are presented on large production workloads, demonstrating low runtime overhead with one codebase supporting four MPI implementations: HPE Cray MPI, MPICH, Open MPI, and ExaMPI.

Supplemental Material

MP4 File
Recording of "Implementation-Oblivious Transparent Checkpoint-Restart for MPI" presentation at SuperCheck-SC23.

References

[1]
Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’09). IEEE, Rome, Italy, 1–12.
[2]
Deborah Bard, Cory Snavely, Lisa Gerhardt, Jason Lee, Becci Totzke, Katie Antypas, William Arndt, Johannes Blaschke, Suren Byna, Ravi Cheema, 2022. The LBNL superfacility project report. Technical Report. U.S. Department of Energy Office of Scientific and Technical Information (OSTI); and Lawrence Bekeley National Laboratory (LBNL).
[3]
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTI: High performance fault tolerance interface for hybrid systems. In Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. 1–32.
[4]
H.J.C. Berendsen, D. van der Spoel, and R. van Drunen. 1995. GROMACS: A Message-passing Parallel Molecular Dynamics Implementation. Computer Physics Communications 91, 1 (1995), 43 – 56.
[5]
Mark S Birrittella, Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D Underwood, and Robert C Zak. 2015. Intel® Omni-Path architecture: Enabling scalable, high performance fabrics. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 1–9.
[6]
Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, and Jack Dongarra. 2013. Post-failure recovery of MPI communication capability: Design and rationale. The International Journal of High Performance Computing Applications 27, 3 (2013), 244–254.
[7]
Johannes P Blaschke, Aaron S Brewster, Daniel W Paley, Derek Mendez, Asmit Bhowmick, Nicholas K Sauter, Wilko Kröger, Murali Shankar, Bjoern Enders, and Deborah Bard. 2021. Real-time XFEL data analysis at SLAC and NERSC: a trial run of nascent exascale experimental data analysis. Technical Report.
[8]
Johannes P Blaschke, Felix Wittwer, Bjoern Enders, and Debbie Bard. 2023. How a Lightsource Uses a supercomputer for live interactive analysis of large data sets: Perspectives on the NERSC-LCLS superfacility. Synchrotron Radiation News (Sept. 2023), 1–7.
[9]
Aurelien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, and Franck Cappello. 2006. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. The International Journal of High Performance Computing Applications 20, 3 (2006), 319–333.
[10]
Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2003. Automated application-level checkpointing of MPI programs. In Proc. of the Ninth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. 84–94.
[11]
Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, Hari Subramoni, Jéôme Vienne, and Gene Cooperman. 2016. System-level Scalable Checkpoint-Restart for Petascale Computing. In 22nd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS’16). IEEE Press, 932–941.
[12]
Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. 2014. Transparent Checkpoint-Restart over InfiniBand. In ACM Symposium on High Performance Parallel and and Distributed Computing (HPDC’14). ACM Press, 12 pages.
[13]
Prashant Singh Chouhan, Harsh Khetawat, Neil Resnik, Twinkle Jain, Rohan Garg, Gene Cooperman, Rebecca Hartman–Baker, and Zhengji Zhao. 2021. Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC (extended abstract). In First International Symposium on Checkpointing for Supercomputing (SuperCheck’21). Berkeley, CA, 1–3. https://arxiv.org/abs/2103.08546; from https://supercheck.lbl.gov/resources.
[14]
Cray. 2014. Understanding Communication and MPI on Cray XC40. https://www.hpc.kaust.edu.sa/sites/default/files/files/public//KSL/150607-Cray_training/3.05_cray_mpi.pdf
[15]
Daniele De Sensi, Salvatore Di Girolamo, Kim H McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An in-depth analysis of the Slingshot interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.
[16]
Jack Dongarra, Michael A Heroux, and Piotr Luszczek. 2016. A New Metric for Ranking High-performance Computing Systems. National Science Review (2016), 30–35. (benchmark at https://www.hpcg-benchmark.org/).
[17]
Benjamin Driscoll and Zhengji Zhao. 2020. Automation of NERSC Application Usage Report. In 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools). IEEE, 10–18.
[18]
Qi Gao, Weikuan Yu, Wei Huang, and Dhabaleswar K. Panda. 2006. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. In Int. Conf. on Parallel Processing (ICPP’06). 471–478.
[19]
Rohan Garg, Gregory Price, and Gene Cooperman. 2019. MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing. In Proc. of the 28th Int. Symp. on High-Performance Parallel and Distributed Computing. ACM, 49–60.
[20]
Anna Giannakou, Johannes P Blaschke, Deborah Bard, and Lavanya Ramakrishnan. 2021. Experiences with cross-facility real-time light source data analysis workflows. In 2021 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC). IEEE, 45–53.
[21]
Richard L Graham, George Bosilca, and Jelena Pješivac-Grbovic. 2007. A Comparison of Application Performance Using Open MPI and Cray MPI. Cray Users Group (CUG’07) (2007), 10 pages.
[22]
Richard L Graham, Timothy S Woodall, and Jeffrey M Squyres. 2006. Open MPI: A flexible high performance MPI. In Parallel Processing and Applied Mathematics: 6th International Conference, PPAM 2005, Poznań, Poland, September 11-14, 2005, Revised Selected Papers 6. Springer, 228–239.
[23]
William Gropp and Ewing Lusk. 1996. User’s guide for MPICH, a portable implementation of MPI.
[24]
Jürgen Hafner. 2008. Ab-initio simulations of materials using VASP: Density-functional theory and beyond. Journal of computational chemistry 29, 13 (2008), 2044–2078.
[25]
Paul H Hargrove and Jason C Duell. 2006. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. Journal of Physics: Conference Series 46, 1 (2006), 494.
[26]
Hewlett Packard Enterprise. 2017. Aries High-Speed Network. https://pubs.cray.com/bundle/Urika-GX_Hardware_Guide_H-6142_Rev_C_Urika-GX_HW_Guide_DITAval/page/Aries_High_Speed_Network_Urika-GX.html
[27]
Joshua Hursey, Timothy I Mattox, and Andrew Lumsdaine. 2009. Interconnect agnostic checkpoint/restart in Open MPI. In Proc. of 18th ACM Int. Symp. on High Performance Distributed Computing. 49–58.
[28]
Joshua Hursey, Jeffrey M Squyres, Timothy I Mattox, and Andrew Lumsdaine. 2007. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1–8.
[29]
Ian Karlin, Jeff Keasler, and J Robert Neely. 2013. Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
[30]
Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[31]
Ignacio Laguna, David F Richards, Todd Gamblin, Martin Schulz, Bronis R de Supinski, Kathryn Mohror, and Howard Pritchard. 2016. Evaluating and extending User-Level Fault Tolerance in MPI applications. The International Journal of High Performance Computing Applications 30, 3 (2016), 305–319.
[32]
Nuria Losada, Patricia González, María J Martín, George Bosilca, Aurélien Bouteiller, and Keita Teranishi. 2020. Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems 106 (2020), 467–481.
[33]
Ping-Jing Lu, Ming-Che Lai, and Jun-Sheng Chang. 2022. A survey of high-performance interconnection networks in high-performance computer systems. Electronics 11, 9 (2022), 1369.
[34]
Mellanox Technologies. 2015. RDMA Aware Networks Programming User Manual (Rev 1.7). https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
[35]
Jamaludin Mohd-Yusof, Sriram Swaminarayan, and Timothy C Germann. 2013. Co-design for molecular dynamics: An exascale proxy application. LA-UR 13-20839 (2013), 88–89.
[36]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[37]
NERSC [n. d.]. NERSC, the primary scientific computing facility for the Office of Science in the U.S. Department of Energy. https://nersc.gov/.
[38]
Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards high performance adaptive asynchronous checkpointing at large scale. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 911–920.
[39]
Dhabaleswar Kumar Panda, Hari Subramoni, Ching-Hsiang Chu, and Mohammadreza Bayatpour. 2021. The MVAPICH project: Transforming research into high-performance MPI library for HPC community. Journal of Computational Science 52 (2021), 101208.
[40]
Dhabaleswar K Panda, Karen Tomko, Karl Schulz, and Amitava Majumdar. 2013. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int’l Conference on Supercomputing (WSSPE). 5 pages.
[41]
Massimo Papa, Toshiki Maruyama, and Aldo Bonasera. 2001. Constrained molecular dynamics approach to fermionic systems. Physical Review C 64, 2 (2001), 024612.
[42]
N Anders Petersson and Björn Sjögreen. 2015. Wave propagation in anisotropic elastic materials and curvilinear coordinates using a summation-by-parts finite difference method. J. Comput. Phys. 299 (2015), 820–841.
[43]
Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2004. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In SC’04: Proc. of the 2004 ACM/IEEE Conf. on Supercomputing. IEEE, 38–38.
[44]
Anthony Skjellum, Martin Rüfenacht, Nawrin Sultana, Derek Schafer, Ignacio Laguna, and Kathryn Mohror. 2020. ExaMPI: A modern design and implementation to accelerate Message Passing Interface innovation. In High Performance Computing: 6th Latin American Conference, CARLA 2019, Turrialba, Costa Rica, September 25–27, 2019, Revised Selected Papers 6. Springer, 153–169.
[45]
Aidan P Thompson, H Metin Aktulga, Richard Berger, Dan S Bolintineanu, W Michael Brown, Paul S Crozier, Pieter J in’t Veld, Axel Kohlmeyer, Stan G Moore, Trung Dac Nguyen, 2022. LAMMPS-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271 (2022), 108171.
[46]
Top500 2021. Top500 Supercomputers (June, 2021). https://www.top500.org/lists/top500/2021/06/. [Online; accessed Aug., 2021].
[47]
Yao Xu, Zhengji Zhao, Rohan Garg, Harsh Khetawat, Rebecca Hartman-Baker, and Gene Cooperman. 2021. MANA-2.0: A future-Proof design for transparent checkpointing of MPI at scale. https://ieeexplore.ieee.org/document/9721343; technical report at https://arxiv.org/abs/2112.05858. In Int. Symp. on Checkpointing for Supercomputing (SuperCheck’SC-21), 2021 SC Workshops Supplementary Proceedings (St. Louis, MO). IEEE, 68–78.
[48]
Junchao Zhang, Bill Long, Kenneth Raffenetti, and Pavan Balaji. 2014. Implementing the MPI-3.0 Fortran 2008 binding. In Proceedings of the 21st European MPI Users’ Group Meeting. 1–6.

Index Terms

  1. Implementation-Oblivious Transparent Checkpoint-Restart for MPI

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
    November 2023
    2180 pages
    ISBN:9798400707858
    DOI:10.1145/3624062
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ExaMPI
    2. MANA
    3. MPI
    4. MPICH
    5. Open MPI
    6. transparent checkpointing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SC-W 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 186
      Total Downloads
    • Downloads (Last 12 months)186
    • Downloads (Last 6 weeks)33
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media