Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2600212.2600219acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Transparent checkpoint-restart over infiniband

Published: 23 June 2014 Publication History

Abstract

Transparently saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. The lack of a solution has forced typical MPI implementations to include custom checkpoint-restart services that "tear down" the network, checkpoint each node in isolation, and then re-connect the network again. This work presents the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach simplifies current practice by avoiding the need for a privileged kernel module. The generality of this approach is demonstrated by applying it both to MPI and to Berkeley UPC (Unified Parallel C), in its native mode (without MPI). Scalability is shown by checkpointing 2,048 MPI processes across 128 nodes (with 16 cores per node). The run-time overhead varies between 0.8% and 1.7%. While checkpoint times dominate, the network-only portion of the implementation is shown to require less than 100 milliseconds (not including the time to locally write application memory to stable storage).

References

[1]
J. Ansel, G. Cooperman, and K. Arya. DMTCP: Scalable user-level transparent checkpointing for cluster computations and the desktop. In Proc. of IEEE International Parallel and Distributed Processing Symposium (IPDPS-09, systems track). IEEE Press, 2009. published on CD; version also available at http://arxiv.org/abs/cs.DC/0701037; software available at http://dmtcp.sourceforge.net.
[2]
T. Bedeir. Building an RDMA-capable application with IB Verbs. Technical report, http://www.hpcadvisorycouncil.com/, August 2010.sloppyhttp://www.hpcadvisorycouncil.com/pdf/building-an-rdma-capable- application-with-ib-verbs.pdf.
[3]
D. Bonachea. GASNet specification, v1.1. Technical report UCB/CSD-02--1207, U. of California, Berkeley, October 2002. http://digitalassets.lib.berkeley.edu/techreports/ucb/text/CSD-02--1207.pdf.
[4]
A. Bouteiler, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. MPICH-V project: a multiprotocol automatic fault tolerant MPI. International Journal of High Performance Computing Applications, 20:319--333, 2006.
[5]
W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. Technical report CCS-tr-99--157, IDA Center for Computing Sciences, 1999. http://upc.lbl.gov/publications/upctr.pdf.
[6]
G. Cooperman, J. Ansel, and X. Ma. Adaptive checkpointing for master-worker style parallelism (extended abstract). In Proc. of 2005 IEEE Computer Society International Conference on Cluster Computing. IEEE Press, 2005. conf. proc. on CD.
[7]
G. Cooperman, J. Ansel, and X. Ma. Transparent adaptive library-based checkpointing for master-worker style parallelism. In Proceedings of the 6$^th$ IEEE International Symposium on Cluster Computing and the Grid (CCGrid06), pages 283--291, Singapore, 2006. IEEE Press.
[8]
DMTCP team. Tutorial for DMTCP plugins, accessed Apr., 2014. http://dmtcp.sourceforge.net/api.html.
[9]
J. Duell, P. Hargrove, and E. Roman. The design and implementation of Berkeley Lab's Linux checkpoint/restart (BLCR). Technical Report LBNL-54941, Lawrence Berkeley National Laboratory, 2003.
[10]
T. El-Ghazawi and F. Cantonnet. UPC performance and potential: A NPB experimental study. In Proc. of the 2002 ACM/IEEE Conference on Supercomputing, Supercomputing '02, pages 1--26, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.
[11]
Q. Gao, W. Yu, W. Huang, and D. K. Panda. Application-transparent checkpoint/restart for MPI programs over InfiniBand. In ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing, pages 471--478, Washington, DC, USA, 2006. IEEE Computer Society.
[12]
R. Garg, K. Sodha, Z. Jin, and G. Cooperman. Checkpoint-restart for a network of virtual machines. In Proc. of 2013 IEEE Computer Society International Conference on Cluster Computing. IEEE Press, 2013.
[13]
GWU High-Performance Computing Laboratory. UPC NAS parallel benchmarks. http://threads.hpcl.gwu.edu/sites/npb-upc, accessed Jan., 2014, 2014.
[14]
P. Hargrove and J. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. Journal of Physics Conference Series, 46:494--499, Sept. 2006.
[15]
J. Hursey, T. I. Mattox, and A. Lumsdaine. Interconnect agnostic checkpoint/restart in Open MPI. In HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing, pages 49--58, New York, NY, USA, 2009. ACM.
[16]
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proceedings of the 21$^st$ IEEE International Parallel and Distributed Processing Symposium (IPDPS) / 12$^th$ IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems. IEEE Computer Society, March 2007.
[17]
G. Janakiraman, J. Santos, D. Subhraveti, and Y. Turner. Cruz: Application-transparent distributed checkpoint-restart on standard operating systems. In Dependable Systems and Networks (DSN-05), pages 260--269, 2005.
[18]
W. Jiang, J. Liu, H.-W. Jin, D. K. Panda, W. Gropp, and R. Thakur. High performance MPI-2 one-sided communication over InfiniBand. In CCGRID, pages 531--538, 2004.
[19]
G. Kerr. Dissecting a small InfiniBand application using the Verbs API. arxiv:1105.1827v2 {cs.dc} technical report, arXiv.org, May 2011.
[20]
G. Kerr, A. Brick, G. Cooperman, and S. Bratus. Checkpoint-restart: Proprietary hardware and the 'spiderweb API', July 8--10 2011. talk: abstract at http://recon.cx/2011/schedule/events/112.en.html; video at https://archive.org/details/Recon_2011_Checkpoint_Restart.
[21]
O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes for commodity clusters. In 2007 USENIX Annual Technical Conference, pages 323--336, 2007.
[22]
O. Laadan, D. Phung, and J. Nieh. Transparent networked checkpoint-restart for commodity clusters. In 2005 IEEE International Conference on Cluster Computing. IEEE Press, 2005.
[23]
P. Lemarinier, A. Bouteillerand, T. Herault, G. Krawezik, and F. Cappello. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In CLUSTER '04: Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 115--124, Washington, DC, USA, 2004. IEEE Computer Society.
[24]
S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Prof. of 5$^th$ Symposium on Operating Systems Design and Implementation (OSDI-2002), 2002.
[25]
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.
[26]
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.
[27]
O. O. Sudakov, I. S. Meshcheriakov, and Y. V. Boyko. CHPOX: Transparent checkpointing system for Linux clusters. In IEEE Int. Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pages 159--164, 2007.
[28]
M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, OSDI'04, Berkeley, CA, USA, 2004. USENIX Association.
[29]
M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. ACM Trans. Comput. Syst., 24(4):333--360, Nov. 2006.
[30]
B. Woodruff, S. Hefty, R. Dreier, and H. Rosenstock. Introduction to the InfiniBand core software. In Proceedings of the Linux Symposium (Volume Two), pages 271--282, July 2005.

Cited By

View all
  • (2023)Implementation-Oblivious Transparent Checkpoint-Restart for MPIProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624255(1738-1747)Online publication date: 12-Nov-2023
  • (2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
  • (2022)Research Progress and Trend of Coflow Time-Optimal Scheduling in Data Center NetworkArtificial Intelligence and Security10.1007/978-3-031-06788-4_47(560-572)Online publication date: 15-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
June 2014
334 pages
ISBN:9781450327497
DOI:10.1145/2600212
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. checkpoint/restart
  2. infiniband
  3. mpi
  4. upc

Qualifiers

  • Research-article

Conference

HPDC'14
Sponsor:

Acceptance Rates

HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)3
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Implementation-Oblivious Transparent Checkpoint-Restart for MPIProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624255(1738-1747)Online publication date: 12-Nov-2023
  • (2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
  • (2022)Research Progress and Trend of Coflow Time-Optimal Scheduling in Data Center NetworkArtificial Intelligence and Security10.1007/978-3-031-06788-4_47(560-572)Online publication date: 15-Jul-2022
  • (2021)Checkpointing Tools in a Supercomputer CenterLobachevskii Journal of Mathematics10.1134/S199508022012035541:12(2603-2613)Online publication date: 4-Feb-2021
  • (2021)MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00019(68-78)Online publication date: Nov-2021
  • (2021)An Innovative Approach for Cloud-Based Web Dev App MigrationICT with Intelligent Applications10.1007/978-981-16-4177-0_79(807-817)Online publication date: 6-Dec-2021
  • (2019)MANA for MPIProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325962(49-60)Online publication date: 17-Jun-2019
  • (2019)Fast Coflow Scheduling via Traffic Compression and Stage Pipelining in Datacenter NetworksIEEE Transactions on Computers10.1109/TC.2019.293171668:12(1755-1771)Online publication date: 1-Dec-2019
  • (2019)Checkpoint/restart approaches for a thread-based MPI runtimeParallel Computing10.1016/j.parco.2019.02.00685:C(204-219)Online publication date: 1-Jul-2019
  • (2018)Transparent High-Speed Network Checkpoint/Restart in MPIProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236383(1-11)Online publication date: 23-Sep-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media