article

Application-level checkpointing for shared memory programs

Authors:

Greg Bronevetsky,

Daniel Marques,

Keshav Pingali,

Martin SchulzAuthors Info & Claims

ACM SIGPLAN Notices, Volume 39, Issue 11

Pages 235 - 247

https://doi.org/10.1145/1037187.1024421

Published: 07 October 2004 Publication History

Abstract

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

References

[1]

A. Beguelin, E. Seligman and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997.

Digital Library

[2]

C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajam ny, W. Yu, and W. Zwaenep el. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29(2):18--28, February 1995.

Digital Library

[3]

Adam Beguelin, Erik Seligman, and Peter Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Also available as http://citeseer.nj.nec.com/beguelin97application.html

Digital Library

[4]

G. Brnevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective perations in an application-level fault tolerant MPI system. In Proceedings of the 2003 International Conference on Supercomputing pages 234--243, June 2003.

Digital Library

[5]

Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming (PPoPP), pages 84--94,June 2003.

Digital Library

[6]

M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. IEEE Transactions on Computing Systems 3(1):63--75, 1985.

Digital Library

[7]

R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: an approach to providing fault-tolerant shared memory clusters. In Proceedings of the Ninth Annual Symposium on High Performance Computer Architecture February 2003.

Digital Library

[8]

Condor.http://www.cs.wisc.edu/condor/manual.

[9]

W. Dieter and Jr. J. Lumpp. A user-level checkpointing library for POSIX threads programs. In Proceedings of 1999 Symposium on Fault-Tolerant Computing Systems (FTCS), June 1999.

Digital Library

[10]

J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.

[11]

M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.

[12]

P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.

[13]

D. Hecht and C. Katsinis. Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Interconnection Network.In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000), May 2000.

Digital Library

[14]

T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.

[15]

Angkul Kongmunvattan, S. Tanchatchawal, and N. Tzeng. Coherence-based coordinated checkpointing for software distributed shared memory systems. In Proceedings of the International Conference on Distributed Computer Systems (ICDCS 2000), 2000.

Digital Library

[16]

Nancy Lynch. Distributed Algorithms Morgan Kaufmann, San Francisco, California,. first edition, 1996.

Digital Library

[17]

J. S. Plank M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, December 1994.

Digital Library

[18]

Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.

[19]

Z. Zhang M. Prvulovic and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared memory multiprocessors. In International Conference on Computer Architecture 2002.

Digital Library

[20]

K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.

[21]

Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.

[22]

N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.

[23]

N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Symposium on Principles of Distributed Computing Systems (PDCS), 1994.

Digital Library

[24]

OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.

[25]

D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA 2002), July 2002.

Digital Library

[26]

G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of International Parallel Processing Symposium(IPPS), 1996.

Digital Library

[27]

George Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996. Also available at http://citeseer.nj.nec.com/stellner96cocheck.html

Digital Library

[28]

F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proceedings of Supercomputing 2000. November 2000.

Digital Library

[29]

S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture 1995 pages 24--36, June 1995.

Digital Library

Cited By

Iyer RKalbarczyk ZNakka N(2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
https://doi.org/10.1002/9781119743453.ch8
Nesterenko BLiu XYi QZhao JZhang J(2019)Transitioning scientific applications to using non-volatile memory for resilienceProceedings of the International Symposium on Memory Systems10.1145/3357526.3357563(114-125)Online publication date: 30-Sep-2019
https://dl.acm.org/doi/10.1145/3357526.3357563
Shahzad FThies JKreutzer MZeiser THager GWellein G(2019)CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault ToleranceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286679430:3(501-514)Online publication date: 1-Mar-2019
https://doi.org/10.1109/TPDS.2018.2866794
Show More Cited By

Index Terms

Application-level checkpointing for shared memory programs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Application-level checkpointing for shared memory programs
ASPLOS 2004

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ...
Automated application-level checkpointing of MPI programs
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications ...
Application-level checkpointing for shared memory programs
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 39, Issue 11

ASPLOS '04

November 2004

283 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1037187

Issue’s Table of Contents

ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
October 2004
296 pages
ISBN:1581138040
DOI:10.1145/1024393
General Chair:
Shubu Mukherjee
Intel Corporation
,
Program Chair:
Kathryn S. McKinley
University of Texas at Austin

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2004

Published in SIGPLAN Volume 39, Issue 11

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

81
Total Citations
View Citations
1,392
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Iyer RKalbarczyk ZNakka N(2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
https://doi.org/10.1002/9781119743453.ch8
Nesterenko BLiu XYi QZhao JZhang J(2019)Transitioning scientific applications to using non-volatile memory for resilienceProceedings of the International Symposium on Memory Systems10.1145/3357526.3357563(114-125)Online publication date: 30-Sep-2019
https://dl.acm.org/doi/10.1145/3357526.3357563
Shahzad FThies JKreutzer MZeiser THager GWellein G(2019)CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault ToleranceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286679430:3(501-514)Online publication date: 1-Mar-2019
https://doi.org/10.1109/TPDS.2018.2866794
Weeks NLuecke GMaris PVary J(2018)Challenges in Developing MPI Fault-Tolerant Fortran Applications2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00068(524-531)Online publication date: Sep-2018
https://doi.org/10.1109/CLUSTER.2018.00068
Arora RBa T(2017)ITALCProceedings of the Fourth International Workshop on HPC User Support Tools10.1145/3152493.3152558(1-11)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3152493.3152558
Kaiser JGad RSuB TPadua FNagel LBrinkmann A(2016)Deduplication Potential of HPC Applications’ Checkpoints2016 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2016.32(413-422)Online publication date: Sep-2016
https://doi.org/10.1109/CLUSTER.2016.32
Elnawawy HTuck JByrd G(2023)PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and Writeback2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00015(74-85)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00015
Coy THe SRen BZhang XAyguadé EHwu WBadia RHofstee H(2020)Compiler aided checkpointing using crash-consistent data structures in NVMM systemsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392755(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392755
Alshboul MElnawawy HElkhouly RKimura KTuck JSolihin Y(2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
https://dl.acm.org/doi/10.1145/3323091
Calciu IPuddu IKolli ANowatzyk AGandhi JMutlu OSubrahmanyam P(2019)Project PBerryProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321424(127-135)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321424
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents