Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Application-level checkpointing for shared memory programs

Published: 07 October 2004 Publication History

Abstract

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

References

[1]
A. Beguelin, E. Seligman and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997.
[2]
C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajam ny, W. Yu, and W. Zwaenep el. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29(2):18--28, February 1995.
[3]
Adam Beguelin, Erik Seligman, and Peter Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Also available as http://citeseer.nj.nec.com/beguelin97application.html
[4]
G. Brnevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective perations in an application-level fault tolerant MPI system. In Proceedings of the 2003 International Conference on Supercomputing pages 234--243, June 2003.
[5]
Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming (PPoPP), pages 84--94,June 2003.
[6]
M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. IEEE Transactions on Computing Systems 3(1):63--75, 1985.
[7]
R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: an approach to providing fault-tolerant shared memory clusters. In Proceedings of the Ninth Annual Symposium on High Performance Computer Architecture February 2003.
[8]
Condor.http://www.cs.wisc.edu/condor/manual.
[9]
W. Dieter and Jr. J. Lumpp. A user-level checkpointing library for POSIX threads programs. In Proceedings of 1999 Symposium on Fault-Tolerant Computing Systems (FTCS), June 1999.
[10]
J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.
[11]
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.
[12]
P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.
[13]
D. Hecht and C. Katsinis. Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Interconnection Network.In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000), May 2000.
[14]
T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.
[15]
Angkul Kongmunvattan, S. Tanchatchawal, and N. Tzeng. Coherence-based coordinated checkpointing for software distributed shared memory systems. In Proceedings of the International Conference on Distributed Computer Systems (ICDCS 2000), 2000.
[16]
Nancy Lynch. Distributed Algorithms Morgan Kaufmann, San Francisco, California,. first edition, 1996.
[17]
J. S. Plank M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, December 1994.
[18]
Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
[19]
Z. Zhang M. Prvulovic and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared memory multiprocessors. In International Conference on Computer Architecture 2002.
[20]
K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.
[21]
Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.
[22]
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.
[23]
N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Symposium on Principles of Distributed Computing Systems (PDCS), 1994.
[24]
OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.
[25]
D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA 2002), July 2002.
[26]
G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of International Parallel Processing Symposium(IPPS), 1996.
[27]
George Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996. Also available at http://citeseer.nj.nec.com/stellner96cocheck.html
[28]
F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proceedings of Supercomputing 2000. November 2000.
[29]
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture 1995 pages 24--36, June 1995.

Cited By

View all
  • (2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
  • (2019)Transitioning scientific applications to using non-volatile memory for resilienceProceedings of the International Symposium on Memory Systems10.1145/3357526.3357563(114-125)Online publication date: 30-Sep-2019
  • (2019)CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault ToleranceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286679430:3(501-514)Online publication date: 1-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 39, Issue 11
ASPLOS '04
November 2004
283 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1037187
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
    October 2004
    296 pages
    ISBN:1581138040
    DOI:10.1145/1024393
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2004
Published in SIGPLAN Volume 39, Issue 11

Check for updates

Author Tags

  1. checkpointing
  2. fault-tolerance
  3. openMP
  4. shared-memory programs

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
  • (2019)Transitioning scientific applications to using non-volatile memory for resilienceProceedings of the International Symposium on Memory Systems10.1145/3357526.3357563(114-125)Online publication date: 30-Sep-2019
  • (2019)CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault ToleranceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286679430:3(501-514)Online publication date: 1-Mar-2019
  • (2018)Challenges in Developing MPI Fault-Tolerant Fortran Applications2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00068(524-531)Online publication date: Sep-2018
  • (2017)ITALCProceedings of the Fourth International Workshop on HPC User Support Tools10.1145/3152493.3152558(1-11)Online publication date: 12-Nov-2017
  • (2016)Deduplication Potential of HPC Applications’ Checkpoints2016 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2016.32(413-422)Online publication date: Sep-2016
  • (2023)PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and Writeback2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00015(74-85)Online publication date: 21-Oct-2023
  • (2020)Compiler aided checkpointing using crash-consistent data structures in NVMM systemsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392755(1-13)Online publication date: 29-Jun-2020
  • (2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
  • (2019)Project PBerryProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321424(127-135)Online publication date: 13-May-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media