research-article

Design and modeling of a non-blocking checkpointing system

Authors:

Naoya Maruyama,

Kathryn Mohror,

Bronis R. de Supinski,

Satoshi MatsuokaAuthors Info & Claims

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 19, Pages 1 - 10

Published: 10 November 2012 Publication History

Abstract

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.

References

[1]

"TOP500 Supercomputing Sites," http://www.top500.org/.

[2]

"TSUBAME 2.0 - Monitoring Portal," http://mon.g.gsic.titech.ac.jp/.

[3]

B. Schroeder and G. A. Gibson, "Understanding Failures in Petascale Computers," Journal of Physics: Conference Series, vol. 78, no. 1, pp. 012 022+, Jul. 2007. {Online}. Available: http://dx.doi.org/10.1088/1742-6596/78/1/012022

[4]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, Nov. 2010, pp. 1--11. {Online}. Available: http://dx.doi.org/10.1109/SC.2010.18

Digital Library

[5]

L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka, "FTI: High Performance Fault Tolerance Interface for Hybrid Systems," in Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WS, USA, 2011.

Digital Library

[6]

"Scalable Checkpoint/Restart Library," http://sourceforge.net/projects/scalablecr/.

[7]

"IOR HPC Benchmark," http://sourceforge.net/projects/ior-sio/.

[8]

D. Patterson, G. Gibson, and R. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," in Proceedings of the 1988 ACM SIGMOD Conference on Management of Data, 1988.

Digital Library

[9]

W. Gropp, R. Ross, and N. Miller, "Providing Efficient I/O Redundancy in MPI Environments," in Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004.

[10]

J. Borrill, L. Oliker, J. Shalf, and H. Shan, "Investigation of Leading HPC I/O Performance Using a Scientific-Application Derived Benchmark," in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ser. SC '07. New York, NY, USA: ACM, 2007, pp. 1--12. {Online}. Available: http://dx.doi.org/10.1145/1362622.1362636

Digital Library

[11]

H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng, "DataStager: Scalable Data Staging Services for Petascale Applications," in Proceedings of the 18th ACM international symposium on High performance distributed computing, ser. HPDC '09. New York, NY, USA: ACM, 2009, pp. 39--48. {Online}. Available: http://dx.doi.org/10.1145/1551609.1551618

Digital Library

[12]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System," https://library-ext.llnl.gov, Lawrence Livermore National Laboratory, Tech. Rep., Jul. 2010.

[13]

"Lustre: A Scalable, High-Performance File System," http://wiki.lustre.org/index.php/Main_Page.

[14]

R. Himeno, "Himeno benchmark," http://accc.riken.jp/HPC_e/himenobmt_e.html.

[15]

C. M. Patrick, S. Son, and M. Kandemir, "Comparative Evaluation of Overlap Strategies with Study of I/O Overlap in MPI-IO," SIGOPS Oper. Syst. Rev., vol. 42, pp. 43--49, Oct. 2008. {Online}. Available: http://dx.doi.org/10.1145/1453775.1453784

Digital Library

[16]

N. Ali and M. Lauria, "Improving the Performance of Remote I/O Using Asynchronous Primitives," pp. 218--228. {Online}. Available: http://dx.doi.org/10.1109/HPDC.2006.1652153

[17]

N. Liu, C. Jason, C. Philip, C. Christopher, R. Robert, G. Gary, C. Adam, and M. Carlos, "On the Role of Burst Buffers in Leadership-Class Storage Systems," in MSST/SNAPI, Apr. 2012.

[18]

J. W. Young, "A First Order Approximation to the Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, Sep. 1974. {Online}. Available: http://dx.doi.org/10.1145/361147.361115

Digital Library

[19]

N. H. Vaidya, "On Checkpoint Latency," College Station, TX, USA, Tech. Rep., 1995. {Online}. Available: http://portal.acm.org/citation.cfm?id=892900

Digital Library

[20]

N. H. Vaidya, "A Case for Two-Level Distributed Recovery Schemes," SIGMETRICS Perform. Eval. Rev., vol. 23, no. 1, pp. 64--73, May 1995. {Online}. Available: http://dx.doi.org/10.1145/223586.223596

Digital Library

[21]

N. H. Vaidya, "Another Two-Level Failure Recovery Scheme," College Station, TX, USA, Tech. Rep., 1994. {Online}. Available: http://portal.acm.org/citation.cfm?id=892923

Digital Library

Cited By

Wang JHe XHori AYoshinaga KHerault TBouteiller ABosilca GIshikawa Y(2020)Overhead of using spare nodesInternational Journal of High Performance Computing Applications10.1177/109434202090188534:2(208-226)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1177/1094342020901885
Behera SWan LMueller FWolf MKlasky SParashar MVlassov VIrwin DMohror K(2020)Orchestrating Fault Prediction with Live Migration and CheckpointingProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392672(167-171)Online publication date: 23-Jun-2020
https://dl.acm.org/doi/10.1145/3369583.3392672
Grove DHamouda SHerta BIyengar AKawachiya KMilthorpe JSaraswat VShinnar ATakeuchi MTardieu O(2019)Failure Recovery in Resilient X10ACM Transactions on Programming Languages and Systems10.1145/333237241:3(1-30)Online publication date: 2-Jul-2019
https://dl.acm.org/doi/10.1145/3332372
Show More Cited By

Design and modeling of a non-blocking checkpointing system

Recommendations

Design and modeling of a non-blocking checkpointing system
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. ...
Checkpointing Exascale Memory Systems with Existing Memory Technologies
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming ...
Reliable and Efficient Distributed Checkpointing System for Grid Environments

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2012

1161 pages

ISBN:9781467308045

General Chair:
Jeffrey K. Hollingsworth
University of Maryland

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '12

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '12: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2012

Utah, Salt Lake City

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
558
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang JHe XHori AYoshinaga KHerault TBouteiller ABosilca GIshikawa Y(2020)Overhead of using spare nodesInternational Journal of High Performance Computing Applications10.1177/109434202090188534:2(208-226)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1177/1094342020901885
Behera SWan LMueller FWolf MKlasky SParashar MVlassov VIrwin DMohror K(2020)Orchestrating Fault Prediction with Live Migration and CheckpointingProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392672(167-171)Online publication date: 23-Jun-2020
https://dl.acm.org/doi/10.1145/3369583.3392672
Grove DHamouda SHerta BIyengar AKawachiya KMilthorpe JSaraswat VShinnar ATakeuchi MTardieu O(2019)Failure Recovery in Resilient X10ACM Transactions on Programming Languages and Systems10.1145/333237241:3(1-30)Online publication date: 2-Jul-2019
https://dl.acm.org/doi/10.1145/3332372
Lee KSullivan MHari STsai TKeckler SErez MEigenmann RDing CMcKee S(2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330361
Cantwell CNielsen A(2019)A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient SolversJournal of Scientific Computing10.1007/s10915-018-0778-778:1(565-581)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1007/s10915-018-0778-7
Shahzad FKreutzer MZeiser TMachado RPieper AHager GWellein G(2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1177/1094342016677085
Liu JAgrawal G(2017)Supporting Fault-Tolerance in Presence of In-Situ AnalyticsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.5555/3101112.3101155(304-313)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.5555/3101112.3101155
Poke MHoefler TGlass CHuang HWeissman JIamnitchi AIosup A(2017)AllConcurProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078598(205-218)Online publication date: 26-Jun-2017
https://dl.acm.org/doi/10.1145/3078597.3078598
Dubey AFujita HGraves DChien ATiwari DWest J(2016)Granularity and the cost of error recovery in resilient AMR scientific applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014961(1-10)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014961
Mansouri HBadache NAliouat MPathan A(2015)A Non-Blocking Coordinated Checkpointing Algorithm for Message-Passing SystemsProceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication10.1145/2816839.2816885(1-5)Online publication date: 23-Nov-2015
https://dl.acm.org/doi/10.1145/2816839.2816885
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents