Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2388996.2389022acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Design and modeling of a non-blocking checkpointing system

Published: 10 November 2012 Publication History
  • Get Citation Alerts
  • Abstract

    As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.

    References

    [1]
    "TOP500 Supercomputing Sites," http://www.top500.org/.
    [2]
    "TSUBAME 2.0 - Monitoring Portal," http://mon.g.gsic.titech.ac.jp/.
    [3]
    B. Schroeder and G. A. Gibson, "Understanding Failures in Petascale Computers," Journal of Physics: Conference Series, vol. 78, no. 1, pp. 012 022+, Jul. 2007. {Online}. Available: http://dx.doi.org/10.1088/1742-6596/78/1/012022
    [4]
    A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, Nov. 2010, pp. 1--11. {Online}. Available: http://dx.doi.org/10.1109/SC.2010.18
    [5]
    L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka, "FTI: High Performance Fault Tolerance Interface for Hybrid Systems," in Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WS, USA, 2011.
    [6]
    "Scalable Checkpoint/Restart Library," http://sourceforge.net/projects/scalablecr/.
    [7]
    "IOR HPC Benchmark," http://sourceforge.net/projects/ior-sio/.
    [8]
    D. Patterson, G. Gibson, and R. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," in Proceedings of the 1988 ACM SIGMOD Conference on Management of Data, 1988.
    [9]
    W. Gropp, R. Ross, and N. Miller, "Providing Efficient I/O Redundancy in MPI Environments," in Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004.
    [10]
    J. Borrill, L. Oliker, J. Shalf, and H. Shan, "Investigation of Leading HPC I/O Performance Using a Scientific-Application Derived Benchmark," in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ser. SC '07. New York, NY, USA: ACM, 2007, pp. 1--12. {Online}. Available: http://dx.doi.org/10.1145/1362622.1362636
    [11]
    H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng, "DataStager: Scalable Data Staging Services for Petascale Applications," in Proceedings of the 18th ACM international symposium on High performance distributed computing, ser. HPDC '09. New York, NY, USA: ACM, 2009, pp. 39--48. {Online}. Available: http://dx.doi.org/10.1145/1551609.1551618
    [12]
    A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System," https://library-ext.llnl.gov, Lawrence Livermore National Laboratory, Tech. Rep., Jul. 2010.
    [13]
    "Lustre: A Scalable, High-Performance File System," http://wiki.lustre.org/index.php/Main_Page.
    [14]
    R. Himeno, "Himeno benchmark," http://accc.riken.jp/HPC_e/himenobmt_e.html.
    [15]
    C. M. Patrick, S. Son, and M. Kandemir, "Comparative Evaluation of Overlap Strategies with Study of I/O Overlap in MPI-IO," SIGOPS Oper. Syst. Rev., vol. 42, pp. 43--49, Oct. 2008. {Online}. Available: http://dx.doi.org/10.1145/1453775.1453784
    [16]
    N. Ali and M. Lauria, "Improving the Performance of Remote I/O Using Asynchronous Primitives," pp. 218--228. {Online}. Available: http://dx.doi.org/10.1109/HPDC.2006.1652153
    [17]
    N. Liu, C. Jason, C. Philip, C. Christopher, R. Robert, G. Gary, C. Adam, and M. Carlos, "On the Role of Burst Buffers in Leadership-Class Storage Systems," in MSST/SNAPI, Apr. 2012.
    [18]
    J. W. Young, "A First Order Approximation to the Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, Sep. 1974. {Online}. Available: http://dx.doi.org/10.1145/361147.361115
    [19]
    N. H. Vaidya, "On Checkpoint Latency," College Station, TX, USA, Tech. Rep., 1995. {Online}. Available: http://portal.acm.org/citation.cfm?id=892900
    [20]
    N. H. Vaidya, "A Case for Two-Level Distributed Recovery Schemes," SIGMETRICS Perform. Eval. Rev., vol. 23, no. 1, pp. 64--73, May 1995. {Online}. Available: http://dx.doi.org/10.1145/223586.223596
    [21]
    N. H. Vaidya, "Another Two-Level Failure Recovery Scheme," College Station, TX, USA, Tech. Rep., 1994. {Online}. Available: http://portal.acm.org/citation.cfm?id=892923

    Cited By

    View all
    • (2020)Overhead of using spare nodesInternational Journal of High Performance Computing Applications10.1177/109434202090188534:2(208-226)Online publication date: 1-Mar-2020
    • (2020)Orchestrating Fault Prediction with Live Migration and CheckpointingProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392672(167-171)Online publication date: 23-Jun-2020
    • (2019)Failure Recovery in Resilient X10ACM Transactions on Programming Languages and Systems10.1145/333237241:3(1-30)Online publication date: 2-Jul-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2012
    1161 pages
    ISBN:9781467308045

    Sponsors

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 10 November 2012

    Check for updates

    Author Tags

    1. Markov model
    2. checkpoint/restart
    3. fault tolerance

    Qualifiers

    • Research-article

    Conference

    SC '12
    Sponsor:

    Acceptance Rates

    SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Overhead of using spare nodesInternational Journal of High Performance Computing Applications10.1177/109434202090188534:2(208-226)Online publication date: 1-Mar-2020
    • (2020)Orchestrating Fault Prediction with Live Migration and CheckpointingProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392672(167-171)Online publication date: 23-Jun-2020
    • (2019)Failure Recovery in Resilient X10ACM Transactions on Programming Languages and Systems10.1145/333237241:3(1-30)Online publication date: 2-Jul-2019
    • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
    • (2019)A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient SolversJournal of Scientific Computing10.1007/s10915-018-0778-778:1(565-581)Online publication date: 1-Jan-2019
    • (2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018
    • (2017)Supporting Fault-Tolerance in Presence of In-Situ AnalyticsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.5555/3101112.3101155(304-313)Online publication date: 14-May-2017
    • (2017)AllConcurProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078598(205-218)Online publication date: 26-Jun-2017
    • (2016)Granularity and the cost of error recovery in resilient AMR scientific applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014961(1-10)Online publication date: 13-Nov-2016
    • (2015)A Non-Blocking Coordinated Checkpointing Algorithm for Message-Passing SystemsProceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication10.1145/2816839.2816885(1-5)Online publication date: 23-Nov-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media