Keeping checkpointing viable for exascale systems

January 2011

Author:
Kurt B. Ferreira
The University of New Mexico
,
Adviser:
Patrick G. Bridges
The University of New Mexico

Publisher:

University of New Mexico
Albuquerque, NM
United States

ISBN:978-1-267-28351-1

Order Number:AAI3504292

Pages:

180

Purchase on ProQuest

Bibliometrics

Abstract

Next-generation exascale systems, those capable of performing a quintillion (10 18 ) operations per second, are expected to be delivered in the next 8–10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.

Cited By

Contributors

Patrick G. Bridges
The University of New Mexico
- Publication Years1996 - 2025
- Publication counts74
- Citation count805
- Available for Download40
- Downloads (cumulative)11,526
- Downloads (12 months)994
- Downloads (6 weeks)121
- Average Downloads per Article288
- Average Citation per Article11
View Full Profile
Kurt B Ferreira
The University of New Mexico
- Publication Years2007 - 2023
- Publication counts67
- Citation count892
- Available for Download37
- Downloads (cumulative)9,161
- Downloads (12 months)552
- Downloads (6 weeks)53
- Average Downloads per Article248
- Average Citation per Article13
View Full Profile

Comments

Recommendations

Checkpointing Exascale Memory Systems with Existing Memory Technologies
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming ...
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more ...
Multilevel Diskless Checkpointing

Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...

Browse Theses

Sections

Cited By

Checkpointing Exascale Memory Systems with Existing Memory Technologies

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

Multilevel Diskless Checkpointing

Sections

Cited By

Save to Binder

Recommendations

Checkpointing Exascale Memory Systems with Existing Memory Technologies

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

Multilevel Diskless Checkpointing