Scalable diskless checkpointing for large parallel systems

January 2005

Author:
Charng-Da Lu
University of Illinois at Urbana-Champaign
,
Adviser:
Daniel A. Reed
University of Illinois at Urbana-Champaign

Publisher:

University of Illinois at Urbana-Champaign
Champaign, IL
United States

ISBN:978-0-542-44725-9

Order Number:AAI3199074

Pages:

155

Purchase on ProQuest

Bibliometrics

Abstract

Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which all processes coordinate to dump memory to stable storage simultaneously. However, in systems comprising tens of thousands of nodes, the total data volume can overwhelm the network and storage farm, creating an I/O bottleneck. Furthermore, a very large class of scientific applications can fail on these systems if one of the processes dies.

Poor checkpointing performance limits checkpointing frequency and increases the time-to-solution of applications. Also, the application can spend more time in recovery and restart because large systems tend to fail often.

Diskless checkpointing is a viable approach that provides high-performance and reliable storage for intermediate or temporary data, such as checkpoint files. First, the data is stored in memory instead of disk. Second, reliability and recoverability is guaranteed by use of redundancy codes (parity bits or Reed-Solomon codes), which are stored on spares. Third, I/O is made scalable by partitioning nodes and spares into small groups. Each group takes care of its own redundancy codes generation and node failure and recovery.

We have implemented a diskless checkpointing and recovery system and assessed its performance with both I/O benchmarks and real scientific applications. The results show much greater I/O scalability and higher throughput than disk-based parallel file systems for a large number of clients.

As a technology projection, we have also developed an analytical model to investigate the performability of diskless checkpointing. Our model evaluation shows that the overhead of checkpoint/recovery is small on systems with thousands of nodes, and with appropriate partitioning of nodes, the user application can survive several times longer.

Cited By

Contributors

Daniel Reed
University of Iowa
- Publication Years1983 - 2018
- Publication counts145
- Citation count2,502
- Available for Download65
- Downloads (cumulative)72,758
- Downloads (12 months)3,510
- Downloads (6 weeks)522
- Average Downloads per Article1,119
- Average Citation per Article17
View Full Profile
Charng Da Lu
University of Illinois Urbana-Champaign
- Publication Years2004 - 2005
- Publication counts2
- Citation count33
- Available for Download1
- Downloads (cumulative)32
- Downloads (12 months)2
- Downloads (6 weeks)1
- Average Downloads per Article32
- Average Citation per Article17
View Full Profile

Index Terms

Scalable diskless checkpointing for large parallel systems
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Distributed simulation
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
    2. Software system structures
      1. Distributed systems organizing principles

Comments

Recommendations

Multilevel Diskless Checkpointing

Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...
Diskless Checkpointing

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed ...
N-Level Diskless Checkpointing
HPCC '09: Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications

Diskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-...

Browse Theses

Sections

Cited By

Index Terms

Multilevel Diskless Checkpointing

Diskless Checkpointing

N-Level Diskless Checkpointing

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Multilevel Diskless Checkpointing

Diskless Checkpointing

N-Level Diskless Checkpointing