Quasi-synchronous checkpointing and failure recovery in distributed systems

March 1998

Author:
D. Manivannan

Publisher:

Ohio State University
Computer and Information Science Dept. 2036 Neil Avenue Columbus, OH
United States

Order Number:UMI Order No. GAX98-01742

Bibliometrics

Abstract

Checkpointing and rollback recovery is widely used for achieving fault-tolerance in distributed systems. When the state of a process is saved periodically, the saved states are called checkpoints of the process. A set of checkpoints, one from each process is called a consistent global checkpoint if none of them causally happened before any other checkpoint in the set. In rollback recovery, processes roll back to a consistent global checkpoint when a failure occurs. Consistent global checkpoints of a distributed computation has applications not only in failure recovery but also in debugging distributed programs, output commit, monitoring distributed events, protocol specifications and verification, and others.

When processes take checkpoints independently, some of the checkpoints may not be part of any consistent global checkpoint. In this thesis, we present a theoretical framework for identifying the checkpoints that can be used to construct consistent global checkpoints containing a target set of checkpoints. We illustrate the application of our results by presenting a simple and elegant algorithm for enumerating all consistent global checkpoints containing a target set of checkpoints.

We also present a characterization and classification of quasi-synchronous checkpoint algorithms, i.e., checkpointing algorithms which allow processes to take checkpoints independently as well as force processes to take communication induced checkpoints. The classification helps analyze the properties and limitations of such algorithms and also provides guidelines for designing and evaluating new checkpointing algorithms. This classification also sheds light on some important open problems.

Our classification of quasi-synchronous checkpointing algorithms helped us design a new low-overhead quasi-synchronous checkpointing algorithm which makes every checkpoint useful in the sense that every checkpoint is part of a consistent global checkpoint. This property of the checkpoint algorithm is especially helpful to minimize the rollback distance during failure recovery because a failed process needs to rollback only to its latest checkpoint.

Based on the checkpointing algorithm, we also present an asynchronous recovery algorithm which can handle concurrent failure of multiple processes. Unlike existing algorithms, our recovery algorithm does not use vector timestamps to track dependency. Moreover, it uses selective message logging to cope with the messages lost due to rollback.

Contributors

D. Manivannan
University of Kentucky
- Publication Years1996 - 2022
- Publication counts39
- Citation count177
- Available for Download1
- Downloads (cumulative)1,184
- Downloads (12 months)5
- Downloads (6 weeks)1
- Average Downloads per Article1,184
- Average Citation per Article5
View Full Profile

Index Terms

Quasi-synchronous checkpointing and failure recovery in distributed systems

Comments

Recommendations

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes ...
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-...
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...

Browse Theses

Sections

Index Terms

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

A low-overhead recovery technique using quasi-synchronous checkpointing

A fully informed model-based checkpointing protocol for preventing useless checkpoints

Sections

Save to Binder

Index Terms

Recommendations

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

A low-overhead recovery technique using quasi-synchronous checkpointing

A fully informed model-based checkpointing protocol for preventing useless checkpoints