Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Quasi-synchronous checkpointing and failure recovery in distributed systems
Publisher:
  • Ohio State University
  • Computer and Information Science Dept. 2036 Neil Avenue Columbus, OH
  • United States
Order Number:UMI Order No. GAX98-01742
Reflects downloads up to 13 Sep 2024Bibliometrics
Skip Abstract Section
Abstract

Checkpointing and rollback recovery is widely used for achieving fault-tolerance in distributed systems. When the state of a process is saved periodically, the saved states are called checkpoints of the process. A set of checkpoints, one from each process is called a consistent global checkpoint if none of them causally happened before any other checkpoint in the set. In rollback recovery, processes roll back to a consistent global checkpoint when a failure occurs. Consistent global checkpoints of a distributed computation has applications not only in failure recovery but also in debugging distributed programs, output commit, monitoring distributed events, protocol specifications and verification, and others.

When processes take checkpoints independently, some of the checkpoints may not be part of any consistent global checkpoint. In this thesis, we present a theoretical framework for identifying the checkpoints that can be used to construct consistent global checkpoints containing a target set of checkpoints. We illustrate the application of our results by presenting a simple and elegant algorithm for enumerating all consistent global checkpoints containing a target set of checkpoints.

We also present a characterization and classification of quasi-synchronous checkpoint algorithms, i.e., checkpointing algorithms which allow processes to take checkpoints independently as well as force processes to take communication induced checkpoints. The classification helps analyze the properties and limitations of such algorithms and also provides guidelines for designing and evaluating new checkpointing algorithms. This classification also sheds light on some important open problems.

Our classification of quasi-synchronous checkpointing algorithms helped us design a new low-overhead quasi-synchronous checkpointing algorithm which makes every checkpoint useful in the sense that every checkpoint is part of a consistent global checkpoint. This property of the checkpoint algorithm is especially helpful to minimize the rollback distance during failure recovery because a failed process needs to rollback only to its latest checkpoint.

Based on the checkpointing algorithm, we also present an asynchronous recovery algorithm which can handle concurrent failure of multiple processes. Unlike existing algorithms, our recovery algorithm does not use vector timestamps to track dependency. Moreover, it uses selective message logging to cope with the messages lost due to rollback.

Contributors
  • University of Kentucky

Recommendations