Checkpointing and rollback recovery is widely used for achieving fault-tolerance in distributed systems. When the state of a process is saved periodically, the saved states are called checkpoints of the process. A set of checkpoints, one from each process is called a consistent global checkpoint if none of them causally happened before any other checkpoint in the set. In rollback recovery, processes roll back to a consistent global checkpoint when a failure occurs. Consistent global checkpoints of a distributed computation has applications not only in failure recovery but also in debugging distributed programs, output commit, monitoring distributed events, protocol specifications and verification, and others.
When processes take checkpoints independently, some of the checkpoints may not be part of any consistent global checkpoint. In this thesis, we present a theoretical framework for identifying the checkpoints that can be used to construct consistent global checkpoints containing a target set of checkpoints. We illustrate the application of our results by presenting a simple and elegant algorithm for enumerating all consistent global checkpoints containing a target set of checkpoints.
We also present a characterization and classification of quasi-synchronous checkpoint algorithms, i.e., checkpointing algorithms which allow processes to take checkpoints independently as well as force processes to take communication induced checkpoints. The classification helps analyze the properties and limitations of such algorithms and also provides guidelines for designing and evaluating new checkpointing algorithms. This classification also sheds light on some important open problems.
Our classification of quasi-synchronous checkpointing algorithms helped us design a new low-overhead quasi-synchronous checkpointing algorithm which makes every checkpoint useful in the sense that every checkpoint is part of a consistent global checkpoint. This property of the checkpoint algorithm is especially helpful to minimize the rollback distance during failure recovery because a failed process needs to rollback only to its latest checkpoint.
Based on the checkpointing algorithm, we also present an asynchronous recovery algorithm which can handle concurrent failure of multiple processes. Unlike existing algorithms, our recovery algorithm does not use vector timestamps to track dependency. Moreover, it uses selective message logging to cope with the messages lost due to rollback.
Index Terms
- Quasi-synchronous checkpointing and failure recovery in distributed systems
Recommendations
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes ...
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-...
A fully informed model-based checkpointing protocol for preventing useless checkpoints
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...