This thesis deals with fault tolerant schemes that include checkpointing to shorten recovery time after failures, and task duplication for fault detection. Until now there was no known analytical method to analyze these schemes, and simulation was used to check their performance. The thesis includes a new analysis technique for checkpointing schemes with task duplication. This technique gives an easy-to-use method to analyze and study the performance of the schemes. A few applications of the analysis tool, such as finding the optimal interval between checkpoints and comparing different aspects in the performance of existing schemes, are given.
One of conclusions we reached from studying the performance of existing schemes is that the system on which the scheme is implemented can have a major effect on the scheme performance. The thesis describes new checkpointing schemes that consist of two types of checkpoints, compare checkpoints and store checkpoints. The two types of checkpoints can be used to tune the schemes to the system they are used on, and enable an efficient use of the system resources. Analysis results show that using two types of checkpoints can lead to a significant improvement in the performance of checkpointing schemes. Experimental results, obtained on the Intel Paragon parallel computer and a cluster of workstations, confirm that the tuning of checkpointing schemes to the specific systems they are used on can significantly improve their performance.
Another way to improve the performance of checkpointing schemes is to use changes in the checkpointing cost to improve the checkpointing placement strategy. A new on-line algorithm, that uses past and present knowledge when it decides whether or not to place a checkpoint, is presented. Analysis of the new scheme shows that the total overhead of execution time when the proposed algorithm is used is significantly smaller than the overhead when fixed intervals are used. Although the proposed on-line algorithm uses only knowledge about the past and present, its behavior is close to the off-line optimal algorithm that uses a complete knowledge of checkpointing cost in all possible locations.
Cited By
- Kim J and Kim B (2019). Probabilistic Schedulability Analysis of Harmonic Multi-Task Systems with Dual-Modular Temporal Redundancy, Real-Time Systems, 26:2, (199-222), Online publication date: 1-Mar-2004.
- Ziv A and Bruck J (1998). Analysis of Checkpointing Schemes with Task Duplication, IEEE Transactions on Computers, 47:2, (222-227), Online publication date: 1-Feb-1998.
- Ziv A and Bruck J (1997). Performance Optimization of Checkpointing Schemes with Task Duplication, IEEE Transactions on Computers, 46:12, (1381-1386), Online publication date: 1-Dec-1997.
- Ziv A and Bruck J (1997). An On-Line Algorithm for Checkpoint Placement, IEEE Transactions on Computers, 46:9, (976-985), Online publication date: 1-Sep-1997.
Index Terms
- Analysis and performance optimization of checkpointing schemes with task duplication
Recommendations
Performance Optimization of Checkpointing Schemes with Task Duplication
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show ...
Performance Optimization of Checkpointing Schemes with Task Duplication
IMSCCS '06: Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences - Volume 2 (IMSCCS'06) - Volume 02Using store-checkpoints (SCPs) and compare- checkpoints (CCPs), we present an adaptive checkpointing scheme that dynamically adjusts the checkpointing interval on line in this paper. With additional SCPs and CCPs, we can use both the comparison and ...
Analysis of Checkpointing Schemes with Task Duplication
This paper suggests a technique for analyzing the performance of checkpointing schemes with task duplication. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance ...