6.1.1 Simulation Setting.
The generated processes with dependencies are shown in Figure
8(a). According to Section
3, we partition these processes into a DAG of tasks, shown in Figure
8(b). The fault-free execution times of each task are marked beside the vertices. According to Section
4, We extracted the critical path from the DAG and marked it in orange. The critical path consists of four tasks, each with time lengths of 400, 300, 200, and 200. The overall deadline for completing all tasks is 3300, which is equivalent to three times the fault-free computation time of the tasks on the critical path.
Considering the simplicity of setting up the experiment and the directness of illustrating the advantage of our model, the parameters mentioned in the Section
2.3.3 are as follows. Some parameters are based on real-world experiences, such as the checkpointing overhead for placement and recovery, while others can be set arbitrarily and do not affect the fairness among different strategies. We choose the same fault rate for each task, i.e.,
\(\lambda _0=\lambda _1=\lambda _2=\lambda _3=0.01\), which means one fault is expected to occur every 100 units of time. The checkpoint placement overhead
\(t_c\) is 4. When a fault arrives, the system rollback to an optional checkpoint with a probability of
\(p=0.8\), and to a compulsory checkpoint (the initial state for
\(J_1\)) with a probability of
\(q=0.2\). The recovery overhead from a checkpoint is
\(r=12\), and the recovery overhead from the initial state is
\(s=20\).
To simulate the time interval between two faults, we use the equation in [
27] for the exponential distribution:
where
\(U\) is a random value between 0 and 1.
According to Section
5, we can obtain the optimal number of checkpoints in each task of the critical path, which are 13, 9, 6, 6 checkpoints. According to Equations (
14) and (
24), we can also get corresponding intervals between every two checkpoints.
6.1.2 Simulation Result.
We perform four simulations to evaluate our strategy of setting checkpoints. The first simulation aims to determine whether our model can prevent the domino effect and thus reduce the execution time. The second and the third simulation aims to prove our model optimizes the interval of checkpoints and the number of checkpoints, respectively. The fourth simulation shows that our model performs well in a wide range of scales.
Simulation 1: Domino Effect Prevention. The processes with dependencies in Figure
9(a) suffer from the domino effect if we set checkpoints randomly, for example, the blue checkpoints. Instead, setting checkpoints based on our strategy prevents the system from rolling back to the initial state whenever faults happen. Thus, the system’s execution time will be shortened. We simulate the critical path 100,000 times with four strategies: (a) no checkpoints, place no checkpoints in the system; (b) only compulsory checkpoints, place no optional checkpoints; (c) only optional checkpoints, place no compulsory checkpoints; and (d) optimal checkpoints, place both compulsory and optional checkpoints (the proposed strategy). Among them, (a) and (c) are affected by the domino effect.
The results shown in Table
2 indicate that the domino effect leads to a large execution time. There are two observations: (i) the average execution time of strategy (a) is about 750 times longer than that of strategy (b), (ii) the average execution time of (b) is about 4 times longer than that of strategy (d). The reason behind the noteworthy difference is from the compulsory checkpoints, which make the system free from the domino effect. For (a), the system rollback to the initial state when a fault occurs, but for (b), the system only rollback to the nearest compulsory checkpoints. Thus, a large amount of useful work can be saved. Although the system can rollback to an optional checkpoint when a fault occurs for strategy (c), but there is also a possibility
\(q\) that the optional checkpoint is not valid. Under this condition, the system has to rollback to the initial state, which leads to a larger execution time in strategy (c). The domino effect is also avoided in strategy (d). This analysis highlights the importance of compulsory checkpoints that prevent the domino effect, reduce the execution time, and increase the finishing process percentage on time.
Simulation 2: Performance Regarding Checkpoint Interval. The second simulation compares our checkpoint placement strategy with four other strategies on the critical path with respect to checkpoint intervals. We consider five types of different checkpoint placement strategies, the same in the number of checkpoints but different in the checkpoint interval. They are: (a) optimal placement strategy obtained from Section
5; (b) two-state strategy [
36]: a strong prior work that has two stages of setting checkpoints, and the first stage delays the checkpoints as much as possible avoiding checkpointing overheads. (c) uniformI (I stands for intervals): place checkpoints based on uniform distribution; (d) Gauss distribution placement strategy: place checkpoints based on Gauss distribution with
\(I/2\) as the mean and
\(I/4\) as the standard deviation; (e) narrowing placement strategy: gradually narrow the interval between two checkpoints; and (f) widening placement strategy: gradually widen the interval between two checkpoints. The placement strategy (d) is based on the algorithm: the
\(i+1\)-th checkpoint in a task is placed at the first third of the interval between the
\(i\)-th checkpoint and the end. The placement strategy (e) is the reverse process of strategy (d). We simulate the critical path process 100,000 times and list the result in Table
3.
The result shows that our model optimizes the interval between checkpoints. We notice that our optimal strategy (a) has the shortest average execution time and the highest percentage of finishing processes on time. Strategy (b), (c), and (d) have shorter average execution times than strategy (d) and (e), but still longer than strategy (a). The prior work (b) yields a competitive rate of meeting deadlines with the proposed strategy but it behaves worse than the proposed strategy and baseline strategies (c) and (d) in terms of average and minimum execution time. This is because of its concentrated distribution of execution time. This strategy reduces the maximum execution time and increases the percentage of meeting deadlines, however, it also increases the minimum execution time and thus increases the average execution time. Strategy (c) places the checkpoints uniformly on each task, making it worse but close to our model’s result. The performance of strategy (d) depends on the value of the mean and standard deviation, and in this case, it has a better performance than strategy (c). Note, the maximum execution time of strategy (d) being less than other strategies is due to randomness, i.e., fewer faults happen during some runs in strategy (d). Our model cannot guarantee to perform the best every time, but it promises a better average result when running time accumulates.
Simulation 3: Performance Regarding Checkpoint Number. The third simulation compares our checkpoint placement strategy with three other strategies on the critical path with respect to checkpoint numbers. We consider four types of checkpoint placement strategies in this simulation: (a) optimal placement strategy obtained from Section
5; (b) MelhemInt: The algorithm of determining the number of checkpoints that is used in many prior work [
4,
30]; (c) uniformM (M stands for the number of checkpoints
\(m\)), placing the same number of checkpoints in each task, and the total number of checkpoints is close to that in (a); (d) light-weight placement strategy, place fewer checkpoints than (a), but the number of checkpoints in each task is proportional to the computation time of each task; and (e) heavy-weight placement strategy, place more checkpoints than (a), but the number of checkpoints in each task is proportional to the computation time of each task. All strategies share the same pattern of determining intervals, i.e., our model. We simulate the critical path process 100,000 times, and the result is listed in Table
4.
The result shows that our model optimizes the number of checkpoints. The other three strategies slightly change the number of checkpoints for each task, and none of them performs better than our model. Our strategy reduces the average execution time and increases the percentage of processes completed on time. The prior work (b) and the lightweight placement baseline place inadequate checkpoints, wasting useful work and leading to longer execution time. On the other hand, the heavy-weight placement strategy places too many checkpoints, and their overheads contribute more execution time. Only our proposed strategy meets the trade-off between useful work waste and checkpointing overhead. Note that the light-weight placement strategy’s minimum execution time is smaller than our model because it places fewer checkpoints. The scale, i.e., the size of the DAG and the length of the critical path, is small in this simulation, which leads to the slight improvement from other strategies to the proposed model. Improvement will be increased in simulation 4.
Simulation 4: Performance Regarding Scalability. The fourth simulation shows the scalability of our checkpoint placement strategy. First, we gradually add the number of processes and their lengths. The execution time for each task is chosen randomly from 50 to 650 units of time. Second, we randomly generate dependencies between tasks: for each task, there is 0.4 chance that no message is sent out, 0.5 chance that 1 message is sent to another process, and 0.1 chance that 2 messages are sent to other processes. By selecting and adjusting the number and length of processes, the scale of the DAG can be controlled to an expected range. Then according to Section
3, we partition these processes into a DAG of tasks. Finally, according to Section
4, we extract the critical path of the DAG. The set of parameters, i.e.,
\(\lambda , t_c, p, q, r, s\) is the same as the above value. And the deadline for finishing all tasks is also three times the fault-free computation time of tasks on the critical path. The scale and critical path details are shown in Tables
5 and
6, respectively.
The optimal checkpoint numbers are calculated by the proposed model. We simulate the critical path process 100,000 times and list the result in Table
7. We also simulate the two best baselines in the previous simulation for better comparison: the TwoState strategy in simulation 2 and the light-weight strategy in simulation 3. Besides, we simulate the strategy of using compulsory checkpoints only.
The result shows that our model stays strong in different aspects of scale and proves again that it performs better than other strategies. We notice that as the critical path becomes longer, the average execution time of all four strategies increases. The strategy to place only compulsory checkpoints cost the system over 10 times longer than the other three strategies to complete the tasks. Moreover, the percentage of meeting the deadline remains 0 on all scales. This unacceptable result is due to repeated work without proper checkpointing. The TwoState strategy that had a competitive performance as our approach in simulation 2 has unacceptable performance on execution time and deadline meeting rate at all scales. As the scale grows, its performance degrades significantly because its motivation to delay the first checkpoint leads to a long rollback for the first fault. Also, in contrast to the more concentrated distribution of execution times, the TwoState strategy has an overall longer execution time in average, minimum, and maximum, and the difference increases as the scale grows. The lightweight placing strategy has a competitive result, but still fails to surpass our model in terms of both average execution time and percentage of meeting deadline. The reason is that given the relatively low fault rate, the overhead of setting checkpoints makes up the gap between performances. Another noteworthy observation shown in the result is that the uniform distribution strategy’s percentage of meeting the deadline goes low as the scale becomes large, while our model and lightweight strategy have an opposite trend. This is because (i) our model performs better as the scale grows and random data become stable, (ii) the light-weight strategy’s performance is relative to our model as it has a fixed ratio of fewer checkpoints, and (iii) the absolute more execution time of uniform strategy becomes large as the scale expands so the percentage of meeting the deadline drops.