ARTICLE IN PRESS
J. Parallel Distrib. Comput. 64 (2004) 649–661
Concurrent checkpoint initiation and recovery algorithms on
asynchronous ring networks
Partha Sarathi Mandal and Krishnendu Mukhopadhyaya*
Advanced Computing and Microelectronics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700108, India
Received 10 February 2003; revised 9 February 2004
Abstract
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work,
we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks.
The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints,
processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing
control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged
messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that
have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of
the proposed checkpointing algorithm is OðknÞ when k initiators initiate concurrently. The time complexity is OðnÞ: For the recovery
algorithm, time and message complexities are both OðnÞ:
r 2004 Elsevier Inc. All rights reserved.
Keywords: Distributed system; Ring network; Coordinated checkpointing; Rollback recovery; Logical checkpoint
1. Introduction
Checkpointing is an important feature in distributed
computing. It gives fault tolerance without requiring
additional efforts from the programmer. A checkpoint is
a snapshot of the current state of a process. It saves
enough information in non-volatile stable storage such
that, if the contents of the volatile storage are lost due to
process failure, one can reconstruct the process state
from the information saved in the non-volatile stable
storage. If the processes communicate with each other
through messages, rolling back a process may cause
some inconsistencies. In the time since its last checkpoint, a process may have sent some messages. If it is
rolled back and restarted from the point of its last
checkpoint, it may create orphan messages, i.e., messages
whose receive events are recorded in the states of the
destination processes but the send events are lost.
*Corresponding author.
E-mail addresses: partha r@isical.ac.in (P.S. Mandal),
krishnendu@isical.ac.in (K. Mukhopadhyaya).
0743-7315/$ - see front matter r 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.jpdc.2004.03.013
Similarly, messages received during the rolled back
period, may also cause problem. Their sending processes
will have no idea that these messages are to be sent
again. Such messages, whose send events are recorded in
the state of the sender process but the receive events are
lost, are called missing messages. In Fig. 1 if process P3 is
rolled back to its last checkpoint, then m3 would be an
orphan message. Similarly, if P0 is rolled back to its last
checkpoint then m1 would be a missing message.
A set of checkpoints, with one checkpoint for every
process, is said to be consistent global checkpointing state
(CGS), if it does not contain any orphan message or
missing message. However, generation of missing
messages may be acceptable, if messages are logged by
sender.
In a distributed system, each process has to take
checkpoints periodically on non-volatile stable storage.
In case of a failure, the system rolls back to a consistent
set of checkpoints. If all the processes take checkpoints
at the same time instant, the set of checkpoints would be
consistent. But since globally synchronized clocks are
very difficult to implement, processes may take
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
650
P0
m2
m4
P1
m1
P2
m3
P3
:
Checkpoint
Fig. 1. An example showing orphan message ðm3 Þ and missing
message ðm1 Þ:
Checkpoint latency
taken may not lie on any CGS. Such checkpoints are
called useless checkpoints. Useless checkpoints degrade
system performance. Unlike uncoordinated checkpointing coordinated checkpointing does not generate useless
checkpoints.
Coordinated checkpointing algorithms are of two
types: (a) blocking [6,7] and (b) non-blocking [2,3,13,14].
Blocking algorithms force all relevant processes in the
system to block their computation during checkpointing
latency and hence degrade system performance. In nonblocking algorithms application processes are not
blocked when checkpoints are being taken. Some of
the main issues addressed in the literature are reducing
the number of checkpoints, minimizing the complexity
for synchronization, minimizing roll back, etc.
P0
P1
P2
2. Earlier works
CGS (1)
P3
P4
v_no = 1
TCT
Ckpt_req
Ack_message
Temporary checkpoint
Permanent checkpoint
Fig. 2. An example showing the checkpointing for a single initiator
(P3 ).
checkpoints within an interval. In order to achieve
synchronization, sometimes processes take temporary
checkpoints. They are made permanent, when all
processes agree. The time from the checkpointing
initiation to the time of the last process taking its
checkpoint (may be temporary) is called the total
checkpointing time (TCT) (Fig. 2). The time interval
from the initiation to completion of the checkpointing
process, when all checkpoints are taken and made
permanent, is called checkpointing latency (Fig. 2).
Checkpointing algorithms may be classified into two
broad categories: (a) coordinated and (b) uncoordinated.
In uncoordinated checkpointing [1,11,17] each process
takes checkpoints independently, without bothering
about other processes. In case of a failure, after
recovery, a CGS is found among the existing checkpoints and the system restarts from there. Here, finding
a CGS is quite tricky. The choice of checkpoints for the
different processes is influenced by their mutual causal
dependencies. The common approach is to use rollbackdependent graph or checkpoint graph [1,2,4,13,20]. In
coordinated checkpointing [2,3,7–9,11,13,19], all processes have to synchronize through control messages
before taking checkpoints. These synchronization messages contribute to extra overhead. On the other hand,
in uncoordinated checkpointing some of the checkpoints
Several earlier works [2,3] on snapshot collection
algorithms assume that at any point of time only one
snapshot collection process is active. Koo and Toueg [7],
Spezialetti and Kearns [16] and Prakash and Singhal [12]
have proposed methods for handling concurrent initiations of snapshot collection.
According to Koo and Toueg’s (K–T) algorithm,
once a process takes a local checkpoint, either as an
initiator or on request from another process, it becomes
unwilling to take a checkpoint in response to another
initiator’s request. The process sends an ‘unwilling’
response to all subsequence requests, until the checkpoint it has taken, is made permanent or the checkpointing collection is aborted. This algorithm is blocking.
Prakash and Singhal have shown that in this algorithm
all the initiations may end up aborting, leading to a
wastage of effort [12].
Spezialetti and Kearns (S–K) algorithm [16] forces all
process to take local checkpoints similar to Chandy–
Lamport checkpoint collection algorithm [3]. A process
takes local checkpoint for the first request and forwards
that request to its neighbors. All subsequent requests are
collected in a list called border list: Once a process has
received requests along all its incident edges, its
checkpointing phase is complete. Then the process sends
its border list to the process from which it received the
first checkpoint request message. In this way, mutually
disjoint sets of processes take their local checkpoints in
response to requests from different initiators. Finally,
initiators communicate with each other and one
checkpoint for each process is selected to build a CGS,
which is minimal [10].
On the other hand, the Prakash and Singhal (P–S)
algorithm [12] generates a CGS, which is maximal [10].
Unlike S–K algorithm the P–S algorithm permits
full propagation of checkpoint requests generated by
all the concurrent checkpoint initiations. Thus the P–S
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
algorithm outputs a CGS with more recent checkpoints
than the S–K algorithm. S–K algorithm requires the
transmission of Oðn2 Þ messages to take the local
checkpoints and Oðm2 nÞ messages for information
dissemination phase with message size Oðn=mÞ: For m
concurrent initiators P–S algorithm requires Oðmn2 Þ
messages to take tentative checkpoints. Another Oðm2 nÞ
messages of size OðnÞ are exchanged for establishing a
CGS. Although the number of messages required by P–
S algorithm is higher, as they are sent concurrently, the
time to collect tentative checkpoints is comparable with
S–K algorithm.
Suppose n processes are running together. After one
or more processes fail, the system recovers by rolling
back to a CGS. Many recovery algorithms on distributed computing system have been proposed in the
literature [4,5,15,17,18]. Worst case message complexity
of the Stron and Yemini’s algorithm [17] is Oð2n Þ for
single process failure. Sistea and Welch have proposed
an algorithm [15] which requires Oðn2 Þ messages to be
exchanged with OðnÞ information appended to each
application message and Oðn3 Þ message exchanges when
Oð1Þ extra information is appended to each message.
Juang and Venkatesan proposed an algorithm [18]
which appends Oð1Þ extra information to each message.
When arbitrary number of processes fail, Oðn2 Þ messages are sufficient for general networks and for ring
networks OðnÞ messages are necessary and sufficient for
recovery. They also proposed an algorithm [18] on
general network where OðnÞ additional information is
appended to each application message; OðknÞ messages
are required for rolling-back all of the processes to a
CGS when k processes fail.
In our proposed asynchronous coordinated checkpointing algorithm processes take checkpoints independent of message pattern. It allows any set of processes in
the system to initiate checkpointing. A process need not
consider causal dependency generated by the application
messages. Due to the special nature of the ring network,
the scheme does not need to trace dependence at the
time of roll back to get CGS.
3. System model
We consider a distributed system consisting of n
processes on a ring network. Processes are numbered
P0 ; P1 ; P2 ; y; Pn1 sequentially, in the clockwise direction. In case of the unidirectional ring, the direction of
the ring is also assumed to be clockwise. There is no
common clock or common memory. Message passing is
the only mode for communication between any pair of
processes. Two types of messages are generated by the
processes: application messages for underlying distributed application and control messages to facilitate
checkpointing and roll back of the system. In the
651
unidirectional ring ith process can directly send a
message to jth process if and only if j ¼ ði þ 1Þ mod n:
In bi-directional ring ith process can directly send a
message to jth process if and only if j ¼ ði71Þ mod n:
The communication channel is assumed to be FIFO. In
a link as well as in intermediate node, a message arriving
later does not leave before an earlier message. We
assume that there is no link failure, only processes may
fail. The computation is asynchronous, i.e., each process
runs with its own speed; messages are exchanged with
finite but arbitrary delays. Application messages are
acknowledgement based, i.e., for each message an
acknowledgement is required (this is to let the sender
know that the receiver is alive).
In this paper, we consider logical checkpoints
[11,19,21], which are slightly different from standard
checkpoints. A logical checkpoint is a standard checkpoint (i.e., snapshot of the process) plus a list of
messages, which have been sent by this process but are
unacknowledged at the time of taking the checkpoint.
Message lists are updated continuously. After getting
acknowledgement for a message, it is deleted from the
list. Our algorithm allows the generation of missing
messages in case the system has to roll back to its last
checkpoint. At the time of restart after a failure,
processes retransmit their unacknowledged messages
(not all of whom may be missing messages). So there
may be duplicate messages after recovery from a failure
and that has to be handled using message identifiers.
In our algorithm, for each process, at most two
checkpoints may have to be stored in the stable storage
when checkpointing procedure is running; otherwise one
checkpoint per process is enough to make a system
consistent. Checkpoints have a one-bit version numbers
ðv noÞ: In the beginning all processes start by taking a
permanent checkpoints with v no ¼ 0:
4. Checkpointing algorithm for unidirectional ring
network
A process has complete freedom to take a decision
about checkpointing initiation, provided it does not
have any temporary checkpoint. Any subset of the n
processes, may initiate checkpointing independently.
After certain time periods processes initiate checkpointing. It itself takes a logical checkpoint and sends a
checkpoint request message (ckpt req) to the next
process, with its own id as the initiator. The new
checkpoint is marked temporary and stored in the stable
storage and set its initiator flagi ¼ True: Initially it was
False. If the v no of the existing checkpoint is 0 (1) then
the v no of the new checkpoint is 1 (0). On getting a
ckpt req message, a process checks whether it has taken
any temporary checkpoint or not; if not, then it takes a
logical checkpoint with received initiator id as the
ARTICLE IN PRESS
652
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
initiator id of the checkpoint and then forwards ckpt req
to the next process. Within a TCT, each non-initiator
process takes at most one temporary checkpoint. The
checkpoint is taken in response to the first ckpt req
message; other ckpt req messages are either forwarded or
discarded depending on the id value of the initiator of the
request with respect to the id of the receivers. Each
forwarded ckpt req message always contains initiator id. If
receiver id is less than the initiator id and if receiver has
already taken a temporary checkpoint then that ckpt req
will be discarded provided its initiator flagj is True.
When a process knows that all other n 1 processes
have already taken their checkpoints temporarily to the
stable storage then that process will take permanent
logical checkpoint directly or change the state of the
existing checkpoint to permanent, if its state was
temporary. Then, it deletes its old permanent checkpoint
and sends ack message to the next process, with own id
as ack message generator. On receiving ack message,
a process changes its temporary checkpoint to
permanent, deletes its previous permanent checkpoint
and discards or forwards ack message to next the
process depending on whether the receiver ðid ¼ iÞ is
the immediate predecessor to the ack message initiator
ðack initiator id ¼ jÞ; i.e., if i ¼ ðj 1Þ mod n or not
respectively. When all processes have made their
checkpoints with the new v no permanent then that set
of checkpoints will be a CGS. So in our algorithm
consistent global checkpointing means each process has
a permanent checkpoint in its own stable storage with
same v no over all processes.
Algorithm. Unidirectional Checkpointing Initiatori
/This algorithm is executed by process Pi when Pi
decides to initiate a checkpointing /
begin
if there is no temporary checkpoint then
take a new temporary checkpoint with new v no
/ v no ¼ 0=1; if the existing checkpoint
v no ¼ 1=0/
set initiator flagi ’True / process Pi initiating
checkpoint, initially it was False/
set initiator id’i
/ initiator id is attached to the checkpoint
request;
it denotes the id of the initiator /
send a ckpt req to the next process with initiator id
end if
end
Algorithm. ckpt req receiverj
/ This algorithm is executed by process Pj when it
receives a ckpt req /
begin
if temporary checkpoint exists then
if j is the immediate predecessor of the
rec initiator id then
/ rec initiator id is the initiator id of the
ckpt req generator /
delete the existing permanent checkpoint
make the existing temporary checkpoint permanent
set ack initiator id ’j
generate and send ack message to the next
process with ack initiator id
/ ack initiator id is the id of the ack message
generator /
else if j 4 rec initiator id then
forward the ckpt req to the next process
else / jorec initiator id /
if initiator flagj is False then / process has
not initiated any checkpointing /
forward the ckpt req to the next process
else / initiator flagj is True /
discard the ckpt req
end if
end if
else / permanent checkpoint exists,
but no temporary checkpoint /
if j is the immediate predecessor of the
rec initiator id then
delete old permanent checkpoint
take a new permanent checkpoint with new v no
set ack initiator id’j;
generate and send ack message to the next
process with ack initiator id
else / process j is not the immediate predecessor
of the rec initiator id /
take a new temporary checkpoint with new v no
forward the ckpt req to the next process
end if
end if
end
Algorithm. ack message receiverj
/ This algorithm is executed by process Pj when it
receives an ack message /
begin
if j is not the immediate predecessor of the initiator of
this acknowledgement message then
if temporary checkpoint exists then
delete the permanent checkpoint
make the existing temporary checkpoint permanent
end if
forward ack message to the next process
else
/ process j is the immediate predecessor of the
ack initiator id /
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
if temporary checkpoint exists then
delete the permanent checkpoint
make the existing temporary checkpoint permanent
end if
discard ack message
end if
end
5. Checkpointing algorithm for bi-directional ring
network
Here also, any process, (say Pi ), that has not taken a
temporary checkpoint may decide to initiate
checkpointing. In such a case, the process itself takes a
new temporary logical checkpoint. The version
number of the new checkpoint is the complement of
the version number of the existing permanent
checkpoint. Then initiator sends ckpt req to both
of its neighboring processes ð¼ ði71Þ mod nÞ: On
receiving a ckpt req message, the conditions for a
process taking a new temporary checkpoint or
forwarding or discarding the ckpt req are same as those
of the previous algorithm. The forwarding is done to the
neighbor other than the one from which it is
received. When a process receives ckpt req initiated by
the same initiator twice, the process changes the
existing ckpt state from temporary T to permanent P;
deletes its old permanent checkpoint and forwards
ckpt req message. No acknowledgement message is
required here. The final initiator (one with minimum
id among all initiators who have initiated checkpointing
within TCT) receives two forwarded ckpt req
messages. When it receives the first request, it changes
the existing ckpt state from T to P; deletes old
permanent checkpoint and stops the ckpt req message
propagation. When it receives the other ckpt req, it just
discards the message and the checkpointing algorithm
terminates.
5.1. Checkpointing. algorithm for bi-directional
Algorithm. Bi-directional Checkpointing Initiatori
/ This algorithm is executed by process Pi when Pi
decides to initiate a checkpointing /
begin
if there is no temporary checkpoint then
take a new temporary checkpoint with new v no
/ v no¼ 0=1; if the existing checkpoint v no¼ 1 /
0 /
set initiator id’i
send ckpt req to both adjacent processes with
initiator id
653
end if
end
Algorithm. ckpt req receiverj
/ This algorithm is executed by process Pj when it
receives a ckpt req /
begin
if there is no temporary checkpoint then
take a new temporary checkpoint with new
v no
set initiator id’rec initiator id
/ rec initiator id is the
initiator id of the ckpt req /
forward the ckpt req to the other adjacent
process
/ other than the process
from which the ckpt req
was received /
else
/ temporary checkpoint exists
/
if initiator id o rec initiator id then
discard the ckpt req message
else if initiator id 4 rec initiator id then
set initiator id’rec initiator id
forward the ckpt req to the other
adjacent process
else
/ initiator id is equal to rec initiator id /
if j ¼ rec initiator id then
delete the existing permanent
checkpoint
make the existing temporary
checkpoint permanent
discard the ckpt req message
else / jarec initiator id /
delete the existing permanent
checkpoint
make the existing temporary
checkpoint permanent
forward the ckpt req to the
other adjacent process
end if
end if
end if
end
6. Recovery algorithm for unidirectional ring network
We assume that, when the faulty process is restored, it
initiates recovery process by sending recovery messages
to the other processes. The recovery algorithm finds out
a v no for which checkpoints exist in all the processes.
Processes may fail when distributed application is
ARTICLE IN PRESS
654
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
running or when checkpointing process is in execution.
But a checkpoint is made permanent, only when
checkpoints for that v no has already been taken in all
other process. Thus the existence of even one permanent
checkpoint indicates that checkpoints for this v no are
present in all other process. Note that some of them may
be temporary, while the rest are permanent. In such a
case, the temporary checkpoints are made permanent by
the recovery algorithm. Processes resume computation
from these checkpoints. At the time of restart, processes
resend their messages (in the same order as sent before)
which were unacknowledged at the moment of taking
the checkpoint. There might be duplicate messages after
re-sending messages and these problems have to be
resolved using message identifier at the receiver end.
Our recovery algorithm generates two types of
messages
(1) Recovery message, for synchronization over same
checkpointing version no.
(2) Resume message, after synchronization, initiator
sends this message to all processes. After receiving
this message a process resumes computation from
the latest checkpoint. If the checkpoint corresponding to this v no happens to be temporary, it is made
permanent, deleting the old permanent checkpoint.
When a process initiates recovery process by sending
recovery message to its neighbor, it sends own id as the
recovery initiator, latest checkpoint v no (whether it was
permanent or temporary). On receiving a recovery
message a process checks its own v no with the rec v no.
o. If they are identical, it sets own initiator id to
rec initiator id and forwards the recovery message, as
it is, to the next process. When this forwarded message
reaches its initiator, initiator generates resume message
with own id. If v no is not equal to received v no, then
process checks its ckpt state. If ckpt state ¼ T then it
deletes its temporary checkpoint keeping the permanent
checkpoint. Then it forwards recovery message to the
next process. And if its ckpt state ¼ P then this process
takes over of the role the initiator. In this case this
process sends recovery message to the next process with
own id as a recovery initiator id and own v no.
When a process receives a resume message if its
ckpt state ¼ T it makes ckpt state ¼ P and deletes its
old permanent checkpoint. If its ckpt state was P then
no changes are made. In both the cases it forwards the
resume message to the next process unless it knows that
all previous n 1 processes have already know about
this resume message.
6.1. Recovery algorithm for unidirectional
Algorithm. Unidirectional Recovery Initiatori
/ This algorithm is executed by process Pi when the
faulty process Pi is restored /
Begin
set recovery initiator id’i
send recovery message to the next process
with latest checkpoint v no and recovery initiator id
/ latest checkpoint may be temporary or permanent /
end
Algorithm. Recovery message receiverj
/ This algorithm is executed by process Pj when it
receives a recovery message /
Begin
if process id, j and v no of the latest checkpoint match
with the initiator id
and v no of the recovery message
then
if latest checkpoint is temporary then
delete the permanent checkpoint
make the temporary checkpoint permanent
end if
generate resume message
set resume initiator id’j
send resume message to the next process with
resume initiator id
else if v no of the latest checkpoint does not match
with v no of the recovery
message but jainitiator id
then
if temporary checkpoint exists then
delete the temporary checkpoint
rollback to its previous permanent checkpoint
forward recovery message to the next
process
else
/ temporary checkpoint does not exists
/
discard recovery message
Unidirectional Recovery Initiatorj / set process j as a
new recovery initiator /end ifelse/ v no of the latest
checkpoint match with v no ofthe recovery message but
jainitiator id /forward recovery message to the next
processend ifend
Algorithm. Resume message receiverj
/ This algorithm is executed by process Pj when it
receives a resume message /
Begin
if process j is not the immediate predecessor of the
initiator of this resume message then
if temporary checkpoint exists then
delete the existing permanent checkpoint
make the existing temporary checkpoint permanent
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
end if
forward resume message to the next process
else / process j is the immediate predecessor of
the initiator of this resume message/
if temporary checkpoint exists then
delete the existing permanent checkpoint
make the existing temporary checkpoint permanent
end if
terminate resume message
end if
end
7. Recovery algorithm for bi-directional ring network
In this algorithm we use two flags, flag visit and
flag resume for every process Pi : When Pi comes to
know about a fault, it sets both its flags to False.
As in the unidirectional recovery algorithm, when the
faulty process Pi is restored, it initiates recovery process.
It sends recovery message (reco message) along with
latest checkpoint v no (irrespective of whether it is
permanent or temporary) to its two neighbors
(Pðiþ1Þ mod n and Pði1Þ mod n ) and sets flag visit ¼ True:
The algorithm finds out a v no for which checkpoints
exist in all processes. The resume message is not required
here.
When a process Pi receives a reco message it
compares rec v no with own v no. If they are equal then
Pi checks flag visit: If flag visit ¼ False; Pi sets
flag visit ¼ True and forwards the message. If
flag visit ¼ True and if Pi ’s ckpt state is T; it deletes
its old permanent checkpoint and changes the value of
ckpt state to P: If its ckpt state is P then the ckpt state
remains unchanged. In both the cases the message is
forwarded to the next process, in the direction of travel
of the reco message and the process resumes computation from the permanent checkpoint. When a process
resumes computation it sets flag resume ¼ True: The
reco message is forwarded till it reaches a process whose
flag resume ¼ True:
In case rec v no is not equal to v no, if ckpt state is T
then Pi deletes its current checkpoint (T) and forwards
the message, otherwise it sends reco message to the next
process with its checkpoint v no. In both the cases Pi sets
flag visit ¼ True:
Algorithm. Bi-directional Recovery Initiatori
/ This algorithm is executed by process Pi when the
faulty process Pi is restored /
Begin
set flag visit ’ True
send recovery message to both adjacent processes
with latest checkpoint v no
end
655
Algorithm. Recovery message receiverj
/ This algorithm is executed by process Pj when it
receives a recovery message /
Begin
if flag resume is True then / recovery for process j
is complete /
discard the recovery message
else / flag resume is False, implies recovery for
the process is not yet complete /
if v no of process j matches with the v no of
the recovery message then
if flag visit is True then / process j
already received a recovery message /
if temporary checkpoint exists
then
delete the existing permanent checkpoint
make the temporary checkpoint permanent
end if
set flag resume ’ True
else / flag visit is False, this is the first
recovery message received by process
j /
set flag visit ’ True
end if
forward the recovery message to the
other adjacent process
else / v no of process j does not match the
v no of the recovery message /
if temporary checkpoint exists then
set flag visit ’ True
delete the existing temporary
checkpoint
rollback to its previous permanent checkpoint
forward recovery message to the
other adjacent process
else / permanent checkpoint exist but
no temporary checkpoint /
set flag visit ’ True
send recovery message to the
other adjacent process with its
own checkpoint v no
end if
end if
end if
end
8. Correctness of the proposed algorithms
For unidirectional as well as bi-directional rings, in
order to show that the proposed algorithm is correct, we
first show that, at any point of time, there exists a value
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
656
of v no; for which each process has a checkpoint. Then
we show that, the set thus obtained is indeed a CGS, i.e.,
it does not contain any orphan message.
Theorem 1. In the proposed checkpointing algorithm for
unidirectional ring, at any point of time, there exists
exactly one value of v no, for which each process has a
checkpoint.
Proof. If no checkpointing process is in execution, then
the checkpoints corresponding to the v no of the last
checkpointing process will be available in every process.
Now we consider a point of time, say t; within the
checkpointing latency. If t is within the TCT, since the
new checkpoint has not been made permanent, the
permanent checkpoints in all the processes correspond
to the previous checkpoint latency and hence have the
same v no: For example, in Fig. 3, process P2 fails before
taking its new checkpoint. In this case system has to rollback to their permanent checkpoint, for each process to
get the previous existing CGS(1). If t is after the TCT,
here at least one process has taken a new permanent
checkpoint. Since permanent checkpoints are taken only
when all other processes have taken temporary or
permanent checkpoints corresponding to that v no; a
checkpoint (temporary or permanent) corresponding to
the new v no is available with every process. For
example in Fig. 4, process P2 has failed after taking its
permanent checkpoint. In this case, system will not rollback to CGS(1). Instead, the recovery algorithm goes to
CGS(0) using the current existing checkpoints (temporary or permanent) with v no ¼ 0 for all processes.
Now note that before taking the first permanent
checkpoint corresponding to a new checkpointing
process, the existing permanent checkpoint is deleted.
Thus at any point of time there is only one complete set
of checkpoints. &
Our recovery algorithm finds a set of checkpoints
corresponding to the same v no: It only remains to show
that the set thus obtained is consistent.
CGS(1)
P0
P1
P2
P3
P4
v_no = 1
CGS(1)
CGS(0)
P0
P1
P2
P3
P4
v_no = 1
v_no = 0
Fig. 4. An example showing the checkpointing and recovery when a
failure occurs after TCT but within the checkpointing latency.
Theorem 2. The set of checkpoints corresponding to the
v no in Theorem 1, is consistent.
Proof. Suppose for an application message, the send
event is not recorded in our set of checkpoints. Then, in
the sender process, the checkpoint was taken before
sending this application message. As the checkpoint was
taken earlier, the ckpt req message following the
checkpoint will also precede the application
message. As we assume that the channel is FIFO, and
unidirectional, the ckpt req message will always be
received before the application message. All processes
take checkpoints before receiving the message.
Hence, no checkpoint will show this message being
received. &
Theorem 3. In the proposed checkpointing algorithm for
bi-directional ring, at any point of time, there exists
exactly one value of v no, for which each process has a
checkpoint.
Proof. If no checkpointing process is in execution, then
the checkpoints corresponding to the v no of the last
checkpointing process will be available in every process.
Now we consider a point of time, say t; within the
checkpoint latency. If t is within the TCT (Fig. 5), since
all processes have not taken their temporary checkpoint
for the ongoing checkpointing process, the permanent
checkpoints in all the processes correspond to the
previous checkpoint latency and hence have the same
v no: If t is after the TCT but within the checkpoint
latency (Fig. 6), here every process has taken a new
temporary checkpoint. Some processes may have made
these checkpoints permanent also. So a checkpoint
corresponding to the new v no is available with every
process. &
v_no = 0
Fig. 3. An example showing the checkpointing and recovery when a
failure occurs within TCT.
Theorem 4. The set of checkpoints corresponding to the
v no in Theorem 3, is consistent.
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
P0
P1
P2
P3
P4
v_no = 1
v_no = 0
Fig. 5. An example showing the checkpointing and recovery when a
failure occurs within TCT.
CGS(0)
P0
P1
P2
P3
P4
v_no = 1
v_no = 0
Fig. 6. An example showing the checkpointing and recovery when a
failure occurs after TCT but within the checkpointing latency.
Proof. Suppose for an application message the send
event is not recorded in our set of checkpoints. Then the
checkpoint was taken before sending this application
message. But since the checkpoint was taken earlier, the
ckpt req message following the checkpoint will also
precede the application message. As we assume that the
channel is FIFO, the ckpt req message will always be
received before the application message. Hence all
processes take checkpoints before receiving the message.
Hence, no checkpoint will show this message being
received. &
9. Complexity analysis
For both unidirectional and bi-directional checkpointing algorithms, in a single checkpointing latency,
a process takes exactly one checkpoint. But several
checkpointing request messages may be generated
because of multiple concurrent initiations. Among those
only the request message with minimum id (initiator)
657
survives and goes round the ring once. All other request
messages are dominated by that message and will be
discarded before completing a round. By the time, the
surviving message completes the round, all other
requests are discarded. One more round may be
necessary for the confirmation in case of unidirectional
ring. But in case of bi-directional ring, both checkpointing request messages go round the ring just one each;
along different directions and no separate confirmation
is required. Thus, the checkpointing time is OðnÞ for
both cases.
With respect to message complexity (i.e., the number
of control messages), the worst case occurs when all the
processes initiate at the same point of time. If such a
thing happens in a unidirectional ring, the ckpt req
message from P0 goes to all other processes. For
Pi ðia0Þ it goes up to P0 and is discarded. Thus a total
of ðn 1Þ þ ðn 1 þ n 2 þ ? þ 2 þ 1Þ ¼ ðn 1Þðn þ
2Þ=2 messages are generated. Also ðn 1Þ acknowledgement messages will be generated. Thus, a total of
ðn 1Þðn þ 4Þ=2 ð¼ Oðn2 ÞÞ control messages will be
generated. And for bi-directional ring, the ckpt req
message from P0 goes to all other processes
along both directions and comeback. For all other
processes packets going in the clockwise direction
go up to P0 and those going in the counter
clockwise direction go just one hap each and are
discarded. Thus a total of ð2nÞ þ ðn þ n 1 þ ? þ 2Þ ¼
ð2n 1Þ þ nðn þ 1Þ=2 ð¼ Oðn2 ÞÞ control messages will
be generated.
For rollback recovery algorithms in unidirectional
and bi-directional ring, worst case time complexities and
message complexities are all OðnÞ:
10. Comparison with existing algorithms
K–T algorithm does not work on a unidirectional ring
network when multiple processes initiate checkpoints
concurrently. In such a case, all the checkpointing
processes end up in aborting [12]. Like S–K algorithm
our algorithm takes n temporary checkpoints,
one for each process, and this does not depend on the
number of concurrent initiations. In P–S algorithm,
if all processes are dependent on each other, and k
processes initiate checkpointing concurrently, each
process takes k temporary checkpoints, i.e., a total of
kn checkpoints for the system. Both S–K and P–S
algorithms are designed for general network topologies.
Their worst case message complexities are Oðn3 Þ: But for
the simple unidirectional ring, this worst case is
achieved. In case of the proposed algorithm message
complexity is Oðn2 Þ:
Table 1 compares of the proposed algorithms with the
S–K algorithm and the P–S algorithm.
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
658
Table 1
Performance of the proposed algorithms and other existing algorithms
S–K
algorithm
P–S
algorithm
Proposed checkpointing
algorithms
Network topology for
which applicable
General
General
Ring (unidirectional/
bi-directional)
Worst case message
complexity (Ring)
Oðn3 Þ
Oðn3 Þ
Oðn2 Þ
Time complexity
(Ring)
OðnÞ
OðnÞ
OðnÞ
Control message size
Number of checkpoints
stored for
k concurrent
initiations
Oðn=kÞ
One permanent
one temporary
checkpoint for
each process
OðnÞ
One permanent
and k temporary
checkpoints
for each process
Oð1Þ
One permanent
and one temporary
checkpoint for
each process
Number of checkpoints rollback
after a failure
At most one
temporary
checkpoint
At most k
temporary
checkpoints
At most
one temporary
checkpoint
1/λfault = 0.05 (in millions )
60
50
50
overhead (in millions)
overhead (in millions)
1/λfault = 0.03 (in millions )
60
40
30
20
10
40
30
20
10
0
0.5
50.5
100.5
150.5
200.5
250.5
0
0.5
300.5
50.5
1/λckpt (in thousands)
1/λfault = 0.08 (in millions )
150.5
200.5
250.5
300.5
1/λfault = 0.10 (in millions )
60
60
50
50
overhead (in millions)
overhead (in millions)
100.5
1/λckpt (in thousands)
40
30
20
10
40
30
20
10
0
0.5
0
50.5
100.5
150.5
200.5
250.5
300.5
0.5
50.5
1/λckpt (in thousands)
Recovery Overhead
100.5
150.5
200.5
250.5
300.5
1/λckpt (in thousands)
Checkpointing Overhead
Total Overhead
Fig. 7. Simulation results showing rollback recovery, checkpointing, and total overhead costs for the proposed algorithm in a unidirectional ring.
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
11. Simulation results
Simulation studies were conducted for the behavior of
the proposed algorithm when implemented on unidirectional as well as bi-directional ring networks. We assume
that inter-fault time, inter-checkpoint time and intermessage send for a process follow exponential distribution with parameters (lfault ), (lckpt ) and (lsend ), respectively. The simulation program takes lfault ; lckpt and
lsend as input parameters. It is assumed that a message
takes one unit of time to travel across one link. The time
for taking a checkpoint is assumed to be 5000 units. We
1
varied lckpt
between 500 and 300,000 with increments of
1
(between 30,000 and
100 for a fixed value of lfault
100,000) whereas lsent remains fixed. Simulation has
been carried out for 20,000,000 units of time. For each
set of values of the input parameters, the program was
run 20 different times and then the average of the 20
runs is taken. Figs. 7 and 8 show the simulated values of
checkpointing overhead, rollback recovery overhead
and the total overhead for our proposed uni-directional
and bi-directional algorithms, respectively. Total overhead is the sum of checkpointing overhead and rollback
recovery overhead. As checkpointing rate (lckpt ) decreases, checkpointing overhead decreases while recov-
ery overhead goes up. Initially the total overhead
decreases with decreasing checkpointing rate. At this
stage the checkpointing overhead is the dominant cost.
After it reaches a minimum value, the rollback cost
starts to dominate and the total overhead starts
increasing again. In each case we show the optimum
value of the checkpointing rate that minimizes the total
overhead. As the fault rate (lfault ) goes down (in
different graphs), the recovery cost also goes down
and the optimum checkpointing rate goes down too.
The number of control messages was affected strongly
by the number of concurrent initiations. Concurrent
initiations abort many control messages without letting
them complete the cycle. This explains the variation in
the curves.
Table 2 compares the proposed Unidirectional (U)
and Bi-directional (B) algorithms with S–K and P–S
algorithms in terms number of total control messages.
Simulated runs of the four algorithms were carried out.
We assume that whenever one process initiates checkpointing, all the processes, take checkpoints. We have
simulated systems with 4, 6 and 10 processes each. The
1
value of lckpt
was taken 200, 400 or 600. For each
algorithm, total control messages was counted as the
average of 200 different runs. The results clearly reflect
1/λfault = 0.50 (in millions )
60
50
50
overhead (in millions)
overhead (in millions)
1/λfault = 0.03 (in millions )
60
40
30
20
10
0
0.5
40
30
20
10
0
0.5
50.5 100.5 150.5 200.5 250.5 300.5
1/λfault = 0.08 (in millions )
1/λfault = 0.10 (in millions )
60
60
50
50
overhead (in millions)
overhead (in millions)
50.5 100.5 150.5 200.5 250.5 300.5
1/λckpt (in thousands)
1/λckpt (in thousands)
40
30
20
10
0
0.5
659
40
30
20
10
0
50.5 100.5 150.5 200.5 250.5 300.5
0.5
Recovery Overhead
50.5 100.5 150.5 200.5 250.5 300.5
1/λckpt (in thousands)
1/λckpt (in thousands)
Checkpointing Overhead
Total Overhead
Fig. 8. Simulation results showing rollback recovery, checkpointing, and total overhead costs for the proposed algorithm in a bi-directional ring.
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
660
Table 2
Simulation results comparing the number of control messages for the proposed Unidirectional (U) and Bi-directional (B) algorithm with S–K and P–
S algorithms
No. of processes
1
lckpt
(k)
Algorithms (-)
200
400
600
4
U
16
18
19
6
B
24
24
25
S–K
58
38
27
P–S
188
174
158
U
33
36
39
the superiority of the proposed algorithms. The difference is more pronounced for larger systems with higher
number of processes.
10
B
41
45
45
[2]
[3]
12. Conclusion
In this work, we have proposed checkpointing and
recovery algorithms, for unidirectional as well as bidirectional ring networks. In our model, processes take
logical checkpoints, i.e., snapshot of the process plus the
unacknowledged messages. Our algorithm can handle
multiple initiations of checkpointing. During recovery
each process has to rollback at most one checkpoint.
For each process at most two checkpoints (one
permanent and other temporary) may be saved in the
stable storage. For the checkpointing as well the
recovery algorithms, the control message makes two
rounds along the unidirectional ring and one round for
the bi-directional ring.
Though checkpointing schemes for general network
topologies are available in the literature. There is scope
for improvement for particular classes of topologies. A
more general approach showing the effect of the
topologies on the complexities of the checkpointing
algorithms may also be considered.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Acknowledgments
The first author is thankful to Council of Scientific
and Industrial Research (CSIR), India, for financial
support during this work. The authors thank the
anonymous reviewers for their constructive criticism
and helpful suggestions. The authors are also grateful to
Professor Bhabani P. Sinha of ACM Unit, Indian
Statistical Institute, Kolkata, for his patient hearing and
many suggestions which have improved the organization
of the paper.
[12]
[13]
[14]
[15]
References
[1] B. Bhargava, S.R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic
[16]
[17]
S–K
239
155
104
P–S
841
830
787
U
82
88
94
B
80
91
98
S–K
1502
958
663
P–S
28,720
43,442
38,899
approach, in: Proceedings of the Seventh IEEE Symposium on
Reliable Distributed System, 1988, pp. 3–12.
G. Cao, M. Singhal, On coordinated checkpointing in distributed
systems, IEEE Trans. Parallel Distrib. Systems 9 (12) (1998)
1213–1225.
K.M. Chandy, L. Lamport, Distributed snapshots: determining
global states of distributed systems, ACM Trans. Comput.
Systems 3 (1) (1985) 63–75.
E.N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, D.B. Johnson, A
survey of rollback-recovery protocols in message-passing systems,
ACM Comput. Surveys 34 (3) (2002) 375–408.
D. Johnson, W. Zwaenepoel, Recovery in distributed systems
using optimistic message logging and checkpointing, J. Algorithms 3 (11) (1990) 462–491.
J.L. Kin, T. Park, An efficient protocol for checkpointing
recovery in distributed system, IEEE Trans. Parallel Distrib.
Systems 5 (8) (1998) 955–960.
R. Koo, S. Toueg, Checkpointing and rollback-recovery for
distributed system, IEEE Trans. Software Eng. 13 (1) (1987)
23–31.
D. Manivannan, M. Singhal, A low-overhead recovery technique
using quasi-synchronous checkpointing, in: Proceedings of the
IEEE Sixth International Conference on Distributed Computer
Systems, May 1996, pp. 100–107.
D. Manivannan, M. Singhal, Quasi-synchronous checkpointing:
models, characterization, and classification, IEEE Trans. Parallel
Distrib. Systems 10 (7) (1999) 703–713.
F. Mattern, Virtual time and global states of distributed
systems, in: M. Cosnard et al. (Ed.), Proceedings of the
Workshop on Parallel and Distributed Algorithm, Elsevier
Science Publishers B. V., North-Holland, Amsterdam, 1989,
pp. 215–226.
K.Z. Meth, W.G. Tuel, Parallel checkpoint/restart without
message logging, in: Proceedings of the IEEE 28th International
Conference on Parallel Processing (ICPP ’00), August 2000,
pp. 253–258.
R. Prakash, M. Singhal, Maximal global snapshot with
concurrent initiators, in: Proceedings of the Sixth IEEE Symposium of Parallel and Distributed Processing, October 1994,
pp. 334–351.
R. Prakash, M. Singhal, Low-cost checkpointing and failure
recovery in mobile computing systems, IEEE Trans. Parallel
Distrib. Systems 7 (10) (1996) 1035–1048.
L.M. Silva, J.G. Silva, Global checkpointing for distributed
programs, in: Proceedings of the 11th Symposium on Reliable
Distributed Systems, 1992, pp. 115–162.
A.P. Sistla, J. Welch, Efficient distributed recovery using message
logging, in: Proceedings of the ACM Symposium on Principle of
Distributed Computing, 1989, pp. 223–238.
M. Spezialetti, P. Kearns, Efficient distributed snapshots, in:
Proceedings of the Sixth ICDCS, 1986, pp. 382–388.
R.E. Strom, S. Yemini, Optimistic recovery in distributed systems,
ACM Trans. Comput. Systems 3 (3) (1985) 204–226.
ARTICLE IN PRESS
P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661
[18] T.T-Y. Juang, S. Venkatesan, Efficient algorithms for crash
recovery in distributed systems, in: Proceedings of the 10th
Conference on FSTTCS, Springer, Berlin, December 1990,
pp. 349–361.
[19] N.H. Vidya, Staggered consistent checkpointing, IEEE Trans.
Parallel Distrib. Systems 10 (7) (1999) 694–702.
[20] Y.M. Wang, Consistent global checkpoints that contain a given
set of local checkpoints, IEEE Trans. Comput. 46 (4) Apr. (1997)
456–468.
[21] Y.M. Wang, Y. Huang, W.K. Fuchs, Progressive retry for
software error recovery in distributed systems, in: Proceedings of
the IEEE Fault-Tolerant Computing Symposium (FTCS-23),
June 1993, pp. 138–144.
Partha Sarathi Mandal received a Bachelor
of Science (Hons.) degree in Mathematics
from the University of Calcutta, India, a
Master of Science degree in Mathematics
from Jadavpur University, India, in 1995,
and 1997 respectively. He is awarded Junior
and Senior Research Fellowship by the
Council of Scientific & Industrial Research
661
(CSIR), India. He is currently working towards his Ph.D. degree in
Computer Science at the Advanced Computing and Microelectronics
Unit of the Indian Statistical Institute, Kolkata. His current research
interests include parallel and distributed computing, fault tolerance,
mobile agent, performance analysis etc.
Krishnendu Mukhopadhyaya received a Bachelor of Statistics (Hons.), Master of
Statistics, Master of Technology in Computer Science, and Ph.D. in Computer Science
all from the Indian Statistical Institute,
Kolkata, in 1985, 1987, 1989 and 1994
respectively. From 1993 to 1999 he worked
as a Lecturer in the Department of Mathematics, Jadavpur University. Since 1999, he
is working at the Indian Statistical Institute,
Kolkata as an Associate Professor. He was
a recipient of the Young Scientist Award of the Indian Science
Congress Association and the BOYSCAST Fellowship of the
Department of Science and Technology, Government of India. His
current research interests include mobile computing, parallel and
distributed computing, sensor networks etc.