Concurrent checkpoint initiation and recovery algorithms on asynchronous ring network

Krishnendu Mukhopadhyaya

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring network

Journal of Parallel and Distributed Computing, 2004

Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n)....Read more

J. Parallel Distrib. Comput. 64 (2004) 649–661 Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks Partha Sarathi Mandal and Krishnendu Mukhopadhyaya* Advanced Computing and Microelectronics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700108, India Received 10 February 2003; revised 9 February 2004 Abstract Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is OðknÞ when k initiators initiate concurrently. The time complexity is OðnÞ: For the recovery algorithm, time and message complexities are both OðnÞ: r 2004 Elsevier Inc. All rights reserved. Keywords: Distributed system; Ring network; Coordinated checkpointing; Rollback recovery; Logical checkpoint 1. Introduction Checkpointing is an important feature in distributed computing. It gives fault tolerance without requiring additional efforts from the programmer. A checkpoint is a snapshot of the current state of a process. It saves enough information in non-volatile stable storage such that, if the contents of the volatile storage are lost due to process failure, one can reconstruct the process state from the information saved in the non-volatile stable storage. If the processes communicate with each other through messages, rolling back a process may cause some inconsistencies. In the time since its last check- point, a process may have sent some messages. If it is rolled back and restarted from the point of its last checkpoint, it may create orphan messages, i.e., messages whose receive events are recorded in the states of the destination processes but the send events are lost. Similarly, messages received during the rolled back period, may also cause problem. Their sending processes will have no idea that these messages are to be sent again. Such messages, whose send events are recorded in the state of the sender process but the receive events are lost, are called missing messages.In Fig. 1 if process P 3 is rolled back to its last checkpoint, then m 3 would be an orphan message. Similarly, if P 0 is rolled back to its last checkpoint then m 1 would be a missing message. A set of checkpoints, with one checkpoint for every process, is said to be consistent global checkpointing state (CGS), if it does not contain any orphan message or missing message. However, generation of missing messages may be acceptable, if messages are logged by sender. In a distributed system, each process has to take checkpoints periodically on non-volatile stable storage. In case of a failure, the system rolls back to a consistent set of checkpoints. If all the processes take checkpoints at the same time instant, the set of checkpoints would be consistent. But since globally synchronized clocks are very difﬁcult to implement, processes may take ARTICLE IN PRESS *Corresponding author. E-mail addresses: partha r@isical.ac.in (P.S. Mandal), krishnendu@isical.ac.in (K. Mukhopadhyaya). 0743-7315/$-see front matter r 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2004.03.013

checkpoints within an interval. In order to achieve synchronization, sometimes processes take temporary checkpoints. They are made permanent, when all processes agree. The time from the checkpointing initiation to the time of the last process taking its checkpoint (may be temporary) is called the total checkpointing time (TCT) (Fig. 2). The time interval from the initiation to completion of the checkpointing process, when all checkpoints are taken and made permanent, is called checkpointing latency (Fig. 2). Checkpointing algorithms may be classiﬁed into two broad categories: (a) coordinated and (b) uncoordinated. In uncoordinated checkpointing [1,11,17] each process takes checkpoints independently, without bothering about other processes. In case of a failure, after recovery, a CGS is found among the existing check- points and the system restarts from there. Here, ﬁnding a CGS is quite tricky. The choice of checkpoints for the different processes is inﬂuenced by their mutual causal dependencies. The common approach is to use rollback- dependent graph or checkpoint graph [1,2,4,13,20]. In coordinated checkpointing [2,3,7–9,11,13,19], all pro- cesses have to synchronize through control messages before taking checkpoints. These synchronization mes- sages contribute to extra overhead. On the other hand, in uncoordinated checkpointing some of the checkpoints taken may not lie on any CGS. Such checkpoints are called useless checkpoints. Useless checkpoints degrade system performance. Unlike uncoordinated checkpoint- ing coordinated checkpointing does not generate useless checkpoints. Coordinated checkpointing algorithms are of two types: (a) blocking [6,7] and (b) non-blocking [2,3,13,14]. Blocking algorithms force all relevant processes in the system to block their computation during checkpointing latency and hence degrade system performance. In non- blocking algorithms application processes are not blocked when checkpoints are being taken. Some of the main issues addressed in the literature are reducing the number of checkpoints, minimizing the complexity for synchronization, minimizing roll back, etc. 2. Earlier works Several earlier works [2,3] on snapshot collection algorithms assume that at any point of time only one snapshot collection process is active. Koo and Toueg [7], Spezialetti and Kearns [16] and Prakash and Singhal [12] have proposed methods for handling concurrent initia- tions of snapshot collection. According to Koo and Toueg’s (K–T) algorithm, once a process takes a local checkpoint, either as an initiator or on request from another process, it becomes unwilling to take a checkpoint in response to another initiator’s request. The process sends an ‘unwilling’ response to all subsequence requests, until the check- point it has taken, is made permanent or the checkpoint- ing collection is aborted. This algorithm is blocking. Prakash and Singhal have shown that in this algorithm all the initiations may end up aborting, leading to a wastage of effort [12]. Spezialetti and Kearns (S–K) algorithm [16] forces all process to take local checkpoints similar to Chandy– Lamport checkpoint collection algorithm [3]. A process takes local checkpoint for the ﬁrst request and forwards that request to its neighbors. All subsequent requests are collected in a list called border list: Once a process has received requests along all its incident edges, its checkpointing phase is complete. Then the process sends its border list to the process from which it received the ﬁrst checkpoint request message. In this way, mutually disjoint sets of processes take their local checkpoints in response to requests from different initiators. Finally, initiators communicate with each other and one checkpoint for each process is selected to build a CGS, which is minimal [10]. On the other hand, the Prakash and Singhal (P–S) algorithm [12] generates a CGS, which is maximal [10]. Unlike S–K algorithm the P–S algorithm permits full propagation of checkpoint requests generated by all the concurrent checkpoint initiations. Thus the P–S ARTICLE IN PRESS v_no =1 P 0 P 1 P 2 P 3 P 4 CGS (1) TCT Checkpoint latency Ckpt_req Ack_message Temporary checkpoint Permanent checkpoint Fig. 2. An example showing the checkpointing for a single initiator (P 3 ). m 2 m 1 m 3 m 4 P 0 P 1 P 2 P 3 : Checkpoint Fig. 1. An example showing orphan message ðm 3 Þ and missing message ðm 1 Þ: P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 650

ARTICLE IN PRESS J. Parallel Distrib. Comput. 64 (2004) 649–661 Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks Partha Sarathi Mandal and Krishnendu Mukhopadhyaya* Advanced Computing and Microelectronics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700108, India Received 10 February 2003; revised 9 February 2004 Abstract Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is OðknÞ when k initiators initiate concurrently. The time complexity is OðnÞ: For the recovery algorithm, time and message complexities are both OðnÞ: r 2004 Elsevier Inc. All rights reserved. Keywords: Distributed system; Ring network; Coordinated checkpointing; Rollback recovery; Logical checkpoint 1. Introduction Checkpointing is an important feature in distributed computing. It gives fault tolerance without requiring additional efforts from the programmer. A checkpoint is a snapshot of the current state of a process. It saves enough information in non-volatile stable storage such that, if the contents of the volatile storage are lost due to process failure, one can reconstruct the process state from the information saved in the non-volatile stable storage. If the processes communicate with each other through messages, rolling back a process may cause some inconsistencies. In the time since its last checkpoint, a process may have sent some messages. If it is rolled back and restarted from the point of its last checkpoint, it may create orphan messages, i.e., messages whose receive events are recorded in the states of the destination processes but the send events are lost. *Corresponding author. E-mail addresses: partha r@isical.ac.in (P.S. Mandal), krishnendu@isical.ac.in (K. Mukhopadhyaya). 0743-7315/$ - see front matter r 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2004.03.013 Similarly, messages received during the rolled back period, may also cause problem. Their sending processes will have no idea that these messages are to be sent again. Such messages, whose send events are recorded in the state of the sender process but the receive events are lost, are called missing messages. In Fig. 1 if process P3 is rolled back to its last checkpoint, then m3 would be an orphan message. Similarly, if P0 is rolled back to its last checkpoint then m1 would be a missing message. A set of checkpoints, with one checkpoint for every process, is said to be consistent global checkpointing state (CGS), if it does not contain any orphan message or missing message. However, generation of missing messages may be acceptable, if messages are logged by sender. In a distributed system, each process has to take checkpoints periodically on non-volatile stable storage. In case of a failure, the system rolls back to a consistent set of checkpoints. If all the processes take checkpoints at the same time instant, the set of checkpoints would be consistent. But since globally synchronized clocks are very difficult to implement, processes may take ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 650 P0 m2 m4 P1 m1 P2 m3 P3 : Checkpoint Fig. 1. An example showing orphan message ðm3 Þ and missing message ðm1 Þ: Checkpoint latency taken may not lie on any CGS. Such checkpoints are called useless checkpoints. Useless checkpoints degrade system performance. Unlike uncoordinated checkpointing coordinated checkpointing does not generate useless checkpoints. Coordinated checkpointing algorithms are of two types: (a) blocking [6,7] and (b) non-blocking [2,3,13,14]. Blocking algorithms force all relevant processes in the system to block their computation during checkpointing latency and hence degrade system performance. In nonblocking algorithms application processes are not blocked when checkpoints are being taken. Some of the main issues addressed in the literature are reducing the number of checkpoints, minimizing the complexity for synchronization, minimizing roll back, etc. P0 P1 P2 2. Earlier works CGS (1) P3 P4 v_no = 1 TCT Ckpt_req Ack_message Temporary checkpoint Permanent checkpoint Fig. 2. An example showing the checkpointing for a single initiator (P3 ). checkpoints within an interval. In order to achieve synchronization, sometimes processes take temporary checkpoints. They are made permanent, when all processes agree. The time from the checkpointing initiation to the time of the last process taking its checkpoint (may be temporary) is called the total checkpointing time (TCT) (Fig. 2). The time interval from the initiation to completion of the checkpointing process, when all checkpoints are taken and made permanent, is called checkpointing latency (Fig. 2). Checkpointing algorithms may be classified into two broad categories: (a) coordinated and (b) uncoordinated. In uncoordinated checkpointing [1,11,17] each process takes checkpoints independently, without bothering about other processes. In case of a failure, after recovery, a CGS is found among the existing checkpoints and the system restarts from there. Here, finding a CGS is quite tricky. The choice of checkpoints for the different processes is influenced by their mutual causal dependencies. The common approach is to use rollbackdependent graph or checkpoint graph [1,2,4,13,20]. In coordinated checkpointing [2,3,7–9,11,13,19], all processes have to synchronize through control messages before taking checkpoints. These synchronization messages contribute to extra overhead. On the other hand, in uncoordinated checkpointing some of the checkpoints Several earlier works [2,3] on snapshot collection algorithms assume that at any point of time only one snapshot collection process is active. Koo and Toueg [7], Spezialetti and Kearns [16] and Prakash and Singhal [12] have proposed methods for handling concurrent initiations of snapshot collection. According to Koo and Toueg’s (K–T) algorithm, once a process takes a local checkpoint, either as an initiator or on request from another process, it becomes unwilling to take a checkpoint in response to another initiator’s request. The process sends an ‘unwilling’ response to all subsequence requests, until the checkpoint it has taken, is made permanent or the checkpointing collection is aborted. This algorithm is blocking. Prakash and Singhal have shown that in this algorithm all the initiations may end up aborting, leading to a wastage of effort [12]. Spezialetti and Kearns (S–K) algorithm [16] forces all process to take local checkpoints similar to Chandy– Lamport checkpoint collection algorithm [3]. A process takes local checkpoint for the first request and forwards that request to its neighbors. All subsequent requests are collected in a list called border list: Once a process has received requests along all its incident edges, its checkpointing phase is complete. Then the process sends its border list to the process from which it received the first checkpoint request message. In this way, mutually disjoint sets of processes take their local checkpoints in response to requests from different initiators. Finally, initiators communicate with each other and one checkpoint for each process is selected to build a CGS, which is minimal [10]. On the other hand, the Prakash and Singhal (P–S) algorithm [12] generates a CGS, which is maximal [10]. Unlike S–K algorithm the P–S algorithm permits full propagation of checkpoint requests generated by all the concurrent checkpoint initiations. Thus the P–S ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 algorithm outputs a CGS with more recent checkpoints than the S–K algorithm. S–K algorithm requires the transmission of Oðn2 Þ messages to take the local checkpoints and Oðm2 nÞ messages for information dissemination phase with message size Oðn=mÞ: For m concurrent initiators P–S algorithm requires Oðmn2 Þ messages to take tentative checkpoints. Another Oðm2 nÞ messages of size OðnÞ are exchanged for establishing a CGS. Although the number of messages required by P– S algorithm is higher, as they are sent concurrently, the time to collect tentative checkpoints is comparable with S–K algorithm. Suppose n processes are running together. After one or more processes fail, the system recovers by rolling back to a CGS. Many recovery algorithms on distributed computing system have been proposed in the literature [4,5,15,17,18]. Worst case message complexity of the Stron and Yemini’s algorithm [17] is Oð2n Þ for single process failure. Sistea and Welch have proposed an algorithm [15] which requires Oðn2 Þ messages to be exchanged with OðnÞ information appended to each application message and Oðn3 Þ message exchanges when Oð1Þ extra information is appended to each message. Juang and Venkatesan proposed an algorithm [18] which appends Oð1Þ extra information to each message. When arbitrary number of processes fail, Oðn2 Þ messages are sufficient for general networks and for ring networks OðnÞ messages are necessary and sufficient for recovery. They also proposed an algorithm [18] on general network where OðnÞ additional information is appended to each application message; OðknÞ messages are required for rolling-back all of the processes to a CGS when k processes fail. In our proposed asynchronous coordinated checkpointing algorithm processes take checkpoints independent of message pattern. It allows any set of processes in the system to initiate checkpointing. A process need not consider causal dependency generated by the application messages. Due to the special nature of the ring network, the scheme does not need to trace dependence at the time of roll back to get CGS. 3. System model We consider a distributed system consisting of n processes on a ring network. Processes are numbered P0 ; P1 ; P2 ; y; Pn1 sequentially, in the clockwise direction. In case of the unidirectional ring, the direction of the ring is also assumed to be clockwise. There is no common clock or common memory. Message passing is the only mode for communication between any pair of processes. Two types of messages are generated by the processes: application messages for underlying distributed application and control messages to facilitate checkpointing and roll back of the system. In the 651 unidirectional ring ith process can directly send a message to jth process if and only if j ¼ ði þ 1Þ mod n: In bi-directional ring ith process can directly send a message to jth process if and only if j ¼ ði71Þ mod n: The communication channel is assumed to be FIFO. In a link as well as in intermediate node, a message arriving later does not leave before an earlier message. We assume that there is no link failure, only processes may fail. The computation is asynchronous, i.e., each process runs with its own speed; messages are exchanged with finite but arbitrary delays. Application messages are acknowledgement based, i.e., for each message an acknowledgement is required (this is to let the sender know that the receiver is alive). In this paper, we consider logical checkpoints [11,19,21], which are slightly different from standard checkpoints. A logical checkpoint is a standard checkpoint (i.e., snapshot of the process) plus a list of messages, which have been sent by this process but are unacknowledged at the time of taking the checkpoint. Message lists are updated continuously. After getting acknowledgement for a message, it is deleted from the list. Our algorithm allows the generation of missing messages in case the system has to roll back to its last checkpoint. At the time of restart after a failure, processes retransmit their unacknowledged messages (not all of whom may be missing messages). So there may be duplicate messages after recovery from a failure and that has to be handled using message identifiers. In our algorithm, for each process, at most two checkpoints may have to be stored in the stable storage when checkpointing procedure is running; otherwise one checkpoint per process is enough to make a system consistent. Checkpoints have a one-bit version numbers ðv noÞ: In the beginning all processes start by taking a permanent checkpoints with v no ¼ 0: 4. Checkpointing algorithm for unidirectional ring network A process has complete freedom to take a decision about checkpointing initiation, provided it does not have any temporary checkpoint. Any subset of the n processes, may initiate checkpointing independently. After certain time periods processes initiate checkpointing. It itself takes a logical checkpoint and sends a checkpoint request message (ckpt req) to the next process, with its own id as the initiator. The new checkpoint is marked temporary and stored in the stable storage and set its initiator flagi ¼ True: Initially it was False. If the v no of the existing checkpoint is 0 (1) then the v no of the new checkpoint is 1 (0). On getting a ckpt req message, a process checks whether it has taken any temporary checkpoint or not; if not, then it takes a logical checkpoint with received initiator id as the ARTICLE IN PRESS 652 P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 initiator id of the checkpoint and then forwards ckpt req to the next process. Within a TCT, each non-initiator process takes at most one temporary checkpoint. The checkpoint is taken in response to the first ckpt req message; other ckpt req messages are either forwarded or discarded depending on the id value of the initiator of the request with respect to the id of the receivers. Each forwarded ckpt req message always contains initiator id. If receiver id is less than the initiator id and if receiver has already taken a temporary checkpoint then that ckpt req will be discarded provided its initiator flagj is True. When a process knows that all other n 1 processes have already taken their checkpoints temporarily to the stable storage then that process will take permanent logical checkpoint directly or change the state of the existing checkpoint to permanent, if its state was temporary. Then, it deletes its old permanent checkpoint and sends ack message to the next process, with own id as ack message generator. On receiving ack message, a process changes its temporary checkpoint to permanent, deletes its previous permanent checkpoint and discards or forwards ack message to next the process depending on whether the receiver ðid ¼ iÞ is the immediate predecessor to the ack message initiator ðack initiator id ¼ jÞ; i.e., if i ¼ ðj 1Þ mod n or not respectively. When all processes have made their checkpoints with the new v no permanent then that set of checkpoints will be a CGS. So in our algorithm consistent global checkpointing means each process has a permanent checkpoint in its own stable storage with same v no over all processes. Algorithm. Unidirectional Checkpointing Initiatori /This algorithm is executed by process Pi when Pi decides to initiate a checkpointing / begin if there is no temporary checkpoint then take a new temporary checkpoint with new v no / v no ¼ 0=1; if the existing checkpoint v no ¼ 1=0/ set initiator flagi ’True / process Pi initiating checkpoint, initially it was False/ set initiator id’i / initiator id is attached to the checkpoint request; it denotes the id of the initiator / send a ckpt req to the next process with initiator id end if end Algorithm. ckpt req receiverj / This algorithm is executed by process Pj when it receives a ckpt req / begin if temporary checkpoint exists then if j is the immediate predecessor of the rec initiator id then / rec initiator id is the initiator id of the ckpt req generator / delete the existing permanent checkpoint make the existing temporary checkpoint permanent set ack initiator id ’j generate and send ack message to the next process with ack initiator id / ack initiator id is the id of the ack message generator / else if j 4 rec initiator id then forward the ckpt req to the next process else / jorec initiator id / if initiator flagj is False then / process has not initiated any checkpointing / forward the ckpt req to the next process else / initiator flagj is True / discard the ckpt req end if end if else / permanent checkpoint exists, but no temporary checkpoint / if j is the immediate predecessor of the rec initiator id then delete old permanent checkpoint take a new permanent checkpoint with new v no set ack initiator id’j; generate and send ack message to the next process with ack initiator id else / process j is not the immediate predecessor of the rec initiator id / take a new temporary checkpoint with new v no forward the ckpt req to the next process end if end if end Algorithm. ack message receiverj / This algorithm is executed by process Pj when it receives an ack message / begin if j is not the immediate predecessor of the initiator of this acknowledgement message then if temporary checkpoint exists then delete the permanent checkpoint make the existing temporary checkpoint permanent end if forward ack message to the next process else / process j is the immediate predecessor of the ack initiator id / ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 if temporary checkpoint exists then delete the permanent checkpoint make the existing temporary checkpoint permanent end if discard ack message end if end 5. Checkpointing algorithm for bi-directional ring network Here also, any process, (say Pi ), that has not taken a temporary checkpoint may decide to initiate checkpointing. In such a case, the process itself takes a new temporary logical checkpoint. The version number of the new checkpoint is the complement of the version number of the existing permanent checkpoint. Then initiator sends ckpt req to both of its neighboring processes ð¼ ði71Þ mod nÞ: On receiving a ckpt req message, the conditions for a process taking a new temporary checkpoint or forwarding or discarding the ckpt req are same as those of the previous algorithm. The forwarding is done to the neighbor other than the one from which it is received. When a process receives ckpt req initiated by the same initiator twice, the process changes the existing ckpt state from temporary T to permanent P; deletes its old permanent checkpoint and forwards ckpt req message. No acknowledgement message is required here. The final initiator (one with minimum id among all initiators who have initiated checkpointing within TCT) receives two forwarded ckpt req messages. When it receives the first request, it changes the existing ckpt state from T to P; deletes old permanent checkpoint and stops the ckpt req message propagation. When it receives the other ckpt req, it just discards the message and the checkpointing algorithm terminates. 5.1. Checkpointing. algorithm for bi-directional Algorithm. Bi-directional Checkpointing Initiatori / This algorithm is executed by process Pi when Pi decides to initiate a checkpointing / begin if there is no temporary checkpoint then take a new temporary checkpoint with new v no / v no¼ 0=1; if the existing checkpoint v no¼ 1 / 0 / set initiator id’i send ckpt req to both adjacent processes with initiator id 653 end if end Algorithm. ckpt req receiverj / This algorithm is executed by process Pj when it receives a ckpt req / begin if there is no temporary checkpoint then take a new temporary checkpoint with new v no set initiator id’rec initiator id / rec initiator id is the initiator id of the ckpt req / forward the ckpt req to the other adjacent process / other than the process from which the ckpt req was received / else / temporary checkpoint exists / if initiator id o rec initiator id then discard the ckpt req message else if initiator id 4 rec initiator id then set initiator id’rec initiator id forward the ckpt req to the other adjacent process else / initiator id is equal to rec initiator id / if j ¼ rec initiator id then delete the existing permanent checkpoint make the existing temporary checkpoint permanent discard the ckpt req message else / jarec initiator id / delete the existing permanent checkpoint make the existing temporary checkpoint permanent forward the ckpt req to the other adjacent process end if end if end if end 6. Recovery algorithm for unidirectional ring network We assume that, when the faulty process is restored, it initiates recovery process by sending recovery messages to the other processes. The recovery algorithm finds out a v no for which checkpoints exist in all the processes. Processes may fail when distributed application is ARTICLE IN PRESS 654 P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 running or when checkpointing process is in execution. But a checkpoint is made permanent, only when checkpoints for that v no has already been taken in all other process. Thus the existence of even one permanent checkpoint indicates that checkpoints for this v no are present in all other process. Note that some of them may be temporary, while the rest are permanent. In such a case, the temporary checkpoints are made permanent by the recovery algorithm. Processes resume computation from these checkpoints. At the time of restart, processes resend their messages (in the same order as sent before) which were unacknowledged at the moment of taking the checkpoint. There might be duplicate messages after re-sending messages and these problems have to be resolved using message identifier at the receiver end. Our recovery algorithm generates two types of messages (1) Recovery message, for synchronization over same checkpointing version no. (2) Resume message, after synchronization, initiator sends this message to all processes. After receiving this message a process resumes computation from the latest checkpoint. If the checkpoint corresponding to this v no happens to be temporary, it is made permanent, deleting the old permanent checkpoint. When a process initiates recovery process by sending recovery message to its neighbor, it sends own id as the recovery initiator, latest checkpoint v no (whether it was permanent or temporary). On receiving a recovery message a process checks its own v no with the rec v no. o. If they are identical, it sets own initiator id to rec initiator id and forwards the recovery message, as it is, to the next process. When this forwarded message reaches its initiator, initiator generates resume message with own id. If v no is not equal to received v no, then process checks its ckpt state. If ckpt state ¼ T then it deletes its temporary checkpoint keeping the permanent checkpoint. Then it forwards recovery message to the next process. And if its ckpt state ¼ P then this process takes over of the role the initiator. In this case this process sends recovery message to the next process with own id as a recovery initiator id and own v no. When a process receives a resume message if its ckpt state ¼ T it makes ckpt state ¼ P and deletes its old permanent checkpoint. If its ckpt state was P then no changes are made. In both the cases it forwards the resume message to the next process unless it knows that all previous n 1 processes have already know about this resume message. 6.1. Recovery algorithm for unidirectional Algorithm. Unidirectional Recovery Initiatori / This algorithm is executed by process Pi when the faulty process Pi is restored / Begin set recovery initiator id’i send recovery message to the next process with latest checkpoint v no and recovery initiator id / latest checkpoint may be temporary or permanent / end Algorithm. Recovery message receiverj / This algorithm is executed by process Pj when it receives a recovery message / Begin if process id, j and v no of the latest checkpoint match with the initiator id and v no of the recovery message then if latest checkpoint is temporary then delete the permanent checkpoint make the temporary checkpoint permanent end if generate resume message set resume initiator id’j send resume message to the next process with resume initiator id else if v no of the latest checkpoint does not match with v no of the recovery message but jainitiator id then if temporary checkpoint exists then delete the temporary checkpoint rollback to its previous permanent checkpoint forward recovery message to the next process else / temporary checkpoint does not exists / discard recovery message Unidirectional Recovery Initiatorj / set process j as a new recovery initiator /end ifelse/ v no of the latest checkpoint match with v no ofthe recovery message but jainitiator id /forward recovery message to the next processend ifend Algorithm. Resume message receiverj / This algorithm is executed by process Pj when it receives a resume message / Begin if process j is not the immediate predecessor of the initiator of this resume message then if temporary checkpoint exists then delete the existing permanent checkpoint make the existing temporary checkpoint permanent ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 end if forward resume message to the next process else / process j is the immediate predecessor of the initiator of this resume message/ if temporary checkpoint exists then delete the existing permanent checkpoint make the existing temporary checkpoint permanent end if terminate resume message end if end 7. Recovery algorithm for bi-directional ring network In this algorithm we use two flags, flag visit and flag resume for every process Pi : When Pi comes to know about a fault, it sets both its flags to False. As in the unidirectional recovery algorithm, when the faulty process Pi is restored, it initiates recovery process. It sends recovery message (reco message) along with latest checkpoint v no (irrespective of whether it is permanent or temporary) to its two neighbors (Pðiþ1Þ mod n and Pði1Þ mod n ) and sets flag visit ¼ True: The algorithm finds out a v no for which checkpoints exist in all processes. The resume message is not required here. When a process Pi receives a reco message it compares rec v no with own v no. If they are equal then Pi checks flag visit: If flag visit ¼ False; Pi sets flag visit ¼ True and forwards the message. If flag visit ¼ True and if Pi ’s ckpt state is T; it deletes its old permanent checkpoint and changes the value of ckpt state to P: If its ckpt state is P then the ckpt state remains unchanged. In both the cases the message is forwarded to the next process, in the direction of travel of the reco message and the process resumes computation from the permanent checkpoint. When a process resumes computation it sets flag resume ¼ True: The reco message is forwarded till it reaches a process whose flag resume ¼ True: In case rec v no is not equal to v no, if ckpt state is T then Pi deletes its current checkpoint (T) and forwards the message, otherwise it sends reco message to the next process with its checkpoint v no. In both the cases Pi sets flag visit ¼ True: Algorithm. Bi-directional Recovery Initiatori / This algorithm is executed by process Pi when the faulty process Pi is restored / Begin set flag visit ’ True send recovery message to both adjacent processes with latest checkpoint v no end 655 Algorithm. Recovery message receiverj / This algorithm is executed by process Pj when it receives a recovery message / Begin if flag resume is True then / recovery for process j is complete / discard the recovery message else / flag resume is False, implies recovery for the process is not yet complete / if v no of process j matches with the v no of the recovery message then if flag visit is True then / process j already received a recovery message / if temporary checkpoint exists then delete the existing permanent checkpoint make the temporary checkpoint permanent end if set flag resume ’ True else / flag visit is False, this is the first recovery message received by process j / set flag visit ’ True end if forward the recovery message to the other adjacent process else / v no of process j does not match the v no of the recovery message / if temporary checkpoint exists then set flag visit ’ True delete the existing temporary checkpoint rollback to its previous permanent checkpoint forward recovery message to the other adjacent process else / permanent checkpoint exist but no temporary checkpoint / set flag visit ’ True send recovery message to the other adjacent process with its own checkpoint v no end if end if end if end 8. Correctness of the proposed algorithms For unidirectional as well as bi-directional rings, in order to show that the proposed algorithm is correct, we first show that, at any point of time, there exists a value ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 656 of v no; for which each process has a checkpoint. Then we show that, the set thus obtained is indeed a CGS, i.e., it does not contain any orphan message. Theorem 1. In the proposed checkpointing algorithm for unidirectional ring, at any point of time, there exists exactly one value of v no, for which each process has a checkpoint. Proof. If no checkpointing process is in execution, then the checkpoints corresponding to the v no of the last checkpointing process will be available in every process. Now we consider a point of time, say t; within the checkpointing latency. If t is within the TCT, since the new checkpoint has not been made permanent, the permanent checkpoints in all the processes correspond to the previous checkpoint latency and hence have the same v no: For example, in Fig. 3, process P2 fails before taking its new checkpoint. In this case system has to rollback to their permanent checkpoint, for each process to get the previous existing CGS(1). If t is after the TCT, here at least one process has taken a new permanent checkpoint. Since permanent checkpoints are taken only when all other processes have taken temporary or permanent checkpoints corresponding to that v no; a checkpoint (temporary or permanent) corresponding to the new v no is available with every process. For example in Fig. 4, process P2 has failed after taking its permanent checkpoint. In this case, system will not rollback to CGS(1). Instead, the recovery algorithm goes to CGS(0) using the current existing checkpoints (temporary or permanent) with v no ¼ 0 for all processes. Now note that before taking the first permanent checkpoint corresponding to a new checkpointing process, the existing permanent checkpoint is deleted. Thus at any point of time there is only one complete set of checkpoints. & Our recovery algorithm finds a set of checkpoints corresponding to the same v no: It only remains to show that the set thus obtained is consistent. CGS(1) P0 P1 P2 P3 P4 v_no = 1 CGS(1) CGS(0) P0 P1 P2 P3 P4 v_no = 1 v_no = 0 Fig. 4. An example showing the checkpointing and recovery when a failure occurs after TCT but within the checkpointing latency. Theorem 2. The set of checkpoints corresponding to the v no in Theorem 1, is consistent. Proof. Suppose for an application message, the send event is not recorded in our set of checkpoints. Then, in the sender process, the checkpoint was taken before sending this application message. As the checkpoint was taken earlier, the ckpt req message following the checkpoint will also precede the application message. As we assume that the channel is FIFO, and unidirectional, the ckpt req message will always be received before the application message. All processes take checkpoints before receiving the message. Hence, no checkpoint will show this message being received. & Theorem 3. In the proposed checkpointing algorithm for bi-directional ring, at any point of time, there exists exactly one value of v no, for which each process has a checkpoint. Proof. If no checkpointing process is in execution, then the checkpoints corresponding to the v no of the last checkpointing process will be available in every process. Now we consider a point of time, say t; within the checkpoint latency. If t is within the TCT (Fig. 5), since all processes have not taken their temporary checkpoint for the ongoing checkpointing process, the permanent checkpoints in all the processes correspond to the previous checkpoint latency and hence have the same v no: If t is after the TCT but within the checkpoint latency (Fig. 6), here every process has taken a new temporary checkpoint. Some processes may have made these checkpoints permanent also. So a checkpoint corresponding to the new v no is available with every process. & v_no = 0 Fig. 3. An example showing the checkpointing and recovery when a failure occurs within TCT. Theorem 4. The set of checkpoints corresponding to the v no in Theorem 3, is consistent. ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 P0 P1 P2 P3 P4 v_no = 1 v_no = 0 Fig. 5. An example showing the checkpointing and recovery when a failure occurs within TCT. CGS(0) P0 P1 P2 P3 P4 v_no = 1 v_no = 0 Fig. 6. An example showing the checkpointing and recovery when a failure occurs after TCT but within the checkpointing latency. Proof. Suppose for an application message the send event is not recorded in our set of checkpoints. Then the checkpoint was taken before sending this application message. But since the checkpoint was taken earlier, the ckpt req message following the checkpoint will also precede the application message. As we assume that the channel is FIFO, the ckpt req message will always be received before the application message. Hence all processes take checkpoints before receiving the message. Hence, no checkpoint will show this message being received. & 9. Complexity analysis For both unidirectional and bi-directional checkpointing algorithms, in a single checkpointing latency, a process takes exactly one checkpoint. But several checkpointing request messages may be generated because of multiple concurrent initiations. Among those only the request message with minimum id (initiator) 657 survives and goes round the ring once. All other request messages are dominated by that message and will be discarded before completing a round. By the time, the surviving message completes the round, all other requests are discarded. One more round may be necessary for the confirmation in case of unidirectional ring. But in case of bi-directional ring, both checkpointing request messages go round the ring just one each; along different directions and no separate confirmation is required. Thus, the checkpointing time is OðnÞ for both cases. With respect to message complexity (i.e., the number of control messages), the worst case occurs when all the processes initiate at the same point of time. If such a thing happens in a unidirectional ring, the ckpt req message from P0 goes to all other processes. For Pi ðia0Þ it goes up to P0 and is discarded. Thus a total of ðn 1Þ þ ðn 1 þ n 2 þ ? þ 2 þ 1Þ ¼ ðn 1Þðn þ 2Þ=2 messages are generated. Also ðn 1Þ acknowledgement messages will be generated. Thus, a total of ðn 1Þðn þ 4Þ=2 ð¼ Oðn2 ÞÞ control messages will be generated. And for bi-directional ring, the ckpt req message from P0 goes to all other processes along both directions and comeback. For all other processes packets going in the clockwise direction go up to P0 and those going in the counter clockwise direction go just one hap each and are discarded. Thus a total of ð2nÞ þ ðn þ n 1 þ ? þ 2Þ ¼ ð2n 1Þ þ nðn þ 1Þ=2 ð¼ Oðn2 ÞÞ control messages will be generated. For rollback recovery algorithms in unidirectional and bi-directional ring, worst case time complexities and message complexities are all OðnÞ: 10. Comparison with existing algorithms K–T algorithm does not work on a unidirectional ring network when multiple processes initiate checkpoints concurrently. In such a case, all the checkpointing processes end up in aborting [12]. Like S–K algorithm our algorithm takes n temporary checkpoints, one for each process, and this does not depend on the number of concurrent initiations. In P–S algorithm, if all processes are dependent on each other, and k processes initiate checkpointing concurrently, each process takes k temporary checkpoints, i.e., a total of kn checkpoints for the system. Both S–K and P–S algorithms are designed for general network topologies. Their worst case message complexities are Oðn3 Þ: But for the simple unidirectional ring, this worst case is achieved. In case of the proposed algorithm message complexity is Oðn2 Þ: Table 1 compares of the proposed algorithms with the S–K algorithm and the P–S algorithm. P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 658 Table 1 Performance of the proposed algorithms and other existing algorithms S–K algorithm P–S algorithm Proposed checkpointing algorithms Network topology for which applicable General General Ring (unidirectional/ bi-directional) Worst case message complexity (Ring) Oðn3 Þ Oðn3 Þ Oðn2 Þ Time complexity (Ring) OðnÞ OðnÞ OðnÞ Control message size Number of checkpoints stored for k concurrent initiations Oðn=kÞ One permanent one temporary checkpoint for each process OðnÞ One permanent and k temporary checkpoints for each process Oð1Þ One permanent and one temporary checkpoint for each process Number of checkpoints rollback after a failure At most one temporary checkpoint At most k temporary checkpoints At most one temporary checkpoint 1/λfault = 0.05 (in millions ) 60 50 50 overhead (in millions) overhead (in millions) 1/λfault = 0.03 (in millions ) 60 40 30 20 10 40 30 20 10 0 0.5 50.5 100.5 150.5 200.5 250.5 0 0.5 300.5 50.5 1/λckpt (in thousands) 1/λfault = 0.08 (in millions ) 150.5 200.5 250.5 300.5 1/λfault = 0.10 (in millions ) 60 60 50 50 overhead (in millions) overhead (in millions) 100.5 1/λckpt (in thousands) 40 30 20 10 40 30 20 10 0 0.5 0 50.5 100.5 150.5 200.5 250.5 300.5 0.5 50.5 1/λckpt (in thousands) Recovery Overhead 100.5 150.5 200.5 250.5 300.5 1/λckpt (in thousands) Checkpointing Overhead Total Overhead Fig. 7. Simulation results showing rollback recovery, checkpointing, and total overhead costs for the proposed algorithm in a unidirectional ring. ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 11. Simulation results Simulation studies were conducted for the behavior of the proposed algorithm when implemented on unidirectional as well as bi-directional ring networks. We assume that inter-fault time, inter-checkpoint time and intermessage send for a process follow exponential distribution with parameters (lfault ), (lckpt ) and (lsend ), respectively. The simulation program takes lfault ; lckpt and lsend as input parameters. It is assumed that a message takes one unit of time to travel across one link. The time for taking a checkpoint is assumed to be 5000 units. We 1 varied lckpt between 500 and 300,000 with increments of 1 (between 30,000 and 100 for a fixed value of lfault 100,000) whereas lsent remains fixed. Simulation has been carried out for 20,000,000 units of time. For each set of values of the input parameters, the program was run 20 different times and then the average of the 20 runs is taken. Figs. 7 and 8 show the simulated values of checkpointing overhead, rollback recovery overhead and the total overhead for our proposed uni-directional and bi-directional algorithms, respectively. Total overhead is the sum of checkpointing overhead and rollback recovery overhead. As checkpointing rate (lckpt ) decreases, checkpointing overhead decreases while recov- ery overhead goes up. Initially the total overhead decreases with decreasing checkpointing rate. At this stage the checkpointing overhead is the dominant cost. After it reaches a minimum value, the rollback cost starts to dominate and the total overhead starts increasing again. In each case we show the optimum value of the checkpointing rate that minimizes the total overhead. As the fault rate (lfault ) goes down (in different graphs), the recovery cost also goes down and the optimum checkpointing rate goes down too. The number of control messages was affected strongly by the number of concurrent initiations. Concurrent initiations abort many control messages without letting them complete the cycle. This explains the variation in the curves. Table 2 compares the proposed Unidirectional (U) and Bi-directional (B) algorithms with S–K and P–S algorithms in terms number of total control messages. Simulated runs of the four algorithms were carried out. We assume that whenever one process initiates checkpointing, all the processes, take checkpoints. We have simulated systems with 4, 6 and 10 processes each. The 1 value of lckpt was taken 200, 400 or 600. For each algorithm, total control messages was counted as the average of 200 different runs. The results clearly reflect 1/λfault = 0.50 (in millions ) 60 50 50 overhead (in millions) overhead (in millions) 1/λfault = 0.03 (in millions ) 60 40 30 20 10 0 0.5 40 30 20 10 0 0.5 50.5 100.5 150.5 200.5 250.5 300.5 1/λfault = 0.08 (in millions ) 1/λfault = 0.10 (in millions ) 60 60 50 50 overhead (in millions) overhead (in millions) 50.5 100.5 150.5 200.5 250.5 300.5 1/λckpt (in thousands) 1/λckpt (in thousands) 40 30 20 10 0 0.5 659 40 30 20 10 0 50.5 100.5 150.5 200.5 250.5 300.5 0.5 Recovery Overhead 50.5 100.5 150.5 200.5 250.5 300.5 1/λckpt (in thousands) 1/λckpt (in thousands) Checkpointing Overhead Total Overhead Fig. 8. Simulation results showing rollback recovery, checkpointing, and total overhead costs for the proposed algorithm in a bi-directional ring. ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 660 Table 2 Simulation results comparing the number of control messages for the proposed Unidirectional (U) and Bi-directional (B) algorithm with S–K and P– S algorithms No. of processes 1 lckpt (k) Algorithms (-) 200 400 600 4 U 16 18 19 6 B 24 24 25 S–K 58 38 27 P–S 188 174 158 U 33 36 39 the superiority of the proposed algorithms. The difference is more pronounced for larger systems with higher number of processes. 10 B 41 45 45 [2] [3] 12. Conclusion In this work, we have proposed checkpointing and recovery algorithms, for unidirectional as well as bidirectional ring networks. In our model, processes take logical checkpoints, i.e., snapshot of the process plus the unacknowledged messages. Our algorithm can handle multiple initiations of checkpointing. During recovery each process has to rollback at most one checkpoint. For each process at most two checkpoints (one permanent and other temporary) may be saved in the stable storage. For the checkpointing as well the recovery algorithms, the control message makes two rounds along the unidirectional ring and one round for the bi-directional ring. Though checkpointing schemes for general network topologies are available in the literature. There is scope for improvement for particular classes of topologies. A more general approach showing the effect of the topologies on the complexities of the checkpointing algorithms may also be considered. [4] [5] [6] [7] [8] [9] [10] [11] Acknowledgments The first author is thankful to Council of Scientific and Industrial Research (CSIR), India, for financial support during this work. The authors thank the anonymous reviewers for their constructive criticism and helpful suggestions. The authors are also grateful to Professor Bhabani P. Sinha of ACM Unit, Indian Statistical Institute, Kolkata, for his patient hearing and many suggestions which have improved the organization of the paper. [12] [13] [14] [15] References [1] B. Bhargava, S.R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic [16] [17] S–K 239 155 104 P–S 841 830 787 U 82 88 94 B 80 91 98 S–K 1502 958 663 P–S 28,720 43,442 38,899 approach, in: Proceedings of the Seventh IEEE Symposium on Reliable Distributed System, 1988, pp. 3–12. G. Cao, M. Singhal, On coordinated checkpointing in distributed systems, IEEE Trans. Parallel Distrib. Systems 9 (12) (1998) 1213–1225. K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Trans. Comput. Systems 3 (1) (1985) 63–75. E.N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput. Surveys 34 (3) (2002) 375–408. D. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing, J. Algorithms 3 (11) (1990) 462–491. J.L. Kin, T. Park, An efficient protocol for checkpointing recovery in distributed system, IEEE Trans. Parallel Distrib. Systems 5 (8) (1998) 955–960. R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed system, IEEE Trans. Software Eng. 13 (1) (1987) 23–31. D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the IEEE Sixth International Conference on Distributed Computer Systems, May 1996, pp. 100–107. D. Manivannan, M. Singhal, Quasi-synchronous checkpointing: models, characterization, and classification, IEEE Trans. Parallel Distrib. Systems 10 (7) (1999) 703–713. F. Mattern, Virtual time and global states of distributed systems, in: M. Cosnard et al. (Ed.), Proceedings of the Workshop on Parallel and Distributed Algorithm, Elsevier Science Publishers B. V., North-Holland, Amsterdam, 1989, pp. 215–226. K.Z. Meth, W.G. Tuel, Parallel checkpoint/restart without message logging, in: Proceedings of the IEEE 28th International Conference on Parallel Processing (ICPP ’00), August 2000, pp. 253–258. R. Prakash, M. Singhal, Maximal global snapshot with concurrent initiators, in: Proceedings of the Sixth IEEE Symposium of Parallel and Distributed Processing, October 1994, pp. 334–351. R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing systems, IEEE Trans. Parallel Distrib. Systems 7 (10) (1996) 1035–1048. L.M. Silva, J.G. Silva, Global checkpointing for distributed programs, in: Proceedings of the 11th Symposium on Reliable Distributed Systems, 1992, pp. 115–162. A.P. Sistla, J. Welch, Efficient distributed recovery using message logging, in: Proceedings of the ACM Symposium on Principle of Distributed Computing, 1989, pp. 223–238. M. Spezialetti, P. Kearns, Efficient distributed snapshots, in: Proceedings of the Sixth ICDCS, 1986, pp. 382–388. R.E. Strom, S. Yemini, Optimistic recovery in distributed systems, ACM Trans. Comput. Systems 3 (3) (1985) 204–226. ARTICLE IN PRESS P. S. Mandal, K. Mukhopadhyaya / J. Parallel Distrib. Comput. 64 (2004) 649–661 [18] T.T-Y. Juang, S. Venkatesan, Efficient algorithms for crash recovery in distributed systems, in: Proceedings of the 10th Conference on FSTTCS, Springer, Berlin, December 1990, pp. 349–361. [19] N.H. Vidya, Staggered consistent checkpointing, IEEE Trans. Parallel Distrib. Systems 10 (7) (1999) 694–702. [20] Y.M. Wang, Consistent global checkpoints that contain a given set of local checkpoints, IEEE Trans. Comput. 46 (4) Apr. (1997) 456–468. [21] Y.M. Wang, Y. Huang, W.K. Fuchs, Progressive retry for software error recovery in distributed systems, in: Proceedings of the IEEE Fault-Tolerant Computing Symposium (FTCS-23), June 1993, pp. 138–144. Partha Sarathi Mandal received a Bachelor of Science (Hons.) degree in Mathematics from the University of Calcutta, India, a Master of Science degree in Mathematics from Jadavpur University, India, in 1995, and 1997 respectively. He is awarded Junior and Senior Research Fellowship by the Council of Scientific & Industrial Research 661 (CSIR), India. He is currently working towards his Ph.D. degree in Computer Science at the Advanced Computing and Microelectronics Unit of the Indian Statistical Institute, Kolkata. His current research interests include parallel and distributed computing, fault tolerance, mobile agent, performance analysis etc. Krishnendu Mukhopadhyaya received a Bachelor of Statistics (Hons.), Master of Statistics, Master of Technology in Computer Science, and Ph.D. in Computer Science all from the Indian Statistical Institute, Kolkata, in 1985, 1987, 1989 and 1994 respectively. From 1993 to 1999 he worked as a Lecturer in the Department of Mathematics, Jadavpur University. Since 1999, he is working at the Indian Statistical Institute, Kolkata as an Associate Professor. He was a recipient of the Young Scientist Award of the Indian Science Congress Association and the BOYSCAST Fellowship of the Department of Science and Technology, Government of India. His current research interests include mobile computing, parallel and distributed computing, sensor networks etc.

Log In

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring network