This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs... more
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day
A method of generating rollback points which takes into account program structure is proposed. It is shown that the variables needed for program rollback are those that vary in specific program implementation. A new method of placing... more
A method of generating rollback points which takes into account program structure is proposed. It is shown that the variables needed for program rollback are those that vary in specific program implementation. A new method of placing rollback points is introduced and compared with traditional rollbacks.
Algorithms of computing recovery after hardware faults are considered. An algorithm of modified linear recovery is proposed that enables to find the latest non-damaged recovery point. A uniqueness of the sequence is proven. The problem... more
Algorithms of computing recovery after hardware faults are considered. An algorithm of modified linear recovery is proposed that enables to find the latest non-damaged recovery point. A uniqueness of the sequence is proven. The problem solved without limitation of a period between appearance and manifestation of hardware fault.
El número de procesadores dentro de las computadoras de altas prestaciones se encuentra en constante crecimiento, por lo que el tiempo promedio de fallo de las mismas va disminuyendo significativamente en comparación con el tiempo de... more
El número de procesadores dentro de las computadoras de altas prestaciones se encuentra en constante crecimiento, por lo que el tiempo promedio de fallo de las mismas va disminuyendo significativamente en comparación con el tiempo de ejecución de las aplicaciones paralelas de paso de mensajes que se ejecutan en computadores de altas prestaciones. Cuando un nodo de cómputo falla, la aplicación paralela de paso de mensajes para (fail stop), y debe ser reiniciada, por lo que se pierde toda la información ya procesada. Las estrategias de rollback recovery son de las más utilizada para proteger la información procesada. Se propone el diseño y la implementación de un middleware de tolerancia a fallos que proporcione las funcionalidades de un Log de Mensajes transparente a la aplicación y que se integre con la arquitectura RADIC para trabajar con aplicaciones paralelas de paso de mensajes (MPI). El middleware de tolerancia a fallos diseñado e implementado, está basado en un log de mensajes híbrido, se encarga de generar logs de los mensajes tanto enviados como recibidos dentro de cada proceso y realizar las gestiones necesarias en protección y recuperación, de forma que en caso de que se produzca un fallo, se pueda reejecutar sólo los procesos que han fallados y estén los mensajes listo para ser entregados a la aplicación sin necesidad de que intervengan el resto de los procesos de la aplicación. Se ha validado la interacción de la aplicación MPI con el log de mensajes híbrido implementado, se ha comprobado que funciona correctamente de manera transparente en protección y recuperación.
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and... more
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n).
This paper presents Sonora, a platform for mobile-cloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming... more
This paper presents Sonora, a platform for mobile-cloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming interfaces that coherently integrate a broad range of existing techniques from mobile, database, and distributed systems. These range from support for disconnected operation to relational and event-driven models. Sonora's execution engine is a fault-tolerant distributed runtime that supports user-facing continuous sensing and processing services in the cloud. Key features of this engine are its dynamic load balancing mechanisms, and a novel failure recovery protocol that performs checkpoint-based partial rollback recovery with selective re-execution. To illustrate the relevance and power of the stream abstraction in describing complex mobile-cloud services we evaluate Sonora's design in the context of two services. We also validate Sonora's ...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of... more
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms... more
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n)O(n) steps, where nn is the number of processes. The algorithm can be extended to work for general topologies too.
Recovery from transient failures is one of the prime issues in the context of distributed systems. These systems demand to have transparent yet efficient techniques to achieve the same. Checkpoint is defined as a designated place in a... more
Recovery from transient failures is one of the prime issues in the context of distributed systems. These systems demand to have transparent yet efficient techniques to achieve the same. Checkpoint is defined as a designated place in a program where normal processing of a system is interrupted to preserve the status information. Checkpointing is a process of saving status information. Mobile computing systems often suffer from high failure rates that are transient and independent in nature. To add reliability and high availability to such distributed systems, checkpoint based rollback recovery is one of the widely used techniques for applications such as scientific computing, database, telecommunication applications and mission critical applications. This paper surveys the algorithms which have been reported in the literature for checkpointing in Mobile Computing Systems.
A method of generating rollback points which takes into account program structure is proposed. It is shown that the variables needed for program rollback are those that vary in specific program implementation. A new method of placing... more
A method of generating rollback points which takes into account program structure is proposed. It is shown that the variables needed for program rollback are those that vary in specific program implementation. A new method of placing rollback points is introduced and compared with traditional rollbacks.
A major concern in implementing a checkpoint-based recovery protocol for distributed systems is the performance degradation resulting from process roll-backs. In critical systems, it is highly desirable to contain the rollback distance as... more
A major concern in implementing a checkpoint-based recovery protocol for distributed systems is the performance degradation resulting from process roll-backs. In critical systems, it is highly desirable to contain the rollback distance as well as the number of processes involved in the rollback so that timely recovery is possible. One popular approach to accomplish such goals is to control the communication of messages which are the main cause of error propagation. In this paper, we show that watchdog processor-based concurrent error detection can be merged with recovery so that quick recovery from errors is possible without restricting the communications. The low cost and low latency characteristic of an m-out-of-n code-based error detection scheme is exploited to develop a novel message validation technique which helps in curtailing the excessive rollback during recovery. A simulation analysis is conducted to demonstrate the beneets of combining detection and recovery | an approac...
Problems related to distributed systems fault-tolerance are tackled by providing efficient and fault-tolerant algorithm procedures for checkpointing and rollback recovery for such systems. The authors propose checkpointing algorithms... more
Problems related to distributed systems fault-tolerance are tackled by providing efficient and fault-tolerant algorithm procedures for checkpointing and rollback recovery for such systems. The authors propose checkpointing algorithms which can be initiated by any process in the system or upon failure of one or more component processes as part of a backward recovery procedure. The algorithm return the most recent and consistent checkpoints, require less stable storage and do not interfere with the progress of the distributed system application. Obtaining a consistent checkpoint is always guaranteed. Examples illustrating these algorithms are also provided
The number of processors within the high performance computers is constantly growing, so the average time of failure of these is decreasing significantly compared to the execution time of the parallel applications of message passing that... more
The number of processors within the high performance computers is constantly growing, so the average time of failure of these is decreasing significantly compared to the execution time of the parallel applications of message passing that are executed in high performance computers. When a computation node fails, the parallel application of message passing stops (fail stop), and must be restarted, so all the information already processed is lost. The rollback recovery strategies are the most used to protect the processed information. We propose the design and implementation of a fault tolerance middleware that provides the functionality of a Transparent message logger to the application and that is integrated with the RADIC architecture to work with parallel message passing applications (MPI). The fault tolerance middleware designed and implemented, is based on a hybrid message log, which is responsible for generating logs of messages as well as those sent as those received within each process and perform the necessary steps in protection and recovery, so, in the case there is a failure, it can only re-execute the processes that have failed and the messages are ready to be delivered to the application without having to intervene the rest of the application processes. The interaction of the MPI application with the implemented hybrid message log has been validated, it has been verified that it works correctly in a transparent manner in protection and recovery.
In this paper we consider two software-based control-flow error recovery methods with a rollback recovery mechanism for using in multithreaded architectures. Disregarding to thread interactions between different threads by previous CFE... more
In this paper we consider two software-based control-flow error recovery methods with a rollback recovery mechanism for using in multithreaded architectures. Disregarding to thread interactions between different threads by previous CFE recovery techniques caused these methods not be suitable in multithreaded architectures. Furthermore, the high memory and performance overheads of these techniques may be problematic for real-time embedded systems which have tight memory and performance budget. Therefore, regarding to the importance of handling the CFE, unsuitability of the conventional related techniques to be utilized in the modern processors and high memory and performance overheads of previous CFE recovery techniques, two low-cost control-flow error recovery techniques, CFE Recovery using Data-flow graph Consideration (CRDC) and CFE Recovery using Macro block-level Check pointing (CRMC), are presented in this paper. The proposed recovery techniques are composed of two phases of control-flow error detection and control-flow error recovery. These phases are achieved through inserting additional instructions into program at compile time regarding to dependency graph. This graph is extracted to model control-flow and data dependencies among basic blocks and thread interactions between different threads of a program. In order to evaluate the proposed techniques, five multithreaded benchmarks Quick Sort, Matrix Multiplication, Bubble Sort, Linked List and Fast Fourier Transform utilized to run on a multi-core processor, and a total of 5000 transient faults has been injected into several executable points of each program. Fault injection experiments show that tolerable performance and memory overheads with noticeable error recovery coverage can be achieved via proposed techniques.
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, per- formance, and hardware cost. ReVive performs... more
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, per- formance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the perma- nent loss of an entire node. To maintain high performance, ReVive includes
Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published in the literature were supposed to be transparent to the... more
Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published in the literature were supposed to be transparent to the application programmer and implemented at the operating-system level. In recent years, there has been some work on higher-level forms of checkpointing. In this second approach, the user is responsible for the checkpoint placement and is required to specify the checkpoint contents. We compare the two approaches: system-level and user-defined checkpointing. We discuss the pros and cons of both approaches and we present an experimental study that was conducted on a commercial parallel machine
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms... more
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n)O(n) steps, where nn is the number of processes. The algorithm can be extended to work for general topologies too.
If the variables used for a checkpointing algorithm have data faults, the algorithm may fail. In this paper, a self-stabilizing checkpointing algorithm is proposed for handling data faults in a ring network. The proposed algorithm can... more
If the variables used for a checkpointing algorithm have data faults, the algorithm may fail. In this paper, a self-stabilizing checkpointing algorithm is proposed for handling data faults in a ring network. The proposed algorithm can deal with concurrent initiations of checkpointing and at most one data fault per process. However, several processes may be faulty.