Keyword: rollback recovery : Search

research-article

Rollback recovery strategies are well-known in concurrent and distributed systems. In this context, recovering from unexpected failures is even more relevant given the non-deterministic nature of execution, which means that it is practically impossible ...

Article

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

Germán Vidal

Formal Aspects of Component SoftwarePages 103–123https://doi.org/10.1007/978-3-031-52183-6_6

Abstract

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their ...

research-article

Public Access

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingPages 43–55https://doi.org/10.1145/2907294.2907306

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (...

research-article

High-level synthesis of error detecting cores through low-cost modulo-3 shadow datapaths

DAC '15: Proceedings of the 52nd Annual Design Automation ConferenceArticle No.: 161, Pages 1–6https://doi.org/10.1145/2744769.2744851

In this study, we propose a low-cost approach to error detection for arithmetic orientated data paths by performing lightweight shadow computations in modulo-3 space for each main computation. By leveraging the binding and scheduling flexibility of high-...

Article

N Fault-Tolerant Sender-Based Message Logging for Group Communication-Based Message Passing Systems

Jinho Ahn

CSE '14: Proceedings of the 2014 IEEE 17th International Conference on Computational Science and EngineeringPages 1296–1301https://doi.org/10.1109/CSE.2014.248

All the existing SBML protocols have the limitations that they cannot tolerate concurrent failures in common. In this paper, we identify the exact reasons why they unavoidably have their incapability with the assumption of reliable FIFO unicast-only ...

Article

A Transparent Rollback Recovery Algorithm for Virtual Machines Using Quasi-synchronous Checkpoint

CLOUDCOM-ASIA '13: Proceedings of the 2013 International Conference on Cloud Computing and Big DataPages 260–266https://doi.org/10.1109/CLOUDCOM-ASIA.2013.31

Considering the characteristic of the virtual machines applied on cloud platform, a quasi-synchronous checkpoint algorithm with selective message logging for virtual Machine is presented. This algorithm keeps the inherent optimized checkpoint interval ...

article

A fully informed model-based checkpointing protocol for preventing useless checkpoints

International Journal of Parallel, Emergent and Distributed Systems (IJPEDS), Volume 28, Issue 6Pages 485–518https://doi.org/10.1080/17445760.2012.736508

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...

Article

Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part IPages 531–545https://doi.org/10.1007/978-3-642-33078-0_38

Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the ...

Article

A New Approach for a Fault Tolerant Mobile Agent System

SNPD '11: Proceedings of the 2011 12th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed ComputingPages 133–138https://doi.org/10.1109/SNPD.2011.44

Improving the survivability of mobile agents in the presence of agent server failures with unreliable underlying networks is a challenging issue. In this paper, we address a fault tolerance approach of deploying cooperating agents to detect agent ...

article

A New Diskless Checkpointing Approach for Multiple Processor Failures

IEEE Transactions on Dependable and Secure Computing (TDSC), Volume 8, Issue 4Pages 481–493https://doi.org/10.1109/TDSC.2010.76

Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple ...

research-article

Algorithm-based recovery for iterative methods without checkpointing

Zizhong Chen

HPDC '11: Proceedings of the 20th international symposium on High performance distributed computingPages 73–84https://doi.org/10.1145/1996130.1996142

In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable ...

article

Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery

Concurrency and Computation: Practice & Experience (CCOMP), Volume 21, Issue 6Pages 819–853

In the rollback recovery of large-scale long-running applications in a distributed environment, pessimistic message logging protocols enable failed processes to recover independently, though at the expense of logging every message synchronously during ...

Article

An Efficient Checkpointing and Rollback Recovery Scheme for Cluster-Based Multi-channel Ad Hoc Wireless Networks

ISPA '08: Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with ApplicationsPages 371–378https://doi.org/10.1109/ISPA.2008.35

Compared to the wired distributed computing system, cluster-based ad-hoc wireless networks have certain new characteristics. The transient failure probability of the computing process increases greatly with the enlarging of system scale. If a failure ...

Article

Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

D. Manivannan

ICS'08: Proceedings of the 12th WSEAS international conference on SystemsPages 569–574

Checkpointing and rollback recovery are well-established techniques for dealing with failures in distributed systems. In this paper, we briefly summarize the existing solution approaches to these problems and also discuss the open issues, suggested ...

Article

Communication Aware Recovery Configurations for Networks-on-Chip

IOLTS '08: Proceedings of the 2008 14th IEEE International On-Line Testing SymposiumPages 201–206https://doi.org/10.1109/IOLTS.2008.44

In this paper we propose a set of different configurations of failure recovery schemes, developed for Network-on-Chip (NoC) based systems. These configurations exploit the fact that communication in NoCs tends to be partitioned and eventually localized. ...

Article

Novel log management for sender-based message logging

Jinho Ahn

ICAI'08: Proceedings of the 9th WSEAS International Conference on International Conference on Automation and InformationPages 356–361

Among message logging approaches, volatile logging by sender processes considerably alleviates the normal operation overhead of synchronous logging on stable storage. But, this approach forces each process to maintain log information of its sent ...

article

Lightweight log management algorithm for removing logged messages of sender processes with little overhead

Jinho Ahn

WSEAS Transactions on Computers (WSTOCMP), Volume 7, Issue 6Pages 804–813

Sender-based message logging allows each message to be logged in the volatile storage of its corresponding sender. This behavior avoids logging messages on the stable storage synchronously and results in lower failure-free overhead than receiver-based ...

Article

Performance Analysis of Rollback Recovery Schemes for the Mobile Computing Environment

ICICSE '08: Proceedings of the 2008 International Conference on Internet Computing in Science and EngineeringPages 436–443https://doi.org/10.1109/ICICSE.2008.8

Different from the wired distributed system, the mobile computing system has certain new characteristics, which impact on its checkpointing-rollback recovery schemes. The performance of these schemes primarily depends on the heterogeneous processing ...

Article

An Efficient Handoff Strategy for Mobile Computing Checkpoint System

Embedded and Ubiquitous ComputingPages 410–421https://doi.org/10.1007/978-3-540-77092-3_36

Abstract

The Eager, Lazy and Movement-based strategies are used in mobile computing system when handoff. They result in performance loss while moving the whole checkpoint on fault-free or slow recovery while not moving any checkpoint until recovery. In the ...

Article

An efficient handoff strategy for mobile computing checkpoint system

EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computingPages 410–421

The Eager, Lazy and Movement-based strategies are used in mobile computing system when handoff. They result in performance loss while moving the whole checkpoint on fault-free or slow recovery while not moving any checkpoint until recovery. In the paper,...

Applied Filters

People

Names

Institutions

Authors

Reviewers

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

An Asynchronous Scheme for Rollback Recovery in Message-Passing Concurrent Programming Languages

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

High-level synthesis of error detecting cores through low-cost modulo-3 shadow datapaths

N Fault-Tolerant Sender-Based Message Logging for Group Communication-Based Message Passing Systems

Upcoming Conferences

A Transparent Rollback Recovery Algorithm for Virtual Machines Using Quasi-synchronous Checkpoint

A fully informed model-based checkpointing protocol for preventing useless checkpoints

Comparing checkpoint and rollback recovery schemes in a cluster system

A New Approach for a Fault Tolerant Mobile Agent System

A New Diskless Checkpointing Approach for Multiple Processor Failures

Algorithm-based recovery for iterative methods without checkpointing

Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery

An Efficient Checkpointing and Rollback Recovery Scheme for Cluster-Based Multi-channel Ad Hoc Wireless Networks

Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

Communication Aware Recovery Configurations for Networks-on-Chip

Novel log management for sender-based message logging

Lightweight log management algorithm for removing logged messages of sender processes with little overhead

Performance Analysis of Rollback Recovery Schemes for the Mobile Computing Environment

An Efficient Handoff Strategy for Mobile Computing Checkpoint System

An efficient handoff strategy for mobile computing checkpoint system

Applied Filters

People

Names

Institutions

Authors

Reviewers

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder

Upcoming Conferences