Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleMay 2024
An Asynchronous Scheme for Rollback Recovery in Message-Passing Concurrent Programming Languages
SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied ComputingPages 1132–1139https://doi.org/10.1145/3605098.3636051Rollback recovery strategies are well-known in concurrent and distributed systems. In this context, recovering from unexpected failures is even more relevant given the non-deterministic nature of execution, which means that it is practically impossible ...
- ArticleJanuary 2024
From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs
AbstractThe reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their ...
- research-articleMay 2016
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
- Dingwen Tao,
- Shuaiwen Leon Song,
- Sriram Krishnamoorthy,
- Panruo Wu,
- Xin Liang,
- Eddy Z. Zhang,
- Darren Kerbyson,
- Zizhong Chen
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingPages 43–55https://doi.org/10.1145/2907294.2907306Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (...
- research-articleJune 2015
High-level synthesis of error detecting cores through low-cost modulo-3 shadow datapaths
DAC '15: Proceedings of the 52nd Annual Design Automation ConferenceArticle No.: 161, Pages 1–6https://doi.org/10.1145/2744769.2744851In this study, we propose a low-cost approach to error detection for arithmetic orientated data paths by performing lightweight shadow computations in modulo-3 space for each main computation. By leveraging the binding and scheduling flexibility of high-...
- ArticleDecember 2014
N Fault-Tolerant Sender-Based Message Logging for Group Communication-Based Message Passing Systems
CSE '14: Proceedings of the 2014 IEEE 17th International Conference on Computational Science and EngineeringPages 1296–1301https://doi.org/10.1109/CSE.2014.248All the existing SBML protocols have the limitations that they cannot tolerate concurrent failures in common. In this paper, we identify the exact reasons why they unavoidably have their incapability with the assumption of reliable FIFO unicast-only ...
-
- ArticleDecember 2013
A Transparent Rollback Recovery Algorithm for Virtual Machines Using Quasi-synchronous Checkpoint
CLOUDCOM-ASIA '13: Proceedings of the 2013 International Conference on Cloud Computing and Big DataPages 260–266https://doi.org/10.1109/CLOUDCOM-ASIA.2013.31Considering the characteristic of the virtual machines applied on cloud platform, a quasi-synchronous checkpoint algorithm with selective message logging for virtual Machine is presented. This algorithm keeps the inherent optimized checkpoint interval ...
- articleDecember 2013
A fully informed model-based checkpointing protocol for preventing useless checkpoints
International Journal of Parallel, Emergent and Distributed Systems (IJPEDS), Volume 28, Issue 6Pages 485–518https://doi.org/10.1080/17445760.2012.736508Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
- ArticleSeptember 2012
Comparing checkpoint and rollback recovery schemes in a cluster system
ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part IPages 531–545https://doi.org/10.1007/978-3-642-33078-0_38Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the ...
- ArticleJuly 2011
A New Approach for a Fault Tolerant Mobile Agent System
SNPD '11: Proceedings of the 2011 12th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed ComputingPages 133–138https://doi.org/10.1109/SNPD.2011.44Improving the survivability of mobile agents in the presence of agent server failures with unreliable underlying networks is a challenging issue. In this paper, we address a fault tolerance approach of deploying cooperating agents to detect agent ...
- articleJuly 2011
A New Diskless Checkpointing Approach for Multiple Processor Failures
IEEE Transactions on Dependable and Secure Computing (TDSC), Volume 8, Issue 4Pages 481–493https://doi.org/10.1109/TDSC.2010.76Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple ...
- research-articleJune 2011
Algorithm-based recovery for iterative methods without checkpointing
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computingPages 73–84https://doi.org/10.1145/1996130.1996142In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable ...
- articleApril 2009
Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery
In the rollback recovery of large-scale long-running applications in a distributed environment, pessimistic message logging protocols enable failed processes to recover independently, though at the expense of logging every message synchronously during ...
- ArticleDecember 2008
An Efficient Checkpointing and Rollback Recovery Scheme for Cluster-Based Multi-channel Ad Hoc Wireless Networks
ISPA '08: Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with ApplicationsPages 371–378https://doi.org/10.1109/ISPA.2008.35Compared to the wired distributed computing system, cluster-based ad-hoc wireless networks have certain new characteristics. The transient failure probability of the computing process increases greatly with the enlarging of system scale. If a failure ...
- ArticleJuly 2008
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions
Checkpointing and rollback recovery are well-established techniques for dealing with failures in distributed systems. In this paper, we briefly summarize the existing solution approaches to these problems and also discuss the open issues, suggested ...
- ArticleJuly 2008
Communication Aware Recovery Configurations for Networks-on-Chip
IOLTS '08: Proceedings of the 2008 14th IEEE International On-Line Testing SymposiumPages 201–206https://doi.org/10.1109/IOLTS.2008.44In this paper we propose a set of different configurations of failure recovery schemes, developed for Network-on-Chip (NoC) based systems. These configurations exploit the fact that communication in NoCs tends to be partitioned and eventually localized. ...
- ArticleJune 2008
Novel log management for sender-based message logging
Among message logging approaches, volatile logging by sender processes considerably alleviates the normal operation overhead of synchronous logging on stable storage. But, this approach forces each process to maintain log information of its sent ...
- articleJune 2008
Lightweight log management algorithm for removing logged messages of sender processes with little overhead
Sender-based message logging allows each message to be logged in the volatile storage of its corresponding sender. This behavior avoids logging messages on the stable storage synchronously and results in lower failure-free overhead than receiver-based ...
- ArticleJanuary 2008
Performance Analysis of Rollback Recovery Schemes for the Mobile Computing Environment
ICICSE '08: Proceedings of the 2008 International Conference on Internet Computing in Science and EngineeringPages 436–443https://doi.org/10.1109/ICICSE.2008.8Different from the wired distributed system, the mobile computing system has certain new characteristics, which impact on its checkpointing-rollback recovery schemes. The performance of these schemes primarily depends on the heterogeneous processing ...
- ArticleMarch 2023
An Efficient Handoff Strategy for Mobile Computing Checkpoint System
AbstractThe Eager, Lazy and Movement-based strategies are used in mobile computing system when handoff. They result in performance loss while moving the whole checkpoint on fault-free or slow recovery while not moving any checkpoint until recovery. In the ...
- ArticleDecember 2007
An efficient handoff strategy for mobile computing checkpoint system
EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computingPages 410–421The Eager, Lazy and Movement-based strategies are used in mobile computing system when handoff. They result in performance loss while moving the whole checkpoint on fault-free or slow recovery while not moving any checkpoint until recovery. In the paper,...