research-article

A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

Authors:

Alex OrailogluAuthors Info & Claims

CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems

Pages 11 - 20

https://doi.org/10.1145/1450095.1450100

Published: 19 October 2008 Publication History

Abstract

While technology advances have made MPSoCs a standard architecture for embedded systems, their applicability is increasingly being challenged by dramatic increases in the amount of device failures that may occur during execution. Conventional fault tolerance techniques employ a duplication-and-comparison strategy to detect arbitrary execution faults, as well as a checkpointing-and-rollback strategy to recover from the faulty state. Comparison and checkpointing are performed either at task level, thus imposing a large amount of overhead in verifying and backing up memory pages, or at instruction level, thus necessitating a lock-step execution model which significantly limits the attainable performance. To overcome the shortcomings of both strategies, in this paper we propose a cache-based fault tolerance scheme wherein the comparison and checkpointing process is performed at the cache-memory interface. By allowing two processors that execute duplicated tasks to share a single data cache, the proposed scheme is able to verify execution results before writing them back into memory, thus protecting the memory from being polluted by execution faults. This in turn significantly reduces the checkpointing overhead. Meanwhile, since only the data written into memory are compared, the strict instruction-by-instruction synchronization model used in multithreading processors can be relaxed. The simulation results confirm that the proposed scheme only imposes a performance overhead ranging from 1.4% to 10.4%, while both fault detection and execution checkpointing can be effectively attained.

References

[1]

W. Wolf, "The future of multiprocessor systems-on-chips," In Proc. 41st DAC, pp. 681--685, Jule 2004.

Digital Library

[2]

International Technology Roadmap for Semiconductors (ITRS), 2007 Edition. "Process integration, devices, and structures".

[3]

P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," In Proc. DSN'02, pp. 389--398, 2002.

Digital Library

[4]

K. D. Wilken and T. Kong, "Concurrent detection of software and hardware data-access faults," IEEE Trans. on Computers, 46(4):412--424, April 1997.

Digital Library

[5]

S. K. Reinhardt and S. S. Mukherjee, "Transient-fault detection via simultaneous multithreading," In Proc. 27th ISCA, pp. 25--36, June 2000.

Digital Library

[6]

S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," In Proc. 29th ISCA, pp. 99--110, May 2002.

Digital Library

[7]

T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery using simultaneous multithreading,"In Proc. 29th ISCA, pp. 87--98, May 2002.

Digital Library

[8]

M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," In Proc. 30th ISCA, pp. 98--109, June 2003.

Digital Library

[9]

A. Wood, "Data integrity concepts, features, and technology," White paper, Tandem divison, Compaq Computer Corporation.

[10]

N. S. Bowen and D. K. Pradhan, "Virtual checkpoints: Architecture and performance," IEEE Trans. on Computers, 41(5):516--525, May 1992.

Digital Library

[11]

P. A. Lee, N. Ghani, and K. Heron, "A recovery cache for the PDP-11," IEEE Trans. on Computers, C-29(6):546--549, June 1980.

Digital Library

[12]

Y. Zhang and K. Chakrabarty, "Energy-aware adaptive checkpointing in embedded real-time systems," In Proc. DATE'03, pp. 918--923, 2003.

Digital Library

[13]

D. B. Hunt and P. N. Marinos, "A general purpose cache-aided rollback error recovery (CARER) technique," In Proc. FTCS-17, pp. 170--175, 1987.

[14]

K.-L. Wu, W. K. Fuchs, and J. H. Patel, "Error recovery in shared memory multiprocessors using private caches," IEEE Trans. on Parallel and Distributed Systems, 1(2):231--240, April 1990.

Digital Library

[15]

C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for Evaluating and Synthesizing Multimedia and Communications Systems," In 30th Micro, pp. 330--335, Dec. 1997.

Digital Library

[16]

T. Austin, E. Larson, and D. Ernst, "SimpleScalar: an infrastructure for computer system modeling," Computer, 35(2):59--67, Feb. 2002.

Digital Library

Cited By

Hosseini FYang C(2019)Comprehensive Evaluation of Program Reliability with ComFIDet: An Integrated Fault Injection and Detection Framework for Embedded Systems2019 IEEE International Conference on Embedded Software and Systems (ICESS)10.1109/ICESS.2019.8782444(1-8)Online publication date: Jun-2019
https://doi.org/10.1109/ICESS.2019.8782444
Hosseini FFotouhi PYang CGao G(2017)Leveraging Compiler Optimizations to Reduce Runtime Fault Recovery OverheadProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062273(1-6)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3061639.3062273
Khouzani HYang C(2016)Towards a Scalable and Write-Free Multi-version Checkpointing Scheme in Solid State Drives2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.13(37-48)Online publication date: Jun-2016
https://doi.org/10.1109/DSN.2016.13
Show More Cited By

Index Terms

A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

Building a Fault Tolerant Application Using the GASPI Communication Layer
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in ...
A software fix towards fault-tolerant computing

This article describes a low cost software technique for transient fault detection and fault tolerance in a processing system. The random errors caused by potential transients, Electrical Fast Transients (EFT) can be controlled by this proposed ...
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
LCTES '13: Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems

While the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems

October 2008

274 pages

ISBN:9781605584690

DOI:10.1145/1450095

Program Chair:
Erik Altman
IBM

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESWEEK 08

Sponsor:

ESWEEK 08: Fourth Embedded Systems Week

October 19 - 24, 2008

GA, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hosseini FYang C(2019)Comprehensive Evaluation of Program Reliability with ComFIDet: An Integrated Fault Injection and Detection Framework for Embedded Systems2019 IEEE International Conference on Embedded Software and Systems (ICESS)10.1109/ICESS.2019.8782444(1-8)Online publication date: Jun-2019
https://doi.org/10.1109/ICESS.2019.8782444
Hosseini FFotouhi PYang CGao G(2017)Leveraging Compiler Optimizations to Reduce Runtime Fault Recovery OverheadProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062273(1-6)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3061639.3062273
Khouzani HYang C(2016)Towards a Scalable and Write-Free Multi-version Checkpointing Scheme in Solid State Drives2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.13(37-48)Online publication date: Jun-2016
https://doi.org/10.1109/DSN.2016.13
Liu CYang CCavallaro JZhang TJones ALi H(2014)Exploiting heterogeneity in MPSoCs to prevent potential trojan propagation across malicious IPsProceedings of the 24th edition of the great lakes symposium on VLSI10.1145/2591513.2591595(335-340)Online publication date: 20-May-2014
https://dl.acm.org/doi/10.1145/2591513.2591595
Chen HYang CRabbah RRaghunathan A(2013)Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptationProceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.5555/2555729.2555751(1-10)Online publication date: 29-Sep-2013
https://dl.acm.org/doi/10.5555/2555729.2555751
Chen HYang C(2013)Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointingACM SIGPLAN Notices10.1145/2499369.246556248:5(13-20)Online publication date: 20-Jun-2013
https://dl.acm.org/doi/10.1145/2499369.2465562
Chen HYang CFranke BXue J(2013)Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointingProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2491899.2465562(13-20)Online publication date: 20-Jun-2013
https://dl.acm.org/doi/10.1145/2491899.2465562
Chen HYang CFranke BXue J(2013)Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointingProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2465554.2465562(13-20)Online publication date: 20-Jun-2013
https://dl.acm.org/doi/10.1145/2465554.2465562
Chen HYang C(2013)Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662528(1-10)Online publication date: Sep-2013
https://doi.org/10.1109/CASES.2013.6662528
Chen JShen BOu QYang WHsu W(2013)Effective code discovery for ARM/Thumb mixed ISA binaries in a static binary translator2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662525(1-10)Online publication date: Sep-2013
https://doi.org/10.1109/CASES.2013.6662525
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten