Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1450095.1450100acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

Published: 19 October 2008 Publication History

Abstract

While technology advances have made MPSoCs a standard architecture for embedded systems, their applicability is increasingly being challenged by dramatic increases in the amount of device failures that may occur during execution. Conventional fault tolerance techniques employ a duplication-and-comparison strategy to detect arbitrary execution faults, as well as a checkpointing-and-rollback strategy to recover from the faulty state. Comparison and checkpointing are performed either at task level, thus imposing a large amount of overhead in verifying and backing up memory pages, or at instruction level, thus necessitating a lock-step execution model which significantly limits the attainable performance. To overcome the shortcomings of both strategies, in this paper we propose a cache-based fault tolerance scheme wherein the comparison and checkpointing process is performed at the cache-memory interface. By allowing two processors that execute duplicated tasks to share a single data cache, the proposed scheme is able to verify execution results before writing them back into memory, thus protecting the memory from being polluted by execution faults. This in turn significantly reduces the checkpointing overhead. Meanwhile, since only the data written into memory are compared, the strict instruction-by-instruction synchronization model used in multithreading processors can be relaxed. The simulation results confirm that the proposed scheme only imposes a performance overhead ranging from 1.4% to 10.4%, while both fault detection and execution checkpointing can be effectively attained.

References

[1]
W. Wolf, "The future of multiprocessor systems-on-chips," In Proc. 41st DAC, pp. 681--685, Jule 2004.
[2]
International Technology Roadmap for Semiconductors (ITRS), 2007 Edition. "Process integration, devices, and structures".
[3]
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," In Proc. DSN'02, pp. 389--398, 2002.
[4]
K. D. Wilken and T. Kong, "Concurrent detection of software and hardware data-access faults," IEEE Trans. on Computers, 46(4):412--424, April 1997.
[5]
S. K. Reinhardt and S. S. Mukherjee, "Transient-fault detection via simultaneous multithreading," In Proc. 27th ISCA, pp. 25--36, June 2000.
[6]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," In Proc. 29th ISCA, pp. 99--110, May 2002.
[7]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery using simultaneous multithreading,"In Proc. 29th ISCA, pp. 87--98, May 2002.
[8]
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," In Proc. 30th ISCA, pp. 98--109, June 2003.
[9]
A. Wood, "Data integrity concepts, features, and technology," White paper, Tandem divison, Compaq Computer Corporation.
[10]
N. S. Bowen and D. K. Pradhan, "Virtual checkpoints: Architecture and performance," IEEE Trans. on Computers, 41(5):516--525, May 1992.
[11]
P. A. Lee, N. Ghani, and K. Heron, "A recovery cache for the PDP-11," IEEE Trans. on Computers, C-29(6):546--549, June 1980.
[12]
Y. Zhang and K. Chakrabarty, "Energy-aware adaptive checkpointing in embedded real-time systems," In Proc. DATE'03, pp. 918--923, 2003.
[13]
D. B. Hunt and P. N. Marinos, "A general purpose cache-aided rollback error recovery (CARER) technique," In Proc. FTCS-17, pp. 170--175, 1987.
[14]
K.-L. Wu, W. K. Fuchs, and J. H. Patel, "Error recovery in shared memory multiprocessors using private caches," IEEE Trans. on Parallel and Distributed Systems, 1(2):231--240, April 1990.
[15]
C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for Evaluating and Synthesizing Multimedia and Communications Systems," In 30th Micro, pp. 330--335, Dec. 1997.
[16]
T. Austin, E. Larson, and D. Ernst, "SimpleScalar: an infrastructure for computer system modeling," Computer, 35(2):59--67, Feb. 2002.

Cited By

View all
  • (2019)Comprehensive Evaluation of Program Reliability with ComFIDet: An Integrated Fault Injection and Detection Framework for Embedded Systems2019 IEEE International Conference on Embedded Software and Systems (ICESS)10.1109/ICESS.2019.8782444(1-8)Online publication date: Jun-2019
  • (2017)Leveraging Compiler Optimizations to Reduce Runtime Fault Recovery OverheadProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062273(1-6)Online publication date: 18-Jun-2017
  • (2016)Towards a Scalable and Write-Free Multi-version Checkpointing Scheme in Solid State Drives2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.13(37-48)Online publication date: Jun-2016
  • Show More Cited By

Index Terms

  1. A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
    October 2008
    274 pages
    ISBN:9781605584690
    DOI:10.1145/1450095
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 October 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. checkpointing
    2. fault detection
    3. fault recovery

    Qualifiers

    • Research-article

    Conference

    ESWEEK 08
    ESWEEK 08: Fourth Embedded Systems Week
    October 19 - 24, 2008
    GA, Atlanta, USA

    Acceptance Rates

    Overall Acceptance Rate 52 of 230 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Comprehensive Evaluation of Program Reliability with ComFIDet: An Integrated Fault Injection and Detection Framework for Embedded Systems2019 IEEE International Conference on Embedded Software and Systems (ICESS)10.1109/ICESS.2019.8782444(1-8)Online publication date: Jun-2019
    • (2017)Leveraging Compiler Optimizations to Reduce Runtime Fault Recovery OverheadProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062273(1-6)Online publication date: 18-Jun-2017
    • (2016)Towards a Scalable and Write-Free Multi-version Checkpointing Scheme in Solid State Drives2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.13(37-48)Online publication date: Jun-2016
    • (2014)Exploiting heterogeneity in MPSoCs to prevent potential trojan propagation across malicious IPsProceedings of the 24th edition of the great lakes symposium on VLSI10.1145/2591513.2591595(335-340)Online publication date: 20-May-2014
    • (2013)Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptationProceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.5555/2555729.2555751(1-10)Online publication date: 29-Sep-2013
    • (2013)Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointingACM SIGPLAN Notices10.1145/2499369.246556248:5(13-20)Online publication date: 20-Jun-2013
    • (2013)Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointingProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2491899.2465562(13-20)Online publication date: 20-Jun-2013
    • (2013)Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointingProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2465554.2465562(13-20)Online publication date: 20-Jun-2013
    • (2013)Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662528(1-10)Online publication date: Sep-2013
    • (2013)Effective code discovery for ARM/Thumb mixed ISA binaries in a static binary translator2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662525(1-10)Online publication date: Sep-2013
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media