Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1736020.1736063acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Shoestring: probabilistic soft error reliability on the cheap

Published: 13 March 2010 Publication History

Abstract

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to soft errors. We are quickly approaching a new era where resilience to soft errors is no longer a luxury that can be reserved for just processors in high-reliability, mission-critical domains. Even processors used in mainstream computing will soon require protection. However, due to tighter profit margins, reliable operation for these devices must come at little or no cost. This paper presents Shoestring, a minimally invasive software solution that provides high soft error coverage with very little overhead, enabling its deployment even in commodity processors with "shoestring" reliability budgets. Leveraging intelligent analysis at compile time, and exploiting low-cost, symptom-based error detection, Shoestring is able to focus its efforts on protecting statistically-vulnerable portions of program code. Shoestring effectively applies instruction duplication to protect only those segments of code that, when subjected to a soft error, are likely to result in user-visible faults without first exhibiting symptomatic behavior. Shoestring is able to recover from an additional 33.9% of soft errors that are undetected by a symptom-only approach, achieving an overall user-visible failure rate of 1.6%. This reliability improvement comes at a modest performance overhead of 15.8%.

References

[1]
T. Austin. Diva: a reliable substrate for deep submicron microarchitecture design. In Proc. of the 32nd Annual International Symposium on Microarchitecture, pages 196--207, 1999.
[2]
W. Bartlett and L. Spainhower. Commercial fault tolerance: A tale of two systems. IEEE Transactions on Dependable and Secure Computing, 1(1):87--96, 2004.
[3]
D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. Nonstop advanced architecture. In International Conference on Dependable Systems and Networks, pages 12--21, June 2005.
[4]
J. A. Blome, S. Gupta, S. Feng, S. Mahlke, and D. Bradley. Costefficient soft error protection for embedded microprocessors. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 421--431, 2006.
[5]
S. Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25(6):10--16, 2005.
[6]
F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proc. of the 38th Annual International Symposium on Microarchitecture, pages 197--208, 2005.
[7]
M. Gomaa and T. Vijaykumar. Opportunistic transient-fault detection. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 172--183, June 2005.
[8]
M. A. Gomaa, C. Scarbrough, I. Pomeranz, and T. N. Vijaykumar. Transient-fault recovery for chip multiprocessors. In Proc. of the 30th Annual International Symposium on Computer Architecture, pages 98--109, 2003.
[9]
J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. C. Hoe. Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. In Proc. of the 40th Annual International Symposium on Microarchitecture, 2007.
[10]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004.
[11]
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 265--276, 2008.
[12]
X. Li and D. Yeung. Application-level correctness and its impact on fault tolerance. In Proc. of the 13th International Symposium on High-Performance Computer Architecture, pages 181--192, Feb. 2007.
[13]
T. May and M. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, Jan. 1979.
[14]
S. McCamant and M. D. Ernst. Quantitative information flow as network flow capacity. In Proc. of the SIGPLAN '08 Conference on Programming Language Design and Implementation, pages 193--205, June 2008.
[15]
A.Meixner, M. Bauer, and D. Sorin. Argus: Low-cost, comprehensive error detection in simple cores. IEEE Micro, 28(1):52--59, 2008.
[16]
P. Montesinos, W. Liu, and J. Torrellas. Using register lifetime predictions to protect register files against soft errors. In Proc. of the 2007 International Conference on Dependable Systems and Networks, pages 286--296, 2007.
[17]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proc. of the 29th Annual International Symposium on Computer Architecture, pages 99--110, 2002.
[18]
S. S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high performance microprocessor. In International Symposium on Microarchitecture, pages 29--42, Dec. 2003.
[19]
N. Oh, S.Mitra, and E. J.McCluskey. Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2):180--199, 2002.
[20]
P. Racunas, K. Constantinides, S. Manne, and S. Mukherjee. Perturbation-based fault screening. In Proc. of the 13th International Symposium on High-Performance Computer Architecture, pages 169--180, Feb. 2007.
[21]
V. Reddy, S. Parthasarathy, and E. Rotenberg. Understanding prediction-based partial redundant threading for low-overhead, highcoverage fault tolerance. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 83--94, Oct. 2006.
[22]
V. Reddy and E. Rotenberg. Inherent time redundancy (itr): Using program repetition for low-overhead fault tolerance. In Proc. of the 2007 International Conference on Dependable Systems and Networks, pages 307--316, June 2007.
[23]
V. Reddy and E. Rotenberg. Coverage of a microarchitecture-level fault check regimen in a superscalar processor. In Proc. of the 2008 International Conference on Dependable Systems and Networks, pages 1--10, June 2008.
[24]
S. K. Reinhardt and S. S.Mukherjee. Transient fault detection via simulataneous multithreading. In Proc. of the 27th Annual International Symposium on Computer Architecture, pages 25--36, June 2000.
[25]
G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proc. of the 2005 International Symposium on Code Generation and Optimization, pages 243--254, 2005.
[26]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization, 2(4):366--396, 2005.
[27]
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In International Symposium on Fault Tolerant Computing, pages 84--91, 1999.
[28]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45--57, New York, NY, USA, 2002. ACM.
[29]
P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proc. of the 2002 International Conference on Dependable Systems and Networks, pages 389--398, June 2002.
[30]
J. Smolens, J. Kim, J. Hoe, and B. Falsafi. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 256--268, Dec. 2004.
[31]
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. Reunion: Complexity-effective multicore redundancy. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 223--234, 2006.
[32]
L. Spainhower and T. Gregg. IBMS/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(6):863--873, 1999.
[33]
N. Vachharajani, M. J. Bridges, J. Chang, R. Rangan, G. Ottoni, J. A. Blome, G. A. Rei, M. Vachharajani, and D. I. August. Rifle: An architectural framework for user-centric information-flow security. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 243--254, Dec. 2004.
[34]
C. Wang, H. seop Kim, Y. Wu, and V. Ying. Compiler-managed software-based redundant multi-threading for transient fault detection. In Proc. of the 2007 International Symposium on Code Generation and Optimization, 2007.
[35]
N. Wang and S. Patel. Restore: Symptom based soft error detection in microprocessors. In International Conference on Dependable Systems and Networks, pages 30--39, June 2005.
[36]
N. J. Wang, M. Fertig, and S. J. Patel. Y-branches: When you come to a fork in the road, take it. In Proc. of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 56--65, 2003.
[37]
N. J. Wang and S. J. Patel. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3):188--201, June 2006.
[38]
N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline. In International Conference on Dependable Systems and Networks, page 61, June 2004.
[39]
C. Weaver and T. M. Austin. A fault tolerant approach to microprocessor design. In Proc. of the 2001 International Conference on Dependable Systems and Networks, pages 411--420, Washington, DC, USA, 2001. IEEE Computer Society.
[40]
P. M. Wells, K. Chakraborty, and G. S. Sohi. Mixed-mode multicore reliability. In 17th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 169--180, 2009.
[41]
M. T. Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proc. of the 2007 IEEE Symposium on Performance Analysis of Systems and Software, pages 23--34, 2007.
[42]
J. F. Ziegler and H. Puchner. SER-History, Trends, and Challenges: A Guide for Designing with Memory ICs. Cypress Semiconductor Corp., 2004.

Cited By

View all
  • (2025)HeterogeneousRTOS: A CPU-FPGA Real-Time OS for Fault Tolerance on COTS at Near-Zero Timing CostACM Transactions on Embedded Computing Systems10.1145/371206224:2(1-50)Online publication date: 17-Jan-2025
  • (2025)A Deep Technical Review of nZDC Fault ToleranceProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712688(104-116)Online publication date: 25-Feb-2025
  • (2025)FastFlip: Compositional SDC Resiliency AnalysisProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708938(362-376)Online publication date: 1-Mar-2025
  • Show More Cited By

Index Terms

  1. Shoestring: probabilistic soft error reliability on the cheap

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
        March 2010
        422 pages
        ISBN:9781605588391
        DOI:10.1145/1736020
        • General Chair:
        • James C. Hoe,
        • Program Chair:
        • Vikram S. Adve
        • cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
          ASPLOS '10
          March 2010
          399 pages
          ISSN:0163-5964
          DOI:10.1145/1735970
          Issue’s Table of Contents
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 45, Issue 3
          ASPLOS '10
          March 2010
          399 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1735971
          Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 March 2010

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. compiler analysis
        2. error detection
        3. fault injection

        Qualifiers

        • Research-article

        Conference

        ASPLOS '10

        Acceptance Rates

        ASPLOS XV Paper Acceptance Rate 32 of 181 submissions, 18%;
        Overall Acceptance Rate 535 of 2,713 submissions, 20%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)66
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 25 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)HeterogeneousRTOS: A CPU-FPGA Real-Time OS for Fault Tolerance on COTS at Near-Zero Timing CostACM Transactions on Embedded Computing Systems10.1145/371206224:2(1-50)Online publication date: 17-Jan-2025
        • (2025)A Deep Technical Review of nZDC Fault ToleranceProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712688(104-116)Online publication date: 25-Feb-2025
        • (2025)FastFlip: Compositional SDC Resiliency AnalysisProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708938(362-376)Online publication date: 1-Mar-2025
        • (2024)Enhanced Compiler Technology for Software-based Hardware Fault DetectionACM Transactions on Design Automation of Electronic Systems10.1145/366052429:5(1-23)Online publication date: 22-Apr-2024
        • (2023)Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-core Computing Clusters for Reliable Processing in SpaceACM Transactions on Cyber-Physical Systems10.1145/36351619:1(1-29)Online publication date: 30-Nov-2023
        • (2023)Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction DuplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607078(1-13)Online publication date: 12-Nov-2023
        • (2022)Survey of Software-Implemented Soft Error ProtectionElectronics10.3390/electronics1103045611:3(456)Online publication date: 3-Feb-2022
        • (2022)Trace-and-brace (TAB): bespoke software countermeasures against soft errorsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535070(73-85)Online publication date: 14-Jun-2022
        • (2022)An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors in ProgramsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.304302319:3(1988-2006)Online publication date: 1-May-2022
        • (2022)Silent Data Corruption Estimation and Mitigation Without Fault InjectionIEEE Canadian Journal of Electrical and Computer Engineering10.1109/ICJECE.2022.318904345:3(318-327)Online publication date: Oct-2023
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media