Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2463209.2488859acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

Quantitative evaluation of soft error injection techniques for robust system design

Published: 29 May 2013 Publication History

Abstract

Choosing the correct error injection technique is of primary importance in simulation-based design and evaluation of robust systems that are resilient to soft errors. Many low-level (e.g., flip-flop-level) error injection techniques are generally used for small systems due to long execution times and significant memory requirements. High-level error injections at the architecture or memory levels are generally fast but can be inaccurate. Unfortunately, there exists very little research literature on quantitative analysis of the inaccuracies associated with high-level error injection techniques. In this paper, we use simulation and emulation results to understand the accuracy trade-offs associated with a variety of high-level error injection techniques. A detailed analysis of error propagation explains the causes of high degrees of inaccuracies associated with error injection techniques at higher levels of abstraction.

References

[1]
{Arlat 03} J. Arlat et al., "Comparison of Physical and Software-Implemented Fault Injection Techniques," IEEE Trans. Computers, vol. 52, no. 9, pp. 1115--1133, Sept. 2003.
[2]
{Borkar 11} S. Borkar and A. A. Chien, "The Future of Microprocessors," Commun. ACM, vol. 54, no. 5, pp. 67--77, May 2011.
[3]
{Chen 06} G. Chen, G. Chen, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin, "Object Duplication for Improving Reliability," Proc. Asia and South Pacific Design Automation Conf., pp. 140--145, 2006.
[4]
{Chen 08} D. Chen, G. Jacques-Silva, Z. Kalbarczyk, R. K. Iyer, and B. Mealey, "Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER," Proc. IEEE Pac. Rim Intl. Symp. Dependable Computing, pp. 339--346, 2008.
[5]
{Choi 90} G. S. Choi, R. K. Iyer, and V. A. Carreno, "Simulated Fault Injection: A Methodology to Evaluate Fault Tolerant Microprocessor Architectures," IEEE Trans. Reliability, vol. 39, no. 4, pp. 486--491, Oct. 1990.
[6]
{Davis 09} J. D. Davis, C. P. Thacker, and C. Chang, "BEE3: Revitalizing Computer Architecture Research," Microsoft Tech. Rep. MSR-TR-2009-45, 2009.
[7]
{DeHon 10} A. DeHon, H. M. Quinn, and N. P. Carter, "Vision for Cross-Layer Optimization to Address the Dual Challenges of Energy and Reliability," Proc. Design, Automation and Test in Europe, pp.1017--1022, 2010.
[8]
{Feng 10} S. Feng, S. Gupta, A. Ansari, and S. Mahlke, "Shoestring: Probabilistic Soft Error Reliability on the Cheap," Proc. Intl. Conf. Architectural Support for Programming Languages and Operating Systems, pp. 385--396, 2010.
[9]
{Fleming 86} P. J. Fleming and J. J. Wallace, "How not to lie with statistics: the correct way to summarize benchmark results," Commun. ACM, vol. 29, no. 3, pp. 218--221, March 1986.
[10]
{Gem5} "The gem5 Simulator System," http://www.m5sim.org
[11]
{Gu 04} W. Gu, Z. Kalbarczyk, R. K. Iyer, "Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors," Proc. Intl. Conf. on Dependable Systems and Networks, pp. 887--896, 2004.
[12]
{Howard 10} J. Howard et al., "A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS," Proc. IEEE Intl. Solid-State Circuits Conf., pp. 108--109, 2010.
[13]
{Kalbarczyk 99} Z. Kalbarczyk et al., "Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation," IEEE Trans. Software Engineering, vol. 25, no. 5, pp. 619--632, Sept.--Oct. 1999.
[14]
{Kanawati 93} G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, "EMAX: An Automatic Extractor of High-Level Error Models," Proc. AIAA Computing Aerospace Conf., pp. 1297--1306, 1993.
[15]
{KleinOsowski 02} AJ KleinOsowski, D. J. Lilja, "MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research," IEEE Computer Architecture Letters, vol. 1, no. 1, p. 7, Jan.--Dec. 2002.
[16]
{Leon} Aeroflex Gaisler, "Leon3 Processor," http://www.gaisler.com.
[17]
{McCluskey 71} E. J. McCluskey and F. W. Clegg, "Fault Equivalence in Combinational Logic Networks," IEEE Trans. Computers, vol. 20, no. 11, pp. 1286--1293, Nov. 1971.
[18]
{McCluskey 00} E. J. McCluskey and C.-W. Tseng, "Stuck-Fault Tests vs. Actual Defects," IEEE Intl. Test Conf., pp. 336--343, 2000.
[19]
{Maniatakos 11} M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris, "Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller," IEEE Trans. Computers, vol. 60, no. 9, pp. 1260--1273, Sept. 2011.
[20]
{Michalak 12} S. E. Michalak et al., "Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer," IEEE Trans. Device and Materials Reliability, vol. 12, no. 2, pp. 445--454, June 2012.
[21]
{Miskov-Zivanov 10} N. Miskov-Zivanov, D. Marculescu, "Multiple Transient Faults in Combinational and Sequential Circuits: A Systematic Approach," IEEE Trans. Comput.-Aided Des. Integr. Circuits and Syst., vol. 29, no. 10, pp. 1614--1627, Oct. 2010.
[22]
{Mitra 10} S. Mitra, K. Brelsford, and P. N. Sanda, "Cross-Layer Resilience Challenges: Metrics and Optimization," Proc. Design, Automation and Test in Europe, pp. 1029--1034, 2010.
[23]
{OpenSPARC} "OpenSPARC: World's First Free 64-bit Microprocessor," http://www.opensparc.net.
[24]
{Pellegrini 12} A. Pellegrini et al., "CrashTest'ing SWAT: Accurate, Gate-Level Evaluation of Symptom-Based Resiliency Solutions," Proc. Design, Automation and Test in Europe, pp. 1106--1109, 2012.
[25]
{Pattabiraman 11} K. Pattabiraman, G. P. Saggese, D. Chen, Z. T. Kalbarczyk, and R. K. Iyer "Automated Derivation of Application-Specific Error Detectors Using Dynamic Analysis," IEEE Trans. Dependable and Secure Computing, vol. 8, no. 5, pp. 640--655, Sept.--Oct. 2011.
[26]
{Ramachandran 08} P. Ramachandran, P. Kudva, J. Kellington, J. Schumann, and P. Sanda, "Statistical Fault Injection," Proc. IEEE Intl. Conf. Dependable Systems and Networks, pp. 122--127, 2008.
[27]
{Racunas 07} P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee, "Perturbation-based Fault Screening," Proc. IEEE Intl. Symp. High Performance Computer Architecture, pp. 169--180, 2007.
[28]
{Rebaudengo 02} M. Rebaudengo, M. S. Reorda, and M. Violante, "Analysis of SEU effects in a pipelined processor," Proc. IEEE Intl. On-Line Testing Workshop, pp. 112--116, 2002.
[29]
{Rimen 94} M. Rimen, J. Ohlsson, and J. Torin, "On microprocessor error behavior modeling," Proc. IEEE Intl. Symp. Fault-Tolerant Computing, pp. 76--85, 1994.
[30]
{Sanda 08} P. N. Sanda et al., "Soft-error resilience of the IBM POWER6 processor," IBM Journal of Research and Development, vol. 52, no. 3, pp. 275--284, May 2008.
[31]
{Seifert 10} N. Seifert, "Radiation-induced soft errors: A chip-level modeling per- spective," Foundat. Trends® in Electron. Design Autom., vol. 4, no. 2-3, pp. 99--221, Feb. 2010.
[32]
{Seifert 12} N. Seifert et al., "Soft Error Susceptibilities of 22 nm Tri-Gate Devices," IEEE Trans. Nucl. Sci., vol. 59, no. 6, pp. 2666--2673, Dec. 2012.
[33]
{Wang 04} N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, "Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline," Proc. Intl. Conf. on Dependable Systems and Networks, pp. 61--70, 2004.
[34]
{Wang 07} N. J. Wang, A. Mahesri, and S. J. Patel, "Examining ACE Analysis Reliability Estimates Using Fault-Injection," Proc. Intl. Symp. Computer Architecture, pp. 460--469, 2007.
[35]
{Yim 10} K. S. Yim, Z. Kalbarczyk, and R. K. Iyer, "Measurement-based Analysis of Fault and Error Sensitivities of Dynamic Memory," Proc. IEEE/IFIP Intl. Conf. on Dependable Systems and Networks, pp. 431--436, 2010.
[36]
{Zhang 10} Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August, "DAFT: Decoupled Acyclic Fault Tolerance," Proc. Intl. Conf. Parallel Architectures and Compilation Techniques, pp. 87--98, 2010.

Cited By

View all
  • (2024)Assessing the Impact of Compiler Optimizations on GPUs ReliabilityACM Transactions on Architecture and Code Optimization10.1145/363824921:2(1-22)Online publication date: 12-Jan-2024
  • (2024)Multiscale, Multiphysics Modeling and Simulation of Single-Event Effects in Digital Electronics: From Particles to SystemsIEEE Transactions on Nuclear Science10.1109/TNS.2023.333728871:1(31-66)Online publication date: Jan-2024
  • (2024)Probing Weaknesses in GPU Reliability Assessment: A Cross-Layer Approach2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00048(331-333)Online publication date: 5-May-2024
  • Show More Cited By

Index Terms

  1. Quantitative evaluation of soft error injection techniques for robust system design

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        DAC '13: Proceedings of the 50th Annual Design Automation Conference
        May 2013
        1285 pages
        ISBN:9781450320719
        DOI:10.1145/2463209
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 29 May 2013

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Conference

        DAC '13
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

        Upcoming Conference

        DAC '25
        62nd ACM/IEEE Design Automation Conference
        June 22 - 26, 2025
        San Francisco , CA , USA

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)35
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Assessing the Impact of Compiler Optimizations on GPUs ReliabilityACM Transactions on Architecture and Code Optimization10.1145/363824921:2(1-22)Online publication date: 12-Jan-2024
        • (2024)Multiscale, Multiphysics Modeling and Simulation of Single-Event Effects in Digital Electronics: From Particles to SystemsIEEE Transactions on Nuclear Science10.1109/TNS.2023.333728871:1(31-66)Online publication date: Jan-2024
        • (2024)Probing Weaknesses in GPU Reliability Assessment: A Cross-Layer Approach2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00048(331-333)Online publication date: 5-May-2024
        • (2024)Gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00047(543-559)Online publication date: 2-Mar-2024
        • (2023)vRTLmodProceedings of the 20th ACM International Conference on Computing Frontiers10.1145/3587135.3591435(387-388)Online publication date: 9-May-2023
        • (2023)Understanding and Mitigating Hardware Failures in Deep Learning Training SystemsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589105(1-16)Online publication date: 17-Jun-2023
        • (2023)HPC Hardware Design Reliability Benchmarking With HDFITIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323777734:3(995-1006)Online publication date: 1-Mar-2023
        • (2023)Characterizing a Neutron-Induced Fault Model for Deep Neural NetworksIEEE Transactions on Nuclear Science10.1109/TNS.2022.322453870:4(370-380)Online publication date: Apr-2023
        • (2023)Anatomy of On-Chip Memory Hardware Fault Effects Across the LayersIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.320580811:2(420-431)Online publication date: 1-Apr-2023
        • (2023)Silent Data Corruptions: Microarchitectural PerspectivesIEEE Transactions on Computers10.1109/TC.2023.328509472:11(3072-3085)Online publication date: Nov-2023
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media