Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2749469.2750372acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

FaultHound: value-locality-based soft-fault tolerance

Published: 13 June 2015 Publication History

Abstract

Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).

References

[1]
Data Integrity for Compaq NonStop Himalaya Servers http://www.efudan.com/course/compaq/himalaya_data_integrity.pdf.
[2]
ReStore: Symptom Based Soft Error Detection in Microprocessors Proceedings of the 2005 International Conference on Dependable Systems and Networks, IEEE Computer Society, 2005.
[3]
Calder, B., Reinman, G. and Tullsen, D. M. Selective value prediction Proceedings of the 26th annual international symposium on Computer architecture, IEEE Computer Society, Atlanta, Georgia, USA, 1999.
[4]
Feng, S., Gupta, S., Ansari, A. and Mahlke, S. Shoestring: probabilistic soft error reliability on the cheap Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ACM, Pittsburgh, Pennsylvania, USA, 2010.
[5]
Feng, S., Gupta, S., Ansari, A., Mahlke, S. A. and August, D. I. Encore: low-cost, fine-grained transient fault recovery Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, Porto Alegre, Brazil, 2011.
[6]
Folegnani, D., Gonz, A. and lez. Energy-effective issue logic Proceedings of the 28th annual international symposium on Computer architecture, ACM, 2001.
[7]
Gomaa, M., Scarbrough, C., Vijaykumar, T. N. and Pomeranz, I. Transient-fault recovery for chip multiprocessors Proceedings of the 30th annual international symposium on Computer architecture, ACM, San Diego, California, 2003.
[8]
Gomaa, M. A. and Vijaykumar, T. N. Opportunistic Transient-Fault Detection Proceedings of the 32nd annual international symposium on Computer Architecture, IEEE Computer Society, 2005.
[9]
Jacobsen, E., Rotenberg, E. and Smith, J. E. Assigning confidence to conditional branch predictions Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society, Paris, France, 1996.
[10]
Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M. and Jouppi, N. P. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ACM, New York, New York, 2009.
[11]
Li, S., Chen, K., Ahn, J. H., Brockman, J. B. and Jouppi, N. P. CACTI-P: architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques Proceedings of the International Conference on Computer-Aided Design, IEEE Press, San Jose, California, 2011.
[12]
Lipasti, M. H. and Shen, J. P. Exceeding the dataflow limit via value prediction Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society, Paris, France, 1996.
[13]
Lipasti, M. H., Wilkerson, C. B. and Shen, J. P. Value locality and load value prediction Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, ACM, Cambridge, Massachusetts, USA, 1996.
[14]
Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D. and Wood, D. A. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 33(4). 92--99.
[15]
Mukherjee, S. S., Kontz, M. and Reinhardt, S. K. Detailed design and evaluation of redundant multithreading alternatives Proceedings of the 29th annual international symposium on Computer architecture, IEEE Computer Society, Anchorage, Alaska, 2002.
[16]
Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K. and Austin, T. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2003.
[17]
Park, I., Powell, M. D. and Vijaykumar, T. N. Reducing register ports for higher speed and lower energy Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society Press, Istanbul, Turkey, 2002.
[18]
Perelman, E., Hamerly, G., Biesbrouck, M. V., Sherwood, T. and Calder, B. Using SimPoint for accurate and efficient simulation Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, ACM, San Diego, CA, USA, 2003.
[19]
Racunas, P., Constantinides, K., Manne, S. and Mukherjee, S. S. Perturbation-based Fault Screening Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2007.
[20]
Reddy, V. K., Rotenberg, E. and Parthasarathy, S. Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, ACM, San Jose, California, USA, 2006.
[21]
Reinhardt, S. K. and Mukherjee, S. S. Transient fault detection via simultaneous multithreading Proceedings of the 27th annual international symposium on Computer architecture, ACM, Vancouver, British Columbia, Canada, 2000.
[22]
Reis, G. A., Chang, J., Vachharajani, N., Rangan, R. and August, D. I. SWIFT: Software Implemented Fault Tolerance Proceedings of the international symposium on Code generation and optimization, IEEE Computer Society, 2005.
[23]
Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., August, D. I. and Mukherjee, S. S. Software-controlled fault tolerance. ACM Trans. Archit. Code Optim., 2 (4). 366--396.
[24]
Rotenberg, E. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, 1999.
[25]
Shinde, R., Goel, A., Gupta, P. and Dutta, D. Similarity search and locality sensitive hashing using ternary content addressable memories Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, ACM, Indianapolis, Indiana, USA, 2010.
[26]
Slegel, T. J., Averill, R. M., III, Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., Li, W. H., Liptay, J. S., MacDougall, J. D., McPherson, T. J., Navarro, J. A., Schwarz, E. M., Shum, K. and Webb, C. F. IBM's S/390 G5 microprocessor design. Micro, IEEE, 19 (2). 12--23.
[27]
Sundaramoorthy, K., Purser, Z. and Rotenburg, E. Slipstream processors: improving both performance and fault tolerance Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, ACM, Cambridge, Massachusetts, USA, 2000.
[28]
Varghese, G. Network Algorithmics,: An Interdisciplinary Approach to Designing Fast Networked Devices (The Morgan Kaufmann Series in Networking). Morgan Kaufmann Publishers Inc., 2004.
[29]
Vijaykumar, T. N., Pomeranz, I. and Cheng, K. Transient-fault recovery using simultaneous multithreading Proceedings of the 29th annual international symposium on Computer architecture, IEEE Computer Society, Anchorage, Alaska, 2002.
[30]
Wang, N. J., Quek, J., Rafacz, T. M. and patel, S. J. Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline Proceedings of the 2004 International Conference on Dependable Systems and Networks, IEEE Computer Society, 2004.
[31]
Weaver, C., Emer, J., Mukherjee, S. S. and Reinhardt, S. K., Techniques to reduce the soft error rate of a high-performance microprocessor. in Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on, (2004), 264--275.
[32]
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P. and Gupta, A. The SPLASH-2 programs: characterization and methodological considerations Proceedings of the 22nd annual international symposium on Computer architecture, ACM, S. Margherita Ligure, Italy, 1995.
[33]
Xiaodong, L., Adve, S. V., Bose, P. and Rivers, J. A., Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions. in Dependable Systems and Networks, 2007. DSN '07. 37th Annual IEEE/IFIP International Conference on, (2007), 266--275.
[34]
Xiaodong, L., Adve, S. V., Bose, P. and Rivers, J. A., SoftArch: an architecture-level tool for modeling and analyzing soft errors. in Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on, (2005), 496--505.

Cited By

View all
  • (2018)VPsecProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203276(191-199)Online publication date: 8-May-2018
  • (2017)Data mining the memory access stream to detect anomalous application behaviorProceedings of the Computing Frontiers Conference10.1145/3075564.3075578(45-52)Online publication date: 15-May-2017
  • (2016)Processor Design for Soft ErrorsACM Computing Surveys10.1145/299635749:3(1-44)Online publication date: 8-Nov-2016
  • Show More Cited By

Index Terms

  1. FaultHound: value-locality-based soft-fault tolerance

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
        June 2015
        768 pages
        ISBN:9781450334020
        DOI:10.1145/2749469
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 June 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Funding Sources

        • National Science Foundation

        Conference

        ISCA '15
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 543 of 3,203 submissions, 17%

        Upcoming Conference

        ISCA '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)7
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 06 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2018)VPsecProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203276(191-199)Online publication date: 8-May-2018
        • (2017)Data mining the memory access stream to detect anomalous application behaviorProceedings of the Computing Frontiers Conference10.1145/3075564.3075578(45-52)Online publication date: 15-May-2017
        • (2016)Processor Design for Soft ErrorsACM Computing Surveys10.1145/299635749:3(1-44)Online publication date: 8-Nov-2016
        • (2023)Design of Low-Cost Reliable and Fault-Tolerant 32-Bit One Instruction Core for Multi-Core SystemsQuality Control - An Anthology of Cases10.5772/intechopen.102823Online publication date: 18-Jan-2023
        • (2020)FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing SystemsIEEE Access10.1109/ACCESS.2020.29758328(42674-42688)Online publication date: 2020
        • (2017)Data mining the memory access stream to detect anomalous application behaviorProceedings of the Computing Frontiers Conference10.1145/3075564.3075578(45-52)Online publication date: 15-May-2017

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media