Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2616606.2616731acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdateConference Proceedingsconference-collections
research-article

Reliability-aware exceptions: tolerating intermittent faults in microprocessor array structures

Published: 24 March 2014 Publication History

Abstract

In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.

References

[1]
Agostinelli, M. et al, "Erratic fluctuations of sram cache vmin at the 90nm process technology node," IEDM Technical Digest, Dec. 2005.
[2]
Bondavalli, A.; Chiaradonna, S.; di Giandomenico, F.; Grandoni, F., "Threshold-based mechanisms to discriminate transient from intermittent faults," IEEE Transactions on Computers, Mar. 2000.
[3]
Bower, F. A.; Shealy, P. G.; Ozev, S.; Sorin, D. J., "Tolerating hard faults in microprocessor array structures," International Conference on Dependable Systems and Networks, July 2004.
[4]
Bower, F. A.; Sorin, D. J.; Ozev, S., "A mechanism for online diagnosis of hard faults in microprocessors," 38th Annual IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.
[5]
Brooks, D.; Tiwari, V.; Martonosi, M., "Wattch: a framework for architectural-level power analysis and optimizations," 27th International Symposium on Computer Architecture, June 2000.
[6]
Burger, D.; T. M. Austin, T. M., "The SimpleScalar Tool Set, Version 2.0," Computer Architecture News, June 1997.
[7]
Constantinescu, C., "Intermittent faults and effects on reliability of integrated circuits," Annual Reliability and Maintainability Symposium, Jan. 2008.
[8]
Constantinescu, C., "Intermittent Faults in VLSI Circuits", IEEE Workshop on System Effects of Logic Soft Errors, Apr. 2006.
[9]
Cristal, A.; Santana, O. J.; Valero, M.; Martínez, J. F., "Toward kilo-instruction processors," ACM Transactions on Architecture and Code Optimization (TACO), Dec. 2004.
[10]
Das, S. et al, "RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance," IEEE Journal of Solid-State Circuits, Jan. 2009.
[11]
Ershov, M. et al, "Dynamic recovery of negative bias temperature instability in p-type metal--oxide--semiconductor field-effect transistors," Applied Physics Letters, Aug. 2003.
[12]
Wei Huang; Sankaranarayanan, K.; Skadron, K.; Ribando, R. J.; Stan, M. R., "Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model," IEEE Transactions on Computers, Sept. 2008.
[13]
Li, M.; Ramachandran, P.; Sahoo, S. K.; Adve, S. V.; Zhou, Y., "Understanding the propagation of hard errors to software and implications for resilient system design," Computer Architecture News, Mar. 2008.
[14]
Mahapatra, S., "Negative Bias Temperature Instability (NBTI) in p-MOSFETs: Characterization, Material/Process Dependence and Predictive Modeling," http://nanohub.org/resources/13613.
[15]
Nightingale, E. B.; Douceur, J. R.; Orgovan, V., "Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs," Sixth Conference on Computer systems, Apr. 2011.
[16]
Reick, K. et al, "Fault-Tolerant Design of the IBM Power6 Microprocessor," IEEE Micro, March-April 2008.
[17]
Shi, K.; Howard, D., "Sleep Transistor Design and Implementation - Simple Concepts Yet Challenges To Be Optimum," International Symposium on VLSI Design, Automation and Test, Apr. 2006.
[18]
Shin, J.; Zyuban, V.; Zhigang Hu; Rivers, J. A.; Bose, P., "A Framework for Architecture-Level Lifetime Reliability Modeling," 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, June 2007.
[19]
Shum, C. -L K et al, "Design and microarchitecture of the IBM System z10 microprocessor," IBM Journal of Research and Development, Jan. 2009.
[20]
Slayman, C., "Soft error trends and mitigation techniques in memory devices," Annual Reliability and Maintainability Symposium, Jan. 2011.
[21]
Wells, P. M.; Chakraborty, K.; Sohi, G. S., "Adapting to Intermittent Faults in Future Multicore Systems," 16th International Conference on Parallel Architecture and Compilation Techniques, Sept. 2007.
[22]
Quming Zhou; Mohanram, K., "Gate sizing to radiation harden combinational logic," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Jan. 2006.

Cited By

View all
  • (2019)MinotaurProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304050(1087-1103)Online publication date: 4-Apr-2019
  • (2016)Design and evaluation of reliability-oriented task re-mapping in MPSoCs using time-series analysis of intermittent faultsProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2971991(798-803)Online publication date: 14-Mar-2016

Index Terms

  1. Reliability-aware exceptions: tolerating intermittent faults in microprocessor array structures

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          DATE '14: Proceedings of the conference on Design, Automation & Test in Europe
          March 2014
          1959 pages
          ISBN:9783981537024

          Sponsors

          • EDAA: European Design Automation Association
          • ECSI
          • EDAC: Electronic Design Automation Consortium
          • IEEE Council on Electronic Design Automation (CEDA)
          • The Russian Academy of Sciences: The Russian Academy of Sciences

          In-Cooperation

          Publisher

          European Design and Automation Association

          Leuven, Belgium

          Publication History

          Published: 24 March 2014

          Check for updates

          Author Tags

          1. array strucutre
          2. de-configuration
          3. fault injection
          4. intermittent fault

          Qualifiers

          • Research-article

          Conference

          DATE '14
          Sponsor:
          • EDAA
          • EDAC
          • The Russian Academy of Sciences
          DATE '14: Design, Automation and Test in Europe
          March 24 - 28, 2014
          Dresden, Germany

          Acceptance Rates

          Overall Acceptance Rate 518 of 1,794 submissions, 29%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 22 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2019)MinotaurProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304050(1087-1103)Online publication date: 4-Apr-2019
          • (2016)Design and evaluation of reliability-oriented task re-mapping in MPSoCs using time-series analysis of intermittent faultsProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2971991(798-803)Online publication date: 14-Mar-2016

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media