Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adapting to intermittent faults in multicore systems

Published: 01 March 2008 Publication History

Abstract

Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several cycles to several seconds or more. Due to practical limitations of circuit techniques, cost-effective reliability will likely require the ability to temporarily suspend execution on a core during periods of intermittent faults.
We investigate three of the most obvious techniques for adapting to the dynamically changing resource availability caused by intermittent faults, and demonstrate their different system-level implications. We show that system software reconfiguration has very high overhead, that temporarily pausing execution on a faulty core can lead to cascading livelock, and that using spare cores has high fault-free cost. To remedy these and other drawbacks of the three baseline techniques, we propose using a thin hardware/firmware layer to manage an overcommitted system -- one where the OS is configured to use more virtual processors than the number of currently available physical cores. We show that this proposed technique can gracefully degrade performance during intermittent faults of various duration with low overhead, without involving system software, and without requiring spare cores.

Supplementary Material

JPG File (1346314.jpg)
index.html (index.html)
Slides from the presentation
ZIP File (p255-wells-slides.zip)
Supplemental material for Adapting to intermittent faults in multicore systems
Audio only (1346314.mp3)
Video (1346314.mp4)

References

[1]
Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Prog., Dec 2005.
[2]
W. Armstrong et al. Advanced virtualization capabilities of POWER5 systems. IBMJournal and Research and Development, 49(4/5), 2005.
[3]
D. Bernick et al. Nonstop advanced architecture. In Proceedings of the 2005 International Conference on Dependable Systems and Networks, 2005.
[4]
D. M. Blough, F. J. Kurdahi, and S. Y. Ohm. High-level synthesis of recoverable VLSI microarchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7(4):401--410, 1999.
[5]
D. M. Blough, G. F. Sullivan, and G. M. Masson. Intermittent fault diagnosis in multiprocessor systems. IEEE Transactions on Computers, 41(11):1430--1441, 1992.
[6]
S. Borkar. Microarchitecture and design challenges for gigascale integration: Keynote. In Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO), 2004.
[7]
S. Borkar, T. Karnik, and V. De. Design and reliability challenges in nanometer technologies. In Proceedings of the 41th Annual Conference on Design Automation, 2004.
[8]
S. Borkar, T. Karnik, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In Proceedings of the 40th Annual Conference on Design Automation, 2003.
[9]
F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO), 2005.
[10]
K. Bowman, S. Duvall, and J. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE Journal of Solid-State Circuits, 37(2):183--190, Feb 2002.
[11]
J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing, 2007.
[12]
C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, 2003.
[13]
C. Constantinescu. Intermittent faults in VLSI circuits. In Proceedings of the IEEE Workshop on Silicon Errors in Logic -- System Effects, 2007.
[14]
O. Contant, S. Lafortune, and D. Teneketzis. Diagnosis of intermittent faults. Discrete Event Dynamic Systems, 14(2):171--202, 2004.
[15]
G. Deen, M. Hammer, J. Bethencourt, I. Eiron, J. Thomas, and J. Kaufman. Running Quake II on a grid. IBM Journal and Research and Development, 45(1), 2006.
[16]
D. Ernst et al. Razor: A low-power pipeline based on circuit-level timing speculation. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), 2003.
[17]
K. Govil, D. Teodosiu, Y. Huang, and M. Rosenblum. Cellular Disco: Resource management using virtual clusters on sharedmemory multiprocessors. ACM Transactions on Computer Systems, 18(3):229--262, Aug 2000.
[18]
S. H. Gunther, F. Binns, D. M. Carmean, and J. C. Hall. Managing the impact of increasing microprocessor power consumption. Intel Technology Journal, Q1, 2001.
[19]
S. N. Hamilton and A. Orailoglu. Transient and intermittent fault recovery without rollback. In Proceedings of the 13th International Symposium on Defect and Fault-Tolerance in VLSI Systems, 1998.
[20]
A. A. Ismaeel and R. Bhatnagar. Test for detection & location of intermittent faults in combinational circuits. IEEE Transactions on Reliability, 46(2):269--274, Jun 1997.
[21]
R. Joseph. Exploring core salvage techniques for multi-core architectures. In Proceedings of the Workshop on High Performance Computing Reliability Issues, 2006.
[22]
R. Joseph, D. Brooks, and M. Martonosi. Control techniques to eliminate voltage emergencies in high performance processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), 2003.
[23]
Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi, and K.Whisnant. Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Transactions on Parallel and Distrubuted Systems, 10(6):560--579, 1999.
[24]
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT), 2004.
[25]
C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In Proceedings of the 2007 International Conference on Dependable Systems and Networks, 2007.
[26]
T. Li, A. R. Lebeck, and D. J. Sorin. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distrubuted Systems, 17(6):508--521, 2006.
[27]
X. Liang and D. Brooks. Mitigating the impact of process variations on processor register files and execution units. In Proceedings of the 39th Annual International Symposium on Microarchitecture (MICRO), 2006.
[28]
T. Litt. Method and apparatus for CPU failure recovery in symmetric multi--processing systems. U.S. Patent 5,815,651, Sep 1998.
[29]
P. Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002.
[30]
S. Mitra, M. Zhang, N. S. amd TM Mak, and K. Kim. Soft error resilient system design through error correction. In Proceedings of the Very Large Scale Integration, January 2006.
[31]
T. Nanya and H. A. Goosen. The byzantine hardware fault model. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 8(11):1226--1231, Nov 1989.
[32]
M. D. Powell, M. Gomaa, and T. N. Vijaykumar. Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In Proceedings of the 11th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004.
[33]
M. D. Powell and T. N. Vijaykumar. Exploiting resonant behavior to reduce inductive noise. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004.
[34]
Semiconductor Industry Association. International technology roadmap for semiconductors: Executive summary, 2005.
[35]
T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003.
[36]
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002.
[37]
S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006.
[38]
K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan. Temperature-aware microarchitecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003.
[39]
T. J. Slegel et al. IBM's S/390 G5 microprocessor design. IEEE Micro, 19(2):12--23, 1999.
[40]
J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K.Mai. Detecting emerging wearout faults. In Proceedings of the IEEE Workshop on Silicon Errors in Logic -- System Effects, 2007.
[41]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 2002.
[42]
Sun Microsystems. Sun fire high-end and midrange systems dynamic reconfiguration user's guide. http://docs.sun.com/app/docs/doc/819-1501. Viewed 8/07/2007.
[43]
J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, S. Borkar, and V. De. Dynamic sleep transistor and body bias for active leakage power control of microprocessors. IEEE Journal of Solid-State Circuits, 38(11), 2003.
[44]
R. Uhlig et al. Intel virtualization technology. Computer, 38(5), 2005.
[45]
V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd Virtual Machine Research and Technology Symposium, 2004.
[46]
P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT), 2006.
[47]
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005.

Cited By

View all
  • (2017)ANMR: Aging-aware adaptive N-modular redundancy for homogeneous multicore embedded processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.04.013109(29-41)Online publication date: Nov-2017
  • (2016)Taking the Blame Game out of Data Centers Operations with NetPoirotProceedings of the 2016 ACM SIGCOMM Conference10.1145/2934872.2934884(440-453)Online publication date: 22-Aug-2016
  • (2012)Analysis of intermittent timing fault vulnerabilityMicroelectronics Reliability10.1016/j.microrel.2012.03.00352:7(1515-1522)Online publication date: Jul-2012
  • Show More Cited By

Index Terms

  1. Adapting to intermittent faults in multicore systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 43, Issue 3
    ASPLOS '08
    March 2008
    339 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/1353536
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
      March 2008
      352 pages
      ISBN:9781595939586
      DOI:10.1145/1346281
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2008
    Published in SIGPLAN Volume 43, Issue 3

    Check for updates

    Author Tags

    1. intermittent faults
    2. overcommitted system

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 31 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)ANMR: Aging-aware adaptive N-modular redundancy for homogeneous multicore embedded processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.04.013109(29-41)Online publication date: Nov-2017
    • (2016)Taking the Blame Game out of Data Centers Operations with NetPoirotProceedings of the 2016 ACM SIGCOMM Conference10.1145/2934872.2934884(440-453)Online publication date: 22-Aug-2016
    • (2012)Analysis of intermittent timing fault vulnerabilityMicroelectronics Reliability10.1016/j.microrel.2012.03.00352:7(1515-1522)Online publication date: Jul-2012
    • (2022)IntroductionFault Tolerant Computer Architecture10.1007/978-3-031-01723-0_1(1-17)Online publication date: 5-Mar-2022
    • (2021)A Lightweight Error-Resiliency Mechanism for Deep Neural Networks2021 22nd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED51717.2021.9424287(311-316)Online publication date: 7-Apr-2021
    • (2019)Detecting and Estimating Intermittent Actuator Faults in Linear Stochastic Systems2019 CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS)10.1109/SAFEPROCESS45799.2019.9213314(625-630)Online publication date: Jul-2019
    • (2017)BigSURACM Transactions on Graphics10.1145/3130800.313082336:6(1-16)Online publication date: 20-Nov-2017
    • (2017)Microfacet-based normal mapping for robust Monte Carlo path tracingACM Transactions on Graphics10.1145/3130800.313080636:6(1-12)Online publication date: 20-Nov-2017
    • (2017)A BSSRDF model for efficient rendering of fur with global illuminationACM Transactions on Graphics10.1145/3130800.313080236:6(1-13)Online publication date: 20-Nov-2017
    • (2017)Flexible PV-cell Modeling for Energy Harvesting in Wearable IoT ApplicationsACM Transactions on Embedded Computing Systems10.1145/312656816:5s(1-20)Online publication date: 27-Sep-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media