Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2540708.2540721acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Use it or lose it: wear-out and lifetime in future chip multiprocessors

Published: 07 December 2013 Publication History

Abstract

Moore's Law scaling is continuing to yield even higher transistor density with each succeeding process generation, leading to today's multi-core Chip Multi-Processors (CMPs) with tens or even hundreds of interconnected cores or tiles. Unfortunately, deep sub-micron CMOS process technology is marred by increasing susceptibility to wearout. Prolonged operational stress gives rise to accelerated wearout and failure, due to several physical failure mechanisms, including Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI). Each failure mechanism correlates with different usage-based stresses, all of which can eventually generate permanent faults. While the wearout of an individual core in many-core CMPs may not necessarily be catastrophic for the system, a single fault in the inter-processor Network-on-Chip (NoC) fabric could render the entire chip useless, as it could lead to protocol-level deadlocks, or even partition away vital components such as the memory controller or other critical I/O. In this paper, we develop critical path models for HCI- and NBTI-induced wear due to the actual stresses caused by real workloads, applied onto the interconnect microarchitecture. A key finding from this modeling being that, counter to prevailing wisdom, wearout in the CMP on-chip interconnect is correlated with lack of load observed in the NoC routers, rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wearout-sensitive components exercised, without significantly impacting cycle time, pipeline depth, area or power consumption of the overall router. We subsequently show that the proposed design yields a 13.8x-65x increase in CMP lifetime.

References

[1]
J. Abella, X. Vera, and A. González. Penelope: The nbti-aware processor. In the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 85--96, 2007.
[2]
M. Agarwal, B. Paul, M. Zhang, and S. Mitra. Circuit failure prediction and its application to transistor aging. In the 25th IEEE VLSI Test Symposium, pages 277--286, 2007.
[3]
C. Auth. 22-nm fully-depleted tri-gate cmos transistors. In the 2012 IEEE Custom Integrated Circuits Conference (CICC), pages 1--6, 2012.
[4]
D. U. Becker. Efficient Microarchitecture for Network-on-Chip Routers. PhD thesis, Stanford University, 2012.
[5]
K. Bhardwaj, K. Chakraborty, and S. Roy. An milp-based aging-aware routing algorithm for nocs. In the Design, Automation Test in Europe Conference (DATE), pages 326--331, 2012.
[6]
K. Bhardwaj, K. Chakraborty, and S. Roy. Towards graceful aging degradation in nocs through an adaptive routing algorithm. In the 49th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 382--391, 2012.
[7]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: characterization and architectural implications. In the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, 2008.
[8]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2), 2011.
[9]
J. Blome, S. Feng, S. Gupta, and S. Mahlke. Self-calibrating online wearout detection. In the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 109--122, 2007.
[10]
W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks. In the 38th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 684--689, 2001.
[11]
A. DeOrio, K. Aisopos, V. Bertacco, and L.-S. Peh. Drain: Distributed recovery architecture for inaccessible nodes in multi-core chips. In the 48th ACM/EDAC/IEEE Design Automation Conference (DAC), 2011.
[12]
D. Fick, A. DeOrio, G. Chen, V. Bertacco, D. Sylvester, and D. Blaauw. A highly resilient routing algorithm for fault-tolerant nocs. In the Design, Automation Test in Europe Conference (DATE), 2009.
[13]
X. Fu, T. Li, and J. A. B. Fortes. Architecting reliable multi-core network-on-chip for small scale processing technology. In DSN, 2010.
[14]
E. Gunadi, A. A. Sinkar, N. S. Kim, and M. H. Lipasti. Combating aging with the colt duty cycle equalizer. In the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 103--114, 2010.
[15]
J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. K. De, and R. V. Der Wijngaart. A 48-core ia-32 processor in 45 nm cmos using on-die message-passing and dvfs for performance and power scaling. IEEE Journal of Solid-State Circuits, 46(1), 2011.
[16]
L. Huang and Q. Xu. Agesim: A simulation framework for evaluating the lifetime reliability of processor-based socs. In the Conference on Design, Automation and Test in Europe (DATE), pages 51--56, 2010.
[17]
ITRS International Technology Roadmap for Semiconductors. Process integration, devices, and structures (PIDS), 2009.
[18]
JEDEC Solid State Technology Association. Failure mechanisms and models for semiconductor devices, JEP122G, 2011.
[19]
U. Karpuzcu, B. Greskamp, and J. Torrellas. The bubblewrap many-core: Popping cores for sequential acceleration. In the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO), pages 447--458, 2009.
[20]
K. J. Kuhn. Reducing variation in advanced logic technologies: Approaches to process and design for manufacturability of nanoscale cmos. In the IEEE International Electron Devices Meeting, (IEDM), pages 471--474, 2007.
[21]
S. V. Kumar, C. H. Kim, and S. S. Sapatnekar. Impact of nbti on sram read stability and design for reliability. In the 7th International Symposium on Quality Electronic Design, (ISQED), pages 6--pp, 2006.
[22]
X. Li, J. Qin, and J. Bernstein. Compact modeling of mosfet wearout mechanisms for circuit-reliability simulation. IEEE Transactions on Device and Materials Reliability, 8(1):98--121, 2008.
[23]
Y. Li, S. Makar, and S. Mitra. CASP: Concurrent Autonomous Chip Self-Test using Stored Test Patterns. In the Conference on Design, Automation and Test in Europe (DATE), pages 885--890, 2008.
[24]
Y. Lu, L. Shang, H. Zhou, H. Zhu, F. Yang, and X. Zeng. Statistical reliability analysis under process variation and aging effects. In the 46th ACM/IEEE Design Automation Conference, (DAC), pages 514--519, 2009.
[25]
E. Maricau and G. Gielen. A methodology for measuring transistor ageing effects towards accurate reliability simulation. In the 15th IEEE International On-Line Testing Symposium, (IOLTS), pages 21--26, 2009.
[26]
S. Nassif, K. Bernstein, D. Frank, A. Gattiker, W. Haensch, B. Ji, E. Nowak, D. Pearson, and N. Rohrer. High performance cmos variability in the 65nm regime and beyond. In IEEE International Electron Devices Meeting (IEDM), pages 569--571, 2007.
[27]
F. Oboril and M. Tahoori. Extratime: Modeling and analysis of wearout due to transistor aging at microarchitecture-level. In the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1--12, 2012.
[28]
L.-S. Peh and W. J. Dally. A delay model and speculative architecture for pipelined routers. In the 7th International Symposium on High-Performance Computer Architecture (HPCA), pages 255--266, 2001.
[29]
M. D. Powell, A. Biswas, S. Gupta, and S. S. Mukherjee. Architectural core salvaging in a multi-core processor for hard-error tolerance. In the 36th Annual International Symposium on Computer Architecture (ISCA). ACM, 2009.
[30]
M. Saitoh, K. Ota, C. Tanaka, Y. Nakabayashi, K. Uchida, and T. Numata. Performance, variability and reliability of silicon tri-gate nanowire mosfets. In the IEEE International Reliability Physics Symposium (IRPS), pages 6A-3, 2012.
[31]
T. Sakurai and A. Newton. Alpha-power law mosfet model and its applications to cmos inverter delay and other formulas. IEEE Journal of Solid-State Circuits, 25(2):584--594, 1990.
[32]
T. Schonwald, J. Zimmermann, O. Bringmann, and W. Rosenstiel. Fully adaptive fault-tolerant routing algorithm for network-on-chip architectures. In the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools, (DSD), pages 527--534, 2007.
[33]
J. Shin, V. Zyuban, Z. Hu, J. Rivers, and P. Bose. A framework for architecture-level lifetime reliability modeling. In the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, (DSN), pages 534--543, 2007.
[34]
J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. In the IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), 2007.
[35]
J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The case for lifetime reliability-aware microprocessors. In the 31st Annual International Symposium on Computer Architecture, (ISCA), pages 276--287, 2004.
[36]
B. Tudor, J. Wang, Z. Chen, R. Tan, W. Liu, and F. Lee. An accurate and scalable mosfet aging model for circuit simulation. In the 12th International Symposium on Quality Electronic Design (ISQED), pages 1--4, 2011.
[37]
Y. Wang, S. Cotofana, and L. Fang. A unified aging model of NBTI and HCI degradation towards lifetime reliability management for nanoscale MOSFET circuits. In the IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), pages 175--180, 2011.
[38]
Z. Zhang, A. Greiner, and S. Taktak. A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip. In the 45th ACM/IEEE Design Automation Conference, (DAC), pages 441--446, 2008.

Cited By

View all
  • (2024)BZSim: Fast, Large-Scale Microarchitectural Simulation with Detailed Interconnect Modeling2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00025(167-178)Online publication date: 5-May-2024
  • (2023)SRAM Process and Debug SensorNAECON 2023 - IEEE National Aerospace and Electronics Conference10.1109/NAECON58068.2023.10365775(304-308)Online publication date: 28-Aug-2023
  • (2022)HREN: A Hybrid Reliable and Energy-Efficient Network-on-Chip ArchitectureIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.3147407(1-1)Online publication date: 2022
  • Show More Cited By

Index Terms

  1. Use it or lose it: wear-out and lifetime in future chip multiprocessors

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
        December 2013
        498 pages
        ISBN:9781450326384
        DOI:10.1145/2540708
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 December 2013

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. hot carrier injection (HCI)
        2. lifetime
        3. negative bias temperature instability (NBTI)
        4. network-on-chip
        5. reliability
        6. wearout

        Qualifiers

        • Research-article

        Funding Sources

        • Republic of Cyprus
        • European Regional Development Fund

        Conference

        MICRO-46
        Sponsor:

        Acceptance Rates

        MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)20
        • Downloads (Last 6 weeks)9
        Reflects downloads up to 25 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)BZSim: Fast, Large-Scale Microarchitectural Simulation with Detailed Interconnect Modeling2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00025(167-178)Online publication date: 5-May-2024
        • (2023)SRAM Process and Debug SensorNAECON 2023 - IEEE National Aerospace and Electronics Conference10.1109/NAECON58068.2023.10365775(304-308)Online publication date: 28-Aug-2023
        • (2022)HREN: A Hybrid Reliable and Energy-Efficient Network-on-Chip ArchitectureIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.3147407(1-1)Online publication date: 2022
        • (2022)FastTrackNoC: A NoC with FastTrack Router Datapaths2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00075(971-985)Online publication date: Apr-2022
        • (2022)Pre-Silicon NBTI Delay-Aware Modeling of Network-on-Chip Router MicroarchitectureMicroprocessors & Microsystems10.1016/j.micpro.2022.10452691:COnline publication date: 1-Jun-2022
        • (2022)Hardware Security in Emerging Photonic Network-on-Chip ArchitecturesEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_9(291-313)Online publication date: 9-Jul-2022
        • (2021)Securing Silicon Photonic NoCs Against Hardware AttacksNetwork-on-Chip Security and Privacy10.1007/978-3-030-69131-8_15(399-421)Online publication date: 4-May-2021
        • (2020)CURE: A High-Performance, Low-Power, and Reliable Network-on-Chip Design Using Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.298629731:9(2125-2138)Online publication date: 1-Sep-2020
        • (2020)Gate Level NBTI and Leakage Co-Optimization in Combinational Circuits with Input Vector CyclingIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2018.27997398:3(738-749)Online publication date: 1-Jul-2020
        • (2020)System management recovery in NoC-based many-core systemsAnalog Integrated Circuits and Signal Processing10.1007/s10470-020-01631-yOnline publication date: 12-Mar-2020
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media