Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

An Online and Real-Time Fault Detection and Localization Mechanism for Network-on-Chip Architectures

Published: 14 June 2016 Publication History

Abstract

Networks-on-Chip (NoC) are becoming increasingly susceptible to emerging reliability threats. The need to detect and localize the occurrence of faults at runtime is steadily becoming imperative. In this work, we propose NoCAlert, a comprehensive online and real-time fault detection and localization mechanism that demonstrates 0% false negatives within the interconnect for the fault models and stimulus set used in this study. Based on the concept of invariance checking, NoCAlert employs a group of lightweight microchecker modules that collectively implement real-time hardware assertions. The checkers operate concurrently with normal NoC operation, thus eliminating the need for periodic, or triggered-based, self-testing. Based on the pattern/signature of asserted checkers, NoCAlert can pinpoint the location of the fault at various granularity levels. Most important, 97% of the transient and 90% of the permanent faults are detected instantaneously, within a single clock cycle upon fault manifestation. The fault localization accuracy ranges from 90% to 100%, depending on the desired localization granularity. Extensive cycle-accurate simulations in a 64-node CMP and analysis at the RTL netlist-level demonstrate the efficacy of the proposed technique.

Supplementary Material

a22-chrysanthou-apndx.pdf (chrysanthou.zip)
Supplemental movie, appendix, image and software files for, An Online and Real-Time Fault Detection and Localization Mechanism for Network-on-Chip Architectures

References

[1]
N. Agarwal, T. Krishna, Li-Shiuan Peh, and N. K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 33--42.
[2]
V. K. Agarwal and A. S. F. Fung. 1981. Multiple fault testing of large circuits by single fault test sets. IEEE Transactions on Circuits and Systems 28, 11, 1059--1069.
[3]
Konstantinos Aisopos, Andrew DeOrio, Li-Shiuan Peh, and Valeria Bertacco. 2011. ARIADNE: Agnostic reconfiguration in a disconnected network environment. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, Washington, DC, 298--309.
[4]
D. M. Ancajas, K. Bhardwaj, K. Chakraborty, and S. Roy. 2015. Wearout resilience in NoCs through an aging aware adaptive routing algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 2, 369--373.
[5]
T. Boraten and A. Kodi. 2013. Energy-efficient runtime adaptive scrubbing in fault-tolerant network-on-chips (NoCs) architectures. In IEEE 31st International Conference on Computer Design (ICCD’13). 264--271.
[6]
D. Borrione, A. Helmy, L. Pierre, and J. Schmaltz. 2007. A generic model for formally verifying NoC communication architectures: A case study. In 1st International Symposium on Networks-on-Chip (NOCS’07). 127--136.
[7]
Robert King Brayton, Alberto L. Sangiovanni-Vincentelli, Curtis T. McMullen, and Gary D. Hachtel. 1984. Logic Minimization Algorithms for VLSI Synthesis. Kluwer Academic Publishers, Norwell, MA.
[8]
Changlin Chen, Ye Lu, and S. D. Cotofana. 2012. A novel flit serialization strategy to utilize partially faulty links in networks-on-chip. In 6th IEEE/ACM International Symposium on Networks on Chip (NoCS’12). 124--131.
[9]
K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. 2006. BulletProof: A defect-tolerant CMP switch architecture. In 12th International Symposium on High-Performance Computer Architecture. 5--16.
[10]
A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, Jin Hu, and G. Chen. 2012. A reliable routing architecture and algorithm for NoCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 5, 726--739.
[11]
D. DiTomaso, A. Kodi, and A. Louri. 2014. QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 320--331.
[12]
Tudor Dumitraş, Sam Kerner, and Radu Mărculescu. 2003. Towards on-chip fault-tolerant communication. In Proceedings of the 2003 Asia and South Pacific Design Automation Conference (ASP-DAC’03). ACM, New York, NY.
[13]
David Fick, Andrew DeOrio, Gregory Chen, Valeria Bertacco, Dennis Sylvester, and David Blaauw. 2009a. A highly resilient routing algorithm for fault-tolerant NoCs. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’09). European Design and Automation Association, 3001 Leuven, Belgium, 21--26.
[14]
D. Fick, A. DeOrio, Jin Hu, V. Bertacco, D. Blaauw, and D. Sylvester. 2009b. Vicis: A reliable network for unreliable silicon. In 46th ACM/IEEE Design Automation Conference (DAC’09). 812--817.
[15]
Binzhang Fu, Yinhe Han, Jun Ma, Huawei Li, and Xiaowei Li. 2011. An abacus turn model for time/space-efficient reconfigurable routing. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 259--270.
[16]
A. Ghofrani, R. Parikh, S. Shamshiri, A. DeOrio, Kwang-Ting Cheng, and V. Bertacco. 2012. Comprehensive online defect diagnosis in on-chip networks. In IEEE 30th VLSI Test Symposium (VTS’12). 44--49.
[17]
M. Hosseinabady, A. Dalirsani, and Z. Navabi. 2007. Using the inter- and intra-switch regularity in NoC switch testing. In Design, Automation Test in Europe Conference Exhibition (DATE’07). 1--6.
[18]
J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. K. De, and R. Van Der Wijngaart. 2011. A 48-core IA-32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling. IEEE Journal of Solid-State Circuits 46, 1, 173--183.
[19]
C. Iordanou, V. Soteriou, and K. Aisopos. 2014. Hermes: Architecting a top-performing fault-tolerant routing algorithm for networks-on-chips. In 32nd IEEE International Conference on Computer Design (ICCD’14). 424--431.
[20]
M. R. Kakoee, V. Bertacco, and L. Benini. 2011a. A distributed and topology-agnostic approach for on-line NoC testing. In 5th IEEE/ACM International Symposium on Networks on Chip (NoCS’11). 113--120.
[21]
M. R. Kakoee, V. Bertacco, and L. Benini. 2011b. ReliNoC: A reliable network for priority-based on-chip communication. In Design, Automation Test in Europe Conference Exhibition (DATE’11). 1--6.
[22]
M. R. Kakoee, V. Bertacco, and L. Benini. 2014. At-speed distributed functional testing to detect logic and delay faults in NoCs. IEEE Transactions on Computers 63, 3, 703--717.
[23]
Hyungjun Kim, Siva Bhanu Krishna Boga, Arseniy Vitkovskiy, Stavros Hadjitheophanous, Paul V. Gratz, Vassos Soteriou, and Maria K. Michael. 2015. Use it or lose it: Proactive, deterministic longevity in future chip multiprocessors. ACM Transactions on Design Automation of Electronic Systems 20, 4, Article 65, 26 pages.
[24]
Hyungjun Kim, Arseniy Vitkovskiy, Paul V. Gratz, and Vassos Soteriou. 2013. Use it or lose it: Wear-out and lifetime in future chip multiprocessors. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). ACM, New York, NY, 136--147.
[25]
Jongman Kim, Chrysostomos Nicopoulos, and Dongkook Park. 2006. A gracefully degrading and energy-efficient modular router architecture for on-chip networks. SIGARCH Computer Architecture News 34, 2, 4--15.
[26]
Adan Kohler and Martin Radetzki. 2009. Fault-tolerant architecture and deflection routing for degradable NoC switches. In Proceedings of the 3rd ACM/IEEE International Symposium on Networks-on-Chip (NOCS’09). IEEE Computer Society, Washington, DC, 22--31.
[27]
Michihiro Koibuchi, Hiroki Matsutani, Hideharu Amano, and Timothy Mark Pinkston. 2008. A lightweight fault-tolerant mechanism for network-on-chip. International Symposium on Networks-on-Chip, 0, 13--22.
[28]
Doowon Lee, R. Parikh, and V. Bertacco. 2014. Brisk and limited-impact NoC routing reconfiguration. In Design, Automation and Test in Europe Conference and Exhibition (DATE’14). 1--6.
[29]
Albert Meixner, Michael E. Bauer, and Daniel Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 210--222.
[30]
Thomas Moscibroda and Onur Mutlu. 2009. A case for bufferless routing in on-chip networks. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 196--207.
[31]
Srinivasan Murali, David Atienza, Luca Benini, and Giovanni De Michel. 2006. A multi-path routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 845--848.
[32]
S. R. Nassif, N. Mehta, and Yu Cao. 2010. A resilience roadmap. In Design, Automation Test in Europe Conference Exhibition (DATE’10). 1011--1016.
[33]
C. Nicopoulos, S. Srinivasan, A. Yanamandra, Dongkook Park, V. Narayanan, C. R. Das, and M. J. Irwin. 2010. On the effects of process variation in network-on-chip architectures. IEEE Transactions on Dependable and Secure Computing 7, 3, 240--254.
[34]
Maurizio Palesi, Shashi Kumar, and Vincenzo Catania. 2010. Leveraging partially faulty links usage for enhancing yield and performance in networks-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29, 3, 426--440.
[35]
A. Panteloukas, A. Psarras, C. Nicopoulos, and G. Dimitrakopoulos. 2015. Timing-resilient network-on-chip architectures. In IEEE 21st International On-Line Testing Symposium (IOLTS’15). 77--82.
[36]
Ritesh Parikh and Valeria Bertacco. 2011. Formally enhanced runtime verification to ensure NoC functional correctness. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 410--419.
[37]
Ritesh Parikh and Valeria Bertacco. 2013. uDIREC: Unified diagnosis and reconfiguration for frugal bypass of NoC faults. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 148--159.
[38]
Ritesh Parikh and Valeria Bertacco. 2014. ForEVeR: A complementary formal and runtime verification approach to correct NoC functionality. ACM Transactions on Embedded Computing Systems 13, 3s, Article 104, 30 pages.
[39]
D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, and C. R. Das. 2006. Exploring fault-tolerant network-on-chip architectures. In International Conference on Dependable Systems and Networks (DSN’06). 93--104.
[40]
L.-S. Peh and W. J. Dally. 2001. A delay model and speculative architecture for pipelined routers. In 7th International Symposium on High-Performance Computer Architecture (HPCA’01). 255--266.
[41]
A. Pellegrini and V. Bertacco. 2014. Cardio: CMP adaptation for reliability through dynamic introspective operation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 33, 2, 265--278.
[42]
P. Poluri and A. Louri. 2013. Tackling permanent faults in the network-on-chip router pipeline. In 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’13). 49--56.
[43]
A. Prodromou, A. Panteli, C. Nicopoulos, and Y. Sazeides. 2012. NoCAlert: An on-line and real-time fault detection mechanism for network-on-chip architectures. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). 60--71.
[44]
V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. 2008. Immunet: Dependable routing for interconnection networks with arbitrary topology. IEEE Transactions on Computers 57, 12, 1676--1689.
[45]
Pengju Ren, Qingxin Meng, Xiaowei Ren, and Nanning Zheng. 2014. Fault-tolerant routing for on-chip network without using virtual channels. In 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). 1--6.
[46]
S. Rodrigo, J. Flich, A. Roca, S. Medardoni, D. Bertozzi, J. Camacho, F. Silla, and J. Duato. 2010. Addressing manufacturing challenges with cost-efficient fault tolerant routing. In Proceedings of the 2010 4th ACM/IEEE International Symposium on Networks-on-Chip (NOCS’10). IEEE Computer Society, Washington, DC, 25--32.
[47]
I. Seitanidis, A. Psarras, E. Kalligeros, C. Nicopoulos, and G. Dimitrakopoulos. 2014. ElastiNoC: A self-testable distributed VC-based network-on-chip architecture. In 8th IEEE/ACM International Symposium on Networks-on-Chip (NoCS). 135--142.
[48]
S. Shamshiri, A. Ghofrani, and Kwang-Ting Cheng. 2011. End-to-end error correction and online diagnosis for on-chip networks. In 2011 IEEE International Test Conference (ITC). 1--10.
[49]
A. Strano, C. Goandmez, D. Ludovici, M. Favalli, M. E. Goandmez, and D. Bertozzi. 2011. Exploiting network-on-chip structural redundancy for a cooperative and scalable built-in self-test architecture. In Design, Automation Test in Europe Conference Exhibition (DATE’11). 1--6.
[50]
S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. 2008. An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits 43, 1, 29--41.
[51]
A. Vitkovskiy, V. Soteriou, and C. Nicopoulos. 2012. A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 8, 1235--1248.
[52]
Liang Wang, Xiaohang Wang, and T. Mak. 2014. Dynamic programming-based lifetime aware adaptive routing algorithm for network-on-chip. In 22nd International Conference on Very Large Scale Integration (VLSI-SoC’14). 1--6.
[53]
Qiaoyan Yu, J. Cano, J. Flich, and P. Ampadu. 2012. Transient and permanent error control for high-end multiprocessor systems-on-chip. In 6th IEEE/ACM International Symposium on Networks on Chip (NoCS). 169--176.
[54]
Davide Zoni and William Fornaciari. 2013. Sensor-wise methodology to face NBTI stress of NoC buffers. In Design, Automation Test in Europe Conference Exhibition (DATE’13). 1038--1043.

Cited By

View all
  • (2024)ICLB: intelligent controllers load balancing for software-defined based optical data center networksThe Journal of Supercomputing10.1007/s11227-024-06165-y80:13(19031-19061)Online publication date: 1-Sep-2024
  • (2023)Sixer: A low-overhead, fully-distributed test scheme with guaranteed delivery of packets in networks-on-chipMicroelectronics Reliability10.1016/j.microrel.2023.114908142(114908)Online publication date: Mar-2023
  • (2020)A new architecture for online error detection and isolation in network on chipJournal of High Speed Networks10.3233/JHS-20064626:4(307-323)Online publication date: 1-Jan-2020
  • Show More Cited By

Index Terms

  1. An Online and Real-Time Fault Detection and Localization Mechanism for Network-on-Chip Architectures

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 2
        June 2016
        200 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2952301
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 June 2016
        Accepted: 01 April 2016
        Revised: 01 March 2016
        Received: 01 December 2015
        Published in TACO Volume 13, Issue 2

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Networks-on-chip
        2. NoC
        3. fault detection/diagnosis
        4. fault localization

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • Harpa, Project
        • European Commission 7th RTD Framework Program - Information and Communication Technologies: Computing Systems

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)68
        • Downloads (Last 6 weeks)13
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)ICLB: intelligent controllers load balancing for software-defined based optical data center networksThe Journal of Supercomputing10.1007/s11227-024-06165-y80:13(19031-19061)Online publication date: 1-Sep-2024
        • (2023)Sixer: A low-overhead, fully-distributed test scheme with guaranteed delivery of packets in networks-on-chipMicroelectronics Reliability10.1016/j.microrel.2023.114908142(114908)Online publication date: Mar-2023
        • (2020)A new architecture for online error detection and isolation in network on chipJournal of High Speed Networks10.3233/JHS-20064626:4(307-323)Online publication date: 1-Jan-2020
        • (2020)On Hardware-Trojan-Assisted Power Budgeting System Attack Targeting Many Core SystemsJournal of Systems Architecture10.1016/j.sysarc.2020.101757(101757)Online publication date: Feb-2020
        • (2019)Speed up dynamic time warpingof multivariate time seriesJournal of Intelligent & Fuzzy Systems10.3233/JIFS-18173636:3(2593-2603)Online publication date: 26-Mar-2019
        • (2018)On a New Hardware Trojan Attack on Power Budgeting of Many Core Systems2018 31st IEEE International System-on-Chip Conference (SOCC)10.1109/SOCC.2018.8618565(1-6)Online publication date: Sep-2018
        • (2018)Testable Error Detection Logic Design Applied to an Asynchronous Timing Resilient Template2018 31st Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI.2018.8533263(1-6)Online publication date: Aug-2018
        • (2018)A Hierarchical Approach for Devising Area Efficient Concurrent Online Checkers2018 IEEE International Test Conference in Asia (ITC-Asia)10.1109/ITC-Asia.2018.00034(139-144)Online publication date: Aug-2018
        • (2018)Monitor and Knob Techniques in Network-on-Chip ArchitecturesHarnessing Performance Variability in Embedded and High-performance Many/Multi-core Platforms10.1007/978-3-319-91962-1_9(187-213)Online publication date: 24-Oct-2018
        • (2017)HARPAProceedings of the Conference on Design, Automation & Test in Europe10.5555/3130379.3130402(97-102)Online publication date: 27-Mar-2017
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media