In the electronics space industry, memory cells are one of the main concerns, especially in term of reliability, since radiation particles may hit cell nodes and disturb the state of the cell, possibly causing fatal errors. In this paper... more
In the electronics space industry, memory cells are one of the main concerns, especially in term of reliability, since radiation particles may hit cell nodes and disturb the state of the cell, possibly causing fatal errors. In this paper we propose the Nwise SRAM cell, an area-efficient and highly reliable radiation hardened memory cell for use in high-density memories for space applications. Simulations confirm that the proposed Nwise cell is fully tolerant to single event upsets (SEU) in any one of its nodes regardless of upset polarity. Meanwhile, compared with the RHBD-10T cell, the latest area-efficient radiation hardened memory cell, it has higher robustness: the minimum critical charge of Nwise is 4.1× 4.1× 4.1× higher than the minimum critical charge of the RHBD-10T cell. It also shows 23% and 12% improvements in read and write static noise margin (SNM). Furthermore, compared with RHBD-10T, up to 18.4% and 7.0% power savings are obtainable during write and read operations respectively. Nwise is about 2.28× × × faster than RHBD-10T during the more frequent read operation, with a similar penalty in write time. Finally, Nwise is the first proposed high density and reliable radiation hardened memory cell that has been designed using the 28nm FD-SOI technology node.
Increased clock frequencies, higher transistor counts, lower voltage levels, and reduced noise margin have exponentially raised performance of modern microprocessors but made processors more vulnerable to soft errors. To detect soft... more
Increased clock frequencies, higher transistor counts, lower voltage levels, and reduced noise margin have exponentially raised performance of modern microprocessors but made processors more vulnerable to soft errors. To detect soft errors, redundant software and hardware are introduced frequently by the designers but software techniques have shown their capability to protect against soft errors without any hardware overhead and are more beneficial for their flexibility and low cost as well as easier to deployment. The techniques of Software fault-tolerance enable application protection by constructing redundancy into the compiled code. A new methodology is proposed for tolerating soft errors through triple modular redundancy. Experimental studies show that this method has increased reliability and offer efficient protection via software modulation in comparison to existing duplication and triplication methods.
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability,... more
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory (NVM), GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.
Soft error is a significant reliability concern for nanometer technologies. Shrinking feature sizes, lower voltage levels, reduced noise margins, and increased clock frequency improves the performance and lowers the power consumption of... more
Soft error is a significant reliability concern for nanometer technologies. Shrinking feature sizes, lower voltage levels, reduced noise margins, and increased clock frequency improves the performance and lowers the power consumption of integrated circuit. But it causes the integrated circuit more susceptible to soft error that can corrupt data and make systems vulnerable. The ‘device shrinking’ reduces the soft error tolerance of the VLSI circuits, as very little energy is needed to change their states. In digital systems, where the reliability is a great concern, the impact of soft errors may be very catastrophic. Safety critical systems are very sensitive to soft errors. A bit flip due to soft error can change the value of critical variable. And consequently the system control flow can completely be changed which may lead to system failure. To minimize the soft error risks, critical blocks are identified by criticality analysis of the blocks and ranking among them. Highest ranked blocks are considered as critical block. Refactoring is applied to minimize the criticality of the critical blocks. Then a novel methodology is proposed to detect and recover from soft errors.
Deep Neural Networks (DNNs) used in safetycritical systems cannot compromise their performance due to reliability issues. In particular, soft errors are the worst. Selective softwarebased protection solutions are among the best techniques... more
Deep Neural Networks (DNNs) used in safetycritical systems cannot compromise their performance due to reliability issues. In particular, soft errors are the worst. Selective softwarebased protection solutions are among the best techniques to improve the reliability of DNNs efficiently. However, their most significant challenge is precisely hardening portions of the DNN model to avoid performance degradation. In this work, we propose a comprehensive methodology to analyze the reliability of object detection and classification algorithms run on GPUs from the lowest (instruction) evaluation level. The ultimate goal is to avoid the performance penalty of full instruction duplication by confidently identifying the vulnerable instructions. For this purpose, we propose a technique, Instruction Vulnerability Factor (IVF). By applying our methodology on ResNet and YOLO models, we demonstrate that both models' most vulnerable instructions can be precisely determined. Moreover, we show that YOLO is more sensitive to the changes caused by soft errors than ResNet. Also, ResNet depends on the input image in its reliability, while YOLO tends to be independent.
Abstract: Soft error is a significant reliability concern for nanometer technologies. Shrinking feature sizes, lower voltage levels, reduced noise margins, and increased clock frequency improves the performance and lowers the power... more
Abstract: Soft error is a significant reliability concern for nanometer technologies. Shrinking feature sizes, lower voltage levels, reduced noise margins, and increased clock frequency improves the performance and lowers the power consumption of integrated circuit. But it causes the integrated circuit more susceptible to soft error that can corrupt data and make systems vulnerable. In computer systems, where the reliability is a great concern, the impact of soft errors may be very catastrophic. This paper proposes a new approach to detect soft errors through variable dependency analysis. The proposed method has lesser time overhead in comparison to existing dominant approach.
Radiation-induced soft errors have posed an increasing reliability challenge to combinational and sequential circuits in advanced CMOS technologies. Therefore, it is imperative to devise fast, accurate and scalable soft error rate (SER)... more
Radiation-induced soft errors have posed an increasing reliability challenge to combinational and sequential circuits in advanced CMOS technologies. Therefore, it is imperative to devise fast, accurate and scalable soft error rate (SER) estimation methods as part of cost-effective robust circuit design. This paper presents an efficient SER estimation framework for combinational and sequential circuits, which considers single-event transients (SETs) in combinational logic and multiple cell upsets (MCUs) in sequential elements. A novel top-down memoization algorithm is proposed to accelerate the propagation of SETs, and a general schematic and layout co-simulation approach is proposed to model the MCUs for redundant sequential storage structures. The feedback in sequential logic is analyzed with an efficient time frame expansion method. Experimental results on various ISCAS85 combinational benchmark circuits demonstrate that the proposed approach achieves up to 560.2X times speedup with less than 3% difference in terms of SER results compared with the baseline algorithm. The average runtime of the proposed framework on a variety of ISCAS89 benchmark circuits is 7.20s, and the runtime is 119.23s for the largest benchmark circuit with more than 3,000 flip-flops and 17,000 gates. CCS Concepts: r Hardware → Signal integrity and noise analysis; Transient errors and upsets; Additional Key Words and Phrases: Soft error, single-event upset, multiple cell upset, hardened flip-flop, algorithm ACM Reference Format: Ji Li and Jeffrey Draper, 2016. Accelerated Soft-Error-Rate (SER) estimation for combinational and sequential circuits.
Processor register file (RF) is an important microarchitectural component used for storing operands and results of instructions. The design and operation of RF has crucial impact on the performance, energy efficiency and reliability of... more
Processor register file (RF) is an important microarchitectural component used for storing operands and results of instructions. The design and operation of RF has crucial impact on the performance, energy efficiency and reliability of the processor and hence, several techniques have been recently proposed to manage RF in modern processors. In this paper, we present a survey of techniques for architecting and managing CPU register file. We classify the techniques across several parameters to underscore their similarities and differences. We hope that this paper will provide insights to researchers into working of RF and inspire even more efforts towards optimization of RF in next-generation computing systems.
Soft errors, due to cosmic radiations, are one of the major challenges for reliable VLSI designs. In this paper, we present a symbolic framework to model soft errors in both synchronous and asynchronous designs. The proposed methodology... more
Soft errors, due to cosmic radiations, are one of the major challenges for reliable VLSI designs. In this paper, we present a symbolic framework to model soft errors in both synchronous and asynchronous designs. The proposed methodology utilizes Multiway Decision Graphs (MDGs) and glitch-propagation sets (GP sets) to obtain soft error rate (SER) estimation at gate level. This work helps mitigate design for testability (DFT) issues in relation to identifying the controllable and the observable circuit nodes, when the circuit is subject to soft errors. Also, this methodology allows designers to apply radiation tolerance techniques on reduced sets of internal nodes. To demonstrate the effectiveness of our technique, several ISCAS89 sequential and combinational benchmark circuits, and multiple asynchronous handshake circuits have been analyzed. Results indicate that the proposed technique is on average 4.29 times faster than the best contemporary state-of-the-art techniques. The proposed technique is capable to exhaustively identify soft error glitch propagation paths, which are then used to estimate the SER. To the best of our knowledge, this is the first time that a decision diagram based soft error identification approach is proposed for asynchronous circuits.
The advancement in technologies results in reduction of transistor size which makes the devices more vulnerable to noise and radiation effect that causes soft errors. Soft errors or transient error occurs when radioactive atoms decay and... more
The advancement in technologies results in reduction of transistor size which makes the devices more vulnerable to noise and radiation effect that causes soft errors. Soft errors or transient error occurs when radioactive atoms decay and release alpha particles into the chip. These alpha particles hit the memory cell and change its state value which affects memory reliability. Soft errors in Memories can cause multiple cell upsets (MCUs) around the location of strike. The memory cells can be protected against MCUs using various Error Correction Codes (ECCs). Decimal Matrix Code (DMC) is a type of ECC that has been recently proposed for memory protection. The main issue of using DMC is that it has more redundant bits and limited error correction capability compared to the proposed work. The Hybrid Matrix code (HMC) is a combination of Matrix Code and Hamming code along with Encoder Reuse technique (ERT) has been proposed to assure reliability in the presence of MCUs and reduces the redundant bits and it corrects more error compared to the existing method. ERT uses HMC encoder itself to be part of the decoder. The ERT is used to minimize the area overhead of extra circuits without disturbing the whole encoding and decoding process. The proposed algorithm is coded in VHDL and simulated using ModelSim and Xilinx ISE 8.1i Simulator. The results obtained show reduced area usage, delay overhead and redundant bits compared to existing method.
Glitches due to soft errors can act as a severe deterrent to asynchronous circuit operations. To mitigate soft errors in quasi delay insensitive (QDI) asynchronous circuits, built-in soft error correction in NULL convention logic (NCL)... more
Glitches due to soft errors can act as a severe deterrent to asynchronous circuit operations. To mitigate soft errors in quasi delay insensitive (QDI) asynchronous circuits, built-in soft error correction in NULL convention logic (NCL) has been introduced [9]. However, this technique cannot detect errors during the NULL phase of NCL pipeline, and also cannot avoid error propagation into the pipeline after its detection. This paper provides a modified approach to overcome these limitations with, on average, comparable power and latency costs. This work also analyzes the temperature variation effects on latency and power consumption of the proposed design. The modified NCL pipeline is implemented in IHP 90nm CMOS technology and analyzed under various operating temperatures. It is found that the proposed design survives well in the worst case operating temperatures and does not propagate soft errors.
Recently, deep neural networks (DNNs) have been increasingly deployed in various healthcare applications, which are considered safety-critical applications. Thus, the reliability of these DNN models should be remarkably high, because even... more
Recently, deep neural networks (DNNs) have been increasingly deployed in various healthcare applications, which are considered safety-critical applications. Thus, the reliability of these DNN models should be remarkably high, because even a small error in healthcare applications can lead to injury or death. Due to the high computations of the DNN models, DNNs are often executed on the graphics processing units (GPUs). However, the GPUs have been reportedly impacted by soft errors, which are extremely serious issues in the healthcare applications. In this paper, we show how the fault injection can provide a deeper understanding of DenseNet201 model instructions vulnerability on the GPU. Then, we analyze vulnerable instructions of the DenseNet201 on the GPU. Our results show that the most significant vulnerable instructions against soft errors PR, STORE, FADD, FFMA, SETP and LD can be reduced from 4.42% to 0.14% of injected faults, after we applied our mitigation strategy.
With drastic device shrinking, low operating voltages , increasing complexities, and high speed operations, radiation-induced soft errors have posed an ever increasing reliability challenge to both combinational and sequential circuits in... more
With drastic device shrinking, low operating voltages , increasing complexities, and high speed operations, radiation-induced soft errors have posed an ever increasing reliability challenge to both combinational and sequential circuits in advanced CMOS technologies. Therefore, it is imperative to devise efficient soft error rate (SER) estimation methods, in order to evaluate the soft error vulnerabilities for cost-effective robust circuit design. Previous works either analyze only SER in combinational circuits or evaluate soft error vulnerabilities in sequential elements. In this paper, a joint SER estimation framework is proposed, which considers single-event transients (SETs) in combinational logic and multiple cell upsets (MCUs) in sequential components. Various masking effects are considered in the combinational SER estimation process, and several typical radiation-hardened and non-hardened flip-flop structures are analyzed and compared as the sequential elements. A schematic and layout co-simulation approach is proposed to model the MCUs for redundant sequential storage structures. Experimental results of a variety of ISCAS benchmark circuits using the Nangate 45nm CMOS standard cell library demonstrate the difference in soft error resilience among designs using different sequential elements and the importance of modeling MCUs in redundant structures. Keywords—Soft error, hardened flip-flop, single-event upset, multiple cell upset.
Soft error is a significant reliability concern for nanometer technologies. Shrinking feature sizes, lower voltage levels, reduced noise margins, and increased clock frequency improves the performance and lowers the power... more
Soft error is a significant reliability concern for nanometer technologies. Shrinking feature sizes, lower voltage levels, reduced noise margins, and increased clock frequency improves the performance and lowers the power consumption of integrated circuit. But it causes the integrated circuit more susceptible to soft errors that can corrupt data and make systems vulnerable. In computer systems, where the reliability is a great concern, the impact of soft errors may be very catastrophic. Specially, in safety critical systems, a momentary failure may cause human death. Hence, it is a must to detect the soft errors in lesser time. This paper proposes a new approach to lessen time overhead to detect soft errors through variable dependency analysis.
Advancement in deep submicron (DSM) technologies led to miniaturization. However, it also increased the vulnerability against some electrical and device non-idealities, including the soft errors. These errors are significant threat to the... more
Advancement in deep submicron (DSM) technologies led to miniaturization. However, it also increased the vulnerability against some electrical and device non-idealities, including the soft errors. These errors are significant threat to the reliable functionality of digital circuits. Several techniques for the detection and deterrence of soft errors (to improve the reliability) have been proposed, both in synchronous and asynchronous domain. In this paper we propose a low power and soft error tolerant solution for synchronous systems that leverages the asynchronous pipeline within a synchronous framework. We named our technique as macro synchronous micro asynchronous (MSMA) pipeline. We provided a framework along with timing analysis of the MSMA technique. MSMA is implemented using a macro synchronous system and soft error tolerant and low power version of null convention logic (NCL) asynchronous circuit. It is found out that this solution can easily replace the intermediate stages of synchronous and asynchronous pipelines without changing its interface protocol. Such NCL asynchronous circuits can be used as a standard cell in the synchronous ASIC design flow. Power and performance analysis is done using electrical simulations, which shows that this techniques consumes at least 22% less power and 45% less energy delay product (EDP) compared to state-of-the-art solutions.
Increase in vulnerability to soft errors has affected the reliability of both synchronous and asynchronous circuits implemented in modern deep sub-micron technologies. Hence in such circuits, there is a growing need to identify the soft... more
Increase in vulnerability to soft errors has affected the reliability of both synchronous and asynchronous circuits implemented in modern deep sub-micron technologies. Hence in such circuits, there is a growing need to identify the soft error glitch propagation possibility at an early stage in the design flow. This paper proposes a new methodology to obtain soft error glitch propagation paths in digital designs (both synchronous and asynchronous). To compute these paths, Multiway Decision Graphs (MDGs) and glitch-propagation sets (GP sets) are utilized in conjunction with Boolean Satisfiability solvers (MiniSat). The applicability of the proposed method is illustrated by implementing ISCAS89 benchmark sequential circuits, 8-bit adders, multipliers, and the Self-timed multiple-group pipeline asynchronous handshake circuits. The proposed SAT based methodology is on average 13 times faster than the best contemporary state-of-the-art techniques exhaustively analyze possible soft error glitch-propagation paths.
The progressive shrinking of device size in advanced technologies leads to miniaturization and performance improvements. However, ultra-deep sub-micron technologies are more vulnerable to soft errors. Error analysis of a complex system... more
The progressive shrinking of device size in advanced technologies leads to miniaturization and performance improvements. However, ultra-deep sub-micron technologies are more vulnerable to soft errors. Error analysis of a complex system with a sufficiently large sample of vulnerable nodes takes a large amount of time. In this paper we propose RASVAS, a hierarchical statistical method to model, analyze, and estimate the behavior of a system in the presence of Single Event Transients (SETs) modeled at different abstraction levels. Gate level propagation tables are developed to abstract SET propagation conditions and probabilities from gate level models. At RTL, these tables are utilized to model the underlying probabilistic behavior as Markov Decision Process (MDP) models. Experimental results demonstrate that RASVAS is orders of magnitude faster than contemporary techniques and also handle designs as large as 256-bit adders while maintaining accuracy.
This paper presents an FPGA fault injection system, a methodology for soft processor fault injection, and fault injection experimental results for MicroBlaze and LEON3 soft processor designs. The Xilinx Radiation Test Consortium—Virtex 5... more
This paper presents an FPGA fault injection system, a methodology for soft processor fault injection, and fault injection experimental results for MicroBlaze and LEON3 soft processor designs. The Xilinx Radiation Test Consortium—Virtex 5 Fault Injector (XRTC-V5FI) was built to evaluate the configuration memory sensitivity of soft processor designs. To overcome some of the challenges of soft processor fault injection, we designed the XRTC-V5FI to be fast, flexible, and to fully cover all configuration memory bits. The minimum time to inject a full bitstream is 28 minutes and the individual fault injection can be as fast as 49 μS. The LEON3 has 81.3 % more sensitive bits than the MicroBlaze, yet when normalized by the number of used slices, the MicroBlaze is 26.2 % more sensitive than the LEON3.
Smaller feature size, higher clock frequency and lower power consumption are of core concerns of today’s nano-technology, which has been resulted by continuous downscaling of CMOS technologies. The resultant ‘device shrinking’ reduces the... more
Smaller feature size, higher clock frequency and lower power consumption are of core concerns of today’s nano-technology, which has been resulted by continuous downscaling of CMOS technologies. The resultant ‘device shrinking’ reduces the soft error tolerance of the VLSI circuits, as very little energy is needed to change their states. Safety critical systems are very sensitive to soft errors. A bit flip due to soft error can change the value of critical variable and consequently the system control flow can completely be changed which leads to system failure. To minimize soft error risks, a novel methodology is proposed to detect and recover from soft errors considering only ‘critical code blocks’ and ‘critical variables’ rather than considering all variables and/or blocks in the whole program. The proposed method shortens space and time overhead in comparison to existing dominant approaches.
Glitches due to soft errors have become a major concern in circuits designed in ultra-deep sub-micron technologies. Most of the soft error mitigation techniques require redundancy and are power hungry. Recently, low power quasi delay... more
Glitches due to soft errors have become a major concern in circuits designed in ultra-deep sub-micron technologies. Most of the soft error mitigation techniques require redundancy and are power hungry. Recently, low power quasi delay insensitive (QDI) null conventional logic based asynchronous circuits have been proposed, but these circuits work for pure asynchronous designs only. This paper extends the low-power soft-error-tolerant asynchronous technique for conventional synchronous circuits. The main idea is to accommodate asynchronous standard cells within the synchronous pipeline, and thus giving rise to a macro synchronous micro asynchronous (MSMA) pipeline. An important application of this design is found in detecting the hardware Trojans. The state-of-the-art signature based hardware Trojan detection is implemented using the clock referencing signals for timing signatures. However, an intruder can intrude into clock distribution network itself and may lead to many false positive or even false negative cases. Asynchronous handshake signals, on the other hand, provide event trigger nature to the digital system, and hence the timing analysis is unique to the data path itself alone, without getting affected by the clock distribution network. This paper provides a proof of concept soft error tolerant MSMA design. Time delay based signature without using clock distribution network is obtained to detect hardware Trojan insertion in MSMA.
Strategies to improve the visible resilience of applications require the ability to distinguish vulnerability difference across application components and selectively apply protection. Hence, quantitatively modeling application... more
Strategies to improve the visible resilience of applications require the ability to distinguish vulnerability difference across application components and selectively apply protection. Hence, quantitatively modeling application vulnerability, as a method to capture vulnerability variance within the application, is critical to evaluate and improve system resilience. The tradition methods cannot effectively quantify vulnerability, because they lack a holistic view to examine system resilience, and come with prohibitive evaluation costs. In this paper, we introduce a data-driven methodology to analyze application vulnerability based on a novel resilience metric, the data vulnerability factor (DVF). DVF integrates both application and specific hardware into the resilience analysis. To calculate DVF, we extend a performance modeling language to provide a fast modeling solution. Furthermore, we measure six representative computational kernels; we demonstrate the values of DVF by quantifying the impact of algorithm optimization on vulnerability and by quantifying the effectiveness of a hardware protection mechanism.
Due to shrinking feature sizes and significant reduction in noise margins, as CMOS technologies evolve toward ultra-deep sub-micron, digital circuits have become more susceptible to soft errors. Therefore, researchers have recently... more
Due to shrinking feature sizes and significant reduction in noise margins, as CMOS technologies evolve toward ultra-deep sub-micron, digital circuits have become more susceptible to soft errors. Therefore, researchers have recently reported several approaches to model Single Event Transient (SET) propagation at gate or higher abstraction levels. However, contemporary techniques model only the possibility that SET pulse may be masked electrically, logically, or by time windowing. In this paper, the propagation induced pulse broadening (PIPB) phenomenon is further investigated and a new model which abstracts this phenomenon is proposed. This paper also investigates and abstracts the impact of input patterns and propagation paths on SET pulse width. Through electrical simulations, we validated our analysis.