Overcoming early-life failure and aging for robust systems

Y Li, YM Kim, E Mintarno, DS Gardner… - IEEE Design & Test of …, 2009 - ieeexplore.ieee.org
Y Li, YM Kim, E Mintarno, DS Gardner, S Mitra
IEEE Design & Test of Computers, 2009ieeexplore.ieee.org
The prospect of system failure has increased because of device and chip-level effects in the
late CMOS era. In this article, the authors present novel system-level architecture and design
innovations to cope with these lifetime reliability challenges. At nanometer-scale geometries,
several hardware failure mechanisms, which were largely benign in the past, are becoming
visible at the system level. Moreover, recent studies indicate that, depending on the
application, hardware failures can be significant contributors to overall system failure rates …
The prospect of system failure has increased because of device and chip-level effects in the late CMOS era. In this article, the authors present novel system-level architecture and design innovations to cope with these lifetime reliability challenges. At nanometer-scale geometries, several hardware failure mechanisms, which were largely benign in the past, are becoming visible at the system level. Moreover, recent studies indicate that, depending on the application, hardware failures can be significant contributors to overall system failure rates.Design of robust systems ensuring required hardware reliability, although nontrivial, is achievable but at high costs. Concurrent error detection during system operation is an extremely important aspect of such systems.Hardware reliability challenges arise from three major sources: early-life failures (also called infant mortality), radiation-induced soft errors, and circuit aging. Several techniques, such as Built-in Soft-Error Resilience (BISER), can be effectively used for correcting radiation-induced transient (soft) errors. Focus on early-life failures (ELF) and circuit aging was discussed. These techniques utilize specific characteristics of reliability mechanisms without incurring the high costs of traditional concurrent error detection.
ieeexplore.ieee.org