An architectural framework for detecting process hangs/crashes

N Nakka, GP Saggese, Z Kalbarczyk… - … Computing-EDCC 5: 5th …, 2005 - Springer
N Nakka, GP Saggese, Z Kalbarczyk, RK Iyer
Dependable Computing-EDCC 5: 5th European Dependable Computing Conference …, 2005Springer
This paper addresses the challenges faced in practical implementation of heartbeat-based
process/crash and hang detection. We propose an in-processor hardware module to reduce
error detection latency and instrumentation overhead. Three hardware techniques
integrated into the main pipeline of a superscalar processor are presented. The techniques
discussed in this work are:(i) Instruction Count Heartbeat (ICH), which detects process
crashes and a class of hangs where the process exists but is not executing any …
Abstract
This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.
Springer