research-article

Open access

Hardware Fault Recovery for I/O Intensive Applications

Authors:

Pradeep Ramachandran,

Siva Kumar Sastry Hari,

Sarita V. AdveAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 3

Article No.: 33, Pages 1 - 25

https://doi.org/10.1145/2656342

Published: 27 October 2014 Publication History

Abstract

With continued process scaling, the rate of hardware failures in commodity systems is increasing. Because these commodity systems are highly sensitive to cost, traditional solutions that employ heavy redundancy to handle such failures are no longer acceptable owing to their high associated costs.

Detecting such faults by identifying anomalous software execution and recovering through checkpoint-and-replay is emerging as a viable low-cost alternative for future commodity systems. An important but commonly ignored aspect of such solutions is ensuring that external outputs to the system are fault-free. The outputs must be delayed until the detectors guarantee this, influencing fault-free performance. The overheads for resiliency must thus be evaluated while taking these delays into consideration; prior work has largely ignored this relationship.

This article concerns recovery for I/O intensive applications from in-core faults. We present a strategy to buffer external outputs using dedicated hardware and show that checkpoint intervals previously considered as acceptable incur exorbitant overheads when hardware buffering is considered. We then present two techniques to reduce the checkpoint interval and demonstrate a practical solution that provides high resiliency while incurring low overheads.

References

[1]

Periklis Akritidis, Cristian Cadar, Costin Raiciu, Manuel Costa, and Miguel Castro. 2008. Preventing Memory Error Exploits with WIT. In SOSP. 263--277.

Digital Library

[2]

Todd M. Austin. 1998. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In MICRO. 196--207.

Digital Library

[3]

D. Bernick, J. Smullen, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, and J. Klecka. 2005. NonStop Advanced Architecture. In DSN. 12--21.

Digital Library

[4]

Shekhar Borkar. 2005. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 25, 6 (2005), 10--16.

Digital Library

[5]

Fred A. Bower, Daniel J. Sorin, and Sule Ozev. 2007. Online Diagnosis of Hard Faults in Microprocessors. Transactions on Architecture and Code Optimization 4, 2 (2007).

Digital Library

[6]

Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Radu Rugina. 2008. Compiler-Enhanced Incremental Checkpointing. In Workshop on Languages and Compilers for Parallel Computing.

[7]

Jonathan Chang, George Reis, and David August. 2006. Automatic Instruction-Level Software-Only Recovery. In DSN.

Digital Library

[8]

Kypros Constantinides, O. Mutlu, T. Austin, and V. Bertacco. 2007. Software-Based On-Line Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation. In MICRO.

Digital Library

[9]

Jonathan Corbet, Greg Kroah-Hartman, and Alessandro Rubini. 2005. Linux Device Drivers (3rd ed.). O’Reilly.

Digital Library

[10]

Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An Architectural Framework for Software Recovery of Hardware Faults. In ISCA.

Digital Library

[11]

Marc de Kruijf and Karhikeyan Sankaralingam. 2009. Exploring the Synergy of Emerging Workloads and Si Reliability Trends. In SELSE.

[12]

Joe Devietti, Colin Blundell, Milo Martin, and Steve Zdancewic. 2008. Hardbound: Architectural Support for Spatial Safety of the C Programming Language. In ASPLOS.

Digital Library

[13]

Dinakar Dhurjati, Sumant Kowshik, and Vikram Adve. 2006. SAFECode: Enforcing Alias Analysis for Weakly Typed Languages. SIGPLAN Not. 41, 6 (2006).

Digital Library

[14]

Martin Dimitrov and Huiyang Zhou. 2007. Unified Arch Support for Soft-Error Protection or SW Bug Detection. In PACT.

Digital Library

[15]

Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCamant, Carlos Pacheco, Matthew S. Tschantz, and Chen Xiao. 2007. The Daikon System for Dynamic Detection of Likely Invariants. Science of Comp. Programming (2007), 35--45.

Digital Library

[16]

Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic Soft Error Reliability on the Cheap. In ASPLOS.

Digital Library

[17]

Siva Hari, Man-Lap Li, P. Ramachandran, Byn Choi, and S. V. Adve. 2009. Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems. In MICRO.

[18]

Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012. Low-Cost Program-Level Detectors for Reducing Silent Data Corruptions. In DSN.

[19]

HP. 2010. RAS Features of the Mission-Critical Converged Infrastructure: Reliability, Availability, and Serviceability (RAS) Features of HP Integrity Systems: Superdome 2, BL8x0c, and rx2800 i2. Technical Report. Hewlett-Packard Development Company, LP.

[20]

Intel. 2012. Intel Itanium Processor 9300 Series and 9500 Series. Technical Report. Intel Corporation.

[21]

Asim Kadav, Matthew Renzelmann, and Micael Swift. 2009. Tolerating Hardware Device Failures in Software. In SOSP.

Digital Library

[22]

Andrew Lenharth, Vikram S. Adve, and Samuel T. King. 2009. Recovery Domains: An Organizing Principle for Recoverable Operating Systems. In ASPLOS.

Digital Library

[23]

Manlap Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, and Yuanyuan Zhou. 2008a. Trace-Based Microarchitecture-Level Diagnosis of Permanent Hardware Faults. In DSN.

[24]

Manlap Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, and Yuanyuan Zhou. 2008b. Understanding the Propagation of Hard Errors to Software and Implications for Resilient Systems Design. In ASPLOS.

Digital Library

[25]

Manlap Li, Pradeep Ramachandran, Rahmet Ulya Karpuzcu, Siva Hari, and Sarita Adve. 2009. Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults. In HPCA.

[26]

Xuanhua Li and Donald Yeung. 2007. Application-Level Correctness and Its Impact on Fault Tolerance. In HPCA.

Digital Library

[27]

Peter B. Mark. 1985. The Sequoia Computer: A Fault-Tolerant Tightly-Coupled Multiprocessor Architecture. In ISCA.

Digital Library

[28]

Yoshio Masubuchi, Satoshi Hoshina, Tomofumi Shimada, Hideaki Hirayama, and Nobuhiro Kato. 1997. Fault Recovery Mechanism for Multiprocessor Servers. In FTCS.

Digital Library

[29]

Albert Meixner, Michael E. Bauer, and Daniel Sorin. 2007. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. In MICRO.

Digital Library

[30]

Shubhendu Mukherjee, Joel Emer, and Steven Reinhardt. 2005. The Soft Error Problem: An Architectural Perspective. In HPCA.

Digital Library

[31]

Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. 2003. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In MICRO.

Digital Library

[32]

Santosh Nagarakatte, Jianzhou Zhao, Milo Martin, and Steve Zdancewic. 2009. SoftBound: Highly Compatible and Complete Spatial Memory Safety for C. In PLDI.

Digital Library

[33]

Jun Nakano, Pablo Montesinos, Kourosh Gharachorloo, and Josep Torrellas. 2006. ReVive I/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers. In HPCA.

[34]

Nithin Nakka, Giacinto P. Saggese, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2005. An Arch Framework for Detecting Process Hangs/Crashes. In European Dep. Computing Conf.

Digital Library

[35]

Shuou Nomura, Matthew D. Sinclair, Chen-Han Ho, Venkatraman Govindaraju, Marc de Kruijf, and Karthikeyan Sankaralingam. 2011. Sampling + DMR: Practical and Low-Overhead Permanent Fault Detection. In ISCA.

Digital Library

[36]

Karthik Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. 2006. Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware. In European Dependable Computing Conference.

Digital Library

[37]

A. Pellegrini, R. Smolinski, X. Fu, L. Chen, S. K. S. Hari, J. Jiang, S. V. Adve, T. Austin, and V. Bertacco. 2012. CrashTest’ing SWAT: Accurate, Gate-Level Evaluation of Symptom-Based Resiliency Solutions. In Design Automation and Test Europe.

Digital Library

[38]

Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubhendu S. Mukherjee. 2009. Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance. In ISCA.

Digital Library

[39]

Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-Effective Arch Support for Rollback Recovery in Shared-Mem Multiprocessors. In ISCA.

Digital Library

[40]

Paul Racunas, Kypros Constantinides, Srilatha Manne, and Shubhendu S. Mukherjee. 2007. Perturbation-Based Fault Screening. In HPCA.

Digital Library

[41]

George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David August, and Shubhendu Mukherjee. 2005. Software-Controlled Fault Tolerance. Transactions on Architecture and Code Optimization 2, 4 (2005), 366--396.

Digital Library

[42]

Bogdan Romanescu and Daniel Sorin. 2008. Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults. In PACT.

Digital Library

[43]

Swarup Sahoo, Man-Lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Yuanyuan Zhou. 2008. Using Likely Program Invariants to Detect Hardware Errors. In DSN.

[44]

Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, and Andreas G. Nowatzyk. 2004. Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth. In Proc. of 11th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[45]

Daniel Sorin, Milo Martin, Mark Hill, and David Wood. 2002. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. In ISCA.

Digital Library

[46]

Lisa Spainhower and T. A. Gregg. September/November 1999. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of R&D 43, 5.6, 863--873.

Digital Library

[47]

Vilas Sridharan and David R. Kaeli. 2008. Quantifying Software Vulnerability. In Proceedings of the Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies.

Digital Library

[48]

Michael Swift, Muthukaruppan Annamalai, Brian Bershad, and Henry Levy. 2004. Recovering Device Drivers. In OSDI.

Digital Library

[49]

N. J. Wang and S. J. Patel. 2006. ReStore: Symptom-Based Soft Error Detection in Microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3 (July-Sept. 2006), 188--201.

Digital Library

Cited By

Feng KVora SJiang RRosenbaum EVasudevan S(2019)Guilty As Charged: Computational Reliability Threats Posed By Electrostatic Discharge-induced Soft Errors2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715149(156-161)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8715149
Nguyen TWentzlaff DOskin MInoue K(2018)PiCLProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00048(507-519)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00048
Liu QJung CLee DTiwari DWest J(2016)Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recoveryProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014931(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014931
Show More Cited By

Index Terms

Hardware Fault Recovery for I/O Intensive Applications
1. Hardware

Recommendations

Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model
CICN '12: Proceedings of the 2012 Fourth International Conference on Computational Intelligence and Communication Networks

Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the ...
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor

In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...
Increasing System Availability with Local Recovery Based on Fault Localization
QSIC '10: Proceedings of the 2010 10th International Conference on Quality Software

Due to the fact that software systems cannot be tested exhaustively, software systems must cope with residual defects at run-time. Local recovery is an approach for recovering from errors, in which only the defective parts of the system are recovered ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 3

October 2014

298 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2658949

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2014

Accepted: 01 July 2014

Revised: 01 July 2014

Received: 01 July 2013

Published in TACO Volume 11, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
OpenSPARC Center of Excellence at Illinois supported by Sun Microsystems
Intel Corporation
International Business Machines Corporation
Gigascale Systems Research Center (funded under FCRP, an SRC program)
Division of Computing and Communication Foundations

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
427
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feng KVora SJiang RRosenbaum EVasudevan S(2019)Guilty As Charged: Computational Reliability Threats Posed By Electrostatic Discharge-induced Soft Errors2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715149(156-161)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8715149
Nguyen TWentzlaff DOskin MInoue K(2018)PiCLProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00048(507-519)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00048
Liu QJung CLee DTiwari DWest J(2016)Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recoveryProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014931(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014931
Constantin JWang ZKarakonstantis GChattopadhyay ABurg A(2016)Statistical fault injection for impact-evaluation of timing errors on application performanceProceedings of the 53rd Annual Design Automation Conference10.1145/2897937.2898095(1-6)Online publication date: 5-Jun-2016
https://dl.acm.org/doi/10.1145/2897937.2898095
Liu QJung CLee DTiwari D(2016)Compiler-Directed Lightweight Checkpointing for Fine-Grained Guaranteed Soft Error RecoverySC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.19(228-239)Online publication date: Dec-2016
https://doi.org/10.1109/SC.2016.19
Gopalakrishnan SSingh V(2016)REMO: Redundant execution with minimum area, power, performance overhead fault tolerant architecture2016 IEEE 22nd International Symposium on On-Line Testing and Robust System Design (IOLTS)10.1109/IOLTS.2016.7604681(109-114)Online publication date: Jul-2016
https://doi.org/10.1109/IOLTS.2016.7604681
Li GPattabiraman KCher CBose P(2015)Experience reportProceedings of the 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE.2015.7381808(141-152)Online publication date: 2-Nov-2015
https://dl.acm.org/doi/10.1109/ISSRE.2015.7381808

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents