research-article

Configurable Detection of SDC-causing Errors in Programs

Authors:

Karthik Pattabiraman,

Meeta S. Gupta,

Jude A. RiversAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 3

Article No.: 88, Pages 1 - 25

https://doi.org/10.1145/3014586

Published: 28 March 2017 Publication History

Abstract

Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle and do not allow programmers to trade off performance for SDC coverage. Further, many require tens of thousands of fault-injection experiments, which are highly time- and resource-intensive. In this article, we propose two empirical models, SDCTune and SDCAuto, to predict the SDC proneness of a program’s data. Both models are based on static and dynamic features of the program alone and do not require fault injections to be performed. The main difference between them is that SDCTune requires manual tuning while SDCAuto is completely automated, using machine-learning algorithms.

We then develop an algorithm using both models to selectively protect the most SDC-prone data in the program subject to a given performance overhead bound. Our results show that both models are accurate at predicting the relative SDC rate of an application compared to fault injection, for a fraction of the time taken. Further, in terms of efficiency of detection (i.e., ratio of SDC coverage provided to performance overhead), our technique outperforms full duplication by a factor of 0.78x to 1.65x with the SDCTune model and 0.62x to 0.96x with SDCAuto model.

References

[1]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. In ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 158--165.

Digital Library

[2]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81.

Digital Library

[3]

S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16.

Digital Library

[4]

L. Breiman, J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.

[5]

J. Cong and K. Gururaj. 2011. Assuring application-level correctness against soft errors. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). 150--157.

Digital Library

[6]

C. Constantinescu. 2008. Intermittent faults and effects on reliability of integrated circuits. In Reliability and Maintainability Symposium (RAMS’08). 370--374.

Digital Library

[7]

Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508.

Digital Library

[8]

Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. 1999. Dynamically discovering likely program invariants to support program evolution. In International Conference on Software Engineering (ICSE’99). ACM, New York, NY, 213--224.

Digital Library

[9]

Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 385--396.

Digital Library

[10]

Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE Computer Society, Washington, DC, 1--12. http://dl.acm.org/citation.cfm?id=2354410.2355132

Digital Library

[11]

Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 123--134.

Digital Library

[12]

John L. Henning. 2000. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer 33, 7, 28--35.

Digital Library

[13]

Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES’12). ACM, New York, NY, 99--108.

Digital Library

[14]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--. http://dl.acm.org/citation.cfm?id=977395.977673

Digital Library

[15]

Kyoungwoo Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian. 2009. Partially protected caches to reduce failures due to soft errors in multimedia applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 9, 1343--1347.

Digital Library

[16]

Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, New York, NY, 213--224.

Digital Library

[17]

Qining Lu, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2014. SDCTune: A model for predicting the SDC proneness of an application for configurable protection. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’14). ACM, New York, NY, Article 23, 10 pages.

Digital Library

[18]

Silvano Martello and Paolo Toth. 1990. Knapsack Problems. Wiley, New York, NY.

[19]

Thomas Mason and others. 2009. LAMPVIEW: A loop-aware toolset for facilitating parallelization. Master’s Thesis, Department of Electrical Engineering, Princeton University, Princeton, NJ.

[20]

K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. 2005. Application-based metrics for strategic placement of detectors. In Pacific Rim International Symposium on Dependable Computing. 8 pp.

Digital Library

[21]

K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. 2006. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference (EDCC’06). 97--108.

Digital Library

[22]

John Ross Quinlan. 1993. C4. 5: Programs for Machine Learning. Vol. 1. Morgan Kaufmann, San Francisco, CA.

Digital Library

[23]

G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization (CGO’05). 243--254.

Digital Library

[24]

S. K. Sahoo, Man-Lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Yuanyuan Zhou. 2008. Using likely program invariants to detect hardware errors. In IEEE International Conference on Dependable Systems and Networks. 70--79.

[25]

M. Shafique, S. Rehman, P. V. Aceituno, and J. Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--9.

Digital Library

[26]

Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic (DSN’02). 389--398.

Digital Library

[27]

D. P. Siewiorek. 1991. Architecture of fault-tolerant computers: An historical perspective. Proc. IEEE 79, 12 (Dec. 1991), 1710--1734.

[28]

John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W.-M. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing.

[29]

A. Thomas and K. Pattabiraman. 2013. Error detector placement for soft computation. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). 1--12.

Digital Library

[30]

Jiesheng Wei, A. Thomas, Guanpeng Li, and K. Pattabiraman. 2014. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). 375--382.

Digital Library

[31]

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Computer Architecture News 13.

Digital Library

Cited By

Li ZMenon HMohror KLiu SGuo LBremer PPascucci V(2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TVCG.2022.3230636
Hsiao YWan ZJia TGhosal RMahmoud ARaychowdhury ABrooks DWei GReddi V(2024)Silent Data Corruption in Robot Operating System: A Case for End-to-End System-Level Fault Analysis Using Autonomous UAVsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333229343:4(1037-1050)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3332293
Wei XJiang NYue HWang XZhao JLi GQiu M(2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3330821
Show More Cited By

Index Terms

Configurable Detection of SDC-causing Errors in Programs
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

SDCTune: a model for predicting the SDC proneness of an application for configurable protection
CASES '14: Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle, and do not allow programmers to trade off performance for SDC coverage. Further, many of them ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 3

Special Issue on Embedded Computing for IoT, Special Issue on Big Data and Regular Papers

August 2017

610 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3072970

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 28 March 2017

Accepted: 01 November 2016

Revised: 01 August 2016

Received: 01 May 2016

Published in TECS Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Defense Advanced Research Projects Agency (DARPA)
Microsystems Technology Office (MTO)
Natural Science and Engineering Research Council of Canada (NSERC)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
248
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li ZMenon HMohror KLiu SGuo LBremer PPascucci V(2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TVCG.2022.3230636
Hsiao YWan ZJia TGhosal RMahmoud ARaychowdhury ABrooks DWei GReddi V(2024)Silent Data Corruption in Robot Operating System: A Case for End-to-End System-Level Fault Analysis Using Autonomous UAVsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333229343:4(1037-1050)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3332293
Wei XJiang NYue HWang XZhao JLi GQiu M(2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3330821
Ahmad HSedaghat Y(2024)An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessmentFuture Generation Computer Systems10.1016/j.future.2024.04.006157:C(392-407)Online publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1016/j.future.2024.04.006
He ZHuang YXu HTao DLi GMohror KArnold DBadia R(2023)Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction DuplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607078(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607078
Meng XZhang ZXue JChen FWang J(2023)Reliability Analysis for Programs with Redundancy Computation for Soft ErrorsJournal of Physics: Conference Series10.1088/1742-6596/2522/1/0120222522:1(012022)Online publication date: 1-Jun-2023
https://doi.org/10.1088/1742-6596/2522/1/012022
Ko YBradbury ABurgstaller BMullins RGrosser TLee K(2022)Trace-and-brace (TAB): bespoke software countermeasures against soft errorsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535070(73-85)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3519941.3535070
Yakhchi MFazeli MAsghari S(2022)Silent Data Corruption Estimation and Mitigation Without Fault InjectionIEEE Canadian Journal of Electrical and Computer Engineering10.1109/ICJECE.2022.318904345:3(318-327)Online publication date: Oct-2023
https://doi.org/10.1109/ICJECE.2022.3189043
Fang WGu JYan ZWang Q(2021)SDC Error Detection by Exploring the Importance of Instruction FeaturesWireless Algorithms, Systems, and Applications10.1007/978-3-030-85928-2_28(351-363)Online publication date: 2-Sep-2021
https://doi.org/10.1007/978-3-030-85928-2_28
Palazzi LLi GFang BPattabiraman K(2020)Improving the Accuracy of IR-level Fault InjectionIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.2980273(1-1)Online publication date: 2020
https://doi.org/10.1109/TDSC.2020.2980273
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents