Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Configurable Detection of SDC-causing Errors in Programs

Published: 28 March 2017 Publication History

Abstract

Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle and do not allow programmers to trade off performance for SDC coverage. Further, many require tens of thousands of fault-injection experiments, which are highly time- and resource-intensive. In this article, we propose two empirical models, SDCTune and SDCAuto, to predict the SDC proneness of a program’s data. Both models are based on static and dynamic features of the program alone and do not require fault injections to be performed. The main difference between them is that SDCTune requires manual tuning while SDCAuto is completely automated, using machine-learning algorithms.
We then develop an algorithm using both models to selectively protect the most SDC-prone data in the program subject to a given performance overhead bound. Our results show that both models are accurate at predicting the relative SDC rate of an application compared to fault injection, for a fraction of the time taken. Further, in terms of efficiency of detection (i.e., ratio of SDC coverage provided to performance overhead), our technique outperforms full duplication by a factor of 0.78x to 1.65x with the SDCTune model and 0.62x to 0.96x with SDCAuto model.

References

[1]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. In ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 158--165.
[2]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81.
[3]
S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16.
[4]
L. Breiman, J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.
[5]
J. Cong and K. Gururaj. 2011. Assuring application-level correctness against soft errors. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). 150--157.
[6]
C. Constantinescu. 2008. Intermittent faults and effects on reliability of integrated circuits. In Reliability and Maintainability Symposium (RAMS’08). 370--374.
[7]
Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508.
[8]
Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. 1999. Dynamically discovering likely program invariants to support program evolution. In International Conference on Software Engineering (ICSE’99). ACM, New York, NY, 213--224.
[9]
Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 385--396.
[10]
Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE Computer Society, Washington, DC, 1--12. http://dl.acm.org/citation.cfm?id=2354410.2355132
[11]
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 123--134.
[12]
John L. Henning. 2000. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer 33, 7, 28--35.
[13]
Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES’12). ACM, New York, NY, 99--108.
[14]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--. http://dl.acm.org/citation.cfm?id=977395.977673
[15]
Kyoungwoo Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian. 2009. Partially protected caches to reduce failures due to soft errors in multimedia applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 9, 1343--1347.
[16]
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, New York, NY, 213--224.
[17]
Qining Lu, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2014. SDCTune: A model for predicting the SDC proneness of an application for configurable protection. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’14). ACM, New York, NY, Article 23, 10 pages.
[18]
Silvano Martello and Paolo Toth. 1990. Knapsack Problems. Wiley, New York, NY.
[19]
Thomas Mason and others. 2009. LAMPVIEW: A loop-aware toolset for facilitating parallelization. Master’s Thesis, Department of Electrical Engineering, Princeton University, Princeton, NJ.
[20]
K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. 2005. Application-based metrics for strategic placement of detectors. In Pacific Rim International Symposium on Dependable Computing. 8 pp.
[21]
K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. 2006. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference (EDCC’06). 97--108.
[22]
John Ross Quinlan. 1993. C4. 5: Programs for Machine Learning. Vol. 1. Morgan Kaufmann, San Francisco, CA.
[23]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization (CGO’05). 243--254.
[24]
S. K. Sahoo, Man-Lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Yuanyuan Zhou. 2008. Using likely program invariants to detect hardware errors. In IEEE International Conference on Dependable Systems and Networks. 70--79.
[25]
M. Shafique, S. Rehman, P. V. Aceituno, and J. Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--9.
[26]
Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic (DSN’02). 389--398.
[27]
D. P. Siewiorek. 1991. Architecture of fault-tolerant computers: An historical perspective. Proc. IEEE 79, 12 (Dec. 1991), 1710--1734.
[28]
John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W.-M. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing.
[29]
A. Thomas and K. Pattabiraman. 2013. Error detector placement for soft computation. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). 1--12.
[30]
Jiesheng Wei, A. Thomas, Guanpeng Li, and K. Pattabiraman. 2014. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). 375--382.
[31]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Computer Architecture News 13.

Cited By

View all
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: 1-Jul-2024
  • (2024)Silent Data Corruption in Robot Operating System: A Case for End-to-End System-Level Fault Analysis Using Autonomous UAVsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333229343:4(1037-1050)Online publication date: 1-Apr-2024
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 16, Issue 3
Special Issue on Embedded Computing for IoT, Special Issue on Big Data and Regular Papers
August 2017
610 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3072970
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 28 March 2017
Accepted: 01 November 2016
Revised: 01 August 2016
Received: 01 May 2016
Published in TECS Volume 16, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fault tolerance
  2. compiler
  3. error detection
  4. modeling
  5. reliability

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Defense Advanced Research Projects Agency (DARPA)
  • Microsystems Technology Office (MTO)
  • Natural Science and Engineering Research Council of Canada (NSERC)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: 1-Jul-2024
  • (2024)Silent Data Corruption in Robot Operating System: A Case for End-to-End System-Level Fault Analysis Using Autonomous UAVsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333229343:4(1037-1050)Online publication date: 1-Apr-2024
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: 1-Apr-2024
  • (2024)An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessmentFuture Generation Computer Systems10.1016/j.future.2024.04.006157:C(392-407)Online publication date: 18-Jul-2024
  • (2023)Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction DuplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607078(1-13)Online publication date: 12-Nov-2023
  • (2023)Reliability Analysis for Programs with Redundancy Computation for Soft ErrorsJournal of Physics: Conference Series10.1088/1742-6596/2522/1/0120222522:1(012022)Online publication date: 1-Jun-2023
  • (2022)Trace-and-brace (TAB): bespoke software countermeasures against soft errorsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535070(73-85)Online publication date: 14-Jun-2022
  • (2022)Silent Data Corruption Estimation and Mitigation Without Fault InjectionIEEE Canadian Journal of Electrical and Computer Engineering10.1109/ICJECE.2022.318904345:3(318-327)Online publication date: Oct-2023
  • (2021)SDC Error Detection by Exploring the Importance of Instruction FeaturesWireless Algorithms, Systems, and Applications10.1007/978-3-030-85928-2_28(351-363)Online publication date: 2-Sep-2021
  • (2020)Improving the Accuracy of IR-level Fault InjectionIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.2980273(1-1)Online publication date: 2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media