Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3437801.3441589acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Understanding a program's resiliency through error propagation

Published: 17 February 2021 Publication History

Abstract

Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples.
In this research, we develop a method to systematically analyze all fault injection sites in an application with a low number of fault injection experiments. We use fault propagation data from a fault injection experiment to predict the resiliency of other untested fault sites and obtain an approximate fault tolerance threshold value for each site, which represents the largest error that can be introduced at the site without incurring incorrect simulation results. We define the collection of threshold values over all fault sites in the program as a fault tolerance boundary and propose a simple but efficient method to approximate the boundary. In our experiments, we show our method reduces the number of fault injection samples required to understand a program's resiliency by several orders of magnitude when compared with a traditional fault injection study.

References

[1]
Rizwan A Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the propagation of transient errors in HPC applications. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, USA, 1--12.
[2]
A. Benso, S. Di Carlo, G. Di Natale, P. Prinetto, and L. Taghaferri. 2003. Data criticality estimation in software applications. In International Test Conference, 2003. Proceedings. ITC 2003., Vol. 1. IEEE, USA, 802--810.
[3]
Greg Bronevetsky and Bronis de Supinski. 2008. Soft Error Vulnerability of Iterative Linear Algebra Methods. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS '08). Association for Computing Machinery, New York, NY, USA, 155--164.
[4]
Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault Resilience of the Algebraic Multi-Grid Solver. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). Association for Computing Machinery, New York, NY, USA, 91--100.
[5]
Swarat Chaudhuri, Sumit Gulwani, and Roberto Lublinerman. 2012. Continuity and robustness of programs. Commun. ACM 55, 8 (2012), 107--115.
[6]
Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices 48, 8 (2013), 167--176.
[7]
Sheng Di, Eduardo Berrocal, and Franck Cappello. 2015. An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, IEEE, USA, 271--280.
[8]
James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, IEEE, USA, 1193--1202.
[9]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861--874.
[10]
Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: probabilistic soft error reliability on the cheap. ACM SIGARCH Computer Architecture News 38, 1 (2010), 385--396.
[11]
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys (CSUR) 23, 1 (1991), 5--48.
[12]
Siva Kumar Sastry Hari, Sarita V Adve, and Helia Naeimi. 2012. Lowcost program-level detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012). IEEE, IEEE, USA, 1--12.
[13]
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transient Faults. SIGPLAN Not. 47, 4 (March 2012), 123--134.
[14]
Martin Hiller, Arshad Jhumka, and Neeraj Suri. 2002. On the placement of software mechanisms for detection of data errors. In Proceedings International Conference on Dependable Systems and Networks. IEEE, IEEE, USA, 135--144.
[15]
Kuang-Hua Huang and Jacob A Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers 100, 6 (1984), 518--528.
[16]
Manolis Kaliorakis, Dimitris Gizopoulos, Ramon Canal, and Antonio Gonzalez. 2017. Merlin: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment. In Proceedings of the 44th Annual International Symposium on Computer Architecture. IEEE, USA, 241--254.
[17]
Ignacio Laguna, Martin Schulz, David F. Richards, Jon Calhoun, and Luke Olson. 2016. IPAS: Intelligent Protection against Silent Output Corruption in Scientific Applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). Association for Computing Machinery, New York, NY, USA, 227--238.
[18]
R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. 2009. Statistical Fault Injection: Quantified Error and Confidence. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE '09). European Design and Automation Association, Leuven, BEL, 502--506.
[19]
Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, Michael Sullivan, and Timothy Tsai. 2018. Modeling soft-error propagation in programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, IEEE, USA, 27--38.
[20]
Z. Li, H. Menon, D. Maljovec, Y. Livnat, S. Liu, K. Mohror, P. Bremer, and V. Pascucci. 5555. SpotSDC: Revealing the Silent Data Corruption Propagation in High-performance Computing Systems. IEEE Transactions on Visualization and Computer Graphics 0, 01 (may 5555), 1--1.
[21]
Robert E Lyons and Wouter Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM journal of research and development 6, 2 (1962), 200--209.
[22]
Harshitha Menon and Kathryn Mohror. 2018. Discvar: Discovering critical variables using algorithmic differentiation for transient faults. ACM SIGPLAN Notices 53, 1 (2018), 195--206.
[23]
Karthik Pattabiraman, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2005. Application-Based Metrics for Strategic Placement of Detectors. In Proceedings of the 11th Pacific Rim International Symposium on Dependable Computing (PRDC '05). IEEE Computer Society, USA, 75--82.
[24]
George A Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I August, and Shubhendu S Mukherjee. 2005. Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization (TACO) 2, 4 (2005), 366--396.
[25]
Siva Kumar Sastry Hari, Radha Venkatagiri, Sarita V Adve, and Helia Naeimi. 2014. GangES: Gang error simulation for hardware resiliency evaluation. ACM SIGARCH Computer Architecture News 42, 3 (2014), 61--72.
[26]
Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2011. Characterizing the Impact of Soft Errors on Iterative Methods in Scientific Computing. In Proceedings of the International Conference on Supercomputing (ICS '11). Association for Computing Machinery, New York, NY, USA, 152--161.
[27]
Vilas Sridharan and David R Kaeli. 2009. Eliminating microarchitectural dependency from architectural vulnerability. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, USA, 117--128.
[28]
Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2018. Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems 19 (2018), 277--290.
[29]
Radha Venkatagiri, Khalique Ahmed, Abdulrahman Mahmoud, Sasa Misailovic, Darko Marinov, Christopher W Fletcher, and Sarita V Adve. 2019. Gem5-approxilyzer: An open-source tool for application-level soft error analysis. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, Portland, OR, USA, 214--221.
[30]
Radha Venkatagiri, Abdulrahman Mahmoud, Siva Kumar Sastry Hari, and Sarita V. Adve. 2016. Approxilyzer: Towards a Systematic Framework for Instruction-Level Approximate Computing and Its Application to Hardware Resiliency. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE Press, USA, Article 42, 14 pages.
[31]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. ACM SIGARCH computer architecture news 23, 2 (1995), 24--36.

Cited By

View all
  • (2025)Prediction of instruction SDC vulnerability in routing algorithms based on graph convolutional networkFourth International Conference on Network Communication and Information Security (ICNCIS 2024)10.1117/12.3052148(40)Online publication date: 15-Jan-2025
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: Jul-2024
  • (2024)HAppA: A Modular Platform for HPC Application Resilience Analysis with LLMs Embedded2024 43rd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS64841.2024.00015(40-51)Online publication date: 30-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2021
507 pages
ISBN:9781450382946
DOI:10.1145/3437801
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. error propagation
  2. fault tolerance
  3. natural resilience
  4. silent data corruption

Qualifiers

  • Research-article

Conference

PPoPP '21

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)11
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Prediction of instruction SDC vulnerability in routing algorithms based on graph convolutional networkFourth International Conference on Network Communication and Information Security (ICNCIS 2024)10.1117/12.3052148(40)Online publication date: 15-Jan-2025
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: Jul-2024
  • (2024)HAppA: A Modular Platform for HPC Application Resilience Analysis with LLMs Embedded2024 43rd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS64841.2024.00015(40-51)Online publication date: 30-Sep-2024
  • (2024)Versatile Datapath Soft Error Detection on the Cheap for HPC ApplicationsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00061(1-15)Online publication date: 17-Nov-2024
  • (2024)An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessmentFuture Generation Computer Systems10.1016/j.future.2024.04.006157:C(392-407)Online publication date: 18-Jul-2024
  • (2023)Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction DuplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607078(1-13)Online publication date: 12-Nov-2023
  • (2023)Visilience: An Interactive Visualization Framework for Resilience Analysis using Control-Flow Graph2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC59308.2023.00041(250-256)Online publication date: 24-Oct-2023
  • (2023)Characterizing Runtime Performance Variation in Error Detection by Duplicating Instructions2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00043(730-741)Online publication date: 9-Oct-2023
  • (2022)Mitigating Silent Data Corruptions in HPC Applications across Multiple Program InputsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00022(1-14)Online publication date: Nov-2022
  • (2022)GCFI: A High Accurate Compiler-based Fault Injection for Transient Hardware Faults2022 CPSSI 4th International Symposium on Real-Time and Embedded Systems and Technologies (RTEST)10.1109/RTEST56034.2022.9850187(1-8)Online publication date: 30-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media