Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126908.3126972acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed

Published: 12 November 2017 Publication History

Abstract

Compiler-based fault injection (FI) has become a popular technique for resilience studies to understand the impact of soft errors in supercomputing systems. Compiler-based FI frameworks inject faults at a high intermediate-representation level. However, they are less accurate than machine code, binary-level FI because they lack access to all dynamic instructions, thus they fail to mimic certain fault manifestations. In this paper, we study the limitations of current practices in compiler-based FI and how they impact the interpretation of results in resilience studies.
We propose REFINE, a novel framework that addresses these limitations, performing FI in a compiler backend. Our approach provides the portability and efficiency of compiler-based FI, while keeping accuracy comparable to binary-level FI methods. We demonstrate our approach in 14 HPC programs and show that, due to our unique design, its runtime overhead is significantly smaller than state-of-the-art compiler-based FI frameworks, reducing the time for large FI experiments.

References

[1]
{n. d.}. CoMD Proxy App. http://www.exmatex.org/comd.html. ({n. d.}).
[2]
{n. d.}. HPCCG Mini Application. https://mantevo.org/packages.php. ({n. d.}).
[3]
Fatimah Adamu-Fika and Arshad Jhumka. 2015. An Investigation of the Impact of Double Bit-Flip Error Variants on Program Execution. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 799--813.
[4]
Rizwan A Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the propagation of transient errors in HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM, 72.
[5]
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization (CGO 2003). IEEE, 265--275.
[6]
Jon Calhoun, Luke Olson, and Marc Snir. 2014. FlipIt: An LLVM Based Fault Injector for HPC. Springer International Publishing, Cham, 547--558.
[7]
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. 2009. Toward exascale resilience. The International Journal of High Performance Computing Applications 23, 4 (2009), 374--388.
[8]
Vinay K Chippa, Srimat T Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Analysis and characterization of inherent application resilience for approximate computing. In Proceedings of the 50th Annual Design Automation Conference. ACM, 113.
[9]
Jeffrey A Clark and Dhiraj K Pradhan. 1995. Fault injection: A method for validating computer-system dependability. Computer 28, 6 (1995), 47--56.
[10]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.
[11]
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2014. Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 221--230.
[12]
Mei-Chen Hsueh, Timothy K Tsai, and Ravishankar K Iyer. 1997. Fault injection techniques and tools. Computer 30, 4 (1997), 75--82.
[13]
Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. 1995. FERRARI: A flexible software-based fault and error injection system. IEEE Transactions on computers 44, 2 (1995), 248--260.
[14]
W-I Kao, Ravishankar K. Iyer, and Dong Tang. 1993. FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faults. IEEE Transactions on Software Engineering 19, 11 (1993), 1105--1118.
[15]
Johan Karlsson, Peter Liden, Peter Dahlgren, Rolf Johansson, and Ulf Gunneflo. 1994. Using heavy-ion radiation to validate fault-handling mechanisms. IEEE micro 14, 1 (1994), 8--23.
[16]
Maha Kooli and Giorgio Di Natale. 2014. A survey on simulation-based fault injection tools for complex systems. In 2014 9th IEEE International Conference On Design & Technology of Integrated Systems In Nanoscale Era (DTIS). IEEE, 1--6.
[17]
Maha Kooli, Giorgio Di Natale, and Alberto Bosio. 2016. Cache-aware reliability evaluation through LLVM-based analysis and fault injection. In 2016 IEEE 22nd International Symposium on On-Line Testing and Robust System Design (IOLTS),. IEEE, 19--22.
[18]
Ignacio Laguna, Martin Schulz, David F Richards, Jon Calhoun, and Luke Olson. 2016. IPAS: Intelligent protection against silent output corruption in scientific applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 227--238.
[19]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.
[20]
R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. 2009. Statistical fault injection: Quantified error and confidence. In 2009 Design, Automation Test in Europe Conference Exhibition. 502--506.
[21]
Dong Li, Jeffrey S Vetter, and Weikuan Yu. 2012. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 57.
[22]
Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V Adve, Vikram S Adve, and Yuanyuan Zhou. 2008. Understanding the propagation of hard errors to software and implications for resilient system design. In ACM SIGARCH Computer Architecture News, Vol. 36. ACM, 265--276.
[23]
Shubhendu S Mukherjee, Christopher Weaver, Joel Emer, Steven K Reinhardt, and Todd Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 29--40.
[24]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (PLDI '07). ACM, New York, NY, USA, 89--100.
[25]
Xiang Ni and Laxmikant V Kale. 2016. FlipBack: automatic targeted protection against silent data corruption. RTS 1, m3 (2016), m4.
[26]
K. Parasyris, G. Tziantzoulis, C. D. Antonopoulos, and N. Bellas. 2014. GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 622--629.
[27]
P. Ramachandran, P. Kudva, J. Kellington, J. Schumann, and P. Sanda. 2008. Statistical Fault Injection. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN). 122--127.
[28]
Vijay Janapa Reddi, Alex Settle, Daniel A. Connors, and Robert S. Cohn. 2004. PIN: A Binary Instrumentation Tool for Computer Architecture Research and Education. In Proceedings of the 2004 Workshop on Computer Architecture Education: Held in Conjunction with the 31st International Symposium on Computer Architecture (WCAE '04). ACM, New York, NY, USA, Article 22.
[29]
Pia N Sanda, Jeffrey W Kellington, Prabhakar Kudva, Ronald Kalla, Ryan B McBeth, Jerry Ackaret, Ryan Lockwood, John Schumann, and Christopher R Jones. 2008. Soft-error resilience of the IBM POWER6 processor. IBM Journal of Research and Development 52, 3 (2008), 275--284.
[30]
Horst Schirmeier, Christoph Borchert, and Olaf Spinczyk. 2015. Avoiding pitfalls in fault-injection based comparison of program susceptibility to soft errors. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 319--330.
[31]
V. C. Sharma, G. Gopalakrishnan, and S. Krishnamoorthy. 2016. Towards Resiliency Evaluation of Vector Programs. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1319--1328.
[32]
Vishal Chandra Sharma, Arvind Haran, Zvonimir Rakamari'c, and Ganesh Gopalakrishnan. 2013. Towards Formal Approaches to System Resilience. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). to appear.
[33]
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. 2007. Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07). 297--306.
[34]
Anna Thomas, Jacques Clapauch, and Karthik Pattabiraman. 2013. Effect of compiler optimizations on the error resilience of soft computing applications. In Workshop on Application and Algorithmic Error Resilience.
[35]
Anna Thomas and Karthik Pattabiraman. 2013. Error detector placement for soft computation. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 1--12.
[36]
Anna Thomas and Karthik Pattabiraman. 2013. LLFI: An intermediate code level fault injector for soft computing applications. In Workshop on Silicon Errors in Logic System Effects (SELSE).
[37]
Rajesh Venkatasubramanian, John P Hayes, and Brian T Murray. 2003. Low-cost on-line fault detection using control flow assertions. In On-Line Testing Symposium, 2003. IOLTS 2003. 9th IEEE. IEEE, 137--143.
[38]
Nicholas J Wang and Sanjay J Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3 (2006), 188--201.
[39]
Nicholas J Wang, Justin Quek, Todd M Rafacz, and Sanjay J Patel. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In 2004 International Conference on Dependable Systems and Networks,. IEEE, 61--70.
[40]
Ute Wappler and Christof Fetzer. 2006. Hardware fault injection using dynamic binary instrumentation: FITgrind. Proceedings Supplemental Volume of EDCC-6 (2006).
[41]
J. Wei, A. Thomas, G. Li, and K. Pattabiraman. 2014. Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 375--382.
[42]
Haissam Ziade, Rafic A Ayoubi, Raoul Velazco, et al. 2004. A survey on fault injection techniques. Int. Arab J. Inf. Technol. 1, 2 (2004), 171--186.

Cited By

View all
  • (2024)Gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00047(543-559)Online publication date: 2-Mar-2024
  • (2023)Optimization-Aware Compiler-Level Event ProfilingACM Transactions on Programming Languages and Systems10.1145/359147345:2(1-50)Online publication date: 26-Jun-2023
  • (2023)Understanding System Resilience for Converged Computing of Cloud, Edge, and HPCHigh Performance Computing10.1007/978-3-031-40843-4_17(221-233)Online publication date: 25-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compiler-based instrumentation
  2. fault injection
  3. high-performance computing
  4. resilience

Qualifiers

  • Research-article

Funding Sources

  • Engineering and Physical Sciences Research Council (UK)
  • European Commission (H2020-EU)

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)5
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00047(543-559)Online publication date: 2-Mar-2024
  • (2023)Optimization-Aware Compiler-Level Event ProfilingACM Transactions on Programming Languages and Systems10.1145/359147345:2(1-50)Online publication date: 26-Jun-2023
  • (2023)Understanding System Resilience for Converged Computing of Cloud, Edge, and HPCHigh Performance Computing10.1007/978-3-031-40843-4_17(221-233)Online publication date: 25-Aug-2023
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2022)Near-Zero Downtime Recovery From Transient-Error-Induced CrashesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309605533:4(765-778)Online publication date: 1-Apr-2022
  • (2022)Mitigating Silent Data Corruptions in HPC Applications across Multiple Program InputsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00022(1-14)Online publication date: Nov-2022
  • (2022)GCFI: A High Accurate Compiler-based Fault Injection for Transient Hardware Faults2022 CPSSI 4th International Symposium on Real-Time and Embedded Systems and Technologies (RTEST)10.1109/RTEST56034.2022.9850187(1-8)Online publication date: 30-May-2022
  • (2022)Instruction-aware Learning-based Timing Error Models through Significance-driven Approximations2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00074(455-462)Online publication date: Oct-2022
  • (2022)Error resilience of three GMRES implementations under fault injectionThe Journal of Supercomputing10.1007/s11227-021-04148-x78:5(7158-7185)Online publication date: 1-Apr-2022
  • (2021)PEPPA-XProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476147(1-13)Online publication date: 14-Nov-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media