Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2854038.2854059acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

IPAS: intelligent protection against silent output corruption in scientific applications

Published: 29 February 2016 Publication History

Abstract

This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes—we call these errors silent output corruption (SOC) errors. Thus applications require duplication only on code that, when affected by a fault, yields SOC. We use machine learning to learn code instructions that must be protected to avoid SOC, and, using a compiler, we protect only those vulnerable instructions by duplication, thus significantly reducing the overhead that is introduced by instruction duplication. In our experiments with five workloads, IPAS reduces the percentage of SOC by up to 90% with a slowdown that ranges between 1.04x and 1.35x, which corresponds to as much as 47% less slowdown than state-of-the-art instruction duplication techniques.

References

[1]
CoMD Proxy App. http://www.exmatex.org/comd.html.
[2]
HPCCG Mini Application. https://mantevo.org/packages.php.
[3]
Intel OpenMP Runtime Library. https://www.openmprtl.org/.
[4]
F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 295–305, 2006.
[5]
Z. Alkhalifa, V. S. Nair, N. Krishnamurthy, and J. A. Abraham. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems, 10(6):627–641, 1999.
[6]
R. Baumann. Soft errors in advanced computer systems. IEEE Design & Test of Computers, 22(3):258––266, 2005.
[7]
C. M. Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006.
[8]
J. Calhoun, L. Olson, and M. Snir. FlipIt: An LLVM Based Fault Injector for HPC. In Euro-Par 2014: Parallel Processing Workshops, volume 8805 of Lecture Notes in Computer Science, pages 547–558. 2014.
[9]
F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir. Toward exascale resilience. International Journal of High Performance Computing Applications, 2009.
[10]
C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
[11]
C.-L. Chen and M. Hsiao. Error-correcting codes for semiconductor memory applications: A state-of-the-art review. IBM Journal of Research and Development, 28(2):124–134, 1984.
[12]
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
[13]
D. Bailey et al. The NAS Parallel Benchmarks. International Journal of High Performance Computing Applications, 5(3): 63–73, 1991.
[14]
T. J. Dell. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division, pages 1–23, 1997.
[15]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375–408, 2002.
[16]
C. L. Fefferman. Existence and smoothness of the navierstokes equation. The millennium prize problems, pages 57–67, 2000.
[17]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic soft error reliability on the cheap. In Proceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 385–396, 2010.
[18]
P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.
[19]
S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost programlevel detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–12, 2012.
[20]
R. Hegde and N. R. Shanbhag. Soft digital signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9(6):813–823, 2001.
[21]
G. Hripcsak and A. S. Rothschild. Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3):296–298, 2005.
[22]
J. Dongarra et al. The international exascale software project roadmap. International Journal of High Performance Computing Applications, 25(1):3–60, Feb. 2011.
[23]
D. Khudia and S. Mahlke. Harnessing soft computations for low-budget fault tolerance. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 319–330, Dec 2014.
[24]
O. A. Ladyzhenskaya and R. A. Silverman. The mathematical theory of viscous incompressible flow, volume 76. Gordon and Breach New York, 1969.
[25]
R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. Statistical fault injection: quantified error and confidence. In Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE’& Exhibition, 2009. DATE’’09., pages 502–506. IEEE, 2009.
[26]
Q. Lu, K. Pattabiraman, M. S. Gupta, J. Rivers, et al. SDCTune: a model for predicting the SDC proneness of an application for configurable protection. In 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pages 1–10. IEEE, 2014.
[27]
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2–9, 1979.
[28]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multi-threading alternatives. In 29th Annual International Symposium on Computer Architecture 2002, pages 99–110. IEEE, 2002.
[29]
M. T. Musavi, W. Ahmed, K. H. Chan, K. B. Faris, and D. M. Hummels. On the training of radial basis function classifiers. Neural networks, 5(4):595–603, 1992.
[30]
N. Oh, S. Mitra, and E. J. McCluskey. ED 4 I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2):180–199, 2002.
[31]
S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, pages 25–36, 2000.
[32]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software Implemented Fault Tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 243–254, Washington, DC, USA, 2005. IEEE Computer Society.
[33]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization (TACO), 2(4):366–396, 2005.
[34]
J. W. Ruge and K. Stüben. Algebraic multigrid. In Multigrid methods, volume 3 of Frontiers in Applied Mathematics, pages 73–130. SIAM, Philadelphia, PA, 1987.
[35]
J. Somers, F. Director, and S. Graham. Stratus ftserver–intel fault tolerant platform. In Rap. tech. Intel Developer Forum, Fall, 2002.
[36]
M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O’Reilly. Meta Optimization: Improving Compiler Heuristics with Machine Learning. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI ’03, pages 77–90, 2003.
[37]
A. Thomas and K. Pattabiraman. Error detector placement for soft computation. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–12. IEEE, 2013.
[38]
N. J. Wang and S. J. Patel. Restore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3):188–201, 2006.
[39]
C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of ISCA ’04, 2004.
[40]
M. Weiser. Program slicing. In Proceedings of the 5th international conference on Software engineering, pages 439– 449. IEEE Press, 1981.
[41]
P. Wu and T. G. Dietterich. Improving SVM accuracy by training on auxiliary data sources. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML), page 110. ACM, 2004.
[42]
J. Yu, M. J. Garzaran, and M. Snir. ESoftCheck: Removal of Non-vital Checks for Fault Tolerance. In Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 35–46, 2009.

Cited By

View all
  • (2024)Soft Error Resilience at Near-Zero CostProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656605(176-187)Online publication date: 30-May-2024
  • (2023)Predicting Software Defects in Hybrid MPI and OpenMP Parallel Programs Using Machine LearningElectronics10.3390/electronics1301018213:1(182)Online publication date: 30-Dec-2023
  • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
  • Show More Cited By

Index Terms

  1. IPAS: intelligent protection against silent output corruption in scientific applications

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization
        February 2016
        283 pages
        ISBN:9781450337786
        DOI:10.1145/2854038
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        • IEEE-CS: Computer Society

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 29 February 2016

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Resilience
        2. compiler analysis
        3. high-performance computing
        4. machine learning

        Qualifiers

        • Research-article

        Conference

        CGO '16

        Acceptance Rates

        CGO '16 Paper Acceptance Rate 25 of 108 submissions, 23%;
        Overall Acceptance Rate 312 of 1,061 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)22
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Soft Error Resilience at Near-Zero CostProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656605(176-187)Online publication date: 30-May-2024
        • (2023)Predicting Software Defects in Hybrid MPI and OpenMP Parallel Programs Using Machine LearningElectronics10.3390/electronics1301018213:1(182)Online publication date: 30-Dec-2023
        • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
        • (2023)Evaluating the Resiliency of Posits for Scientific ComputingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624116(477-487)Online publication date: 12-Nov-2023
        • (2023)Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction DuplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607078(1-13)Online publication date: 12-Nov-2023
        • (2023)RTailor: Parameterizing Soft Error Resilience for Mixed-Criticality Real-Time Systems2023 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS59052.2023.00037(344-357)Online publication date: 5-Dec-2023
        • (2023)Visilience: An Interactive Visualization Framework for Resilience Analysis using Control-Flow Graph2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC59308.2023.00041(250-256)Online publication date: 24-Oct-2023
        • (2023)Characterizing Runtime Performance Variation in Error Detection by Duplicating Instructions2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00043(730-741)Online publication date: 9-Oct-2023
        • (2023)TC-SEPM: Characterizing soft error resilience of CNNs on Tensor Cores from program and microarchitecture perspectivesJournal of Systems Architecture10.1016/j.sysarc.2023.103024145(103024)Online publication date: Dec-2023
        • (2023)Trade-off Mechanism Between Reliability and Performance for Data-flow Soft Error DetectionJournal of Electronic Testing: Theory and Applications10.1007/s10836-023-06087-239:5-6(583-595)Online publication date: 1-Dec-2023
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media