Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2544137.2544140acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
tutorial

Automated Algorithmic Error Resilience for Structured Grid Problems Based on Outlier Detection

Published: 16 October 2018 Publication History

Abstract

In this paper, we propose automated algorithmic error resilience based on outlier detection. Our approach exploits the characteristic behavior of a class of applications to create metric functions that normally produce metric values according to a designed distribution or behavior and produce outlier values (i.e., values that do not conform to the designed distribution or behavior) when computations are affected by errors. For a robust algorithm that employs such an approach, error detection becomes equivalent to outlier detection. As such, we can make use of well-established, statistically rigorous techniques for outlier detection to effectively and efficiently detect errors, and subsequently correct them. Our error-resilient algorithms incur significantly lower overhead than traditional hardware and software error resilience techniques. Also, compared to previous approaches to application-based error resilience, our approaches parameterize the robustification process, making it easy to automatically transform large classes of applications into robust applications with the use of parser-based tools and minimal programmer effort. We demonstrate the use of automated error resilience based on outlier detection for structured grid problems, leveraging the flexibility of algorithmic error resilience to achieve improved application robustness and lower overhead compared to previous error resilience approaches. We demonstrate 2 × --3× improvement in output quality compared to the original algorithm with only 22% overhead, on average, for non-iterative structured grid problems. Average overhead is as low as 4.5% for error-resilient iterative structured grid algorithms that tolerate error rates up to 10E-3 and achieve the same output quality as their error-free counterparts.

References

[1]
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
[2]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.
[3]
J. Canny. A computational approach to edge detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-8(6):679--698, 1986.
[4]
M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao. The daikon system for dynamic detection of likely invariants. Sci. Comput. Program., 69(1-3):35--45, 2007.
[5]
J. N. Glosli, D. F. Richards, K. J. Caspersen, R. E. Rudd, J. A. Gunnels, and F. H. Streitz. Extending stability beyond cpu millennium: a micron-scale atomistic simulation of kelvin-helmholtz instability. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pages 58:1--58:11, 2007.
[6]
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85--126, 2004.
[7]
K. A. Hoffmann. Computational Fluid Dynamics for Engineers.
[8]
K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computing, 33(6):518--528, 1984.
[9]
K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, 1984.
[10]
ITRS. International technology roadmap for semiconductors, 2010.
[11]
S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predicting the number of fatal soft errors in los alamos national laboratory's asc q supercomputer. Device and Materials Reliability, IEEE Transactions on, 5(3):329--335, 2005.
[12]
N. Nakka, Z. Kalbarczyk, R. Iyer, and J. Xu. An architectural framework for providing reliability and security support. In Dependable Systems and Networks, 2004 International Conference on, pages 585--594, 2004.
[13]
M. Nixon and A. S. Aguado. Feature Extraction & Image Processing, Second Edition. Academic Press, 2nd edition, 2008.
[14]
F. Pukelsheim. The three sigma rule. The American Statistician, 48(2):88--91, 1994.
[15]
D. P. Siewiorek and R. S. Swarz. Reliable computer systems (3rd ed.): design and evaluation. A. K. Peters, Ltd., 1998.
[16]
J. Sloan, D. Kesler, R. Kumar, and A. Rahimi. A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance. In Dependable Systems and Networks (DSN), 2010 IEEE/IFIP International Conference on, pages 161--170, 2010.
[17]
J. Sloan, R. Kumar, and G. Bronevetsky. Algorithmic approaches to low overhead fault detection for sparse linear algebra. In Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), DSN '12, pages 1--12, 2012.
[18]
J. Sloan, R. Kumar, and G. Bronevetsky. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pages 1--12, 2013.
[19]
K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. Hauberk: Lightweight silent data corruption error detector for gpgpu. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 287--300, 2011.

Cited By

View all
  • (2016)In-Situ Mitigation of Silent Data Corruption in PDE SolversProceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale10.1145/2909428.2909433(43-48)Online publication date: 31-May-2016

Index Terms

  1. Automated Algorithmic Error Resilience for Structured Grid Problems Based on Outlier Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
    February 2014
    328 pages
    ISBN:9781450326704
    DOI:10.1145/2581122

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. algorithmic error resilience
    2. application robustification
    3. outlier detection
    4. structured grids

    Qualifiers

    • Tutorial
    • Refereed limited

    Conference

    CGO '14

    Acceptance Rates

    CGO '14 Paper Acceptance Rate 29 of 100 submissions, 29%;
    Overall Acceptance Rate 312 of 1,061 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)In-Situ Mitigation of Silent Data Corruption in PDE SolversProceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale10.1145/2909428.2909433(43-48)Online publication date: 31-May-2016

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media