Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

Published: 01 October 2016 Publication History

Abstract

For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99 percent of SDCs with a false alarm rate less that 1 percent of iterations for most cases. The memory cost and detection overhead are reduced to 15 and 6.3 percent, respectively, for a large majority of applications.

Cited By

View all
  • (2023)Recovering Detectable Uncorrectable Errors via Spatial Data PredictionProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624120(507-515)Online publication date: 12-Nov-2023
  • (2023)Evaluating the Resiliency of Posits for Scientific ComputingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624116(477-487)Online publication date: 12-Nov-2023
  • (2023)Anomaly Detection in Scientific Datasets using Sparse RepresentationProceedings of the First Workshop on AI for Systems10.1145/3588982.3603610(13-18)Online publication date: 10-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 27, Issue 10
October 2016
307 pages

Publisher

IEEE Press

Publication History

Published: 01 October 2016

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Recovering Detectable Uncorrectable Errors via Spatial Data PredictionProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624120(507-515)Online publication date: 12-Nov-2023
  • (2023)Evaluating the Resiliency of Posits for Scientific ComputingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624116(477-487)Online publication date: 12-Nov-2023
  • (2023)Anomaly Detection in Scientific Datasets using Sparse RepresentationProceedings of the First Workshop on AI for Systems10.1145/3588982.3603610(13-18)Online publication date: 10-Aug-2023
  • (2023)Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593715(360-372)Online publication date: 21-Jun-2023
  • (2023)FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331601134:12(3207-3223)Online publication date: 1-Dec-2023
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2022)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-478:1(1381-1408)Online publication date: 1-Jan-2022
  • (2021)FT-BLASProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460364(127-138)Online publication date: 3-Jun-2021
  • (2021)SDC Error Detection by Exploring the Importance of Instruction FeaturesWireless Algorithms, Systems, and Applications10.1007/978-3-030-85928-2_28(351-363)Online publication date: 25-Jun-2021
  • (2020)Predictive Reliability and Fault Management in Exascale SystemsACM Computing Surveys10.1145/340395653:5(1-32)Online publication date: 28-Sep-2020
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media