research-article

Public Access

Addressing data resiliency for staging based scientific workflows

Authors:

Shaohua Duan,

Pradeep Subedi,

Philip E. Davis,

Manish ParasharAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 87, Pages 1 - 22

https://doi.org/10.1145/3295500.3356158

Published: 17 November 2019 Publication History

PDF eReader

Abstract

As applications move towards extreme scales, data-related challenges are becoming significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data processing have been proposed to address these challenges. Increasing scale is also expected to result in an increase in the rate of silent data corruption errors, which will impact both the correctness and performance of applications. Furthermore, this impact is amplified in the case of in-situ workflows due to the dataflow between the component applications of the workflow. While existing research has explored silent error detection at the application level, silent error detection for workflows remains an open challenge. This paper addresses silent error detection for extreme scale in-situ workflows. The presented approach leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption, effectively reducing the propagation of corrupted data and end-to-end workflow execution time in the presence of silent errors. As an illustration of this approach, we use a spatial outlier detection approach in staging to detect errors introduced in data transfer and storage. We also provide a CPU-GPU hybrid staging framework for error detection in order to achieve faster error identification. We have implemented our approach within the DataSpaces staging service, and evaluated it using both synthetic and real workflows on a Cray XK7 system (Titan) at different scales. We demonstrate that, in the presence of silent errors, enabling error detection on staged data alongside a checkpoint/restart scheme improves the total in-situ workflow execution time by up to 22% in comparison with using checkpoint/restart alone.

References

[1]

H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just In Time: Adding Value to The IO Pipelines of High Performance Applications with JITStaging. In Proc. 20th International Symposium on High Performance Distributed Computing (HPDC'11), June 2011.

Abstract

References

Cited By

Recommendations

Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows

Addressing Fault Tolerance for Staging Based Scientific Workflows

Adaptive data placement for staging-based coupled scientific workflows

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations