research-article

Fault-Aware Runtime Strategies for High-Performance Computing

Authors:

Yawei Li,

Zhiling Lan,

Prashasta Gujrati,

Xian-He SunAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 20, Issue 4

Pages 460 - 473

https://doi.org/10.1109/TPDS.2008.128

Published: 01 April 2009 Publication History

Publisher Site

Abstract

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).

Cited By

View all

Rebbah MSlimani YBenyettou ABrunie L(2018)Reliable fault tolerant model for grid computing environmentsMultiagent and Grid Systems10.3233/MGS-14022410:4(213-232)Online publication date: 16-Dec-2018
https://dl.acm.org/doi/10.3233/MGS-140224
Saadatfar HDeldari H(2018)A job submission manager for large-scale distributed systems based on job futurity predictorInternational Journal of Grid and Utility Computing10.1504/IJGUC.2014.0582525:1(50-59)Online publication date: 16-Dec-2018
https://dl.acm.org/doi/10.1504/IJGUC.2014.058252
Zhu LGu JWang YZhao TCai Z(2018)Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actionsThe Journal of Supercomputing10.1007/s11227-015-1458-071:10(3668-3694)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-015-1458-0
Show More Cited By

Fault-Aware Runtime Strategies for High-Performance Computing

Recommendations

Reliability and Performance Analysis of FPGA-Based Fault Tolerant System
DFT '09: Proceedings of the 2009 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems

FPGAs are applicable to implementation of fault tolerant systems due to their reconfigurability. Such fault tolerant systems can be classified according to recovering methods: fail-soft and stand-by-redundant systems. In this work, we propose a ...
Algorithms for testing fault-tolerance of sequenced jobs

We study the problem of testing whether a given set of sequenced jobs can tolerate transient faults. We present efficient algorithms for this problem in several fault models. A fault model describes what types of faults are allowed and specifies ...
Performance analysis of fault-tolerant routing algorithm in wormhole-switched interconnections
Abstract
With nowadays popularity of large-scale parallel computers, Multiprocessors System-on-Chip (MP-SoCs), multicomputers, cluster computers and peer-to-peer communication networks, fault-tolerant routing becomes an important issue in developing these ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 20, Issue 4

April 2009

160 pages

ISSN:1045-9219

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 April 2009

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Rebbah MSlimani YBenyettou ABrunie L(2018)Reliable fault tolerant model for grid computing environmentsMultiagent and Grid Systems10.3233/MGS-14022410:4(213-232)Online publication date: 16-Dec-2018
https://dl.acm.org/doi/10.3233/MGS-140224
Saadatfar HDeldari H(2018)A job submission manager for large-scale distributed systems based on job futurity predictorInternational Journal of Grid and Utility Computing10.1504/IJGUC.2014.0582525:1(50-59)Online publication date: 16-Dec-2018
https://dl.acm.org/doi/10.1504/IJGUC.2014.058252
Zhu LGu JWang YZhao TCai Z(2018)Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actionsThe Journal of Supercomputing10.1007/s11227-015-1458-071:10(3668-3694)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-015-1458-0
Aupy GRobert YVivien FZaidouni D(2014)Checkpointing algorithms and fault predictionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.10.01074:2(2048-2064)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1016/j.jpdc.2013.10.010

Abstract

Cited By

Recommendations

Reliability and Performance Analysis of FPGA-Based Fault Tolerant System

Algorithms for testing fault-tolerance of sequenced jobs

Performance analysis of fault-tolerant routing algorithm in wormhole-switched interconnections

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations