Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Fault-Aware Runtime Strategies for High-Performance Computing

Published: 01 April 2009 Publication History

Abstract

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).

Cited By

View all
  • (2018)Reliable fault tolerant model for grid computing environmentsMultiagent and Grid Systems10.3233/MGS-14022410:4(213-232)Online publication date: 16-Dec-2018
  • (2018)A job submission manager for large-scale distributed systems based on job futurity predictorInternational Journal of Grid and Utility Computing10.1504/IJGUC.2014.0582525:1(50-59)Online publication date: 16-Dec-2018
  • (2018)Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actionsThe Journal of Supercomputing10.1007/s11227-015-1458-071:10(3668-3694)Online publication date: 31-Dec-2018
  • Show More Cited By
  1. Fault-Aware Runtime Strategies for High-Performance Computing

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Parallel and Distributed Systems
        IEEE Transactions on Parallel and Distributed Systems  Volume 20, Issue 4
        April 2009
        160 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 April 2009

        Author Tags

        1. Fault-tolerance
        2. Parallel systems
        3. Performance
        4. Scheduling

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 12 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2018)Reliable fault tolerant model for grid computing environmentsMultiagent and Grid Systems10.3233/MGS-14022410:4(213-232)Online publication date: 16-Dec-2018
        • (2018)A job submission manager for large-scale distributed systems based on job futurity predictorInternational Journal of Grid and Utility Computing10.1504/IJGUC.2014.0582525:1(50-59)Online publication date: 16-Dec-2018
        • (2018)Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actionsThe Journal of Supercomputing10.1007/s11227-015-1458-071:10(3668-3694)Online publication date: 31-Dec-2018
        • (2014)Checkpointing algorithms and fault predictionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.10.01074:2(2048-2064)Online publication date: 1-Feb-2014

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media