Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1383422.1383426acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Performability modeling for scheduling and fault tolerance strategies for scientific workflows

Published: 23 June 2008 Publication History
  • Get Citation Alerts
  • Abstract

    Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.

    References

    [1]
    G. Alonso, C. Hagen, D. Agrawal, A. E. Abbadi, and C. Mohan. Enhancing the Fault Tolerance of Workflow Management Systems. IEEE Concurrency, 2000.
    [2]
    S. F. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 214:1--8, 1990.
    [3]
    J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, and K. Kennedy. Task Scheduling Strategies for Workflow-Based Applications in Grids. In CCGRID, pages 759--767, 2005.
    [4]
    T. D. Braun, H. J. Siegel, and N. Beck. A Comparision of Eleven Static Heuristics for Maping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems. In J. of Parallel and Distributed Computing, 2001.
    [5]
    C. da Lu and D. A. Reed. Assessing Fault Sensitivity in MPI Applications. Proc. of Supercomputing, 2004.
    [6]
    A. Darling, L. Carey, and W. chun Feng. The Design, Implementation, and Evaluation of mpiBLAST. 4th International Conference on Linux Clusters: The HPC Revolution, 2003.
    [7]
    K. K. Droegemeier and et. al. Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather. Computing in Science and Engineering, 2005.
    [8]
    L. M. eSolva. Parallel Programming Models and Paradigms. In High Performance Cluster Computing: Programming and Applications, 1999.
    [9]
    B. R. Haverkort, R. Marie, G. Rubino, and K. Trivedi. Performability Modelling. Wiley, 2001.
    [10]
    S. Hwang and C. Kesselman. A Flexible Framework for Fault Tolerance in the Grid. Journal of Grid Computing, 2003.
    [11]
    J.Schopf and F. Berman. Performance Prediction in Production Environments. In Proceedings of IPPS/SPDP, 1998.
    [12]
    Y.-S. Kee, D. Logothetis, R. Huang, H. Casanova, and A. Chien. Efficient Resource Description and High Quality Selection for Virtual Grids. In Proceedings of the 5th IEEE Symposium on Cluster Computing and the Grid (CCGrid). IEEE, 2005.
    [13]
    Y.-S. Kee, D. Logothetis, R. Huang, H. Casanova, and A. Chien. Efficient Resource Description and High Quality Selection for Virtual Grids. In Proceedings of the 5th IEEE Symposium on Cluster Computing and the Grid (CCGrid). IEEE, 2005.
    [14]
    O. Khalili, J. He, C. Olschanowsky, A. Snavely, and H. Casanova. Measuring the Performance and Reliability of Production Computational Grids. In The 7th IEEE/ACM International Conference on Grid Computing, 2006.
    [15]
    W. Kramer and C. Ryan. Performance Variability of Highly Parallel Architectures. In International Conference on Computational Science, 2003.
    [16]
    J. F. Meyer. On Evaluating the Performability of Degradable Computing Systems. IEEE Trans. Computers, 1980.
    [17]
    J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Reseach and Forecast Model: Software Architecture and Performance. Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004.
    [18]
    D. Nurmi, J. Brevik, and R. Wolski. Minimizing the Network Overhead of Checkpointing in Cycle Harvesting Cluster Environments. Future Generation Computer Systems, 2006.
    [19]
    L. Ramakrishnan, B. O. Blanton, H. M. Lander, R. A. Luettich, Jr, D. A. Reed, and S. R. Thorpe. Real-time Storm Surge Ensemble Modeling in a Grid Environment. In Second International Workshop on Grid Computing Environments (GCE), 2006.
    [20]
    F. Ranno, S. Shrivastava, and S. Wheater. A System for Specifying and Coordinating the Execution of Reliable Distributed Applications. In Conf. on Distributed Applications and Interoperable Systems, 1997.
    [21]
    D. A. Reed, C. da Lu, and C. L. Mendes. Reliability Challenges in Large Systems. Future Generation Computer Systems, 2006.
    [22]
    R. A. Sahner, K. S. Trivedi, and A. Puliafito. Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers, 1996.
    [23]
    B. Schroeder and G. Gibson. A Large-scale Study of Failures in High-performance Computing Systems. In Proc. of the International Conference on Dependable Systems, 2006.
    [24]
    J. B. Weissman. Fault Tolerant Computing on the Grid: What are My Options? In HPDC, 1999.
    [25]
    Y.Zhang, A. Mandal, H.Casanova, A. Chien, Y. Kee, K. Kennedy, and C. Koelbel. Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling. In CCGrid, 2006.

    Cited By

    View all
    • (2020)Harnessing the Computing Continuum for Programming Our WorldFog Computing10.1002/9781119551713.ch7(215-230)Online publication date: 25-Apr-2020
    • (2019)Performability Evaluation and Optimization of Workflow Applications in Cloud EnvironmentsJournal of Grid Computing10.1007/s10723-019-09476-0Online publication date: 17-Jan-2019
    • (2018)Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithmNeural Computing and Applications10.5555/3184485.318449629:1(279-293)Online publication date: 1-Jan-2018
    • Show More Cited By

    Index Terms

    1. Performability modeling for scheduling and fault tolerance strategies for scientific workflows

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing
        June 2008
        252 pages
        ISBN:9781595939975
        DOI:10.1145/1383422
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 23 June 2008

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. fault tolerance
        2. grid/cloud resource management
        3. workflow scheduling

        Qualifiers

        • Research-article

        Conference

        HPDC '08
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 166 of 966 submissions, 17%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2020)Harnessing the Computing Continuum for Programming Our WorldFog Computing10.1002/9781119551713.ch7(215-230)Online publication date: 25-Apr-2020
        • (2019)Performability Evaluation and Optimization of Workflow Applications in Cloud EnvironmentsJournal of Grid Computing10.1007/s10723-019-09476-0Online publication date: 17-Jan-2019
        • (2018)Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithmNeural Computing and Applications10.5555/3184485.318449629:1(279-293)Online publication date: 1-Jan-2018
        • (2018)GA-ETI: An enhanced genetic algorithm for the scheduling of scientific workflows in cloud environmentsJournal of Computational Science10.1016/j.jocs.2016.08.00726(318-331)Online publication date: May-2018
        • (2017)A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systemsFuture Generation Computer Systems10.1016/j.future.2015.12.00574:C(168-178)Online publication date: 1-Sep-2017
        • (2017)A checkpointed league championship algorithm-based cloud scheduling scheme with secure fault tolerance responsivenessApplied Soft Computing10.1016/j.asoc.2017.08.04861(670-680)Online publication date: Dec-2017
        • (2016)Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithmNeural Computing and Applications10.1007/s00521-016-2448-829:1(279-293)Online publication date: 16-Jul-2016
        • (2013)A Survey of Scheduling and Management Techniques for Data-Intensive Application WorkflowsEnterprise Resource Planning10.4018/978-1-4666-4153-2.ch066(1170-1190)Online publication date: 2013
        • (2012)A Survey of Scheduling and Management Techniques for Data-Intensive Application WorkflowsData Intensive Distributed Computing10.4018/978-1-61520-971-2.ch007(156-176)Online publication date: 2012
        • (2012)A bi-criteria scheduling process with CoS support on grids and cloudsConcurrency and Computation: Practice & Experience10.1002/cpe.186824:13(1443-1460)Online publication date: 1-Sep-2012
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media