research-article

Performability modeling for scheduling and fault tolerance strategies for scientific workflows

Authors:

Lavanya Ramakrishnan,

Daniel A. ReedAuthors Info & Claims

HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing

Pages 23 - 34

https://doi.org/10.1145/1383422.1383426

Published: 23 June 2008 Publication History

Abstract

Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.

References

[1]

G. Alonso, C. Hagen, D. Agrawal, A. E. Abbadi, and C. Mohan. Enhancing the Fault Tolerance of Workflow Management Systems. IEEE Concurrency, 2000.

Digital Library

[2]

S. F. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 214:1--8, 1990.

[3]

J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, and K. Kennedy. Task Scheduling Strategies for Workflow-Based Applications in Grids. In CCGRID, pages 759--767, 2005.

Digital Library

[4]

T. D. Braun, H. J. Siegel, and N. Beck. A Comparision of Eleven Static Heuristics for Maping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems. In J. of Parallel and Distributed Computing, 2001.

Digital Library

[5]

C. da Lu and D. A. Reed. Assessing Fault Sensitivity in MPI Applications. Proc. of Supercomputing, 2004.

Digital Library

[6]

A. Darling, L. Carey, and W. chun Feng. The Design, Implementation, and Evaluation of mpiBLAST. 4th International Conference on Linux Clusters: The HPC Revolution, 2003.

[7]

K. K. Droegemeier and et. al. Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather. Computing in Science and Engineering, 2005.

Digital Library

[8]

L. M. eSolva. Parallel Programming Models and Paradigms. In High Performance Cluster Computing: Programming and Applications, 1999.

[9]

B. R. Haverkort, R. Marie, G. Rubino, and K. Trivedi. Performability Modelling. Wiley, 2001.

[10]

S. Hwang and C. Kesselman. A Flexible Framework for Fault Tolerance in the Grid. Journal of Grid Computing, 2003.

[11]

J.Schopf and F. Berman. Performance Prediction in Production Environments. In Proceedings of IPPS/SPDP, 1998.

Digital Library

[12]

Y.-S. Kee, D. Logothetis, R. Huang, H. Casanova, and A. Chien. Efficient Resource Description and High Quality Selection for Virtual Grids. In Proceedings of the 5th IEEE Symposium on Cluster Computing and the Grid (CCGrid). IEEE, 2005.

Digital Library

[13]

Y.-S. Kee, D. Logothetis, R. Huang, H. Casanova, and A. Chien. Efficient Resource Description and High Quality Selection for Virtual Grids. In Proceedings of the 5th IEEE Symposium on Cluster Computing and the Grid (CCGrid). IEEE, 2005.

Digital Library

[14]

O. Khalili, J. He, C. Olschanowsky, A. Snavely, and H. Casanova. Measuring the Performance and Reliability of Production Computational Grids. In The 7th IEEE/ACM International Conference on Grid Computing, 2006.

Digital Library

[15]

W. Kramer and C. Ryan. Performance Variability of Highly Parallel Architectures. In International Conference on Computational Science, 2003.

Digital Library

[16]

J. F. Meyer. On Evaluating the Performability of Degradable Computing Systems. IEEE Trans. Computers, 1980.

Digital Library

[17]

J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Reseach and Forecast Model: Software Architecture and Performance. Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004.

[18]

D. Nurmi, J. Brevik, and R. Wolski. Minimizing the Network Overhead of Checkpointing in Cycle Harvesting Cluster Environments. Future Generation Computer Systems, 2006.

[19]

L. Ramakrishnan, B. O. Blanton, H. M. Lander, R. A. Luettich, Jr, D. A. Reed, and S. R. Thorpe. Real-time Storm Surge Ensemble Modeling in a Grid Environment. In Second International Workshop on Grid Computing Environments (GCE), 2006.

[20]

F. Ranno, S. Shrivastava, and S. Wheater. A System for Specifying and Coordinating the Execution of Reliable Distributed Applications. In Conf. on Distributed Applications and Interoperable Systems, 1997.

[21]

D. A. Reed, C. da Lu, and C. L. Mendes. Reliability Challenges in Large Systems. Future Generation Computer Systems, 2006.

Digital Library

[22]

R. A. Sahner, K. S. Trivedi, and A. Puliafito. Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers, 1996.

Digital Library

[23]

B. Schroeder and G. Gibson. A Large-scale Study of Failures in High-performance Computing Systems. In Proc. of the International Conference on Dependable Systems, 2006.

Digital Library

[24]

J. B. Weissman. Fault Tolerant Computing on the Grid: What are My Options? In HPDC, 1999.

Digital Library

[25]

Y.Zhang, A. Mandal, H.Casanova, A. Chien, Y. Kee, K. Kennedy, and C. Koelbel. Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling. In CCGrid, 2006.

Digital Library

Cited By

Beckman PDongarra JFerrier NFox GMoore TReed DBeck M(2020)Harnessing the Computing Continuum for Programming Our WorldFog Computing10.1002/9781119551713.ch7(215-230)Online publication date: 25-Apr-2020
https://doi.org/10.1002/9781119551713.ch7
Oliveira DBrinkmann ARosa NMaciel P(2019)Performability Evaluation and Optimization of Workflow Applications in Cloud EnvironmentsJournal of Grid Computing10.1007/s10723-019-09476-0Online publication date: 17-Jan-2019
https://doi.org/10.1007/s10723-019-09476-0
Abdulhamid SAbd Latiff MMadni SAbdullahi M(2018)Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithmNeural Computing and Applications10.5555/3184485.318449629:1(279-293)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3184485.3184496
Show More Cited By

Index Terms

Performability modeling for scheduling and fault tolerance strategies for scientific workflows
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid ...
Scheduling Scientific Workflows Elastically for Cloud Computing
CLOUD '11: Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing

Most existing workflow scheduling algorithms only consider a computing environment in which the number of compute resources is bounded. Compute resources in such an environment usually cannot be provisioned or released on demand of the size of a ...
Fault Tolerant and Optimal Task Clustering for Scientific Workflow in Cloud

Scientific workflows are very complex, large-scale applications and require more computational power for data transmission and execution. In this article, the authors address the problem of scheduling scientific workflow on a number of virtual machines ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing

June 2008

252 pages

ISBN:9781595939975

DOI:10.1145/1383422

General Chairs:
Manish Parashar
Rutgers University, USA
,
Karsten Schwan
Georgia Institute of Technology, USA
,
Program Chairs:
Jon Weissman
National e-Science Center, Edinburgh, University of Minnesota, USA
,
Domenico Laforenza
Information Science and Technology Institute, CNR, Italy

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC '08

Sponsor:

HPDC '08: International Symposium on High Performance Distributed Computing

June 23 - 27, 2008

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
908
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Beckman PDongarra JFerrier NFox GMoore TReed DBeck M(2020)Harnessing the Computing Continuum for Programming Our WorldFog Computing10.1002/9781119551713.ch7(215-230)Online publication date: 25-Apr-2020
https://doi.org/10.1002/9781119551713.ch7
Oliveira DBrinkmann ARosa NMaciel P(2019)Performability Evaluation and Optimization of Workflow Applications in Cloud EnvironmentsJournal of Grid Computing10.1007/s10723-019-09476-0Online publication date: 17-Jan-2019
https://doi.org/10.1007/s10723-019-09476-0
Abdulhamid SAbd Latiff MMadni SAbdullahi M(2018)Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithmNeural Computing and Applications10.5555/3184485.318449629:1(279-293)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3184485.3184496
Casas ITaheri JRanjan RWang LZomaya A(2018)GA-ETI: An enhanced genetic algorithm for the scheduling of scientific workflows in cloud environmentsJournal of Computational Science10.1016/j.jocs.2016.08.00726(318-331)Online publication date: May-2018
https://doi.org/10.1016/j.jocs.2016.08.007
Casas ITaheri JRanjan RWang LZomaya A(2017)A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systemsFuture Generation Computer Systems10.1016/j.future.2015.12.00574:C(168-178)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.future.2015.12.005
Abdulhamid SLatiff M(2017)A checkpointed league championship algorithm-based cloud scheduling scheme with secure fault tolerance responsivenessApplied Soft Computing10.1016/j.asoc.2017.08.04861(670-680)Online publication date: Dec-2017
https://doi.org/10.1016/j.asoc.2017.08.048
Abdulhamid SAbd Latiff MMadni SAbdullahi M(2016)Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithmNeural Computing and Applications10.1007/s00521-016-2448-829:1(279-293)Online publication date: 16-Jul-2016
https://doi.org/10.1007/s00521-016-2448-8
Pandey SBuyya R(2013)A Survey of Scheduling and Management Techniques for Data-Intensive Application WorkflowsEnterprise Resource Planning10.4018/978-1-4666-4153-2.ch066(1170-1190)Online publication date: 2013
https://doi.org/10.4018/978-1-4666-4153-2.ch066
Pandey SBuyya R(2012)A Survey of Scheduling and Management Techniques for Data-Intensive Application WorkflowsData Intensive Distributed Computing10.4018/978-1-61520-971-2.ch007(156-176)Online publication date: 2012
https://doi.org/10.4018/978-1-61520-971-2.ch007
Kloh HSchulze BPinto RMury A(2012)A bi-criteria scheduling process with CoS support on grids and cloudsConcurrency and Computation: Practice & Experience10.1002/cpe.186824:13(1443-1460)Online publication date: 1-Sep-2012
https://dl.acm.org/doi/10.1002/cpe.1868
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents