research-article

VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Authors:

Lavanya Ramakrishnan,

Charles Koelbel,

Graziano Obertelli,

Anirban Mandal,

Kiran Thyagaraja,

Dmitrii ZagorodnovAuthors Info & Claims

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Article No.: 47, Pages 1 - 12

https://doi.org/10.1145/1654059.1654107

Published: 14 November 2009 Publication History

Abstract

Today's scientific workflows use distributed heterogeneous resources through diverse grid and cloud interfaces that are often hard to program. In addition, especially for time-sensitive critical applications, predictable quality of service is necessary across these distributed resources. VGrADS' virtual grid execution system (vgES) provides an uniform qualitative resource abstraction over grid and cloud systems. We apply vgES for scheduling a set of deadline sensitive weather forecasting workflows. Specifically, this paper reports on our experiences with (1) virtualized reservations for batchqueue systems, (2) coordinated usage of TeraGrid (batch queue), Amazon EC2 (cloud), our own clusters (batch queue) and Eucalyptus (cloud) resources, and (3) fault tolerance through automated task replication. The combined effect of these techniques was to enable a new workflow planning method to balance performance, reliability and cost considerations. The results point toward improved resource selection and execution management support for a variety of e-Science applications over grids and cloud systems.

References

[1]

I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludscher, and S. Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows, 2004.

[2]

Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon.com/ec2/.

[3]

M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the Clouds: A Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley, Feb 2009.

[4]

R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of la-mpi, a network-fault-tolerant mpi. Parallel and Distributed Processing Symposium, International, 1:15b, 2004.

[5]

J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, and K. Kennedy. Task scheduling strategies for workflow-based applications in grids. In IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2005). IEEE Press, 2005.

Digital Library

[6]

J. Brevik, D. Nurmi, and R. Wolski. Predicting bounds on queuing delay for batch-scheduled parallel machines. In Proceedings of PPoPP 2006, March 2006.

Digital Library

[7]

Condor Team. Dagman metascheduler -- http://www.cs.wisc.edu/condor/dagman.

[8]

E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1(1):25--39, 2003.

[9]

A. Downey. Predicting queue times on space-sharing parallel computers. In Proceedings of the 11th International Parallel Processing Symposium, April 1997.

Digital Library

[10]

K. K. Droegemeier, D. Gannon, D. Reed, B. Plale, J. Alameda, T. Baltzer, K. Brewster, R. Clark, B. Domenico, S. Graves, E. Joseph, D. Murray, R. Ramachandran, M. Ramamurthy, L. Ramakrishnan, J. A. Rushing, D. Weber, R. Wilhelmson, A. Wilson, M. Xue, and S. Yalda. Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather. Computing in Science and Engg., 7(6):12--29, 2005.

Digital Library

[11]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.

Digital Library

[12]

C. Evangelinos and C. Hill. Cloud Computing for Parallel Scientific HPC Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on Amazon EC2. ratio, 2(2.40):2--34, 2008.

[13]

G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. Dongarra. Extending the mpi specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference (ICS) 2004. Primeur, 2004.

[14]

D. G. Feitelson and L. Rudolph. Parallel Job Scheduling: Issues and Approaches, pages 1--18. Springer-Verlag, 1995.

[15]

I. Foster and C. Kesselman. The Grid2. Morgan Kauffmann Publishers, Inc., 2003.

[16]

J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-g: A computation management agent for multi-institutional grids. 10th IEEE International Symposium on High Performance Distributed Computing (HPDC-10 '01), 00:0055, 2001.

Digital Library

[17]

J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-g: A computation management agent for multi-institutional grids. Cluster Computing, 5(3):237--246, 2002.

Digital Library

[18]

Globus. http://www.globus.org/.

[19]

Hadoop. http://hadoop.apache.org/core.

[20]

F. Heine, M. Hovestadt, O. Kao, and A. Streit. On the impact of reservations from the grid on planning-based resource management. In International Workshop on Grid Computing Security and Resource Management (GSRM) at ICCS, pages 155--162, Atlanta, USA, 2005. Springer.

Digital Library

[21]

G. Kandaswamy, A. Mandal, and D. A. Reed. Fault tolerance and recovery of scientific workflows on computational grids. In CCGRID '08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pages 777--782, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[22]

K. Keahey, T. Freeman, J. Lauret, and D. Olson. Virtual workspaces for scientific applications. In SciDAC Conference, 2007.

[23]

Y.-S. Kee and C. Kessleman. Grid resource abstraction, virtualization, and provisioning for time-targeted applications. In ACM/IEEE International Symposium on Cluster Computing and the Grid (CCGrid08), May 2008.

Digital Library

[24]

Y.-S. Kee, C. Kessleman, D. Nurmi, and R. Wolski. Enabling personal clusters on demand for batch resources using commodity software. In International Heterogeneity Computing Workshop (HCW08) in conjunction with IEEE IPDPS08, April 2008.

[25]

Y.-S. Kee, K. Yocum, A. A. Chien, and H. Casanova. Improving grid resource allocation via integrated selection and binding. In International Conference on High Performance Computing, Network, Storage, 2006.

Digital Library

[26]

G. Malewicz. Parallel scheduling of complex dags under uncertainty. In Proceedings of the 17th Annual ACM Symposium on Parallel Algorithms(SPAA), pages 66--75, 2005.

Digital Library

[27]

Maui scheduler home page -- http://www.clusterresources.com/products/maui/.

[28]

G. V. Mc Evoy and B. Schulze. Using clouds to address grid limitations. In MGC '08: Proceedings of the 6th international workshop on Middleware for grid computing, pages 1--6, New York, NY, USA, 2008. ACM.

Digital Library

[29]

J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Reseach and Forecast Model: Software Architecture and Performance. Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, October 2004.

[30]

D. Nurmi, J. Brevik, and R. Wolski. QBETS: Queue bounds estimation from time series. In Proceedings of 13th Workshop on Job Scheduling Strategies for Parallel Processing (with ICS07), June 2007.

Digital Library

[31]

D. Nurmi, J. Brevik, and R. Wolski. VARQ: Virtual advance reservations for queues. Proceedings 17th IEEE Symp. on High Performance Distributed Computing (HDPC), 2008.

Digital Library

[32]

D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy. Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In Proceedings of SC'06, Tampa, FL, 2006. IEEE.

Digital Library

[33]

D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In 9th International Symposium on Cluser Computing and the Grid (CCGrid) - to appear, 2009.

Digital Library

[34]

T. Oinn, M. Greenwood, M. Addis, M. N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. R. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. Taverna: Lessons in Creating a Workflow Environment for the Life Sciences: Research Articles. Concurr. Comput.: Pract. Exper., 18(10):1067--1100, 2006.

Digital Library

[35]

M. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon S3 for science grids: a viable solution? In Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55--64. ACM New York, NY, USA, 2008.

Digital Library

[36]

H. Qian, E. Miller, W. Zhang, M. Rabinovich, and C. E. Wills. Agility in virtualized utility computing. In VTDC '07: Proceedings of the 3rd international workshop on Virtualization technology in distributed computing, pages 1--8, New York, NY, USA, 2007. ACM.

Digital Library

[37]

L. Ramakrishnan and D. Gannon. A survey of distribted workflow characteristics and resource requirements. Technical Report TR671, Department of Computer Science, Indiana University, Indiana, September 2008.

[38]

L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, and J. Chase. Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Computing, Networking, Storage and Analysis, November 2006.

Digital Library

[39]

L. Ramakrishnan and D. A. Reed. Performability modeling for scheduling and fault tolerance strategies for scientific workflows. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 23--34, New York, NY, USA, 2008. ACM.

Digital Library

[40]

L. Ramakrishnan and D. A. Reed. Predictable quality of service atop degradable distributed systems. In Journal of Cluster Computing, 2009.

[41]

D. A. Reed, C.-d. Lu, and C. L. Mendes. Reliability challenges in large systems. Future Generation Computer Systems, 22(3):293--302, 2006.

Digital Library

[42]

R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. Dikaiakos. Scheduling workflows with budget constraints. In S. Gorlatch and M. Danelutto, editors, Integrated Research in GRID Computing, CoreGRID, pages 189--202. Springer-Verlag, 2007.

[43]

W. Smith, V. E. Taylor, and I. T. Foster. Using run-time predictions to estimate queue wait times and improve scheduler performance. In IPPS/SPDP '99/JSSPP '99: Proceedings of the Job Scheduling Strategies for Parallel Processing, pages 202--219, London, UK, 1999. Springer-Verlag.

Digital Library

[44]

Q. Snell, M. Clement, D. Jackson, and C. Gregory. The performance impact of advance reservation meta-scheduling. In 6th Workshop on Job Scheduling Strategies for Parallel Processing, pages 137--153, 2000.

Digital Library

[45]

B. Sotomayor, K. Keahey, and I. Foster. Combining batch execution and leasing using virtual machines. In High Performance Distributed Computing (HPDC), 2008.

Digital Library

[46]

I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer, December 2006.

Digital Library

[47]

Torque home page -- http://www.clusterresources.com/pages/products/torque-resource-manager.%php.

[48]

VGrADS Demo Site. http://vgdemo.cs.rice.edu/vgdemo/archives.jsp?display=whitelist.

[49]

A. YarKhan, J. Dongarra, and K. Seymour. Gridsolve: The evolution of network enabled solver. In Proceedings of the 2006 International Federation for Information Processing (IFIP) Working Conference, 2006.

[50]

J. Yu and R. Buyya. Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Scientific Programming, 14(3--4):217--230, 2006.

Digital Library

[51]

Y. Zhang, A. Mandal, H. Casanova, A. Chien, Y. Kee, K. Kennedy, and C. Koelbel. Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling. In Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06). IEEE, May 2006.

Digital Library

Cited By

Zabasta AKazymyr VDrozd OVerslype SEspeel LBruzgiene R(2024)Development of Shared Modeling and Simulation Environment for Sustainable e-Learning in the STEM FieldSustainability10.3390/su1605219716:5(2197)Online publication date: 6-Mar-2024
https://doi.org/10.3390/su16052197
Kazymyr VHorval DDrozd OZabašta A(2023)Shared Modeling and Simulation Environment for Online Learning with Moodle and JupyterMathematical Modeling and Simulation of Systems10.1007/978-3-031-30251-0_11(131-142)Online publication date: 3-Jun-2023
https://doi.org/10.1007/978-3-031-30251-0_11
Varshney PRamesh SChhabra SKhochare ASimmhan Y(2022)Resilient Execution of Data-triggered Applications on Edge, Fog and Cloud Resources2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00057(473-483)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00057
Show More Cited By

Index Terms

VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Software and its engineering
  1. Software organization and properties

Recommendations

The Organization and Management of Grid Infrastructures

Grid computing technology has become fundamental to e-Science. As the virtual organizations established by scientific communities progress from testing their applications to more routine usage, maintaining reliable and adaptive grid infrastructures ...
MGC middleware for grid computing: the Globus Toolkit
ACAI '11: Proceedings of the International Conference on Advances in Computing and Artificial Intelligence

Grid computing has made substantial advances during the last decade. A major concern in Grid environments is dealing with the high degree of heterogeneity of resources that can range from laptops and PCs to supercomputers. The unified virtual view of ...
Interoperability of BOINC and EGEE

Today basically two types of grid systems are in use: service grids and desktop grids. Service grids offer an infrastructure for grid users, thus require notable management to keep the service running. On the other hand, desktop grids aim to utilize ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

November 2009

778 pages

ISBN:9781605587448

DOI:10.1145/1654059

Conference Chair:
Wilfred Pinfold

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SC '09

Sponsor:

SIGARCH
IEEE-CS

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 20, 2009

Oregon, Portland

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
43
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zabasta AKazymyr VDrozd OVerslype SEspeel LBruzgiene R(2024)Development of Shared Modeling and Simulation Environment for Sustainable e-Learning in the STEM FieldSustainability10.3390/su1605219716:5(2197)Online publication date: 6-Mar-2024
https://doi.org/10.3390/su16052197
Kazymyr VHorval DDrozd OZabašta A(2023)Shared Modeling and Simulation Environment for Online Learning with Moodle and JupyterMathematical Modeling and Simulation of Systems10.1007/978-3-031-30251-0_11(131-142)Online publication date: 3-Jun-2023
https://doi.org/10.1007/978-3-031-30251-0_11
Varshney PRamesh SChhabra SKhochare ASimmhan Y(2022)Resilient Execution of Data-triggered Applications on Edge, Fog and Cloud Resources2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00057(473-483)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00057
Masoumi MMotallebi H(2022)A structure-aware algorithm for fault-tolerant scheduling of scientific workflowsThe Journal of Supercomputing10.1007/s11227-022-04529-w78:15(17348-17377)Online publication date: 18-May-2022
https://doi.org/10.1007/s11227-022-04529-w
Papadimitriou GLyons EWang CThareja KTanaka RRuth PRodero IDeelman EZink MMandal A(2021)Fair sharing of network resources among workflow ensemblesCluster Computing10.1007/s10586-021-03457-325:4(2873-2891)Online publication date: 22-Nov-2021
https://doi.org/10.1007/s10586-021-03457-3
Papadimitriou GLyons EWang CThareja KTanaka RRuth PVillalobos JRodero IDeelman EZink MMandal A(2020)Application Aware Software Defined Flows of Workflow Ensembles2020 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS)10.1109/INDIS51933.2020.00007(10-21)Online publication date: Dec-2020
https://doi.org/10.1109/INDIS51933.2020.00007
Vakilinia SCheriet M(2018)Preemptive cloud resource allocation modeling of processing jobsThe Journal of Supercomputing10.5555/3211601.321166774:5(2116-2150)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.5555/3211601.3211667
Visheratin AMelnik MNasonov DButakov NBoukhanovsky A(2018)Hybrid scheduling algorithm in early warning systemsFuture Generation Computer Systems10.1016/j.future.2017.04.00279:P2(630-642)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1016/j.future.2017.04.002
Vakilinia SCheriet M(2018)Preemptive cloud resource allocation modeling of processing jobsThe Journal of Supercomputing10.1007/s11227-017-2226-074:5(2116-2150)Online publication date: 11-Jan-2018
https://doi.org/10.1007/s11227-017-2226-0
Haidri RKatti CSaxena P(2018)Cost‐effective deadline‐aware stochastic scheduling strategy for workflow applications on virtual machines in cloud computingConcurrency and Computation: Practice and Experience10.1002/cpe.500631:7Online publication date: 4-Oct-2018
https://doi.org/10.1002/cpe.5006
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents