Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1654059.1654107acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Published: 14 November 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Today's scientific workflows use distributed heterogeneous resources through diverse grid and cloud interfaces that are often hard to program. In addition, especially for time-sensitive critical applications, predictable quality of service is necessary across these distributed resources. VGrADS' virtual grid execution system (vgES) provides an uniform qualitative resource abstraction over grid and cloud systems. We apply vgES for scheduling a set of deadline sensitive weather forecasting workflows. Specifically, this paper reports on our experiences with (1) virtualized reservations for batchqueue systems, (2) coordinated usage of TeraGrid (batch queue), Amazon EC2 (cloud), our own clusters (batch queue) and Eucalyptus (cloud) resources, and (3) fault tolerance through automated task replication. The combined effect of these techniques was to enable a new workflow planning method to balance performance, reliability and cost considerations. The results point toward improved resource selection and execution management support for a variety of e-Science applications over grids and cloud systems.

    References

    [1]
    I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludscher, and S. Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows, 2004.
    [2]
    Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon.com/ec2/.
    [3]
    M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the Clouds: A Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley, Feb 2009.
    [4]
    R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of la-mpi, a network-fault-tolerant mpi. Parallel and Distributed Processing Symposium, International, 1:15b, 2004.
    [5]
    J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, and K. Kennedy. Task scheduling strategies for workflow-based applications in grids. In IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2005). IEEE Press, 2005.
    [6]
    J. Brevik, D. Nurmi, and R. Wolski. Predicting bounds on queuing delay for batch-scheduled parallel machines. In Proceedings of PPoPP 2006, March 2006.
    [7]
    Condor Team. Dagman metascheduler -- http://www.cs.wisc.edu/condor/dagman.
    [8]
    E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1(1):25--39, 2003.
    [9]
    A. Downey. Predicting queue times on space-sharing parallel computers. In Proceedings of the 11th International Parallel Processing Symposium, April 1997.
    [10]
    K. K. Droegemeier, D. Gannon, D. Reed, B. Plale, J. Alameda, T. Baltzer, K. Brewster, R. Clark, B. Domenico, S. Graves, E. Joseph, D. Murray, R. Ramachandran, M. Ramamurthy, L. Ramakrishnan, J. A. Rushing, D. Weber, R. Wilhelmson, A. Wilson, M. Xue, and S. Yalda. Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather. Computing in Science and Engg., 7(6):12--29, 2005.
    [11]
    E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.
    [12]
    C. Evangelinos and C. Hill. Cloud Computing for Parallel Scientific HPC Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on Amazon EC2. ratio, 2(2.40):2--34, 2008.
    [13]
    G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. Dongarra. Extending the mpi specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference (ICS) 2004. Primeur, 2004.
    [14]
    D. G. Feitelson and L. Rudolph. Parallel Job Scheduling: Issues and Approaches, pages 1--18. Springer-Verlag, 1995.
    [15]
    I. Foster and C. Kesselman. The Grid2. Morgan Kauffmann Publishers, Inc., 2003.
    [16]
    J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-g: A computation management agent for multi-institutional grids. 10th IEEE International Symposium on High Performance Distributed Computing (HPDC-10 '01), 00:0055, 2001.
    [17]
    J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-g: A computation management agent for multi-institutional grids. Cluster Computing, 5(3):237--246, 2002.
    [18]
    Globus. http://www.globus.org/.
    [19]
    Hadoop. http://hadoop.apache.org/core.
    [20]
    F. Heine, M. Hovestadt, O. Kao, and A. Streit. On the impact of reservations from the grid on planning-based resource management. In International Workshop on Grid Computing Security and Resource Management (GSRM) at ICCS, pages 155--162, Atlanta, USA, 2005. Springer.
    [21]
    G. Kandaswamy, A. Mandal, and D. A. Reed. Fault tolerance and recovery of scientific workflows on computational grids. In CCGRID '08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pages 777--782, Washington, DC, USA, 2008. IEEE Computer Society.
    [22]
    K. Keahey, T. Freeman, J. Lauret, and D. Olson. Virtual workspaces for scientific applications. In SciDAC Conference, 2007.
    [23]
    Y.-S. Kee and C. Kessleman. Grid resource abstraction, virtualization, and provisioning for time-targeted applications. In ACM/IEEE International Symposium on Cluster Computing and the Grid (CCGrid08), May 2008.
    [24]
    Y.-S. Kee, C. Kessleman, D. Nurmi, and R. Wolski. Enabling personal clusters on demand for batch resources using commodity software. In International Heterogeneity Computing Workshop (HCW08) in conjunction with IEEE IPDPS08, April 2008.
    [25]
    Y.-S. Kee, K. Yocum, A. A. Chien, and H. Casanova. Improving grid resource allocation via integrated selection and binding. In International Conference on High Performance Computing, Network, Storage, 2006.
    [26]
    G. Malewicz. Parallel scheduling of complex dags under uncertainty. In Proceedings of the 17th Annual ACM Symposium on Parallel Algorithms(SPAA), pages 66--75, 2005.
    [27]
    Maui scheduler home page -- http://www.clusterresources.com/products/maui/.
    [28]
    G. V. Mc Evoy and B. Schulze. Using clouds to address grid limitations. In MGC '08: Proceedings of the 6th international workshop on Middleware for grid computing, pages 1--6, New York, NY, USA, 2008. ACM.
    [29]
    J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Reseach and Forecast Model: Software Architecture and Performance. Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, October 2004.
    [30]
    D. Nurmi, J. Brevik, and R. Wolski. QBETS: Queue bounds estimation from time series. In Proceedings of 13th Workshop on Job Scheduling Strategies for Parallel Processing (with ICS07), June 2007.
    [31]
    D. Nurmi, J. Brevik, and R. Wolski. VARQ: Virtual advance reservations for queues. Proceedings 17th IEEE Symp. on High Performance Distributed Computing (HDPC), 2008.
    [32]
    D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy. Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In Proceedings of SC'06, Tampa, FL, 2006. IEEE.
    [33]
    D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In 9th International Symposium on Cluser Computing and the Grid (CCGrid) - to appear, 2009.
    [34]
    T. Oinn, M. Greenwood, M. Addis, M. N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. R. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. Taverna: Lessons in Creating a Workflow Environment for the Life Sciences: Research Articles. Concurr. Comput.: Pract. Exper., 18(10):1067--1100, 2006.
    [35]
    M. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon S3 for science grids: a viable solution? In Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55--64. ACM New York, NY, USA, 2008.
    [36]
    H. Qian, E. Miller, W. Zhang, M. Rabinovich, and C. E. Wills. Agility in virtualized utility computing. In VTDC '07: Proceedings of the 3rd international workshop on Virtualization technology in distributed computing, pages 1--8, New York, NY, USA, 2007. ACM.
    [37]
    L. Ramakrishnan and D. Gannon. A survey of distribted workflow characteristics and resource requirements. Technical Report TR671, Department of Computer Science, Indiana University, Indiana, September 2008.
    [38]
    L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, and J. Chase. Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Computing, Networking, Storage and Analysis, November 2006.
    [39]
    L. Ramakrishnan and D. A. Reed. Performability modeling for scheduling and fault tolerance strategies for scientific workflows. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 23--34, New York, NY, USA, 2008. ACM.
    [40]
    L. Ramakrishnan and D. A. Reed. Predictable quality of service atop degradable distributed systems. In Journal of Cluster Computing, 2009.
    [41]
    D. A. Reed, C.-d. Lu, and C. L. Mendes. Reliability challenges in large systems. Future Generation Computer Systems, 22(3):293--302, 2006.
    [42]
    R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. Dikaiakos. Scheduling workflows with budget constraints. In S. Gorlatch and M. Danelutto, editors, Integrated Research in GRID Computing, CoreGRID, pages 189--202. Springer-Verlag, 2007.
    [43]
    W. Smith, V. E. Taylor, and I. T. Foster. Using run-time predictions to estimate queue wait times and improve scheduler performance. In IPPS/SPDP '99/JSSPP '99: Proceedings of the Job Scheduling Strategies for Parallel Processing, pages 202--219, London, UK, 1999. Springer-Verlag.
    [44]
    Q. Snell, M. Clement, D. Jackson, and C. Gregory. The performance impact of advance reservation meta-scheduling. In 6th Workshop on Job Scheduling Strategies for Parallel Processing, pages 137--153, 2000.
    [45]
    B. Sotomayor, K. Keahey, and I. Foster. Combining batch execution and leasing using virtual machines. In High Performance Distributed Computing (HPDC), 2008.
    [46]
    I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer, December 2006.
    [47]
    Torque home page -- http://www.clusterresources.com/pages/products/torque-resource-manager.%php.
    [48]
    VGrADS Demo Site. http://vgdemo.cs.rice.edu/vgdemo/archives.jsp?display=whitelist.
    [49]
    A. YarKhan, J. Dongarra, and K. Seymour. Gridsolve: The evolution of network enabled solver. In Proceedings of the 2006 International Federation for Information Processing (IFIP) Working Conference, 2006.
    [50]
    J. Yu and R. Buyya. Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Scientific Programming, 14(3--4):217--230, 2006.
    [51]
    Y. Zhang, A. Mandal, H. Casanova, A. Chien, Y. Kee, K. Kennedy, and C. Koelbel. Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling. In Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06). IEEE, May 2006.

    Cited By

    View all
    • (2024)Development of Shared Modeling and Simulation Environment for Sustainable e-Learning in the STEM FieldSustainability10.3390/su1605219716:5(2197)Online publication date: 6-Mar-2024
    • (2023)Shared Modeling and Simulation Environment for Online Learning with Moodle and JupyterMathematical Modeling and Simulation of Systems10.1007/978-3-031-30251-0_11(131-142)Online publication date: 3-Jun-2023
    • (2022)Resilient Execution of Data-triggered Applications on Edge, Fog and Cloud Resources2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00057(473-483)Online publication date: May-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
    November 2009
    778 pages
    ISBN:9781605587448
    DOI:10.1145/1654059
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 November 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '09
    Sponsor:

    Acceptance Rates

    SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Development of Shared Modeling and Simulation Environment for Sustainable e-Learning in the STEM FieldSustainability10.3390/su1605219716:5(2197)Online publication date: 6-Mar-2024
    • (2023)Shared Modeling and Simulation Environment for Online Learning with Moodle and JupyterMathematical Modeling and Simulation of Systems10.1007/978-3-031-30251-0_11(131-142)Online publication date: 3-Jun-2023
    • (2022)Resilient Execution of Data-triggered Applications on Edge, Fog and Cloud Resources2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00057(473-483)Online publication date: May-2022
    • (2022)A structure-aware algorithm for fault-tolerant scheduling of scientific workflowsThe Journal of Supercomputing10.1007/s11227-022-04529-w78:15(17348-17377)Online publication date: 18-May-2022
    • (2021)Fair sharing of network resources among workflow ensemblesCluster Computing10.1007/s10586-021-03457-325:4(2873-2891)Online publication date: 22-Nov-2021
    • (2020)Application Aware Software Defined Flows of Workflow Ensembles2020 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS)10.1109/INDIS51933.2020.00007(10-21)Online publication date: Dec-2020
    • (2018)Preemptive cloud resource allocation modeling of processing jobsThe Journal of Supercomputing10.5555/3211601.321166774:5(2116-2150)Online publication date: 1-May-2018
    • (2018)Hybrid scheduling algorithm in early warning systemsFuture Generation Computer Systems10.1016/j.future.2017.04.00279:P2(630-642)Online publication date: 1-Feb-2018
    • (2018)Preemptive cloud resource allocation modeling of processing jobsThe Journal of Supercomputing10.1007/s11227-017-2226-074:5(2116-2150)Online publication date: 11-Jan-2018
    • (2018)Cost‐effective deadline‐aware stochastic scheduling strategy for workflow applications on virtual machines in cloud computingConcurrency and Computation: Practice and Experience10.1002/cpe.500631:7Online publication date: 4-Oct-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media