Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An algebraic approach for data-centric scientific workflows

Published: 01 August 2011 Publication History

Abstract

Scientific workflows have emerged as a basic abstraction for structuring and executing scientific experiments in computational environments. In many situations, these workflows are computationally and data intensive, thus requiring execution in large-scale parallel computers. However, parallelization of scientific workflows remains low-level, ad-hoc and labor-intensive, which makes it hard to exploit optimization opportunities. To address this problem, we propose an algebraic approach (inspired by relational algebra) and a parallel execution model that enable automatic optimization of scientific workflows. We conducted a thorough validation of our approach using both a real oil exploitation application and synthetic data scenarios. The experiments were run in Chiron, a data-centric scientific workflow engine implemented to support our algebraic approach. Our experiments demonstrate performance improvements of up to 226% compared to an ad-hoc workflow implementation.

References

[1]
W. van der Aalst and K. van Hee. Workflow Management: Models, Methods, and Systems. The MIT Press, 2002.
[2]
E. Deelman, D. Gannon, M. Shields, and I. Taylor. Workflows and e-Science: An overview of workflow system features and capabilities. Future Generation Computer Systems, 25(5):528--540, 2009.
[3]
M. Mattoso, C. Werner, G.H. Travassos, V. Braganholo, L. Murta, E. Ogasawara, D. Oliveira, S.M.S. da Cruz, and W. Martinho. Towards Supporting the Life Cycle of Large-scale Scientific Experiments. Int Journal of Business Process Integration and Management, 5(1):79--92, 2010.
[4]
J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems. Proc. of 4th Workshop on Workflows in Support of Large-Scale Science, 1--8, 2009.
[5]
L. Bouganim, D. Florescu, and P. Valduriez. Dynamic load balancing in hierarchical parallel database systems. Proc. of VLDB, 436--447, 1996.
[6]
G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993.
[7]
F.N. Afrati and J.D. Ullman. Optimizing joins in a map-reduce environment. Proc. of EDBT, 99--110, 2010.
[8]
T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5):541--551, 2009.
[9]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. Proc. of SIGMOD, 1099--1110, 2008.
[10]
M. Vrhovnik, H. Schwarz, O. Suhre, B. Mitschang, V. Markl, A. Maier, and T. Kraft. An approach to optimize data processing in business processes. Proc. of VLDB, 615--626, 2007.
[11]
Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon, C. Goble, M. Livny, L. Moreau, et al. Examining the Challenges of Scientific Workflows. Computer, 40(12):24--32, 2007.
[12]
A. Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, and V. Tziovara. Benchmarking ETL workflows. Performance Evaluation and Benchmarking, R. Nambiar and M. Poess, eds., Springer, 199--220, 2009.
[13]
W. van der Aalst, A. Hofstede, B. Kiepuszewski, and A. Barros. Workflow patterns. Distributed and Parallel Databases, 14(1):5--51, 2003.
[14]
ProvChallenge. Provenance Challenge Wiki, http://twiki.ipaw.info/bin/view/Challenge/WebHome, 2010.

Cited By

View all
  • (2022)A Provenance-based Execution Strategy for Variant GPU-accelerated Scientific Workflows in CloudsJournal of Grid Computing10.1007/s10723-022-09625-y20:4Online publication date: 1-Dec-2022
  • (2021)Executing cyclic scientific workflows in the cloudJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-021-00229-710:1Online publication date: 6-Apr-2021
  • (2019)Adaptive Caching for Data-Intensive Scientific Workflows in the CloudDatabase and Expert Systems Applications10.1007/978-3-030-27618-8_33(452-466)Online publication date: 26-Aug-2019
  • Show More Cited By

Index Terms

  1. An algebraic approach for data-centric scientific workflows
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 4, Issue 12
    August 2011
    303 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2011
    Published in PVLDB Volume 4, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A Provenance-based Execution Strategy for Variant GPU-accelerated Scientific Workflows in CloudsJournal of Grid Computing10.1007/s10723-022-09625-y20:4Online publication date: 1-Dec-2022
    • (2021)Executing cyclic scientific workflows in the cloudJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-021-00229-710:1Online publication date: 6-Apr-2021
    • (2019)Adaptive Caching for Data-Intensive Scientific Workflows in the CloudDatabase and Expert Systems Applications10.1007/978-3-030-27618-8_33(452-466)Online publication date: 26-Aug-2019
    • (2017)Optimization of Complex Dataflows with User-Defined FunctionsACM Computing Surveys10.1145/307875250:3(1-39)Online publication date: 26-May-2017
    • (2016)SQLShareProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2882957(281-293)Online publication date: 26-Jun-2016
    • (2014)Optimization of Data-intensive FlowsProceedings of the 17th International Workshop on Data Warehousing and OLAP10.1145/2666158.2666174(95-98)Online publication date: 7-Nov-2014
    • (2012)Evaluating parameter sweep workflows in high performance computingProceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies10.1145/2443416.2443418(1-10)Online publication date: 20-May-2012
    • (2012)Using domain-specific data to enhance scientific workflow steering queriesProceedings of the 4th international conference on Provenance and Annotation of Data and Processes10.1007/978-3-642-34222-6_12(152-167)Online publication date: 19-Jun-2012

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media