Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Scientific Workflows and Provenance: Introduction and Research Opportunities

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

Scientific workflows are becoming increasingly popular for compute-intensive and data-intensive scientific applications. The vision and promise of scientific workflows includes rapid, easy workflow design, reuse, scalable execution, and other advantages, e.g., to facilitate “reproducible science” through provenance (e.g., data lineage) support. However, as described in the paper, important research challenges remain. While the database community has studied (business) workflow technologies extensively in the past, most current work in scientific workflows seems to be done outside of the database community, e.g., by practitioners and researchers in the computational sciences and eScience. We provide a brief introduction to scientific workflows and provenance, and identify areas and problems that suggest new opportunities for database research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Here we ignore a number of details, e.g., actor ports, subworkflows “hidden” within so-called composite actors, etc.

  2. Similarly, in business process modeling, more abstract models, e.g., BPMN, and simple, structured models (e.g., series-parallel graphs) can be easier to understand and reuse than unstructured or lower-level models, e.g., Petri nets.

  3. A physical shim is a thin strip of metal for aligning pipes.

  4. This shim actor turns a data array token into a sequence of individual data tokens.

  5. As of July 2012; see http://www.myexperiment.org.

  6. http://www.dataone.org/.

  7. See, for example, Amazon’s Simple Storage Service (S3) http://aws.amazon.com and Simple Workflow Service (SWS).

References

  1. van der Aalst WMP (2011) Process mining: discovery, conformance and enhancement of business processes. Springer, Berlin

    MATH  Google Scholar 

  2. Abiteboul S, Bienvenu M, Galland A, Rousset M (2011) Distributed datalog revisited. In: Datalog reloaded, pp 252–261

    Chapter  Google Scholar 

  3. Abramson D, Enticott C, Altinas I (2008) Nimrod/K: towards massively parallel dynamic grid workflows. In: Supercomputing conference. IEEE, New York

    Google Scholar 

  4. Afrati F, Toni F (1997) Chain queries expressible by linear datalog programs. In: Deductive databases and logic programming (DDLP), pp 49–58

    Google Scholar 

  5. Ailamaki A, Ioannidis Y, Livny M (1998) Scientific workflow management by database management. In: SSDBM, pp 190–199

    Google Scholar 

  6. Amin K von, Laszewski G, Hategan M, Zaluzec N, Hampton S, Rossi A (2004) GridAnt: a client-controllable grid workflow system. In: Hawaii intl conf on system sciences (HICSS). IEEE, New York

    Google Scholar 

  7. Anand MK, Bowers S, Ludäscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: Proceedings of the 13th international conference on extending database technology, EDBT’10. ACM, New York, pp 287–298

    Chapter  Google Scholar 

  8. Bao Z, Davidson SB, Khanna S, Roy S (2010) An optimal labeling scheme for workflow provenance using skeleton labels. In: SIGMOD, pp 711–722

    Google Scholar 

  9. Biton O, Cohen-Boulakia S, Davidson S (2007) Zoom* userviews: querying relevant provenance in workflow systems. In: VLDB, pp 1366–1369

    Google Scholar 

  10. Borkar V, Carey M, Grover R, Onose N, Vernica R (2011) Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE

    Google Scholar 

  11. Bowers S, Ludäscher B (2004) An ontology-driven framework for data transformation in scientific workflows. In: Data integration in the life sciences (DILS), pp 1–16

    Chapter  Google Scholar 

  12. Bowers S, Ludäscher B (2005) Actor-oriented design of scientific workflows. In: Conceptual modeling (ER), pp 369–384

    Google Scholar 

  13. Bowers S, McPhillips T, Ludäscher B, Cohen S, Davidson SB (2006) A model for user-oriented data provenance in pipelined scientific workflows. In: Intl provenance and annotation workshop (IPAW)

    Google Scholar 

  14. Braun U, Garfinkel S, Holland D, Muniswamy-Reddy K, Seltzer M (2006) Issues in automatic provenance collection. In: Provenance and annotation of data, pp 171–183

    Chapter  Google Scholar 

  15. Chapman AP, Jagadish HV, Ramanan P (2008) Efficient provenance storage. In: SIGMOD, pp 993–1006

    Chapter  Google Scholar 

  16. Chebotko A, Chang S, Lu S, Fotouhi F, Yang P (2008) Scientific workflow provenance querying with security views. In: Web-age information management (WAIM), pp 349–356

    Google Scholar 

  17. Cheney J, Finkelstein A, Ludäscher B, Vansummeren S (2012) Principles of provenance. Dagstuhl Rep 2(2):84–113 (Dagstuhl Seminar 12091). doi:10.4230/DagRep.2.2.84

    Google Scholar 

  18. Cohen-Boulakia S, Leser U (2011) Search, adapt, and reuse: the future of scientific workflows. ACM SIGMOD Rec 40(2):6–16

    Article  Google Scholar 

  19. Consortium TB (2008) Interoperability with Moby 1.0—It’s better than sharing your toothbrush! Brief Bioinform 9(3):220–231

    Article  Google Scholar 

  20. Curcin V, Ghanem M (2008) Scientific workflow systems—can one size fit all? In: Biomedical engineering conference (CIBEC)

    Google Scholar 

  21. Davidson S, Khanna S, Roy S, Boulakia S (2010) Privacy issues in scientific workflow provenance. In: Intl workshop on workflow approaches to new data-centric science

    Google Scholar 

  22. De Roure D, Goble C, Stevens R (2009) The design and realisation of the myExperiment virtual research environment for social sharing of workflows. Future Gener Comput Syst 25(5):561–567

    Article  Google Scholar 

  23. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  24. Deelman E, Blythe J, Gil Y, Kesselman C, Mehta G, Patil S, Su M, Vahi K, Livny M (2004) Pegasus: mapping scientific workflows onto the grid. In: Grid computing. Springer, Berlin, pp 131–140

    Google Scholar 

  25. Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540

    Article  Google Scholar 

  26. Deutch D, Milo T (2012) A structural/temporal query language for business processes. J Comput Syst Sci 78(2):583–609

    Article  MathSciNet  MATH  Google Scholar 

  27. Dey S, Köhler S, Bowers S, Ludäscher B (2012) Datalog as a Lingua Franca for provenance querying and reasoning. In: Workshop on the theory and practice of provenance (TaPP)

    Google Scholar 

  28. Dey S, Zinn D, Ludäscher B (2011) PROPUB: towards a declarative approach for publishing customized, policy-aware provenance. In: Intl conf on scientific and statistical database management (SSDBM)

    Google Scholar 

  29. Dijkman R, Dumas M, García-Bañuelos L (2009) Graph matching algorithms for business process model similarity search. In: Intl conf on business process management (BPM), pp 48–63

    Chapter  Google Scholar 

  30. Dong G, Libkin L, Su J, Wong L (1999) Maintaining transitive closure of graphs in SQL. Int J Inf Technol 5

  31. Dou L, Cao G, Morris PJ, Morris RA, Ludäscher B, Macklin JA, Hanken J (2012) Kurator: a Kepler package for data curation workflows. Proc Comput Sci 9:1614–1619. Demo video at http://youtu.be/DEkPbvLsud0

    Article  Google Scholar 

  32. Dou L, Zinn D, McPhillips TM, Köhler S, Riddle S, Bowers S, Ludäscher B (2011) Scientific workflow design 2.0: demonstrating streaming data collections in Kepler. In: ICDE

    Google Scholar 

  33. Eker J, Janneck J, Lee EA, Liu J, Liu X, Ludvig J, Sachs S, Xiong Y (2003) Taming heterogeneity—the Ptolemy approach. Proc IEEE 91(1):127–144

    Article  Google Scholar 

  34. Ellqvist T, Koop D, Freire J, Silva C, Stromback L (2009) Using mediation to achieve provenance interoperability. In: World conference on Services-I. IEEE, New York, pp 291–298

    Google Scholar 

  35. Fagin R, Haas L, Hernández M, Miller R, Popa L, Velegrakis YC (2009) Schema mapping creation and data exchange. In: Conceptual modeling: foundations and applications, pp 198–236

    Chapter  Google Scholar 

  36. Fernández M, Florescu D, Levy A, Suciu D (2000) Declarative specification of web sites with S. VLDB J 9(1):38–55

    Article  Google Scholar 

  37. Freire J, Silva CT, Callahan SP, Santos E, Scheidegger CE, Vo HT (2006) Managing rapidly-evolving scientific workflows. In: Intl annotation and provenance workshop (IPAW), pp 10–18

    Chapter  Google Scholar 

  38. Gadelha L, Mattoso M, Wilde M, Foster I (2011) In: Provenance query patterns for Many-Task scientific computing. Workshop on the theory and practice of provenance, Heraklion, Greece, pp 1–6

    Google Scholar 

  39. Gadelha LMR Jr, Clifford B, Mattoso M, Wilde M, Foster I (2011) Provenance management in swift. Future Gener Comput Syst 27(6):775–780

    Article  Google Scholar 

  40. Geilen M, Basten T (2003) Requirements on the execution of Kahn process networks. In: Programming languages and systems, pp 319–334

    Chapter  Google Scholar 

  41. Gil Y, Ratnakar V, Deelman E, Mehta G, Kim J (2007) Wings for Pegasus: creating large-scale scientific applications using semantic representations of computational workflows. In: National conference on artificial intelligence, vol 22

    Google Scholar 

  42. Goderis A, Brooks C, Altintas I, Lee EA, Goble CA (2007) Composing different models of computation in Kepler and Ptolemy II. In: Intl conf on computational science

    Google Scholar 

  43. Hellerstein J (2010) The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec 39(1):5–19

    Article  Google Scholar 

  44. Hidders J, Kwasnikowska N, Sroka J, Tyszkiewicz J, Van den Bussche J (2008) DFL: a dataflow language based on Petri nets and nested relational calculus. Inf Syst 33(3):261–284

    Article  Google Scholar 

  45. Howe B, Green-Fishback H, Maier D (2009) Scientific mashups: runtime-configurable data product ensembles. In: SSDBM, pp 19–36

    Google Scholar 

  46. Huang S, Green T, Loo B (2011) Datalog and emerging applications: an interactive tutorial. In: SIGMOD, pp 1213–1216

    Google Scholar 

  47. Hughes J (2005) Programming with arrows. In: Intl summer school on advanced functional programming. LNCS, vol 3622, pp 73–129

    Google Scholar 

  48. Hull D, Stevens R, Lord P, Wroe C, Goble C (2004) Treating “shimantic web” syndrome with ontologies. In: First AKT workshop on semantic web services

    Google Scholar 

  49. Jin R, Ruan N, Xiang Y, Wang H (2011) Path-tree: an efficient reachability indexing scheme for large directed graphs. ACM Trans Database Syst 36(1):7:1–7:44

    Article  Google Scholar 

  50. Kahn G (1974) The semantics of simple language for parallel programming. In: IFIP congress, pp 471–475

    Google Scholar 

  51. Köhler S, Riddle S, Zinn D, McPhillips TM, Ludäscher B (2011) Improving workflow fault tolerance through provenance-based recovery. In: SSDBM, pp 207–224

    Google Scholar 

  52. Koschmieder A, Leser U (2012) Regular path queries on large graphs. In: Intl conf on scientific and statistical database management (SSDBM)

    Google Scholar 

  53. Lee EA, Matsikoudis E (2008) The semantics of dataflow with firing. In: Huet G, Plotkin G, Lévy JJ, Bertot Y (eds) From semantics to computer science: essays in memory of Gilles Kahn

    Google Scholar 

  54. Lee EA, Parks TM (1995) Dataflow process networks. In: Proceedings of the IEEE, pp 773–799

    Google Scholar 

  55. Li G, Feng J, Zhou X, Wang J (2011) Providing built-in keyword search capabilities in RDBMS. VLDB J 20(1):1–19

    Article  Google Scholar 

  56. Lin C, Lu S, Fei X, Pai D, Hua J (2009) A task abstraction and mapping approach to the shimming problem in scientific workflows. In: Services computing. IEEE, New York, pp 284–291

    Google Scholar 

  57. Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee E, Tao J, Zhao Y (2006) Scientific workflow management and the Kepler system. Concurr Comput, Pract Exp 18(10):1039–1065

    Article  Google Scholar 

  58. Ludäscher B, Altintas I, Bowers S, Cummings J, Critchlow T, Deelman E, Roure DD, Freire J, Goble C, Jones M, Klasky S, McPhillips T, Podhorszki N, Silva C, Taylor I, Vouk M (2009) Scientific process automation and workflow management. In: Shoshani A, Rotem D (eds) Scientific data management. Chapman & Hall/CRC, London/Boca Raton

    Google Scholar 

  59. Ludäscher B, Bowers S, McPhillips T (2009) Scientific workflows. In: Özsu T, Liu L (eds) Encyclopedia of database systems. Springer, Berlin

    Google Scholar 

  60. Ludäscher B, Weske M, McPhillips T, Bowers S (2009) Scientific workflows: business as usual? In: Intl conf on business process management (BPM), pp 31–47

    Chapter  Google Scholar 

  61. McPhillips T, Bowers S, Ludäscher B (2006) Collection-oriented scientific workflows for integrating and analyzing biological data. In: Intl workshop on data integration in the life sciences (DILS)

    Google Scholar 

  62. McPhillips T, Bowers S, Zinn D, Ludäscher B (2009) Scientific workflows for Mere Mortals. Future Gener Comput Syst 25(5):541–551

    Article  Google Scholar 

  63. Mendelzon AO, Wood PT (1995) Finding regular simple paths in graph databases. SIAM J Comput 24(6):1235–1258

    Article  MathSciNet  MATH  Google Scholar 

  64. Missier P, Ludascher B, Bowers S, Dey S, Sarkar A, Shrestha B, Altintas I, Anand M, Goble C (2010) Linking multiple workflow provenance traces for interoperable collaborative science. In: 5th workshop on workflows in support of large-scale science (WORKS), pp 1–8

    Chapter  Google Scholar 

  65. Missier P, Ludäscher B, Bowers S, Dey S, Sarkar A, Shrestha B, Altintas I, Anand M, Goble C (2010) Linking multiple workflow provenance traces for interoperable collaborative science. In: Workshop on workflows in support of large-scale science (WORKS)

    Google Scholar 

  66. Missier P, Paton NW, Belhajjame K (2010) Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT, pp 299–310

    Chapter  Google Scholar 

  67. Missier P, Soiland-Reyes S, Owen S, Tan W, Nenadic A, Dunlop I, Williams A, Oinn T, Goble C (2010) Taverna, reloaded. In: SSDBM, pp 471–481

    Google Scholar 

  68. Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J, Plale B, Simmhan Y, Stephan E, den Bussche JV (2011) The open provenance model core specification (v1.1). Future Gener Comput Syst 27(6):743–756

    Article  Google Scholar 

  69. Moreau L, Kwasnikowska N, den Bussche JV (2009) A formal account of the open provenance model. Tech rep, University of Southampton

  70. Muniswamy-Reddy KK, Braun U, Holland DA, Macko P, Maclean D, Margo D, Seltzer M, Smogor R (2009) Layering in provenance systems. In: USENIX

    Google Scholar 

  71. Ngu A, Bowers S, Haasch N, McPhillips T, Critchlow T (2008) Flexible scientific workflow modeling using frames, templates, and dynamic embedding. In: SSDBM, pp 566–572

    Google Scholar 

  72. Ogasawara E, De Oliveira D, Valduriez P, Dias D, Porto F, Mattoso M (2011) An algebraic approach for data-centric scientific workflows. Proc VLDB 4(11):1328–1339

    Google Scholar 

  73. Podhorszki N, Ludäscher B, Klasky SA (2007) Workflow automation for processing plasma fusion simulation data. In: Workflows in support of large-scale science (WORKS), pp 35–44

    Chapter  Google Scholar 

  74. Shankar S, Kini A, DeWitt D, Naughton J (2005) Integrating databases and workflow systems. ACM SIGMOD Rec 34(3)

  75. Tan W, Missier P, Madduri R, Foster I (2009) Building scientific workflow with Taverna and BPEL: a comparative study in caGrid. In: Service-oriented computing—ICSOC 2008 workshops. Springer, Berlin, pp 118–129

    Chapter  Google Scholar 

  76. Taylor I, Deelman E, Gannon D, Shields M (eds) (2007) Workflows for e-Science: scientific workflows for grids. Springer, Berlin

    Google Scholar 

  77. Tekle KT, Gorbovitski M, Liu YA (2010) Graph queries through datalog optimizations. In: Principles and practice of declarative programming (PPDP), pp 25–34

    Google Scholar 

  78. Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the Condor experience. Concurr Comput, Pract Exp 17(2–4):323–356

    Article  Google Scholar 

  79. Thusoo A, Sarma J, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. In: VLDB, vol 2(2)

    Google Scholar 

  80. Turi D, Missier P, Goble C, Roure DD, Oinn T (2007) Taverna workflows: syntax and semantics. In: Intl conf on e-Science and grid computing

    Google Scholar 

  81. Vrba Ž., Halvorsen P, Griwodz C, Beskow P (2009) Kahn process networks are a flexible alternative to MapReduce. In: High performance computing and communications (HPCC), pp 154–162

    Google Scholar 

  82. Vrba Ž., Halvorsen P, Griwodz C, Beskow P, Espeland H, Johansen D (2010) The Nornir run-time system for parallel programs using Kahn process networks on multi-core machines a flexible alternative to MapReduce. J Supercomput 1–27

  83. Wainer J, Weske M, Vossen G, Medeiros C (1996) Scientific workflow systems. In: NSF workshop on workflow and process automation in information systems: state-of-the-art and future directions, Athens, GA

    Google Scholar 

  84. Wang J, Altintas I (2012) Early cloud experiences with the Kepler scientific workflow system. Proc Comput Sci 9:1630–1634

    Article  Google Scholar 

  85. Wang J, Crawl D, Altintas I (2009) Kepler+Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems. In: Workshop on workflows in support of large-scale science (WORKS)

    Google Scholar 

  86. Wieczorek M, Prodan R, Fahringer T (2005) Scheduling of scientific workflows in the ASKALON grid environment. SIGMOD Rec 34(3):56–62

    Article  Google Scholar 

  87. Wilde M, Foster I, Iskra K, Beckman P, Zhang Z, Espinosa A, Hategan M, Clifford B, Raicu I (2009) Parallel scripting for applications at the petascale and beyond. IEEE Comput Soc 42(11):50–60

    Article  Google Scholar 

  88. Wombacher A (2010) Data workflow: a workflow model for continuous data processing. Centre for Telematics and Information Technology, University of Twente

  89. Wood PT (2012) Query languages for graph databases. SIGMOD Rec 41(1):50–60

    Article  Google Scholar 

  90. Yan Z, Dijkman R, Grefen P (2012) Business process model repositories—framework and survey. Inf Softw Technol 54(4):380–395

    Article  Google Scholar 

  91. Zinn D, Bowers S, Ludäscher B (2010) XML-based computation for scientific workflows. In: ICDE. IEEE, New York, pp 812–815

    Google Scholar 

  92. Zinn D, Bowers S, McPhillips T, Ludäscher B (2009) Scientific workflow design with data assembly lines. In: Workshop on workflows in support of large-scale science (WORKS)

    Google Scholar 

  93. Zinn D, Bowers S, McPhillips T, Ludäscher B (2009) X-CSR: dataflow optimization for distributed XML process pipelines. In: ICDE, pp 577–580

    Google Scholar 

  94. Zinn D, Hart Q, McPhillips TM, Ludäscher B, Simmhan Y, Giakkoupis M, Prasanna VK (2011) Towards reliable, performant workflows for streaming-applications on cloud platforms. In: Intl symposium on cluster, cloud and grid computing (CCGRID), pp 235–244

    Google Scholar 

  95. Zinn D, Ludäscher B (2010) Abstract provenance graphs: anticipating and exploiting schema-level data provenance. In: Intl provenance and annotation workshop (IPAW), pp 206–215

    Chapter  Google Scholar 

Download references

Acknowledgements

Work supported in part by NSF awards OCI-0830944, OCI-0722079, DGE-0841297, and DBI-0960535.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bertram Ludäscher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cuevas-Vicenttín, V., Dey, S., Köhler, S. et al. Scientific Workflows and Provenance: Introduction and Research Opportunities. Datenbank Spektrum 12, 193–203 (2012). https://doi.org/10.1007/s13222-012-0100-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-012-0100-z

Keywords