Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1739041.1739079acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Fine-grained and efficient lineage querying of collection-based workflow provenance

Published: 22 March 2010 Publication History

Abstract

The management and querying of workflow provenance data underpins a collection of activities, including the analysis of workflow results, and the debugging of workflows or services. Such activities require efficient evaluation of lineage queries over potentially complex and voluminous provenance logs. Näive implementations of lineage queries navigate provenance logs by joining tables that represent the flow of data between connected processors invoked from workflows. In this paper we provide an approach to provenance querying that: (i) avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results; (ii) provides fine grained provenance querying, even for workflows that create and consume collections; and (iii) scales effectively to address complex workflows, workflows with large intermediate data sets, and queries over multiple workflows.

References

[1]
M. Anand, S. Bowers, T. McPhillips, and B. Ludaescher. Efficient provenance storage over nested data collections. In Procs. EDBT, March 2009.
[2]
Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna. Differencing provenance in scientific workflows. In Procs. ICDE, March 2009.
[3]
R. S. Barga and L. A. Digiampietri. Automatic capture and efficient storage of e-science experiment provenance. Concurrency and Computation: Practice and Experience, 20(8):419--429, 2008.
[4]
A. Barker and J. van Hemert. Scientific Workflow: A Survey and Research Directions, volume 4967/2008 of LNCS. Springer, 2008.
[5]
O. Benjelloun, A. Das Sarma, A. Y. Halevy, M. Theobald, and J. Widom. Databases with uncertainty and lineage. VLDB J., 17(2):243--264, 2008.
[6]
O. Biton, S. Cohen Boulakia, and S. B. Davidson. Zoom*userviews: Querying relevant provenance in workflow systems. In VLDB, pages 1366--1369, 2007.
[7]
O. Biton, S. Cohen Boulakia, S. B. Davidson, and C. S. Hara. Querying and managing provenance through user views in scientific workflows. In ICDE, pages 1072--1081, 2008.
[8]
S. Bowers and B. Ludäscher. Actor-oriented design of scientific workflows. In ER, pages 369--384, 2005.
[9]
S. Bowers, T. M. McPhillips, and B. Ludäscher. Provenance in collection-oriented scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):519--529, 2008.
[10]
S. Bowers, T. M. McPhillips, S. Riddle, M. Kumar Anand, and B. Ludäscher. Kepler/pPOD: Scientific workflow and provenance support for assembling the tree of life. In IPAW, pages 70--77, 2008.
[11]
U. Braun, S. L. Garfinkel, D. A. Holland, K. K. Muniswamy-Reddy, and M. I. Seltzer. Issues in automatic provenance collection. In IPAW, pages 171--183, 2006.
[12]
S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, Cláudio T. Silva, and H. T. Vo. VisTrails: visualization meets data management. In SIGMOD Conference, pages 745--747, 2006.
[13]
A. Chapman and H. V. Jagadish. Issues in building practical provenance systems. IEEE Data Eng. Bull., 30(4):38--43, 2007.
[14]
A. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In Wang {32}, pages 993--1006.
[15]
Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon, C. Goble, M. Livny, L. Moreau, and J. Myers. Examining the challenges of scientific workflows. Computer, 40(12):24--32, Dec. 2007.
[16]
Huahai He and Ambuj K. Singh. Graphs-at-a-time: query language and access methods for graph databases. In Procs. SIGMOD Conference, pages 405--418, Vancouver, BC, Canada, June 2008.
[17]
T. Heinis and G. Alonso. Efficient lineage tracking for scientific workflows. In SIGMOD Conference, pages 1007--1018, 2008.
[18]
J. Hidders and J. Sroka. Towards a calculus for collection-oriented scientific workflows with side effects. In OTM Conferences (1), pages 374--391, 2008.
[19]
W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages. ACM Comput. Surv., 36(1):1--34, 2004.
[20]
E. A. Lee. Dataflow process networks. Memorandum UCB/ERL M94/53, UC Berkeley EECS Dept, 1994.
[21]
D. T. Liu and M. J. Franklin. The design of GridDB: A data-centric overlay for the scientific grid. In VLDB, pages 600--611, 2004.
[22]
B. Ludäscher, I. Altintas, and C. Berkley. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10):1039--1065, 2005.
[23]
Shawn Bowers Manish Anand and Bertram Ludaescher. Techniques for efficiently querying scientific workflow provenance graphs. In Procs. EDBT, Lausanne, Switzerland, March 2010.
[24]
T. McPhillips, S. Bowers, and B. Ludäscher. Collection-oriented scientific workflows for integrating and analyzing biological data. In Proceedings 3rd International Conference on Data Integration for the Life Sciences (DILS), LNCS/LNBI. Springer, 2006.
[25]
T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5):541--551, 2009.
[26]
KK Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer. Provenance-aware storage systems. In a USENIX Annual Technical Conference, General Track, pages 43--56, 2006.
[27]
T. Oinn, M. Greenwood, M. Addis, and M. Nedim Alpdemir. Taverna: Lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18(10):1067--1100, August 2006.
[28]
C. Scheidegger, D. Koop, E. Santos, H. Vo, S. Callahan, J Freire, and C. Silva. Tackling the provenance challenge one layer at a time. Concurrency and Computation: Practice and Experience Concurrency and Computation: Practice and Experience, 20(5):473--483, 2008.
[29]
C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and C. T. Silva. Querying and re-using workflows with VisTrails. In SIGMOD, pages 1251--1254, New York, NY, USA, 2008. ACM.
[30]
Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31--36, 2005.
[31]
D. Turi, P. Missier, D. De Roure, C. Goble, and T. Oinn. Taverna Workflows: Syntax and Semantics. In Proceedings of the 3rd e-Science conference, Bangalore, India, December 2007.
[32]
Jason Tsong-Li Wang, editor. Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10--12, 2008. ACM, 2008.
[33]
A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In ICDE, pages 91--102, 1997.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology
March 2010
741 pages
ISBN:9781605589459
DOI:10.1145/1739041
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EDBT/ICDT '10
EDBT/ICDT '10: EDBT/ICDT '10 joint conference
March 22 - 26, 2010
Lausanne, Switzerland

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media