article

Understanding provenance black boxes

Authors:

Adriane Chapman,

H. V. JagadishAuthors Info & Claims

Distributed and Parallel Databases, Volume 27, Issue 2

Pages 139 - 167

https://doi.org/10.1007/s10619-009-7058-3

Published: 01 April 2010 Publication History

Abstract

Current provenance stores associated with workflow management systems (WfMSs) capture enough coarse-grained information to describe which datasets were used and which processes were run. While this information is enough to rebuild a workflow run, it is not enough to facilitate user understanding. Because the data is manipulated via a series of black boxes, it is often impossible for a human to understand what happened to the data. In this work, we highlight the missing information that can assist user understanding. Unfortunately, provenance information is already very complex and difficult for a user to comprehend, which can be exacerbated by adding the extra information needed for deeper blackbox understanding. In order to alleviate this, we develop a model of provenance answers that follow a "roll up", "drill down" strategy. We evaluate these techniques to determine if users have better understanding of provenance information. We show how this information can be captured by workflow management systems, and that the structures and information needed for this model are a negligible addition to standard provenance stores. Finally, we implement these techniques in a real provenance system, and evaluate implementation feasibility.

References

[1]

http://hissa.nist.gov/unravel/ (1998).

[2]

Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: SSDM, pp. 958-969 (2009).

Digital Library

[3]

Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB Seoul, Korea, pp. 953-964 (2006).

Digital Library

[4]

Bowers, S., McPhillips, T., Wu, M., Ludäscher, B.: Project histories: managing data provenance across collection-oriented scientific workflow runs. In: DILS, pp. 27-29 (2007).

Digital Library

[5]

Bowers, S., McPhillips, T., Riddle, S., Anand, M., Ludäscher, B.: Kepler/pPOD: scientific workflow and provenance support for assembling the tree of life (2008). Provenance and annotation of data and processes edn. In: Lecture Notes in Computer Science, pp. 70-77. Springer, Berlin (2008).

Digital Library

[6]

Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: ACM SIGMOD, pp. 539-550 (2006).

Digital Library

[7]

Buneman, P., Khanna, S., Tan, W.C.: Why and Where: a characterization of data provenance. In: ICDT, pp. 316-330 (2001).

Digital Library

[8]

Buneman, P., Khanna, S., Tan, W.C.: On propagation of deletions and annotations through views. In: PODS, pp. 150-158 (2002).

Digital Library

[9]

Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Vo, C.T.S.H.T.: VisTrails: Visualization meets data management. In: SIGMOD, pp. 745-747 (2006).

Digital Library

[10]

Cheung, K., Hunter, J.: Provenance Explorer--customized provenance views using semantic inferencing. In: International Semantic Web Conference, pp. 215-227 (2006).

Digital Library

[11]

Cohen-Boulakia, S., Biton, O., Cohen, S., Davidson, S.: Addressing the provenance challenge using ZOOM. Concurr. Comput., Pract. Exp. 20, 497-506 (2008).

Digital Library

[12]

Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: VLDB, pp. 41-58 (2001).

Digital Library

[13]

da Cruz, S.M.S., Barros, P.M., Bisch, P.M., Campos, M.L.M., Mattoso, M.: Provenance services for distributed workflows. In: CCGRID, pp. 526-533 (2008).

Digital Library

[14]

Davidson, S., Cohen-Boulakia, S., Eyal, A., Ludascher, B., McPhillips, T., Bowers, S., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 32(4), 44-50 (2007).

[15]

Foster, I., Vockler, J., Eilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: SSDBM, pp. 37-46 (2002).

Digital Library

[16]

Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurr. Comput., Pract. Exp. 20(5), 485-496 (2008).

Digital Library

[17]

Green, T.J., Karvounarakis, G., Taylor, N.E., Biton, O., Ives, Z.G., Tannen, V.: ORCHESTRA: Facilitating collaborative data sharing. In: SIGMOD, pp. 1131-1133 (2007).

Digital Library

[18]

Groth, P., Miles, S., Moreau, L.: PReServ: Provenance recording for services. In: Proceedings of the UK OST e-Science second All Hands Meeting 2005 (AHM'05) (2005).

[19]

Hermjakob, H., et al.: IntAct--an open source molecular interaction database. Nucleic Acids Res. D 32, 452-455 (2004).

[20]

Jayapandian, M., Chapman, A., et al.: Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together. Nucleic Acids Res., D566-D571 (2007).

[21]

Kim, Y.J., Boyd, A., Athey, B.D., Patel, J.M.: miBLAST: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 33(13), 4335-4344 (2005).

[22]

Lal, A., Reps, T.: Solving multiple dataflow queries. In: Static Analysis Symposium (2008).

Digital Library

[23]

Lenz, H.J., Shoshani, A.: Summarizability in OLAP and statistical data bases. In: SSDM, pp. 132-143 (1997).

Digital Library

[24]

Missier, P., Embury, S., Greenwood, M., Preece, A., Jin, B.: Quality views: capturing and exploiting the user perspective on data quality. In: VLDB, pp. 977-988 (2006).

Digital Library

[25]

Missier, P., Embury, S.M., Greenwood, M., Preece, A., Jin, B.: Managing information quality in escience: the qurator workbench. In: SIGMOD, pp. 1150-1152 (2007).

Digital Library

[26]

Moreau, L., Ludäscher, B., et al.: The First Provenance Challenge. Concurrency and computation: practice and experience (2007). http://twiki.ipaw.info/bin/view/Challenge/SecondProvenance Challenge.

[27]

Moreau, L., Ludäscher, B., et al.: The provenance challenge. http://twiki.ipaw.info/bin/view/ Challenge/ThirdProvenanceChallenge.

[28]

Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G. Jr., Clifford, B., Cohen, S., Cohen-Boulakia, S., Davidson, S., Deelman, E., Digiampietri, L., Foster, I., Freire, J., Frew, J., Futrelle, J., Gibson, T., Gil, Y., Goble, C., Golbeck, J., Groth, P., Holland, D.A., Jiang, S., Kim, J., Koop, D., Krenek, A., McPhillips, T., Mehta, G., Miles, S., Metzger, D., Munroe, S., Myers, J., Plale, B., Podhorszki, N., Ratnakar, V., Santos, E., Scheidegger, C., Schuchardt, K., Seltzer, M., Simmhan, Y.L., Silva, C., Slaughter, P., Stephan, E., Stevens, R., Turi, D., Vo, H., Wilde, M., Zhao, J., Zhao, Y.: Special issue: The first provenance challenge. Concurr. Comput., Pract. Exp. 20, 409-418 (2008).

Digital Library

[29]

Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, pp. 43-56 (2006).

Digital Library

[30]

Oinn, T., Greenwood, M., Addis, M., Alpdemir, M.N., Ferris, J., Glover, K., Goble, C., et al.: Taverna: lessons in creating a workflow environment for the life sciences. Concurr. Comput. Pract. Exp. 18(10), 1067-1100 (2006).

Digital Library

[31]

Open provenance model: http://twiki.ipaw.info/bin/view/Challenge/OPM (2008).

[32]

Peri, S., et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363-2371 (2003).

[33]

Salwinski, L., et al.: The database of interacting proteins: 2004 update. Nucleic Acids Res. D 32, 449-451 (2004).

[34]

Scheidegger, C.E., Vo, H.T., Koop, D., Freire, J., Silva, C.T.: Querying and creating visualizations by analogy. IEEE Trans. Vis. Comput. Graph. 13(6), 1560-1567 (2007).

Digital Library

[35]

Stef-Praun, T., Clifford, B., Foster, I., Hasson, U., Hategan, M., Small, S., Wilde, M., Zhao, Y.: Accelerating medical research using the swift workflow system. Health Grid (2007).

[36]

Tip, F.: A survey of program slicing techniques. J. Program. Lang. 3, 121-189 (1995).

[37]

Weiser, M.: Program slicing. In: International Conference on Software Engineering, pp. 439-449 (1981).

Digital Library

[38]

Wiwatwattana, N., Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D.: X3: a cube operator for XML OLAP. In: ICDE, pp. 916-925 (2007).

[39]

Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: ICDE, pp. 97-102 (1997).

Digital Library

[40]

Zhang, M., Zhang, X., Zhang, X., Prabhakar, S.: Tracing lineage beyond relational operators. In: VLDB, pp. 1116-1127 (2007).

Digital Library

Cited By

Pimentel JFreire JMurta LBraganholo V(2019)A Survey on Collecting, Managing, and Analyzing Provenance from ScriptsACM Computing Surveys10.1145/331195552:3(1-38)Online publication date: 18-Jun-2019
https://dl.acm.org/doi/10.1145/3311955
Narock TYoon V(2019)An agent-based approach for capturing and linking provenance in geoscience workflowsComputers & Geosciences10.1016/j.cageo.2015.03.00479:C(58-68)Online publication date: 3-Jan-2019
https://dl.acm.org/doi/10.1016/j.cageo.2015.03.004
Narock TYoon VMarch S(2018)A provenance-based approach to semantic web service description and discoveryDecision Support Systems10.1016/j.dss.2014.04.00764:C(90-99)Online publication date: 30-Dec-2018
https://dl.acm.org/doi/10.1016/j.dss.2014.04.007
Show More Cited By

Recommendations

Semantic Provenance for eScience: Managing the Deluge of Scientific Data

Provenance information in eScience is metadata that's critical to effectively manage the exponentially increasing volumes of scientific data from industrial-scale experiment protocols. Semantic provenance, based on domain-specific provenance ontologies, ...
Modelling Provenance Collection Points and Their Impact on Provenance Graphs
IPAW 2016: Proceedings of the 6th International Workshop on Provenance and Annotation of Data and Processes - Volume 9672

As many domains employ ever more complex systems-of-systems, capturing provenance among component systems is increasingly important. Applications such as intrusion detection, load balancing, traffic routing, and insider threat detection all involve ...
Automatic capture and reconstruction of computational provenance
The First Provenance Challenge

The Earth System Science Server (ES3) project is developing a local infrastructure for managing Earth science data products derived from satellite remote sensing. By ‘local,’ we mean the infrastructure that a scientist uses to manage the creation and ...

Comments

Information & Contributors

Information

Published In

cover image Distributed and Parallel Databases

Distributed and Parallel Databases Volume 27, Issue 2

April 2010

114 pages

ISSN:0926-8782

Issue’s Table of Contents

Copyright © Copyright © 2010 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pimentel JFreire JMurta LBraganholo V(2019)A Survey on Collecting, Managing, and Analyzing Provenance from ScriptsACM Computing Surveys10.1145/331195552:3(1-38)Online publication date: 18-Jun-2019
https://dl.acm.org/doi/10.1145/3311955
Narock TYoon V(2019)An agent-based approach for capturing and linking provenance in geoscience workflowsComputers & Geosciences10.1016/j.cageo.2015.03.00479:C(58-68)Online publication date: 3-Jan-2019
https://dl.acm.org/doi/10.1016/j.cageo.2015.03.004
Narock TYoon VMarch S(2018)A provenance-based approach to semantic web service description and discoveryDecision Support Systems10.1016/j.dss.2014.04.00764:C(90-99)Online publication date: 30-Dec-2018
https://dl.acm.org/doi/10.1016/j.dss.2014.04.007
Das Sarma AJain ABohannon P(2011)Building a generic debugger for information extraction pipelinesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063933(2229-2232)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063933

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents