Abstract
Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance.
In this paper, we provide a formal model of provenance for scientific workflows which is general (i.e. can be used with existing workflow systems, such as Kepler, myGrid and Chimera) and sufficiently expressive to answer the provenance queries we encountered in a number of case studies. Interestingly, our model not only takes into account the chained and nested structure of scientific workflows, but allows asks for provenance at different levels of abstraction (user views).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alpdemir, M.N., Mukherjee, A., Paton, N.W., Fernandes, A.A.A., Watson, P., Glover, K., Greenhalgh, C., Oinn, T., Tipney, H.: Contextualised Workflow Execution in MyGrid. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 444–453. Springer, Heidelberg (2005)
Berry, D., Buneman, P., Wilde, M., Ioannidis, Y.: e-Science Workshop on Data Provenance and Annotation. National e-Science Centre, Edinburgh (2003)
Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An Annotation Management System for Relational Databases. In: Proc. Conference on Very Large Data Bases (VLDB), pp. 900–911 (2004)
Bowers, S., McPhillips, T., Ludäscher, B., Cohen, S., Davidson, S.B.: A model for user-oriented data provenance in pipelined scientific workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 133–147. Springer, Heidelberg (2006)
Bowers, S., Ludäscher, B.: Actor-Oriented Design of Scientific Workflows. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, Ó. (eds.) ER 2005. LNCS, vol. 3716, pp. 369–384. Springer, Heidelberg (2005)
Buneman, P., Khanna, S., Tan, W.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)
Buneman, P., Chapman, A., Cheney, J.: Provenance Management in Curated Databases. In: Proc. of SIGMOD International Conference on Management of Data (to appear, 2006)
Clark, T., Martin, S., Liefeld, T.: Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics 5(1), 59–70 (2004)
Cohen-Boulakia, S., Lair, S., Stransky, N., Graziani, S., Radvanyi, F., Barillot, E., Froidevaux, C.: Selecting biomedical data sources according to user preferences. In: Bioinformatics, Proc. ISMB/ECCB 2004, vol. 20, pp. i86–i93 (2004)
Cohen, S., Cohen-Boulakia, S., Davidson, S.: Towards a Model of Provenance in Scientific Workflows, University of Pennsylvania, Internal Report, #MS-CIS-06-03 (2006)
Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, C., Stoeckert, C.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal (2001)
Foster, I., Vockler, J., Woilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: Proc. of the 14th Intl. Conf. on Scientific and Statistical Database Management (SSDBM) (2002)
Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. In: Proc. of Conference on Innovative Data System Research (CIDR) (2003)
Greiner, U., Müller, R., Rahm, E., Ramsch, J., Heller, B., Löffler, M.: AdaptFlow: Protocol-based Medical Treatment Using Adaptive Workflows. Methods of Information in Medicine 44, 80–88 (2005)
Higgins, D.G., Sharp, P.M.: Clustal: A package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1998)
Kiepuszewski, B., ter Hofstede, A.H.M., van der Aalst, W.M.P.: Fundamentals of control flow in workflows. Acta Inf. 39(3), 143–209 (2003)
McPhillips, T., Bowers, S.: An approach for pipelining nested collections in scientific workflows. SIGMOD Record 34(3), 12–17 (2005)
Moss, J.E.B.: Nested Transactions: An Approach to Reliable Distributed Computing, Ph.D. dissertation, Dept. of Electrical Engineering and Computer Science, MIT (April 1981)
Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.T., Carver, K., Glover, P.M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, Proc. ISMB/ECCB03 20(1), 3045–3054 (2003)
The Pasoa Project Luc Moreau et al., http://www.pasoa.org/
Phylip Programs and Documentation, http://evolution.genetics.washington.edu/phylip/phylip.html.Swofford
Rowe, A., Kalaitzopoulos, D., Osmond, M., Ghanem, M., Guo, Y.: The discovery net system for high throughput bioinformatics. Bioinformatics 19(1), i225–i231 (2004)
Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)
Swofford, D.L.: PAUP*: Phylogenetic Analysis Using Parsimony (*and other methods). Sinauer Associates, Sunderland, MA (2000)
Targino, R., Cavalcanti, M.C., Mattoso, M.: An Environment to Define and Execute In-Silico Workflows Using Web Services. In: Ludäscher, B., Raschid, L. (eds.) DILS 2005. LNCS (LNBI), vol. 3615, pp. 288–291. Springer, Heidelberg (2005)
Ullman, J.D., Widom, J.: A First Course in Database Systems. Prentice-Hall, Englewood Cliffs (1997)
Widom, J.: Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In: CIDR 2005, Conference on Innovative Data Systems Research, pp. 262–276 (2005)
Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using Semantic Web Technologies for Representing E-science Provenance. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 92–106. Springer, Heidelberg (2004)
Zhao, J., Goble, C.A., Stevens, R., Bechhofer, S.: Semantically Linking and Browsing Provenance Logs for E-science. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226, pp. 158–176. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cohen, S., Cohen-Boulakia, S., Davidson, S. (2006). Towards a Model of Provenance and User Views in Scientific Workflows. In: Leser, U., Naumann, F., Eckman, B. (eds) Data Integration in the Life Sciences. DILS 2006. Lecture Notes in Computer Science(), vol 4075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11799511_24
Download citation
DOI: https://doi.org/10.1007/11799511_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36593-8
Online ISBN: 978-3-540-36595-2
eBook Packages: Computer ScienceComputer Science (R0)