Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Provenance and data differencing for workflow reproducibility analysis

Published: 25 March 2016 Publication History

Abstract

One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services - often choreographed through workflow, process data to generate results. The reproduction of results is often not straightforward as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. This paper addresses these problems in three ways. Firstly, it introduces a new framework to clarify the range of meanings of 'reproducibility'. Secondly, it describes a new algorithm, PDIFF, that uses a comparison of workflow provenance traces to determine whether an experiment has been reproduced; the main innovation is that if this is not the case then the specific points of divergence are identified through graph analysis, assisting any researcher wishing to understand those differences. One key feature is support for user-defined, semantic data comparison operators. Finally, the paper describes an implementation of PDIFFthat leverages the power of the e-Science Central platform that enacts workflows in the cloud. As well as automatically generating a provenance trace for consumption by PDIFF, the platform supports the storage and reuse of old versions of workflows, data and services; the paper shows how this can be powerfully exploited to achieve reproduction and reuse. Copyright © 2013 John Wiley & Sons, Ltd.

References

[1]
Roure DD, Belhajjame K, Missier P, Al E. Towards the preservation of scientific workflows. Proceedings of the 8th International Conference on Preservation of Digital Objects iPRES 2011, Singapore, 2011; pp.228-231.
[2]
Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows. SIGMOD Record 2011; Volume 40 Issue 2: pp.6-16.
[3]
Groth P, Deelman E, Juve G, Mehta G, Berriman B. A pipeline-centric provenance model. The 4th Workshop on Workflows in Support of Large-Scale Science, Portland, OR, November 16, 2009.
[4]
Wang Y, DeWitt D, Cai JY. X-diff: an effective change detection algorithm for XML documents. Proceedings of the 19th International Conference on Data Engineering, 2003, Bangalore, India, 2003; pp.519-530.
[5]
Berztiss AT. A backtrack procedure for isomorphism of directed graphs. Journal of the ACM 1973; Volume 20 Issue 3: pp.365-377.
[6]
Ullmann JR. An algorithm for subgraph isomorphism. Journal of the ACM 1976; Volume 23 Issue 1: pp.31-42.
[7]
Hiden H, Watson P, Woodman S, Leahy D. e-Science Central: Cloud-based e-Science and its application to chemical property modelling. Technical Report CS-TR-1227, School of Computing Science, Newcastle University, 2011.
[8]
Cala J, Watson P, Woodman S. Cloud computing for fast prediction of chemical activity. Proceedings of the 2nd International Workshop on Cloud Computing and Scientific Applications CCSA, Ottawa, Canada, 2012.
[9]
Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J, Plale B, Simmhan Y, Stephan E, Van Den Bussche J. The open provenance model - core specification v1.1. Future Generation Computer Systems 2011; Volume 7 Issue 21: pp.743-756.
[10]
Moreau L, Missier P, Belhajjame K, B'Far R, Cheney J, Coppens S, Cresswell S, Gil Y, Groth P, Klyne G, Lebo T, McCusker J, Miles S, Myers J, Sahoo S, Tilmes C. PROV-DM: The PROV data model. Technical Report, World Wide Web Consortium, 2012.
[11]
Hanson B, Sugden A, Alberts B. Making data maximally available. Science 2011; Volume 331 Issue 6018: pp.649.
[12]
Merall Z. Computational science: 'Error. Nature 2010; Volume 467: pp.775-777.
[13]
Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic research. American Journal of Epidemiology 2006; Volume 163 Issue 9: pp.783-789.
[14]
Drummond C. Science, replicability is not reproducibility: Nor is it good science. Proceedings of the 4th Workshop on Evaluation Methods for Machine Learning in Conjunction with ICML 2009, Montreal, Canada, 2009.
[15]
Peng R. Reproducible research in computational science. Science 2011; Volume 334 Issue 6060: pp.1226-1127.
[16]
Bechhofer S, De Roure D, Gamble M, Goble C, Buchan I. Research objects: Towards exchange and reuse of digital knowledge. Procs. The Future of the Web for Collaborative Science Workshop, held in conjunction with WWW2010, Raleigh, NC, USA, April 26-30, 2010.
[17]
Schwab M, Karrenbach M, Claerbout J. Making scientific computations reproducible. Computing in Science Engineering 2000; Volume 2 Issue 6: pp.61-67.
[18]
Mesirov J. Accessible reproducible research. Science 2010; Volume 327: pp.415-416.
[19]
Nekrutenko A. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 2010; Volume 11 Issue 8: pp.R86.
[20]
Scheidegger C, Vo H, Koop D, Freire J. Querying and re-using workflows with VisTrails. Proceedings of the SIGMOD, 2008; pp.1251-1254.
[21]
Ludäscher B, Altintas I, Berkley C. Scientific workflow management and the kepler system. Concurrency and Computation: Practice and Experience 2005; Volume 18: pp.1039-1065.
[22]
Missier P, Soiland-Reyes S, Owen S, Tan W, Nenadic A, Dunlop I, Williams A, Oinn T, Goble C. Taverna, reloaded. In Procs. SSDBM 2010, Gertz M, Hey T, Ludaescher B eds. Springer: Heidelberg, Germany, 2010.
[23]
Zhao J, Gomez-Perez J, Belhajjame K, Klyne G, Al E. Why workflows break - understanding and combating decay in taverna workflows. Proceedings of the e-Science Conference, Chicago, 2012; pp.1-9.
[24]
Missier P, Paton N, Belhajjame K. Fine-grained and efficient lineage querying of collection-based workflow provenance. Proceedings of the EDBT, Lausanne, Switzerland, 2010.
[25]
Moreau L. Provenance-based reproducibility in the Semantic Web. Web semantics: science, services and agents on the world wide web 2011; Volume 9 Issue 2: pp.202-221.
[26]
Kim J, Deelman E, Gil Y, Mehta G, Ratnakar V. Provenance trails in the Wings-Pegasus system. Concurrency and Computation: Practice and Experience 2008; Volume 20: pp.587-597.
[27]
Koop D, Scheidegger C, Freire J, Silva C. The Provenance of workflow upgrades. In Provenance and Annotation of Data and Processes, Vol.Volume 6378, McGuinness D, Michaelis J, Moreau L eds, <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer Berlin: Heidelberg, 2010; pp.2-16.
[28]
Bunke H. Graph matching: Theoretical foundations, algorithms, and applications. Proceedings of the Vision Interface 2000, May; Volume 2000: pp.82-88.
[29]
Chawathe SS, Garcia-Molina H. Meaningful change detection in structured data. SIGMOD Record 1997; Volume 26 Issue 2: pp.26-37.
[30]
Altintas I, Barney O, Jaeger-Frank E. Provenance collection support in the {K}epler scientific workflow system. IPAW, 2006; pp.118-132.
[31]
Bao Z, Cohen-Boulakia S, Davidson S, Eyal A, Khanna S. Differencing provenance in scientific workflows. Proceedings of the ICDE, 2009.
[32]
Deelman E, Gannon D, Shields M, Taylor I. Workflows and e-Science: an overview of workflow system features and capabilities. Future Generation Computer Systems 2009; Volume 25 Issue 5: pp.528-540.
[33]
Schubert E, Schaffert S, Bry F. Structure-preserving difference search for XML documents. Extreme Markup Languages®;, Montréal, Québec, 2005.
[34]
Cobena G, Abiteboul S, Marian A. Detecting changes in XML documents. Proceedings of the 18th International Conference on Data engineering, 2002, 2002; pp.41-52.
[35]
Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems MCSS 1989; Volume 2: pp.303-314.
[36]
Rutherford A. Introducing ANOVA and ANCOVA: A GLM Approach. SAGE: London, 2001.
[37]
Watson P, Hiden H, Woodman S. e-Science Central for CARMEN: science as a service. Concurrency and Computation: Practice and Experience 2010; Volume 22: pp.2369-2380.
[38]
Dominguez-Sal D, Urbón-Bayes P, Giménez-Vañó A, Gómez-Villamor S, Martínez-Bazán N, Larriba-Pey JL. Survey of graph database performance on the HPC scalable graph analysis benchmark. In Proceedings of the 2010 International Conference on Web-age Information Management, <bookSeriesTitle>WAIM'10</bookSeriesTitle>. Springer-Verlag: Berlin, Heidelberg, 2010; pp.37-48.

Cited By

View all
  • (2022)Sharing and performance optimization of reproducible workflows in the cloudFuture Generation Computer Systems10.1016/j.future.2019.03.04598:C(487-502)Online publication date: 21-Apr-2022
  • (2020)Efficient provenance alignment in reproduced executionsProceedings of the 12th USENIX Conference on Theory and Practice of Provenance10.5555/3488890.3488896(6-6)Online publication date: 22-Jun-2020
  • (2018)Dac-ManProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291753(1-13)Online publication date: 11-Nov-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience
Concurrency and Computation: Practice & Experience  Volume 28, Issue 4
March 2016
451 pages
ISSN:1532-0626
EISSN:1532-0634
Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 25 March 2016

Author Tags

  1. e-science
  2. provenance
  3. reproducibility
  4. scientific workflow

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Sharing and performance optimization of reproducible workflows in the cloudFuture Generation Computer Systems10.1016/j.future.2019.03.04598:C(487-502)Online publication date: 21-Apr-2022
  • (2020)Efficient provenance alignment in reproduced executionsProceedings of the 12th USENIX Conference on Theory and Practice of Provenance10.5555/3488890.3488896(6-6)Online publication date: 22-Jun-2020
  • (2018)Dac-ManProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291753(1-13)Online publication date: 11-Nov-2018
  • (2018)Provenance Analytics for Workflow-Based Computational ExperimentsACM Computing Surveys10.1145/318490051:3(1-25)Online publication date: 23-May-2018
  • (2018)Dac-ManProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00075(1-13)Online publication date: 11-Nov-2018
  • (2018)Reproducibility of scientific workflows execution using cloud-aware provenance (ReCAP)Computing10.1007/s00607-018-0617-6100:12(1299-1333)Online publication date: 1-Dec-2018
  • (2016)Cloud computing for data-driven science and engineeringConcurrency and Computation: Practice & Experience10.1002/cpe.366828:4(947-949)Online publication date: 25-Mar-2016
  • (2014)noWorkflowRevised Selected Papers of the 5th International Provenance and Annotation Workshop on Provenance and Annotation of Data and Processes - Volume 862810.1007/978-3-319-16462-5_6(71-83)Online publication date: 9-Jun-2014

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media