Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A survey of data provenance in e-science

Published: 01 September 2005 Publication History
  • Get Citation Alerts
  • Abstract

    Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.

    References

    [1]
    J. Brase, "Using Digital Library Techniques - Registration of Scientific Primary Data," in ECDL, 2004.]]
    [2]
    D. G. Clarke and D. M. Clark, "Lincage," in Elements of Spatial Data Quality, 1995.]]
    [3]
    J. L. Romeu, "Data Quality and Pedigree," in Material Ease, 1999.]]
    [4]
    H. V. Jagadish and F. Olken, "Database Management for Life Sciences Research," in SIGMOD Record, vol. 33, 2004.]]
    [5]
    "Access to genetic resources and Benefit-Sharing (ABS) Program," United Nations University, 2003.]]
    [6]
    P. Buneman, S. Khanna, and W. C. Tan, "Why and Where: A Characterization of Data Provenance," in ICDT, 2001.]]
    [7]
    D. P. Lanter, "Design of a Lineage-Based Meta-Data Base for GIS," in Cartography and Geographic Information Systems, vol. 18, 1991.]]
    [8]
    M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science Experiments - experience from Bioinformatics," in Proceedings of the UK OST e-Science 2nd AHM, 2003.]]
    [9]
    Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of Data Provenance Techniques," in Technical Report TR-618: Computer Science Department, Indiana University, 2005.]]
    [10]
    R. Bose and J. Frew, "Lineage retrieval for scientific data processing: a survey," in ACM Comput. Surv., vol. 37, 2005.]]
    [11]
    S. Miles, P. Groth, M. Branco, and L. Moreau, "The requirements of recording and using provenance in e-Science experiments," in Technical Report, Electronics and Computer Science, University of Southampton, 2005.]]
    [12]
    D. Pearson, "Presentation on Grid Data Requirements Scoping Metadata & Provenance," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]
    [13]
    G. Cameron, "Provenance and Pragmatics," in Workshop on Data Provenance and Annotation, Edinburgh, 2003.]]
    [14]
    C. Goble, "Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]
    [15]
    P. P. da Silva, D. L. McGuinness, and R. McCool, "Knowledge Provenance Infrastructure," in IEEE Data Engineering Bulletin, vol. 26, 2003.]]
    [16]
    H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Improving Data Cleaning Quality Using a Data Lineage Facility," in DMDW, 2001.]]
    [17]
    I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao. "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration," in CIDR, 2003.]]
    [18]
    J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer, "Semantically Linking and Browsing Provenance Logs for E-science," in ICSNW, 2004.]]
    [19]
    A. Woodruff and M. Stonebraker, "Supporting Fine-grained Data Lineage in a Database Visualization Environment," in ICDE, 1997.]]
    [20]
    B. Plale, D. Gannon, D. Reed, S. Graves, K. Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD," in ICCS workshop on Dynamic Data Driven Applications, 2005.]]
    [21]
    D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya, "An Annotation Management System for Relational Databases," in VLDB, 2004.]]
    [22]
    Y. Cui and J. Widom, "Practical Lineage Tracing in Data Warehouses," in ICDE, 2000.]]
    [23]
    J. Widom, "Trio: A System for Integrated Management of Data, Accuracy, and Lineage," in CIDR, 2005.]]
    [24]
    C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee, L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K. Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C. Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon, B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D. Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M. Frenklach, "Metadata in the collaboratory for multi-scale chemical science," in Dublin Core Conference, 2003.]]
    [25]
    J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier, "Multi-Scale Science, Supporting Emerging Practice with Semantically Derived Provenance," in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.]]
    [26]
    R. Bose and J. Frew, "Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products," in SSDBM, 2004.]]
    [27]
    I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao, "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation," in SSDBM, 2002.]]
    [28]
    J. Frew and R. Bose, "Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products," in SSDBM, 2001.]]
    [29]
    Y. Cui and J. Widom, "Lineage tracing for general data warehouse transformations," in VLDB Journal, vol. 12, 2003.]]

    Cited By

    View all
    • (2024)Segurança da informação na pesquisa científicaRevista Sociedade Científica10.61411/rsc2024366177:1(1952-1964)Online publication date: 15-Apr-2024
    • (2024)Cloud‐based provenance framework for duplicates identification and data quality enhancementExpert Systems10.1111/exsy.13600Online publication date: Apr-2024
    • (2024)This is the Table I Want! Interactive Data Transformation on Desktop and in Virtual RealityIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.329960230:8(5635-5650)Online publication date: Aug-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 34, Issue 3
    September 2005
    115 pages
    ISSN:0163-5808
    DOI:10.1145/1084805
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2005
    Published in SIGMOD Volume 34, Issue 3

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)177
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Segurança da informação na pesquisa científicaRevista Sociedade Científica10.61411/rsc2024366177:1(1952-1964)Online publication date: 15-Apr-2024
    • (2024)Cloud‐based provenance framework for duplicates identification and data quality enhancementExpert Systems10.1111/exsy.13600Online publication date: Apr-2024
    • (2024)This is the Table I Want! Interactive Data Transformation on Desktop and in Virtual RealityIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.329960230:8(5635-5650)Online publication date: Aug-2024
    • (2024)From Invisible to Visible: Impacts of Metadata in Communicative Data VisualizationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323171630:7(3427-3443)Online publication date: Jul-2024
    • (2024)PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337455535:5(844-861)Online publication date: May-2024
    • (2024)To Store or Not to Store: a graph theoretical approach for Dataset Versioning2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00049(479-493)Online publication date: 27-May-2024
    • (2024)e-Science workflow: A semantic approach for airborne pollen predictionKnowledge-Based Systems10.1016/j.knosys.2023.111230284(111230)Online publication date: Jan-2024
    • (2023)Framework for Data Provenance Assurance in Cloud Environment using Ethereum BlockchainICST Transactions on Scalable Information Systems10.4108/eetsis.3536Online publication date: 9-Oct-2023
    • (2023)Data Provenance in Healthcare: Approaches, Challenges, and Future DirectionsSensors10.3390/s2314649523:14(6495)Online publication date: 18-Jul-2023
    • (2023)Provenance Data Management in Health Information Systems: A Systematic Literature ReviewJournal of Personalized Medicine10.3390/jpm1306099113:6(991)Online publication date: 13-Jun-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media