Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1142473.1142534acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Provenance management in curated databases

Published: 27 June 2006 Publication History
  • Get Citation Alerts
  • Abstract

    Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.

    References

    [1]
    G. Bader, D. Betel, and C. W. Hogue. BIND: the biomolecule interaction network database. Nucleic Acids Research, 31(1):248--250, 2003.]]
    [2]
    D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In Proc. of the Intl. Conf. on Very Large Data Bases (VLDB), pages 900--911. Morgan Kaufmann, 2004.]]
    [3]
    R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv., 37(1):1--28, 2005.]]
    [4]
    P. Buneman, S. Davidson, W. Fan, C. Hara, and W.-C. Tan. Keys for XML. Computer Networks, 39(5), August 2002.]]
    [5]
    P. Buneman, S. Khanna, K. Tajima, and W. C. Tan. Archiving scientific data. ACM Trans. Database Syst., 29:2--42, 2004.]]
    [6]
    P. Buneman, S. Khanna, and W.-C. Tan. Why and Where: A characterization of data provenance. In ICDT, pages 316--330, 2001.]]
    [7]
    J. Cherry, C. Adler, C. Ball, S. Chervitz, S. Dwight, E. Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, and D. Botstein. SGD: Saccharomyces genome database. Nucleic Acids Res., 26(1):73--79, 1998.]]
    [8]
    Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB J., 12(1):41--58, 2003.]]
    [9]
    G. Dellaire, R. Farrall, and W. A. Bickmore. The nuclear protein database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic Acids Research, 31(1):328--330, 2003.]]
    [10]
    I. Foster, J. Vockler, M. Eilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In International Conference on Scientific and Statistical Database Management, pages 1--10, July 2002.]]
    [11]
    J. N. Foster, M. B. Greenwald, J. T. Moore, B. C. Pierce, and A. Schmitt. Combinators for bi-directional tree transformations: A linguistic approach to the view update problem. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), Long Beach, California, 2005.]]
    [12]
    M. Y. Galperin. The molecular biology database collection: 2006 update. Nucl. Acids Res., 34:D3-D5, Jan 2006.
    [13]
    J. Gray, D. T. Liu, M. A. Nieto-Santisteban, A. S. Szalay, G. Heber, and D. DeWitt. Scientific data management in the coming decade. Technical Report MSR-TR-2005-10, Microsoft Research, January 2005.]]
    [14]
    P. Groth, S. Miles, W. Fang, S. C. Wong, K.-P. Zauner, and L. Moreau. Recording and using provenance in a protein compressibility experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), 2005.]]
    [15]
    H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu. Timber: A native XML database. The VLDB Journal, 11(4):274--291, 2002.]]
    [16]
    T. Lee, S. Bressan, and S. E. Madnick. Source attribution for querying against semi-structured documents. In Workshop on Web Information and Data Management, pages 33--39, 1998.]]
    [17]
    A. Marian, S. Abiteboul, G. Cobena, and L. Mignet. Change-centric management of versions in an XML warehouse. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB, pages 581--590. Morgan Kaufmann, 2001.]]
    [18]
    Mimi. http://mimi.ctaalliance.org.]]
    [19]
    W. O'Mullane, J. Gray, N. Li, T. Budavari, M. A. Nieto-Santisteban, and A. Szalay. Batch query system with interactive local storage for SDSS and the VO. In F. Ochsenbein, M. Allen, and D. Egret, editors, Astronomical Data Analysis Software and Systems XIII, volume 314 of ASP Conference Series, 2004.]]
    [20]
    Y. Reimer and S. A. Douglas. Implementation challenges associated with developing a web-based e-notebook. Journal of Digital Information (JoDI), 4(3), 2003.]]
    [21]
    Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31--36, 2005.]]
    [22]
    W. Tan. Containment of relational queries with annotation propagation. In Proceedings of the International Workshop on Database and Programming Languages (DBPL), 2003.]]
    [23]
    UniProt. http://www.ebi.ac.uk/uniprot/.]]
    [24]
    J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.]]
    [25]
    N. Wiwatwattana and A. Kumar. Organelle DB: a cross-species database of protein localization and function. Nucleic Acids Research, 33:D598--604, 2005.]]
    [26]
    A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In International Conference of Data Engineering, 1997.]]
    [27]
    J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer. Semantically linking and browsing provenance logs for e-science. In ICSNW, pages 158--176, 2004.]]

    Cited By

    View all
    • (2023)Transactional Python for Durable Machine Learning: Vision, Challenges, and FeasibilityProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595855(1-5)Online publication date: 18-Jun-2023
    • (2023)Deletion Propagation Revisited for Multiple Key Preserving ViewsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.311085135:3(2445-2456)Online publication date: 1-Mar-2023
    • (2023)ProvOER model: A provenance model for Open Educational ResourcesHeliyon10.1016/j.heliyon.2023.e133119:2(e13311)Online publication date: Feb-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
    June 2006
    830 pages
    ISBN:1595934340
    DOI:10.1145/1142473
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. curation
    2. provenance
    3. storage

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Transactional Python for Durable Machine Learning: Vision, Challenges, and FeasibilityProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595855(1-5)Online publication date: 18-Jun-2023
    • (2023)Deletion Propagation Revisited for Multiple Key Preserving ViewsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.311085135:3(2445-2456)Online publication date: 1-Mar-2023
    • (2023)ProvOER model: A provenance model for Open Educational ResourcesHeliyon10.1016/j.heliyon.2023.e133119:2(e13311)Online publication date: Feb-2023
    • (2022)Challenges of Provenance in Scientific Workflow Management Systems2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS56498.2022.00007(10-18)Online publication date: Nov-2022
    • (2022)Data provenance for cloud forensic investigations, security, challenges, solutions and future perspectives: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.10.01834:10(10217-10245)Online publication date: Nov-2022
    • (2021)A Roadmap for Automating Lineage Tracing to Aid Automatically Explaining Machine Learning Predictions for Clinical Decision SupportJMIR Medical Informatics10.2196/277789:5(e27778)Online publication date: 27-May-2021
    • (2021)LineageChain: a fine-grained, secure and efficient data provenance system for blockchainsThe VLDB Journal10.1007/s00778-020-00646-1Online publication date: 7-Feb-2021
    • (2020)Revealing Every Story of Data in Blockchain SystemsACM SIGMOD Record10.1145/3422648.342266549:1(70-77)Online publication date: 4-Sep-2020
    • (2020)The relationship between trust in AI and trustworthy machine learning technologiesProceedings of the 2020 Conference on Fairness, Accountability, and Transparency10.1145/3351095.3372834(272-283)Online publication date: 27-Jan-2020
    • (2020)Equivalence-Invariant Algebraic Provenance for Hyperplane Update QueriesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380578(415-429)Online publication date: 11-Jun-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media