Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Evaluation of a Hybrid Approach for Efficient Provenance Storage

Published: 01 November 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Provenance is the metadata that describes the history of objects. Provenance provides new functionality in a variety of areas, including experimental documentation, debugging, search, and security. As a result, a number of groups have built systems to capture provenance. Most of these systems focus on provenance collection, a few systems focus on building applications that use the provenance, but all of these systems ignore an important aspect: efficient long-term storage of provenance.
    In this article, we first analyze the provenance collected from multiple workloads and characterize the properties of provenance with respect to long-term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of Web graph compression (adapted for provenance) and dictionary encoding, provides the best trade-off in terms of compression ratio, compression time, and query performance when compared to other compression schemes.

    References

    [1]
    Adler, M. and Mitzenmacher, M. 2001. Towards compressing web graphs. In Proceedings of the IEEE Data Compression Conference.
    [2]
    Barga, R. S. and Digiampietri, L. A. 2007. Automatic capture and efficient storage of escience experiment provenance. Concur. Comput. Pract. Exper. 1--10.
    [3]
    Boldi, P. and Vigna, S. 2004a. The webgraph framework I: Compression techniques. In Proceedings of the 13th International World Wide Web Conference.
    [4]
    Boldi, P. and Vigna, S. 2004b. The webgraph framework II: Codes for the world-wide web. In Proceedings of the International Data Compression Conference.
    [5]
    Boncz, P. A. 2002. Monet: A next generation dbms kernel for query-intensive application. Ph.D. thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands. http://oai.cwi.nl/oai/asset/14832/14832A.pdf.
    [6]
    Bose, R. and Frew, J. 2004. Composing lineage metadata with xml for custom satellite-derived data products. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management.
    [7]
    Cao, B., Plale, B., Subramanian, G., Robertson, E., and Simmhan, Y. 2009. Provenance information model of karma version 3. In Proceedings of the 3rd IEEE International Workshop on Scientific Workflows (SWF’09).
    [8]
    Challenge3. 2009. The third provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ParticipatingTeams3.
    [9]
    Chapman, A. P., Jagadish, H. V., and Ramanan, P. 2008. Efficient provenance storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
    [10]
    Cheah, Y.-W., Plale, B., Kendall-Morwick, J., Leake, D., and Ramakrishnan, L. 2011. A noisy 10GB provenance database. In Proceedings of the 2nd International Workshop on Traceability and Compliance of Semi-Structured Processes, in conjunction with the 9th International Conference on Business Process Management.
    [11]
    Chen, Z., Gehrke, J., and Korn, F. 2001. Query optimization in compressed database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 271--282.
    [12]
    Futrelle, J., Gaynor, J., Plutchak, J., Myers, J. D., McGrath, R. E., Bajcsy, P., Kastner, J., Kotwani, K., Lee, J. S., Marini, L., Kooper, R., Mclaren, T., and Liu, Y. 2009. Semantic middleware for e-science knowledge spaces. In Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science.
    [13]
    Goldstein, J., Ramakrishnan, R., and Shaft, U. 1998. Compressing relations and indexes. In Proceedings of the International Conference on Data Engineering.
    [14]
    Graefe, G. and Shapiro, L. 1991. Data compression and database performance. In Proceedings of the ACM/IEEE-CS Symposium on Applied Computing. 22--27.
    [15]
    Groth, P., Miles, S., Fang, W., Wong, S. C., Zauner, K., and Moreau, L. 2005. Recording and using provenance in a protein compressibility experiment. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing.
    [16]
    Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S., and Moreau, L. 2006. An architecture for provenance system. Tech. rep. http://eprints.soton.ac.uk/263196/1/provenanceArchitecture7.pdf.
    [17]
    Jayapandian, M., Chapman, A. P., Tarcea, V. G., Yu, C., Elkiss, A., Ianni, A., Liu, B., Nandi, A., Santos, C., Andrews, P., Athey, B., States, D., and Jagadish, H. V. 2007. Michigan molecular interactions (MiMI): Putting the jigsaw puzzle together. Nucleic Acids Res. 35, D566-571.
    [18]
    King, S. T. and Chen, P. M. 2003. Backtracking intrusions. In Proceedings of the ACM Symposium on Operating Systems Principles.
    [19]
    Liefke, H. and Suciu, D. 2000. XMill: An efficient compressor for xml data. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
    [20]
    Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., Williams, A., Oinn, T., and Goble, C. 2010. Taverna, reloaded. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Nanagement (SSDBM’10).
    [21]
    Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., and van den Bussche, J. 2011. The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27, 6, 743--756.
    [22]
    Muniswamy-Reddy, K.-K., Holland, D. A., Braun, U., and Seltzer, M. I. 2006. Provenance-aware storage systems. In Proceedings of the USENIX Annual Technical Conference.
    [23]
    Muniswamy-Reddy, K.-K., Braun, U., Holland, D. A., Macko, P., Maclean, D., Margo, D., Seltzer, M. I., and Smogor, R. 2009. Layering in provenance systems. In Proceedings of the USENIX Annual Technical Conference.
    [24]
    Passtrace. 2008. http://www.eecs.harvard.edu/syrah/pass/traces/.
    [25]
    Poss, M. and Potapov, D. 2003. Data compression in oracle. In Proceedings of the International Conference on Very Large Data Bases.
    [26]
    Randall, K., Wickremesinghe, R., and Wiener, J. 2001. The link database: Fast access to graphs of the web. Res. rep. 175, Compaq Systems Research Center, Palo Alto, CA.
    [27]
    Roth, M. A. and Horn, S. J. V. 1993. Database compression. SIGMOD Rec. 22, 3, 31--39.
    [28]
    Shah, S., Soules, C. A. N., Ganger, G. R., and Noble, B. D. 2007. Using provenance to aid in personal file search. In Proceedings of the USENIX Annual Technical Conference.
    [29]
    Simmhan, Y. L., Plale, B., and Gannon, D. 2006. A framework for collecting provenance in data-centric scientific workflows. In Proceedings of the IEEE International Conference on Web Services.
    [30]
    Suel, T. and Yuan, J. 2001. Compressing the graph structure of the web. In Proceedings of the IEEE Data Compression Conference.
    [31]
    Tolani, P. M. and Haritsa, J. R. 2002. XGRIND: A query-friendly xml compressor. In Proceedings of the International Conference on Data Engineering. 225--234.
    [32]
    Vahdat, A. and Anderson, T. 1997. Transparent result caching. Tech. rep. CSD-97-974,8.
    [33]
    Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the International Conference on Innovation Data Systems Research (CIDR).
    [34]
    Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco.
    [35]
    Xie, Y., K. Muniswamy-Reddy, K., Long, D. D. E., Amer, A., Feng, D., and Tan, Z. 2011. Compressing provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance.
    [36]
    Xie, Y., K. Muniswamy-Reddy, K., Feng, D., Yan, L., Long, D. D. E., Tan, Z., and Chen, L. 2012. A hybrid approach for efficient provenance storage. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.
    [37]
    Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3, 337--343.
    [38]
    Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006. Super-scalar ram-cpu cache compression. In Proceedings of the International Conference on Data Engineering.

    Cited By

    View all
    • (2023)APM: An Attack Path-based Method for APT Attack Detection on Few-Shot Learning2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00025(10-19)Online publication date: 1-Nov-2023
    • (2023)SoK: History is a Vast Early Warning System: Auditing the Provenance of System Intrusions2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179405(2620-2638)Online publication date: May-2023
    • (2022)Data provenance for cloud forensic investigations, security, challenges, solutions and future perspectives: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.10.01834:10(10217-10245)Online publication date: Dec-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Storage
    ACM Transactions on Storage  Volume 9, Issue 4
    November 2013
    117 pages
    ISSN:1553-3077
    EISSN:1553-3093
    DOI:10.1145/2555948
    • Editor:
    • Darrell Long
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2013
    Accepted: 01 June 2013
    Revised: 01 April 2013
    Received: 01 October 2012
    Published in TOS Volume 9, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Provenance graphs
    2. Web compression
    3. dictionary encoding
    4. storage

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)APM: An Attack Path-based Method for APT Attack Detection on Few-Shot Learning2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00025(10-19)Online publication date: 1-Nov-2023
    • (2023)SoK: History is a Vast Early Warning System: Auditing the Provenance of System Intrusions2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179405(2620-2638)Online publication date: May-2023
    • (2022)Data provenance for cloud forensic investigations, security, challenges, solutions and future perspectives: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.10.01834:10(10217-10245)Online publication date: Dec-2022
    • (2022)Data Provenance for Big Data Security and AccountabilityEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_237-2(1-6)Online publication date: 24-May-2022
    • (2021)Validating the Integrity of Audit Logs Against Execution Repartitioning AttacksProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security10.1145/3460120.3484551(3337-3351)Online publication date: 12-Nov-2021
    • (2021)P-Gaussian: Provenance-Based Gaussian Distribution for Detecting Intrusion Behavior Variants Using High Efficient and Real Time Memory DatabasesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2019.296035318:6(2658-2674)Online publication date: 1-Nov-2021
    • (2020)Historical Graph Management in Dynamic EnvironmentsElectronics10.3390/electronics90608959:6(895)Online publication date: 28-May-2020
    • (2020)On the Forensic Validity of Approximated Audit LogsProceedings of the 36th Annual Computer Security Applications Conference10.1145/3427228.3427272(189-202)Online publication date: 7-Dec-2020
    • (2020)Forensic Analysis in Access ControlProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security10.1145/3372297.3417860(1533-1550)Online publication date: 30-Oct-2020
    • (2020)Pagoda: A Hybrid Approach to Enable Efficient Real-Time Provenance Based Intrusion Detection in Big Data EnvironmentsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2018.286759517:6(1283-1296)Online publication date: 1-Nov-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media