Effective Data Versioning for Collaborative Data Analytics
Pages 1925 - 1938
Abstract
With the massive proliferation of datasets in a variety of sec-tors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during this data science process, via various data processing operations like data transformation and cleaning, feature engineering and normalization, among others. However, no existing systems enable us to effectively store, track, and query these versioned datasets, leading to massive redundancy in versioned data storage and making true collaboration and sharing impossible. In my PhD thesis, we develop solutions for versioned data management for collaborative data analytics. In the first part of my dissertation, we extend a relational database to support versioning of structured data. Specifically, we build a system, OrpheusDB, on top of a relational database with a carefully designed data representation and an intelligent partitioning algorithm for fast version control operations. OrpheusDB inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. However, OrpheusDB implicitly makes a few assumptions, namely that:(a) the SQL assumption: a SQL-like language is the best fit for querying data and versioning information;(b) the structural assumption: the data is in a relational for-mat with a regular structure;(c) the from-scratch assumption: users adopt OrpheusDB from the very beginning of their project and register each data version along with full meta-data in the system. In the second part of my dissertation, we remove each of these assumptions, one at a time. First, we remove the SQL assumption and propose a generalized query language for querying data along with versioning and provenance information. Second, we remove the structural assumption and develop solutions for compact storage and fast retrieval of arbitrary data representations [4]. Finally, we remove the "from-scratch" assumption, by developing techniques to infer lineage relationships among versions residing in an existing data repository.
References
[1]
Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. Orpheusdb: Bolt-on versioning for relational databases. Proceedings of the VLDB Endowment, 10(10), 2017.
[2]
Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. Orpheusdb: bolt-on versioning for relational databases (extended version). The VLDB Journal, 29(1):509--538, 2020.
[3]
Amit Chavan, Silu Huang, Amol Deshpande, Aaron Elmore, Samuel Madden, and Aditya Parameswaran. Towards a unified query language for provenance and versioning. In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15), 2015.
[4]
Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. Proceedings of the VLDB Endowment, 8(12), 2015.
Index Terms
- Effective Data Versioning for Collaborative Data Analytics
Recommendations
DB: bolt-on versioning for relational databases (extended version)
AbstractData science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to ...
Comments
Information & Contributors
Information
Published In
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
- General Chairs:
- David Maier,
- Rachel Pottinger,
- Program Chairs:
- AnHai Doan,
- Wang-Chiew Tan,
- Publications Chairs:
- Abdussalam Alawini,
- Hung Q. Ngo
Copyright © 2020 Owner/Author.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 31 May 2020
Check for updates
Author Tags
Qualifiers
- Abstract
Funding Sources
Conference
SIGMOD/PODS '20
Sponsor:
Acceptance Rates
Overall Acceptance Rate 785 of 4,003 submissions, 20%
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 351Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)2
Reflects downloads up to 03 Oct 2024
Other Metrics
Citations
View Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in