abstract

Effective Data Versioning for Collaborative Data Analytics

Author:

Silu HuangAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1925 - 1938

https://doi.org/10.1145/3318464.3394027

Published: 31 May 2020 Publication History

Get Access

Abstract

With the massive proliferation of datasets in a variety of sec-tors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during this data science process, via various data processing operations like data transformation and cleaning, feature engineering and normalization, among others. However, no existing systems enable us to effectively store, track, and query these versioned datasets, leading to massive redundancy in versioned data storage and making true collaboration and sharing impossible. In my PhD thesis, we develop solutions for versioned data management for collaborative data analytics. In the first part of my dissertation, we extend a relational database to support versioning of structured data. Specifically, we build a system, OrpheusDB, on top of a relational database with a carefully designed data representation and an intelligent partitioning algorithm for fast version control operations. OrpheusDB inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. However, OrpheusDB implicitly makes a few assumptions, namely that:(a) the SQL assumption: a SQL-like language is the best fit for querying data and versioning information;(b) the structural assumption: the data is in a relational for-mat with a regular structure;(c) the from-scratch assumption: users adopt OrpheusDB from the very beginning of their project and register each data version along with full meta-data in the system. In the second part of my dissertation, we remove each of these assumptions, one at a time. First, we remove the SQL assumption and propose a generalized query language for querying data along with versioning and provenance information. Second, we remove the structural assumption and develop solutions for compact storage and fast retrieval of arbitrary data representations [4]. Finally, we remove the "from-scratch" assumption, by developing techniques to infer lineage relationships among versions residing in an existing data repository.

References

[1]

Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. Orpheusdb: Bolt-on versioning for relational databases. Proceedings of the VLDB Endowment, 10(10), 2017.

Google Scholar

[2]

Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. Orpheusdb: bolt-on versioning for relational databases (extended version). The VLDB Journal, 29(1):509--538, 2020.

Digital Library

Google Scholar

[3]

Amit Chavan, Silu Huang, Amol Deshpande, Aaron Elmore, Samuel Madden, and Aditya Parameswaran. Towards a unified query language for provenance and versioning. In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15), 2015.

Google Scholar

[4]

Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. Proceedings of the VLDB Endowment, 8(12), 2015.

Digital Library

Google Scholar

Index Terms

Effective Data Versioning for Collaborative Data Analytics
1. Information systems
  1. Data management systems

Recommendations

Effective Data Versioning for Collaborative Data Analytics
$O R P H E U S$ DB: bolt-on versioning for relational databases (extended version)
Abstract
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to ... $^{}$
Big Data Analytics

Comments

Information & Contributors

Information

Published In

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Check for updates

Author Tags

Qualifiers

Abstract

Funding Sources

National Institutes of Health (NIH)
Microsoft
3M
National Science Foundation (NSF)

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
349
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations