Effective Data Versioning for Collaborative Data Analytics

January 2019

Author:
Silu Huang
University of Illinois at Urbana-Champaign
,
Advisor:
Parameswaran, Aditya
University of Illinois at Urbana-Champaign
,
Committee Members:
Han, Jiawei
University of Illinois at Urbana-Champaign
,
Sinha, Saurabh
University of Illinois at Urbana-Champaign
,
Elmore, Aaron
University of Illinois at Urbana-Champaign

Publisher:

University of Illinois at Urbana-Champaign
Champaign, IL
United States

ISBN:979-8-7806-0096-1

Order Number:AAI29023989

Purchase on ProQuest

Bibliometrics

Abstract

With the massive proliferation of datasets in a variety of sectors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during this data science process, via various data processing operations like data transformation and cleaning, feature engineering and normalization, among others. However, no existing systems enable us to effectively store, track, and query these versioned datasets, leading to massive redundancy in versioned data storage and making true collaboration and sharing impossible. In this thesis, we develop solutions for versioned data management for collaborative data analytics.In the first part of this thesis, we extend a relational database to support versioning of structured data. Specifically, we build a system, OrpheusDB, on top of a relational database with a carefully designed data representation and an intelligent partitioning algorithm for fast version control operations. OrpheusDB inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. However, OrpheusDB implicitly makes a few assumptions, namely that: (a) the SQL assumption: a SQL-like language is the best fit for querying data and versioning information; (b) the structural assumption: the data is in a relational format with a regular structure; (c) the from-scratch assumption: users adopt OrpheusDB from the very beginning of their project and register each data version along with full metadata in the system. In the second part of this thesis, we remove each of these assumptions, one at a time. First, we remove the SQL assumption and propose a generalized query language for querying data along with versioning and provenance information. Second, we remove the structural assumption and develop solutions for compact storage and fast retrieval of arbitrary data representations. Finally, we remove the "from-scratch" assumption, by developing techniques to infer lineage relationships among versions residing in an existing data repository.

Contributors

Silu Huang
Microsoft Research
- Publication Years2015 - 2023
- Publication counts19
- Citation count305
- Available for Download16
- Downloads (cumulative)7,466
- Downloads (12 months)957
- Downloads (6 weeks)104
- Average Downloads per Article467
- Average Citation per Article16
View Full Profile
Parameswaran Aditya
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile
Han Jiawei
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile
Sinha Saurabh
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile
Elmore Aaron
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile

Index Terms

Effective Data Versioning for Collaborative Data Analytics

Index terms have been assigned to the content through auto-classification.

Comments

Recommendations

Effective Data Versioning for Collaborative Data Analytics
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

With the massive proliferation of datasets in a variety of sec-tors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during ...
Big Data Analytics
Big Data Analytics with R and Hadoop

Browse Theses

Sections

Index Terms