Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. We demonstrate "Deequ", an Apache Spark-based library for automating the verification of data quality at scale. This library provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables "unit tests for data". Deequ is available as open source, meets the requirements of production use cases at Amazon, and scales to datasets with billions of records if the constraints to evaluate are chosen carefully. Our demonstration walks attendees through a fictitious business use case of validating daily product reviews from a public dataset, and is executed in a proprietary interactive notebook environment. We show attendees how to define data unit tests from automatically suggested constraints and how to create customized tests. Additionally, we demonstrate how to apply Deequ to validate incrementally growing datasets, and give examples of how to configure anomaly detection algorithms on time series of data quality metrics to further automate the data validation.

References

[1]

Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. SIGMOD (2015), 1383--1394.

Google Scholar

[2]

Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Dustin Lange, David Salinas, Sebastian Schelter, Matthias Seeger, and Yuyang Wang. 2017. Probabilistic demand forecasting at scale. PVLDB 10, 12 (2017), 1694--1705.

Digital Library

Google Scholar

[3]

Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (2008).

Google Scholar

[4]

Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. SIGMOD (2017), 1723--1726.

Digital Library

Google Scholar

[5]

Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, Gyuri Szarvas, et al. 2018. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin (2018).

Google Scholar

[6]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification. PVLDB 11, 12 (2018).

Google Scholar

[7]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2019. Differential Data Quality Verification on Partitioned Data. ICDE (2019).

Google Scholar

[8]

D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. NeurIPS (2015), 2503--2511.

Digital Library

Google Scholar

[9]

Peter R Winters. 1960. Forecasting sales by exponentially weighted moving averages. Management science 6, 3 (1960), 324--342.

Digital Library

Google Scholar

Cited By

View all

Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Bylois NNeven FVansummeren S(2024)Data Ingestion Validation Through Stable Conditional Metrics with Ranking and FilteringInformation Systems Frontiers10.1007/s10796-024-10504-yOnline publication date: 5-Jul-2024
https://doi.org/10.1007/s10796-024-10504-y
Siddiqi SKern RBoehm M(2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617338
Show More Cited By

Index Terms

Unit Testing Data with Deequ
1. Information systems
  1. Data management systems
    1. Database administration
      1. Database performance evaluation
    2. Database management system engines
      1. Database query processing
        Query planning

Recommendations

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, ...
Read More
Rule-based data quality
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

In the business intelligence/data warehouse user community, there is a growing confusion as to the difference between data cleansing and data quality. While many data cleansing products can help in applying data edits to name and address data, or help ...
Read More
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Read More

Comments

Information & Contributors

Information

Published In

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

June 2019

2106 pages

ISBN:9781450356435

DOI:10.1145/3299869

General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
413
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

View all

Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Bylois NNeven FVansummeren S(2024)Data Ingestion Validation Through Stable Conditional Metrics with Ranking and FilteringInformation Systems Frontiers10.1007/s10796-024-10504-yOnline publication date: 5-Jul-2024
https://doi.org/10.1007/s10796-024-10504-y
Siddiqi SKern RBoehm M(2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617338
Hoseini SAli AShaker HQuix CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)SEDAR: A Semantic Data Reservoir for Heterogeneous DatasetsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614753(5056-5060)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614753
Robertson SWang ZMoritz DKery MHohman F(2023)Angler: Helping Machine Translation Practitioners Prioritize Model ImprovementsProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580790(1-20)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580790
Bylois NNeven FVansummeren S(2023)Data Ingestion Validation through Stable Conditional Metrics with Ranking and FilteringAdvances in Databases and Information Systems10.1007/978-3-031-42914-9_15(210-223)Online publication date: 28-Aug-2023
https://doi.org/10.1007/978-3-031-42914-9_15
Ilyas IRekatsinas T(2022)Machine Learning and Data Cleaning: Which Serves the Other?Journal of Data and Information Quality10.1145/350671214:3(1-11)Online publication date: 21-Jul-2022
https://dl.acm.org/doi/10.1145/3506712
Studer SBui TDrescher CHanuschkin AWinkler LPeters SMüller K(2021)Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance MethodologyMachine Learning and Knowledge Extraction10.3390/make30200203:2(392-413)Online publication date: 22-Apr-2021
https://doi.org/10.3390/make3020020
Song JHe YLi GLi ZIdreos SSrivastava D(2021)Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data LakesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457250(1678-1691)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457250
Liu ZZhou ZRekatsinas T(2021)Picket: guarding against corrupted data in tabular data during learning and inferenceThe VLDB Journal10.1007/s00778-021-00699-w31:5(927-955)Online publication date: 12-Oct-2021
https://doi.org/10.1007/s00778-021-00699-w

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Rule-based data quality

An Enhanced Technique to Clean Data in the Data Warehouse