Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3299869.3320210acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Unit Testing Data with Deequ

Published: 25 June 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. We demonstrate "Deequ", an Apache Spark-based library for automating the verification of data quality at scale. This library provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables "unit tests for data". Deequ is available as open source, meets the requirements of production use cases at Amazon, and scales to datasets with billions of records if the constraints to evaluate are chosen carefully. Our demonstration walks attendees through a fictitious business use case of validating daily product reviews from a public dataset, and is executed in a proprietary interactive notebook environment. We show attendees how to define data unit tests from automatically suggested constraints and how to create customized tests. Additionally, we demonstrate how to apply Deequ to validate incrementally growing datasets, and give examples of how to configure anomaly detection algorithms on time series of data quality metrics to further automate the data validation.

    References

    [1]
    Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. SIGMOD (2015), 1383--1394.
    [2]
    Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Dustin Lange, David Salinas, Sebastian Schelter, Matthias Seeger, and Yuyang Wang. 2017. Probabilistic demand forecasting at scale. PVLDB 10, 12 (2017), 1694--1705.
    [3]
    Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (2008).
    [4]
    Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. SIGMOD (2017), 1723--1726.
    [5]
    Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, Gyuri Szarvas, et al. 2018. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin (2018).
    [6]
    Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification. PVLDB 11, 12 (2018).
    [7]
    Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2019. Differential Data Quality Verification on Partitioned Data. ICDE (2019).
    [8]
    D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. NeurIPS (2015), 2503--2511.
    [9]
    Peter R Winters. 1960. Forecasting sales by exponentially weighted moving averages. Management science 6, 3 (1960), 324--342.

    Cited By

    View all
    • (2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
    • (2024)Data Ingestion Validation Through Stable Conditional Metrics with Ranking and FilteringInformation Systems Frontiers10.1007/s10796-024-10504-yOnline publication date: 5-Jul-2024
    • (2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
    June 2019
    2106 pages
    ISBN:9781450356435
    DOI:10.1145/3299869
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anomaly detection
    2. data quality
    3. data validation
    4. integrity constraints
    5. unit tests for data

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '19
    Sponsor:
    SIGMOD/PODS '19: International Conference on Management of Data
    June 30 - July 5, 2019
    Amsterdam, Netherlands

    Acceptance Rates

    SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
    • (2024)Data Ingestion Validation Through Stable Conditional Metrics with Ranking and FilteringInformation Systems Frontiers10.1007/s10796-024-10504-yOnline publication date: 5-Jul-2024
    • (2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
    • (2023)SEDAR: A Semantic Data Reservoir for Heterogeneous DatasetsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614753(5056-5060)Online publication date: 21-Oct-2023
    • (2023)Angler: Helping Machine Translation Practitioners Prioritize Model ImprovementsProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580790(1-20)Online publication date: 19-Apr-2023
    • (2023)Data Ingestion Validation through Stable Conditional Metrics with Ranking and FilteringAdvances in Databases and Information Systems10.1007/978-3-031-42914-9_15(210-223)Online publication date: 28-Aug-2023
    • (2022)Machine Learning and Data Cleaning: Which Serves the Other?Journal of Data and Information Quality10.1145/350671214:3(1-11)Online publication date: 21-Jul-2022
    • (2021)Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance MethodologyMachine Learning and Knowledge Extraction10.3390/make30200203:2(392-413)Online publication date: 22-Apr-2021
    • (2021)Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data LakesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457250(1678-1691)Online publication date: 9-Jun-2021
    • (2021)Picket: guarding against corrupted data in tabular data during learning and inferenceThe VLDB Journal10.1007/s00778-021-00699-w31:5(927-955)Online publication date: 12-Oct-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media