research-article

Automating large-scale data quality verification

Authors:

Sebastian Schelter,

Philipp Schmidt,

Meltem Celikel,

Felix Biessmann,

Andreas GrafbergerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 12

Pages 1781 - 1794

https://doi.org/10.14778/3229863.3229867

Published: 01 August 2018 Publication History

Abstract

Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.

References

[1]

Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015.

Digital Library

[2]

P. Andrews, A. Kalro, H. Mehanna, and A. Sidorov. Productionizing Machine Learning Pipelines at Scale. Machine Learning Systems workshop at ICML, 2016.

[3]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. SIGMOD, 1383--1394, 2015.

Digital Library

[4]

C. Batini, C. Cappiello, C. Francalanci, and A. Maurino. Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3):16, 2009.

Digital Library

[5]

D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, et al. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD, 1387--1395, 2017.

Digital Library

[6]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. ICDE, 746--755, 2007.

[7]

J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.

Digital Library

[8]

E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley. The ml test score: A rubric for ml production readiness and technical debt reduction. Big Data, 1123--1132, 2017.

[9]

E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data Infrastructure for Machine Learning. SysML, 2018.

[10]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. SIGMOD, 2201--2206, 2016.

Digital Library

[11]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.

Digital Library

[12]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. SIGMOD, 240--251, 2002.

Digital Library

[13]

V. Flunkert, D. Salinas, and J. Gasthaus. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. CoRR, abs/1704.04110, 2017.

[14]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. VLDB, 371--380, 2001.

Digital Library

[15]

M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. SIGMOD Record, 30:58--66, 2001.

Digital Library

[16]

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. GOODS: Organizing Google's Datasets. SIGMOD, 795--806, 2016.

Digital Library

[17]

A. Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. Intelligent Systems, 24(2):8--12, 2009.

Digital Library

[18]

H. Harmouch and F. Naumann. Cardinality estimation: An experimental survey. PVLDB, 11(4):499--512, 2017.

Digital Library

[19]

J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[20]

J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, A. Dey, S. Nag, K. Ramachandran, S. Arora, A. Bhattacharyya, S. Das, et al. Ground: A data context service. CIDR, 2017.

[21]

S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. EDBT, 683--692, 2013.

Digital Library

[22]

R. J. Hyndman and G. Athanasopoulos. Forecasting: principles and practice. OTexts, 2014.

[23]

I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4), 281--393, 2015.

Digital Library

[24]

S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. SIGMOD, 2117--2120, 2016.

Digital Library

[25]

S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu. Boostclean: Automated error detection and repair for machine learning. CoRR, abs/1711.01299, 2017.

[26]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 44(4):17--22, 2016.

Digital Library

[27]

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in Apache Spark. JMLR, 17(1):1235--1241, 2016.

Digital Library

[28]

H. Miao, A. Li, L. S. Davis, and A. Deshpande. On model discovery for hosted data science projects. Workshop on Data Management for End-to-End Machine Learning at SIGMOD, 6, 2017.

Digital Library

[29]

H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. ICDE, 571--582, 2017.

[30]

F. Naumann. Quality-driven Query Answering for Integrated Information Systems. Springer, 2002.

Digital Library

[31]

T. Papenbrock and F. Naumann. A hybrid approach to functional dependency discovery. SIGMOD, 821--833, 2016.

Digital Library

[32]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12:2825--2830, 2011.

Digital Library

[33]

J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. PVLDB, 10(12):1841--1844, 2017.

Digital Library

[34]

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. SIGMOD, 1723--1726, 2017.

Digital Library

[35]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.

Digital Library

[36]

J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poor assumptions of naive bayes text classifiers. ICML, 616--623, 2003.

Digital Library

[37]

T. Rukat, D. Lange, and C. Archambeau. An interpretable latent variable model for attribute applicability in the amazon catalogue. Interpretable ML Symposium at NIPS, 2017.

[38]

S. Sadiq, J. Freire, R. J. Miller, T. Dasu, I. F. Ilyas, F. Naumann, D. Srivastava, X. L. Dong, S. Link, and X. Zhou. Data quality the role of empiricism. SIGMOD Record, 46(4):35--43, 2018.

Digital Library

[39]

M. Scannapieco and T. Catarci. Data quality under a computer science perspective. Archivi & Computer, 2, 1--15, 2002.

[40]

S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. Machine Learning Systems workshop at NIPS, 2017.

[41]

S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Declarative Metadata Management: A Missing Piece in End-to-End Machine Learning. SysML, 2018.

[42]

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison. Hidden Technical Debt in Machine Learning Systems. NIPS, 2503--2511, 2015.

Digital Library

[43]

E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE, 535--546, 2017.

[44]

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. ICCV, 843--852, 2017.

[45]

M. Terry, D. Sculley, and N. Hynes. The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. Machine Learning Systems Workshop at NIPS, 2017.

[46]

T. van der Weide, D. Papadopoulos, O. Smirnov, M. Zielinski, and T. van Kasteren. Versioning for end-to-end machine learning pipelines. Workshop on Data Management for End-to-End Machine Learning at SIGMOD, 2, 2017.

Digital Library

[47]

J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning. KDD, 49--60, 2014.

Digital Library

[48]

M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. ModelDB: A System for Machine Learning Model Management. Workshop on Human-In-the-Loop Data Analytics at SIGMOD, 14, 2016.

Digital Library

[49]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.

Digital Library

[50]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 95, 2010.

Digital Library

Cited By

Dong SWang QSahri SPalpanas TSrivastava D(2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681984
Hiniduma KByna SBez JMadduri R(2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3676288.3676296
Hu GZeng XYu WPeng MYuan MDuan L(2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686664
Show More Cited By

Recommendations

Large Scale and Big Data: Processing and Management
Large Scale and Big Data: Processing and Management
Large-scale multilevel streaming data analytics
CASCON '18: Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering

There is a monumental shift happening in how data powers organizational and business operations. This shift is about moving away from traditional batch and real-time analytics to hybrid analytics involving both static and continuous data. Most analytics ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 12

August 2018

426 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018

Published in PVLDB Volume 11, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
1,640
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)45

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dong SWang QSahri SPalpanas TSrivastava D(2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681984
Hiniduma KByna SBez JMadduri R(2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3676288.3676296
Hu GZeng XYu WPeng MYuan MDuan L(2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686664
Mozaffari MDignös AGamper JStörl U(2024)Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and TuningACM Computing Surveys10.1145/366532356:11(1-37)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3665323
Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Döhmen TGeacu RHulsebos MSchelter S(2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654975
Grafberger SGroth PSchelter S(2024)Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663327(7-11)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663327
Wu JBosch JLewis GCleland-Huang JMuccini H(2024)An Exploratory Study of V-Model in Building ML-Enabled Software: A Systems Engineering PerspectiveProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644951(30-40)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644951
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Ustunboyacioglu IKumara IDi Nucci DTamburri Dvan den Heuvel W(2024)Integrating Data Quality in Industrial Big Data Architectures: An Action Design Research StudySoftware Architecture10.1007/978-3-031-70797-1_1(3-19)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70797-1_1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents