Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automating large-scale data quality verification

Published: 01 August 2018 Publication History

Abstract

Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.

References

[1]
Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015.
[2]
P. Andrews, A. Kalro, H. Mehanna, and A. Sidorov. Productionizing Machine Learning Pipelines at Scale. Machine Learning Systems workshop at ICML, 2016.
[3]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. SIGMOD, 1383--1394, 2015.
[4]
C. Batini, C. Cappiello, C. Francalanci, and A. Maurino. Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3):16, 2009.
[5]
D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, et al. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD, 1387--1395, 2017.
[6]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. ICDE, 746--755, 2007.
[7]
J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.
[8]
E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley. The ml test score: A rubric for ml production readiness and technical debt reduction. Big Data, 1123--1132, 2017.
[9]
E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data Infrastructure for Machine Learning. SysML, 2018.
[10]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. SIGMOD, 2201--2206, 2016.
[11]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.
[12]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. SIGMOD, 240--251, 2002.
[13]
V. Flunkert, D. Salinas, and J. Gasthaus. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. CoRR, abs/1704.04110, 2017.
[14]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. VLDB, 371--380, 2001.
[15]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. SIGMOD Record, 30:58--66, 2001.
[16]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. GOODS: Organizing Google's Datasets. SIGMOD, 795--806, 2016.
[17]
A. Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. Intelligent Systems, 24(2):8--12, 2009.
[18]
H. Harmouch and F. Naumann. Cardinality estimation: An experimental survey. PVLDB, 11(4):499--512, 2017.
[19]
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
[20]
J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, A. Dey, S. Nag, K. Ramachandran, S. Arora, A. Bhattacharyya, S. Das, et al. Ground: A data context service. CIDR, 2017.
[21]
S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. EDBT, 683--692, 2013.
[22]
R. J. Hyndman and G. Athanasopoulos. Forecasting: principles and practice. OTexts, 2014.
[23]
I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4), 281--393, 2015.
[24]
S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. SIGMOD, 2117--2120, 2016.
[25]
S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu. Boostclean: Automated error detection and repair for machine learning. CoRR, abs/1711.01299, 2017.
[26]
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 44(4):17--22, 2016.
[27]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in Apache Spark. JMLR, 17(1):1235--1241, 2016.
[28]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. On model discovery for hosted data science projects. Workshop on Data Management for End-to-End Machine Learning at SIGMOD, 6, 2017.
[29]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. ICDE, 571--582, 2017.
[30]
F. Naumann. Quality-driven Query Answering for Integrated Information Systems. Springer, 2002.
[31]
T. Papenbrock and F. Naumann. A hybrid approach to functional dependency discovery. SIGMOD, 821--833, 2016.
[32]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12:2825--2830, 2011.
[33]
J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. PVLDB, 10(12):1841--1844, 2017.
[34]
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. SIGMOD, 1723--1726, 2017.
[35]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
[36]
J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poor assumptions of naive bayes text classifiers. ICML, 616--623, 2003.
[37]
T. Rukat, D. Lange, and C. Archambeau. An interpretable latent variable model for attribute applicability in the amazon catalogue. Interpretable ML Symposium at NIPS, 2017.
[38]
S. Sadiq, J. Freire, R. J. Miller, T. Dasu, I. F. Ilyas, F. Naumann, D. Srivastava, X. L. Dong, S. Link, and X. Zhou. Data quality the role of empiricism. SIGMOD Record, 46(4):35--43, 2018.
[39]
M. Scannapieco and T. Catarci. Data quality under a computer science perspective. Archivi & Computer, 2, 1--15, 2002.
[40]
S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. Machine Learning Systems workshop at NIPS, 2017.
[41]
S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Declarative Metadata Management: A Missing Piece in End-to-End Machine Learning. SysML, 2018.
[42]
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison. Hidden Technical Debt in Machine Learning Systems. NIPS, 2503--2511, 2015.
[43]
E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE, 535--546, 2017.
[44]
C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. ICCV, 843--852, 2017.
[45]
M. Terry, D. Sculley, and N. Hynes. The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. Machine Learning Systems Workshop at NIPS, 2017.
[46]
T. van der Weide, D. Papadopoulos, O. Smirnov, M. Zielinski, and T. van Kasteren. Versioning for end-to-end machine learning pipelines. Workshop on Data Management for End-to-End Machine Learning at SIGMOD, 2, 2017.
[47]
J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning. KDD, 49--60, 2014.
[48]
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. ModelDB: A System for Machine Learning Model Management. Workshop on Human-In-the-Loop Data Analytics at SIGMOD, 14, 2016.
[49]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.
[50]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 95, 2010.

Cited By

View all
  • (2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
  • (2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
  • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 12
August 2018
426 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018
Published in PVLDB Volume 11, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)268
  • Downloads (Last 6 weeks)45
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
  • (2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
  • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
  • (2024)Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and TuningACM Computing Surveys10.1145/366532356:11(1-37)Online publication date: 17-May-2024
  • (2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
  • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
  • (2024)Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663327(7-11)Online publication date: 9-Jun-2024
  • (2024)An Exploratory Study of V-Model in Building ML-Enabled Software: A Systems Engineering PerspectiveProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644951(30-40)Online publication date: 14-Apr-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Integrating Data Quality in Industrial Big Data Architectures: An Action Design Research StudySoftware Architecture10.1007/978-3-031-70797-1_1(3-19)Online publication date: 1-Sep-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media