Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3457250acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Published: 18 June 2021 Publication History

Abstract

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate machine-generated data by inferring suitable data-validation "patterns'' that accurately describe the underlying data-domain, which minimizes false-positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that \sj is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

Supplementary Material

MP4 File (3448016.3457250.mp4)
Complex data pipelines are increasingly common in applications like ETL, BI, and ML. These pipelines often recur on a regular basis (e.g., daily or weekly), as downstream data products need to be refreshed regularly. However, it is widely reported that in complex production pipelines, upstream data feeds often change in unexpected ways, causing downstream applications to break silently and requiring substantial human efforts to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. In this work, we develop a corpus-driven approach to automatically infer data-validation ?patterns? for string-valued data, where the goal is to maximize data quality issues that can be caught, while minimizing false-positives. We evaluate the proposed Auto-Validate using real production data from an enterprise data lake. Results suggest that Auto-Validate is substantially more accurate in catching quality issues compared to existing methods.

References

[1]
Amazon Deequ Library for Data Validation. https://github.com/awslabs/deequ.
[2]
Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes (Full version). https://arxiv.org/abs/2104.04659.
[3]
AWS Glue custom classifers. https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html.
[4]
Azure ML: Data Pipelines. https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines.
[5]
Azure Purview for data governance. https://azure.microsoft.com/en-us/services/purview/.
[6]
Bing entity search. https://azure.microsoft.com/en-us/services/cognitive-services/bing-entity-search-api/.
[7]
Data Crawler for NationalArchives.gov.uk. https://github.com/alex-bogatu/DataSpiders.
[8]
FlashProfile package. https://www.nuget.org/packages/Microsoft.ProgramSynthesis.Extraction.Text/.
[9]
Google TensorFlow Data Validation. https://www.tensorflow.org/tfx/guide/tfdv.
[10]
Grok Patterns. https://github.com/elastic/elasticsearch/blob/master/libs/grok/src/main/resources/patterns/grok-patterns.
[11]
Informatica Rev. https://www.informatica.com/products/data-quality/rev.html.
[12]
Kaggle. https://www.kaggle.com/.
[13]
Power BI: Data Flow. https://docs.microsoft.com/en-us/power-bi/transform-model/service-dataflows-create-use.
[14]
SSIS: Data Profiling. https://docs.microsoft.com/en-us/sql/integration-services/control-flow/data-profiling-task?view=sql-server-ver15.
[15]
Tableau: Flow. https://help.tableau.com/current/prep/en-us/prep_build_flow.htm.
[16]
Tableau: Flow. https://aws.amazon.com/blogs/big-data/simplify-data-pipelines-with-aws-glue-automatic-code-generation-and-workflows/.
[17]
XGBoost. https://xgboost.readthedocs.io/en/latest/.
[18]
XSystem Code. https://bitbucket.org/andrewiilyas/xsystem-old/src/outlier-detection/.
[19]
A. Agresti et al. A survey of exact inference for contingency tables. Statistical science, 7(1):131--153, 1992.
[20]
L. Berti-Equille, H. Harmouch, F. Naumann, N. Novelli, and S. Thirumuruganathan. Discovery of genuine functional dependencies from relational data with missing values. VLDB, 2018.
[21]
A. Bogatu, A. A. Fernandes, N. W. Paton, and N. Konstantinou. Dataset discovery in data lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 709--720. IEEE, 2020.
[22]
E. Breck, N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich. Data validation for machine learning. In Conference on Systems and Machine Learning (SysML). https://www. sysml. cc/doc/2019/167. pdf, 2019.
[23]
H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM journal on applied mathematics, 48(5):1073--1082, 1988.
[24]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265--1276, 2008.
[25]
G. Chen, K. Yang, L. Chen, Y. Gao, B. Zheng, and C. Chen. Metric similarity joins using mapreduce. IEEE Transactions on Knowledge and Data Engineering, 29(3):656--669, 2016.
[26]
F. Chiang and R. J. Miller. Discovering data quality rules. VLDB, 1(1), 2008.
[27]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. VLDB, 6(13), 2013.
[28]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009.
[29]
A. Das Sarma, Y. He, and S. Chaudhuri. Clusterjoin: A similarity joins framework using map-reduce. Proceedings of the VLDB Endowment, 7(12):1059--1070, 2014.
[30]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.
[31]
U. Dayal, M. Castellanos, A. Simitsis, and K. Wilkinson. Data integration flows for business intelligence. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pages 1--11, 2009.
[32]
K. Fisher and R. Gruber. Pads: a domain-specific language for processing ad hoc data. ACM Sigplan Notices, 40(6):295--304, 2005.
[33]
K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. ACM SIGPLAN Notices, 43(1):421--434, 2008.
[34]
L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. VLDB, 3(1--2), 2010.
[35]
Y. He, J. Song, Y. Wang, S. Chaudhuri, V. Anil, B. Lassiter, Y. Goland, and G. Malhotra. Auto-tag: Tagging-data-by-example in data lakes using pre-training and inferred domain patterns.
[36]
Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In Proceedings of the 20th international conference on World wide web, pages 427--436, 2011.
[37]
A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, pages 829--846, 2019.
[38]
Z. Huang and Y. He. Auto-Detect: Data-Driven Error Detection in Tables. In SIGMOD, 2018.
[39]
M. Hulsebos, K. Hu, M. Bakker, E. Zgraggen, A. Satyanarayan, T. Kraska, cC. Demiralp, and C. Hidalgo. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1500--1508, 2019.
[40]
N. Hynes, D. Sculley, and M. Terry. The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop, 2017.
[41]
A. Ilyas, J. M. da Trindade, R. C. Fernandez, and S. Madden. Extracting syntactical patterns from databases. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 41--52. IEEE, 2018.
[42]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[43]
W. Just. Computational complexity of multiple sequence alignment with sp-score. Journal of computational biology, 8(6):615--623, 2001.
[44]
G. K. Kanji. 100 statistical tests. Sage, 2006.
[45]
R. M. Karp and A. Wigderson. A fast parallel algorithm for the maximal independent set problem. Journal of the ACM (JACM), 32(4):762--773, 1985.
[46]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1), 1995.
[47]
Z. Liu, Z. Zhou, and T. Rekatsinas. Picket: Self-supervised data diagnostics for ml pipelines. arXiv preprint arXiv:2006.04730, 2020.
[48]
M. Mahdavi, Z. Abedjan, R. Castro Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, pages 865--882, 2019.
[49]
F. Naumann. Data profiling revisited. ACM SIGMOD Record, 42(4):40--49, 2014.
[50]
S. Padhi, P. Jain, D. Perelman, O. Polozov, S. Gulwani, and T. Millstein. Flashprofile: a framework for synthesizing data profiles. Proceedings of the ACM on Programming Languages, 2(OOPSLA):1--28, 2018.
[51]
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 938--947, 2009.
[52]
T. Papenbrock and F. Naumann. A hybrid approach to functional dependency discovery. In Proceedings of the 2016 International Conference on Management of Data, pages 821--833, 2016.
[53]
H. Patel, A. Jindal, and C. Szyperski. Big data processing at microsoft: Hyper scale, massive complexity, and minimal cost. In Proceedings of the ACM Symposium on Cloud Computing, pages 490--490, 2019.
[54]
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1723--1726, 2017.
[55]
A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. Anmat: automatic knowledge discovery and error detection through pattern functional dependencies. In Proceedings of the 2019 International Conference on Management of Data, pages 1977--1980, 2019.
[56]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. the VLDB Journal, 10(4):334--350, 2001.
[57]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.
[58]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001.
[59]
S. Schelter, F. Biessmann, D. Lange, T. Rukat, P. Schmidt, S. Seufert, P. Brunelle, and A. Taptunov. Unit testing data with deequ. In Proceedings of the 2019 International Conference on Management of Data, pages 1993--1996, 2019.
[60]
S. Schelter, S. Grafberger, P. Schmidt, T. Rukat, M. Kiessling, A. Taptunov, F. Biessmann, and D. Lange. Differential data quality verification on partitioned data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1940--1945. IEEE, 2019.
[61]
S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12):1781--1794, 2018.
[62]
M. Stonebraker and I. F. Ilyas. Data integration: The current status and the way forward. IEEE Data Eng. Bull., 41(2):3--9, 2018.
[63]
A. Swami, S. Vasudevan, and J. Huyn. Data sentinel: A declarative production-scale data validation platform. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1579--1590. IEEE, 2020.
[64]
P. Vassiliadis and A. Simitsis. Near real time etl. In New trends in data warehousing and data analysis, pages 1--31. Springer, 2009.
[65]
P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, pages 811--828, 2019.
[66]
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In Seventh IEEE international conference on data mining (ICDM 2007), pages 342--350. IEEE, 2007.
[67]
C. Yan and Y. He. Auto-Type: Synthesizing type-detection logic for rich semantic data types using open-source code. In Proceedings of the 2018 International Conference on Management of Data. ACM, 2018.
[68]
J. N. Yan, O. Schulte, J. Wang, and R. Cheng. Coded: Column-oriented data error detection with statistical constraints.
[69]
J. N. Yan, O. Schulte, M. Zhang, J. Wang, and R. Cheng. Scoded: Statistical constraint oriented data error detection. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 845--860, 2020.
[70]
D. Zhang, Y. Suhara, J. Li, M. Hulsebos, cC. Demiralp, and W.-C. Tan. Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311, 2019.
[71]
J. Zhou, N. Bruno, M.-C. Wu, P.-A. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. The VLDB Journal, 21(5):611--636, 2012.

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
  • (2024)Security for Machine Learning-based Software Systems: A Survey of Threats, Practices, and ChallengesACM Computing Surveys10.1145/363853156:6(1-38)Online publication date: 23-Feb-2024
  • Show More Cited By

Index Terms

  1. Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
    June 2021
    2969 pages
    ISBN:9781450383431
    DOI:10.1145/3448016
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data lake
    2. data pipelines
    3. data quality
    4. data validation

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)67
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
    • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
    • (2024)Security for Machine Learning-based Software Systems: A Survey of Threats, Practices, and ChallengesACM Computing Surveys10.1145/363853156:6(1-38)Online publication date: 23-Feb-2024
    • (2023)Analytical Review of Data Lakes and Perspectives of Application in the Field of EducationVìsnik Nacìonalʹnogo unìversitetu "Lʹvìvsʹka polìtehnìka". Serìâ Ìnformacìjnì sistemi ta merežì10.23939/sisn2023.14.37314(373-382)Online publication date: 29-Dec-2023
    • (2023)DeepJoin: Joinable Table Discovery with Pre-Trained Language ModelsProceedings of the VLDB Endowment10.14778/3603581.360358716:10(2458-2470)Online publication date: 1-Jun-2023
    • (2023)Data Lakes: A Survey of Functions and SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327010135:12(12571-12590)Online publication date: 25-Apr-2023
    • (2021)Semantic programming by example with pre-trained modelsProceedings of the ACM on Programming Languages10.1145/34854775:OOPSLA(1-25)Online publication date: 15-Oct-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media