research-article

Uni-Detect: A Unified Approach to Automated Error Detection in Tables

Authors:

Yeye HeAuthors Info & Claims

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 811 - 828

https://doi.org/10.1145/3299869.3319855

Published: 25 June 2019 Publication History

Abstract

Data errors are ubiquitous in tables. Extensive research in this area has resulted in a rich variety of techniques, each often targeting a specific type of errors, e.g., numeric outliers, constraint violations, etc. While these diverse techniques clearly improve data quality, it places a significant burden on humans to configure these techniques with suitable rules and parameters for each data set. For example, an expert is expected to define suitable functional-dependencies between column pairs, or tune appropriate thresholds for outlier-detection algorithms, all of which are specific to one individual data set. As a result, users today often hire experts to cleanse only their high-value data sets. We propose \sj, a unified framework to automatically detect diverse types of errors. Our approach employs a novel "what-if'' analysis that performs local data perturbations to reason about data abnormality, leveraging classical hypothesis-tests on a large corpus of tables. We test \sj on a wide variety of tables including Wikipedia tables, and make surprising discoveries of thousands of FD violations, numeric outliers, spelling mistakes, etc., with better accuracy than existing algorithms specifically designed for each type of errors. For example, for spelling mistakes, \sj outperforms the state-of-the-art spell-checker from a commercial search engine.

References

[1]

Bing spell check. https://azure.microsoft.com/en-us/services/ cognitive-services/spell-check/.

[2]

Database connection strings in excel. https://docs.microsoft.com/ en-us/dotnet/framework/data/adonet/connection-string-syntax.

[3]

Excel error checking rules. https://excelribbon.tips.net/T006221_ Changing_Error_Checking_Rules.html.

[4]

Excel error checking rules. https://www.wiktionary.org/.

[5]

Glove 840B tokens model. https://nlp.stanford.edu/projects/glove/.

[6]

Google spell check. https://code.google.com/archive/p/ google-api-spelling-java/.

[7]

Microsoft excel error checking rules. https://excelribbon.tips.net/ T006221_Changing_Error_Checking_Rules.html.

[8]

OpenRefine (formerly Google Refine). http://openrefine.org/.

[9]

Paxata data preparation. https://www.paxata.com/.

[10]

Power bi. https://docs.microsoft.com/en-us/power-bi/ desktop-data-types.

[11]

Self-service data preparation, worldwide, 2016. https://www.gartner. com/doc/3204817/forecast-snapshot-selfservice-data-preparation.

[12]

Spreadsheet mistakes - news stories, compiled by european spreadsheet risk interest group EuSpRiG. http://www.eusprig.org/ horror-stories.htm.

[13]

Talend data services platform studio user guide: Semantic discovery. https://help.talend.com/reader/nAXiZW0j0H 2 YApZIsRFw/ _u0D0oqWxesgBDSihDgbYA.

[14]

Trifacta. https://www.trifacta.com/.

[15]

Trifacta built-in data types. https://docs.trifacta.com/display/PE/ Supported+Data+Types.

[16]

Word2Vec 100B Google News model. https://github.com/mmihaltz/ word2vec-GoogleNews-vectors.

[17]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? VLDB, 9(12), 2016.

Digital Library

[18]

F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In Proceedings of the 12th International Conference on Database Theory. ACM, 2009.

Digital Library

[19]

A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD, 1996.

Digital Library

[20]

I. Ben-Gal. Outlier detection. In Data mining and knowledge discovery handbook. Springer, 2005.

[21]

L. Berti-Equille, H. Harmouch, F. Naumann, N. Novelli, and S. Thirumuruganathan. Discovery of genuine functional dependencies from relational data with missing values. VLDB, 2018.

Digital Library

[22]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.

Digital Library

[23]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE. IEEE, 2007.

[24]

M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In ACM sigmod record. ACM, 2000.

Digital Library

[25]

E. Brewer. Cap twelve years later: how the. Computer, (2), 2012.

[26]

M. J. Cafarella, A. Halevy, D. Z.Wang, E.Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1), 2008.

Digital Library

[27]

G. Casella and R. Berger. R. 2001, statistical inference. Duxbury Press.

[28]

K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3), 2016.

[29]

V. Chandola, A. Banerjee, and V. Kumar. Outlier detection: A survey. ACM Computing Surveys, 2007.

[30]

F. Chiang and R. J. Miller. Discovering data quality rules. VLDB, 1(1), 2008.

Digital Library

[31]

X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, 2015.

Digital Library

[32]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. VLDB, 6(13), 2013.

Digital Library

[33]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. VLDB, 6(13), 2013.

Digital Library

[34]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE. IEEE, 2013.

[35]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015.

Digital Library

[36]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.

Digital Library

[37]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.

Digital Library

[38]

W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. VLDB, 2(1), 2009.

Digital Library

[39]

D. Freeman. How to make spreadsheets error-proof. Journal of Accountancy, 181(5), 1996.

[40]

Y. Ganjisaffar, A. Zilio, S. Javanmardi, I. Cetindil, M. Sikka, S. Katumalla, N. Khatib, C. Li, and C. Lopes. qspell: Spelling correction of web search queries using ranking models and iterative correction. In Spelling Alteration for Web Search Workshop, 2011.

[41]

L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. VLDB, 3(1--2), 2010.

Digital Library

[42]

M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier detection for temporal data: A survey. TKDE, 26(9), 2014.

Digital Library

[43]

P. Hall, J. Marron, and B. U. Park. Smoothed cross-validation. Probability theory and related fields, 92(1), 1992.

[44]

W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In ACM SIGPLAN Notices, volume 46. ACM, 2011.

Digital Library

[45]

Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (TDE): an extensible search engine for data transformations. VLDB, 11(10), 2018.

Digital Library

[46]

Y. He, K. Ganjam, and X. Chu. SEMA-JOIN: joining semanticallyrelated tables using big table corpora. VLDB, 8(12), 2015.

Digital Library

[47]

Z. He, S. Deng, and X. Xu. An optimization model for outlier detection in categorical data. Advances in Intelligent Computing, 2005.

Digital Library

[48]

J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[49]

V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial intelligence review, 2004.

Digital Library

[50]

Z. Huang and Y. He. Auto-Detect: Data-Driven Error Detection in Tables. In SIGMOD, 2018.

Digital Library

[51]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The computer journal, 42(2), 1999.

[52]

B.-G. I. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, 2005.

Digital Library

[53]

I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.

Digital Library

[54]

I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.

Digital Library

[55]

E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameterfree data mining. In SIGKDD, 2004.

[56]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1), 1995.

Digital Library

[57]

E. M. Knox and R. T. Ng. Algorithms for mining distance based outliers in large datasets. In VLDB, 1998.

Digital Library

[58]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.

Digital Library

[59]

V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49. ACM, 2014.

Digital Library

[60]

P. M. Lee. Bayesian statistics. Arnold Publication, 1997.

[61]

E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.

[62]

H. Lieberman. Your wish is my command: Programming by example. Morgan Kaufmann, 2001.

Digital Library

[63]

G. Luec. A data-driven approach for correcting search quaries. In Spelling Alteration for Web Search Workshop, 2011.

[64]

Y. E. Mark Ziemann and A. El-Osta. Gene name errors are widespread in the scientific literature. Genome Biology, 2016.

[65]

R. A. Maronna, R. D. Martin, V. J. Yohai, and M. Salibián-Barrera. Robust Statistics: Theory and Methods (with R). Wiley, 2018.

[66]

F. McSherry and K. Talwar. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS'07. 48th Annual IEEE Symposium on. IEEE, 2007.

Digital Library

[67]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013.

Digital Library

[68]

J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. Applied linear statistical models, volume 4. Irwin Chicago, 1996.

[69]

J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A., 1933.

[70]

R. R. Panko. What we know about spreadsheet errors. Journal of Organizational and End User Computing (JOEUC), 1998.

Digital Library

[71]

T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. VLDB, 8(12), 2015.

Digital Library

[72]

E. Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3), 1962.

[73]

J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[74]

S. G. Powell, K. R. Baker, and B. Lawson. Errors in operational spreadsheets: A review of the state of the art. In System Sciences, 2009. HICSS'09, 2009.

Digital Library

[75]

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

[76]

V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001.

Digital Library

[77]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. VLDB, 10(11), 2017.

Digital Library

[78]

S. Schelter, D. Lange, P. Schmidt, M. Celikel, and F. Biessmann. Automating largescale data quality verification. In VLDB, 2018.

[79]

H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press, 2008.

[80]

R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. VLDB, 9(10), 2016.

Digital Library

[81]

R. Singh and S. Gulwani. Transforming spreadsheet data types using examples. In Acm Sigplan Notices, 2016.

Digital Library

[82]

M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.

Digital Library

[83]

C. Yan and Y. He. Auto-Type: Synthesizing type-detection logic for rich semantic data types using open-source code. In Proceedings of the 2018 International Conference on Management of Data. ACM, 2018.

Digital Library

[84]

C. Zhao and Y. He. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW, 2019.

Digital Library

[85]

Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD, 2017.

Digital Library

[86]

E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. VLDB, 10(10), 2017.

Digital Library

Cited By

Yan MWang YWang YMiao XLi J(2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698811
Fannouch AGahi YGharib J(2024)Unified Data Framework for Enhanced Data Management, Consumption, Provisioning, Processing and MovementProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659836(1-7)Online publication date: 18-Apr-2024
https://dl.acm.org/doi/10.1145/3659677.3659836
Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979
Show More Cited By

Index Terms

Uni-Detect: A Unified Approach to Automated Error Detection in Tables
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

Auto-Detect: Data-Driven Error Detection in Tables
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without ...
Automated error detection using association rules

High data quality is important for every application. Inaccurate or inadequate data can lead to inappropriate assumptions, misleading results, bias and ultimately poor policy and decision making. Finding errors and cleaning data is a time consuming ...
An evaluation to detect and correct erroneous characters wrongly substituted, deleted and inserted in Japanese and English sentences using Markov models
COLING '94: Proceedings of the 15th conference on Computational linguistics - Volume 1

In optical character recognition and continuous speech recognition of a natural language, it has been difficult to detect error characters which are wrongly deleted and inserted. In order to judge three types of the errors, which are characters wrongly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

June 2019

2106 pages

ISBN:9781450356435

DOI:10.1145/3299869

General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
707
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan MWang YWang YMiao XLi J(2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698811
Fannouch AGahi YGharib J(2024)Unified Data Framework for Enhanced Data Management, Consumption, Provisioning, Processing and MovementProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659836(1-7)Online publication date: 18-Apr-2024
https://dl.acm.org/doi/10.1145/3659677.3659836
Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979
Parciak MWeytjens SHens NNeven FPeeters LVansummeren S(2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00270
Peng JShen DNie TKou Y(2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
https://doi.org/10.1016/j.ins.2024.121281
Zhu JZhao XSun YSong SYuan X(2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
https://doi.org/10.1007/s41019-024-00266-7
Tu DHe YCui WGe SZhang HHan SZhang DChaudhuri SSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data PipelinesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599776(4991-5003)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599776
Al-Madi MAbdel-Wahab AAlShanty MBawazeer SAlZahrani M(2023)A Perceptual Data Cleansing Model (SDCM) for Reducing the Dirty Data2023 International Conference on Smart Computing and Application (ICSCA)10.1109/ICSCA57840.2023.10087605(1-7)Online publication date: 5-Feb-2023
https://doi.org/10.1109/ICSCA57840.2023.10087605
Mundra PZhang JNargesian FAugsten N(2023)Koios: Top-k Semantic Overlap Set Search2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00121(1531-1543)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00121
Long YLi HWan ZTian P(2023)Data Redundancy Detection Algorithm based on Multidimensional Similarity2023 International Conference on Frontiers of Robotics and Software Engineering (FRSE)10.1109/FRSE58934.2023.00032(180-187)Online publication date: Jun-2023
https://doi.org/10.1109/FRSE58934.2023.00032
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents