Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3299869.3319855acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Uni-Detect: A Unified Approach to Automated Error Detection in Tables

Published: 25 June 2019 Publication History

Abstract

Data errors are ubiquitous in tables. Extensive research in this area has resulted in a rich variety of techniques, each often targeting a specific type of errors, e.g., numeric outliers, constraint violations, etc. While these diverse techniques clearly improve data quality, it places a significant burden on humans to configure these techniques with suitable rules and parameters for each data set. For example, an expert is expected to define suitable functional-dependencies between column pairs, or tune appropriate thresholds for outlier-detection algorithms, all of which are specific to one individual data set. As a result, users today often hire experts to cleanse only their high-value data sets. We propose \sj, a unified framework to automatically detect diverse types of errors. Our approach employs a novel "what-if'' analysis that performs local data perturbations to reason about data abnormality, leveraging classical hypothesis-tests on a large corpus of tables. We test \sj on a wide variety of tables including Wikipedia tables, and make surprising discoveries of thousands of FD violations, numeric outliers, spelling mistakes, etc., with better accuracy than existing algorithms specifically designed for each type of errors. For example, for spelling mistakes, \sj outperforms the state-of-the-art spell-checker from a commercial search engine.

References

[1]
Bing spell check. https://azure.microsoft.com/en-us/services/ cognitive-services/spell-check/.
[2]
Database connection strings in excel. https://docs.microsoft.com/ en-us/dotnet/framework/data/adonet/connection-string-syntax.
[3]
Excel error checking rules. https://excelribbon.tips.net/T006221_ Changing_Error_Checking_Rules.html.
[4]
Excel error checking rules. https://www.wiktionary.org/.
[5]
Glove 840B tokens model. https://nlp.stanford.edu/projects/glove/.
[6]
Google spell check. https://code.google.com/archive/p/ google-api-spelling-java/.
[7]
Microsoft excel error checking rules. https://excelribbon.tips.net/ T006221_Changing_Error_Checking_Rules.html.
[8]
OpenRefine (formerly Google Refine). http://openrefine.org/.
[9]
Paxata data preparation. https://www.paxata.com/.
[10]
Power bi. https://docs.microsoft.com/en-us/power-bi/ desktop-data-types.
[11]
Self-service data preparation, worldwide, 2016. https://www.gartner. com/doc/3204817/forecast-snapshot-selfservice-data-preparation.
[12]
Spreadsheet mistakes - news stories, compiled by european spreadsheet risk interest group EuSpRiG. http://www.eusprig.org/ horror-stories.htm.
[13]
Talend data services platform studio user guide: Semantic discovery. https://help.talend.com/reader/nAXiZW0j0H 2 YApZIsRFw/ _u0D0oqWxesgBDSihDgbYA.
[14]
Trifacta. https://www.trifacta.com/.
[15]
Trifacta built-in data types. https://docs.trifacta.com/display/PE/ Supported+Data+Types.
[16]
Word2Vec 100B Google News model. https://github.com/mmihaltz/ word2vec-GoogleNews-vectors.
[17]
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? VLDB, 9(12), 2016.
[18]
F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In Proceedings of the 12th International Conference on Database Theory. ACM, 2009.
[19]
A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD, 1996.
[20]
I. Ben-Gal. Outlier detection. In Data mining and knowledge discovery handbook. Springer, 2005.
[21]
L. Berti-Equille, H. Harmouch, F. Naumann, N. Novelli, and S. Thirumuruganathan. Discovery of genuine functional dependencies from relational data with missing values. VLDB, 2018.
[22]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
[23]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE. IEEE, 2007.
[24]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In ACM sigmod record. ACM, 2000.
[25]
E. Brewer. Cap twelve years later: how the. Computer, (2), 2012.
[26]
M. J. Cafarella, A. Halevy, D. Z.Wang, E.Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1), 2008.
[27]
G. Casella and R. Berger. R. 2001, statistical inference. Duxbury Press.
[28]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3), 2016.
[29]
V. Chandola, A. Banerjee, and V. Kumar. Outlier detection: A survey. ACM Computing Surveys, 2007.
[30]
F. Chiang and R. J. Miller. Discovering data quality rules. VLDB, 1(1), 2008.
[31]
X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, 2015.
[32]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. VLDB, 6(13), 2013.
[33]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. VLDB, 6(13), 2013.
[34]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE. IEEE, 2013.
[35]
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015.
[36]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
[37]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.
[38]
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. VLDB, 2(1), 2009.
[39]
D. Freeman. How to make spreadsheets error-proof. Journal of Accountancy, 181(5), 1996.
[40]
Y. Ganjisaffar, A. Zilio, S. Javanmardi, I. Cetindil, M. Sikka, S. Katumalla, N. Khatib, C. Li, and C. Lopes. qspell: Spelling correction of web search queries using ranking models and iterative correction. In Spelling Alteration for Web Search Workshop, 2011.
[41]
L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. VLDB, 3(1--2), 2010.
[42]
M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier detection for temporal data: A survey. TKDE, 26(9), 2014.
[43]
P. Hall, J. Marron, and B. U. Park. Smoothed cross-validation. Probability theory and related fields, 92(1), 1992.
[44]
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In ACM SIGPLAN Notices, volume 46. ACM, 2011.
[45]
Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (TDE): an extensible search engine for data transformations. VLDB, 11(10), 2018.
[46]
Y. He, K. Ganjam, and X. Chu. SEMA-JOIN: joining semanticallyrelated tables using big table corpora. VLDB, 8(12), 2015.
[47]
Z. He, S. Deng, and X. Xu. An optimization model for outlier detection in categorical data. Advances in Intelligent Computing, 2005.
[48]
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
[49]
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial intelligence review, 2004.
[50]
Z. Huang and Y. He. Auto-Detect: Data-Driven Error Detection in Tables. In SIGMOD, 2018.
[51]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The computer journal, 42(2), 1999.
[52]
B.-G. I. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, 2005.
[53]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[54]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[55]
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameterfree data mining. In SIGKDD, 2004.
[56]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1), 1995.
[57]
E. M. Knox and R. T. Ng. Algorithms for mining distance based outliers in large datasets. In VLDB, 1998.
[58]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
[59]
V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49. ACM, 2014.
[60]
P. M. Lee. Bayesian statistics. Arnold Publication, 1997.
[61]
E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.
[62]
H. Lieberman. Your wish is my command: Programming by example. Morgan Kaufmann, 2001.
[63]
G. Luec. A data-driven approach for correcting search quaries. In Spelling Alteration for Web Search Workshop, 2011.
[64]
Y. E. Mark Ziemann and A. El-Osta. Gene name errors are widespread in the scientific literature. Genome Biology, 2016.
[65]
R. A. Maronna, R. D. Martin, V. J. Yohai, and M. Salibián-Barrera. Robust Statistics: Theory and Methods (with R). Wiley, 2018.
[66]
F. McSherry and K. Talwar. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS'07. 48th Annual IEEE Symposium on. IEEE, 2007.
[67]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013.
[68]
J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. Applied linear statistical models, volume 4. Irwin Chicago, 1996.
[69]
J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A., 1933.
[70]
R. R. Panko. What we know about spreadsheet errors. Journal of Organizational and End User Computing (JOEUC), 1998.
[71]
T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. VLDB, 8(12), 2015.
[72]
E. Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3), 1962.
[73]
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
[74]
S. G. Powell, K. R. Baker, and B. Lawson. Errors in operational spreadsheets: A review of the state of the art. In System Sciences, 2009. HICSS'09, 2009.
[75]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
[76]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001.
[77]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. VLDB, 10(11), 2017.
[78]
S. Schelter, D. Lange, P. Schmidt, M. Celikel, and F. Biessmann. Automating largescale data quality verification. In VLDB, 2018.
[79]
H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press, 2008.
[80]
R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. VLDB, 9(10), 2016.
[81]
R. Singh and S. Gulwani. Transforming spreadsheet data types using examples. In Acm Sigplan Notices, 2016.
[82]
M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.
[83]
C. Yan and Y. He. Auto-Type: Synthesizing type-detection logic for rich semantic data types using open-source code. In Proceedings of the 2018 International Conference on Management of Data. ACM, 2018.
[84]
C. Zhao and Y. He. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW, 2019.
[85]
Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD, 2017.
[86]
E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. VLDB, 10(10), 2017.

Cited By

View all
  • (2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
  • (2024)Unified Data Framework for Enhanced Data Management, Consumption, Provisioning, Processing and MovementProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659836(1-7)Online publication date: 18-Apr-2024
  • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. Uni-Detect: A Unified Approach to Automated Error Detection in Tables

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
    June 2019
    2106 pages
    ISBN:9781450356435
    DOI:10.1145/3299869
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. constraints
    2. data quality
    3. error detection
    4. outliers

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '19
    Sponsor:
    SIGMOD/PODS '19: International Conference on Management of Data
    June 30 - July 5, 2019
    Amsterdam, Netherlands

    Acceptance Rates

    SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
    • (2024)Unified Data Framework for Enhanced Data Management, Consumption, Provisioning, Processing and MovementProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659836(1-7)Online publication date: 18-Apr-2024
    • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
    • (2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
    • (2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
    • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
    • (2023)Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data PipelinesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599776(4991-5003)Online publication date: 6-Aug-2023
    • (2023)A Perceptual Data Cleansing Model (SDCM) for Reducing the Dirty Data2023 International Conference on Smart Computing and Application (ICSCA)10.1109/ICSCA57840.2023.10087605(1-7)Online publication date: 5-Feb-2023
    • (2023)Koios: Top-k Semantic Overlap Set Search2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00121(1531-1543)Online publication date: Apr-2023
    • (2023)Data Redundancy Detection Algorithm based on Multidimensional Similarity2023 International Conference on Frontiers of Robotics and Software Engineering (FRSE)10.1109/FRSE58934.2023.00032(180-187)Online publication date: Jun-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media