research-article

Learning Over Dirty Data Without Cleaning

Authors:

Arash Termehchy,

Ga Young LeeAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1301 - 1316

https://doi.org/10.1145/3318464.3389708

Published: 31 May 2020 Publication History

Abstract

Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing data errors to create clean databases for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose Dirty Learn, DLearn, a novel learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean versions of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.

Supplementary Material

MP4 File (3318464.3389708.mp4)

Presentation Video

Download
118.20 MB

References

[1]

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal, Vol. 24 (2015), 557--581.

Digital Library

[2]

Serge Abiteboul, Richard Hull, and Victor Vianu. 1994. Foundations of Databases: The Logical Level. Addison-Wesley.

Digital Library

[3]

Azza Abouzeid, Dana Angluin, Christos H. Papadimitriou, Joseph M. Hellerstein, and Abraham Silberschatz. 2013. Learning and verifying quantified boolean queries by example. In PODS.

[4]

Foto Afrati, Chen Li, and Prasenjit Mitra. 2004. On Containment of Conjunctive Queries with Arithmetic Comparisons. In Advances in Database Technology - EDBT. 459--476.

[5]

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. 2008. Transformation-based Framework for Record Matching. ICDE (2008), 40--49.

[6]

Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS.

[7]

Zeinab Bahmani, Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2012. Declarative Entity Resolution via Matching Dependencies and Answer Set Programs. In KR.

[8]

Zeinab Bahmani, Leopoldo E. Bertossi, and Nikolaos Vasiloglou. 2015. ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution. In SUM.

[9]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2008. Swoosh: a generic approach to entity resolution. The VLDB Journal, Vol. 18 (2008), 255--276.

Digital Library

[10]

Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2011. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory of Computing Systems, Vol. 52 (2011), 441--482.

Digital Library

[11]

Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 143--154. https://doi.org/10.1145/1066157.1066175

Digital Library

[12]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. In 2007 IEEE 23rd International Conference on Data Engineering. 746--755.

[13]

Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A Declarative Framework for Linking Entities. ACM Trans. Database Syst., Vol. 41, 3, Article 17 (2016), 38 pages. https://doi.org/10.1145/2894748

Digital Library

[14]

Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In SIGMOD Conference.

[15]

Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. Proc. Int'l Conf. Very Large Data Bases (VLDB), 315--326.

[16]

Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, and Pradap Konda. [n. d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.

[17]

Luc De Raedt. 2010. Logical and Relational Learning 1st ed.). Springer Publishing Company, Incorporated.

[18]

AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of Data Integration 1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Digital Library

[19]

Pedro Domingos. 2018. Machine Learning for Data Management: Problems and Solutions. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ?18). Association for Computing Machinery, New York, NY, USA, 629. https://doi.org/10.1145/3183713.3199515

Digital Library

[20]

Thomas Eiter, Georg Gottlob, and Heikki Mannila. 1997. Disjunctive Datalog. ACM Trans. Database Syst., Vol. 22 (1997), 364--418.

Digital Library

[21]

Richard Evans and Edward Grefenstette. 2018. Learning Explanatory Rules from Noisy Data. J. Artif. Intell. Res., Vol. 61 (2018), 1--64.

[22]

Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9--11, 2008, Vancouver, BC, Canada, Maurizio Lenzerini and Domenico Lembo (Eds.). ACM, 159--170. https://doi.org/10.1145/1376916.1376940

Digital Library

[23]

Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management .Morgan & Claypool Publishers. https://doi.org/10.2200/S00439ED1V01Y201207DTM030

[24]

Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about Record Matching Rules. PVLDB, Vol. 2 (2009), 407--418.

Digital Library

[25]

Enrico Franconi, Antonio Laureti Palma, Nicola Leone, Simona Perri, and Francesco Scarcello. 2001. Census Data Repair: A Challenging Application of Disjunctive Logic Programming. In Logic for Programming, Artificial Intelligence, and Reasoning, Robert Nieuwenhuis and Andrei Voronkov (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 561--578.

[26]

Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. 2001. Declarative Data Cleaning: Language, Model, and Algorithms. In VLDB.

[27]

Hector GarciaMolina, Jeff Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book. Prentice Hall.

Digital Library

[28]

Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In KDD.

[29]

Lise Getoor and Ben Taskar. 2007. Introduction to Statistical Relational Learning. MIT Press.

Digital Library

[30]

Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On Generating Near-Optimal Tableaux for Conditional Functional Dependencies. Proc. VLDB Endow., Vol. 1, 1 (Aug. 2008), 376--390. https://doi.org/10.14778/1453856.1453900

Digital Library

[31]

Osamu Gotoh. 1982. An improved algorithm for matching biological sequences. Journal of Molecular Biology, Vol. 162 3 (1982), 705--708.

[32]

Miguel Ángel Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, and Ryan Wisnesky. 2013. HIL: a high-level scripting language for entity integration. In EDBT.

[33]

Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The Merge/Purge Problem for Large Databases. In SIGMOD Conference.

[34]

Ihab F. Ilyas. 2016. Effective Data Cleaning with Continuous Evaluation. IEEE Data Eng. Bull., Vol. 39, 2 (2016), 38--46. http://sites.computer.org/debull/A16june/p38.pdf

[35]

Dmitri V. Kalashnikov, Laks V.S. Lakshmanan, and Divesh Srivastava. 2018. FastQRE: Fast Query Reverse Engineering. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). ACM, New York, NY, USA, 337--350. https://doi.org/10.1145/3183713.3183727

Digital Library

[36]

Angelika Kimmig, David Poole, and Jay Pujara. 2020. Statistical Relational AI (StarAI) WorkShop. In AAAI.

[37]

Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On Approximating Optimum Repairs for Functional Dependency Violations. In Proceedings of the 12th International Conference on Database Theory. Association for Computing Machinery, New York, NY, USA, 53--62. https://doi.org/10.1145/1514894.1514901

[38]

Ioannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate Detection with Matching Dependencies. Proceedings of the VLDB Endowment, Vol. 13, 5 (2020), 712--725.

Digital Library

[39]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Kenneth Y. Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB, Vol. 9 (2016), 948--959.

[40]

Ni Lao, Einat Minkov, and William Cohen. 2015. Learning Relational Features with Backward Random Walks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, 666--675. https://doi.org/10.3115/v1/P15--1065

[41]

Hao Li, Chee Yong Chan, and David Maier. 2015. Query From Examples: An Iterative, Data-Driven Approach to Query Construction. PVLDB, Vol. 8 (2015), 2158--2169.

Digital Library

[42]

Lilyana Mihalkova and Raymond J. Mooney. 2007. Bottom-up learning of Markov logic network structure. In ICML.

[43]

Stephen Muggleton. 1995. Inverse entailment and Progol. New Generation Computing, Vol. 13 (1995), 245--286.

Digital Library

[44]

Stephen Muggleton and Cao Feng. 1990. Efficient Induction of Logic Programs. In ALT.

[45]

Stephen Muggleton, Jose Santos, and Alireza Tamaddoni-Nezhad. 2009. ProGolem: A System Based on Relative Minimal Generalisation. In ILP.

[46]

Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning Over Dirty Data Without Cleaning. (2020). https://arxiv.org/abs/2004.02308

[47]

Jose Picado, Arash Termehchy, and Alan Fern. 2017. Schema Independent Relational Learning. In SIGMOD Conference.

[48]

J. Ross Quinlan. 1990. Learning Logical Definitions from Relations. Machine Learning, Vol. 5 (1990), 239--266.

[49]

Luc De Raedt, David Poole, Kristian Kersting, and Sriraam Natarajan. 2017. Statistical Relational Artificial Intelligence: Logic, Probability and Computation. In NeurIPS.

[50]

Theodoros I. Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10 (2017), 1190--1201.

Digital Library

[51]

Matthew Richardson and Pedro M. Domingos. 2006. Markov logic networks. Machine Learning, Vol. 62 (2006), 107--136.

Digital Library

[52]

Melanie Weis, Felix Naumann, Ulrich Jehle, Jens Lufter, and Holger Schuster. 2008. Industry-scale duplicate detection. PVLDB, Vol. 1 (2008), 1253--1264.

Digital Library

[53]

Jef Wijsen. 2003. Condensed Representation of Database Repairs for Consistent Query Answering. In Proceedings of the 9th International Conference on Database Theory. Springer-Verlag, Berlin, Heidelberg, 378--393.

[54]

Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided Data Repair. Proc. VLDB Endow., Vol. 4, 5 (Feb. 2011), 279--289. https://doi.org/10.14778/1952376.1952378

Digital Library

[55]

Qiang Zeng, Jignesh M. Patel, and David Page. 2014. QuickFOIL: Scalable Inductive Logic Programming. PVLDB, Vol. 8 (2014), 197--208.

Cited By

Zhen CAryal NTermehchy AChabada A(2024)Certain and Approximately Certain Models for Statistical LearningProceedings of the ACM on Management of Data10.1145/36549292:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654929
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Zhen CChabada ATermehchy ABoehm MHulsebos MShankar SVarma P(2023)When Can We Ignore Missing Data in Model Training?Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595854(1-4)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3595360.3595854
Show More Cited By

Index Terms

Learning Over Dirty Data Without Cleaning

Recommendations

Machine Learning and Data Cleaning: Which Serves the Other?
The last few years witnessed significant advances in building automated or semi-automated data quality, data cleaning and data integration systems powered by machine learning (ML). In parallel, large deployment of ML systems in business, science, ...
Data cleaning and machine learning: a systematic literature review
Abstract
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches ...
Machine Learning: The State of the Art

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
589
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)3

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Zhen CAryal NTermehchy AChabada A(2024)Certain and Approximately Certain Models for Statistical LearningProceedings of the ACM on Management of Data10.1145/36549292:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654929
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Zhen CChabada ATermehchy ABoehm MHulsebos MShankar SVarma P(2023)When Can We Ignore Missing Data in Model Training?Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595854(1-4)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3595360.3595854
Sukprasert PChan GRossi RDu FKoh E(2023)Discovery and Matching Numerical Attributes in Data Lakes2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386080(423-432)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386080
Wu YZhang ZWang G(2021)Correcting Large Knowledge Bases Using Guided Inductive Logic Learning RulesPRICAI 2021: Trends in Artificial Intelligence10.1007/978-3-030-89188-6_42(556-571)Online publication date: 8-Nov-2021
https://dl.acm.org/doi/10.1007/978-3-030-89188-6_42

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents