Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389708acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Learning Over Dirty Data Without Cleaning

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing data errors to create clean databases for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose Dirty Learn, DLearn, a novel learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean versions of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.

    Supplementary Material

    MP4 File (3318464.3389708.mp4)
    Presentation Video

    References

    [1]
    Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal, Vol. 24 (2015), 557--581.
    [2]
    Serge Abiteboul, Richard Hull, and Victor Vianu. 1994. Foundations of Databases: The Logical Level. Addison-Wesley.
    [3]
    Azza Abouzeid, Dana Angluin, Christos H. Papadimitriou, Joseph M. Hellerstein, and Abraham Silberschatz. 2013. Learning and verifying quantified boolean queries by example. In PODS.
    [4]
    Foto Afrati, Chen Li, and Prasenjit Mitra. 2004. On Containment of Conjunctive Queries with Arithmetic Comparisons. In Advances in Database Technology - EDBT. 459--476.
    [5]
    Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. 2008. Transformation-based Framework for Record Matching. ICDE (2008), 40--49.
    [6]
    Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS.
    [7]
    Zeinab Bahmani, Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2012. Declarative Entity Resolution via Matching Dependencies and Answer Set Programs. In KR.
    [8]
    Zeinab Bahmani, Leopoldo E. Bertossi, and Nikolaos Vasiloglou. 2015. ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution. In SUM.
    [9]
    Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2008. Swoosh: a generic approach to entity resolution. The VLDB Journal, Vol. 18 (2008), 255--276.
    [10]
    Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2011. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory of Computing Systems, Vol. 52 (2011), 441--482.
    [11]
    Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 143--154. https://doi.org/10.1145/1066157.1066175
    [12]
    P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. In 2007 IEEE 23rd International Conference on Data Engineering. 746--755.
    [13]
    Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A Declarative Framework for Linking Entities. ACM Trans. Database Syst., Vol. 41, 3, Article 17 (2016), 38 pages. https://doi.org/10.1145/2894748
    [14]
    Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In SIGMOD Conference.
    [15]
    Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. Proc. Int'l Conf. Very Large Data Bases (VLDB), 315--326.
    [16]
    Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, and Pradap Konda. [n. d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
    [17]
    Luc De Raedt. 2010. Logical and Relational Learning 1st ed.). Springer Publishing Company, Incorporated.
    [18]
    AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of Data Integration 1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
    [19]
    Pedro Domingos. 2018. Machine Learning for Data Management: Problems and Solutions. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ?18). Association for Computing Machinery, New York, NY, USA, 629. https://doi.org/10.1145/3183713.3199515
    [20]
    Thomas Eiter, Georg Gottlob, and Heikki Mannila. 1997. Disjunctive Datalog. ACM Trans. Database Syst., Vol. 22 (1997), 364--418.
    [21]
    Richard Evans and Edward Grefenstette. 2018. Learning Explanatory Rules from Noisy Data. J. Artif. Intell. Res., Vol. 61 (2018), 1--64.
    [22]
    Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9--11, 2008, Vancouver, BC, Canada, Maurizio Lenzerini and Domenico Lembo (Eds.). ACM, 159--170. https://doi.org/10.1145/1376916.1376940
    [23]
    Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management .Morgan & Claypool Publishers. https://doi.org/10.2200/S00439ED1V01Y201207DTM030
    [24]
    Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about Record Matching Rules. PVLDB, Vol. 2 (2009), 407--418.
    [25]
    Enrico Franconi, Antonio Laureti Palma, Nicola Leone, Simona Perri, and Francesco Scarcello. 2001. Census Data Repair: A Challenging Application of Disjunctive Logic Programming. In Logic for Programming, Artificial Intelligence, and Reasoning, Robert Nieuwenhuis and Andrei Voronkov (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 561--578.
    [26]
    Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. 2001. Declarative Data Cleaning: Language, Model, and Algorithms. In VLDB.
    [27]
    Hector GarciaMolina, Jeff Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book. Prentice Hall.
    [28]
    Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In KDD.
    [29]
    Lise Getoor and Ben Taskar. 2007. Introduction to Statistical Relational Learning. MIT Press.
    [30]
    Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On Generating Near-Optimal Tableaux for Conditional Functional Dependencies. Proc. VLDB Endow., Vol. 1, 1 (Aug. 2008), 376--390. https://doi.org/10.14778/1453856.1453900
    [31]
    Osamu Gotoh. 1982. An improved algorithm for matching biological sequences. Journal of Molecular Biology, Vol. 162 3 (1982), 705--708.
    [32]
    Miguel Ángel Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, and Ryan Wisnesky. 2013. HIL: a high-level scripting language for entity integration. In EDBT.
    [33]
    Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The Merge/Purge Problem for Large Databases. In SIGMOD Conference.
    [34]
    Ihab F. Ilyas. 2016. Effective Data Cleaning with Continuous Evaluation. IEEE Data Eng. Bull., Vol. 39, 2 (2016), 38--46. http://sites.computer.org/debull/A16june/p38.pdf
    [35]
    Dmitri V. Kalashnikov, Laks V.S. Lakshmanan, and Divesh Srivastava. 2018. FastQRE: Fast Query Reverse Engineering. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). ACM, New York, NY, USA, 337--350. https://doi.org/10.1145/3183713.3183727
    [36]
    Angelika Kimmig, David Poole, and Jay Pujara. 2020. Statistical Relational AI (StarAI) WorkShop. In AAAI.
    [37]
    Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On Approximating Optimum Repairs for Functional Dependency Violations. In Proceedings of the 12th International Conference on Database Theory. Association for Computing Machinery, New York, NY, USA, 53--62. https://doi.org/10.1145/1514894.1514901
    [38]
    Ioannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate Detection with Matching Dependencies. Proceedings of the VLDB Endowment, Vol. 13, 5 (2020), 712--725.
    [39]
    Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Kenneth Y. Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB, Vol. 9 (2016), 948--959.
    [40]
    Ni Lao, Einat Minkov, and William Cohen. 2015. Learning Relational Features with Backward Random Walks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, 666--675. https://doi.org/10.3115/v1/P15--1065
    [41]
    Hao Li, Chee Yong Chan, and David Maier. 2015. Query From Examples: An Iterative, Data-Driven Approach to Query Construction. PVLDB, Vol. 8 (2015), 2158--2169.
    [42]
    Lilyana Mihalkova and Raymond J. Mooney. 2007. Bottom-up learning of Markov logic network structure. In ICML.
    [43]
    Stephen Muggleton. 1995. Inverse entailment and Progol. New Generation Computing, Vol. 13 (1995), 245--286.
    [44]
    Stephen Muggleton and Cao Feng. 1990. Efficient Induction of Logic Programs. In ALT.
    [45]
    Stephen Muggleton, Jose Santos, and Alireza Tamaddoni-Nezhad. 2009. ProGolem: A System Based on Relative Minimal Generalisation. In ILP.
    [46]
    Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning Over Dirty Data Without Cleaning. (2020). https://arxiv.org/abs/2004.02308
    [47]
    Jose Picado, Arash Termehchy, and Alan Fern. 2017. Schema Independent Relational Learning. In SIGMOD Conference.
    [48]
    J. Ross Quinlan. 1990. Learning Logical Definitions from Relations. Machine Learning, Vol. 5 (1990), 239--266.
    [49]
    Luc De Raedt, David Poole, Kristian Kersting, and Sriraam Natarajan. 2017. Statistical Relational Artificial Intelligence: Logic, Probability and Computation. In NeurIPS.
    [50]
    Theodoros I. Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10 (2017), 1190--1201.
    [51]
    Matthew Richardson and Pedro M. Domingos. 2006. Markov logic networks. Machine Learning, Vol. 62 (2006), 107--136.
    [52]
    Melanie Weis, Felix Naumann, Ulrich Jehle, Jens Lufter, and Holger Schuster. 2008. Industry-scale duplicate detection. PVLDB, Vol. 1 (2008), 1253--1264.
    [53]
    Jef Wijsen. 2003. Condensed Representation of Database Repairs for Consistent Query Answering. In Proceedings of the 9th International Conference on Database Theory. Springer-Verlag, Berlin, Heidelberg, 378--393.
    [54]
    Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided Data Repair. Proc. VLDB Endow., Vol. 4, 5 (Feb. 2011), 279--289. https://doi.org/10.14778/1952376.1952378
    [55]
    Qiang Zeng, Jignesh M. Patel, and David Page. 2014. QuickFOIL: Scalable Inductive Logic Programming. PVLDB, Vol. 8 (2014), 197--208.

    Cited By

    View all
    • (2024)Certain and Approximately Certain Models for Statistical LearningProceedings of the ACM on Management of Data10.1145/36549292:3(1-25)Online publication date: 30-May-2024
    • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
    • (2023)When Can We Ignore Missing Data in Model Training?Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595854(1-4)Online publication date: 18-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. data integration
    3. machine learning
    4. relational learning

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)3
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Certain and Approximately Certain Models for Statistical LearningProceedings of the ACM on Management of Data10.1145/36549292:3(1-25)Online publication date: 30-May-2024
    • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
    • (2023)When Can We Ignore Missing Data in Model Training?Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595854(1-4)Online publication date: 18-Jun-2023
    • (2023)Discovery and Matching Numerical Attributes in Data Lakes2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386080(423-432)Online publication date: 15-Dec-2023
    • (2021)Correcting Large Knowledge Bases Using Guided Inductive Logic Learning RulesPRICAI 2021: Trends in Artificial Intelligence10.1007/978-3-030-89188-6_42(556-571)Online publication date: 8-Nov-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media