Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

MDedup: duplicate detection with matching dependencies

Published: 01 January 2020 Publication History

Abstract

Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned.
For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.

References

[1]
Peter Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag Berlin Heidelberg, 2012.
[2]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):1--16, 2007.
[3]
Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), Knowledge Discovery and Data Mining, pages 39--48, 2003.
[4]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. Distributed representations of tuples for entity resolution. PVLDB, 11(11):1454--1467, 2018.
[5]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 19--34, 2018.
[6]
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. Entity matching: How similar is similar. PVLDB, 4(10):622--633, 2011.
[7]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.
[8]
Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. Matching dependencies: semantics and query answering. Frontiers of Computer Science, 6(3):278--292, 2012.
[9]
Zeinab Bahmani, Leopoldo E Bertossi, Solmaz Kolahi, and Laks VS Lakshmanan. Declarative entity resolution via matching dependencies and answer set programs. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR), pages 380--390, 2012.
[10]
Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2):197--210, 2010.
[11]
Peter Christen. Febrl: a freely available record linkage system with a graphical user interface. In Proceedings of the Australasian Workshop on Health Data and Knowledge Management (HDKM), pages 17--25, 2008.
[12]
George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. The return of JedAI: end-to-end entity resolution for structured and semi-structured data. PVLDB, 11(12):1950--1953, 2018.
[13]
Helena Galhardas, Daniela Florescu, Dennis Shasha, and Eric Simon. Ajax: An extensible data cleaning tool. In Proceedings of the International Conference on Management of Data (SIGMOD), page 590, 2000.
[14]
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approach to entity resolution. VLDB Journal (VLDBJ), 18(1):255--276, 2009.
[15]
George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-blocking: Taking entity resolution to the next level. IEEE Transactions on Knowledge and Data Engineering (TKDE), 26(8):1946--1960, 2013.
[16]
Patrick Lehti and Peter Fankhauser. Unsupervised duplicate detection using sample non-duplicates. In Journal of Data Semantics (JoDS), pages 136--164. Springer, 2006.
[17]
Peter Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), pages 151--159. ACM, 2008.
[18]
Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. Efficient data reconciliation. Information Sciences, 137(1):1--15, 2001.
[19]
Zeinab Bahmani, Leopoldo Bertossi, and Nikolaos Vasiloglou. ERBlox: combining matching dependencies with machine learning for entity resolution. In International Conference on Scalable Uncertainty Management (SUM), pages 399--414. Springer, 2015.
[20]
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017.
[21]
Sahand Negahban, Benjamin I. P. Rubinstein, and Jim Gemmell. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2224--2228, 2012.
[22]
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. Relaxed functional dependencies: a survey of approaches. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(1):147--165, 2016.
[23]
Wenfei Fan. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS), pages 159--170, 2008.
[24]
Shaoxu Song and Lei Chen. Discovering matching dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1421--1424. ACM, 2009.
[25]
Shaoxu Song and Lei Chen. Efficient discovery of similarity constraints for matching dependencies. Data and Knowledge Engineering (DKE), 87:146--166, 2013.
[26]
Metanome algorithm repository. https://github.com/HPI-Information-Systems/metanome-algorithms. [Online; accessed 2-January-2020].
[27]
Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707--710, 1966.
[28]
Paul Jaccard. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:241--272, 1901.
[29]
Ye Nan, Kian M. Chai, Wee S. Lee, and Hai L. Chieu. Optimizing f-measure: A tale of two approaches. In John Langford and Joelle Pineau, editors, International Conference on Machine Learning (ICML), pages 289--296, 2012.
[30]
Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL), pages 26--33, 2001.
[31]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer-Verlag Berlin Heidelberg, 1st edition, 2001.
[32]
Michael Collins, Robert E Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1--3):253--285, 2002.
[33]
Smile - statistical machine intelligence and learning engine. https://haifengl.github.io/smile/quickstart.html. [Online; accessed 1-July-2019].
[34]
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. National Taiwan University, Taipei, 2003.
[35]
Zhaohong Deng and Shitong Wang. Novel inductive and transductive transfer learning approaches based on support vector learning. In Support Vector Machines Applications, pages 49--103. Springer, 2014.

Cited By

View all
  1. MDedup: duplicate detection with matching dependencies

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 13, Issue 5
    January 2020
    195 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 January 2020
    Published in PVLDB Volume 13, Issue 5

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 25 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
    • (2024)Class Ratio and Its Implications for Reproducibility and Performance in Record LinkageAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_16(194-205)Online publication date: 7-May-2024
    • (2022)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 1-Dec-2022
    • (2022)Entity Resolution with Hierarchical Graph Attention NetworksProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517872(429-442)Online publication date: 10-Jun-2022
    • (2021)TabReformer: Unsupervised Representation Learning for Erroneous Data DetectionACM/IMS Transactions on Data Science10.1145/34475412:3(1-29)Online publication date: 18-May-2021
    • (2020)Efficient Discovery of Matching DependenciesACM Transactions on Database Systems10.1145/339277845:3(1-33)Online publication date: 26-Aug-2020
    • (2020)Learning Over Dirty Data Without CleaningProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389708(1301-1316)Online publication date: 11-Jun-2020

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media