research-article

MDedup: duplicate detection with matching dependencies

Authors:

loannis Koumarelas,

Thorsten Papenbrock,

Felix NaumannAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 5

Pages 712 - 725

https://doi.org/10.14778/3377369.3377379

Published: 01 January 2020 Publication History

Abstract

Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned.

For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.

References

[1]

Peter Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag Berlin Heidelberg, 2012.

[2]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):1--16, 2007.

Digital Library

[3]

Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), Knowledge Discovery and Data Mining, pages 39--48, 2003.

Digital Library

[4]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. Distributed representations of tuples for entity resolution. PVLDB, 11(11):1454--1467, 2018.

[5]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 19--34, 2018.

Digital Library

[6]

Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. Entity matching: How similar is similar. PVLDB, 4(10):622--633, 2011.

Digital Library

[7]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.

Digital Library

[8]

Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. Matching dependencies: semantics and query answering. Frontiers of Computer Science, 6(3):278--292, 2012.

[9]

Zeinab Bahmani, Leopoldo E Bertossi, Solmaz Kolahi, and Laks VS Lakshmanan. Declarative entity resolution via matching dependencies and answer set programs. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR), pages 380--390, 2012.

[10]

Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2):197--210, 2010.

Digital Library

[11]

Peter Christen. Febrl: a freely available record linkage system with a graphical user interface. In Proceedings of the Australasian Workshop on Health Data and Knowledge Management (HDKM), pages 17--25, 2008.

[12]

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. The return of JedAI: end-to-end entity resolution for structured and semi-structured data. PVLDB, 11(12):1950--1953, 2018.

Digital Library

[13]

Helena Galhardas, Daniela Florescu, Dennis Shasha, and Eric Simon. Ajax: An extensible data cleaning tool. In Proceedings of the International Conference on Management of Data (SIGMOD), page 590, 2000.

[14]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approach to entity resolution. VLDB Journal (VLDBJ), 18(1):255--276, 2009.

[15]

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-blocking: Taking entity resolution to the next level. IEEE Transactions on Knowledge and Data Engineering (TKDE), 26(8):1946--1960, 2013.

[16]

Patrick Lehti and Peter Fankhauser. Unsupervised duplicate detection using sample non-duplicates. In Journal of Data Semantics (JoDS), pages 136--164. Springer, 2006.

[17]

Peter Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), pages 151--159. ACM, 2008.

Digital Library

[18]

Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. Efficient data reconciliation. Information Sciences, 137(1):1--15, 2001.

Digital Library

[19]

Zeinab Bahmani, Leopoldo Bertossi, and Nikolaos Vasiloglou. ERBlox: combining matching dependencies with machine learning for entity resolution. In International Conference on Scalable Uncertainty Management (SUM), pages 399--414. Springer, 2015.

[20]

Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017.

Digital Library

[21]

Sahand Negahban, Benjamin I. P. Rubinstein, and Jim Gemmell. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2224--2228, 2012.

Digital Library

[22]

Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. Relaxed functional dependencies: a survey of approaches. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(1):147--165, 2016.

Digital Library

[23]

Wenfei Fan. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS), pages 159--170, 2008.

Digital Library

[24]

Shaoxu Song and Lei Chen. Discovering matching dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1421--1424. ACM, 2009.

Digital Library

[25]

Shaoxu Song and Lei Chen. Efficient discovery of similarity constraints for matching dependencies. Data and Knowledge Engineering (DKE), 87:146--166, 2013.

[26]

Metanome algorithm repository. https://github.com/HPI-Information-Systems/metanome-algorithms. [Online; accessed 2-January-2020].

[27]

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707--710, 1966.

[28]

Paul Jaccard. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:241--272, 1901.

[29]

Ye Nan, Kian M. Chai, Wee S. Lee, and Hai L. Chieu. Optimizing f-measure: A tale of two approaches. In John Langford and Joelle Pineau, editors, International Conference on Machine Learning (ICML), pages 289--296, 2012.

[30]

Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL), pages 26--33, 2001.

Digital Library

[31]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer-Verlag Berlin Heidelberg, 1st edition, 2001.

[32]

Michael Collins, Robert E Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1--3):253--285, 2002.

[33]

Smile - statistical machine intelligence and learning engine. https://haifengl.github.io/smile/quickstart.html. [Online; accessed 1-July-2019].

[34]

Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. National Taiwan University, Taipei, 2003.

[35]

Zhaohong Deng and Shitong Wang. Novel inductive and transductive transfer learning approaches based on support vector learning. In Support Vector Machines Applications, pages 49--103. Springer, 2014.

Cited By

Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Foxcroft JChristen PAntonie L(2024)Class Ratio and Its Implications for Reproducibility and Performance in Record LinkageAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_16(194-205)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1007/978-981-97-2242-6_16
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2022)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1145/3584014.3584015
Show More Cited By

MDedup: duplicate detection with matching dependencies
1. Computing methodologies

Recommendations

Multiview Semi-Supervised Learning with Consensus

Obtaining high-quality and up-to-date labeled data can be difficult in many real-world machine learning applications. Semi-supervised learning aims to improve the performance of a classifier trained with limited number of labeled data by utilizing the ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
A Framework to Generate Synthetic Multi-label Datasets

A controlled environment based on known properties of the dataset used by a learning algorithm is useful to empirically evaluate machine learning algorithms. Synthetic (artificial) datasets are used for this purpose. Although there are publicly ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 5

January 2020

195 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2020

Published in PVLDB Volume 13, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
236
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)10

Reflects downloads up to 25 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Foxcroft JChristen PAntonie L(2024)Class Ratio and Its Implications for Reproducibility and Performance in Record LinkageAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_16(194-205)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1007/978-981-97-2242-6_16
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2022)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1145/3584014.3584015
Yao DGu YCong GJin HLv XIves ZBonifati AEl Abbadi A(2022)Entity Resolution with Hierarchical Graph Attention NetworksProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517872(429-442)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517872
Nashaat MGhosh AMiller JQuader S(2021)TabReformer: Unsupervised Representation Learning for Erroneous Data DetectionACM/IMS Transactions on Data Science10.1145/34475412:3(1-29)Online publication date: 18-May-2021
https://dl.acm.org/doi/10.1145/3447541
Schirmer PPapenbrock TKoumarelas INaumann F(2020)Efficient Discovery of Matching DependenciesACM Transactions on Database Systems10.1145/339277845:3(1-33)Online publication date: 26-Aug-2020
https://dl.acm.org/doi/10.1145/3392778
Picado JDavis JTermehchy ALee GMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Learning Over Dirty Data Without CleaningProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389708(1301-1316)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389708

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents