research-article

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection

Authors:

Peter Christen,

Felix NaumannAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 12, Issue 1

Article No.: 3, Pages 1 - 30

https://doi.org/10.1145/3352591

Published: 07 December 2019 Publication History

Abstract

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result.

We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.

References

[1]

Javed A. Aslam, Ekaterina Pelekhov, and Daniela Rus. 2004. The star clustering algorithm for static and dynamic information organization.J. Graph Algor. Appl. 8 (2004), 95--129.

[2]

Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. Correlation clustering. Mach. Learn. 56, 1-3 (2004), 89--113.

Digital Library

[3]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. VLDB J. 18, 1 (2009), 255--276.

Digital Library

[4]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. 39--48.

[5]

Coen Bron and Joep Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. Commun. ACM 16, 9 (1973), 575--577.

Digital Library

[6]

Peter Christen. 2005. Probabilistic data generation for deduplication and data linkage. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’05). 109--116.

Digital Library

[7]

Peter Christen. 2011. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9 (2011), 1537--1555.

Digital Library

[8]

Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin.

Digital Library

[9]

Peter Christen. 2016. Application of advanced record linkage techniques for complex population reconstruction. Arxiv Preprint Arxiv:1612.04286 (2016).

[10]

Xin Dong, Alon Halevy, and Jayant Madhavan. 2005. Reference reconciliation in complex information spaces. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’05). 85--96.

Digital Library

[11]

Uwe Draisbach and Felix Naumann. 2010. DuDe: The duplicate detection toolkit. In Proceedings of the International Workshop on Quality in Databases (QDB’10).

[12]

Uwe Draisbach, Felix Naumann, Sascha Szott, and Oliver Wonneberg. 2012. Adaptive windows for duplicate detection. In Proceedings of the International Conference on Data Engineering (ICDE’12). 1073--1083.

Digital Library

[13]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.

[14]

Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP’09). 19--27.

Digital Library

[15]

Jeffrey Fisher and Qing Wang. 2015. Unsupervised measuring of entity resolution consistency. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW’15). 218--221.

Digital Library

[16]

Michael R. Garey and David S. Johnson. 197. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York.

[17]

Andrey Goder and Vladimir Filkov. 2008. Consensus clustering algorithms: Comparison and refinement. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX’08). 109--117.

[18]

David Hand and Peter Christen. 2018. A note on using the f-measure for evaluating record linkage algorithms. Stat. Comput. 28, 3 (2018), 539--547.

Digital Library

[19]

Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, and Hyun Chul Lee. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proc. Very Large Data Base 2, 1 (2009), 1282--1293.

Digital Library

[20]

Oktie Hassanzadeh and Renée J. Miller. 2009. Creating probabilistic databases from duplicated data. VLDB J. 18, 5 (2009), 1141--1166.

Digital Library

[21]

Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. 2000. Scalable techniques for clustering the web. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB’00). 129--134.

[22]

Melanie Herschel, Felix Naumann, Sascha Szott, and Maik Taubert. 2012. Scalable iterative graph duplicate detection. IEEE Trans. Knowl. Data Eng. 24, 11 (2012), 2094--2108.

Digital Library

[23]

Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis (2nd ed.). Cambridge University Press, New York.

[24]

David Menestrina, Steven Whang, and Hector Garcia-Molina. 2010. Evaluating entity resolution results. Proc. Very Large Data Base 3, 1 (2010), 208--219.

Digital Library

[25]

Alvaro Monge and Charles Elkan. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97).

[26]

Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection (Synthesis Lectures on Data Management). Morgan and Claypool Publishers.

[27]

Markus Nentwig, Anika Groß, and Erhard Rahm. 2016. Holistic entity clustering for linked data. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDM’16). 194--201.

[28]

H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. 1959. Automatic linkage of vital records. Science 130, 3381 (1959), 954--959.

[29]

Banda Ramadan, Peter Christen, Huizhi Liang, Ross W. Gayler, and David Hawking. 2015. Dynamic sorted neighborhood indexing for real-time entity resolution. J. Data Info. Qual. 6, 4 (2015), 15:1--15:29.

[30]

Alice Reid, Ros Davies, and Eilidh Garrett. 2002. Nineteenth-century Scottish demography from linked censuses and civil registers. History Comput. 14, 1--2 (2002), 61--86.

[31]

J. M. Robson. 1986. Algorithms for maximum independent sets. J. Algor. 7, 3 (1986), 425--440.

[32]

Alieh Saeedi, Markus Nentwig, Eric Peukert, and Erhard Rahm. 2018. Scalable matching and clustering of entities with FAMER. Complex Syst. Info. Model. Quart. 16 (2018), 61--83.

[33]

Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using link features for entity clustering in knowledge graphs. In Proceedings of the European Semantic Web Conference (ESWC’18). 576--592.

[34]

Robert Endre Tarjan and Anthony E. Trojanowski. 1977. Finding a maximum independent set. SIAM J. Comput. 6, 3 (1977), 537--546.

Digital Library

[35]

Stijn van Dongen. 2000. Graph Clustering by Flow Simulation. Ph.D. Dissertation. University of Utrecht.

[36]

Hongzhi Wang, Jianzhong Li, and Hong Gao. 2015. Efficient entity resolution based on subgraph cohesion. Knowl. Info. Syst. 46, 2 (2015), 285--314.

Digital Library

[37]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. Very Large Data Base 5, 11 (2012), 1483--1494.

Digital Library

[38]

Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’13). 229--240.

Digital Library

[39]

Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-based deduplication: An adaptive approach. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). 1263--1277.

Digital Library

[40]

Henry S. Warren Jr.1975. A modification of warshall’s algorithm for the transitive closure of binary relations. Commun. ACM 18, 4 (1975), 218--220.

Digital Library

[41]

Stephen Warshall. 1962. A theorem on boolean matrices. J. ACM 9, 1 (1962), 11--12.

Digital Library

[42]

Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proc. Very Large Data Base 6, 6 (2013), 349--360.

Digital Library

Cited By

Wenz VKesper ATaentzer G(2023)Clustering Heterogeneous Data Values for Data Quality AnalysisJournal of Data and Information Quality10.1145/360371015:3(1-33)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3603710
Kobayashi FTalburt J(2023)Context Extraction in Unsupervised Entity Resolution2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00304(1842-1848)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00304
Sun JYang KWoźniak M(2023)Research on Hybrid Data Clustering Algorithm for Wireless Communication Intelligent BraceletsMobile Networks and Applications10.1007/s11036-023-02249-wOnline publication date: 19-Sep-2023
https://doi.org/10.1007/s11036-023-02249-w
Show More Cited By

Index Terms

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection
1. Information systems

Recommendations

Document clustering as a record linkage problem
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI ...
A Clustering-Based Framework for Incrementally Repairing Entity Resolution
PAKDD 2016: Proceedings, Part II, of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume 9652

Although entity resolution ER is known to be an important problem that has wide-spread applications in many areas, including e-commerce, health-care, social science, and crime and fraud detection, one aspect that has largely been neglected is to monitor ...
Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore Conference

Record or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 12, Issue 1

ON THE HORIZON, CHALLENGE PAPER, REGULAR PAPERS, and EXPERIENCE PAPER

March 2020

110 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3372130

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2019

Accepted: 01 July 2019

Revised: 01 April 2019

Received: 01 July 2018

Published in JDIQ Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
556
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)10

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wenz VKesper ATaentzer G(2023)Clustering Heterogeneous Data Values for Data Quality AnalysisJournal of Data and Information Quality10.1145/360371015:3(1-33)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3603710
Kobayashi FTalburt J(2023)Context Extraction in Unsupervised Entity Resolution2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00304(1842-1848)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00304
Sun JYang KWoźniak M(2023)Research on Hybrid Data Clustering Algorithm for Wireless Communication Intelligent BraceletsMobile Networks and Applications10.1007/s11036-023-02249-wOnline publication date: 19-Sep-2023
https://doi.org/10.1007/s11036-023-02249-w
Tudoreanu M(2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
https://doi.org/10.3389/fdata.2022.931398
Graf MLaskowski LPapsdorf FSold FGremmelspacher RNaumann FPanse F(2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3554821.3554823
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2022)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1145/3584014.3584015
Hung EHauben MEssex HZou CBright S(2022)More extreme duplication in FDA Adverse Event Reporting System detected by literature reference normalization and fuzzy string matchingPharmacoepidemiology and Drug Safety10.1002/pds.5555Online publication date: 9-Dec-2022
https://doi.org/10.1002/pds.5555
Papadakis GIoannou EThanos EPalpanas T(2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
https://doi.org/10.2200/S01067ED1V01Y202012DTM064
Panse FNaumann F(2021)Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00269(2373-2376)Online publication date: May-2021
https://doi.org/10.1109/ICDE51399.2021.00269
Ebeid ITalburt JSiddique M(2012)Graph-Based Hierarchical Record Clustering for Unsupervised Entity ResolutionITNG 2022 19th International Conference on Information Technology-New Generations10.1007/978-3-030-97652-1_14(107-118)Online publication date: 24-Feb-2012
https://doi.org/10.1007/978-3-030-97652-1_14

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents