Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection

Published: 07 December 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result.
    We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.

    References

    [1]
    Javed A. Aslam, Ekaterina Pelekhov, and Daniela Rus. 2004. The star clustering algorithm for static and dynamic information organization.J. Graph Algor. Appl. 8 (2004), 95--129.
    [2]
    Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. Correlation clustering. Mach. Learn. 56, 1-3 (2004), 89--113.
    [3]
    Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. VLDB J. 18, 1 (2009), 255--276.
    [4]
    Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. 39--48.
    [5]
    Coen Bron and Joep Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. Commun. ACM 16, 9 (1973), 575--577.
    [6]
    Peter Christen. 2005. Probabilistic data generation for deduplication and data linkage. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’05). 109--116.
    [7]
    Peter Christen. 2011. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9 (2011), 1537--1555.
    [8]
    Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin.
    [9]
    Peter Christen. 2016. Application of advanced record linkage techniques for complex population reconstruction. Arxiv Preprint Arxiv:1612.04286 (2016).
    [10]
    Xin Dong, Alon Halevy, and Jayant Madhavan. 2005. Reference reconciliation in complex information spaces. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’05). 85--96.
    [11]
    Uwe Draisbach and Felix Naumann. 2010. DuDe: The duplicate detection toolkit. In Proceedings of the International Workshop on Quality in Databases (QDB’10).
    [12]
    Uwe Draisbach, Felix Naumann, Sascha Szott, and Oliver Wonneberg. 2012. Adaptive windows for duplicate detection. In Proceedings of the International Conference on Data Engineering (ICDE’12). 1073--1083.
    [13]
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.
    [14]
    Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP’09). 19--27.
    [15]
    Jeffrey Fisher and Qing Wang. 2015. Unsupervised measuring of entity resolution consistency. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW’15). 218--221.
    [16]
    Michael R. Garey and David S. Johnson. 197. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York.
    [17]
    Andrey Goder and Vladimir Filkov. 2008. Consensus clustering algorithms: Comparison and refinement. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX’08). 109--117.
    [18]
    David Hand and Peter Christen. 2018. A note on using the f-measure for evaluating record linkage algorithms. Stat. Comput. 28, 3 (2018), 539--547.
    [19]
    Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, and Hyun Chul Lee. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proc. Very Large Data Base 2, 1 (2009), 1282--1293.
    [20]
    Oktie Hassanzadeh and Renée J. Miller. 2009. Creating probabilistic databases from duplicated data. VLDB J. 18, 5 (2009), 1141--1166.
    [21]
    Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. 2000. Scalable techniques for clustering the web. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB’00). 129--134.
    [22]
    Melanie Herschel, Felix Naumann, Sascha Szott, and Maik Taubert. 2012. Scalable iterative graph duplicate detection. IEEE Trans. Knowl. Data Eng. 24, 11 (2012), 2094--2108.
    [23]
    Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis (2nd ed.). Cambridge University Press, New York.
    [24]
    David Menestrina, Steven Whang, and Hector Garcia-Molina. 2010. Evaluating entity resolution results. Proc. Very Large Data Base 3, 1 (2010), 208--219.
    [25]
    Alvaro Monge and Charles Elkan. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97).
    [26]
    Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection (Synthesis Lectures on Data Management). Morgan and Claypool Publishers.
    [27]
    Markus Nentwig, Anika Groß, and Erhard Rahm. 2016. Holistic entity clustering for linked data. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDM’16). 194--201.
    [28]
    H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. 1959. Automatic linkage of vital records. Science 130, 3381 (1959), 954--959.
    [29]
    Banda Ramadan, Peter Christen, Huizhi Liang, Ross W. Gayler, and David Hawking. 2015. Dynamic sorted neighborhood indexing for real-time entity resolution. J. Data Info. Qual. 6, 4 (2015), 15:1--15:29.
    [30]
    Alice Reid, Ros Davies, and Eilidh Garrett. 2002. Nineteenth-century Scottish demography from linked censuses and civil registers. History Comput. 14, 1--2 (2002), 61--86.
    [31]
    J. M. Robson. 1986. Algorithms for maximum independent sets. J. Algor. 7, 3 (1986), 425--440.
    [32]
    Alieh Saeedi, Markus Nentwig, Eric Peukert, and Erhard Rahm. 2018. Scalable matching and clustering of entities with FAMER. Complex Syst. Info. Model. Quart. 16 (2018), 61--83.
    [33]
    Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using link features for entity clustering in knowledge graphs. In Proceedings of the European Semantic Web Conference (ESWC’18). 576--592.
    [34]
    Robert Endre Tarjan and Anthony E. Trojanowski. 1977. Finding a maximum independent set. SIAM J. Comput. 6, 3 (1977), 537--546.
    [35]
    Stijn van Dongen. 2000. Graph Clustering by Flow Simulation. Ph.D. Dissertation. University of Utrecht.
    [36]
    Hongzhi Wang, Jianzhong Li, and Hong Gao. 2015. Efficient entity resolution based on subgraph cohesion. Knowl. Info. Syst. 46, 2 (2015), 285--314.
    [37]
    Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. Very Large Data Base 5, 11 (2012), 1483--1494.
    [38]
    Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’13). 229--240.
    [39]
    Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-based deduplication: An adaptive approach. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). 1263--1277.
    [40]
    Henry S. Warren Jr.1975. A modification of warshall’s algorithm for the transitive closure of binary relations. Commun. ACM 18, 4 (1975), 218--220.
    [41]
    Stephen Warshall. 1962. A theorem on boolean matrices. J. ACM 9, 1 (1962), 11--12.
    [42]
    Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proc. Very Large Data Base 6, 6 (2013), 349--360.

    Cited By

    View all
    • (2023)Clustering Heterogeneous Data Values for Data Quality AnalysisJournal of Data and Information Quality10.1145/360371015:3(1-33)Online publication date: 22-Jun-2023
    • (2023)Context Extraction in Unsupervised Entity Resolution2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00304(1842-1848)Online publication date: 24-Jul-2023
    • (2023)Research on Hybrid Data Clustering Algorithm for Wireless Communication Intelligent BraceletsMobile Networks and Applications10.1007/s11036-023-02249-wOnline publication date: 19-Sep-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 12, Issue 1
    ON THE HORIZON, CHALLENGE PAPER, REGULAR PAPERS, and EXPERIENCE PAPER
    March 2020
    110 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3372130
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2019
    Accepted: 01 July 2019
    Revised: 01 April 2019
    Received: 01 July 2018
    Published in JDIQ Volume 12, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Record linkage
    2. clustering
    3. data matching
    4. deduplication
    5. entity resolution

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Clustering Heterogeneous Data Values for Data Quality AnalysisJournal of Data and Information Quality10.1145/360371015:3(1-33)Online publication date: 22-Jun-2023
    • (2023)Context Extraction in Unsupervised Entity Resolution2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00304(1842-1848)Online publication date: 24-Jul-2023
    • (2023)Research on Hybrid Data Clustering Algorithm for Wireless Communication Intelligent BraceletsMobile Networks and Applications10.1007/s11036-023-02249-wOnline publication date: 19-Sep-2023
    • (2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
    • (2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 29-Sep-2022
    • (2022)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 1-Dec-2022
    • (2022)More extreme duplication in FDA Adverse Event Reporting System detected by literature reference normalization and fuzzy string matchingPharmacoepidemiology and Drug Safety10.1002/pds.5555Online publication date: 9-Dec-2022
    • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
    • (2021)Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00269(2373-2376)Online publication date: May-2021
    • (2012)Graph-Based Hierarchical Record Clustering for Unsupervised Entity ResolutionITNG 2022 19th International Conference on Information Technology-New Generations10.1007/978-3-030-97652-1_14(107-118)Online publication date: 24-Feb-2012

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media