research-article

Framework for evaluating clustering algorithms in duplicate detection

Authors:

Oktie Hassanzadeh,

Hyun Chul Lee, and

Renée J. MillerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 2, Issue 1

Pages 1282 - 1293

https://doi.org/10.14778/1687627.1687771

Published: 01 August 2009 Publication History

Abstract

The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.

References

[1]

N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. In ACM Symp. on Theory of Computing (STOC), pages 684--693, 2005.

Digital Library

[2]

P. Andritsos. Scalable Clustering of Categorical Data And Applications. PhD thesis, University of Toronto, Toronto, Canada, September 2004.

Digital Library

[3]

A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 918--929, 2006.

Digital Library

[4]

J. A. Aslam, E. Pelekhov, and D. Rus. The Star Clustering Algorithm For Static And Dynamic Information Organization. Journal of Graph Algorithms and Applications, 8(1):95--129, 2004.

[5]

N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Machine Learning, 56(1--3):89--113, 2004.

Digital Library

[6]

N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking Stable Clusters In The Blogosphere. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 806--817, Vienna, Austria, 2007.

Digital Library

[7]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In Int'l World Wide Web Conference (WWW), pages 131--140, Banff, Canada, 2007.

Digital Library

[8]

I. Bhattacharya and L. Getoor. A Latent Dirichlet Model for Unsupervised Entity Resolution. In Proc. of the SIAM International Conference on Data Mining (SDM), pages 47--58, Bethesda, MD, USA, 2006.

[9]

I. Bhattacharya and L. Getoor. Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, 29(2):4--12, 2006.

[10]

U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms, pages 568--579. Springer-Verlag, 2003.

[11]

S. Brohee and J. van Helden. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks. BMC Bioinformatics, 7:488+, 2006.

[12]

M. Charikar, V. Guruswami, and A. Wirth. Clustering with Qualitative Information. J. Comput. Syst. Sci., 71(3):360--383, 2005.

Digital Library

[13]

S. Chaudhuri, V. Ganti, and R. Motwani. Robust Identification of Fuzzy Duplicates. In IEEE Proc. of the Int'l Conf. on Data Eng., pages 865--876, Washington, DC, USA, 2005.

Digital Library

[14]

D. Cheng, R. Kannan, S. Vempala, and G. Wang. On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities. Technical Report MIT-LCS-TR-906, MIT LCS, 2003.

[15]

F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, and E. Upfal. Finding Near Neighbors Through Cluster Pruning. In Proc. of the ACM Symp. on Principles of Database Systems (PODS), pages 103--112, Beijing, China, 2007.

Digital Library

[16]

W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, Mexico, 2003.

[17]

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill and MIT Press, 1990.

Digital Library

[18]

W. H. Day and H. Edelsbrunner. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification, 1(1):7--24, 1984.

[19]

E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation Clustering In General Weighted Graphs. Theor. Comput. Sci., 361(2):172--187, 2006.

Digital Library

[20]

E. A. Dinic. Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl, 11:1277--1280, 1970.

[21]

J. Edmonds and R. M. Karp. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM, 19(2):248--264, 1972.

Digital Library

[22]

P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vision, 59(2):167--181, 2004.

Digital Library

[23]

M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A Survey of Kernel and Spectral Methods for Clustering. Pattern Recognition, 41(1):176--190, 2008.

Digital Library

[24]

G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph Clustering and Minimum Cut Trees. Internet Mathematics, 1(4):385--408, 2004.

[25]

L. Ford and D. Fulkerson. Maximal Flow Through a Network. Canadian J. Math, 8:399--404, 1956.

[26]

D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 721--732, 2005.

Digital Library

[27]

A. V. Goldberg and R. E. Tarjan. A New Approach to the Maximum-Flow Problem. Journal of the ACM, 35(4):921--940, 1988.

Digital Library

[28]

M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. journal, 17(2--3):107--145, 2001.

Digital Library

[29]

O. Hassanzadeh. Benchmarking Declarative Approximate Selection Predicates. Master's thesis, University of Toronto, February 2007.

[30]

O. Hassanzadeh and R. J. Miller. Creating Probabilistic Databases from Duplicated Data. Technical Report CSRG-568, University of Toronto, To appear in The VLDB Journal, Accepted on 26 June 2009.

Digital Library

[31]

O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of Approximate String Joins Using Grams. In Proc. of the International Workshop on Quality in Databases (QDB), pages 11--18, Vienna, Austria, 2007.

[32]

T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In Proc. of the Int'l Workshop on the Web and Databases (WebDB), pages 129--134, Dallas, Texas, USA, 2000.

[33]

M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.

Digital Library

[34]

L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res., 9(11):1106--1115, 1999.

[35]

A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

Digital Library

[36]

A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31(3):264--323, 1999.

Digital Library

[37]

R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004.

Digital Library

[38]

Y. J. Kim and J. M. Patel. A Framework for Protein Structure Classification and Identification of Novel Protein Structures. BMC Bioinformatics, 7:456+, 2006.

[39]

A. D. King. Graph Clustering with Restricted Neighbourhood Search. Master's thesis, University of Toronto, 2004.

[40]

J. Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge Univ. Press, 2007.

Digital Library

[41]

C. Li, B. Wang, and X. Yang. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 303--314, Vienna, Austria, 2007.

Digital Library

[42]

A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning, pages 727--734, San Francisco, CA, USA, 2000.

Digital Library

[43]

S. Sarawagi and A. Kirpal. Efficient Set Joins On Similarity Predicates. In ACM SIGMOD Int'l Conf. on the Mgmt. of Data, pages 743--754, Paris, France, 2004.

Digital Library

[44]

N. Slonim. The Information Bottleneck: Theory And Applications. PhD thesis, The Hebrew University, 2003.

[45]

C. Swamy. Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 526--527, New Orleans, Louisiana, USA, 2004.

Digital Library

[46]

C. Umans. Hardness of Approximating Sigma₂^p Minimization Problems. In Symp. on Foundations of Computer Science (FOCS), pages 465--474, 1999.

Digital Library

[47]

S. van Dongen. Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht, 2000.

[48]

J. A. Whitney. Graph Clustering With Overlap. Master's thesis, University of Toronto, 2006.

[49]

D. T. Wijaya and S. Bressan. Ricochet: A Family of Unconstrained Algorithms for Graph Clustering. In Proc. of the Int'l Conf. on Database Systems for Advanced Applications (DASFAA), pages 153--167, Brisbane, Australia, 2009.

Digital Library

[50]

R. Xu and I. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645--678, 2005.

Digital Library

[51]

J. Zupan. Clustering of Large Data Sets. Research Studies Press, 1982.

Cited By

Nikoletos KIoannou EPapadakis G(2024)The Five Generations of Entity Resolution on Web DataWeb Engineering10.1007/978-3-031-62362-2_46(469-473)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1007/978-3-031-62362-2_46
Dalamagkas CGeorgakis AHrissagis-Chrysagis KPapadakis G(2024)The Open V2X Management PlatformWeb Engineering10.1007/978-3-031-62362-2_10(133-146)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1007/978-3-031-62362-2_10
Charikar MHenzinger MHu LVötsch MWaingarten EOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Simple, scalable and effective clustering via one-dimensional projectionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668942(64618-64649)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668942
Show More Cited By

Index Terms

Framework for evaluating clustering algorithms in duplicate detection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection
ON THE HORIZON, CHALLENGE PAPER, REGULAR PAPERS, and EXPERIENCE PAPER

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records ...
Read More
Duplicate Record Detection: A Survey

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Read More
Evaluating indeterministic duplicate detection results
SUM'12: Proceedings of the 6th international conference on Scalable Uncertainty Management

Duplicate detection is an important process for cleaning or integrating data. Since real-life data is often polluted, detecting duplicates usually comes along with uncertainty. To handle duplicate uncertainty in an appropriate way, indeterministic ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 2, Issue 1

August 2009

1293 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2009

Published in PVLDB Volume 2, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

71
Total Citations
View Citations
858
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)3

Other Metrics

View Author Metrics

Citations

Cited By

Nikoletos KIoannou EPapadakis G(2024)The Five Generations of Entity Resolution on Web DataWeb Engineering10.1007/978-3-031-62362-2_46(469-473)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1007/978-3-031-62362-2_46
Dalamagkas CGeorgakis AHrissagis-Chrysagis KPapadakis G(2024)The Open V2X Management PlatformWeb Engineering10.1007/978-3-031-62362-2_10(133-146)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1007/978-3-031-62362-2_10
Charikar MHenzinger MHu LVötsch MWaingarten EOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Simple, scalable and effective clustering via one-dimensional projectionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668942(64618-64649)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668942
Piché DFont LZouaq AGagnon M(2023)Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature DomainJournal on Computing and Cultural Heritage 10.1145/360669916:3(1-18)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3606699
Wu NVatsalan DKaafar MRamesh S(2023)Privacy-Preserving Record Linkage for Cardinality CountingProceedings of the 2023 ACM Asia Conference on Computer and Communications Security10.1145/3579856.3590338(53-64)Online publication date: 10-Jul-2023
https://dl.acm.org/doi/10.1145/3579856.3590338
Liu KEl-Gohary N(2023)Improved similarity assessment and spectral clustering for unsupervised linking of data extracted from bridge inspection reportsAdvanced Engineering Informatics10.1016/j.aei.2021.10149651:COnline publication date: 15-Mar-2023
https://dl.acm.org/doi/10.1016/j.aei.2021.101496
Papadakis GEfthymiou VThanos EHassanzadeh OChristen P(2023)An analysis of one-to-one matching algorithms for entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00791-332:6(1369-1400)Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1007/s00778-023-00791-3
Graf MLaskowski LPapsdorf FSold FGremmelspacher RNaumann FPanse F(2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554823
Sun CShen D(2022)Towards deep entity resolution via soft schema matchingNeurocomputing10.1016/j.neucom.2021.10.106471:C(107-117)Online publication date: 30-Jan-2022
https://dl.acm.org/doi/10.1016/j.neucom.2021.10.106
Niknam MMinaei-Bidgoli BDianat R(2022)The role of transitive closure in evaluating blocking methods for dirty entity resolutionJournal of Intelligent Information Systems10.1007/s10844-021-00676-358:3(561-590)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1007/s10844-021-00676-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents