Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Framework for evaluating clustering algorithms in duplicate detection

Published: 01 August 2009 Publication History
  • Get Citation Alerts
  • Abstract

    The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.

    References

    [1]
    N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. In ACM Symp. on Theory of Computing (STOC), pages 684--693, 2005.
    [2]
    P. Andritsos. Scalable Clustering of Categorical Data And Applications. PhD thesis, University of Toronto, Toronto, Canada, September 2004.
    [3]
    A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 918--929, 2006.
    [4]
    J. A. Aslam, E. Pelekhov, and D. Rus. The Star Clustering Algorithm For Static And Dynamic Information Organization. Journal of Graph Algorithms and Applications, 8(1):95--129, 2004.
    [5]
    N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Machine Learning, 56(1--3):89--113, 2004.
    [6]
    N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking Stable Clusters In The Blogosphere. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 806--817, Vienna, Austria, 2007.
    [7]
    R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In Int'l World Wide Web Conference (WWW), pages 131--140, Banff, Canada, 2007.
    [8]
    I. Bhattacharya and L. Getoor. A Latent Dirichlet Model for Unsupervised Entity Resolution. In Proc. of the SIAM International Conference on Data Mining (SDM), pages 47--58, Bethesda, MD, USA, 2006.
    [9]
    I. Bhattacharya and L. Getoor. Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, 29(2):4--12, 2006.
    [10]
    U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms, pages 568--579. Springer-Verlag, 2003.
    [11]
    S. Brohee and J. van Helden. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks. BMC Bioinformatics, 7:488+, 2006.
    [12]
    M. Charikar, V. Guruswami, and A. Wirth. Clustering with Qualitative Information. J. Comput. Syst. Sci., 71(3):360--383, 2005.
    [13]
    S. Chaudhuri, V. Ganti, and R. Motwani. Robust Identification of Fuzzy Duplicates. In IEEE Proc. of the Int'l Conf. on Data Eng., pages 865--876, Washington, DC, USA, 2005.
    [14]
    D. Cheng, R. Kannan, S. Vempala, and G. Wang. On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities. Technical Report MIT-LCS-TR-906, MIT LCS, 2003.
    [15]
    F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, and E. Upfal. Finding Near Neighbors Through Cluster Pruning. In Proc. of the ACM Symp. on Principles of Database Systems (PODS), pages 103--112, Beijing, China, 2007.
    [16]
    W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, Mexico, 2003.
    [17]
    T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill and MIT Press, 1990.
    [18]
    W. H. Day and H. Edelsbrunner. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification, 1(1):7--24, 1984.
    [19]
    E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation Clustering In General Weighted Graphs. Theor. Comput. Sci., 361(2):172--187, 2006.
    [20]
    E. A. Dinic. Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl, 11:1277--1280, 1970.
    [21]
    J. Edmonds and R. M. Karp. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM, 19(2):248--264, 1972.
    [22]
    P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vision, 59(2):167--181, 2004.
    [23]
    M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A Survey of Kernel and Spectral Methods for Clustering. Pattern Recognition, 41(1):176--190, 2008.
    [24]
    G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph Clustering and Minimum Cut Trees. Internet Mathematics, 1(4):385--408, 2004.
    [25]
    L. Ford and D. Fulkerson. Maximal Flow Through a Network. Canadian J. Math, 8:399--404, 1956.
    [26]
    D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 721--732, 2005.
    [27]
    A. V. Goldberg and R. E. Tarjan. A New Approach to the Maximum-Flow Problem. Journal of the ACM, 35(4):921--940, 1988.
    [28]
    M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. journal, 17(2--3):107--145, 2001.
    [29]
    O. Hassanzadeh. Benchmarking Declarative Approximate Selection Predicates. Master's thesis, University of Toronto, February 2007.
    [30]
    O. Hassanzadeh and R. J. Miller. Creating Probabilistic Databases from Duplicated Data. Technical Report CSRG-568, University of Toronto, To appear in The VLDB Journal, Accepted on 26 June 2009.
    [31]
    O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of Approximate String Joins Using Grams. In Proc. of the International Workshop on Quality in Databases (QDB), pages 11--18, Vienna, Austria, 2007.
    [32]
    T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In Proc. of the Int'l Workshop on the Web and Databases (WebDB), pages 129--134, Dallas, Texas, USA, 2000.
    [33]
    M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.
    [34]
    L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res., 9(11):1106--1115, 1999.
    [35]
    A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
    [36]
    A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31(3):264--323, 1999.
    [37]
    R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004.
    [38]
    Y. J. Kim and J. M. Patel. A Framework for Protein Structure Classification and Identification of Novel Protein Structures. BMC Bioinformatics, 7:456+, 2006.
    [39]
    A. D. King. Graph Clustering with Restricted Neighbourhood Search. Master's thesis, University of Toronto, 2004.
    [40]
    J. Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge Univ. Press, 2007.
    [41]
    C. Li, B. Wang, and X. Yang. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 303--314, Vienna, Austria, 2007.
    [42]
    A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning, pages 727--734, San Francisco, CA, USA, 2000.
    [43]
    S. Sarawagi and A. Kirpal. Efficient Set Joins On Similarity Predicates. In ACM SIGMOD Int'l Conf. on the Mgmt. of Data, pages 743--754, Paris, France, 2004.
    [44]
    N. Slonim. The Information Bottleneck: Theory And Applications. PhD thesis, The Hebrew University, 2003.
    [45]
    C. Swamy. Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 526--527, New Orleans, Louisiana, USA, 2004.
    [46]
    C. Umans. Hardness of Approximating Sigma2p Minimization Problems. In Symp. on Foundations of Computer Science (FOCS), pages 465--474, 1999.
    [47]
    S. van Dongen. Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht, 2000.
    [48]
    J. A. Whitney. Graph Clustering With Overlap. Master's thesis, University of Toronto, 2006.
    [49]
    D. T. Wijaya and S. Bressan. Ricochet: A Family of Unconstrained Algorithms for Graph Clustering. In Proc. of the Int'l Conf. on Database Systems for Advanced Applications (DASFAA), pages 153--167, Brisbane, Australia, 2009.
    [50]
    R. Xu and I. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645--678, 2005.
    [51]
    J. Zupan. Clustering of Large Data Sets. Research Studies Press, 1982.

    Cited By

    View all
    • (2024)The Five Generations of Entity Resolution on Web DataWeb Engineering10.1007/978-3-031-62362-2_46(469-473)Online publication date: 17-Jun-2024
    • (2024)The Open V2X Management PlatformWeb Engineering10.1007/978-3-031-62362-2_10(133-146)Online publication date: 17-Jun-2024
    • (2023)Simple, scalable and effective clustering via one-dimensional projectionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668942(64618-64649)Online publication date: 10-Dec-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 2, Issue 1
    August 2009
    1293 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2009
    Published in PVLDB Volume 2, Issue 1

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)3

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The Five Generations of Entity Resolution on Web DataWeb Engineering10.1007/978-3-031-62362-2_46(469-473)Online publication date: 17-Jun-2024
    • (2024)The Open V2X Management PlatformWeb Engineering10.1007/978-3-031-62362-2_10(133-146)Online publication date: 17-Jun-2024
    • (2023)Simple, scalable and effective clustering via one-dimensional projectionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668942(64618-64649)Online publication date: 10-Dec-2023
    • (2023)Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature DomainJournal on Computing and Cultural Heritage 10.1145/360669916:3(1-18)Online publication date: 9-Aug-2023
    • (2023)Privacy-Preserving Record Linkage for Cardinality CountingProceedings of the 2023 ACM Asia Conference on Computer and Communications Security10.1145/3579856.3590338(53-64)Online publication date: 10-Jul-2023
    • (2023)Improved similarity assessment and spectral clustering for unsupervised linking of data extracted from bridge inspection reportsAdvanced Engineering Informatics10.1016/j.aei.2021.10149651:COnline publication date: 15-Mar-2023
    • (2023)An analysis of one-to-one matching algorithms for entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00791-332:6(1369-1400)Online publication date: 18-Apr-2023
    • (2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
    • (2022)Towards deep entity resolution via soft schema matchingNeurocomputing10.1016/j.neucom.2021.10.106471:C(107-117)Online publication date: 30-Jan-2022
    • (2022)The role of transitive closure in evaluating blocking methods for dirty entity resolutionJournal of Intelligent Information Systems10.1007/s10844-021-00676-358:3(561-590)Online publication date: 1-Jun-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media