Abstract
When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., If only last name is used as the identifier, one cannot distinguish “Masao Obama” from “Norio Obama”). In this paper, in particular, we study the scalability issue of the name disambiguation problem—when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation—our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.
Similar content being viewed by others
References
Aygun R (2008) S2S: structural-to-syntactic matching similar documents. Knowl Inform Syst 16: 303–329
Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: Proceedings of the SIAM data mining, November 2007
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proceedings of international world wide web conference
Cheng D, Kannan R, Vempala S, Wang G (2005) A divide-and-merge methodology for clustering. ACM Trans Database Syst
Cohen W, Ravikumar P, Fienberg S (2003) A Comparison of string distance metrics for name-matching tasks. Proceedings of the IIWEB workshop
Cui X, Potok T, Palathingal P (2005) Document clustering using particle swarm optimization. In: Proceedings of swarm intelligene symposium
Dhillon I, Guan Y, Kulis B (2005) A Fast kernel-based multilevel algorithm for graph clustering. Proceedings of ACM SIGKDD conference on knowledge discovery and data mining
Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst, September/October, 2–7
Dorneles C, Goncalves R, Mello R (2010) Approximate data instance matching: a survey. Knowl Inform Syst
Frey B, Dueck D (2007) Clustering by passing messages between data points. Science 315
Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins university press, Baltimore
Halbert D (2008) Record linkage. Am J Publ Health 36(12): 1412–1416
Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In geographic data mining and knowledge discovery. Taylor and Francis, London
Han H, Giles C, Zha H (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of ACM/IEEE joint conference on digital libraries, June 2005
Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inform Syst 6: 710–727
Heath M (2002) Scientific computing: an introductory survey. Prentice Hall, Englewood Cliffs
Hendrickson B, Leland R (1992) An improved spectral graph partitioning algorithm for mapping parallel computations. Technical report, SAND92-1460, Sandia National Lab, Albuquerque
Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0. Sandia
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. ACM SIGMOD/PODS conference
Hong Y, On B, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: Proceedings of European conference on digital libraies, Bath, UK, September 2004
Howard S, Tang H, Berry M, Martin D (2009) GTP: general text parser. http://www.cs.utk.edu/~lsi/
Karypis G, Kumar V (1996) A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parallel Distributed Comput 48(1): 71–95
Lee D, On B, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the ACM SIGMOD workshop on information quality in information systems, Baltimore, MD, USA, June 2005
Li Z, Ng W, Sun A (2005) Web data extraction based on structural similarity. Knowl Inform Syst 8: 438–461
Lu W, Milios J, Japkowicz M, Zhang Y (2006) Node similarity in the citation graph. Knowl Inform Syst 11: 105–129
Meila M, Shi J (2001) A random walks view of spectral segmentation. In: Proceedings of the international conference on machine learning
Newman M (2004) Detecting community structure in networks. Eur Phys J B(38): 321–330
On B, Elmacioglu E, Lee D, Kang J, Pei J (2006) Improving grouped-entity resolution using quasi-cliques. In: Proceedings of the IEEE international conference on data mining
On B, Koudas N, Lee D, Srivastava D (2007) Group linkage. In: Proceedings of the IEEE international conference on data engineering
On B, Lee D (2007) Scalable name disambiguation using multi-level graph partition. In: Proceedings of the SIAM international conference on data mining
On B, Lee I (2009) Google based name search: resolving mixed entities on the Web. In: Proceedings of the international conference on digital information management
Pasula H, Marthi B, Milch B, Russell S, Shapitser I (2003) Identity uncertainty and citation matching. Advances in neural information processing 15, MIT press, Cambridge
Pothen A, Simon H, Liou K (1990) Partitioning sparse sparse matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl 11(3): 430–452
Pothen A, Simon H, Wang L, Bernard S (1992) Toward a fast implementation of spectral nested dissection. In: Proceedings of the SUPERCOM, pp 42–51
SecondString: open-source java-based package of approximate string-matching. http://secondstring.sourceforge.net/
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Proceedings of the SIGIR
Verma D, Meila M (2003) Spectral clustering toolbox. http://www.ms.washington.edu/~spectral/
Wan X (2008) Beyond topical similarity: a structure similarity measure for retrieving highly similar document. Knowl Inform Syst 15: 55–73
Wu X, Kumar V, Quinlan J, Ghosh J, Yang Q (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37
Ye S, Wen J, Ma W (2007) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inform Syst 14: 217–232
Yippy. http://search.yippy.com/
Yu S, Shi J (2003) Multiclass spectral clustering. In: Proceedings of the international conference on computer vision
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper was extended from the earlier conference paper that appeared in Ref. [31].
Rights and permissions
About this article
Cite this article
On, BW., Lee, I. & Lee, D. Scalable clustering methods for the name disambiguation problem. Knowl Inf Syst 31, 129–151 (2012). https://doi.org/10.1007/s10115-011-0397-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0397-1