Abstract
Measuring object similarity in a graph is a fundamental data- mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that “similar objects have similar neighbors” and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy.
Similar content being viewed by others
References
Aliguliyev RM (2009) A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst Appl 36(4): 7764–7772
Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley, NY
Burkard R, Dell’Amico M, Martello S (2009) Assignment problems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA
Cunningham P (2009) A taxonomy of similarity mechanisms for case-based reasoning. IEEE Trans Knowl Data Eng 21(11): 1532–1543
Dean J, Henzinger MR (1999) Finding related pages in the World Wide Web. Comput Netw (Amsterdam, Netherlands, 1994) 31(11–16): 1467–1479
Drake DE, Hougardy S (2003) A simple approximation algorithm for the weighted matching problem. Inf Process Lett 85(4): 211–213
Flake GW, Lawrence S, Giles CL, Coetzee FM (2002) Self-organization and identification of web communities. Computer 35(3): 66–71
Fogaras D, Rácz B (2005) Scaling link-based similarity search. In: WWW ’05: proceedings of the 14th international conference on World Wide Web, ACM, New York, USA, pp. 641–650
Formica A, Elaheh P (2010) Content based similarity of geographic classes organized as partition hierarchies. Knowl Inf Syst 20(2): 221–241
Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In: ICCV’03: proceedings of the 9th IEEE international conference on computer vision, IEEE Computer Society, Washington, DC, USA, pp. 487
Gueguen L, Datcu M (2008) A similarity metric for retrieval of compressed objects: application for mining satellite image time series. IEEE Trans Knowl Data Eng 20(4): 562–575
Gupta A, Ying L (1999) On algorithms for finding maximum matchings in bipartite graphs. In: Technical report RC 21576 (97320), IBM T. J. Watson Research Center
Gyöngyi Z, Molina HG (2005) Web spam taxonomy. In: First international workshop on adversarial information retrieval on the Web (AIRWeb 2005)’
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc., NJ, USA
Jeh G, Widom J (2002) SimRank: a measure of structural-context similarity. In: KDD ’02: proceedings of the 8th ACM SIGKDD, ACM Press, NY, USA, pp. 538–543
Katz L (1953) A new status index derived from sociometric analysis. Psychometrika 18(1): 39–43
Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14(10–25)
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. JACM 46(5): 604–632
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Quart 2: 83–97
Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8): 1138–1150
Lian X, Chen L (2008) Efficient similarity search over future stream time series. IEEE Trans Knowl Data Eng 20(1): 40–54
Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks. In: CIKM ’03: prodeedings of the 12th international conference on information and knowledge management, ACM, pp. 556–559
Lin D (1998) An information-theoretic definition of similarity. In: ICML ’98: proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 296–304
Lin Z, King I, Lyu MR (2006) PageSim: a novel link-based similarity measure for the World Wide Web. In: WI ’06: proceedings of the 5th international conference on web intelligence, IEEE Computer Society, Hong Kong, pp. 687–693
Lin Z, Lyu MR, King I (2007) Extending link-based algorithms for similar web pages with neighborhood structure. In: WI ’07: proceedings of the 6th international conference on web intelligence, IEEE Computer Society, Washington, DC, USA, pp. 263–266
Lu W, Janssen J, Milios E, Japkowicz N, Zhang Y (2006) Node similarity in the citation graph. Knowl Inf Syst 11(1): 105–129
Maguitman AG, Menczer F, Roinestad H, Vespignani A (2005) Algorithmic detection of semantic similarity. In: WWW ’05: proceedings of the 14th international conference on World Wide Web, ACM, New York, NY, USA, pp. 107–116
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the Web, Technical report, Stanford Digital Library Technologies Project
Ramos J (2003) Using TF-IDF to determine word relevance in document queries, Technical report, Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855e
Salton G (1989) Automatic Text Processing. Addison-Wesley, MA
Salton G, Buckley C (1987) Term weighting approaches in automatic text retrieval, Technical report, Ithaca, NY, USA
Sen P, Namata GM, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T (2008) Collective classification in network data. AI Magazine 29(3): 93–106
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(265–269)
Sugiyama K, Hatano K, Yoshikawa M, Uemura S (2003) Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages. In: HYPERTEXT ’03: proceedings of the 14th ACM conference on Hypertext and hypermedia, ACM, NY, USA, pp. 198–207
Sugiyama K, Hatano K, Yoshikawa M, Uemura S (2005) Improvement in TF-IDF scheme for web pages based on the contents of their hyperlinked neighboring pages. Syst Comput Japan 36(14): 56–68
van Rijsbergen CJ (1979) Information Retrieval. Butterworth-Heinemann
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73
Wang H (2006) Nearest neighbors by neighborhood counting. IEEE Trans Pattern Anal Mach Intell 28(6): 942–953
Wang H, Murtagh F (2008) A study of the neighborhood counting similarity. IEEE Trans Knowl Data Eng 20(4): 449–461
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, Z., Lyu, M.R. & King, I. MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32, 141–166 (2012). https://doi.org/10.1007/s10115-011-0427-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0427-z