MatchSim: a novel similarity measure based on maximum neighborhood matching

Lin, Zhenjiang; Lyu, Michael R.; King, Irwin

doi:10.1007/s10115-011-0427-z

MatchSim: a novel similarity measure based on maximum neighborhood matching

Regular Paper
Published: 14 June 2011

Volume 32, pages 141–166, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Zhenjiang Lin¹,
Michael R. Lyu¹ &
Irwin King¹

508 Accesses
42 Citations
Explore all metrics

Abstract

Measuring object similarity in a graph is a fundamental data- mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that “similar objects have similar neighbors” and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aliguliyev RM (2009) A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst Appl 36(4): 7764–7772
Article Google Scholar
Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley, NY
Google Scholar
Burkard R, Dell’Amico M, Martello S (2009) Assignment problems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA
Book MATH Google Scholar
Cunningham P (2009) A taxonomy of similarity mechanisms for case-based reasoning. IEEE Trans Knowl Data Eng 21(11): 1532–1543
Article Google Scholar
Dean J, Henzinger MR (1999) Finding related pages in the World Wide Web. Comput Netw (Amsterdam, Netherlands, 1994) 31(11–16): 1467–1479
Google Scholar
Drake DE, Hougardy S (2003) A simple approximation algorithm for the weighted matching problem. Inf Process Lett 85(4): 211–213
Article MathSciNet MATH Google Scholar
Flake GW, Lawrence S, Giles CL, Coetzee FM (2002) Self-organization and identification of web communities. Computer 35(3): 66–71
Article Google Scholar
Fogaras D, Rácz B (2005) Scaling link-based similarity search. In: WWW ’05: proceedings of the 14th international conference on World Wide Web, ACM, New York, USA, pp. 641–650
Formica A, Elaheh P (2010) Content based similarity of geographic classes organized as partition hierarchies. Knowl Inf Syst 20(2): 221–241
Article Google Scholar
Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In: ICCV’03: proceedings of the 9th IEEE international conference on computer vision, IEEE Computer Society, Washington, DC, USA, pp. 487
Gueguen L, Datcu M (2008) A similarity metric for retrieval of compressed objects: application for mining satellite image time series. IEEE Trans Knowl Data Eng 20(4): 562–575
Article Google Scholar
Gupta A, Ying L (1999) On algorithms for finding maximum matchings in bipartite graphs. In: Technical report RC 21576 (97320), IBM T. J. Watson Research Center
Gyöngyi Z, Molina HG (2005) Web spam taxonomy. In: First international workshop on adversarial information retrieval on the Web (AIRWeb 2005)’
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc., NJ, USA
MATH Google Scholar
Jeh G, Widom J (2002) SimRank: a measure of structural-context similarity. In: KDD ’02: proceedings of the 8th ACM SIGKDD, ACM Press, NY, USA, pp. 538–543
Katz L (1953) A new status index derived from sociometric analysis. Psychometrika 18(1): 39–43
Article MATH Google Scholar
Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14(10–25)
Google Scholar
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. JACM 46(5): 604–632
Article MathSciNet MATH Google Scholar
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Quart 2: 83–97
Article MathSciNet Google Scholar
Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8): 1138–1150
Article Google Scholar
Lian X, Chen L (2008) Efficient similarity search over future stream time series. IEEE Trans Knowl Data Eng 20(1): 40–54
Article Google Scholar
Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks. In: CIKM ’03: prodeedings of the 12th international conference on information and knowledge management, ACM, pp. 556–559
Lin D (1998) An information-theoretic definition of similarity. In: ICML ’98: proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 296–304
Lin Z, King I, Lyu MR (2006) PageSim: a novel link-based similarity measure for the World Wide Web. In: WI ’06: proceedings of the 5th international conference on web intelligence, IEEE Computer Society, Hong Kong, pp. 687–693
Lin Z, Lyu MR, King I (2007) Extending link-based algorithms for similar web pages with neighborhood structure. In: WI ’07: proceedings of the 6th international conference on web intelligence, IEEE Computer Society, Washington, DC, USA, pp. 263–266
Lu W, Janssen J, Milios E, Japkowicz N, Zhang Y (2006) Node similarity in the citation graph. Knowl Inf Syst 11(1): 105–129
Article Google Scholar
Maguitman AG, Menczer F, Roinestad H, Vespignani A (2005) Algorithmic detection of semantic similarity. In: WWW ’05: proceedings of the 14th international conference on World Wide Web, ACM, New York, NY, USA, pp. 107–116
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the Web, Technical report, Stanford Digital Library Technologies Project
Ramos J (2003) Using TF-IDF to determine word relevance in document queries, Technical report, Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855e
Salton G (1989) Automatic Text Processing. Addison-Wesley, MA
Google Scholar
Salton G, Buckley C (1987) Term weighting approaches in automatic text retrieval, Technical report, Ithaca, NY, USA
Sen P, Namata GM, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T (2008) Collective classification in network data. AI Magazine 29(3): 93–106
Google Scholar
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(265–269)
Google Scholar
Sugiyama K, Hatano K, Yoshikawa M, Uemura S (2003) Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages. In: HYPERTEXT ’03: proceedings of the 14th ACM conference on Hypertext and hypermedia, ACM, NY, USA, pp. 198–207
Sugiyama K, Hatano K, Yoshikawa M, Uemura S (2005) Improvement in TF-IDF scheme for web pages based on the contents of their hyperlinked neighboring pages. Syst Comput Japan 36(14): 56–68
Article Google Scholar
van Rijsbergen CJ (1979) Information Retrieval. Butterworth-Heinemann
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73
Article Google Scholar
Wang H (2006) Nearest neighbors by neighborhood counting. IEEE Trans Pattern Anal Mach Intell 28(6): 942–953
Article Google Scholar
Wang H, Murtagh F (2008) A study of the neighborhood counting similarity. IEEE Trans Knowl Data Eng 20(4): 449–461
Article Google Scholar
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong
Zhenjiang Lin, Michael R. Lyu & Irwin King

Authors

Zhenjiang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Irwin King
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenjiang Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, Z., Lyu, M.R. & King, I. MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32, 141–166 (2012). https://doi.org/10.1007/s10115-011-0427-z

Download citation

Received: 25 January 2010
Revised: 01 April 2011
Accepted: 27 May 2011
Published: 14 June 2011
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10115-011-0427-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MatchSim: a novel similarity measure based on maximum neighborhood matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SimRank*: effective and scalable pairwise similarity search based on graph topology

LSimRank: Node Similarity in a Labeled Graph

Fast computation of General SimRank on heterogeneous information network

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

MatchSim: a novel similarity measure based on maximum neighborhood matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SimRank*: effective and scalable pairwise similarity search based on graph topology

LSimRank: Node Similarity in a Labeled Graph

Fast computation of General SimRank on heterogeneous information network

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation