research-article

Scalable similarity search for SimRank

Authors:

Mitsuru Kusumoto,

Takanori Maehara,

Ken-ichi KawarabayashiAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 325 - 336

https://doi.org/10.1145/2588555.2610526

Published: 18 June 2014 Publication History

Abstract

SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many of the above mentioned applications. While there are many algorithms proposed so far to compute SimRank, but unfortunately, none of them are scalable up to graphs of billions size. Motivated by this fact, we consider the following SimRank-based similarity search problem: given a query vertex u, find top-k vertices v with the k highest SimRank scores s(u,v) with respect to u.

We propose a very fast and scalable algorithm for this similarity search problem. Our method consists of the following ingredients: (1) We first introduce a "linear" recursive formula for SimRank. This allows us to formulate a problem that we can propose a very fast algorithm. (2) We establish a Monte-Carlo based algorithm to compute a single pair SimRank score s(u,v), which is based on the random-walk interpretation of our linear recursive formula. (3) We empirically show that SimRank score s(u,v) decreases rapidly as distance d(u,v) increases. Therefore, in order to compute SimRank scores for a query vertex u for our similarity search problem, we only need to look at very "local" area. (4) We can combine two upper bounds for SimRank score s(u,v) (which can be obtained by Monte-Carlo simulation in our preprocess), together with some adaptive sample technique, to prune the similarity search procedure. This results in a much faster algorithm.

Once our preprocess is done (which only takes O(n) time), our algorithm finds, given a query vertex u, top-20 similar vertices v with the 20 highest SimRank scores s(u,v) in less than a few seconds even for graphs with billions edges.

To the best of our knowledge, this is the first time to scale for graphs with at least billions edges(for the single source case).

References

[1]

Z. Abbassi and V. S. Mirrokni. A recommender system based on local random walks and spectral methods. In WebKDD/SNA-KDD, volume 5439 of Lecture Notes in Computer Science, pages 139--153. Springer, 2007.

Digital Library

[2]

R. Albert, H. Jeong, and A.-L. Barabasi. Internet: Diameter of the World-Wide Web. Nature, 401(6749):130--131, 1999.

[3]

I. Antonellis, H. Garcia-Molina, and C.-C. Chang. Simrank++: query rewriting through link analysis of the click graph. PVLDB, 1(1):408--421, 2008.

Digital Library

[4]

A. A. Benczur, K. Csalogany, and T. Sarlos. Link-based similarity search to fight web spam. In AIRWeb, pages 9--16, 2006.

[5]

P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In WWW, pages 587--596. ACM, 2011.

Digital Library

[6]

P. Boldi and S. Vigna. The webgraph framework i: compression techniques. In WWW, pages 595--602. ACM, 2004.

Digital Library

[7]

Y. Cai, G. Cong, X. Jia, H. Liu, J. He, J. Lu, and X. Du. Efficient algorithm for computing link-based similarity in real world networks. In ICDM, pages 734--739. IEEE Computer Society, 2009.

Digital Library

[8]

L. Cao, B. Cho, H. D. Kim, Z. Li, M.-H. Tsai, and I. Gupta. Delta-SimRank computing on MapReduce. In BigMine, pages 28--35. ACM, 2012.

Digital Library

[9]

D. Fogaras and B. Rácz. Scaling link-based similarity search. In WWW, pages 641--650. ACM, 2005.

Digital Library

[10]

Y. Fujiwara, M. Nakatsuji, H. Shiokawa, and M. Onizuka. Efficient search algorithm for simrank. In ICDE, pages 589--600. IEEE Computer Society, 2013.

Digital Library

[11]

Z. Gyongyi, H. Garcia-Molina, and J. O. Pedersen. Combating web spam with TrustRank. In VLDB, pages 576--587. Morgan Kaufmann, 2004.

Digital Library

[12]

G. He, H. Feng, C. Li, and H. Chen. Parallel SimRank computation on large graphs with iterative aggregation. In KDD, pages 543--552. ACM, 2010.

Digital Library

[13]

G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD, pages 538--543. ACM, 2002.

Digital Library

[14]

X. Jia, Y. Cai, H. Liu, J. He, and X. Du. Calculating similarity efficiently in a small world. In ADMA, volume 5678 of Lecture Notes in Computer Science, pages 175--187. Springer, 2009.

Digital Library

[15]

X. Jia, H. Liu, L. Zou, J. He, and X. Du. A fast two-stage algorithm for computing SimRank and its extensions. In WAIM Workshops, volume 6185 of Lecture Notes in Computer Science, pages 61--73. Springer, 2010.

Digital Library

[16]

M. M. Kessler. Bibliographic coupling extended in time: Ten case histories. Information Storage and Retrieval, 1(4):169--187, 1963.

[17]

A. N. Langville and C. D. Meyer. Google's PageRank and beyond: The science of search engine rankings. Princeton University Press, 2006.

Digital Library

[18]

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29--123, 2009.

[19]

C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fast computation of simrank for static and dynamic information networks. In EDBT, volume 426 of ACM International Conference Proceeding Series, pages 465--476. ACM, 2010.

Digital Library

[20]

P. Li, Y. Cai, H. Liu, J. He, and X. Du. Exploiting the block structure of link graph for efficient similarity computation. In PAKDD, volume 5476 of Lecture Notes in Computer Science, pages 389--400. Springer, 2009.

Digital Library

[21]

P. Li, H. Liu, J. X. Yu, J. He, and X. Du. Fast single-pair simrank computation. In SDM, pages 571--582. SIAM, 2010.

[22]

D. Liben-Nowell and J. M. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019--1031, 2007.

Digital Library

[23]

Z. Lin, I. King, and M. R. Lyu. PageSim: A novel link-based similarity measure for the World Wide Web. In Web Intelligence, pages 687--693. IEEE Computer Society, 2006.

Digital Library

[24]

Z. Lin, M. R. Lyu, and I. King. Extending link-based algorithms for similar web pages with neighborhood structure. In Web Intelligence, pages 263--266. IEEE Computer Society, 2007.

Digital Library

[25]

Z. Lin, M. R. Lyu, and I. King. MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowledge and information systems, 32(1):141--166, 2012.

Digital Library

[26]

D. Lizorkin, P. Velikhov, M. N. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. The VLDB Journal, 19(1):45--66, 2010.

Digital Library

[27]

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3(2):127--163, 2000.

Digital Library

[28]

A. Mislove, M. Marcon, P. K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Internet Measurement Conference, pages 29--42. ACM, 2007.

Digital Library

[29]

C. Scheible. Sentiment translation through lexicon induction. In ACL (Student Research Workshop), pages 25--30. The Association for Computer Linguistics, 2010.

Digital Library

[30]

H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, 1973.

[31]

V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354--356, 1969.

Digital Library

[32]

V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In STOC, pages 887--898. ACM, 2012.

Digital Library

[33]

X. Yin, J. Han, and P. S. Yu. LinkClus: Efficient clustering via heterogeneous semantic links. In VLDB, pages 427--438. ACM, 2006.

Digital Library

[34]

L. Yu, Z. Shu, and X. Yang. SimRate: Improve collaborative recommendation based on rating graph for sparsity. In ADMA (2), volume 6441 of Lecture Notes in Computer Science, pages 167--174. Springer, 2010.

Digital Library

[35]

W. Yu, X. Lin, and J. Le. Taming computational complexity: Efficient and parallel SimRank optimizations on undirected graphs. In WAIM, volume 6184 of Lecture Notes in Computer Science, pages 280--296. Springer, 2010.

Digital Library

[36]

W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is simpler: Effectively and efficiently assessing node-pair similarities based on hyperlinks. PVLDB, 7(1):13--24, 2013.

Digital Library

[37]

W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and time efficient algorithm for SimRank computation. World Wide Web, 15(3):327--353, 2012.

Digital Library

[38]

P. Zhao, J. Han, and Y. Sun. P-Rank: a comprehensive structural similarity measure over information networks. In CIKM, pages 553--562. ACM, 2009.

Digital Library

[39]

W. Zheng, L. Zou, Y. Feng, L. Chen, and D. Zhao. Efficient SimRank-based similarity join over large graphs. PVLDB, 6(7):493--504, 2013.

Digital Library

[40]

Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. PVLDB, 2(1):718--729, 2009.

Digital Library

Cited By

Fan WLu PPang KJin RYu W(2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
https://dl.acm.org/doi/10.1145/3639363
Wu TCheng JZhang CHou JChen GHuang ZZhang WHan WBai B(2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588707
Yan LYu W(2023)SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43418-1_14(226-241)Online publication date: 17-Sep-2023
https://doi.org/10.1007/978-3-031-43418-1_14
Show More Cited By

Index Terms

Scalable similarity search for SimRank
1. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory

Recommendations

Accelerating pairwise SimRank estimation over static and dynamic graphs

Measuring similarities among different vertices is a fundamental problem in graph analysis. Among different similarity measurements, SimRank is one of the most promising and popular. In reality, instead of computing the whole similarity matrix, people ...
Efficient SimRank-Based Similarity Join
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence

Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. ...
SimRank and its variants in academic literature data: measures and evaluation
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

SimRank is a well-known link-based similarity measure that can be applied on a citation graph to compute similarity of academic literature data. The intuition behind SimRank is that two objects are similar if they are referenced by similar objects. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
1,703
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fan WLu PPang KJin RYu W(2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
https://dl.acm.org/doi/10.1145/3639363
Wu TCheng JZhang CHou JChen GHuang ZZhang WHan WBai B(2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588707
Yan LYu W(2023)SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43418-1_14(226-241)Online publication date: 17-Sep-2023
https://doi.org/10.1007/978-3-031-43418-1_14
Fan W(2022)Big graphsProceedings of the VLDB Endowment10.14778/3554821.355489915:12(3782-3797)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554899
Yu WMcCann JZhang CFerhatosmanoglu H(2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
https://dl.acm.org/doi/10.1145/3495209
Fan WGeng LJin RLu PTugay RYu W(2022)Linking Entities across Relations and Graphs2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00052(634-647)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00052
Zhang MYang LHu HLiu TWang J(2022)Efficient index-free SimRank similarity search in large graphs by discounting path lengthsExpert Systems with Applications10.1016/j.eswa.2022.117746206(117746)Online publication date: Dec-2022
https://doi.org/10.1016/j.eswa.2022.117746
Wang YXu RFeng ZChe YChen LLuo QMao R(2021)DiskProceedings of the VLDB Endowment10.14778/3430915.343092514:3(351-363)Online publication date: 9-Dec-2021
https://dl.acm.org/doi/10.14778/3430915.3430925
Wang RLi YLin SXie HXu YLui J(2021)On Modeling Influence Maximization in Social Activity Networks under General SettingsACM Transactions on Knowledge Discovery from Data10.1145/345121815:6(1-28)Online publication date: 19-May-2021
https://dl.acm.org/doi/10.1145/3451218
Zhang DYin JZhu XZhang C(2021)Search Efficient Binary Network EmbeddingACM Transactions on Knowledge Discovery from Data10.1145/343689215:4(1-27)Online publication date: 8-May-2021
https://dl.acm.org/doi/10.1145/3436892
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents