Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient top-k simrank-based similarity join

Published: 01 November 2014 Publication History

Abstract

SimRank is a popular and widely-adopted similarity measure to evaluate the similarity between nodes in a graph. It is time and space consuming to compute the SimRank similarities for all pairs of nodes, especially for large graphs. In real-world applications, users are only interested in the most similar pairs. To address this problem, in this paper we study the top-k SimRank-based similarity join problem, which finds k most similar pairs of nodes with the largest SimRank similarities among all possible pairs. To the best of our knowledge, this is the first attempt to address this problem. We encode each node as a vector by summarizing its neighbors and transform the calculation of the SimRank similarity between two nodes to computing the dot product between the corresponding vectors. We devise an efficient two-step framework to compute top-k similar pairs using the vectors. For large graphs, exact algorithms cannot meet the high-performance requirement, and we also devise an approximate algorithm which can efficiently identify top-k similar pairs under user-specified accuracy requirement. Experiments on both real and synthetic datasets show our method achieves high performance and good scalability.

References

[1]
I. Antonellis, H. Garcia-Molina, and C.-C. Chang. Simrank++: query rewriting through link analysis of the clickgraph (poster). In WWW, pages 1177--1178, 2008.
[2]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, pages 426--434, 2003.
[3]
C. Cooper and A. M. Frieze. Random walks with look-ahead in scale-free random graphs. SIAM J. Discrete Math., 24(3): 1162--1176, 2010.
[4]
M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Y. Zien. Evaluation strategies for top-k queries over memory-resident inverted indexes. PVLDB, 4(12): 1213--1224, 2011.
[5]
Y. Fujiwara, M. Nakatsuji, H. Shiokawa, and M. Onizuka. Efficient search algorithm for simrank. In ICDE, pages 589--600, 2013.
[6]
J. He, H. Liu, J. X. Yu, P. Li, W. He, and X. Du. Assessing single-pair similarity over graphs by aggregating first-meeting probabilities. Inf. Syst., 42: 107--122, 2014.
[7]
G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD, pages 538--543, 2002.
[8]
M. Kusumoto, T. Maehara, and K.-i. Kawarabayashi. Scalable similarity search for simrank. In SIGMOD, pages 325--336, 2014.
[9]
D. Lee, J. Park, J. Shim, and S. goo Lee. An efficient similarity join algorithm with cosine similarity predicate. In DEXA (2), pages 422--436, 2010.
[10]
P. Lee, L. V. S. Lakshmanan, and J. X. Yu. On top-k structural similarity search. In ICDE, pages 774--785, 2012.
[11]
D. Lizorkin, P. Velikhov, M. N. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. VLDB J., 19(1): 45--66, 2010.
[12]
Y. Low and A. X. Zheng. Fast top-k similarity queries via matrix compression. In CIKM, pages 2070--2074, 2012.
[13]
O. Rojas, V. G. Costa, and M. Marín. Efficient parallel block-max wand algorithm. In Euro-Par, pages 394--405, 2013.
[14]
Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. PVLDB, 4(11): 992--1003, 2011.
[15]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pages 401--410, 2009.
[16]
W. Yu, X. Lin, and W. Zhang. Towards efficient simrank computation on large networks. In ICDE, pages 601--612, 2013.
[17]
W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is simpler: Effectively and efficiently assessing node-pair similarities based on hyperlinks. PVLDB, 7(1): 13--24, 2013.
[18]
W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and time efficient algorithm for simrank computation. World Wide Web, 15(3): 327--353, 2012.
[19]
W. Zheng, L. Zou, Y. Feng, L. Chen, and D. Zhao. Efficient simrank-based similarity join over large graphs. PVLDB, 6(7): 493--504, 2013.

Cited By

View all
  • (2023)Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and ImprovementProceedings of the VLDB Endowment10.14778/3636218.363621917:4(617-629)Online publication date: 1-Dec-2023
  • (2023)Efficient Single-Source SimRank Query by Path AggregationProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599328(3342-3352)Online publication date: 6-Aug-2023
  • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 3
November 2014
144 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2014
Published in PVLDB Volume 8, Issue 3

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and ImprovementProceedings of the VLDB Endowment10.14778/3636218.363621917:4(617-629)Online publication date: 1-Dec-2023
  • (2023)Efficient Single-Source SimRank Query by Path AggregationProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599328(3342-3352)Online publication date: 6-Aug-2023
  • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
  • (2021)ExactSim: benchmarking single-source SimRank algorithms with high-precision ground truthsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00672-730:6(989-1015)Online publication date: 5-Jun-2021
  • (2020)Realtime index-free single source SimRank processing on web-scale graphsProceedings of the VLDB Endowment10.14778/3384345.338434713:7(966-980)Online publication date: 26-Mar-2020
  • (2020)Exact Single-Source SimRank Computation on Large GraphsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389781(653-663)Online publication date: 11-Jun-2020
  • (2019)Efficient Pairwise Penetrating-rank Similarity RetrievalACM Transactions on the Web10.1145/336861613:4(1-52)Online publication date: 18-Dec-2019
  • (2019)SimRank*The VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0536-328:3(401-426)Online publication date: 1-Jun-2019
  • (2019)Accelerating pairwise SimRank estimation over static and dynamic graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0521-x28:1(99-122)Online publication date: 1-Feb-2019
  • (2018)Dynamical SimRank search on time-varying networksThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0488-z27:1(79-104)Online publication date: 1-Feb-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media