Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Published: 11 January 2022 Publication History

Abstract

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [24] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D. Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [1], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes.
In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied-D” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [24] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR#, with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [24] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al.  SimRank ~{S} to Jeh and Widom’s SimRank S. (6) We propose GSR#, a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.

References

[1]
Ioannis Antonellis, Hector Garcia Molina, and Chichao Chang. 2008. SimRank++: Query rewriting through link analysis of the click graph. PVLDB 1, 1 (2008), 408–421.
[2]
Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. DOI:https://doi.org/10.1007/978-3-642-31164-2
[3]
William W. Cohen. 2000. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems 18, 3 (2000), 288–321. DOI:https://doi.org/10.1145/352595.352598
[4]
Nick Craswell and Martin Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 239–246. DOI:https://doi.org/10.1145/1277741.1277784
[5]
Prasenjit Dey, Kunal Goel, and Rahul Agrawal. 2020. P-Simrank: Extending simrank to scale-free bipartite networks. In Proceedings of the Web Conference. ACM, 3084–3090. DOI:https://doi.org/10.1145/3366423.3380081
[6]
Steven Elsworth and Stefan Guttel. 2020. The block rational Arnoldi method. SIAM Journal on Matrix Analysis and Applications 41, 2 (2020), 365–388.
[7]
Dániel Fogaras and Balázs Rácz. 2004. A scalable randomized method to compute link-based similarity rank on the web graph. In International Conference on Extending Database Technology Workshops.
[8]
Dániel Fogaras and Balázs Rácz. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web.
[9]
Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan T. Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. ACM Transactions on Information Systems 23, 2 (2005), 147–168. DOI:https://doi.org/10.1145/1059981.1059982
[10]
Yasuhiro Fujiwara, Makoto Nakatsuji, Hiroaki Shiokawa, and Makoto Onizuka. 2013. Efficient search algorithm for SimRank. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering.
[11]
Prasanna Ganesan, Hector Garcia-Molina, and Jennifer Widom. 2003. Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems 21, 1 (2003), 64–93. DOI:https://doi.org/10.1145/635484.635487
[12]
Masoud Reyhani Hamedani and Sang-Wook Kim. 2016. SimCC-AT: A method to compute similarity of scientific papers with automatic parameter tuning. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1005–1008. DOI:https://doi.org/10.1145/2911451.2914715
[13]
Masoud Reyhani Hamedani and Sang-Wook Kim. 2019. Pairwise normalization in SimRank variants: Problem, solution, and evaluation. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. ACM, 534–541. DOI:https://doi.org/10.1145/3297280.3297331
[14]
Guoming He, Haijun Feng, Cuiping Li, and Hong Chen. 2010. Parallel SimRank computation on large graphs with iterative aggregation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[15]
Jun He, Hongyan Liu, Jeffrey Xu Yu, Pei Li, Wei He, and Xiaoyong Du. 2014. Assessing single-pair similarity over graphs by aggregating first-meeting probabilities. Information Systems 42, June (2014), 107–122.
[16]
Barbara M. Hill and Anthony Debons. 1972. Bibliographic coupling. Journal of the Association for Information Science and Technology 23, 4 (1972), 286. DOI:https://doi.org/10.1002/asi.4630230413
[17]
Michael E. Houle, Vincent Oria, Shin’ichi Satoh, and Jichao Sun. 2013. Annotation propagation in image databases using similarity graphs. ACM Transactions on Multimedia Computing, Communications, and Applications 10, 1 (2013), 7:1–7:21. DOI:https://doi.org/10.1145/2487736
[18]
Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[19]
Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In Proceedings of the 12th International Conference on World Wide Web. ACM, 271–279. DOI:https://doi.org/10.1145/775152.775191
[20]
Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong, and Ke Wang. 2017. READS: A random walk approach for efficient and accurate dynamic SimRank. Proceedings of the VLDB Endowment 10, 9 (2017), 937–948. DOI:https://doi.org/10.14778/3099622.3099625
[21]
Ruoming Jin, Victor E. Lee, and Hui Hong. 2011. Axiomatic ranking of network role similarity. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[22]
Isabel M. Kloumann, Johan Ugander, and Jon M. Kleinberg. 2017. Block models and personalized PageRank. Proceedings of the National Academy of Sciences USA 114, 1 (2017), 33–38. DOI:https://doi.org/10.1073/pnas.1611275114
[23]
Oren Kurland and Lillian Lee. 2010. PageRank without hyperlinks: Structural reranking using links induced by language models. ACM Transactions on Information Systems 28, 4 (2010), 18:1–18:38. DOI:https://doi.org/10.1145/1852102.1852104
[24]
Mitsuru Kusumoto, Takanori Maehara, and Ken-ichi Kawarabayashi. 2014. Scalable similarity search for SimRank. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.
[25]
Pei Lee, Laks V. S. Lakshmanan, and Jeffrey Xu Yu. 2012. On top-\(k\) structural similarity search. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering.
[26]
Ronny Lempel and Shlomo Moran. 2001. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19, 2 (2001), 131–160. DOI:https://doi.org/10.1145/382979.383041
[27]
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 631–636. DOI:https://doi.org/10.1145/1150402.1150479
[28]
Cuiping Li, Jiawei Han, Guoming He, Xin Jin, Yizhou Sun, Yintao Yu, and Tianyi Wu. 2010. Fast computation of SimRank for static and dynamic information networks. In Proceedings of the 13th International Conference on Extending Database Technology.
[29]
Pei Li, Hongyan Liu, Jeffrey Xu Yu, Jun He, and Xiaoyong Du. 2010. Fast single-pair SimRank computation. In Proceedings of the 2010 SIAM International Conference on Data Mining.
[30]
Xinyi Li, Yifan Chen, Benjamin Pettit, and Maarten de Rijke. 2019. Personalised reranking of paper recommendations using paper content and user behavior. ACM Transactions on Information Systems 37, 3 (2019), 31:1–31:23. DOI:https://doi.org/10.1145/3312528
[31]
Zhenjiang Lin, Michael R. Lyu, and Irwin King. 2006. PageSim: A novel link-based measure of web page aimilarity. In Proceedings of the 15th International Conference on World Wide Web. ACM, 1019–1020. DOI:https://doi.org/10.1145/1135777.1135994
[32]
Zhenjiang Lin, Michael R. Lyu, and Irwin King. 2012. MatchSim: A novel similarity measure based on maximum neighborhood matching. Knowledge and Information Systems 32, 1 (2012), 141–166.
[33]
Dmitry Lizorkin, Pavel Velikhov, Maxim N. Grinev, and Denis Turdakov. 2010. Accuracy estimate and optimization techniques for SimRank computation. The VLDB Journal 19, 1 (2010), 45–66.
[34]
Juan Lu, Zhiguo Gong, and Yiyang Yang. 2021. A matrix sampling approach for efficient SimRank computation. Information Sciences 556 (2021), 1–26. DOI:https://doi.org/10.1016/j.ins.2020.12.046
[35]
Davood Rafiei and Fan Deng. 2020. Similarity join and similarity self-join size estimation in a streaming environment. IEEE Transactions on Knowledge and Data Engineering 32, 4 (2020), 768–781. DOI:https://doi.org/10.1109/TKDE.2019.2893175
[36]
Jieming Shi, Tianyuan Jin, Renchi Yang, Xiaokui Xiao, and Yin Yang. 2020. Realtime index-free single source SimRank processing on web-scale graphs. Proceedings of the VLDB Endowment 13, 7 (2020), 966–978. DOI:http://www.vldb.org/pvldb/vol13/p966-shi.pdf
[37]
Henry Small. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science 24, 4 (1973), 265–269. DOI:https://doi.org/10.1002/asi.4630240406
[38]
Wenbo Tao, Minghe Yu, and Guoliang Li. 2014. Efficient Top-\(K\) SimRank-based similarity join. Proceedings of the VLDB Endowment 8, 3 (2014), 317–328. DOI:https://doi.org/10.14778/2735508.2735520
[39]
Boyu Tian and Xiaokui Xiao. 2016. SLING: A near-optimal index structure for SimRank. In Proceedings of the 2016 International Conference on Management of Data. 1859–1874. DOI:https://doi.org/10.1145/2882903.2915243
[40]
Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. In Proceedings of the 6th International Conference on Data Mining.
[41]
Hanzhi Wang, Zhewei Wei, Ye Yuan, Xiaoyong Du, and Ji-Rong Wen. 2020. Exact single-source SimRank computation on large graphs. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 653–663. DOI:https://doi.org/10.1145/3318464.3389781
[42]
Zhewei Wei, Xiaodong He, Xiaokui Xiao, Sibo Wang, Yu Liu, Xiaoyong Du, and Ji-Rong Wen. 2019. PRSim: Sublinear time SimRank computation on large power-law graphs. In Proceedings of the 2019 International Conference on Management of Data. ACM, 1042–1059. DOI:https://doi.org/10.1145/3299869.3319873
[43]
Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, and Dong Zhuang. 2005. SimFusion: Measuring similarity using unified relationship matrix. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[44]
Jennifer Jie Xu and Hsinchun Chen. 2005. CrimeNet explorer: A framework for criminal network knowledge discovery. ACM Transactions on Information Systems 23, 2 (2005), 201–226. DOI:https://doi.org/10.1145/1059981.1059984
[45]
Brit Youngmann, Tova Milo, and Amit Somech. 2019. Boosting SimRank with semantics. In Proceedings of the EDBT. 37–48. DOI:https://doi.org/10.5441/002/edbt.2019.05
[46]
Weiren Yu, Sima Iranmanesh, Aparajita Haldar, Maoyin Zhang, and Hakan Ferhatosmanoglu. 2021. RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs. World Wide Web (2021), 1–45. https://www.springerprofessional.de/en/rolesim-scaling-axiomatic-role-based-similarity-ranking-on-large/19563864.
[47]
Weiren Yu, Xuemin Lin, and Wenjie Zhang. 2013. Towards efficient SimRank computation on large networks. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering.
[48]
Weiren Yu, Xuemin Lin, and Wenjie Zhang. 2014. Fast incremental SimRank on link-evolving graphs. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering. 304–315. DOI:https://doi.org/10.1109/ICDE.2014.6816660
[49]
Weiren Yu, Xuemin Lin, Wenjie Zhang, Jian Pei, and Julie A. McCann. 2019. SimRank*: Effective and scalable pairwise similarity search based on graph topology. The VLDB Journal 28, 3 (2019), 401–426. DOI:https://doi.org/10.1007/s00778-018-0536-3
[50]
Weiren Yu, Julie McCann, and Chengyuan Zhang. 2021. Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs. Technical Report 4. Retrieved from https://warwick.ac.uk/fac/sci/dcs/people/weiren_yu/tois_2021_tech_rep.pdf.
[51]
Weiren Yu and Julie A. McCann. 2014. Sig-SR: SimRank search over singular graphs. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. DOI:https://doi.org/10.1145/2600428.2609459
[52]
Weiren Yu and Julie A. McCann. 2015. Efficient partial-pairs SimRank search for large networks. Proceedings of the VLDB Endowment 8, 5 (2015), 569–580. Retrieved from http://www.vldb.org/pvldb/vol8/p569-yu.pdf.
[53]
Weiren Yu and Julie Ann McCann. 2015. High quality graph-based similarity search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 83–92. DOI:https://doi.org/10.1145/2766462.2767720
[54]
Weiren Yu and Julie A. McCann. 2016. Random walk with restart over dynamic graphs. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining. 589–598. DOI:https://doi.org/10.1109/ICDM.2016.0070
[55]
Weiren Yu, Julie A. McCann, and Chengyuan Zhang. 2019. Efficient pairwise penetrating-rank similarity retrieval. ACM Transactions on the Web 13, 4 (2019), 21:1–21:52. DOI:https://doi.org/10.1145/3368616
[56]
Weiren Yu and Fan Wang. 2018. Fast exact CoSimRank search on evolving and static graphs. In Proceedings of the 2018 World Wide Web Conference. 599–608. DOI:https://doi.org/10.1145/3178876.3186126
[57]
Peixiang Zhao, Jiawei Han, and Yizhou Sun. 2009. P-Rank: A comprehensive structural similarity measure over information networks. In Proceedings of the 18th ACM Conference on Information and Knowledge Management.
[58]
Weiguo Zheng, Lei Zou, Lei Chen, and Dongyan Zhao. 2017. Efficient SimRank-based similarity join. ACM Transactions on Database Systems 42, 3 (2017), 16:1–16:37. DOI:https://doi.org/10.1145/3083899
[59]
Weiguo Zheng, Lei Zou, Yansong Feng, Lei Chen, and Dongyan Zhao. 2013. Efficient SimRank-based similarity join over large graphs. Proceedings of the VLDB Endowment 6, 7 (2013), 493–504.

Cited By

View all
  • (2023)A Multi-Type Transferable Method for Missing Link Prediction in Heterogeneous Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323348135:11(10981-10991)Online publication date: 2-Jan-2023
  • (2023)SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43418-1_14(226-241)Online publication date: 18-Sep-2023

Index Terms

  1. Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 40, Issue 4
    October 2022
    812 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3501285
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 January 2022
    Accepted: 01 November 2021
    Revised: 01 June 2021
    Received: 01 December 2020
    Published in TOIS Volume 40, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Similarity search
    2. link analysis
    3. scalable algorithms

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • Natural Science Foundation of Jiangsu Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 11 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Multi-Type Transferable Method for Missing Link Prediction in Heterogeneous Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323348135:11(10981-10991)Online publication date: 2-Jan-2023
    • (2023)SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43418-1_14(226-241)Online publication date: 18-Sep-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media