Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2588555.2610526acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Scalable similarity search for SimRank

Published: 18 June 2014 Publication History
  • Get Citation Alerts
  • Abstract

    SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many of the above mentioned applications. While there are many algorithms proposed so far to compute SimRank, but unfortunately, none of them are scalable up to graphs of billions size. Motivated by this fact, we consider the following SimRank-based similarity search problem: given a query vertex u, find top-k vertices v with the k highest SimRank scores s(u,v) with respect to u.
    We propose a very fast and scalable algorithm for this similarity search problem. Our method consists of the following ingredients: (1) We first introduce a "linear" recursive formula for SimRank. This allows us to formulate a problem that we can propose a very fast algorithm. (2) We establish a Monte-Carlo based algorithm to compute a single pair SimRank score s(u,v), which is based on the random-walk interpretation of our linear recursive formula. (3) We empirically show that SimRank score s(u,v) decreases rapidly as distance d(u,v) increases. Therefore, in order to compute SimRank scores for a query vertex u for our similarity search problem, we only need to look at very "local" area. (4) We can combine two upper bounds for SimRank score s(u,v) (which can be obtained by Monte-Carlo simulation in our preprocess), together with some adaptive sample technique, to prune the similarity search procedure. This results in a much faster algorithm.
    Once our preprocess is done (which only takes O(n) time), our algorithm finds, given a query vertex u, top-20 similar vertices v with the 20 highest SimRank scores s(u,v) in less than a few seconds even for graphs with billions edges.
    To the best of our knowledge, this is the first time to scale for graphs with at least billions edges(for the single source case).

    References

    [1]
    Z. Abbassi and V. S. Mirrokni. A recommender system based on local random walks and spectral methods. In WebKDD/SNA-KDD, volume 5439 of Lecture Notes in Computer Science, pages 139--153. Springer, 2007.
    [2]
    R. Albert, H. Jeong, and A.-L. Barabasi. Internet: Diameter of the World-Wide Web. Nature, 401(6749):130--131, 1999.
    [3]
    I. Antonellis, H. Garcia-Molina, and C.-C. Chang. Simrank++: query rewriting through link analysis of the click graph. PVLDB, 1(1):408--421, 2008.
    [4]
    A. A. Benczur, K. Csalogany, and T. Sarlos. Link-based similarity search to fight web spam. In AIRWeb, pages 9--16, 2006.
    [5]
    P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In WWW, pages 587--596. ACM, 2011.
    [6]
    P. Boldi and S. Vigna. The webgraph framework i: compression techniques. In WWW, pages 595--602. ACM, 2004.
    [7]
    Y. Cai, G. Cong, X. Jia, H. Liu, J. He, J. Lu, and X. Du. Efficient algorithm for computing link-based similarity in real world networks. In ICDM, pages 734--739. IEEE Computer Society, 2009.
    [8]
    L. Cao, B. Cho, H. D. Kim, Z. Li, M.-H. Tsai, and I. Gupta. Delta-SimRank computing on MapReduce. In BigMine, pages 28--35. ACM, 2012.
    [9]
    D. Fogaras and B. Rácz. Scaling link-based similarity search. In WWW, pages 641--650. ACM, 2005.
    [10]
    Y. Fujiwara, M. Nakatsuji, H. Shiokawa, and M. Onizuka. Efficient search algorithm for simrank. In ICDE, pages 589--600. IEEE Computer Society, 2013.
    [11]
    Z. Gyongyi, H. Garcia-Molina, and J. O. Pedersen. Combating web spam with TrustRank. In VLDB, pages 576--587. Morgan Kaufmann, 2004.
    [12]
    G. He, H. Feng, C. Li, and H. Chen. Parallel SimRank computation on large graphs with iterative aggregation. In KDD, pages 543--552. ACM, 2010.
    [13]
    G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD, pages 538--543. ACM, 2002.
    [14]
    X. Jia, Y. Cai, H. Liu, J. He, and X. Du. Calculating similarity efficiently in a small world. In ADMA, volume 5678 of Lecture Notes in Computer Science, pages 175--187. Springer, 2009.
    [15]
    X. Jia, H. Liu, L. Zou, J. He, and X. Du. A fast two-stage algorithm for computing SimRank and its extensions. In WAIM Workshops, volume 6185 of Lecture Notes in Computer Science, pages 61--73. Springer, 2010.
    [16]
    M. M. Kessler. Bibliographic coupling extended in time: Ten case histories. Information Storage and Retrieval, 1(4):169--187, 1963.
    [17]
    A. N. Langville and C. D. Meyer. Google's PageRank and beyond: The science of search engine rankings. Princeton University Press, 2006.
    [18]
    J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29--123, 2009.
    [19]
    C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fast computation of simrank for static and dynamic information networks. In EDBT, volume 426 of ACM International Conference Proceeding Series, pages 465--476. ACM, 2010.
    [20]
    P. Li, Y. Cai, H. Liu, J. He, and X. Du. Exploiting the block structure of link graph for efficient similarity computation. In PAKDD, volume 5476 of Lecture Notes in Computer Science, pages 389--400. Springer, 2009.
    [21]
    P. Li, H. Liu, J. X. Yu, J. He, and X. Du. Fast single-pair simrank computation. In SDM, pages 571--582. SIAM, 2010.
    [22]
    D. Liben-Nowell and J. M. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019--1031, 2007.
    [23]
    Z. Lin, I. King, and M. R. Lyu. PageSim: A novel link-based similarity measure for the World Wide Web. In Web Intelligence, pages 687--693. IEEE Computer Society, 2006.
    [24]
    Z. Lin, M. R. Lyu, and I. King. Extending link-based algorithms for similar web pages with neighborhood structure. In Web Intelligence, pages 263--266. IEEE Computer Society, 2007.
    [25]
    Z. Lin, M. R. Lyu, and I. King. MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowledge and information systems, 32(1):141--166, 2012.
    [26]
    D. Lizorkin, P. Velikhov, M. N. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. The VLDB Journal, 19(1):45--66, 2010.
    [27]
    A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3(2):127--163, 2000.
    [28]
    A. Mislove, M. Marcon, P. K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Internet Measurement Conference, pages 29--42. ACM, 2007.
    [29]
    C. Scheible. Sentiment translation through lexicon induction. In ACL (Student Research Workshop), pages 25--30. The Association for Computer Linguistics, 2010.
    [30]
    H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, 1973.
    [31]
    V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354--356, 1969.
    [32]
    V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In STOC, pages 887--898. ACM, 2012.
    [33]
    X. Yin, J. Han, and P. S. Yu. LinkClus: Efficient clustering via heterogeneous semantic links. In VLDB, pages 427--438. ACM, 2006.
    [34]
    L. Yu, Z. Shu, and X. Yang. SimRate: Improve collaborative recommendation based on rating graph for sparsity. In ADMA (2), volume 6441 of Lecture Notes in Computer Science, pages 167--174. Springer, 2010.
    [35]
    W. Yu, X. Lin, and J. Le. Taming computational complexity: Efficient and parallel SimRank optimizations on undirected graphs. In WAIM, volume 6184 of Lecture Notes in Computer Science, pages 280--296. Springer, 2010.
    [36]
    W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is simpler: Effectively and efficiently assessing node-pair similarities based on hyperlinks. PVLDB, 7(1):13--24, 2013.
    [37]
    W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and time efficient algorithm for SimRank computation. World Wide Web, 15(3):327--353, 2012.
    [38]
    P. Zhao, J. Han, and Y. Sun. P-Rank: a comprehensive structural similarity measure over information networks. In CIKM, pages 553--562. ACM, 2009.
    [39]
    W. Zheng, L. Zou, Y. Feng, L. Chen, and D. Zhao. Efficient SimRank-based similarity join over large graphs. PVLDB, 6(7):493--504, 2013.
    [40]
    Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. PVLDB, 2(1):718--729, 2009.

    Cited By

    View all
    • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
    • (2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
    • (2023)SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43418-1_14(226-241)Online publication date: 17-Sep-2023
    • Show More Cited By

    Index Terms

    1. Scalable similarity search for SimRank

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
      June 2014
      1645 pages
      ISBN:9781450323765
      DOI:10.1145/2588555
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. SimRank
      2. scalable

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGMOD/PODS'14
      Sponsor:

      Acceptance Rates

      SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)34
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
      • (2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
      • (2023)SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43418-1_14(226-241)Online publication date: 17-Sep-2023
      • (2022)Big graphsProceedings of the VLDB Endowment10.14778/3554821.355489915:12(3782-3797)Online publication date: 1-Aug-2022
      • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
      • (2022)Linking Entities across Relations and Graphs2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00052(634-647)Online publication date: May-2022
      • (2022)Efficient index-free SimRank similarity search in large graphs by discounting path lengthsExpert Systems with Applications10.1016/j.eswa.2022.117746206(117746)Online publication date: Dec-2022
      • (2021)DiskProceedings of the VLDB Endowment10.14778/3430915.343092514:3(351-363)Online publication date: 9-Dec-2021
      • (2021)On Modeling Influence Maximization in Social Activity Networks under General SettingsACM Transactions on Knowledge Discovery from Data10.1145/345121815:6(1-28)Online publication date: 19-May-2021
      • (2021)Search Efficient Binary Network EmbeddingACM Transactions on Knowledge Discovery from Data10.1145/343689215:4(1-27)Online publication date: 8-May-2021
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media