Article

Scaling up all pairs similarity search

Authors:

Roberto J. Bayardo,

Ramakrishnan SrikantAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 131 - 140

https://doi.org/10.1145/1242572.1242591

Published: 08 May 2007 Publication History

Abstract

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.

References

[1]

A. Arasu, V. Ganti, & R. Kaushik (2006). Efficient Exact Set-Similarity Joins. In Proc. of the 32nd Int'l Conf. on Very Large Data Bases, 918--929.

Digital Library

[2]

D. Beeferman & A. Berger (2000). Agglomerative Clustering of a Search Engine Query Log. In Proc. of the 6th ACM-SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, 407--416.

Digital Library

[3]

C. Böhm, B. Braunmuller, M. Breunig, & H.-P. Kriegel (2000). High Performance Clustering Based on the Similarity Join. In Proc. of the 2000 ACM CIKM International Conference on Information and Knowledge Management, 298--305.

Digital Library

[4]

A. Z. Broder, S. C. Glassman, M. S. Manasse, & G. Zweig (1997). Syntactic clustering of the Web. In Proc. of the 6th Int'l World Wide Web Conference, 391--303.

Digital Library

[5]

C. Buckley & A. F. Lewit (1985). Optimization of Inverted Vector Searches. In Proc. of the Eight Annual Int'l Conf. on Research and Development in Information Retrieval, 97--110.

Digital Library

[6]

M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual Symposium on Theory of Computing, 380--388.

Digital Library

[7]

S. Chaudhuri, V. Ganti, & R. Kaushik (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proc. of the 22nd Int'l Conf on Data Engineering.

Digital Library

[8]

S. Chien & N. Immorlica (2005). Semantic Similarity Between Search Engine Queries Using Temporal Correlation. In Proc. of the 14th Int'l World Wide Web Conference, 2--11.

Digital Library

[9]

S.-L. Chuang & L.-F. Chien (2005). Taxonomy Generation for Text Segments: A Practical Web-Based Approach. In ACM Transactions on Information Systems, 23(4), 363--396.

Digital Library

[10]

R. Fagin, R. Kumar, & D. Sivakumar (2003). Efficient Similarity Search and Classification via Rank Aggregation. In Proc. of the 2003 ACM-SIGMOD Int'l Conf. on Management of Data, 301--312.

Digital Library

[11]

A. Gionis, P. Indyk, & R. Motwani (1999). Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Int'l Conf. on Very Large Data Bases, 518--529.

Digital Library

[12]

P. Indyk, & R. Motwani (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proc. of the 30th Symposium on the Theory of Computing, 604--613.

Digital Library

[13]

A. Metwally, D. Agrawal, & A. El Abbadi (2007). DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams. In Proc. of the 16th Int'l Conf. on the World Wide Web, to appear.

Digital Library

[14]

A. Moffat, R. Sacks-Davis, R. Wilkinson, & J. Zobel (1994). Retrieval of partial documents. In The Second Text REtrieval Conference, 181--190.

[15]

A. Moffat & J. Zobel (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379.

Digital Library

[16]

M. Persin (1994). Document filtering for fast ranking. In Proc. of the 17th Annual Int'l Conf. on Research and Development in Information Retrieval, 339--348.

Digital Library

[17]

M. Persin, J. Zobel, & R. Sacks-Davis (1994). Fast document ranking for large scale information retrieval. In Proc. of the First Int'l Conf. on Applications of Databases, Lecture Notes in Computer Science v819, 253--266.

[18]

R. Ramakrishnan & J. Gehrke (2002). Database Management Systems. McGraw--Hill Science/Engineering/Math; 3rd edition.

Digital Library

[19]

M. Sahami & T. Heilman (2006). A Web--based Kernel Function for Measuring the Similarity of Short Text Snippets. In Proc. of the 15th Int'l Conf. on the World Wide Web, 377--386.

Digital Library

[20]

E. Spertus, M. Sahami, & O. Buyukkokten (2005). Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In Proc. of the 11th ACM--SIGKDD Int'l Conf. on Knowledge Discovery in Data Mining, 678--684.

Digital Library

[21]

S. Sarawagi & A. Kirpal (2004). Efficient Set Joins on Similarity Predicates. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, 743--754.

Digital Library

[22]

T. Strohman, H. Turtle, & W. B. Croft (2005). Optimization Strategies for Complex Queries. In Proc. of the 28th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval, 219--225.

Digital Library

[23]

H. Turtle & J. Flood (1995). Query Evaluation: Strategies and Optimizations. In Information Processing & Management, 31(6), 831--850.

Digital Library

Cited By

Elmougy YHayashi ASarkar V(2024)Asynchronous Distributed Actor-Based Approach to Jaccard Similarity for Genome ComparisonsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528922(1-11)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528922
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Li ZCao SZhai MDing NZhang ZHu B(2024)Multi-level semantic enhancement based on self-distillation BERT for Chinese named entity recognitionNeurocomputing10.1016/j.neucom.2024.127637(127637)Online publication date: Apr-2024
https://doi.org/10.1016/j.neucom.2024.127637
Show More Cited By

Index Terms

Scaling up all pairs similarity search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Effective Similarity Search on Indoor Moving-Object Trajectories
DASFAA 2016: Proceedings, Part II, of the 21st International Conference on Database Systems for Advanced Applications - Volume 9643

In this paper, we propose a new approach to measuring the similarity among indoor moving-object trajectories. Particularly, we propose to measure indoor trajectory similarity based on spatial similarity and semantic pattern similarity. For spatial ...
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

509
Total Citations
View Citations
2,329
Total Downloads

Downloads (Last 12 months)100
Downloads (Last 6 weeks)4

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Elmougy YHayashi ASarkar V(2024)Asynchronous Distributed Actor-Based Approach to Jaccard Similarity for Genome ComparisonsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528922(1-11)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528922
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Li ZCao SZhai MDing NZhang ZHu B(2024)Multi-level semantic enhancement based on self-distillation BERT for Chinese named entity recognitionNeurocomputing10.1016/j.neucom.2024.127637(127637)Online publication date: Apr-2024
https://doi.org/10.1016/j.neucom.2024.127637
Tonci NRivault SBamha MRobert SLimet STorquati M(2024)LSH SimilarityJoin Pattern in FastFlowInternational Journal of Parallel Programming10.1007/s10766-024-00772-152:3(207-230)Online publication date: 23-May-2024
https://doi.org/10.1007/s10766-024-00772-1
Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Jia LTang JLi MLi RDing JChen Y(2023)A Trie Based Set Similarity Query AlgorithmMathematics10.3390/math1101022911:1(229)Online publication date: 2-Jan-2023
https://doi.org/10.3390/math11010229
Akulich MSavnik IKrnc MŠkrekovski R(2023)Multiset-Trie Data StructureAlgorithms10.3390/a1603017016:3(170)Online publication date: 20-Mar-2023
https://doi.org/10.3390/a16030170
SAYIN AGIERL M(2023)Automatic item generation for online measurement and evaluation: Turkish literature itemsAutomatic item generation for online measurement and evaluation: Turkish literature itemsInternational Journal of Assessment Tools in Education10.21449/ijate.124929710:2(218-231)Online publication date: 26-Jun-2023
https://doi.org/10.21449/ijate.1249297
Schmitt DKocher DAugsten NMann WMiller A(2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611480
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2023)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 21-Feb-2023
https://doi.org/10.14778/3574245.3574263
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents