Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3183748acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Overlap Set Similarity Joins with Theoretical Guarantees

Published: 27 May 2018 Publication History

Abstract

This paper studies the set similarity join problem with overlap constraints which, given two collections of sets and a constant c, finds all the set pairs in the datasets that share at least c common elements. This is a fundamental operation in many fields, such as information retrieval, data mining, and machine learning. The time complexity of all existing methods is O(n2) where n is the total size of all the sets. In this paper, we present a size-aware algorithm with the time complexity of O(n2-over 1 c k1 over 2c)=o(n2)+O(k), where k is the number of results. The size-aware algorithm divides all the sets into small and large ones based on their sizes and processes them separately. We can use existing methods to process the large sets and focus on the small sets in this paper. We develop several optimization heuristics for the small sets to improve the practical performance significantly. As the size boundary between the small sets and the large sets is crucial to the efficiency, we propose an effective size boundary selection algorithm to judiciously choose an appropriate size boundary, which works very well in practice. Experimental results on real-world datasets show that our methods achieve high performance and outperform the state-of-the-art approaches by up to an order of magnitude.

References

[1]
T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In PODS, pages 151--164, 2016.
[2]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[3]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[4]
P. Bouros, S. Ge, and N. Mamoulis. Spatio-textual similarity joins. PVLDB, 6(1):1--12, 2012.
[5]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997.
[6]
J. A. Bullinaria and J. P. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3):510--526, Aug 2007.
[7]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.
[8]
D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD, pages 673--684, 2014.
[9]
D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. PVLDB, 9(4):360--371, 2015.
[10]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.
[11]
S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321--350, 2012.
[12]
Y. Jiang, G. Li, J. Feng, and W.-S. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.
[13]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[14]
G. Li, D. Deng, and J. Feng. A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst., 38(2):9, 2013.
[15]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[16]
K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, &Computers, 28(2):203--208, Jun 1996.
[17]
W. Mann and N. Augsten. PEL: position-enhanced length filter for set similarity joins. In GVD, pages 89--94, 2014.
[18]
W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity join techniques. PVLDB, 9(9):636--647, 2016.
[19]
S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Trans. Database Syst., 28:56--99, 2003.
[20]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.
[21]
B. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for inner product search. In ICML, pages 1926--1934, 2015.
[22]
R. Pagh. Locality-sensitive hashing without false negatives. In SODA, pages 1--9, 2016.
[23]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532--1543, 2014.
[24]
K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000.
[25]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, pages 743--754, 2004.
[26]
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.
[27]
A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In NIPS, pages 2321--2329, 2014.
[28]
A. Shrivastava and P. Li. Asymmetric minwise hashing for indexing binary inner products and set containment. In WWW, pages 981--991, 2015.
[29]
C. Teflioudi, R. Gemulla, and O. Mykytiuk. LEMP: fast retrieval of large entries in a matrix product. In SIGMOD, pages 107--122, 2015.
[30]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD, pages 85--96, 2012.
[31]
X. Wang, L. Qin, X. Lin, Y. Zhang, and L. Chang. Leveraging set relations in exact set similarity join. PVLDB, 10(9):925--936, 2017.
[32]
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.
[33]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.
[34]
J. Yang, W. Zhang, S. Yang, Y. Zhang, and X. Lin. Tt-join: Efficient set containment join. In ICDE, pages 509--520, 2017.
[35]
Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In SIGIR, pages 83--92, 2014.

Cited By

View all
  • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
  • (2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. overlap
  2. scalable
  3. set
  4. similarity join
  5. sub-quadratic
  6. theoretical guarantee

Qualifiers

  • Research-article

Funding Sources

  • TAL education
  • 973 Program of China
  • CUHK
  • NSF of China
  • Google

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
  • (2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
  • (2023)R2D2: Reducing Redundancy and Duplication in Data LakesProceedings of the ACM on Management of Data10.1145/36267621:4(1-25)Online publication date: 12-Dec-2023
  • (2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
  • (2023)Benchmarking Filtering Techniques for Entity Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00389(653-666)Online publication date: Apr-2023
  • (2023)MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00085(1045-1058)Online publication date: Apr-2023
  • (2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
  • (2021)LES3Proceedings of the VLDB Endowment10.14778/3476249.347626314:11(2073-2086)Online publication date: 27-Oct-2021
  • (2021)Top-k Tree Similarity JoinProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482304(1939-1948)Online publication date: 26-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media