research-article

Overlap Set Similarity Joins with Theoretical Guarantees

Authors:

Guoliang LiAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 905 - 920

https://doi.org/10.1145/3183713.3183748

Published: 27 May 2018 Publication History

Abstract

This paper studies the set similarity join problem with overlap constraints which, given two collections of sets and a constant c, finds all the set pairs in the datasets that share at least c common elements. This is a fundamental operation in many fields, such as information retrieval, data mining, and machine learning. The time complexity of all existing methods is O(n2) where n is the total size of all the sets. In this paper, we present a size-aware algorithm with the time complexity of O(n2-over 1 c k1 over 2c)=o(n2)+O(k), where k is the number of results. The size-aware algorithm divides all the sets into small and large ones based on their sizes and processes them separately. We can use existing methods to process the large sets and focus on the small sets in this paper. We develop several optimization heuristics for the small sets to improve the practical performance significantly. As the size boundary between the small sets and the large sets is crucial to the efficiency, we propose an effective size boundary selection algorithm to judiciously choose an appropriate size boundary, which works very well in practice. Experimental results on real-world datasets show that our methods achieve high performance and outperform the state-of-the-art approaches by up to an order of magnitude.

References

[1]

T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In PODS, pages 151--164, 2016.

Digital Library

[2]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[3]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[4]

P. Bouros, S. Ge, and N. Mamoulis. Spatio-textual similarity joins. PVLDB, 6(1):1--12, 2012.

Digital Library

[5]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997.

Digital Library

[6]

J. A. Bullinaria and J. P. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3):510--526, Aug 2007.

[7]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.

Digital Library

[8]

D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD, pages 673--684, 2014.

Digital Library

[9]

D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. PVLDB, 9(4):360--371, 2015.

Digital Library

[10]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.

Digital Library

[11]

S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321--350, 2012.

[12]

Y. Jiang, G. Li, J. Feng, and W.-S. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.

Digital Library

[13]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[14]

G. Li, D. Deng, and J. Feng. A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst., 38(2):9, 2013.

Digital Library

[15]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[16]

K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, &Computers, 28(2):203--208, Jun 1996.

[17]

W. Mann and N. Augsten. PEL: position-enhanced length filter for set similarity joins. In GVD, pages 89--94, 2014.

[18]

W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity join techniques. PVLDB, 9(9):636--647, 2016.

Digital Library

[19]

S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Trans. Database Syst., 28:56--99, 2003.

Digital Library

[20]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.

Digital Library

[21]

B. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for inner product search. In ICML, pages 1926--1934, 2015.

Digital Library

[22]

R. Pagh. Locality-sensitive hashing without false negatives. In SODA, pages 1--9, 2016.

Digital Library

[23]

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532--1543, 2014.

[24]

K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000.

Digital Library

[25]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, pages 743--754, 2004.

Digital Library

[26]

V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.

Digital Library

[27]

A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In NIPS, pages 2321--2329, 2014.

Digital Library

[28]

A. Shrivastava and P. Li. Asymmetric minwise hashing for indexing binary inner products and set containment. In WWW, pages 981--991, 2015.

Digital Library

[29]

C. Teflioudi, R. Gemulla, and O. Mykytiuk. LEMP: fast retrieval of large entries in a matrix product. In SIGMOD, pages 107--122, 2015.

Digital Library

[30]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD, pages 85--96, 2012.

Digital Library

[31]

X. Wang, L. Qin, X. Lin, Y. Zhang, and L. Chang. Leveraging set relations in exact set similarity join. PVLDB, 10(9):925--936, 2017.

Digital Library

[32]

C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.

Digital Library

[33]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.

Digital Library

[34]

J. Yang, W. Zhang, S. Yang, Y. Zhang, and X. Lin. Tt-join: Efficient set containment join. In ICDE, pages 509--520, 2017.

[35]

Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In SIGIR, pages 83--92, 2014.

Digital Library

Cited By

Zecchini LBleifuß TSimonini GBergamaschi SNaumann F(2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639303
Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Schmitt DKocher DAugsten NMann WMiller A(2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611480
Show More Cited By

Index Terms

Overlap Set Similarity Joins with Theoretical Guarantees
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Join algorithms
    2. Information integration

Recommendations

GPU Acceleration of Set Similarity Joins
DEXA 2015: Proceedings, Part I, of the 26th International Conference on Database and Expert Systems Applications - Volume 9261

We propose a scheme of efficient set similarity joins on Graphics Processing Units GPUs. Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data ...
Accelerating Set Similarity Joins Using GPUs
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII - Volume 9940

We propose a scheme for efficient set similarity joins on Graphics Processing Units GPUs. Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data ...
Set Similarity Search for Skewed Data
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

TAL education
973 Program of China
CUHK
NSF of China
Google

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
503
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zecchini LBleifuß TSimonini GBergamaschi SNaumann F(2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639303
Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Schmitt DKocher DAugsten NMann WMiller A(2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611480
Shah RMukherjee KTyagi AKarnam SJoshi DBhosale SMitra S(2023)R2D2: Reducing Redundancy and Duplication in Data LakesProceedings of the ACM on Management of Data10.1145/36267621:4(1-25)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626762
Peng ZWang ZDeng D(2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
https://doi.org/10.1145/3589324
Papadakis GFisichella MSchoger FMandilaras GAugsten NNejdl W(2023)Benchmarking Filtering Techniques for Entity Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00389(653-666)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00389
Widmoser MKocher DAugsten NMann W(2023)MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00085(1045-1058)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00085
Wang ZWang SLi JYuan CGu RHuang Y(2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2021.07.009
Li YYu XKoudas N(2021)LES3Proceedings of the VLDB Endowment10.14778/3476249.347626314:11(2073-2086)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476263
Wang JYang JZhang WDemartini GZuccon GCulpepper JHuang ZTong H(2021)Top-k Tree Similarity JoinProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482304(1939-1948)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482304
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents