article

Generalizing prefix filtering to improve set similarity joins

Authors:

Leonardo Andrade Ribeiro,

Theo HärderAuthors Info & Claims

Information Systems, Volume 36, Issue 1

Pages 62 - 78

https://doi.org/10.1016/j.is.2010.07.003

Published: 01 March 2011 Publication History

Abstract

Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to known algorithms.

References

[1]

}}Ribeiro, L.A. and Härder, T., Efficient set similarity joins using min-prefixes. In: Proceedings of the 13th East-European Conference on Advances in Databases and Information Systems (ADBIS 2009), pp. 88-102.

Digital Library

[2]

}}Arasu, A., Ganti, V. and Kaushik, R., Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB 2006), pp. 918-929.

[3]

}}Chaudhuri, S., Ganti, V. and Kaushik, R., A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), pp. 5

[4]

}}Cohen, W.W., Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 201-212.

[5]

}}Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S. and Srivastava, D., Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pp. 491-500.

Digital Library

[6]

}}Spertus, E., Sahami, M. and Buyukkokten, O., Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 678-684.

[7]

}}Broder, A.Z., On the resemblance and containment of documents. In: Proceedings of the International Conference on Compression and Complexity of Sequences (SEQUENCES'97), pp. 21-29.

[8]

}}Theobald, M., Siddharth, J. and Paepcke, A., Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 563-570.

[9]

}}Chakrabarti, K., Chaudhuri, S., Ganti, V. and Xin, D., An efficient filter for approximate membership checking. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2008), pp. 805-818.

[10]

}}Weber, R., Schek, H.-J. and Blott, S., A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24th International Conference on Very Large Data Bases (VLDB'98), pp. 194-205.

[11]

}}Sarawagi, S. and Kirpal, A., Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2004), pp. 743-754.

[12]

}}Bayardo, R.J., Ma, Y. and Srikant, R., Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web (WWW 2007), pp. 131-140.

[13]

}}Xiao, C., Wang, W., Lin, X. and Yu, J.X., Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 131-140.

Digital Library

[14]

}}Xiao, C., Wang, W. and Lin, X., Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proceedings of the VLDB Endowment (PVLDB). v1 i1. 933-944.

[15]

}}Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M. and Srivastava, D., Benchmarking declarative approximate selection predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 353-364.

[16]

}}Li, C., Lu, J. and Lu, Y., Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 257-266.

Digital Library

[17]

}}Xiao, C., Wang, W., Lin, X. and Shang, H., Top-k set similarity joins. In: Proceedings of the 25th International Conference on Data Engineering (ICDE 2009), pp. 916-927.

[18]

}}Robertson, S.E. and Jones, K.S., Relevance weighting of search terms. Journal of the American Society for Information Science. v27 i3. 129-146.

[19]

}}Karp, R.M. and Rabin, M.O., Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development. v31 i2. 249-260.

[20]

}}A. Clauset, C.R. Shalizi, M.E.J. Newman, Power-law distributions in empirical data, SIAM Review 51 (4) (2009) 661-703.

Digital Library

[21]

}}J. Gray, P. Sundaresan, S. Englert, K. Baclawski, P.J. Weinberger, Quickly generating billion-record synthetic databases, 1994, pp. 243-252.

[22]

}}Jacox, E.H. and Samet, H., Spatial join techniques. ACM Transactions on Database Systems (TODS). v32 i1. 7

[23]

}}Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Johnson, T., Ng, R.T., Poosala, V., Ross, K.A. and Sevcik, K.C., The New Jersey data reduction report. IEEE Data Engineering Bulletin. v20 i4. 3-45.

[24]

}}Yang, Y. and Pedersen, J.O., A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412-420.

[25]

}}Faloutsos, C. and Lin, K.-I., Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 163-174.

[26]

}}Gionis, A., Indyk, P. and Motwani, R., Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases (VLDB'99), pp. 518-529.

Digital Library

[27]

}}Chávez, E., Navarro, G., Baeza-Yates, R.A. and Marroquín, J.L., Searching in metric spaces. ACM Computing Surveys. v33 i3. 273-321.

[28]

}}Jin, L., Li, C. and Mehrotra, S., Efficient record linkage in large data sets. In: 8th International Conference on Database Systems for Advanced Applications (DASFAA'03), pp. 137

[29]

}}Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D. and Yu, T., Integrating xml data sources using approximate joins. ACM Transactions on Database Systems. v31 i1. 161-207.

[30]

}}E.H. Jacox, H. Samet, Metric space similarity joins, ACM Transactions on Database Systems (TODS) 33 (2) (2008).

[31]

}}Hadjieleftheriou, M., Chandel, A., Koudas, N. and Srivastava, D., Fast indexes and algorithms for set similarity selection queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 267-276.

Digital Library

[32]

}}Mamoulis, N., Efficient processing of joins on set-valued attributes. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 157-168.

[33]

}}Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Mootwani, R., Ullman, J.D. and Yang, C., Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering (TKDE). v13 i1. 64-78.

[34]

}}Charikar, M., Similarity estimation techniques from rounding algorithms. In: Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380-388.

[35]

}}Lieberman, M.D., Sankaranarayanan, J. and Samet, H., A fast similarity join algorithm using graphics processing units. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 1111-1120.

Digital Library

[36]

}}Bryan, B., Eberhardt, F. and Faloutsos, C., Compact similarity joins. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 346-355.

Digital Library

[37]

}}Arasu, A., Chaudhuri, S. and Kaushik, R., Transformation-based framework for record matching. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 40-49.

Digital Library

[38]

}}Lee, H., Ng, R.T. and Shim, K., Power-law based estimation of set similarity join size. PVLDB. v2 i1. 658-669.

[39]

}}Y.N. Silva, W.G. Aref, M.H. Ali, The similarity join database operator, in: Proceedings of the 26th International Conference on Data Engineering (ICDE 2010), 2010, pp. 892-903.

Cited By

Mann WAugsten NJensen CPawlik M(2025)SWOOP: top-k similarity joins over set streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00880-x34:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s00778-024-00880-x
Yang CChen LWang HShang SMao RZhang X(2023)Dynamic Set Similarity Join: An Update Log Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.312663135:4(3727-3741)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TKDE.2021.3126631
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574263
Show More Cited By

Recommendations

Set similarity join on massive probabilistic data using MapReduce

In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, ...
Spatio-textual similarity joins using variable prefix filtering
CODS '15: Proceedings of the 2nd ACM IKDD Conference on Data Sciences

Spatio-textual similarity join retrieves a set of pairs of objects which are close spatially and have similar textual contents. Due to the high cost of matching complex objects, most of the algorithms proposed for join run in two phases. In the first ...
On-the-fly token similarity joins in relational databases
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Token similarity joins represent data items as sets of tokens, for example, strings are represented as sets of q-grams (substrings of length q). Two items are considered similar and match if their token sets have a large overlap. Previous work on ...

Comments

Information & Contributors

Information

Published In

cover image Information Systems

Information Systems Volume 36, Issue 1

March, 2011

115 pages

ISSN:0306-4379

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2010.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 March 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mann WAugsten NJensen CPawlik M(2025)SWOOP: top-k similarity joins over set streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00880-x34:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s00778-024-00880-x
Yang CChen LWang HShang SMao RZhang X(2023)Dynamic Set Similarity Join: An Update Log Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.312663135:4(3727-3741)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TKDE.2021.3126631
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574263
Wang ZWang SLi JYuan CGu RHuang Y(2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2021.07.009
Fier FFreytag J(2022)Parallelizing filter-and-verification based exact set similarity joins on multicoresInformation Systems10.1016/j.is.2021.101912108:COnline publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.is.2021.101912
Sandes ETeodoro GMelo A(2022)Bitmap filterInformation Systems10.1016/j.is.2019.10144988:COnline publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1016/j.is.2019.101449
Qin JXiao CWang YWang WLin XIshikawa YWang G(2021)Generalizing the Pigeonhole Principle for Similarity Search in Hamming SpaceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.289959733:2(489-505)Online publication date: 11-Jan-2021
https://dl.acm.org/doi/10.1109/TKDE.2019.2899597
Papadakis GSkoutas DThanos EPalpanas T(2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3377455
Kim TLi WBehm ACetindil IVernica RBorkar VCarey MLi C(2020)Similarity query support in big data management systemsInformation Systems10.1016/j.is.2019.10145588:COnline publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1016/j.is.2019.101455
Fier FWang TZhu EFreytag J(2020)Parallelizing Filter-Verification Based Exact Set Similarity Joins on MulticoresSimilarity Search and Applications10.1007/978-3-030-60936-8_5(62-75)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-60936-8_5
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents