Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Generalizing prefix filtering to improve set similarity joins

Published: 01 March 2011 Publication History

Abstract

Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to known algorithms.

References

[1]
}}Ribeiro, L.A. and Härder, T., Efficient set similarity joins using min-prefixes. In: Proceedings of the 13th East-European Conference on Advances in Databases and Information Systems (ADBIS 2009), pp. 88-102.
[2]
}}Arasu, A., Ganti, V. and Kaushik, R., Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB 2006), pp. 918-929.
[3]
}}Chaudhuri, S., Ganti, V. and Kaushik, R., A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), pp. 5
[4]
}}Cohen, W.W., Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 201-212.
[5]
}}Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S. and Srivastava, D., Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pp. 491-500.
[6]
}}Spertus, E., Sahami, M. and Buyukkokten, O., Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 678-684.
[7]
}}Broder, A.Z., On the resemblance and containment of documents. In: Proceedings of the International Conference on Compression and Complexity of Sequences (SEQUENCES'97), pp. 21-29.
[8]
}}Theobald, M., Siddharth, J. and Paepcke, A., Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 563-570.
[9]
}}Chakrabarti, K., Chaudhuri, S., Ganti, V. and Xin, D., An efficient filter for approximate membership checking. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2008), pp. 805-818.
[10]
}}Weber, R., Schek, H.-J. and Blott, S., A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24th International Conference on Very Large Data Bases (VLDB'98), pp. 194-205.
[11]
}}Sarawagi, S. and Kirpal, A., Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2004), pp. 743-754.
[12]
}}Bayardo, R.J., Ma, Y. and Srikant, R., Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web (WWW 2007), pp. 131-140.
[13]
}}Xiao, C., Wang, W., Lin, X. and Yu, J.X., Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 131-140.
[14]
}}Xiao, C., Wang, W. and Lin, X., Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proceedings of the VLDB Endowment (PVLDB). v1 i1. 933-944.
[15]
}}Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M. and Srivastava, D., Benchmarking declarative approximate selection predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 353-364.
[16]
}}Li, C., Lu, J. and Lu, Y., Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 257-266.
[17]
}}Xiao, C., Wang, W., Lin, X. and Shang, H., Top-k set similarity joins. In: Proceedings of the 25th International Conference on Data Engineering (ICDE 2009), pp. 916-927.
[18]
}}Robertson, S.E. and Jones, K.S., Relevance weighting of search terms. Journal of the American Society for Information Science. v27 i3. 129-146.
[19]
}}Karp, R.M. and Rabin, M.O., Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development. v31 i2. 249-260.
[20]
}}A. Clauset, C.R. Shalizi, M.E.J. Newman, Power-law distributions in empirical data, SIAM Review 51 (4) (2009) 661-703.
[21]
}}J. Gray, P. Sundaresan, S. Englert, K. Baclawski, P.J. Weinberger, Quickly generating billion-record synthetic databases, 1994, pp. 243-252.
[22]
}}Jacox, E.H. and Samet, H., Spatial join techniques. ACM Transactions on Database Systems (TODS). v32 i1. 7
[23]
}}Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Johnson, T., Ng, R.T., Poosala, V., Ross, K.A. and Sevcik, K.C., The New Jersey data reduction report. IEEE Data Engineering Bulletin. v20 i4. 3-45.
[24]
}}Yang, Y. and Pedersen, J.O., A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412-420.
[25]
}}Faloutsos, C. and Lin, K.-I., Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 163-174.
[26]
}}Gionis, A., Indyk, P. and Motwani, R., Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases (VLDB'99), pp. 518-529.
[27]
}}Chávez, E., Navarro, G., Baeza-Yates, R.A. and Marroquín, J.L., Searching in metric spaces. ACM Computing Surveys. v33 i3. 273-321.
[28]
}}Jin, L., Li, C. and Mehrotra, S., Efficient record linkage in large data sets. In: 8th International Conference on Database Systems for Advanced Applications (DASFAA'03), pp. 137
[29]
}}Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D. and Yu, T., Integrating xml data sources using approximate joins. ACM Transactions on Database Systems. v31 i1. 161-207.
[30]
}}E.H. Jacox, H. Samet, Metric space similarity joins, ACM Transactions on Database Systems (TODS) 33 (2) (2008).
[31]
}}Hadjieleftheriou, M., Chandel, A., Koudas, N. and Srivastava, D., Fast indexes and algorithms for set similarity selection queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 267-276.
[32]
}}Mamoulis, N., Efficient processing of joins on set-valued attributes. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 157-168.
[33]
}}Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Mootwani, R., Ullman, J.D. and Yang, C., Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering (TKDE). v13 i1. 64-78.
[34]
}}Charikar, M., Similarity estimation techniques from rounding algorithms. In: Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380-388.
[35]
}}Lieberman, M.D., Sankaranarayanan, J. and Samet, H., A fast similarity join algorithm using graphics processing units. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 1111-1120.
[36]
}}Bryan, B., Eberhardt, F. and Faloutsos, C., Compact similarity joins. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 346-355.
[37]
}}Arasu, A., Chaudhuri, S. and Kaushik, R., Transformation-based framework for record matching. In: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), pp. 40-49.
[38]
}}Lee, H., Ng, R.T. and Shim, K., Power-law based estimation of set similarity join size. PVLDB. v2 i1. 658-669.
[39]
}}Y.N. Silva, W.G. Aref, M.H. Ali, The similarity join database operator, in: Proceedings of the 26th International Conference on Data Engineering (ICDE 2010), 2010, pp. 892-903.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Systems
Information Systems  Volume 36, Issue 1
March, 2011
115 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 March 2011

Author Tags

  1. Advanced query processing
  2. Set similarity join

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media