Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2723733acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient Similarity Join and Search on Multi-Attribute Data

Published: 27 May 2015 Publication History

Abstract

In this paper we study similarity join and search on multi- attribute data. Traditional methods on single-attribute data have pruning power only on single attributes and cannot efficiently support multi-attribute data. To address this problem, we propose a prefix tree index which has holis- tic pruning ability on multiple attributes. We propose a cost model to quantify the prefix tree which can guide the prefix tree construction. Based on the prefix tree, we devise a filter-verification framework to support similarity search and join on multi-attribute data. The filter step prunes a large number of dissimilar results and identifies some candi- dates using the prefix tree and the verification step verifies the candidates to generate the final answer. For similar- ity join, we prove that constructing an optimal prefix tree is NP-complete and develop a greedy algorithm to achieve high performance. For similarity search, since one prefix tree cannot support all possible search queries, we extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. We also devise a hybrid verification algorithm to improve the verification step. Experimental results show our method significantly outperforms baseline approaches.

References

[1]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[2]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.
[3]
N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, and T. Sarlós. Optimal hashing schemes for entity matching. In WWW, pages 295--306, 2013.
[4]
D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD Conference, pages 673--684, 2014.
[5]
D. Deng, G. Li, J. Feng, and W.-S. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013.
[6]
D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In ICDE, pages 340--351, 2014.
[7]
M. Garey and D. Johnson. A guide to the theory of NP-completeness. WH Freeman and Company, 1979.
[8]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[9]
M. Hadjieleftheriou, N. Koudas, and D. Srivastava. Incremental maintenance of length normalized indexes for approximate string matching. In SIGMOD Conference, pages 429--440, 2009.
[10]
J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In SIGMOD Conference, pages 267--276, 1993.
[11]
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2(1):9--37, 1998.
[12]
Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.
[13]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006.
[14]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[15]
C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.
[16]
G. Li, D. Deng, and J. Feng. A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst., 38(2):9, 2013.
[17]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[18]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, pages 169--178, 2000.
[19]
M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006.
[20]
J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.
[21]
A. D. Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, pages 1055--1064, 2012.
[22]
J. Wang, G. Li, D. Deng, Y. Zhang, and J. Feng. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. In ICDE, 2015.
[23]
J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.
[24]
J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011.
[25]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.
[26]
W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.
[27]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009.
[28]
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.
[29]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.
[30]
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010.

Cited By

View all
  • (2024)High-Parallelism and Pipelined Architecture for Accelerating Sort-Merge Join on FPGAIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2023EAP1135E107.A:10(1582-1594)Online publication date: 1-Oct-2024
  • (2024)Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index TreesKnowledge and Information Systems10.1007/s10115-024-02089-466:7(4251-4281)Online publication date: 9-Apr-2024
  • (2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-attribute data
  2. similarity join
  3. similarity search

Qualifiers

  • Research-article

Funding Sources

  • 973 China
  • 973, China
  • NSFC

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)High-Parallelism and Pipelined Architecture for Accelerating Sort-Merge Join on FPGAIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2023EAP1135E107.A:10(1582-1594)Online publication date: 1-Oct-2024
  • (2024)Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index TreesKnowledge and Information Systems10.1007/s10115-024-02089-466:7(4251-4281)Online publication date: 9-Apr-2024
  • (2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
  • (2021)Learned Cardinality Estimation for Similarity QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452790(1745-1757)Online publication date: 9-Jun-2021
  • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
  • (2020)Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning ApproachProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380570(1197-1212)Online publication date: 11-Jun-2020
  • (2020)SETJoin: a novel top-k similarity join algorithmSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-020-04807-w24:19(14577-14592)Online publication date: 6-Mar-2020
  • (2019)Balance-aware distributed string similarity-based query processing systemProceedings of the VLDB Endowment10.14778/3329772.332977412:9(961-974)Online publication date: 1-May-2019
  • (2019)Continuous Range Queries Over Multi-attribute Trajectories2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00154(1610-1613)Online publication date: Apr-2019
  • (2018)SmurfProceedings of the VLDB Endowment10.14778/3291264.329127212:3(278-291)Online publication date: 1-Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media