research-article

Efficient Similarity Join and Search on Multi-Attribute Data

Authors:

Jian LiAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1137 - 1151

https://doi.org/10.1145/2723372.2723733

Published: 27 May 2015 Publication History

Abstract

In this paper we study similarity join and search on multi- attribute data. Traditional methods on single-attribute data have pruning power only on single attributes and cannot efficiently support multi-attribute data. To address this problem, we propose a prefix tree index which has holis- tic pruning ability on multiple attributes. We propose a cost model to quantify the prefix tree which can guide the prefix tree construction. Based on the prefix tree, we devise a filter-verification framework to support similarity search and join on multi-attribute data. The filter step prunes a large number of dissimilar results and identifies some candi- dates using the prefix tree and the verification step verifies the candidates to generate the final answer. For similar- ity join, we prove that constructing an optimal prefix tree is NP-complete and develop a greedy algorithm to achieve high performance. For similarity search, since one prefix tree cannot support all possible search queries, we extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. We also devise a hybrid verification algorithm to improve the verification step. Experimental results show our method significantly outperforms baseline approaches.

References

[1]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[2]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.

Digital Library

[3]

N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, and T. Sarlós. Optimal hashing schemes for entity matching. In WWW, pages 295--306, 2013.

Digital Library

[4]

D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD Conference, pages 673--684, 2014.

Digital Library

[5]

D. Deng, G. Li, J. Feng, and W.-S. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013.

Digital Library

[6]

D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In ICDE, pages 340--351, 2014.

[7]

M. Garey and D. Johnson. A guide to the theory of NP-completeness. WH Freeman and Company, 1979.

Digital Library

[8]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[9]

M. Hadjieleftheriou, N. Koudas, and D. Srivastava. Incremental maintenance of length normalized indexes for approximate string matching. In SIGMOD Conference, pages 429--440, 2009.

Digital Library

[10]

J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In SIGMOD Conference, pages 267--276, 1993.

Digital Library

[11]

M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2(1):9--37, 1998.

Digital Library

[12]

Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.

Digital Library

[13]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006.

Digital Library

[14]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[15]

C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.

Digital Library

[16]

G. Li, D. Deng, and J. Feng. A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst., 38(2):9, 2013.

Digital Library

[17]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[18]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, pages 169--178, 2000.

Digital Library

[19]

M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006.

Digital Library

[20]

J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.

Digital Library

[21]

A. D. Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, pages 1055--1064, 2012.

Digital Library

[22]

J. Wang, G. Li, D. Deng, Y. Zhang, and J. Feng. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. In ICDE, 2015.

[23]

J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.

Digital Library

[24]

J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011.

Digital Library

[25]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.

Digital Library

[26]

W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.

Digital Library

[27]

S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009.

Digital Library

[28]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.

Digital Library

[29]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.

Digital Library

[30]

Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010.

Digital Library

Cited By

XUE MWU WLUO JZHANG YZHAO B(2024)High-Parallelism and Pipelined Architecture for Accelerating Sort-Merge Join on FPGAIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2023EAP1135E107.A:10(1582-1594)Online publication date: 1-Oct-2024
https://doi.org/10.1587/transfun.2023EAP1135
Silva VNascimento D(2024)Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index TreesKnowledge and Information Systems10.1007/s10115-024-02089-466:7(4251-4281)Online publication date: 9-Apr-2024
https://doi.org/10.1007/s10115-024-02089-4
Esmailoghli MQuiané-Ruiz JAbedjan Z(2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
https://dl.acm.org/doi/10.14778/3529337.3529353
Show More Cited By

Index Terms

Efficient Similarity Join and Search on Multi-Attribute Data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

As two important operations in data cleaning, similarity join and similarity search have attracted much attention recently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object ...
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees
Abstract
Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

973 China
973, China
NSFC

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
888
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

XUE MWU WLUO JZHANG YZHAO B(2024)High-Parallelism and Pipelined Architecture for Accelerating Sort-Merge Join on FPGAIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2023EAP1135E107.A:10(1582-1594)Online publication date: 1-Oct-2024
https://doi.org/10.1587/transfun.2023EAP1135
Silva VNascimento D(2024)Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index TreesKnowledge and Information Systems10.1007/s10115-024-02089-466:7(4251-4281)Online publication date: 9-Apr-2024
https://doi.org/10.1007/s10115-024-02089-4
Esmailoghli MQuiané-Ruiz JAbedjan Z(2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
https://dl.acm.org/doi/10.14778/3529337.3529353
Sun JLi GTang NLi GLi ZIdreos SSrivastava D(2021)Learned Cardinality Estimation for Similarity QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452790(1745-1757)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452790
Papadakis GSkoutas DThanos EPalpanas T(2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3377455
Wang YXiao CQin JCao XSun YWang WOnizuka MMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning ApproachProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380570(1197-1212)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380570
Wang HYang LXiao Y(2020)SETJoin: a novel top-k similarity join algorithmSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-020-04807-w24:19(14577-14592)Online publication date: 6-Mar-2020
https://dl.acm.org/doi/10.1007/s00500-020-04807-w
Sun JShang ZLi GDeng DBao Z(2019)Balance-aware distributed string similarity-based query processing systemProceedings of the VLDB Endowment10.14778/3329772.332977412:9(961-974)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.14778/3329772.3329774
Xu JBao ZLu H(2019)Continuous Range Queries Over Multi-attribute Trajectories2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00154(1610-1613)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00154
C. PArdalan ADoan AAkella A(2018)SmurfProceedings of the VLDB Endowment10.14778/3291264.329127212:3(278-291)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.14778/3291264.3291272
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents