column

State-of-the-art in string similarity search and join

Authors:

Sebastian Wandelt,

Stefan Gerdjikov,

Shashwat Mishra,

Petar Mitankin,

Enrico Siragusa,

Alexander Tiskin,

Ulf LeserAuthors Info & Claims

ACM SIGMOD Record, Volume 43, Issue 1

Pages 64 - 76

https://doi.org/10.1145/2627692.2627706

Published: 13 May 2014 Publication History

Abstract

String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In PVLDB, VLDB '06, pages 918--929. VLDB Endowment, 2006.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 131--140, New York, NY, USA, 2007. ACM.

Digital Library

[3]

A. Behm, R. Vernica, S. Alsubaiee, S. Ji, J. Lu, L. Jin, Y. Lu, and C. Li. UCI Flamingo Package 4.1, 2010.

[4]

M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report, 1994.

[5]

P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In PVLDB, VLDB '97, pages 426--435, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

Digital Library

[6]

M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, and J. F. Reid. A fast and practical bit-vector algorithm for the Longest Common Subsequence problem. Information Processing Letters, 80(6), Dec. 2001.

Digital Library

[7]

D. Dey, S. Sarkar, and P. De. A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Trans. Knowl. Data Eng., 14(3):567--582, 2002.

Digital Library

[8]

J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal, 21(4):437--461, 2012.

Digital Library

[9]

D. Fenz, D. Lange, A. Rheinländer, F. Naumann, and U. Leser. Efficient similarity search in very large string sets. In A. Ailamaki and S. Bowers, editors, Scientific and Statistical Database Management, volume 7338 of Lecture Notes in Computer Science, pages 262--279. Springer Berlin Heidelberg, 2012.

Digital Library

[10]

X. Ge and P. Smyth. Deformable Markov model templates for time-series pattern matching. In Proceedings of SIGKDD, pages 81--90, New York, NY, USA, 2000. ACM.

Digital Library

[11]

S. Gerdjikov, S. Mihov, P. Mitankin, and K. U. Schulz. Good parts first - a new algorithm for approximate search in lexica and string databases. ArXiv e-prints, Jan. 2013.

[12]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In PVLDB, VLDB '99, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

Digital Library

[13]

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. SIGIR '06, pages 284--291, New York, NY, USA, 2006. ACM.

Digital Library

[14]

H. Hyyrö, K. Fredriksson, and G. Navarro. Increased bit-parallelism for approximate and multiple string matching. ACM Journal of Experimental Algorithmics, 10, 2005.

Digital Library

[15]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[16]

H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14):1754--1760, 2009.

Digital Library

[17]

Y. Li, A. Terrell, and J. M. Patel. WHAM: a high-throughput sequence alignment method. SIGMOD '11, pages 445--456. ACM, 2011.

Digital Library

[18]

S. Mihov and K. U. Schulz. Fast approximate search in large dictionaries. Computational Linguistics, 30(4):451--477, 2004.

Digital Library

[19]

P. Mitankin, S. Mihov, and K. U. Schulz. Deciding word neighborhood with universal neighborhood automata. Theoretical Computer Science, 412(22):2340--2355, 2011.

Digital Library

[20]

I. Moraru and D. G. Andersen. Exact pattern matching with feed-forward Bloom filters. J. Exp. Algorithmics, 17(1):3.4:3.1--3.4:3.18, Sept. 2012.

Digital Library

[21]

G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms, 1(1):205--239, 2000.

[22]

M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. Inverted indexes for phrases and strings. In SIGIR 2011, pages 555--564, 2011.

Digital Library

[23]

A. Rheinländer and U. Leser. Scalable sequence similarity search and join in main memory on multi-cores. In Proceedings of the 2011 international conference on Parallel Processing - Volume 2, Euro-Par'11, pages 13--22, Berlin, Heidelberg, 2012. Springer-Verlag.

Digital Library

[24]

E. Siragusa, D. Weese, and K. Reinert. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic acids research, Jan. 2013.

[25]

B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, April 2007. http://fastss.csg.uzh.ch/.

[26]

A. Tiskin. Semi-local longest common subsequences in subquadratic time. J. Discrete Algorithms, 6(4):570--581, 2008.

Digital Library

[27]

E. Ukkonen. Algorithms for approximate string matching. Information Control, 64:100--18, 1985.

Digital Library

[28]

G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing large sparse graphs for similarity search. IEEE Trans. Knowl. Data Eng., 24(3):440--451, 2012.

Digital Library

[29]

W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. SIGMOD '09, pages 759--770, New York, NY, USA, 2009. ACM.

Digital Library

[30]

C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane. Efficient error-tolerant query autocompletion. PVLDB, 2013.

Digital Library

[31]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, Aug. 2008.

Digital Library

[32]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th international conference on World Wide Web, WWW '08, pages 131--140, New York, NY, USA, 2008. ACM.

Digital Library

[33]

C. Xiao,W.Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.

Digital Library

Cited By

Ermshaus APiechotta MRüter GKeilholz ULeser UBenary M(2024)Preon: Fast and accurate entity normalization for drug names and cancer types in precision oncologyBioinformatics10.1093/bioinformatics/btae085Online publication date: 21-Feb-2024
https://doi.org/10.1093/bioinformatics/btae085
Tonci NRivault SBamha MRobert SLimet STorquati M(2024)LSH SimilarityJoin Pattern in FastFlowInternational Journal of Parallel Programming10.1007/s10766-024-00772-152:3(207-230)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10766-024-00772-1
Karpov NZhang HZhang Q(2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00778-023-00806-z
Show More Cited By

Index Terms

State-of-the-art in string similarity search and join
1. Information systems
  1. Information retrieval
    1. Document representation

Index terms have been assigned to the content through auto-classification.

Recommendations

A pivotal prefix based filtering algorithm for string similarity search
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate ...
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
A bit-parallel algorithm for searching multiple patterns with various lengths

In this paper, we present an Advanced Vector Extensions (AVX) accelerated method for a bit-parallel algorithm that realizes fast string search for maximizing stable search throughput. An advantage of our method is that it accelerates string search by ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 43, Issue 1

March 2014

71 pages

ISSN:0163-5808

DOI:10.1145/2627692

Editors:
Ioana Manolescu
INRIA Saclay
,
Denilson Barbosa
University of Alberta
,
Pablo Barceló
Universidad de Chile
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Anish Das Sarma
Google Research
,
Glenn Paulley
Conestoga College
,
Alkis Simitsis
HP Labs
,
Nesime Tatbul
ETH Zurich
,
Marianne Winslett
University of Illinois

Issue’s Table of Contents

Copyright © 2014 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2014

Published in SIGMOD Volume 43, Issue 1

Check for updates

Author Tags

Qualifiers

Column

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
469
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ermshaus APiechotta MRüter GKeilholz ULeser UBenary M(2024)Preon: Fast and accurate entity normalization for drug names and cancer types in precision oncologyBioinformatics10.1093/bioinformatics/btae085Online publication date: 21-Feb-2024
https://doi.org/10.1093/bioinformatics/btae085
Tonci NRivault SBamha MRobert SLimet STorquati M(2024)LSH SimilarityJoin Pattern in FastFlowInternational Journal of Parallel Programming10.1007/s10766-024-00772-152:3(207-230)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10766-024-00772-1
Karpov NZhang HZhang Q(2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00778-023-00806-z
Li WCheng ZDeng LYang ZLi A(2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213(102213)Online publication date: May-2023
https://doi.org/10.1016/j.is.2023.102213
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574263
Yang ZZheng BWang XLi GZhou X(2022)minIL: A Simple and Small Index for String Similarity Search with Edit Distance2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00047(565-577)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00047
Vaiwsri SRanbaduge TChristen P(2022)Accurate and efficient privacy-preserving string matchingInternational Journal of Data Science and Analytics10.1007/s41060-022-00320-514:2(191-215)Online publication date: 13-Apr-2022
https://doi.org/10.1007/s41060-022-00320-5
Rivault SBamha MLimet SRobert S(2022)Towards a Scalable Set Similarity Join Using MapReduce and LSHComputational Science – ICCS 202210.1007/978-3-031-08751-6_41(569-583)Online publication date: 21-Jun-2022
https://dl.acm.org/doi/10.1007/978-3-031-08751-6_41
Zhang YGe YYu PZhang JZhang YBaker T(2021)A Novel Method to Prevent Misconfigurations of Industrial Automation and Control SystemsIEEE Transactions on Industrial Informatics10.1109/TII.2020.301775417:6(4210-4218)Online publication date: Jun-2021
https://doi.org/10.1109/TII.2020.3017754
Yang CDeng DShang SZhu FLiu LShao L(2021)Internal and external memory set containment joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00644-330:3(447-470)Online publication date: 23-Feb-2021
https://dl.acm.org/doi/10.1007/s00778-020-00644-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents