Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
column

State-of-the-art in string similarity search and join

Published: 13 May 2014 Publication History
  • Get Citation Alerts
  • Abstract

    String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.

    References

    [1]
    A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In PVLDB, VLDB '06, pages 918--929. VLDB Endowment, 2006.
    [2]
    R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 131--140, New York, NY, USA, 2007. ACM.
    [3]
    A. Behm, R. Vernica, S. Alsubaiee, S. Ji, J. Lu, L. Jin, Y. Lu, and C. Li. UCI Flamingo Package 4.1, 2010.
    [4]
    M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report, 1994.
    [5]
    P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In PVLDB, VLDB '97, pages 426--435, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
    [6]
    M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, and J. F. Reid. A fast and practical bit-vector algorithm for the Longest Common Subsequence problem. Information Processing Letters, 80(6), Dec. 2001.
    [7]
    D. Dey, S. Sarkar, and P. De. A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Trans. Knowl. Data Eng., 14(3):567--582, 2002.
    [8]
    J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal, 21(4):437--461, 2012.
    [9]
    D. Fenz, D. Lange, A. Rheinländer, F. Naumann, and U. Leser. Efficient similarity search in very large string sets. In A. Ailamaki and S. Bowers, editors, Scientific and Statistical Database Management, volume 7338 of Lecture Notes in Computer Science, pages 262--279. Springer Berlin Heidelberg, 2012.
    [10]
    X. Ge and P. Smyth. Deformable Markov model templates for time-series pattern matching. In Proceedings of SIGKDD, pages 81--90, New York, NY, USA, 2000. ACM.
    [11]
    S. Gerdjikov, S. Mihov, P. Mitankin, and K. U. Schulz. Good parts first - a new algorithm for approximate search in lexica and string databases. ArXiv e-prints, Jan. 2013.
    [12]
    A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In PVLDB, VLDB '99, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
    [13]
    M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. SIGIR '06, pages 284--291, New York, NY, USA, 2006. ACM.
    [14]
    H. Hyyrö, K. Fredriksson, and G. Navarro. Increased bit-parallelism for approximate and multiple string matching. ACM Journal of Experimental Algorithmics, 10, 2005.
    [15]
    G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
    [16]
    H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14):1754--1760, 2009.
    [17]
    Y. Li, A. Terrell, and J. M. Patel. WHAM: a high-throughput sequence alignment method. SIGMOD '11, pages 445--456. ACM, 2011.
    [18]
    S. Mihov and K. U. Schulz. Fast approximate search in large dictionaries. Computational Linguistics, 30(4):451--477, 2004.
    [19]
    P. Mitankin, S. Mihov, and K. U. Schulz. Deciding word neighborhood with universal neighborhood automata. Theoretical Computer Science, 412(22):2340--2355, 2011.
    [20]
    I. Moraru and D. G. Andersen. Exact pattern matching with feed-forward Bloom filters. J. Exp. Algorithmics, 17(1):3.4:3.1--3.4:3.18, Sept. 2012.
    [21]
    G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms, 1(1):205--239, 2000.
    [22]
    M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. Inverted indexes for phrases and strings. In SIGIR 2011, pages 555--564, 2011.
    [23]
    A. Rheinländer and U. Leser. Scalable sequence similarity search and join in main memory on multi-cores. In Proceedings of the 2011 international conference on Parallel Processing - Volume 2, Euro-Par'11, pages 13--22, Berlin, Heidelberg, 2012. Springer-Verlag.
    [24]
    E. Siragusa, D. Weese, and K. Reinert. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic acids research, Jan. 2013.
    [25]
    B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, April 2007. http://fastss.csg.uzh.ch/.
    [26]
    A. Tiskin. Semi-local longest common subsequences in subquadratic time. J. Discrete Algorithms, 6(4):570--581, 2008.
    [27]
    E. Ukkonen. Algorithms for approximate string matching. Information Control, 64:100--18, 1985.
    [28]
    G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing large sparse graphs for similarity search. IEEE Trans. Knowl. Data Eng., 24(3):440--451, 2012.
    [29]
    W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. SIGMOD '09, pages 759--770, New York, NY, USA, 2009. ACM.
    [30]
    C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane. Efficient error-tolerant query autocompletion. PVLDB, 2013.
    [31]
    C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, Aug. 2008.
    [32]
    C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th international conference on World Wide Web, WWW '08, pages 131--140, New York, NY, USA, 2008. ACM.
    [33]
    C. Xiao,W.Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.

    Cited By

    View all
    • (2024)Preon: Fast and accurate entity normalization for drug names and cancer types in precision oncologyBioinformatics10.1093/bioinformatics/btae085Online publication date: 21-Feb-2024
    • (2024)LSH SimilarityJoin Pattern in FastFlowInternational Journal of Parallel Programming10.1007/s10766-024-00772-152:3(207-230)Online publication date: 1-Jun-2024
    • (2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
    • Show More Cited By

    Index Terms

    1. State-of-the-art in string similarity search and join
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGMOD Record
      ACM SIGMOD Record  Volume 43, Issue 1
      March 2014
      71 pages
      ISSN:0163-5808
      DOI:10.1145/2627692
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2014
      Published in SIGMOD Volume 43, Issue 1

      Check for updates

      Author Tags

      1. comparison
      2. scalability
      3. string join
      4. string search

      Qualifiers

      • Column

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)30
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Preon: Fast and accurate entity normalization for drug names and cancer types in precision oncologyBioinformatics10.1093/bioinformatics/btae085Online publication date: 21-Feb-2024
      • (2024)LSH SimilarityJoin Pattern in FastFlowInternational Journal of Parallel Programming10.1007/s10766-024-00772-152:3(207-230)Online publication date: 1-Jun-2024
      • (2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
      • (2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213(102213)Online publication date: May-2023
      • (2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
      • (2022)minIL: A Simple and Small Index for String Similarity Search with Edit Distance2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00047(565-577)Online publication date: May-2022
      • (2022)Accurate and efficient privacy-preserving string matchingInternational Journal of Data Science and Analytics10.1007/s41060-022-00320-514:2(191-215)Online publication date: 13-Apr-2022
      • (2022)Towards a Scalable Set Similarity Join Using MapReduce and LSHComputational Science – ICCS 202210.1007/978-3-031-08751-6_41(569-583)Online publication date: 21-Jun-2022
      • (2021)A Novel Method to Prevent Misconfigurations of Industrial Automation and Control SystemsIEEE Transactions on Industrial Informatics10.1109/TII.2020.301775417:6(4210-4218)Online publication date: Jun-2021
      • (2021)Internal and external memory set containment joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00644-330:3(447-470)Online publication date: 23-Feb-2021
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media