Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1557670.1557677acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient top-k algorithms for fuzzy search in string collections

Published: 28 June 2009 Publication History

Abstract

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[3]
R. Fagin. Combining fuzzy information from multiple systems. In PODS, pages 216--226, 1996.
[4]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.
[5]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[6]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.
[7]
I. Ilyas, G. Beskales, and M. A. Soliman. A Survey of Top-k Query Processing Techniques in Relational Database Systems. ACM Computing Surveys, 2008.
[8]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006.
[9]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[10]
C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.
[11]
G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.
[12]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Cited By

View all
  • (2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
  • (2021)Novel Few-Shot Learning Neural Network for Predicting Carbohydrate-Active Enzyme Affinity Toward Fructo-OligosaccharidesJournal of Computational Biology10.1089/cmb.2021.009128:12(1208-1218)Online publication date: 1-Dec-2021
  • (2020)A Transformation-Based Framework for KNN Set Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288618932:3(409-423)Online publication date: 1-Mar-2020
  • Show More Cited By

Index Terms

  1. Efficient top-k algorithms for fuzzy search in string collections

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KEYS '09: Proceedings of the First International Workshop on Keyword Search on Structured Data
      June 2009
      54 pages
      ISBN:9781605585703
      DOI:10.1145/1557670
      • General Chair:
      • M. Tamer Özsu,
      • Program Chairs:
      • Yi Chen,
      • Lei Chen
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 June 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '09

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 24 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
      • (2021)Novel Few-Shot Learning Neural Network for Predicting Carbohydrate-Active Enzyme Affinity Toward Fructo-OligosaccharidesJournal of Computational Biology10.1089/cmb.2021.009128:12(1208-1218)Online publication date: 1-Dec-2021
      • (2020)A Transformation-Based Framework for KNN Set Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288618932:3(409-423)Online publication date: 1-Mar-2020
      • (2017)A Privacy-Preserving Multi-Pattern Matching Scheme for Searching Strings in Cloud Database2017 15th Annual Conference on Privacy, Security and Trust (PST)10.1109/PST.2017.00042(293-29309)Online publication date: Aug-2017
      • (2016)Learning transformation rule in heterogeneous environment using log based posterior model2016 10th International Conference on Intelligent Systems and Control (ISCO)10.1109/ISCO.2016.7727044(1-5)Online publication date: Jan-2016
      • (2015)Efficient and accurate approach for approximate string search in spatial dataset2015 IEEE International Advance Computing Conference (IACC)10.1109/IADCC.2015.7154721(315-318)Online publication date: Jun-2015
      • (2015)Enhance Lecture Archive Search with OCR Slide Detection and In-Memory Database TechnologyProceedings of the 2015 IEEE 18th International Conference on Computational Science and Engineering (CSE)10.1109/CSE.2015.19(176-183)Online publication date: 21-Oct-2015
      • (2014)Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-TreesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.230913126:12(2983-2996)Online publication date: Dec-2014
      • (2014)A Probabilistic Approach to String TransformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2013.1126:5(1063-1075)Online publication date: 1-May-2014
      • (2014)Exploration on efficient similar sentences extractionWorld Wide Web10.1007/s11280-012-0195-z17:4(595-626)Online publication date: 1-Jul-2014
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media