research-article

Efficient top-k algorithms for fuzzy search in string collections

Authors:

Rares Vernica,

Chen LiAuthors Info & Claims

KEYS '09: Proceedings of the First International Workshop on Keyword Search on Structured Data

Pages 9 - 14

https://doi.org/10.1145/1557670.1557677

Published: 28 June 2009 Publication History

Get Access

Abstract

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

Google Scholar

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

Google Scholar

[3]

R. Fagin. Combining fuzzy information from multiple systems. In PODS, pages 216--226, 1996.

Digital Library

Google Scholar

[4]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.

Digital Library

Google Scholar

[5]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

Google Scholar

[6]

M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.

Digital Library

Google Scholar

[7]

I. Ilyas, G. Beskales, and M. A. Soliman. A Survey of Top-k Query Processing Techniques in Relational Database Systems. ACM Computing Surveys, 2008.

Digital Library

Google Scholar

[8]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006.

Digital Library

Google Scholar

[9]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

Google Scholar

[10]

C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.

Digital Library

Google Scholar

[11]

G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.

Digital Library

Google Scholar

[12]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

Google Scholar

Cited By

View all

Kwon SJung WShim K(2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551859
Liu SKou YChen L(2021)Novel Few-Shot Learning Neural Network for Predicting Carbohydrate-Active Enzyme Affinity Toward Fructo-OligosaccharidesJournal of Computational Biology10.1089/cmb.2021.009128:12(1208-1218)Online publication date: 1-Dec-2021
https://doi.org/10.1089/cmb.2021.0091
Zhang YWu JWang JXing C(2020)A Transformation-Based Framework for KNN Set Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288618932:3(409-423)Online publication date: 1-Mar-2020
https://doi.org/10.1109/TKDE.2018.2886189
Show More Cited By

Index Terms

Efficient top-k algorithms for fuzzy search in string collections
1. Information systems
  1. Information retrieval
  2. Information systems applications

Recommendations

Efficient fuzzy search in large text collections

We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work ...
Top-k String Similarity Joins
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management

Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types ...
Efficient Top-k Query Answering through its Top-N Rewritings Using Views
PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

Recently, various algorithms were proposed to speed up top-k query answering by using multiple materialized query results. Nevertheless, for most of the proposed algorithms, a potentially costly view selection operation is required. In fact, the ...

Comments

Information & Contributors

Information

Published In

KEYS '09: Proceedings of the First International Workshop on Keyword Search on Structured Data

June 2009

54 pages

ISBN:9781605585703

DOI:10.1145/1557670

General Chair:
M. Tamer Özsu
University of Waterloo
,
Program Chairs:
Yi Chen
Arizona State University
,
Lei Chen
Hong Kong University of Science and Technology

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SIGMOD/PODS '09

Sponsor:

SIGMOD/PODS '09: International Conference on Management of Data

June 28, 2009

Rhode Island, Providence

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kwon SJung WShim K(2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551859
Liu SKou YChen L(2021)Novel Few-Shot Learning Neural Network for Predicting Carbohydrate-Active Enzyme Affinity Toward Fructo-OligosaccharidesJournal of Computational Biology10.1089/cmb.2021.009128:12(1208-1218)Online publication date: 1-Dec-2021
https://doi.org/10.1089/cmb.2021.0091
Zhang YWu JWang JXing C(2020)A Transformation-Based Framework for KNN Set Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288618932:3(409-423)Online publication date: 1-Mar-2020
https://doi.org/10.1109/TKDE.2018.2886189
He MZhang JZeng GYiu S(2017)A Privacy-Preserving Multi-Pattern Matching Scheme for Searching Strings in Cloud Database2017 15th Annual Conference on Privacy, Security and Trust (PST)10.1109/PST.2017.00042(293-29309)Online publication date: Aug-2017
https://doi.org/10.1109/PST.2017.00042
Kalpana RPriya M(2016)Learning transformation rule in heterogeneous environment using log based posterior model2016 10th International Conference on Intelligent Systems and Control (ISCO)10.1109/ISCO.2016.7727044(1-5)Online publication date: Jan-2016
https://doi.org/10.1109/ISCO.2016.7727044
Nikam P(2015)Efficient and accurate approach for approximate string search in spatial dataset2015 IEEE International Advance Computing Conference (IACC)10.1109/IADCC.2015.7154721(315-318)Online publication date: Jun-2015
https://doi.org/10.1109/IADCC.2015.7154721
Malchow MBauer MMeinel C(2015)Enhance Lecture Archive Search with OCR Slide Detection and In-Memory Database TechnologyProceedings of the 2015 IEEE 18th International Conference on Computational Science and Engineering (CSE)10.1109/CSE.2015.19(176-183)Online publication date: 21-Oct-2015
https://dl.acm.org/doi/10.1109/CSE.2015.19
Lu WDu XHadjieleftheriou MOoi B(2014)Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-TreesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.230913126:12(2983-2996)Online publication date: Dec-2014
https://doi.org/10.1109/TKDE.2014.2309131
Wang ZXu GLi HZhang M(2014)A Probabilistic Approach to String TransformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2013.1126:5(1063-1075)Online publication date: 1-May-2014
https://dl.acm.org/doi/10.1109/TKDE.2013.11
Gu YYang ZXu GNakano MToyoda MKitsuregawa M(2014)Exploration on efficient similar sentences extractionWorld Wide Web10.1007/s11280-012-0195-z17:4(595-626)Online publication date: 1-Jul-2014
https://dl.acm.org/doi/10.1007/s11280-012-0195-z
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Efficient fuzzy search in large text collections

Top-k String Similarity Joins

Efficient Top-k Query Answering through its Top-N Rewritings Using Views

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations