research-article

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Authors:

Xuemin LinAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 1, Issue 1

Pages 933 - 944

https://doi.org/10.14778/1453856.1453957

Published: 01 August 2008 Publication History

Abstract

There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings.

In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.

References

[1]

A. Andoni, M. Deza, A. Gupta, P. Indyk, and S. Raskhodnikova. Lower bounds for embedding edit distance into normed spaces. In SODA, pages 523--526, 2003.

Digital Library

[2]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.

Digital Library

[3]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.

Digital Library

[4]

M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., 18(5):16--23, 2003.

Digital Library

[5]

C. Böhm, B. Braunmüller, F. Krebs, and H.-P. Kriegel. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In SIGMOD, pages 379--388, 2001.

Digital Library

[6]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997.

Digital Library

[7]

S. Burkhardt and J. Kärkkäinen. One-gapped q-gram filters for Levenshtein distance. In CPM, pages 225--234, 2002.

Digital Library

[8]

A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In SIGMOD, pages 353--364, 2007.

Digital Library

[9]

M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.

Digital Library

[10]

S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007.

Digital Library

[11]

S. Chaudhuri, V. Ganti, and R. Kaushik. Data debugger: An operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2):60--66, 2006.

[12]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.

Digital Library

[13]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.

Digital Library

[14]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.

Digital Library

[15]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free (erratum). Technical Report CUCS-011-03, Columbia University, 2003.

[16]

D. Gusfield. Algorithms on, Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.

Digital Library

[17]

S. Helmer and G. Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In VLDB, pages 386--395, 1997.

Digital Library

[18]

M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006.

Digital Library

[19]

M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.

Digital Library

[20]

H. Lee, R. T. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In VLDB, pages 195--206, 2007.

Digital Library

[21]

C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, 2007.

Digital Library

[22]

N. Mamoulis. Efficient processing of joins on set-valued attributes. In SIGMOD, pages 157--168, 2003.

Digital Library

[23]

W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18--31, 1980.

[24]

S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Trans. Database Syst., 28:56--99, 2003.

Digital Library

[25]

G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM, 46(3):395--415, 1999.

Digital Library

[26]

G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.

Digital Library

[27]

M. V. Ramakrishna and J. Zobel. Performance in practice of string hashing functions. In DASFAA, pages 215--224, 1997.

Digital Library

[28]

K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000.

Digital Library

[29]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.

Digital Library

[30]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.

Digital Library

[31]

E. Ukkonen. On approximate string matching. In FCT, 1983.

Digital Library

[32]

R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168--173, 1974.

Digital Library

[33]

W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.

[34]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008.

Digital Library

[35]

J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

Digital Library

Cited By

Peng ZZhang YDeng D(2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677136
Jia LWu CZhang PWang ZShrivastava ASui Y(2024)CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction TechniquesProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3652032.3657572(143-154)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652032.3657572
Karpov NZhang HZhang Q(2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00778-023-00806-z
Show More Cited By

Index Terms

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Recommendations

Fast-join: An efficient method for fuzzy token matching based string similarity join
ICDE '11: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to ...
Distance-join: pattern match query in a large graph database

The growing popularity of graph databases has generated interesting data management problems, such as subgraph search, shortest-path query, reachability verification, and pattern match. Among these, a pattern match query is more flexible compared to a ...
Efficient top-k simrank-based similarity join

SimRank is a popular and widely-adopted similarity measure to evaluate the similarity between nodes in a graph. It is time and space consuming to compute the SimRank similarities for all pairs of nodes, especially for large graphs. In real-world ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 1, Issue 1

August 2008

1216 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008

Published in PVLDB Volume 1, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

108
Total Citations
View Citations
906
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Peng ZZhang YDeng D(2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677136
Jia LWu CZhang PWang ZShrivastava ASui Y(2024)CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction TechniquesProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3652032.3657572(143-154)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652032.3657572
Karpov NZhang HZhang Q(2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00778-023-00806-z
Li TGuo HYang DLi MZheng BWang H(2023)Schema Integration on Massive Data SourcesAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_11(186-206)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-0801-7_11
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574263
Karpov NZhang Q(2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565833
Zhou LChen JDas AMin HYu LZhao MZou J(2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.14778/3547305.3547325
Qiu TZong CYang XWang BLi B(2022)Hierarchical filtering: improving similar substring matching under edit distanceWorld Wide Web10.1007/s11280-022-01128-w26:4(1967-2001)Online publication date: 6-Dec-2022
https://dl.acm.org/doi/10.1007/s11280-022-01128-w
Kim JLi GLi ZIdreos SSrivastava D(2021)Boosting Graph Similarity Search through Pre-ComputationProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452780(951-963)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452780
Zhang YChen YYang JWang JHu HXing CZhou X(2021)Clustering Enhanced Error-tolerant Top-k Spatio-textual SearchWorld Wide Web10.1007/s11280-021-00883-624:4(1185-1214)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1007/s11280-021-00883-6
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents