Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Published: 01 August 2008 Publication History

Abstract

There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings.
In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.

References

[1]
A. Andoni, M. Deza, A. Gupta, P. Indyk, and S. Raskhodnikova. Lower bounds for embedding edit distance into normed spaces. In SODA, pages 523--526, 2003.
[2]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.
[3]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.
[4]
M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., 18(5):16--23, 2003.
[5]
C. Böhm, B. Braunmüller, F. Krebs, and H.-P. Kriegel. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In SIGMOD, pages 379--388, 2001.
[6]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997.
[7]
S. Burkhardt and J. Kärkkäinen. One-gapped q-gram filters for Levenshtein distance. In CPM, pages 225--234, 2002.
[8]
A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In SIGMOD, pages 353--364, 2007.
[9]
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.
[10]
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007.
[11]
S. Chaudhuri, V. Ganti, and R. Kaushik. Data debugger: An operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2):60--66, 2006.
[12]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[13]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.
[14]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.
[15]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free (erratum). Technical Report CUCS-011-03, Columbia University, 2003.
[16]
D. Gusfield. Algorithms on, Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.
[17]
S. Helmer and G. Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In VLDB, pages 386--395, 1997.
[18]
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006.
[19]
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.
[20]
H. Lee, R. T. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In VLDB, pages 195--206, 2007.
[21]
C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, 2007.
[22]
N. Mamoulis. Efficient processing of joins on set-valued attributes. In SIGMOD, pages 157--168, 2003.
[23]
W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18--31, 1980.
[24]
S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Trans. Database Syst., 28:56--99, 2003.
[25]
G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM, 46(3):395--415, 1999.
[26]
G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.
[27]
M. V. Ramakrishna and J. Zobel. Performance in practice of string hashing functions. In DASFAA, pages 215--224, 1997.
[28]
K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000.
[29]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.
[30]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.
[31]
E. Ukkonen. On approximate string matching. In FCT, 1983.
[32]
R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168--173, 1974.
[33]
W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.
[34]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008.
[35]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

Cited By

View all
  • (2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
  • (2024)CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction TechniquesProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3652032.3657572(143-154)Online publication date: 20-Jun-2024
  • (2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
  • (2024)CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction TechniquesProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3652032.3657572(143-154)Online publication date: 20-Jun-2024
  • (2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
  • (2023)Schema Integration on Massive Data SourcesAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_11(186-206)Online publication date: 20-Oct-2023
  • (2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 7-Sep-2022
  • (2022)Hierarchical filtering: improving similar substring matching under edit distanceWorld Wide Web10.1007/s11280-022-01128-w26:4(1967-2001)Online publication date: 6-Dec-2022
  • (2021)Boosting Graph Similarity Search through Pre-ComputationProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452780(951-963)Online publication date: 9-Jun-2021
  • (2021)Clustering Enhanced Error-tolerant Top-k Spatio-textual SearchWorld Wide Web10.1007/s11280-021-00883-624:4(1185-1214)Online publication date: 1-Jul-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media