Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/11575832_13guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

N-gram similarity and distance

Published: 02 November 2005 Publication History

Abstract

In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

References

[1]
Chris Brew and David McKelvie. 1996. Word-pair extraction for lexicography. In Proc. of the 2nd Intl Conf. on New Methods in Language Processing, pages 45-55.
[2]
Vacláv Chvátal and David Sankoff. 1975. Longest common subsequences of two random sequences. Journal of Applied Probability, 12:306-315.
[3]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms. The MIT Press, second edition.
[4]
John Hewson. 1993. A computer-generated dictionary of proto-Algonquian. Hull, Quebec: Canadian Museum of Civilization.
[5]
Bruce L. Lambert, Swu-Jane Lin, Ken-Yu Chang, and Sanjay K. Gandhi. 1999. Similarity As a Risk Factor in Drug-Name Confusion Errors: The Look-Alike (Orthographic) and Sound-Alike (Phonetic) Model. Medical Care, 37(12):1214-1225.
[6]
A. Marzal and E. Vidal. 1993. Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence, 15(9):926-932.
[7]
I. DanMelamed. 1998. Manual annotation of translational equivalence: The Blinker project. Technical Report IRCS #98-07, University of Pennsylvania.
[8]
I. Dan Melamed. 1999. Bitext maps and alignment via pattern recognition. Computational Linguistics, 25(1):107-130.
[9]
D. Sankoff and J. B. Kruskal, editors. 1983. Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley.
[10]
Bill Smyth. 2003. Computing Patterns in Strings. Pearson.
[11]
Dan Tufis. 2002. A cheap and fast way to build useful translation lexicons. In Proc. of the 19th Intl Conf. on Computational Linguistics, pages 1030-1036.
[12]
Esko Ukkonen. 1992. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92:191-211.
[13]
Use caution -- avoid confusion. United States Pharmacopeial Convention Quality Review, No. 76, March 2001. Available from http://www.bhhs.org/pdf/qr76.pdf.
[14]
Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the Association for Computing Machinery, 21(1):168-173.

Cited By

View all
  • (2024)Better than randomProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i17.29857(18915-18923)Online publication date: 20-Feb-2024
  • (2024)Hindi MWE Detection by Learning Phraseology from CorporaSN Computer Science10.1007/s42979-024-03088-65:6Online publication date: 10-Aug-2024
  • (2023)GLKE: Global-Local Knowledge Enhanced Text Matching ModelProceedings of the 2023 6th International Conference on Artificial Intelligence and Pattern Recognition10.1145/3641584.3641772(1249-1254)Online publication date: 22-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
SPIRE'05: Proceedings of the 12th international conference on String Processing and Information Retrieval
November 2005
404 pages
ISBN:3540297405

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 02 November 2005

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Better than randomProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i17.29857(18915-18923)Online publication date: 20-Feb-2024
  • (2024)Hindi MWE Detection by Learning Phraseology from CorporaSN Computer Science10.1007/s42979-024-03088-65:6Online publication date: 10-Aug-2024
  • (2023)GLKE: Global-Local Knowledge Enhanced Text Matching ModelProceedings of the 2023 6th International Conference on Artificial Intelligence and Pattern Recognition10.1145/3641584.3641772(1249-1254)Online publication date: 22-Sep-2023
  • (2023)MalXCap: A Method for Malware Capability ExtractionInformation Security Practice and Experience10.1007/978-981-99-7032-2_14(230-249)Online publication date: 24-Aug-2023
  • (2022)Solving Sensor Ontology Metamatching Problem with Compact Flower Pollination AlgorithmWireless Communications & Mobile Computing10.1155/2022/96625172022Online publication date: 1-Jan-2022
  • (2022)Unsupervised Contextual Anomaly Detection for Database SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517861(788-802)Online publication date: 10-Jun-2022
  • (2021)Location-Aware Named Entity DisambiguationProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482135(3433-3438)Online publication date: 26-Oct-2021
  • (2021)A Semantic Textual Similarity Calculation Model Based on Pre-training ModelKnowledge Science, Engineering and Management 10.1007/978-3-030-82147-0_1(3-15)Online publication date: 14-Aug-2021
  • (2020)Improving the identification of confused drug names in SpanishJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-17986939:2(2027-2036)Online publication date: 1-Jan-2020
  • (2020)An orthographic and phonetic knowledge-based measure for confused drug namesJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-17986739:2(2003-2013)Online publication date: 1-Jan-2020
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media