article

Similarity measures for sequential data

Author:

Konrad RieckAuthors Info & Claims

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 1, Issue 4

Pages 296 - 304

https://doi.org/10.1002/widm.36

Published: 01 July 2011 Publication History

Abstract

Expressive comparison of strings is a prerequisite for analysis of sequential data in many areas of computer science. However, comparing strings and assessing their similarity is not a trivial task and there exists several contrasting approaches for defining similarity measures over sequential data. In this paper, we review three major classes of such similarity measures: edit distances, bag-of-word models, and string kernels. Each of these classes originates from a particular application domain and models similarity of strings differently. We present these classes and underlying comparisons in detail, highlight advantages, and differences as well as provide basic algorithms supporting practical applications. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 296–304 DOI: 10.1002/widm.36

References

[1]

Masek W, Patterson M, A faster algorithm for computing string edit distance. J Comput Syst Sci 1980, 20: 18–31.

[2]

Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Dok Akad Nauk SSSR 1966, 163: 845–848.

[3]

Hamming RW. Error-detecting and error-correcting codes. Bell Syst Tech J 1950, 29: 147–160.

[4]

Sankoff D, Kruskal J. Time wraps, String edits, and Macromulecules: The Theory and Practice of Sequence Comparison. Boston, MA: Addision-Wesley Publishing Co.; 1983.

[5]

Needleman S, Wunsch C. A general method applicable to the search for similarties in the amino acid sequence of two proteins. J Mol Biology 1970, 48: 443–453.

[6]

Smith T, Waterman M. Identification of common molecular subsequences. J Mol Biol, 147: 195–197, 1981.

[7]

Doolittle R. Of Urfs and Orfs: A Primer on How to Analyse Derived Amino Acid Sequences. Mill Valley, CA: University Science Books; 1986.

[8]

Gusfield D. Algorithms on Strings, Trees, and Sequences. New York, NY: Cambridge University Press; 1997.

[9]

Salton G, McGill MJ. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill; 1986.

[10]

Rieck K, Laskov P. Linear-time computation of similarity measures for sequential data. J Mach Learn Res 2008, 9: 23–48.

[11]

Sonnenburg S, Rätsch G, Rieck K. Large scale learning with string kernels. In: Bottou L, Chapelle O, DeCoste D, Weston J, eds. Large Scale Kernel Machines. Cambridge, MA: MIT Press; 2007, 73–103.

[12]

Salton G, Wong A, Yang C. A vector space model for automatic indexing. Commun ACM 1975, 18: 613–620.

[13]

Joachims T. Learning to Classify Text Using Support Vector Machines. Boston, MA: Kluwer; 2002.

[14]

Suen C. N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal Mach Intell 1979, 1: 164–172.

Digital Library

[15]

Damashek M. Gauging similarity with n-grams: language-independent categorization of text. Science 1995, 267: 843–848.

[16]

Robertson A, Willett P. Applications of n-grams in textual information systems. J Doc 1998, 58: 48–69.

[17]

Leslie C, Eskin E, Noble W. The spectrum kernel: a string kernel for SVM protein classification. In: Altman RB, Dunker AK, Hunter L, Lauderdale K, Klein TE, eds. Proc. Pacific Symposium Biocomputing, Lihue, HI; 2002, 564–575.

[18]

Liao Y, Vemuri VR. Using text categorization techniques for intrusion detection. In: Proc. of USENIX Security Symposium. 2002, 51–59.

Digital Library

[19]

Hofmeyr S, Forrest S, Somayaji A. Intrusion detection using sequences of system calls. J Comput Secur 1998, 6: 151–180.

[20]

Rieck K, Laskov P. Language models for detection of unknown attacks in network traffic. J Comput Virol 2007, 2: 243–256.

[21]

Schö lkopf B, Smola A. Learning with Kernels. Cambridge, MA: MIT Press; 2002.

[22]

Müller K-R, Mika S, Rätsch G, Tsuda K, Schölkopf B. An introduction to kernel-based learning algorithms. IEEE Neural Netw 2001, 12: 181–201.

[23]

Joachims T. Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A, eds. Advances in Kernel Methods-Support Vector Learning. Cambridge, MA: MIT Press; 1999, 169–184.

[24]

Watkins C. Dynamic alignment kernels. In: Smola A, Bartlett P, Schölkopf B, Schuurmans D, eds. Advances in Large Margin Classifiers. Cambridge, MA: MIT Press; 2000, 39–50.

[25]

Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res 2002, 2: 419–444.

[26]

Leslie C, Eskin E, Cohen A, Weston J, Noble W. Mismatch string kernel for discriminative protein classification. Bioinformatics 2003, 1: 1–10.

[27]

Leslie C, Kuang R. Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 2004, 5: 1435–1455.

[28]

Vishwanathan S, Smola A. Fast kernels for string and tree matching. In: Tsuda K, Schölkopf B, Vert J, eds. Kernels and Bioinformatics. Cambridge, MA: MIT Press; 2004, 113–130.

[29]

Chang WI, Lawler EL. Sublinear approximate string matching and biological applications. Algorithmica 1994, 12: 327–344.

[30]

Haussler D. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10. UC Santa Cruz, CA; July 1999.

[31]

Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller K-R. Engineering support vector machine kernels that recognize translation initiation sites in DNA. BioInformatics 2000, 16: 799–807.

[32]

Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. J Comput Biol 2000, 7: 95–114, 2000.

[33]

Tsuda K, Kawanabe M, Rätsch G, Sonnenburg S, Müller K. A new discriminative kernel from probabilistic models. Neural Comput 2002, 14: 2397–2414.

[34]

Sonnenburg S, Zien A, Rätsch G. ARTS: accurate recognition of transcription starts in human. Bioinformatics 2006, 22: e472–e480.

[35]

Rä tsch G, Sonnenburg S, Schölkopf B. RASE: recognition of alternatively spliced exons in c. elegans. Bioinformatics 2005, 21: i369–i377.

[36]

Cuturi M, Vert J-P, Matsui T. A kernel for time series based on global alignments. In: Kuh A, Huang Y-F, eds. Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI; 2007.

[37]

Vert J-P, Saigo H, Akutsu T. Local alignment kernels for biological sequences. In: Kernel Methods in Computational Biology. Cambridge, MA: MIT Press; 2004, 131–154.

[38]

Shawe- Taylor J, Cristianini N. Kernel Methods for Pattern Analysis. New York, NY: Cambridge University Press; 2004.

[39]

Collins M, Duffy N. Convolution kernel for natural language. In: Becker S, Thrun S, Obermayer K, eds. Advances in Neural Information Proccessing Systems (NIPS) Vancouver, BC. Cambridge, MA: MIT Press. Vol. 16. 2002, 625–632.

[40]

Rieck K, Krueger T, Brefeld U Müller K-R. Approximate tree kernels. J Mach Learn Res 2010, 11: 555–580.

[41]

Gä rtner T, Lloyd J, Flach P. Kernels and distances for structured data. Mach Learn 2004, 57: 205–232.

[42]

Vishwanathan S, Schraudoplh N, Kondor R, Borgwardt K. Graph kernels. J Mach Learn Res. 2010, 11: 1201–1242.

Digital Library

Cited By

Paik YKim SJung DKim M(2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020
https://dl.acm.org/doi/10.1145/3391891
Zarka RCordier AEgyed-Zsigmond ELamontagne LMille A(2016)Trace-based contextual recommendationsExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.07.03564:C(194-207)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.07.035
Loh CLi ISheng Y(2016)Comparison of similarity measures to differentiate players' actions and decision-making profiles in serious games analyticsComputers in Human Behavior10.1016/j.chb.2016.07.02464:C(562-574)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.chb.2016.07.024
Show More Cited By

Index Terms

Similarity measures for sequential data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning similarity measures from data
Abstract
Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or a set of cases most similar to the query case. ...
The Bayes Decision Rule Induced Similarity Measures

This paper first shows that the popular whitened cosine similarity measure is related to the Bayes decision rule under specific assumptions and then presents two new similarity measures: the PRM Whitened Cosine (PWC) similarity measure and the Within-...
Some cosine similarity measures and distance measures between q‐rung orthopair fuzzy sets
Abstract
In this paper, we consider some cosine similarity measures and distance measures between q‐rung orthopair fuzzy sets (q‐ROFSs). First, we define a cosine similarity measure and a Euclidean distance measure of q‐ROFSs, their properties are also ...

Comments

Information & Contributors

Information

Published In

cover image Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Volume 1, Issue 4

July 2011

100 pages

ISSN:1942-4787

EISSN:1942-4795

Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 July 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Paik YKim SJung DKim M(2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020
https://dl.acm.org/doi/10.1145/3391891
Zarka RCordier AEgyed-Zsigmond ELamontagne LMille A(2016)Trace-based contextual recommendationsExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.07.03564:C(194-207)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.07.035
Loh CLi ISheng Y(2016)Comparison of similarity measures to differentiate players' actions and decision-making profiles in serious games analyticsComputers in Human Behavior10.1016/j.chb.2016.07.02464:C(562-574)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.chb.2016.07.024
Gong WChen XQiang SJin Y(2013)Trajectory pattern change analysis in campus WiFi networksProceedings of the Second ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems10.1145/2534190.2534191(1-8)Online publication date: 5-Nov-2013
https://dl.acm.org/doi/10.1145/2534190.2534191

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents