Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Similarity measures for sequential data

Published: 01 July 2011 Publication History
  • Get Citation Alerts
  • Abstract

    Expressive comparison of strings is a prerequisite for analysis of sequential data in many areas of computer science. However, comparing strings and assessing their similarity is not a trivial task and there exists several contrasting approaches for defining similarity measures over sequential data. In this paper, we review three major classes of such similarity measures: edit distances, bag-of-word models, and string kernels. Each of these classes originates from a particular application domain and models similarity of strings differently. We present these classes and underlying comparisons in detail, highlight advantages, and differences as well as provide basic algorithms supporting practical applications. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 296–304 DOI: 10.1002/widm.36

    References

    [1]
    Masek W, Patterson M, A faster algorithm for computing string edit distance. J Comput Syst Sci 1980, 20: 18–31.
    [2]
    Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Dok Akad Nauk SSSR 1966, 163: 845–848.
    [3]
    Hamming RW. Error-detecting and error-correcting codes. Bell Syst Tech J 1950, 29: 147–160.
    [4]
    Sankoff D, Kruskal J. Time wraps, String edits, and Macromulecules: The Theory and Practice of Sequence Comparison. Boston, MA: Addision-Wesley Publishing Co.; 1983.
    [5]
    Needleman S, Wunsch C. A general method applicable to the search for similarties in the amino acid sequence of two proteins. J Mol Biology 1970, 48: 443–453.
    [6]
    Smith T, Waterman M. Identification of common molecular subsequences. J Mol Biol, 147: 195–197, 1981.
    [7]
    Doolittle R. Of Urfs and Orfs: A Primer on How to Analyse Derived Amino Acid Sequences. Mill Valley, CA: University Science Books; 1986.
    [8]
    Gusfield D. Algorithms on Strings, Trees, and Sequences. New York, NY: Cambridge University Press; 1997.
    [9]
    Salton G, McGill MJ. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill; 1986.
    [10]
    Rieck K, Laskov P. Linear-time computation of similarity measures for sequential data. J Mach Learn Res 2008, 9: 23–48.
    [11]
    Sonnenburg S, Rätsch G, Rieck K. Large scale learning with string kernels. In: Bottou L, Chapelle O, DeCoste D, Weston J, eds. Large Scale Kernel Machines. Cambridge, MA: MIT Press; 2007, 73–103.
    [12]
    Salton G, Wong A, Yang C. A vector space model for automatic indexing. Commun ACM 1975, 18: 613–620.
    [13]
    Joachims T. Learning to Classify Text Using Support Vector Machines. Boston, MA: Kluwer; 2002.
    [14]
    Suen C. N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal Mach Intell 1979, 1: 164–172.
    [15]
    Damashek M. Gauging similarity with n-grams: language-independent categorization of text. Science 1995, 267: 843–848.
    [16]
    Robertson A, Willett P. Applications of n-grams in textual information systems. J Doc 1998, 58: 48–69.
    [17]
    Leslie C, Eskin E, Noble W. The spectrum kernel: a string kernel for SVM protein classification. In: Altman RB, Dunker AK, Hunter L, Lauderdale K, Klein TE, eds. Proc. Pacific Symposium Biocomputing, Lihue, HI; 2002, 564–575.
    [18]
    Liao Y, Vemuri VR. Using text categorization techniques for intrusion detection. In: Proc. of USENIX Security Symposium. 2002, 51–59.
    [19]
    Hofmeyr S, Forrest S, Somayaji A. Intrusion detection using sequences of system calls. J Comput Secur 1998, 6: 151–180.
    [20]
    Rieck K, Laskov P. Language models for detection of unknown attacks in network traffic. J Comput Virol 2007, 2: 243–256.
    [21]
    Schö lkopf B, Smola A. Learning with Kernels. Cambridge, MA: MIT Press; 2002.
    [22]
    Müller K-R, Mika S, Rätsch G, Tsuda K, Schölkopf B. An introduction to kernel-based learning algorithms. IEEE Neural Netw 2001, 12: 181–201.
    [23]
    Joachims T. Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A, eds. Advances in Kernel Methods-Support Vector Learning. Cambridge, MA: MIT Press; 1999, 169–184.
    [24]
    Watkins C. Dynamic alignment kernels. In: Smola A, Bartlett P, Schölkopf B, Schuurmans D, eds. Advances in Large Margin Classifiers. Cambridge, MA: MIT Press; 2000, 39–50.
    [25]
    Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res 2002, 2: 419–444.
    [26]
    Leslie C, Eskin E, Cohen A, Weston J, Noble W. Mismatch string kernel for discriminative protein classification. Bioinformatics 2003, 1: 1–10.
    [27]
    Leslie C, Kuang R. Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 2004, 5: 1435–1455.
    [28]
    Vishwanathan S, Smola A. Fast kernels for string and tree matching. In: Tsuda K, Schölkopf B, Vert J, eds. Kernels and Bioinformatics. Cambridge, MA: MIT Press; 2004, 113–130.
    [29]
    Chang WI, Lawler EL. Sublinear approximate string matching and biological applications. Algorithmica 1994, 12: 327–344.
    [30]
    Haussler D. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10. UC Santa Cruz, CA; July 1999.
    [31]
    Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller K-R. Engineering support vector machine kernels that recognize translation initiation sites in DNA. BioInformatics 2000, 16: 799–807.
    [32]
    Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. J Comput Biol 2000, 7: 95–114, 2000.
    [33]
    Tsuda K, Kawanabe M, Rätsch G, Sonnenburg S, Müller K. A new discriminative kernel from probabilistic models. Neural Comput 2002, 14: 2397–2414.
    [34]
    Sonnenburg S, Zien A, Rätsch G. ARTS: accurate recognition of transcription starts in human. Bioinformatics 2006, 22: e472–e480.
    [35]
    Rä tsch G, Sonnenburg S, Schölkopf B. RASE: recognition of alternatively spliced exons in c. elegans. Bioinformatics 2005, 21: i369–i377.
    [36]
    Cuturi M, Vert J-P, Matsui T. A kernel for time series based on global alignments. In: Kuh A, Huang Y-F, eds. Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI; 2007.
    [37]
    Vert J-P, Saigo H, Akutsu T. Local alignment kernels for biological sequences. In: Kernel Methods in Computational Biology. Cambridge, MA: MIT Press; 2004, 131–154.
    [38]
    Shawe- Taylor J, Cristianini N. Kernel Methods for Pattern Analysis. New York, NY: Cambridge University Press; 2004.
    [39]
    Collins M, Duffy N. Convolution kernel for natural language. In: Becker S, Thrun S, Obermayer K, eds. Advances in Neural Information Proccessing Systems (NIPS) Vancouver, BC. Cambridge, MA: MIT Press. Vol. 16. 2002, 625–632.
    [40]
    Rieck K, Krueger T, Brefeld U Müller K-R. Approximate tree kernels. J Mach Learn Res 2010, 11: 555–580.
    [41]
    Gä rtner T, Lloyd J, Flach P. Kernels and distances for structured data. Mach Learn 2004, 57: 205–232.
    [42]
    Vishwanathan S, Schraudoplh N, Kondor R, Borgwardt K. Graph kernels. J Mach Learn Res. 2010, 11: 1201–1242.

    Cited By

    View all
    • (2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020
    • (2016)Trace-based contextual recommendationsExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.07.03564:C(194-207)Online publication date: 1-Dec-2016
    • (2016)Comparison of similarity measures to differentiate players' actions and decision-making profiles in serious games analyticsComputers in Human Behavior10.1016/j.chb.2016.07.02464:C(562-574)Online publication date: 1-Nov-2016
    • Show More Cited By

    Index Terms

    1. Similarity measures for sequential data
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
      Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery  Volume 1, Issue 4
      July 2011
      100 pages
      ISSN:1942-4787
      EISSN:1942-4795
      Issue’s Table of Contents

      Publisher

      John Wiley & Sons, Inc.

      United States

      Publication History

      Published: 01 July 2011

      Author Tags

      1. Biological Data Mining
      2. Data Concepts
      3. Key Design Issues in Data Mining
      4. Text Mining

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020
      • (2016)Trace-based contextual recommendationsExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.07.03564:C(194-207)Online publication date: 1-Dec-2016
      • (2016)Comparison of similarity measures to differentiate players' actions and decision-making profiles in serious games analyticsComputers in Human Behavior10.1016/j.chb.2016.07.02464:C(562-574)Online publication date: 1-Nov-2016
      • (2013)Trajectory pattern change analysis in campus WiFi networksProceedings of the Second ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems10.1145/2534190.2534191(1-8)Online publication date: 5-Nov-2013

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media