Abstract
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [21], as well as to the forward-backward and Baum-Welch [4] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. We describe three algorithms based alternatively on byte pair encoding (BPE) [19], run length encoding (RLE) and Lempel-Ziv (LZ78) parsing [12]. Compared to Viterbi’s algorithm, we achieve a speedup of Ω(r) using BPE, a speedup of \(\Omega(\frac{r}{\log r})\) using RLE, and a speedup of \(\Omega(\frac{\log n}{k})\) using LZ78, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Benson, G., Amir, A., Farach, M.: Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Comp. and Sys. Sciences 52(2), 299–307 (1996)
Agazzi, O., Kuo, S.: HMM based optical character recognition in the presence of deterministic transformations. Pattern recognition 26, 1813–1826 (1993)
Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. Journal of Complexity 15(1), 4–16 (1999)
Baum, L.E.: An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities 3, 1–8 (1972)
Bird, A.P.: Cpg-rich islands as gene markers in the vertebrate nucleus. Trends in Genetics 3, 342–347 (1987)
Buchsbaum, A.L., Giancarlo, R.: Algorithmic aspects in speech recognition: An introduction. ACM Journal of Experimental Algorithms, 2(1) (1997)
Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Information Processing Letters 54, 93–96 (1995)
Chan, T.M.: All-pairs shortest paths with real weights in O(n 3/log n) time. In: Proc. 9th Workshop on Algorithms and Data Structures, pp. 318–324 (2005)
Churchill, G.A.: Hidden Markov chains and the analysis of genome structure. Computers Chem. 16, 107–115 (1992)
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetical progressions. Journal of Symbolic Computation 9, 251–280 (1990)
Crochemore, M., Landau, G., Ziv-Ukelson, M.: A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In: Proc. 13th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 679–688 (2002)
Durbin, R., Eddy, S., Krigh, A., Mitcheson, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1998)
Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 195–209. Springer, Heidelberg (2000)
Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proc. Third South American Workshop on String Processing (WSP), pp. 141–155 (1996)
Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. In: Proc. 12th Annual Symposium On Combinatorial Pattern Matching (CPM). LNCS, vol. 1645, pp. 1–13. Springer, Heidelberg (1999)
Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. In: CPM 2001. LNCS, vol. 2089, pp. 31–49. Springer, Heidelberg (2001)
Manning, C., Schutze, H.: Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proc. Data Compression Conference (DCC), pp. 459–468 (2001)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)
Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 13, 354–356 (1969)
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory IT-13, 260–269 (1967)
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mozes, S., Weimann, O., Ziv-Ukelson, M. (2007). Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions. In: Ma, B., Zhang, K. (eds) Combinatorial Pattern Matching. CPM 2007. Lecture Notes in Computer Science, vol 4580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73437-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-73437-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73436-9
Online ISBN: 978-3-540-73437-6
eBook Packages: Computer ScienceComputer Science (R0)