Abstract
The article studies different methods for estimating the Viterbi path in the Bayesian framework. The Viterbi path is an estimate of the underlying state path in hidden Markov models (HMMs), which has a maximum joint posterior probability. Hence it is also called the maximum a posteriori (MAP) path. For an HMM with given parameters, the Viterbi path can be easily found with the Viterbi algorithm. In the Bayesian framework the Viterbi algorithm is not applicable and several iterative methods can be used instead. We introduce a new EM-type algorithm for finding the MAP path and compare it with various other methods for finding the MAP path, including the variational Bayes approach and MCMC methods. Examples with simulated data are used to compare the performance of the methods. The main focus is on non-stochastic iterative methods and our results show that the best of those methods work as well or better than the best MCMC methods. Our results demonstrate that when the primary goal is segmentation, then it is more reasonable to perform segmentation directly by considering the transition and emission parameters as nuisance parameters.
Similar content being viewed by others
References
Beal, M.: Variational algorithms for approximate Bayesian inference. Ph.D. thesis, Gatsby Computational Neurosience Unit, University College London (2003)
Beal, M., Ghahramani, Z.: The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In: Bayesian Statistics, 7 (Tenerife, 2002), pp. 453–463. Oxford University Press, New York (2003)
Benboudjema, D., Pieczynski, W.: Unsupervised statistical of nonstastionary images using triplet markov fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(8), 1367–1378 (2007)
Besag, J.: An introduction to Markov chain Monte Carlo methods. In: Mathematical Foundations of Speech and Language Processing, IMA Volumes in Mathematics and Applications, vol. 138, pp. 247–270. Springer, New York (2004)
Bishop, C.: Pattern Recognition and Machine Learning, Information Science and Statistics. Springer, New York (2006)
Cappé, O., Moulines, E., Rydén, T.: Inference in Hidden Markov Models. Springer, New York (2005)
Carvalho, L., Lawrence, C.: Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. 105(9), 3209–3214 (2008)
Christodoulopoulos, C., Goldwater, S., Steedman, M.: Two decades of unsupervised POS induction: how far have we come? In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2010)
Corander, J., Xiong, J., Cui, Y., Koski, T.: Optimal Viterbi Bayesian predictive classification for data from finite alphabets. J. Stat. Plan. Inference 143(2), 261–275 (2013)
Courbot, J.-B., Monfrini, E., Mazet, V., Collet, C.: Oriented triplet Markov fields. Pattern Recognit. Lett. 103, 16–22 (2018)
Fox, C., Roberts, S.: A tutorial on variational Bayesian inference. Artif. Intell. Rev. 38(2), 85–95 (2012)
Ganchev, K., Taskar, B., Fernando, P., Joao, G.: Posterior vs parameter sparsity in latent variable models. In: Advances in Neural Information Processing Systems 22 (2009)
Gao, J., Johnson, M.: A comparison of Bayesian estimators for unsupervised hidden Markov model pos taggers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, Stroudsburg, PA, USA, pp. 344–352. Association for Computational Linguistics (2008)
Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (2007)
Gorynin, I., Gangloff, H., Monfrini, E., Pieczynski, W.: Assessing the segmentation performance of pairwise and triplet Markov models. Signal Process. 145, 183–192 (2018)
Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.): Bayesian Nonparametrics. Cambridge University Press, New York (2010)
Johnson, M.: Why doesn’t EM find good HMM POS-taggers? In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 2007, pp. 296–305. Association for Computational Linguistics (2007)
Jordan, M., Gharhami, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999)
Koloydenko, A., Käärik, M., Lember, J.: On adjusted Viterbi training. Acta Appl. Math. 96(1–3), 309–326 (2007)
Koloydenko, A., Lember, J.: Bridging Viterbi and posterior decoding: a generalized risk approach to hidden path inference based on hidden Markov models. J. Mach. Learn. Res. 15, 1–58 (2014)
Koski, T.: Hidden Markov models for Bioinformatics, Computational Biology Series, vol. 2. Kluwer Academic Publishers, Dordrecht (2001)
Lanchantin, P., Lapuyade-Lahorgue, J., Pieczynski, W.: Unsupervised segmentation of triplet Markov chains hidden with long-memory noise. Signal Process. 88(5), 1134–1151 (2008)
Lember, J., Koloydenko, A.: Adjusted Viterbi training: a proof of concept. Probab. Eng. Inf. Sci. 21(3), 451–475 (2007)
Lember, J., Koloydenko, A.: The adjusted Viterbi training for hidden Markov models. Bernoulli 14(1), 180–206 (2008)
Lember, J., Kuljus, K., Koloydenko, A.: Theory of segmentation. In: Dymarsky, P. (ed.) Hidden Markov Models, Theory and Applications, pp. 51–84. InTech, New York (2011)
Marin, J., Robert, C.: Bayesian Core: a Practical Approach to Computational Bayesian Statistics. Springer Texts in Statistics. Springer, New York (2007)
Maruotti, A., Rocci, R.: A mixed non-homogeneous hidden Markov model for categorical data, with application to alcohol consumption. Stat. Med. 31(9), 871–886 (2012)
Maruotti, A., Rydén, T.: A semiparametric approach to hidden Markov models under longitudinal observations. Stat. Comput. 19, 381–393 (2009)
McGrory, C.A., Titterington, D.M.: Variational Bayesian analysis for hidden Markov models. Aust. N. Z. J. Stat. 51(2), 227–244 (2009)
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (2008)
Murphy, K.: Conjugate Bayesian analysis of the Gaussian distribution. Technical report (2007)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Rydén, T.: EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective. Bayesian Anal. 3(4), 659–688 (2008)
Samé, A., Ambroise, C., Govaert, G.: A classification EM algorithm for binned data. Comput. Stat. Data Anal. 51(2), 466–480 (2006)
Samé, A., Ambroise, C., Govaert, G.: An online classification EM algorithm based on the mixture model. Stat. Comput. 17(3), 209–218 (2007)
Scott, S.: Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Stat. Assoc. 97(457), 337–351 (2002)
Smidl, V., Quinn, A.: The Variational Bayes Method in Signal Processing. Springer, Berlin (2006)
Watanabe, S.: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on Applied and Computational Mathematics, vol. 25. Cambridge University Press, Cambridge (2009)
Winkler, G.: Image Analysis, Random Fields and Markov Chain Monte Carlo Methods, Applications of Mathematics, vol. 27. Springer, Berlin (2003)
Yau, C., Holmes, C.: A decision-theoretic approach for segmental classification. Ann. Appl. Stat. 7(3), 1814–1835 (2013)
Acknowledgements
This work is supported by Estonian institutional research funding IUT34-5.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 General formulae for the segmentation methods studied
Due to our independence assumption, all emission and transition parameters can be estimated separately. In the formulae of this section we use the same notation for the random parameters \(p_{l,j}\), \(\mu _k\) and \(\sigma _k^2\), \(k,l,j\in \{1,\ldots ,K\}\), and the corresponding estimates. The exact meaning can be understood from the context.
Segmentation MM. In the case of Dirichlet priors the matrix \(\theta _{tr}^{(i+1)}\) can be found row-wise, the l-th row is the posterior mode:
Emission parameters can be updated independently:
where \(x_{S_k}\) is the subsample of x corresponding to state k in \(y^{(i)}\). Formally, for every sequence \(y\in S^n\) define \(S_k(y)=\{t\in \{1,\dots ,n\}: y_t=k\}\), then \(x_{S_k}=\{x_t: t\in S_k\}\).
Bayesian EM. The emission updates are given by
where
In the case of Dirichlet priors the transition updates are given by
Since one of the studied methods (ICM) starts with an initial sequence, in order the comparison to be fair, we let all the other methods to start with a sequence as well. Therefore, for a given initial sequence \(y^{(0)}\), define
Variational Bayes approach. Let us have a closer look at the measure \(q_{Y}^{(i+1)}(y)\). We are going to show that there exists an HMM (Z, X) such that for every sequence y, \(q_{Y}^{(i+1)}(y)={P}(Z=y|X=x)\). By definition,
Apply the notation from (2.2) in the current case:
Since \(\ln p(\theta ,y|x)=\ln \pi (\theta )+\ln p(y,x|\theta )-\ln p(x)\), we obtain
where \({\tilde{u}}_{lj}\) is the normalized quantity, \( {\tilde{u}}_{lj}:={u_{lj}\over \sum _j u_{lj}}\), and \({\tilde{h}}_k(x_t):=(\sum _j u_{kj})h_k(x_t)\), if \(t\le n-1\), \({\tilde{h}}_k(x_n)=h_k(x_n)\). Let now (Z, X) be an HMM, where Z is the underlying Markov chain with transition matrix \(({\tilde{u}}_{lj})\) and emission densities are given by \({\tilde{h}}_k\). From (7.6) it follows that \(q_{Y}^{(i+1)}(y)\propto P(Z=y|X=x)\). Since \(q_{Y}^{(i+1)}\) and \(P(Z\in \cdot |X=x)\) are both probability measures, it follows that they are equal. To stress the dependence on iterations, we will denote \(q_{Y}^{(i+1)}(y)=P^{(i+1)}(Z=y|X=x)\).
Let us now calculate \(q_{\theta }\). Let \(\gamma _t^{(i)}(k)\) denote the marginal of \(q_{Y}^{(i)}(y)\),
Observe that
where \(C_1:=\sum _{k}(\ln p_{0k})\gamma _1^{(i)}(k)\). The sum \(\sum _y n_{lj}(y)q_{Y}^{(i)}(y)\) is the expected number of transitions from l to j, so that using the equality \(q_{Y}^{(i)}(y)=P^{(i)}(Z=y|X=x)\), we have
Therefore,
From (7.7) we can see that under \(q_{\theta }^{(i+1)}\) the parameters \(\theta _{tr},\theta ^{1}_{em},\ldots ,\theta ^{K}_{em}\) are still independent and can therefore be updated separately. In the case of Dirichlet transition priors the rows are independent as well. The transition update for the l-th row and the emission update for the k-th component are given by
The whole VB approach is applicable since \(\xi ^{(i)}(l,j)\) and \(\gamma _t^{(i)}(k)\) can be found by the standard forward-backward formulae using \({\tilde{u}}_{lj}^{(i)}\) and \({\tilde{h}}^{(i)}_k\). Actually, it is not difficult to see that in these formulae the original \(u^{(i)}_{lj}\) and \(h^{(i)}_k\) can be used instead of the standardized ones. To summarize, in our setup the VB approach yields the following algorithm for calculating \({\hat{v}}_\mathrm{{VB}}\). For a given initial sequence \(y^{(0)}\), find vector \(\gamma _t^{(0)}\) and matrix \(\xi ^{(0)}\) as in (7.5). Given \(\gamma _t^{(i)}\) and \(\xi ^{(i)}\), update \(u^{(i+1)}_{lj}\) and \(h_k^{(i+1)}\) as follows:
With these parameters, \(\xi ^{(i+1)}\) and \(\gamma _t^{(i+1)}\) can be calculated with the usual forward-backward procedure for HMM. Then update \(u^{(i+2)}_{lj}\) and \(h_k^{(i+2)}\) and so on. After the convergence, say after m steps, apply the Viterbi algorithm with transitions \((u_{ij}^{(m)})\) and emission densities \(h^{(m)}_k\). The obtained path maximizes \(q_{Y}^{(m)}(y)\) over all the paths, so it is \({\hat{v}}_\mathrm{{VB}}\).
Simulated annealing. Because of independence of the emission and transition parameters, it holds even for \(\beta >1\) that \(p_{\beta }(\theta |y,x)=p_{\beta }(\theta _{tr}|y)p_{\beta }(\theta _{em}|y,x)\), thus the transition and emission parameters can be sampled separately. When the rows of a transition matrix have independent Dirichlet priors, the l-th row can be generated from the Dirichlet distribution with parameters \(\beta (n_{lk}(s)+\alpha _{lk})+1-\beta \), \(k=1,\ldots ,K\). For given \(\theta \), sampling from \(p(y|\theta ,x)\) can be performed in various ways: we use so-called Markovian Backward Sampling (Algorithm 6.1.1 in [6]). To sample from \(p_{\beta }(y|\theta ,x)\), note that
where \({\tilde{p}}_{ij}:={p^{\beta }_{ij} / \sum _{j}p^{\beta }_{ij}}\), and \({\tilde{f}}_k(x_t):=\big (\sum _{j}p^{\beta }_{ij}\big )f^{\beta }_k(x_t)\), \(t=1,\ldots ,n-1\), \({\tilde{f}}_k(x_n):=\big (\sum _{j}p^{\beta }_{0j}\big )f^{\beta }_k(x_n).\) Although the functions \({{\tilde{f}}}_k\) are not densities, one can still use Markovian Backward Sampling.
1.2 Non-stochastic segmentation algorithms for the Dirichlet-NIX case
Suppose the emission distribution corresponding to state k is \({\mathcal {N}}(\mu _k,\sigma _k^2)\), where prior distributions for \(\mu _k\) and \(\sigma _k^2\) are given by a normal and scaled inverse-chi-square distribution, respectively:
Here \(\kappa _0\), \(\nu _{0}\) and \(\tau ^2_{0}\) are hyperparameters that might depend on k, but in our example we assume they are equal. Recall the density of \(\mathrm{Inv-}\chi ^2(\nu ,\tau ^2)\):
If \(X\sim \mathrm{Inv-}\chi ^2(\nu ,\tau ^2)\), then
and the mode of the distribution is given by \(\nu \tau ^2/(\nu +2)\). Therefore, if \(\nu _0\) and \(\kappa _0\) are both very large, then \(\sigma _k^2\approx \tau _0^2\) and \(\mu _k\approx \xi _k\), and we get back to the first example. If \(\nu _0\) is very large, then \(\sigma ^2_k\approx \tau _0^2\), so that emission variances are \(\tau _0^2\), but the variance of the mean is approximately \(\tau _0^2 /\kappa _0\).
Since emission and transition parameters are independent, the transition parameters can be updated as previously, that is as described in Sect. 1. Because the emission components \((\theta ^1_{em},\ldots ,\theta ^K_{em})\) are independent under prior and posterior, it holds that \(p(\theta _{em}|y,x)=\prod _k p(\theta ^k_{em}|x_{S_k})\), where \(x_{S_k}\) is the subsample of x along y corresponding to state k. Let \(m_k(y)\) be the size of \(x_{S_k}\). Let \(\bar{x}_k\) and \(s^2_k\) be the mean and variance of \(x_{S_k}\). Since NIX-priors are conjugate, for any state k the posterior parameters \(\kappa _{k}\), \(\nu _{k}\), \(\mu _{k}\) and \(\tau _{k}^2\) can be calculated as follows:
see [31]. We also need to calculate for every path y the joint probability \(p(x,y)=p(y)p(x|y)\). Due to the independence of transition and emission parameters, p(y) is still as in (1.6) and p(x|y) depends on emission parameters, only. According to the formula for the marginal likelihood (see, e.g. [31]) we obtain
We will now give a more detailed description of the non-stochastic algorithms for Example 2.
Bayesian EM. Start with initial state sequence \(y^{(0)}\). With this sequence, find for any state k the parameters \(\kappa _{k}\), \(\mu _{k}\), \(\nu _{k},\tau ^2_{k}\) as defined in (7.9), (7.10), (7.11) and calculate the posterior modes, that is update
With these parameters calculate the vectors \(\gamma _t^{(1)}\) and matrix \((\xi ^{(1)}(l,j))\) as in (7.3) and (7.4) using the forward-backward formulae. Given \(\gamma _t^{(i)}\) and \(\xi ^{(i)}(l,j)\), the transition parameters are updated according to (7.4). The emission updates are given by (7.2). Let us calculate \(\theta _{em}^{k,(i+1)}\) for the NIX-model. Suppress k from the notation and observe that \(\theta ^{(i+1)}_{em}=(\mu ^{(i+1)},(\sigma ^2)^{(i+1)})\) maximizes the following function over \(\mu \) and \(\sigma ^2\):
The solutions \(\mu _k^{(i+1)}\) and \((\sigma _k^{2})^{(i+1)}\) are given by:
With \(\kappa _0\rightarrow 0\) (non-informative prior), \(\mu _k^{(i+1)}\) is the same as in the standard EM algorithm. Using the updated parameters, calculate \(\gamma _t^{(i+1)}\) and \(\xi ^{(i+1)}(l,j)\). Keep updating until the change in the log-likelihood is below the stopping criterion.
Segmentation EM. Given sequence \(y^{(i)}\), calculate for every state k the parameters \(\kappa _{k}^{(i)}\), \(\mu ^{(i)}_{k}\), \(\nu ^{(i)}_{k}\) and \((\tau _{k}^2)^{(i)}\) using formulae (7.9)–(7.11). With these parameters, calculate \(h^{(i+1)}_k(x_t)\) as follows:
Compute the matrix \((u_{lj}^{(i+1)})\), where \(\ln u^{(i+1)}_{lj}=\psi (\alpha _{lj}+n_{lj}(y^{(i)}))-\psi (\alpha _{l}+n_{l}(y^{(i)}))\). To find \(y^{(i+1)}\), apply the Viterbi algorithm with \(u_{lj}^{(i+1)}\) and \(h^{(i+1)}_k(x_t)\). Keep doing so until no changes occur in the path estimate.
Segmentation MM. Given \(y^{(i)}\), calculate \(\mu ^{(i)}_{k}\), \(\nu ^{(i)}_{k}\) and \((\tau _{k}^2)^{(i)}\) using formulae (7.9)–(7.11) and update the posterior modes as follows:
With these parameters find \(y^{(i+1)}\) by the Viterbi algorithm. Keep doing so until no changes occur in the estimated state path.
VB algorithm. Given an initial state sequence \(y^{(0)}\), find \(h^{(1)}_k(x_t)\) and \(u^{(1)}_{lj}\) as in the segmentation EM algorithm. With these parameters, calculate \(\gamma ^{(1)}_t\) and \(\xi ^{(1)}(l,j)\) using the forward-backward formulae. Given the matrix \(\big (\xi ^{(i)}(l,j)\big )\), update the matrix \((u_{lj}^{(i+1)})\) according to (7.8). Given \(\gamma ^{(i)}_t(k)\), the parameters \(\kappa ^{(i)}_{k}\), \(\mu ^{(i)}_{k}\), \(\nu ^{(i)}_{k}\) and \((\tau ^2_{k})^{(i)}\) can be calculated by (see, e.g., [29])
Compute then \(h^{(i+1)}_k(x_t)\) as in (7.13). With help of \(h^{(i+1)}_k(x_t)\) and \(u^{(i+1)}_{lj}\), find \(\gamma ^{(i+1)}_t\) and \(\xi ^{(i+1)}(l,j)\) using the forward-backward formulae. After that update \(h^{(i+2)}_k(x_t)\) and \(u^{(i+2)}_{lj}\) and so on. When the VB algorithm has converged, say after m steps, apply the Viterbi algorithm with \(u_{lj}^{(m)}\) as transitions and with \(h^{(m)}_k(x_t)\) as emission values.
1.3 Clustering formulae under normal emissions with NIX priors
From (7.12) it follows that for any sequence \(y'\), the likelihood ratio is given by
When \(\nu _0\rightarrow \infty \) and \(\tau ^2_0>0\), then due to \(\sum _k m_k=\sum _k m'_k=n\) we have
Write \(\nu _{k}\tau ^{2}_{k}\) as
Then
Therefore, when \(\nu _0\rightarrow \infty \), then the likelihood ratio in (7.14) converges to
Thus, maximizing p(x|y) corresponds to the following clustering problem: find clusters \(S_1,\ldots , S_k\) that minimize
which is formula (4.2). Given cluster \(S_k\), it is easy to see that
Since
we obtain (4.3).
To understand the behavior of the segmentation EM and segmentation MM algorithms when \(\nu _0\rightarrow \infty \), recall the segmentation EM iteration formula from (7.13). When \(\nu _0\rightarrow \infty \), then \(\ln (\nu ^{(i)}_{k}/{2})-\psi ({\nu ^{(i)}_{k}}/{2})\rightarrow 0\) and \((\tau _{k}^2)^{(i)}\rightarrow \tau _0^2\). Thus, leaving the superscript (i) out of the notation, we get
where \(\mu _k\) is as in (7.15). The Viterbi alignment is now obtained as
Rights and permissions
About this article
Cite this article
Lember, J., Gasbarra, D., Koloydenko, A. et al. Estimation of Viterbi path in Bayesian hidden Markov models. METRON 77, 137–169 (2019). https://doi.org/10.1007/s40300-019-00152-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40300-019-00152-7