Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Estimation of Viterbi path in Bayesian hidden Markov models

  • Published:
METRON Aims and scope Submit manuscript

Abstract

The article studies different methods for estimating the Viterbi path in the Bayesian framework. The Viterbi path is an estimate of the underlying state path in hidden Markov models (HMMs), which has a maximum joint posterior probability. Hence it is also called the maximum a posteriori (MAP) path. For an HMM with given parameters, the Viterbi path can be easily found with the Viterbi algorithm. In the Bayesian framework the Viterbi algorithm is not applicable and several iterative methods can be used instead. We introduce a new EM-type algorithm for finding the MAP path and compare it with various other methods for finding the MAP path, including the variational Bayes approach and MCMC methods. Examples with simulated data are used to compare the performance of the methods. The main focus is on non-stochastic iterative methods and our results show that the best of those methods work as well or better than the best MCMC methods. Our results demonstrate that when the primary goal is segmentation, then it is more reasonable to perform segmentation directly by considering the transition and emission parameters as nuisance parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Beal, M.: Variational algorithms for approximate Bayesian inference. Ph.D. thesis, Gatsby Computational Neurosience Unit, University College London (2003)

  2. Beal, M., Ghahramani, Z.: The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In: Bayesian Statistics, 7 (Tenerife, 2002), pp. 453–463. Oxford University Press, New York (2003)

  3. Benboudjema, D., Pieczynski, W.: Unsupervised statistical of nonstastionary images using triplet markov fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(8), 1367–1378 (2007)

    Article  Google Scholar 

  4. Besag, J.: An introduction to Markov chain Monte Carlo methods. In: Mathematical Foundations of Speech and Language Processing, IMA Volumes in Mathematics and Applications, vol. 138, pp. 247–270. Springer, New York (2004)

  5. Bishop, C.: Pattern Recognition and Machine Learning, Information Science and Statistics. Springer, New York (2006)

    MATH  Google Scholar 

  6. Cappé, O., Moulines, E., Rydén, T.: Inference in Hidden Markov Models. Springer, New York (2005)

    Book  MATH  Google Scholar 

  7. Carvalho, L., Lawrence, C.: Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. 105(9), 3209–3214 (2008)

    Article  Google Scholar 

  8. Christodoulopoulos, C., Goldwater, S., Steedman, M.: Two decades of unsupervised POS induction: how far have we come? In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2010)

  9. Corander, J., Xiong, J., Cui, Y., Koski, T.: Optimal Viterbi Bayesian predictive classification for data from finite alphabets. J. Stat. Plan. Inference 143(2), 261–275 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  10. Courbot, J.-B., Monfrini, E., Mazet, V., Collet, C.: Oriented triplet Markov fields. Pattern Recognit. Lett. 103, 16–22 (2018)

    Article  Google Scholar 

  11. Fox, C., Roberts, S.: A tutorial on variational Bayesian inference. Artif. Intell. Rev. 38(2), 85–95 (2012)

    Article  Google Scholar 

  12. Ganchev, K., Taskar, B., Fernando, P., Joao, G.: Posterior vs parameter sparsity in latent variable models. In: Advances in Neural Information Processing Systems 22 (2009)

  13. Gao, J., Johnson, M.: A comparison of Bayesian estimators for unsupervised hidden Markov model pos taggers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, Stroudsburg, PA, USA, pp. 344–352. Association for Computational Linguistics (2008)

  14. Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (2007)

  15. Gorynin, I., Gangloff, H., Monfrini, E., Pieczynski, W.: Assessing the segmentation performance of pairwise and triplet Markov models. Signal Process. 145, 183–192 (2018)

    Article  Google Scholar 

  16. Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.): Bayesian Nonparametrics. Cambridge University Press, New York (2010)

    MATH  Google Scholar 

  17. Johnson, M.: Why doesn’t EM find good HMM POS-taggers? In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 2007, pp. 296–305. Association for Computational Linguistics (2007)

  18. Jordan, M., Gharhami, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999)

    Article  MATH  Google Scholar 

  19. Koloydenko, A., Käärik, M., Lember, J.: On adjusted Viterbi training. Acta Appl. Math. 96(1–3), 309–326 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  20. Koloydenko, A., Lember, J.: Bridging Viterbi and posterior decoding: a generalized risk approach to hidden path inference based on hidden Markov models. J. Mach. Learn. Res. 15, 1–58 (2014)

    MathSciNet  MATH  Google Scholar 

  21. Koski, T.: Hidden Markov models for Bioinformatics, Computational Biology Series, vol. 2. Kluwer Academic Publishers, Dordrecht (2001)

    Book  MATH  Google Scholar 

  22. Lanchantin, P., Lapuyade-Lahorgue, J., Pieczynski, W.: Unsupervised segmentation of triplet Markov chains hidden with long-memory noise. Signal Process. 88(5), 1134–1151 (2008)

    Article  MATH  Google Scholar 

  23. Lember, J., Koloydenko, A.: Adjusted Viterbi training: a proof of concept. Probab. Eng. Inf. Sci. 21(3), 451–475 (2007)

    Article  MATH  Google Scholar 

  24. Lember, J., Koloydenko, A.: The adjusted Viterbi training for hidden Markov models. Bernoulli 14(1), 180–206 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  25. Lember, J., Kuljus, K., Koloydenko, A.: Theory of segmentation. In: Dymarsky, P. (ed.) Hidden Markov Models, Theory and Applications, pp. 51–84. InTech, New York (2011)

    Google Scholar 

  26. Marin, J., Robert, C.: Bayesian Core: a Practical Approach to Computational Bayesian Statistics. Springer Texts in Statistics. Springer, New York (2007)

    MATH  Google Scholar 

  27. Maruotti, A., Rocci, R.: A mixed non-homogeneous hidden Markov model for categorical data, with application to alcohol consumption. Stat. Med. 31(9), 871–886 (2012)

    Article  MathSciNet  Google Scholar 

  28. Maruotti, A., Rydén, T.: A semiparametric approach to hidden Markov models under longitudinal observations. Stat. Comput. 19, 381–393 (2009)

    Article  MathSciNet  Google Scholar 

  29. McGrory, C.A., Titterington, D.M.: Variational Bayesian analysis for hidden Markov models. Aust. N. Z. J. Stat. 51(2), 227–244 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  30. McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (2008)

    Book  MATH  Google Scholar 

  31. Murphy, K.: Conjugate Bayesian analysis of the Gaussian distribution. Technical report (2007)

  32. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  33. Rydén, T.: EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective. Bayesian Anal. 3(4), 659–688 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  34. Samé, A., Ambroise, C., Govaert, G.: A classification EM algorithm for binned data. Comput. Stat. Data Anal. 51(2), 466–480 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  35. Samé, A., Ambroise, C., Govaert, G.: An online classification EM algorithm based on the mixture model. Stat. Comput. 17(3), 209–218 (2007)

    Article  MathSciNet  Google Scholar 

  36. Scott, S.: Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Stat. Assoc. 97(457), 337–351 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  37. Smidl, V., Quinn, A.: The Variational Bayes Method in Signal Processing. Springer, Berlin (2006)

    Google Scholar 

  38. Watanabe, S.: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on Applied and Computational Mathematics, vol. 25. Cambridge University Press, Cambridge (2009)

    Google Scholar 

  39. Winkler, G.: Image Analysis, Random Fields and Markov Chain Monte Carlo Methods, Applications of Mathematics, vol. 27. Springer, Berlin (2003)

    MATH  Google Scholar 

  40. Yau, C., Holmes, C.: A decision-theoretic approach for segmental classification. Ann. Appl. Stat. 7(3), 1814–1835 (2013)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by Estonian institutional research funding IUT34-5.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jüri Lember.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 General formulae for the segmentation methods studied

Due to our independence assumption, all emission and transition parameters can be estimated separately. In the formulae of this section we use the same notation for the random parameters \(p_{l,j}\), \(\mu _k\) and \(\sigma _k^2\), \(k,l,j\in \{1,\ldots ,K\}\), and the corresponding estimates. The exact meaning can be understood from the context.

Segmentation MM. In the case of Dirichlet priors the matrix \(\theta _{tr}^{(i+1)}\) can be found row-wise, the l-th row is the posterior mode:

$$\begin{aligned} p^{(i+1)}_{lj}={\alpha _{lj}+n_{lj}(y^{(i)})-1\over \alpha _l+n_l(y^{(i)})-K}. \end{aligned}$$
(7.1)

Emission parameters can be updated independently:

$$\begin{aligned}&\theta _{em}^{k,(i+1)}=\arg \max _{\theta ^k_{em}}p(\theta ^k_{em}|x_{S_k})= \arg \max _{\theta ^k_{em}}\left[ \sum _{t: y^{(i)}_t=k} \ln f_k(x_t|\theta ^{k}_{em})+\ln \pi ^k_{em}(\theta ^{k}_{em})\right] ,\\&\quad k=1,\ldots ,K, \end{aligned}$$

where \(x_{S_k}\) is the subsample of x corresponding to state k in \(y^{(i)}\). Formally, for every sequence \(y\in S^n\) define \(S_k(y)=\{t\in \{1,\dots ,n\}: y_t=k\}\), then \(x_{S_k}=\{x_t: t\in S_k\}\).

Bayesian EM. The emission updates are given by

$$\begin{aligned} \theta _{em}^{k,(i+1)}=\arg \max _{\theta ^k_{em}}\left[ \sum _t \ln f_k(x_t|\theta ^{k}_{em})\gamma _t^{(i)}(k)+\ln \pi ^k_{em}(\theta ^{k}_{em})\right] ,\quad k=1,\ldots ,K, \end{aligned}$$
(7.2)

where

$$\begin{aligned} \gamma _t^{(i)}(k):=P(Y_t=k|X=x,\theta ^{(i)})=\sum _{y: y_t=k}p(y|\theta ^{(i)},x). \end{aligned}$$
(7.3)

In the case of Dirichlet priors the transition updates are given by

$$\begin{aligned} {p}^{(i+1)}_{lj}&={\xi ^{(i)}(l,j)+ (\alpha _{lj}-1)\over \sum _j \xi ^{(i)}(l,j)+(\alpha _l-K)},\quad \text{ where } \xi ^{(i)}(l,j):=\sum _{t=1}^{n-1}P(Y_{t}=l,Y_{t+1}=j|x,\theta ^{(i)}). \end{aligned}$$
(7.4)

Since one of the studied methods (ICM) starts with an initial sequence, in order the comparison to be fair, we let all the other methods to start with a sequence as well. Therefore, for a given initial sequence \(y^{(0)}\), define

$$\begin{aligned} \gamma _t^{(0)}(k):=I_{k}(y_t^{(0)}),\quad \xi ^{(0)}(l,j):=n_{lj}(y^{(0)}). \end{aligned}$$
(7.5)

Variational Bayes approach. Let us have a closer look at the measure \(q_{Y}^{(i+1)}(y)\). We are going to show that there exists an HMM (ZX) such that for every sequence y, \(q_{Y}^{(i+1)}(y)={P}(Z=y|X=x)\). By definition,

$$\begin{aligned} q_{Y}^{(i+1)}(y)\propto \exp \left[ \int \ln p(\theta ,y|x) q_{\theta }^{(i+1)}(d\theta )\right] . \end{aligned}$$

Apply the notation from (2.2) in the current case:

$$\begin{aligned} u^{(i+1)}_{lj}=\exp \left[ \int \ln p_{lj}(\theta _{tr}) q_{\theta }^{(i+1)}(d\theta )\right] ,\quad h^{(i+1)}_k(x_t)=\exp \left[ \int \ln f_k(x_t|\theta _{em}^k)q_{\theta }^{(i+1)}(d\theta )\right] . \end{aligned}$$

Since \(\ln p(\theta ,y|x)=\ln \pi (\theta )+\ln p(y,x|\theta )-\ln p(x)\), we obtain

$$\begin{aligned}&\int \ln p(\theta ,y|x) q_{\theta }^{(i+1)}(d\theta )\nonumber \\&\quad =\int \ln \pi (\theta )q_{\theta }^{(i+1)}(d\theta )-\ln p(x) + \int \ln p(y,x|\theta )q_{\theta }^{(i+1)}(d\theta ) \nonumber \\&\quad =c(q_{\theta }^{(i+1)},x)+ \ln p_{0 y_1}+\sum _{lj}n_{lj}(y)\ln u^{(i+1)}_{lj}+\sum _{k=1}^K \sum _{t: y_t=k} \ln h^{(i+1)}_k(x_t)\nonumber \\&\quad =c(q_{\theta }^{(i+1)},x)+ \ln p_{0 y_1}+\sum _{lj}n_{lj}(y)\ln {\tilde{u}}^{(i+1)}_{lj}+\sum _{k=1}^K \sum _{t: y_t=k} \ln {\tilde{h}}^{(i+1)}_k(x_t), \end{aligned}$$
(7.6)

where \({\tilde{u}}_{lj}\) is the normalized quantity, \( {\tilde{u}}_{lj}:={u_{lj}\over \sum _j u_{lj}}\), and \({\tilde{h}}_k(x_t):=(\sum _j u_{kj})h_k(x_t)\), if \(t\le n-1\), \({\tilde{h}}_k(x_n)=h_k(x_n)\). Let now (ZX) be an HMM, where Z is the underlying Markov chain with transition matrix \(({\tilde{u}}_{lj})\) and emission densities are given by \({\tilde{h}}_k\). From (7.6) it follows that \(q_{Y}^{(i+1)}(y)\propto P(Z=y|X=x)\). Since \(q_{Y}^{(i+1)}\) and \(P(Z\in \cdot |X=x)\) are both probability measures, it follows that they are equal. To stress the dependence on iterations, we will denote \(q_{Y}^{(i+1)}(y)=P^{(i+1)}(Z=y|X=x)\).

Let us now calculate \(q_{\theta }\). Let \(\gamma _t^{(i)}(k)\) denote the marginal of \(q_{Y}^{(i)}(y)\),

$$\begin{aligned} \gamma _t^{(i)}(k):=P^{(i)}(Z_t=k|X=x)=\sum _{y:y_t=k}q_{Y}^{(i)}(y). \end{aligned}$$

Observe that

$$\begin{aligned} \sum _{y}\ln p(y,x|\theta )q_{Y}^{(i)}(y)= & {} C_1 +\sum _{l,j}\ln p_{lj}(\theta _{tr})\big (\sum _y n_{lj}(y)q_{Y}^{(i)}(y)\big )\\&+\sum _{t=1}^n\sum _{k=1}^K\ln f_k(x_t|\theta ^{k}_{em})\gamma ^{(i)}_t(k), \end{aligned}$$

where \(C_1:=\sum _{k}(\ln p_{0k})\gamma _1^{(i)}(k)\). The sum \(\sum _y n_{lj}(y)q_{Y}^{(i)}(y)\) is the expected number of transitions from l to j, so that using the equality \(q_{Y}^{(i)}(y)=P^{(i)}(Z=y|X=x)\), we have

$$\begin{aligned} \sum _y n_{lj}(y)q_{Y}^{(i)}(y)=\sum _{t=1}^{n-1}P^{(i)}(Z_t=l,Z_{t+1}=j|X=x)=:\xi ^{(i)}(l,j). \end{aligned}$$

Therefore,

$$\begin{aligned} \ln q_{\theta }^{(i+1)}(\theta )= & {} C+\ln \pi _{tr}(\theta _{tr})+\ln \pi _{em}(\theta _{em})+ \sum _{l,j}\xi ^{(i)}(l,j)\ln p_{lj}(\theta _{tr})\nonumber \\&+\sum _{t=1}^n\sum _{k=1}^K \ln f_k(x_t|\theta ^{k}_{em})\gamma ^{(i)}_t(k). \end{aligned}$$
(7.7)

From (7.7) we can see that under \(q_{\theta }^{(i+1)}\) the parameters \(\theta _{tr},\theta ^{1}_{em},\ldots ,\theta ^{K}_{em}\) are still independent and can therefore be updated separately. In the case of Dirichlet transition priors the rows are independent as well. The transition update for the l-th row and the emission update for the k-th component are given by

$$\begin{aligned} q_{\theta }^{(i+1)}(p_{l1},\ldots ,p_{lK})\propto \prod _{j=1}^K p_{lj}^{\alpha _{lj}-1+\xi ^{(i)}(l,j)}, \quad q_{\theta }^{(i+1)}(\theta _{em}^k)\propto \pi (\theta _{em}^k)\prod _{t=1}^n \big ( f_k(x_t|\theta _{em}^k)\big )^{\gamma _t^{(i)}(k)}. \end{aligned}$$

The whole VB approach is applicable since \(\xi ^{(i)}(l,j)\) and \(\gamma _t^{(i)}(k)\) can be found by the standard forward-backward formulae using \({\tilde{u}}_{lj}^{(i)}\) and \({\tilde{h}}^{(i)}_k\). Actually, it is not difficult to see that in these formulae the original \(u^{(i)}_{lj}\) and \(h^{(i)}_k\) can be used instead of the standardized ones. To summarize, in our setup the VB approach yields the following algorithm for calculating \({\hat{v}}_\mathrm{{VB}}\). For a given initial sequence \(y^{(0)}\), find vector \(\gamma _t^{(0)}\) and matrix \(\xi ^{(0)}\) as in (7.5). Given \(\gamma _t^{(i)}\) and \(\xi ^{(i)}\), update \(u^{(i+1)}_{lj}\) and \(h_k^{(i+1)}\) as follows:

$$\begin{aligned} u^{(i+1)}_{lj}= & {} \exp [\psi (\alpha _{lj}+\xi ^{(i)}(l,j))-\psi (\alpha _{l}+\xi ^{(i)}(l))],\quad \text {where}\quad \xi ^{(i)}(l):=\sum _j \xi ^{(i)}(l,j), \nonumber \\ h^{(i+1)}_k(x_t)= & {} \exp \left[ \int \ln f_k(x_t|\theta _{em}^k)q_{\theta }^{(i+1)}(d\theta )\right] ,\quad \text {where}\quad q_{\theta }^{(i+1)}(\theta _{em}^k)\nonumber \\&\propto \pi (\theta _{em}^k)\prod _{t=1}^n \big (f_k(x_t|\theta _{em}^k)\big )^{\gamma _t^{(i)}(k)}. \end{aligned}$$
(7.8)

With these parameters, \(\xi ^{(i+1)}\) and \(\gamma _t^{(i+1)}\) can be calculated with the usual forward-backward procedure for HMM. Then update \(u^{(i+2)}_{lj}\) and \(h_k^{(i+2)}\) and so on. After the convergence, say after m steps, apply the Viterbi algorithm with transitions \((u_{ij}^{(m)})\) and emission densities \(h^{(m)}_k\). The obtained path maximizes \(q_{Y}^{(m)}(y)\) over all the paths, so it is \({\hat{v}}_\mathrm{{VB}}\).

Simulated annealing. Because of independence of the emission and transition parameters, it holds even for \(\beta >1\) that \(p_{\beta }(\theta |y,x)=p_{\beta }(\theta _{tr}|y)p_{\beta }(\theta _{em}|y,x)\), thus the transition and emission parameters can be sampled separately. When the rows of a transition matrix have independent Dirichlet priors, the l-th row can be generated from the Dirichlet distribution with parameters \(\beta (n_{lk}(s)+\alpha _{lk})+1-\beta \), \(k=1,\ldots ,K\). For given \(\theta \), sampling from \(p(y|\theta ,x)\) can be performed in various ways: we use so-called Markovian Backward Sampling (Algorithm 6.1.1 in [6]). To sample from \(p_{\beta }(y|\theta ,x)\), note that

$$\begin{aligned} p(x,y|\theta )^{\beta }={p^{\beta }_{0y_1}\over \sum _{j}p^{\beta }_{0,j}}\prod _{t=2}^{n}{{{\tilde{p}}}}_{y_{t-1} y_{t}}{\tilde{f}}_{y_t}(x_t), \end{aligned}$$

where \({\tilde{p}}_{ij}:={p^{\beta }_{ij} / \sum _{j}p^{\beta }_{ij}}\), and \({\tilde{f}}_k(x_t):=\big (\sum _{j}p^{\beta }_{ij}\big )f^{\beta }_k(x_t)\), \(t=1,\ldots ,n-1\), \({\tilde{f}}_k(x_n):=\big (\sum _{j}p^{\beta }_{0j}\big )f^{\beta }_k(x_n).\) Although the functions \({{\tilde{f}}}_k\) are not densities, one can still use Markovian Backward Sampling.

1.2 Non-stochastic segmentation algorithms for the Dirichlet-NIX case

Suppose the emission distribution corresponding to state k is \({\mathcal {N}}(\mu _k,\sigma _k^2)\), where prior distributions for \(\mu _k\) and \(\sigma _k^2\) are given by a normal and scaled inverse-chi-square distribution, respectively:

$$\begin{aligned} \mu _k |\sigma _k^2\sim {\mathcal {N}} \big (\xi _k,{\sigma _k^2\over \kappa _0}\big ),\quad \sigma ^2_k\sim \mathrm{Inv-}\chi ^2(\nu _{0},\tau ^2_{0}). \end{aligned}$$

Here \(\kappa _0\), \(\nu _{0}\) and \(\tau ^2_{0}\) are hyperparameters that might depend on k, but in our example we assume they are equal. Recall the density of \(\mathrm{Inv-}\chi ^2(\nu ,\tau ^2)\):

$$\begin{aligned} f(x;\nu ,\tau ^2)={(\tau ^2\nu /2)^{\nu /2}\over \Gamma ({\nu /2})}x^{-(1 + \nu /2)}\exp \left[ -{\nu \tau ^2\over 2x}\right] . \end{aligned}$$

If \(X\sim \mathrm{Inv-}\chi ^2(\nu ,\tau ^2)\), then

$$\begin{aligned} EX={\tau ^2\nu \over \nu -2},\quad \mathrm{Var}(X)={2\tau ^4\nu ^2 \over (\nu -2)^2(\nu -4)},\quad E(\ln X)=\ln \big ({\nu \tau ^2\over 2}\big ){-}\psi \big ({\nu \over 2}\big ),\quad E X^{-1}=\tau ^{-2}, \end{aligned}$$

and the mode of the distribution is given by \(\nu \tau ^2/(\nu +2)\). Therefore, if \(\nu _0\) and \(\kappa _0\) are both very large, then \(\sigma _k^2\approx \tau _0^2\) and \(\mu _k\approx \xi _k\), and we get back to the first example. If \(\nu _0\) is very large, then \(\sigma ^2_k\approx \tau _0^2\), so that emission variances are \(\tau _0^2\), but the variance of the mean is approximately \(\tau _0^2 /\kappa _0\).

Since emission and transition parameters are independent, the transition parameters can be updated as previously, that is as described in Sect. 1. Because the emission components \((\theta ^1_{em},\ldots ,\theta ^K_{em})\) are independent under prior and posterior, it holds that \(p(\theta _{em}|y,x)=\prod _k p(\theta ^k_{em}|x_{S_k})\), where \(x_{S_k}\) is the subsample of x along y corresponding to state k. Let \(m_k(y)\) be the size of \(x_{S_k}\). Let \(\bar{x}_k\) and \(s^2_k\) be the mean and variance of \(x_{S_k}\). Since NIX-priors are conjugate, for any state k the posterior parameters \(\kappa _{k}\), \(\nu _{k}\), \(\mu _{k}\) and \(\tau _{k}^2\) can be calculated as follows:

$$\begin{aligned} \kappa _{k}&=\kappa _0 + m_k , \quad \quad \nu _{k} =\nu _0+m_k, \end{aligned}$$
(7.9)
$$\begin{aligned} \mu _{k}&= \frac{\kappa _0}{\kappa _0+m_k}\xi _k + \frac{m_k}{\kappa _0+m_k}\bar{x}_k, \end{aligned}$$
(7.10)
$$\begin{aligned} \nu _{k} \tau _{k}^2&=\nu _0\tau _0^2 + (m_k-1)s^2_k + \frac{\kappa _0 m_k }{\kappa _0 +m_k}(\bar{x}_k-\xi _k)^2, \quad s_k^2=\frac{1}{m_k-1}\sum _{t\in S_k} (x_t-\bar{x}_k)^2,\nonumber \\ \end{aligned}$$
(7.11)

see [31]. We also need to calculate for every path y the joint probability \(p(x,y)=p(y)p(x|y)\). Due to the independence of transition and emission parameters, p(y) is still as in (1.6) and p(x|y) depends on emission parameters, only. According to the formula for the marginal likelihood (see, e.g. [31]) we obtain

$$\begin{aligned} p(x|y)&=\prod _{k=1}^K\int \prod _{t\in S_k}f_k(x_t|\theta ^k_{em})\pi (\theta ^k_{em})d \theta ^k_{em}=\prod _{k=1}^K{\Gamma \left( {\nu _{k}\over 2}\right) \over \Gamma \left( {\nu _{0}\over 2}\right) }\sqrt{{\kappa _{0}\over \kappa _{k}}}{(\nu _0\tau _0^2)^{\nu _0\over 2}\over (\nu _{k}\tau ^2_{k})^{\nu _{k}\over 2}}\pi ^{-{m_k\over 2}}. \end{aligned}$$
(7.12)

We will now give a more detailed description of the non-stochastic algorithms for Example 2.

Bayesian EM. Start with initial state sequence \(y^{(0)}\). With this sequence, find for any state k the parameters \(\kappa _{k}\), \(\mu _{k}\), \(\nu _{k},\tau ^2_{k}\) as defined in (7.9), (7.10), (7.11) and calculate the posterior modes, that is update

$$\begin{aligned} p_{lj}^{(1)}&={n_{lj}(y^{(0)})+(\alpha _{lj}-1)\over {\sum _j n_{lj}(y^{(0)})+(\alpha _{l}-K)}},\quad \mu ^{(1)}_k=\mu _{k},\quad (\sigma _k^2)^{(1)}={\nu _{k}\tau _{k}^2\over \nu _{k}+2},\quad k=1\ldots ,K. \end{aligned}$$

With these parameters calculate the vectors \(\gamma _t^{(1)}\) and matrix \((\xi ^{(1)}(l,j))\) as in (7.3) and (7.4) using the forward-backward formulae. Given \(\gamma _t^{(i)}\) and \(\xi ^{(i)}(l,j)\), the transition parameters are updated according to (7.4). The emission updates are given by (7.2). Let us calculate \(\theta _{em}^{k,(i+1)}\) for the NIX-model. Suppress k from the notation and observe that \(\theta ^{(i+1)}_{em}=(\mu ^{(i+1)},(\sigma ^2)^{(i+1)})\) maximizes the following function over \(\mu \) and \(\sigma ^2\):

$$\begin{aligned}&\sum _t \ln f(x_t|\mu ,\sigma ^2)\gamma ^{(i)}_t+\ln \pi (\mu |\sigma ^2)+\ln \pi (\sigma ^2)\\&\quad =\mathrm{const}{-}{1\over 2}\left[ (\ln \sigma ^2) \left( \sum _t \gamma ^{(i)}_t {+} (\nu _0 {+}3) \right) {+} {1\over \sigma ^2} \left( \sum _t {(x_t{-}\mu )^2}\gamma ^{(i)}_t{+}{(\mu {-}\xi )^2\kappa _0}{+}{\nu _0 \tau _0^2}\right) \right] . \end{aligned}$$

The solutions \(\mu _k^{(i+1)}\) and \((\sigma _k^{2})^{(i+1)}\) are given by:

$$\begin{aligned} \mu _k^{(i+1)}= & {} {\sum _t x_t \gamma _t^{(i)}(k)+\xi _k\kappa _0\over \sum _t \gamma ^{(i)}_t(k) + \kappa _0}, \\ (\sigma _k^{2})^{(i+1)}= & {} {\nu _0\tau _0^2+\sum _t(x_t-\mu _k^{(i+1)})^2\gamma ^{(i)}_t(k)+(\mu _k^{(i+1)}-\xi _k)^2\kappa _0\over \sum _t \gamma ^{(i)}_t(k) + \nu _0+3}. \end{aligned}$$

With \(\kappa _0\rightarrow 0\) (non-informative prior), \(\mu _k^{(i+1)}\) is the same as in the standard EM algorithm. Using the updated parameters, calculate \(\gamma _t^{(i+1)}\) and \(\xi ^{(i+1)}(l,j)\). Keep updating until the change in the log-likelihood is below the stopping criterion.

Segmentation EM. Given sequence \(y^{(i)}\), calculate for every state k the parameters \(\kappa _{k}^{(i)}\), \(\mu ^{(i)}_{k}\), \(\nu ^{(i)}_{k}\) and \((\tau _{k}^2)^{(i)}\) using formulae (7.9)–(7.11). With these parameters, calculate \(h^{(i+1)}_k(x_t)\) as follows:

$$\begin{aligned} \ln h^{(i+1)}_k(x_t)&=-\frac{1}{2}\ln \Big (2\pi (\tau _{k}^2)^{(i)}\Big )- \frac{1}{2}\left[ \ln \left( \frac{\nu ^{(i)}_{k}}{2} \right) -\psi \left( \frac{\nu ^{(i)}_{k}}{2}\right) \right] \nonumber \\&\quad -\frac{x_t^2}{2\left( \tau ^{2}_{k}\right) ^{(i)}}+x_t\frac{\mu ^{(i)}_{k}}{\left( \tau ^{2}_{k}\right) ^{(i)}}- \frac{1}{2}\left[ \frac{1}{\kappa ^{(i)}_{k}}+\left( \frac{\mu ^{(i)}_{k}}{\tau _{k}^{(i)}}\right) ^2 \right] . \end{aligned}$$
(7.13)

Compute the matrix \((u_{lj}^{(i+1)})\), where \(\ln u^{(i+1)}_{lj}=\psi (\alpha _{lj}+n_{lj}(y^{(i)}))-\psi (\alpha _{l}+n_{l}(y^{(i)}))\). To find \(y^{(i+1)}\), apply the Viterbi algorithm with \(u_{lj}^{(i+1)}\) and \(h^{(i+1)}_k(x_t)\). Keep doing so until no changes occur in the path estimate.

Segmentation MM. Given \(y^{(i)}\), calculate \(\mu ^{(i)}_{k}\), \(\nu ^{(i)}_{k}\) and \((\tau _{k}^2)^{(i)}\) using formulae (7.9)–(7.11) and update the posterior modes as follows:

$$\begin{aligned} p_{lj}^{(i+1)}&={n_{lj}(y^{(i)})+(\alpha _{lj}-1)\over {\sum _j n_{lj}(y^{(i)})+(\alpha _{l}-K)}},\quad \mu ^{(i+1)}_k=\mu ^{(i)}_{k},\quad (\sigma _k^2)^{(i+1)}={\nu ^{(i)}_{k}\over \nu ^{(i)}_{k}+2}(\tau _{k}^2)^{(i)}. \end{aligned}$$

With these parameters find \(y^{(i+1)}\) by the Viterbi algorithm. Keep doing so until no changes occur in the estimated state path.

VB algorithm. Given an initial state sequence \(y^{(0)}\), find \(h^{(1)}_k(x_t)\) and \(u^{(1)}_{lj}\) as in the segmentation EM algorithm. With these parameters, calculate \(\gamma ^{(1)}_t\) and \(\xi ^{(1)}(l,j)\) using the forward-backward formulae. Given the matrix \(\big (\xi ^{(i)}(l,j)\big )\), update the matrix \((u_{lj}^{(i+1)})\) according to (7.8). Given \(\gamma ^{(i)}_t(k)\), the parameters \(\kappa ^{(i)}_{k}\), \(\mu ^{(i)}_{k}\), \(\nu ^{(i)}_{k}\) and \((\tau ^2_{k})^{(i)}\) can be calculated by (see, e.g., [29])

$$\begin{aligned} \kappa ^{(i)}_k&=\kappa _0+g^{(i)}_k,\quad \nu ^{(i)}_{k}=\nu _0+g^{(i)}_k,\quad g^{(i)}_k=\sum _{t=1}^n \gamma ^{(i)}_t(k),\\ \mu ^{(i)}_{k}&={\kappa _0\over \kappa _0+g^{(i)}_k }\xi _k+{g^{(i)}_k\over \kappa _0+g^{(i)}_k} {\tilde{x}}^{(i)}_k,\quad {\tilde{x}}^{(i)}_k={1\over g^{(i)}_k}\sum _{t=1}^n \gamma ^{(i)}_t(k) x_t,\\ \nu ^{(i)}_k (\tau ^2_{k})^{(i)}&=\nu _0\tau _0^2+\sum _{t=1}^n(x_t-{\tilde{x}}^{(i)}_k)^2\gamma ^{(i)}_t(k)+{\kappa _0 g^{(i)}_k\over \kappa _0+g^{(i)}_k}({\tilde{x}}^{(i)}_k-\xi _k)^2. \end{aligned}$$

Compute then \(h^{(i+1)}_k(x_t)\) as in (7.13). With help of \(h^{(i+1)}_k(x_t)\) and \(u^{(i+1)}_{lj}\), find \(\gamma ^{(i+1)}_t\) and \(\xi ^{(i+1)}(l,j)\) using the forward-backward formulae. After that update \(h^{(i+2)}_k(x_t)\) and \(u^{(i+2)}_{lj}\) and so on. When the VB algorithm has converged, say after m steps, apply the Viterbi algorithm with \(u_{lj}^{(m)}\) as transitions and with \(h^{(m)}_k(x_t)\) as emission values.

1.3 Clustering formulae under normal emissions with NIX priors

From (7.12) it follows that for any sequence \(y'\), the likelihood ratio is given by

$$\begin{aligned} {p(x|y)\over p(x|y')}=\prod _{k=1}^K{\Gamma \left( {\nu _0+m_k\over 2}\right) \over \Gamma \left( {\nu _0+m'_k\over 2}\right) } \prod _{k=1}^K{\sqrt{\kappa _0+m'_k\over \kappa _0+m_k}} \prod _{k=1}^K {(\nu '_{k}\tau ^{'2}_{k})^{{\nu _{0}\over 2}}\over (\nu _{k}\tau ^2_{k})^{{\nu _0\over 2}}} \prod _{k=1}^K {(\nu '_{k}\tau ^{'2}_{k})^{{m'_k\over 2}}\over (\nu _{k}\tau ^2_{k})^{{m_k\over 2}}}. \end{aligned}$$
(7.14)

When \(\nu _0\rightarrow \infty \) and \(\tau ^2_0>0\), then due to \(\sum _k m_k=\sum _k m'_k=n\) we have

$$\begin{aligned} \lim _{\nu _0\rightarrow \infty }\prod _{k=1}^K{\Gamma \left( {\nu _0+m_k\over 2}\right) \over \Gamma \left( {\nu _0+m'_k\over 2}\right) }=1,\quad \lim _{\nu _0\rightarrow \infty } \prod _{k=1}^K {(\nu '_{k}\tau ^{'2}_{k})^{{m'_k\over 2}}\over (\nu _{k}\tau ^2_{k})^{{m_k\over 2}}}= 1. \end{aligned}$$

Write \(\nu _{k}\tau ^{2}_{k}\) as

$$\begin{aligned} \nu _{k}\tau ^{2}_{k}=\nu _0\tau _0^2 + \sum _{t\in S_k} (x_t-\bar{x}_k)^2 + \frac{\kappa _0 m_k }{\kappa _0 +m_k}(\bar{x}_k-\xi _k)^2=\nu _0\tau _0^2+A_k=\nu _0\tau _0^2\left( 1+{2A_k\over 2\nu _0\tau _0^2}\right) . \end{aligned}$$

Then

$$\begin{aligned} {(\nu '_{k}\tau ^{'2}_{k})^{{\nu _{0}\over 2}}\over (\nu _{k}\tau ^2_{k})^{{\nu _0\over 2}}}={\Big (1+{2A'_k\over 2\nu _0\tau _0^2}\Big )^{\nu _0\over 2} \over \Big (1+{2A_k\over 2\nu _0\tau _0^2}\Big )^{\nu _0\over 2}}\rightarrow \exp \left[ {A'_k-A_k\over 2\tau ^2_0}\right] . \end{aligned}$$

Therefore, when \(\nu _0\rightarrow \infty \), then the likelihood ratio in (7.14) converges to

$$\begin{aligned} \prod _{k=1}^K{\sqrt{\kappa _0+m'_k\over \kappa _0+m_k}} \exp \left[ {\sum _k A'_k-\sum _kA_k\over 2\tau ^2_0}\right] . \end{aligned}$$

Thus, maximizing p(x|y) corresponds to the following clustering problem: find clusters \(S_1,\ldots , S_k\) that minimize

$$\begin{aligned} \sum _{k=1}^K \sum _{t\in S_k} (x_t-\bar{x}_k)^2+\kappa _0\sum _{k=1}^K{m_k\over \kappa _0+m_k}(\bar{x}_k-\xi _k)^2+\tau _0^2\sum _{k=1}^K\ln (\kappa _0+m_k), \end{aligned}$$

which is formula (4.2). Given cluster \(S_k\), it is easy to see that

$$\begin{aligned} \arg \min _{\mu \in {\mathbb R}}\left[ \sum _{t\in S_k}(x_t-\mu )^2+\kappa _0(\mu -\xi _k)^2\right] ={m_k{\bar{x}}_k+\kappa _0\xi _k\over \kappa _0+m_k}=:\mu _k. \end{aligned}$$
(7.15)

Since

$$\begin{aligned} \sum _{t\in S_k}(x_t-\mu _k)^2+\kappa _0(\mu _k-\xi _k)^2=\sum _{t\in S_k} (x_t-\bar{x}_k)^2+\kappa _0{m_k\over \kappa _0+m_k}(\bar{x}_k-\xi _k)^2, \end{aligned}$$

we obtain (4.3).

To understand the behavior of the segmentation EM and segmentation MM algorithms when \(\nu _0\rightarrow \infty \), recall the segmentation EM iteration formula from (7.13). When \(\nu _0\rightarrow \infty \), then \(\ln (\nu ^{(i)}_{k}/{2})-\psi ({\nu ^{(i)}_{k}}/{2})\rightarrow 0\) and \((\tau _{k}^2)^{(i)}\rightarrow \tau _0^2\). Thus, leaving the superscript (i) out of the notation, we get

$$\begin{aligned} \ln h_k(x_t) \rightarrow -{1\over 2}\ln (2\pi (\tau _0^2) )-{1\over 2 \tau ^2_0} (x_t- \mu _k )^2 -{1\over 2(\kappa _0+m_k)}, \end{aligned}$$

where \(\mu _k\) is as in (7.15). The Viterbi alignment is now obtained as

$$\begin{aligned} y_t=\arg \min _{k=1,\ldots ,K}\left[ (x_t- \mu _k)^2+{\tau _0^2 \over m_k+\kappa _0}\right] . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lember, J., Gasbarra, D., Koloydenko, A. et al. Estimation of Viterbi path in Bayesian hidden Markov models. METRON 77, 137–169 (2019). https://doi.org/10.1007/s40300-019-00152-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40300-019-00152-7

Keywords