Estimation of Viterbi path in Bayesian hidden Markov models

Lember, Jüri; Gasbarra, Dario; Koloydenko, Alexey; Kuljus, Kristi

doi:10.1007/s40300-019-00152-7

Estimation of Viterbi path in Bayesian hidden Markov models

Published: 28 May 2019

Volume 77, pages 137–169, (2019)
Cite this article

METRON Aims and scope Submit manuscript

Jüri Lember ORCID: orcid.org/0000-0002-9999-4580¹,
Dario Gasbarra²,
Alexey Koloydenko³ &
…
Kristi Kuljus¹

552 Accesses
Explore all metrics

Abstract

The article studies different methods for estimating the Viterbi path in the Bayesian framework. The Viterbi path is an estimate of the underlying state path in hidden Markov models (HMMs), which has a maximum joint posterior probability. Hence it is also called the maximum a posteriori (MAP) path. For an HMM with given parameters, the Viterbi path can be easily found with the Viterbi algorithm. In the Bayesian framework the Viterbi algorithm is not applicable and several iterative methods can be used instead. We introduce a new EM-type algorithm for finding the MAP path and compare it with various other methods for finding the MAP path, including the variational Bayes approach and MCMC methods. Examples with simulated data are used to compare the performance of the methods. The main focus is on non-stochastic iterative methods and our results show that the best of those methods work as well or better than the best MCMC methods. Our results demonstrate that when the primary goal is segmentation, then it is more reasonable to perform segmentation directly by considering the transition and emission parameters as nuisance parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Likelihood and Decoding in a Partially Hidden Markov Model

Pairwise Markov Models and Hybrid Segmentation Approach

Article Open access 10 June 2023

Active Inference for Binary Symmetric Hidden Markov Models

Article 09 August 2015

References

Beal, M.: Variational algorithms for approximate Bayesian inference. Ph.D. thesis, Gatsby Computational Neurosience Unit, University College London (2003)
Beal, M., Ghahramani, Z.: The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In: Bayesian Statistics, 7 (Tenerife, 2002), pp. 453–463. Oxford University Press, New York (2003)
Benboudjema, D., Pieczynski, W.: Unsupervised statistical of nonstastionary images using triplet markov fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(8), 1367–1378 (2007)
Article Google Scholar
Besag, J.: An introduction to Markov chain Monte Carlo methods. In: Mathematical Foundations of Speech and Language Processing, IMA Volumes in Mathematics and Applications, vol. 138, pp. 247–270. Springer, New York (2004)
Bishop, C.: Pattern Recognition and Machine Learning, Information Science and Statistics. Springer, New York (2006)
MATH Google Scholar
Cappé, O., Moulines, E., Rydén, T.: Inference in Hidden Markov Models. Springer, New York (2005)
Book MATH Google Scholar
Carvalho, L., Lawrence, C.: Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. 105(9), 3209–3214 (2008)
Article Google Scholar
Christodoulopoulos, C., Goldwater, S., Steedman, M.: Two decades of unsupervised POS induction: how far have we come? In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2010)
Corander, J., Xiong, J., Cui, Y., Koski, T.: Optimal Viterbi Bayesian predictive classification for data from finite alphabets. J. Stat. Plan. Inference 143(2), 261–275 (2013)
Article MathSciNet MATH Google Scholar
Courbot, J.-B., Monfrini, E., Mazet, V., Collet, C.: Oriented triplet Markov fields. Pattern Recognit. Lett. 103, 16–22 (2018)
Article Google Scholar
Fox, C., Roberts, S.: A tutorial on variational Bayesian inference. Artif. Intell. Rev. 38(2), 85–95 (2012)
Article Google Scholar
Ganchev, K., Taskar, B., Fernando, P., Joao, G.: Posterior vs parameter sparsity in latent variable models. In: Advances in Neural Information Processing Systems 22 (2009)
Gao, J., Johnson, M.: A comparison of Bayesian estimators for unsupervised hidden Markov model pos taggers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, Stroudsburg, PA, USA, pp. 344–352. Association for Computational Linguistics (2008)
Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (2007)
Gorynin, I., Gangloff, H., Monfrini, E., Pieczynski, W.: Assessing the segmentation performance of pairwise and triplet Markov models. Signal Process. 145, 183–192 (2018)
Article Google Scholar
Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.): Bayesian Nonparametrics. Cambridge University Press, New York (2010)
MATH Google Scholar
Johnson, M.: Why doesn’t EM find good HMM POS-taggers? In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 2007, pp. 296–305. Association for Computational Linguistics (2007)
Jordan, M., Gharhami, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999)
Article MATH Google Scholar
Koloydenko, A., Käärik, M., Lember, J.: On adjusted Viterbi training. Acta Appl. Math. 96(1–3), 309–326 (2007)
Article MathSciNet MATH Google Scholar
Koloydenko, A., Lember, J.: Bridging Viterbi and posterior decoding: a generalized risk approach to hidden path inference based on hidden Markov models. J. Mach. Learn. Res. 15, 1–58 (2014)
MathSciNet MATH Google Scholar
Koski, T.: Hidden Markov models for Bioinformatics, Computational Biology Series, vol. 2. Kluwer Academic Publishers, Dordrecht (2001)
Book MATH Google Scholar
Lanchantin, P., Lapuyade-Lahorgue, J., Pieczynski, W.: Unsupervised segmentation of triplet Markov chains hidden with long-memory noise. Signal Process. 88(5), 1134–1151 (2008)
Article MATH Google Scholar
Lember, J., Koloydenko, A.: Adjusted Viterbi training: a proof of concept. Probab. Eng. Inf. Sci. 21(3), 451–475 (2007)
Article MATH Google Scholar
Lember, J., Koloydenko, A.: The adjusted Viterbi training for hidden Markov models. Bernoulli 14(1), 180–206 (2008)
Article MathSciNet MATH Google Scholar
Lember, J., Kuljus, K., Koloydenko, A.: Theory of segmentation. In: Dymarsky, P. (ed.) Hidden Markov Models, Theory and Applications, pp. 51–84. InTech, New York (2011)
Google Scholar
Marin, J., Robert, C.: Bayesian Core: a Practical Approach to Computational Bayesian Statistics. Springer Texts in Statistics. Springer, New York (2007)
MATH Google Scholar
Maruotti, A., Rocci, R.: A mixed non-homogeneous hidden Markov model for categorical data, with application to alcohol consumption. Stat. Med. 31(9), 871–886 (2012)
Article MathSciNet Google Scholar
Maruotti, A., Rydén, T.: A semiparametric approach to hidden Markov models under longitudinal observations. Stat. Comput. 19, 381–393 (2009)
Article MathSciNet Google Scholar
McGrory, C.A., Titterington, D.M.: Variational Bayesian analysis for hidden Markov models. Aust. N. Z. J. Stat. 51(2), 227–244 (2009)
Article MathSciNet MATH Google Scholar
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (2008)
Book MATH Google Scholar
Murphy, K.: Conjugate Bayesian analysis of the Gaussian distribution. Technical report (2007)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Rydén, T.: EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective. Bayesian Anal. 3(4), 659–688 (2008)
Article MathSciNet MATH Google Scholar
Samé, A., Ambroise, C., Govaert, G.: A classification EM algorithm for binned data. Comput. Stat. Data Anal. 51(2), 466–480 (2006)
Article MathSciNet MATH Google Scholar
Samé, A., Ambroise, C., Govaert, G.: An online classification EM algorithm based on the mixture model. Stat. Comput. 17(3), 209–218 (2007)
Article MathSciNet Google Scholar
Scott, S.: Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Stat. Assoc. 97(457), 337–351 (2002)
Article MathSciNet MATH Google Scholar
Smidl, V., Quinn, A.: The Variational Bayes Method in Signal Processing. Springer, Berlin (2006)
Google Scholar
Watanabe, S.: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on Applied and Computational Mathematics, vol. 25. Cambridge University Press, Cambridge (2009)
Google Scholar
Winkler, G.: Image Analysis, Random Fields and Markov Chain Monte Carlo Methods, Applications of Mathematics, vol. 27. Springer, Berlin (2003)
MATH Google Scholar
Yau, C., Holmes, C.: A decision-theoretic approach for segmental classification. Ann. Appl. Stat. 7(3), 1814–1835 (2013)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by Estonian institutional research funding IUT34-5.

Author information

Authors and Affiliations

University of Tartu, Tartu, Estonia
Jüri Lember & Kristi Kuljus
University of Helsinki, Helsinki, Finland
Dario Gasbarra
Royal Holloway, University of London, London, UK
Alexey Koloydenko

Authors

Jüri Lember
View author publications
You can also search for this author in PubMed Google Scholar
Dario Gasbarra
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Koloydenko
View author publications
You can also search for this author in PubMed Google Scholar
Kristi Kuljus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jüri Lember.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 General formulae for the segmentation methods studied

Due to our independence assumption, all emission and transition parameters can be estimated separately. In the formulae of this section we use the same notation for the random parameters $p_{l,j}$, $\mu _k$ and $\sigma _k^2$, $k,l,j\in \{1,\ldots ,K\}$, and the corresponding estimates. The exact meaning can be understood from the context.

Segmentation MM. In the case of Dirichlet priors the matrix $\theta _{tr}^{(i+1)}$ can be found row-wise, the l-th row is the posterior mode:

$$\begin{aligned} p^{(i+1)}_{lj}={\alpha _{lj}+n_{lj}(y^{(i)})-1\over \alpha _l+n_l(y^{(i)})-K}. \end{aligned}$$

(7.1)

Emission parameters can be updated independently:

$$\begin{aligned}&\theta _{em}^{k,(i+1)}=\arg \max _{\theta ^k_{em}}p(\theta ^k_{em}|x_{S_k})= \arg \max _{\theta ^k_{em}}\left[ \sum _{t: y^{(i)}_t=k} \ln f_k(x_t|\theta ^{k}_{em})+\ln \pi ^k_{em}(\theta ^{k}_{em})\right] ,\\&\quad k=1,\ldots ,K, \end{aligned}$$

where $x_{S_k}$ is the subsample of x corresponding to state k in $y^{(i)}$. Formally, for every sequence $y\in S^n$ define $S_k(y)=\{t\in \{1,\dots ,n\}: y_t=k\}$, then $x_{S_k}=\{x_t: t\in S_k\}$.

Bayesian EM. The emission updates are given by

$$\begin{aligned} \theta _{em}^{k,(i+1)}=\arg \max _{\theta ^k_{em}}\left[ \sum _t \ln f_k(x_t|\theta ^{k}_{em})\gamma _t^{(i)}(k)+\ln \pi ^k_{em}(\theta ^{k}_{em})\right] ,\quad k=1,\ldots ,K, \end{aligned}$$

(7.2)

where

$$\begin{aligned} \gamma _t^{(i)}(k):=P(Y_t=k|X=x,\theta ^{(i)})=\sum _{y: y_t=k}p(y|\theta ^{(i)},x). \end{aligned}$$

(7.3)

In the case of Dirichlet priors the transition updates are given by

$$\begin{aligned} {p}^{(i+1)}_{lj}&={\xi ^{(i)}(l,j)+ (\alpha _{lj}-1)\over \sum _j \xi ^{(i)}(l,j)+(\alpha _l-K)},\quad \text{ where } \xi ^{(i)}(l,j):=\sum _{t=1}^{n-1}P(Y_{t}=l,Y_{t+1}=j|x,\theta ^{(i)}). \end{aligned}$$

(7.4)

Since one of the studied methods (ICM) starts with an initial sequence, in order the comparison to be fair, we let all the other methods to start with a sequence as well. Therefore, for a given initial sequence $y^{(0)}$, define

$$\begin{aligned} \gamma _t^{(0)}(k):=I_{k}(y_t^{(0)}),\quad \xi ^{(0)}(l,j):=n_{lj}(y^{(0)}). \end{aligned}$$

(7.5)

Variational Bayes approach. Let us have a closer look at the measure $q_{Y}^{(i+1)}(y)$. We are going to show that there exists an HMM (Z, X) such that for every sequence y, $q_{Y}^{(i+1)}(y)={P}(Z=y|X=x)$. By definition,

$$\begin{aligned} q_{Y}^{(i+1)}(y)\propto \exp \left[ \int \ln p(\theta ,y|x) q_{\theta }^{(i+1)}(d\theta )\right] . \end{aligned}$$

Apply the notation from (2.2) in the current case:

$$\begin{aligned} u^{(i+1)}_{lj}=\exp \left[ \int \ln p_{lj}(\theta _{tr}) q_{\theta }^{(i+1)}(d\theta )\right] ,\quad h^{(i+1)}_k(x_t)=\exp \left[ \int \ln f_k(x_t|\theta _{em}^k)q_{\theta }^{(i+1)}(d\theta )\right] . \end{aligned}$$

Since $\ln p(\theta ,y|x)=\ln \pi (\theta )+\ln p(y,x|\theta )-\ln p(x)$, we obtain

$$\begin{aligned}&\int \ln p(\theta ,y|x) q_{\theta }^{(i+1)}(d\theta )\nonumber \\&\quad =\int \ln \pi (\theta )q_{\theta }^{(i+1)}(d\theta )-\ln p(x) + \int \ln p(y,x|\theta )q_{\theta }^{(i+1)}(d\theta ) \nonumber \\&\quad =c(q_{\theta }^{(i+1)},x)+ \ln p_{0 y_1}+\sum _{lj}n_{lj}(y)\ln u^{(i+1)}_{lj}+\sum _{k=1}^K \sum _{t: y_t=k} \ln h^{(i+1)}_k(x_t)\nonumber \\&\quad =c(q_{\theta }^{(i+1)},x)+ \ln p_{0 y_1}+\sum _{lj}n_{lj}(y)\ln {\tilde{u}}^{(i+1)}_{lj}+\sum _{k=1}^K \sum _{t: y_t=k} \ln {\tilde{h}}^{(i+1)}_k(x_t), \end{aligned}$$

(7.6)

where ${\tilde{u}}_{lj}$ is the normalized quantity, $ {\tilde{u}}_{lj}:={u_{lj}\over \sum _j u_{lj}}$, and ${\tilde{h}}_k(x_t):=(\sum _j u_{kj})h_k(x_t)$, if $t\le n-1$, ${\tilde{h}}_k(x_n)=h_k(x_n)$. Let now (Z, X) be an HMM, where Z is the underlying Markov chain with transition matrix $({\tilde{u}}_{lj})$ and emission densities are given by ${\tilde{h}}_k$. From (7.6) it follows that $q_{Y}^{(i+1)}(y)\propto P(Z=y|X=x)$. Since $q_{Y}^{(i+1)}$ and $P(Z\in \cdot |X=x)$ are both probability measures, it follows that they are equal. To stress the dependence on iterations, we will denote $q_{Y}^{(i+1)}(y)=P^{(i+1)}(Z=y|X=x)$.

Let us now calculate $q_{\theta }$. Let $\gamma _t^{(i)}(k)$ denote the marginal of $q_{Y}^{(i)}(y)$,

$$\begin{aligned} \gamma _t^{(i)}(k):=P^{(i)}(Z_t=k|X=x)=\sum _{y:y_t=k}q_{Y}^{(i)}(y). \end{aligned}$$

Observe that

$$\begin{aligned} \sum _{y}\ln p(y,x|\theta )q_{Y}^{(i)}(y)= & {} C_1 +\sum _{l,j}\ln p_{lj}(\theta _{tr})\big (\sum _y n_{lj}(y)q_{Y}^{(i)}(y)\big )\\&+\sum _{t=1}^n\sum _{k=1}^K\ln f_k(x_t|\theta ^{k}_{em})\gamma ^{(i)}_t(k), \end{aligned}$$

where $C_1:=\sum _{k}(\ln p_{0k})\gamma _1^{(i)}(k)$. The sum $\sum _y n_{lj}(y)q_{Y}^{(i)}(y)$ is the expected number of transitions from l to j, so that using the equality $q_{Y}^{(i)}(y)=P^{(i)}(Z=y|X=x)$, we have

$$\begin{aligned} \sum _y n_{lj}(y)q_{Y}^{(i)}(y)=\sum _{t=1}^{n-1}P^{(i)}(Z_t=l,Z_{t+1}=j|X=x)=:\xi ^{(i)}(l,j). \end{aligned}$$

Therefore,

$$\begin{aligned} \ln q_{\theta }^{(i+1)}(\theta )= & {} C+\ln \pi _{tr}(\theta _{tr})+\ln \pi _{em}(\theta _{em})+ \sum _{l,j}\xi ^{(i)}(l,j)\ln p_{lj}(\theta _{tr})\nonumber \\&+\sum _{t=1}^n\sum _{k=1}^K \ln f_k(x_t|\theta ^{k}_{em})\gamma ^{(i)}_t(k). \end{aligned}$$

(7.7)

From (7.7) we can see that under $q_{\theta }^{(i+1)}$ the parameters $\theta _{tr},\theta ^{1}_{em},\ldots ,\theta ^{K}_{em}$ are still independent and can therefore be updated separately. In the case of Dirichlet transition priors the rows are independent as well. The transition update for the l-th row and the emission update for the k-th component are given by

$$\begin{aligned} q_{\theta }^{(i+1)}(p_{l1},\ldots ,p_{lK})\propto \prod _{j=1}^K p_{lj}^{\alpha _{lj}-1+\xi ^{(i)}(l,j)}, \quad q_{\theta }^{(i+1)}(\theta _{em}^k)\propto \pi (\theta _{em}^k)\prod _{t=1}^n \big ( f_k(x_t|\theta _{em}^k)\big )^{\gamma _t^{(i)}(k)}. \end{aligned}$$

The whole VB approach is applicable since $\xi ^{(i)}(l,j)$ and $\gamma _t^{(i)}(k)$ can be found by the standard forward-backward formulae using ${\tilde{u}}_{lj}^{(i)}$ and ${\tilde{h}}^{(i)}_k$. Actually, it is not difficult to see that in these formulae the original $u^{(i)}_{lj}$ and $h^{(i)}_k$ can be used instead of the standardized ones. To summarize, in our setup the VB approach yields the following algorithm for calculating ${\hat{v}}_\mathrm{{VB}}$. For a given initial sequence $y^{(0)}$, find vector $\gamma _t^{(0)}$ and matrix $\xi ^{(0)}$ as in (7.5). Given $\gamma _t^{(i)}$ and $\xi ^{(i)}$, update $u^{(i+1)}_{lj}$ and $h_k^{(i+1)}$ as follows:

$$\begin{aligned} u^{(i+1)}_{lj}= & {} \exp [\psi (\alpha _{lj}+\xi ^{(i)}(l,j))-\psi (\alpha _{l}+\xi ^{(i)}(l))],\quad \text {where}\quad \xi ^{(i)}(l):=\sum _j \xi ^{(i)}(l,j), \nonumber \\ h^{(i+1)}_k(x_t)= & {} \exp \left[ \int \ln f_k(x_t|\theta _{em}^k)q_{\theta }^{(i+1)}(d\theta )\right] ,\quad \text {where}\quad q_{\theta }^{(i+1)}(\theta _{em}^k)\nonumber \\&\propto \pi (\theta _{em}^k)\prod _{t=1}^n \big (f_k(x_t|\theta _{em}^k)\big )^{\gamma _t^{(i)}(k)}. \end{aligned}$$

(7.8)

With these parameters, $\xi ^{(i+1)}$ and $\gamma _t^{(i+1)}$ can be calculated with the usual forward-backward procedure for HMM. Then update $u^{(i+2)}_{lj}$ and $h_k^{(i+2)}$ and so on. After the convergence, say after m steps, apply the Viterbi algorithm with transitions $(u_{ij}^{(m)})$ and emission densities $h^{(m)}_k$. The obtained path maximizes $q_{Y}^{(m)}(y)$ over all the paths, so it is ${\hat{v}}_\mathrm{{VB}}$.

Simulated annealing. Because of independence of the emission and transition parameters, it holds even for $\beta >1$ that $p_{\beta }(\theta |y,x)=p_{\beta }(\theta _{tr}|y)p_{\beta }(\theta _{em}|y,x)$, thus the transition and emission parameters can be sampled separately. When the rows of a transition matrix have independent Dirichlet priors, the l-th row can be generated from the Dirichlet distribution with parameters $\beta (n_{lk}(s)+\alpha _{lk})+1-\beta $, $k=1,\ldots ,K$. For given $\theta $, sampling from $p(y|\theta ,x)$ can be performed in various ways: we use so-called Markovian Backward Sampling (Algorithm 6.1.1 in [6]). To sample from $p_{\beta }(y|\theta ,x)$, note that

$$\begin{aligned} p(x,y|\theta )^{\beta }={p^{\beta }_{0y_1}\over \sum _{j}p^{\beta }_{0,j}}\prod _{t=2}^{n}{{{\tilde{p}}}}_{y_{t-1} y_{t}}{\tilde{f}}_{y_t}(x_t), \end{aligned}$$

where ${\tilde{p}}_{ij}:={p^{\beta }_{ij} / \sum _{j}p^{\beta }_{ij}}$, and ${\tilde{f}}_k(x_t):=\big (\sum _{j}p^{\beta }_{ij}\big )f^{\beta }_k(x_t)$, $t=1,\ldots ,n-1$, ${\tilde{f}}_k(x_n):=\big (\sum _{j}p^{\beta }_{0j}\big )f^{\beta }_k(x_n).$ Although the functions ${{\tilde{f}}}_k$ are not densities, one can still use Markovian Backward Sampling.

1.2 Non-stochastic segmentation algorithms for the Dirichlet-NIX case

Suppose the emission distribution corresponding to state k is ${\mathcal {N}}(\mu _k,\sigma _k^2)$, where prior distributions for $\mu _k$ and $\sigma _k^2$ are given by a normal and scaled inverse-chi-square distribution, respectively:

$$\begin{aligned} \mu _k |\sigma _k^2\sim {\mathcal {N}} \big (\xi _k,{\sigma _k^2\over \kappa _0}\big ),\quad \sigma ^2_k\sim \mathrm{Inv-}\chi ^2(\nu _{0},\tau ^2_{0}). \end{aligned}$$

Here $\kappa _0$, $\nu _{0}$ and $\tau ^2_{0}$ are hyperparameters that might depend on k, but in our example we assume they are equal. Recall the density of $\mathrm{Inv-}\chi ^2(\nu ,\tau ^2)$:

$$\begin{aligned} f(x;\nu ,\tau ^2)={(\tau ^2\nu /2)^{\nu /2}\over \Gamma ({\nu /2})}x^{-(1 + \nu /2)}\exp \left[ -{\nu \tau ^2\over 2x}\right] . \end{aligned}$$

If $X\sim \mathrm{Inv-}\chi ^2(\nu ,\tau ^2)$, then

$$\begin{aligned} EX={\tau ^2\nu \over \nu -2},\quad \mathrm{Var}(X)={2\tau ^4\nu ^2 \over (\nu -2)^2(\nu -4)},\quad E(\ln X)=\ln \big ({\nu \tau ^2\over 2}\big ){-}\psi \big ({\nu \over 2}\big ),\quad E X^{-1}=\tau ^{-2}, \end{aligned}$$

and the mode of the distribution is given by $\nu \tau ^2/(\nu +2)$. Therefore, if $\nu _0$ and $\kappa _0$ are both very large, then $\sigma _k^2\approx \tau _0^2$ and $\mu _k\approx \xi _k$, and we get back to the first example. If $\nu _0$ is very large, then $\sigma ^2_k\approx \tau _0^2$, so that emission variances are $\tau _0^2$, but the variance of the mean is approximately $\tau _0^2 /\kappa _0$.

Since emission and transition parameters are independent, the transition parameters can be updated as previously, that is as described in Sect. 1. Because the emission components $(\theta ^1_{em},\ldots ,\theta ^K_{em})$ are independent under prior and posterior, it holds that $p(\theta _{em}|y,x)=\prod _k p(\theta ^k_{em}|x_{S_k})$, where $x_{S_k}$ is the subsample of x along y corresponding to state k. Let $m_k(y)$ be the size of $x_{S_k}$. Let $\bar{x}_k$ and $s^2_k$ be the mean and variance of $x_{S_k}$. Since NIX-priors are conjugate, for any state k the posterior parameters $\kappa _{k}$, $\nu _{k}$, $\mu _{k}$ and $\tau _{k}^2$ can be calculated as follows:

$$\begin{aligned} \kappa _{k}&=\kappa _0 + m_k , \quad \quad \nu _{k} =\nu _0+m_k, \end{aligned}$$

(7.9)

$$\begin{aligned} \mu _{k}&= \frac{\kappa _0}{\kappa _0+m_k}\xi _k + \frac{m_k}{\kappa _0+m_k}\bar{x}_k, \end{aligned}$$

(7.10)

$$\begin{aligned} \nu _{k} \tau _{k}^2&=\nu _0\tau _0^2 + (m_k-1)s^2_k + \frac{\kappa _0 m_k }{\kappa _0 +m_k}(\bar{x}_k-\xi _k)^2, \quad s_k^2=\frac{1}{m_k-1}\sum _{t\in S_k} (x_t-\bar{x}_k)^2,\nonumber \\ \end{aligned}$$

(7.11)

see [31]. We also need to calculate for every path y the joint probability $p(x,y)=p(y)p(x|y)$. Due to the independence of transition and emission parameters, p(y) is still as in (1.6) and p(x|y) depends on emission parameters, only. According to the formula for the marginal likelihood (see, e.g. [31]) we obtain

$$\begin{aligned} p(x|y)&=\prod _{k=1}^K\int \prod _{t\in S_k}f_k(x_t|\theta ^k_{em})\pi (\theta ^k_{em})d \theta ^k_{em}=\prod _{k=1}^K{\Gamma \left( {\nu _{k}\over 2}\right) \over \Gamma \left( {\nu _{0}\over 2}\right) }\sqrt{{\kappa _{0}\over \kappa _{k}}}{(\nu _0\tau _0^2)^{\nu _0\over 2}\over (\nu _{k}\tau ^2_{k})^{\nu _{k}\over 2}}\pi ^{-{m_k\over 2}}. \end{aligned}$$

(7.12)

We will now give a more detailed description of the non-stochastic algorithms for Example 2.

Bayesian EM. Start with initial state sequence $y^{(0)}$. With this sequence, find for any state k the parameters $\kappa _{k}$, $\mu _{k}$, $\nu _{k},\tau ^2_{k}$ as defined in (7.9), (7.10), (7.11) and calculate the posterior modes, that is update

$$\begin{aligned} p_{lj}^{(1)}&={n_{lj}(y^{(0)})+(\alpha _{lj}-1)\over {\sum _j n_{lj}(y^{(0)})+(\alpha _{l}-K)}},\quad \mu ^{(1)}_k=\mu _{k},\quad (\sigma _k^2)^{(1)}={\nu _{k}\tau _{k}^2\over \nu _{k}+2},\quad k=1\ldots ,K. \end{aligned}$$

With these parameters calculate the vectors $\gamma _t^{(1)}$ and matrix $(\xi ^{(1)}(l,j))$ as in (7.3) and (7.4) using the forward-backward formulae. Given $\gamma _t^{(i)}$ and $\xi ^{(i)}(l,j)$, the transition parameters are updated according to (7.4). The emission updates are given by (7.2). Let us calculate $\theta _{em}^{k,(i+1)}$ for the NIX-model. Suppress k from the notation and observe that $\theta ^{(i+1)}_{em}=(\mu ^{(i+1)},(\sigma ^2)^{(i+1)})$ maximizes the following function over $\mu $ and $\sigma ^2$:

$$\begin{aligned}&\sum _t \ln f(x_t|\mu ,\sigma ^2)\gamma ^{(i)}_t+\ln \pi (\mu |\sigma ^2)+\ln \pi (\sigma ^2)\\&\quad =\mathrm{const}{-}{1\over 2}\left[ (\ln \sigma ^2) \left( \sum _t \gamma ^{(i)}_t {+} (\nu _0 {+}3) \right) {+} {1\over \sigma ^2} \left( \sum _t {(x_t{-}\mu )^2}\gamma ^{(i)}_t{+}{(\mu {-}\xi )^2\kappa _0}{+}{\nu _0 \tau _0^2}\right) \right] . \end{aligned}$$

The solutions $\mu _k^{(i+1)}$ and $(\sigma _k^{2})^{(i+1)}$ are given by:

$$\begin{aligned} \mu _k^{(i+1)}= & {} {\sum _t x_t \gamma _t^{(i)}(k)+\xi _k\kappa _0\over \sum _t \gamma ^{(i)}_t(k) + \kappa _0}, \\ (\sigma _k^{2})^{(i+1)}= & {} {\nu _0\tau _0^2+\sum _t(x_t-\mu _k^{(i+1)})^2\gamma ^{(i)}_t(k)+(\mu _k^{(i+1)}-\xi _k)^2\kappa _0\over \sum _t \gamma ^{(i)}_t(k) + \nu _0+3}. \end{aligned}$$

With $\kappa _0\rightarrow 0$ (non-informative prior), $\mu _k^{(i+1)}$ is the same as in the standard EM algorithm. Using the updated parameters, calculate $\gamma _t^{(i+1)}$ and $\xi ^{(i+1)}(l,j)$. Keep updating until the change in the log-likelihood is below the stopping criterion.

Segmentation EM. Given sequence $y^{(i)}$, calculate for every state k the parameters $\kappa _{k}^{(i)}$, $\mu ^{(i)}_{k}$, $\nu ^{(i)}_{k}$ and $(\tau _{k}^2)^{(i)}$ using formulae (7.9)–(7.11). With these parameters, calculate $h^{(i+1)}_k(x_t)$ as follows:

$$\begin{aligned} \ln h^{(i+1)}_k(x_t)&=-\frac{1}{2}\ln \Big (2\pi (\tau _{k}^2)^{(i)}\Big )- \frac{1}{2}\left[ \ln \left( \frac{\nu ^{(i)}_{k}}{2} \right) -\psi \left( \frac{\nu ^{(i)}_{k}}{2}\right) \right] \nonumber \\&\quad -\frac{x_t^2}{2\left( \tau ^{2}_{k}\right) ^{(i)}}+x_t\frac{\mu ^{(i)}_{k}}{\left( \tau ^{2}_{k}\right) ^{(i)}}- \frac{1}{2}\left[ \frac{1}{\kappa ^{(i)}_{k}}+\left( \frac{\mu ^{(i)}_{k}}{\tau _{k}^{(i)}}\right) ^2 \right] . \end{aligned}$$

(7.13)

Compute the matrix $(u_{lj}^{(i+1)})$, where $\ln u^{(i+1)}_{lj}=\psi (\alpha _{lj}+n_{lj}(y^{(i)}))-\psi (\alpha _{l}+n_{l}(y^{(i)}))$. To find $y^{(i+1)}$, apply the Viterbi algorithm with $u_{lj}^{(i+1)}$ and $h^{(i+1)}_k(x_t)$. Keep doing so until no changes occur in the path estimate.

Segmentation MM. Given $y^{(i)}$, calculate $\mu ^{(i)}_{k}$, $\nu ^{(i)}_{k}$ and $(\tau _{k}^2)^{(i)}$ using formulae (7.9)–(7.11) and update the posterior modes as follows:

$$\begin{aligned} p_{lj}^{(i+1)}&={n_{lj}(y^{(i)})+(\alpha _{lj}-1)\over {\sum _j n_{lj}(y^{(i)})+(\alpha _{l}-K)}},\quad \mu ^{(i+1)}_k=\mu ^{(i)}_{k},\quad (\sigma _k^2)^{(i+1)}={\nu ^{(i)}_{k}\over \nu ^{(i)}_{k}+2}(\tau _{k}^2)^{(i)}. \end{aligned}$$

With these parameters find $y^{(i+1)}$ by the Viterbi algorithm. Keep doing so until no changes occur in the estimated state path.

VB algorithm. Given an initial state sequence $y^{(0)}$, find $h^{(1)}_k(x_t)$ and $u^{(1)}_{lj}$ as in the segmentation EM algorithm. With these parameters, calculate $\gamma ^{(1)}_t$ and $\xi ^{(1)}(l,j)$ using the forward-backward formulae. Given the matrix $\big (\xi ^{(i)}(l,j)\big )$, update the matrix $(u_{lj}^{(i+1)})$ according to (7.8). Given $\gamma ^{(i)}_t(k)$, the parameters $\kappa ^{(i)}_{k}$, $\mu ^{(i)}_{k}$, $\nu ^{(i)}_{k}$ and $(\tau ^2_{k})^{(i)}$ can be calculated by (see, e.g., [29])

$$\begin{aligned} \kappa ^{(i)}_k&=\kappa _0+g^{(i)}_k,\quad \nu ^{(i)}_{k}=\nu _0+g^{(i)}_k,\quad g^{(i)}_k=\sum _{t=1}^n \gamma ^{(i)}_t(k),\\ \mu ^{(i)}_{k}&={\kappa _0\over \kappa _0+g^{(i)}_k }\xi _k+{g^{(i)}_k\over \kappa _0+g^{(i)}_k} {\tilde{x}}^{(i)}_k,\quad {\tilde{x}}^{(i)}_k={1\over g^{(i)}_k}\sum _{t=1}^n \gamma ^{(i)}_t(k) x_t,\\ \nu ^{(i)}_k (\tau ^2_{k})^{(i)}&=\nu _0\tau _0^2+\sum _{t=1}^n(x_t-{\tilde{x}}^{(i)}_k)^2\gamma ^{(i)}_t(k)+{\kappa _0 g^{(i)}_k\over \kappa _0+g^{(i)}_k}({\tilde{x}}^{(i)}_k-\xi _k)^2. \end{aligned}$$

Compute then $h^{(i+1)}_k(x_t)$ as in (7.13). With help of $h^{(i+1)}_k(x_t)$ and $u^{(i+1)}_{lj}$, find $\gamma ^{(i+1)}_t$ and $\xi ^{(i+1)}(l,j)$ using the forward-backward formulae. After that update $h^{(i+2)}_k(x_t)$ and $u^{(i+2)}_{lj}$ and so on. When the VB algorithm has converged, say after m steps, apply the Viterbi algorithm with $u_{lj}^{(m)}$ as transitions and with $h^{(m)}_k(x_t)$ as emission values.

1.3 Clustering formulae under normal emissions with NIX priors

From (7.12) it follows that for any sequence $y'$, the likelihood ratio is given by

$$\begin{aligned} {p(x|y)\over p(x|y')}=\prod _{k=1}^K{\Gamma \left( {\nu _0+m_k\over 2}\right) \over \Gamma \left( {\nu _0+m'_k\over 2}\right) } \prod _{k=1}^K{\sqrt{\kappa _0+m'_k\over \kappa _0+m_k}} \prod _{k=1}^K {(\nu '_{k}\tau ^{'2}_{k})^{{\nu _{0}\over 2}}\over (\nu _{k}\tau ^2_{k})^{{\nu _0\over 2}}} \prod _{k=1}^K {(\nu '_{k}\tau ^{'2}_{k})^{{m'_k\over 2}}\over (\nu _{k}\tau ^2_{k})^{{m_k\over 2}}}. \end{aligned}$$

(7.14)

When $\nu _0\rightarrow \infty $ and $\tau ^2_0>0$, then due to $\sum _k m_k=\sum _k m'_k=n$ we have

$$\begin{aligned} \lim _{\nu _0\rightarrow \infty }\prod _{k=1}^K{\Gamma \left( {\nu _0+m_k\over 2}\right) \over \Gamma \left( {\nu _0+m'_k\over 2}\right) }=1,\quad \lim _{\nu _0\rightarrow \infty } \prod _{k=1}^K {(\nu '_{k}\tau ^{'2}_{k})^{{m'_k\over 2}}\over (\nu _{k}\tau ^2_{k})^{{m_k\over 2}}}= 1. \end{aligned}$$

Write $\nu _{k}\tau ^{2}_{k}$ as

$$\begin{aligned} \nu _{k}\tau ^{2}_{k}=\nu _0\tau _0^2 + \sum _{t\in S_k} (x_t-\bar{x}_k)^2 + \frac{\kappa _0 m_k }{\kappa _0 +m_k}(\bar{x}_k-\xi _k)^2=\nu _0\tau _0^2+A_k=\nu _0\tau _0^2\left( 1+{2A_k\over 2\nu _0\tau _0^2}\right) . \end{aligned}$$

Then

$$\begin{aligned} {(\nu '_{k}\tau ^{'2}_{k})^{{\nu _{0}\over 2}}\over (\nu _{k}\tau ^2_{k})^{{\nu _0\over 2}}}={\Big (1+{2A'_k\over 2\nu _0\tau _0^2}\Big )^{\nu _0\over 2} \over \Big (1+{2A_k\over 2\nu _0\tau _0^2}\Big )^{\nu _0\over 2}}\rightarrow \exp \left[ {A'_k-A_k\over 2\tau ^2_0}\right] . \end{aligned}$$

Therefore, when $\nu _0\rightarrow \infty $, then the likelihood ratio in (7.14) converges to

$$\begin{aligned} \prod _{k=1}^K{\sqrt{\kappa _0+m'_k\over \kappa _0+m_k}} \exp \left[ {\sum _k A'_k-\sum _kA_k\over 2\tau ^2_0}\right] . \end{aligned}$$

Thus, maximizing p(x|y) corresponds to the following clustering problem: find clusters $S_1,\ldots , S_k$ that minimize

$$\begin{aligned} \sum _{k=1}^K \sum _{t\in S_k} (x_t-\bar{x}_k)^2+\kappa _0\sum _{k=1}^K{m_k\over \kappa _0+m_k}(\bar{x}_k-\xi _k)^2+\tau _0^2\sum _{k=1}^K\ln (\kappa _0+m_k), \end{aligned}$$

which is formula (4.2). Given cluster $S_k$, it is easy to see that

$$\begin{aligned} \arg \min _{\mu \in {\mathbb R}}\left[ \sum _{t\in S_k}(x_t-\mu )^2+\kappa _0(\mu -\xi _k)^2\right] ={m_k{\bar{x}}_k+\kappa _0\xi _k\over \kappa _0+m_k}=:\mu _k. \end{aligned}$$

(7.15)

Since

$$\begin{aligned} \sum _{t\in S_k}(x_t-\mu _k)^2+\kappa _0(\mu _k-\xi _k)^2=\sum _{t\in S_k} (x_t-\bar{x}_k)^2+\kappa _0{m_k\over \kappa _0+m_k}(\bar{x}_k-\xi _k)^2, \end{aligned}$$

we obtain (4.3).

To understand the behavior of the segmentation EM and segmentation MM algorithms when $\nu _0\rightarrow \infty $, recall the segmentation EM iteration formula from (7.13). When $\nu _0\rightarrow \infty $, then $\ln (\nu ^{(i)}_{k}/{2})-\psi ({\nu ^{(i)}_{k}}/{2})\rightarrow 0$ and $(\tau _{k}^2)^{(i)}\rightarrow \tau _0^2$. Thus, leaving the superscript (i) out of the notation, we get

$$\begin{aligned} \ln h_k(x_t) \rightarrow -{1\over 2}\ln (2\pi (\tau _0^2) )-{1\over 2 \tau ^2_0} (x_t- \mu _k )^2 -{1\over 2(\kappa _0+m_k)}, \end{aligned}$$

where $\mu _k$ is as in (7.15). The Viterbi alignment is now obtained as

$$\begin{aligned} y_t=\arg \min _{k=1,\ldots ,K}\left[ (x_t- \mu _k)^2+{\tau _0^2 \over m_k+\kappa _0}\right] . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lember, J., Gasbarra, D., Koloydenko, A. et al. Estimation of Viterbi path in Bayesian hidden Markov models. METRON 77, 137–169 (2019). https://doi.org/10.1007/s40300-019-00152-7

Download citation

Received: 29 October 2018
Accepted: 21 May 2019
Published: 28 May 2019
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s40300-019-00152-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of Viterbi path in Bayesian hidden Markov models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Likelihood and Decoding in a Partially Hidden Markov Model

Pairwise Markov Models and Hybrid Segmentation Approach

Active Inference for Binary Symmetric Hidden Markov Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 General formulae for the segmentation methods studied

1.2 Non-stochastic segmentation algorithms for the Dirichlet-NIX case

1.3 Clustering formulae under normal emissions with NIX priors

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Estimation of Viterbi path in Bayesian hidden Markov models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Likelihood and Decoding in a Partially Hidden Markov Model

Pairwise Markov Models and Hybrid Segmentation Approach

Active Inference for Binary Symmetric Hidden Markov Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 General formulae for the segmentation methods studied

1.2 Non-stochastic segmentation algorithms for the Dirichlet-NIX case

1.3 Clustering formulae under normal emissions with NIX priors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation