Abstract
It is common in functional data analysis to look at a set of related functions: a set of learning curves, a set of brain signals, a set of spatial maps, etc. One way to express relatedness is through an additive model, whereby each individual function \(g_{i}\left( x\right) \) is assumed to be a variation around some shared mean \(f(x)\). Gaussian processes provide an elegant way of constructing such additive models, but suffer from computational difficulties arising from the matrix operations that need to be performed. Recently Heersink & Furrer have shown that functional additive model give rise to covariance matrices that have a specific form they called quasi-Kronecker (QK), whose inverses are relatively tractable. We show that under additional assumptions the two-level additive model leads to a class of matrices we call restricted quasi-Kronecker (rQK), which enjoy many interesting properties. In particular, we formulate matrix factorisations whose complexity scales only linearly in the number of functions in latent field, an enormous improvement over the cubic scaling of naïve approaches. We describe how to leverage the properties of rQK matrices for inference in Latent Gaussian Models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Barthelmé, S., Trukenbrod, H., Engbert, R., Wichmann, F.: Modeling fixation locations using spatial point processes. J. Vis. 13(12), (2013). doi:10.1167/13.12.1
Behseta, S., Kass, R.E., Wallstrom, G.L.: Hierarchical models for assessing variability among functions. Biometrika 92(2), 419–434 (2005)
Bolin, D., Lindgren, F.: Excursion and contour uncertainty regions for latent gaussian models. J. R. Stat. Soc. B (2014)
Bornn, L., Shaddick, G., Zidek, J.V.: Modeling nonstationary processes through dimension expansion. J. Am. Stat. Assoc. 107(497), 281–289 (2012)
Cheng, W., Dryden, I. L., and Huang, X.: Bayesian registration of functions and curves (2013). arXiv:1311.2105
Cressie, N., Johannesson, G.: Fixed rank kriging for very large spatial data sets. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 70(1), 209–226 (2008)
Fillipone, M., Girolami, M.: Pseudo-marginal bayesian inference for Gaussian processes. IEEE. Trans. Pattern. Anal. Mach. Intell. (2014). doi:10.1109/TPAMI.2014.2316530
Furrer, R., Genton, M.G., Nychka, D.: Covariance tapering for interpolation of large spatial datasets. J. Comput. Graph. Stat. 15(3), 502–523 (2006)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
Granger, C.W.J., Morris, M.J.: Time series modelling and interpretation. J. R. Stat. Soc. Ser. A (General) 139, 246–257 (1976)
Heersink, D.K., Furrer, R.: On moore-penrose inverses of quasi-kronecker structured matrices. Linear Algebra Appl. 436, 561–570 (2011)
Illian, J., Penttinen, A., Stoyan, H., Stoyan, D.: Statistical Analysis and Modelling of Spatial Point Patterns (Statistics in Practice), 1st edn. Wiley-Interscience, Hoboken (2008)
Kaufman, C., Sain, S.: Bayesian functional ANOVA modeling using Gaussian process prior distributions. Bayesian Anal. 5, 123–150 (2009)
Kneip, A., Ramsay, J.O.: Combining registration and fitting for functional models. J. Am. Stat. Assoc. 103(483), 1155–1165 (2008)
Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 73(4), 423–498 (2011)
Liu, D., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)
Minka, T. P.: Expectation propagation for approximate bayesian inference. In: UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc, San Francisco (2001)
Møller, J., Syversveen, A.R., Waagepetersen, R.P.: Log gaussian cox processes. Scand. J. Stat. 25(3), 451–482 (1998)
Murray, I., Adams, R.P., MacKay, D.J.: Elliptical slice sampling. JMLR: W&CP 9, 541–548 (2010)
Neal, R. M.: Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification (1997). arXiv:physics/9701026
Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)
Nocedal, J., Wright, S.: Numerical Optimization (Springer Series in Operations Research and Financial Engineering), 2nd edn. Springer, Berlin (2006)
Opper, M., Archambeau, C.: The variational gaussian approximation revisited. Neural Comput. 21(3), 786–792 (2008)
Paciorek, C.J.: Bayesian smoothing with Gaussian processes using fourier basis functions in the spectralGP package. J. Stat. Softw. 19(2), nihpa22751 (2007)
Paninski, L., Pillow, J., Lewi, J.: Statistical models for neural encoding, decoding, and optimal stimulus design. Prog. Brain Res. 165, 493–507 (2007)
Petersen, K. B. and Pedersen, M. S.: The matrix cookbook. Version 20121115 (2012)
Pouzat, C., Chaffiol, A.: Automatic spike train analysis and report generation. an implementation with r, R2HTML and STAR. J. Neurosci. Methods 181(1), 119–144 (2009)
Ramsay, J., Silverman, B.W.: Functional Data Analysis (Springer Series in Statistics), 2nd edn. Springer, Berlin (2005)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning series). The MIT Press, Cambridge (2005)
Rue, H., Held, L.: Gaussian Markov Random Fields: Theory and Applications (Chapman & Hall/CRC Monographs on Statistics & Applied Probability), 1st edn. Chapman and Hall/CRC, London (2005)
Rue, H., Martino, S., Chopin, N.: Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 71(2), 319–392 (2009)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Sain, S.R., Nychka, D., Mearns, L.: Functional ANOVA and regional climate experiments: a statistical analysis of dynamic downscaling. Environmetrics 22(6), 700–711 (2011)
Telesca, D., Inoue, L.Y.T.: Bayesian hierarchical curve registration. J. Am. Stat. Assoc. 103(481), 328–339 (2008)
Tierney, L., Kadane, J.B.: Accurate approximations for posterior moments and marginal densities. J. Am. Stat. Assoc. 81(393), 82–86 (1986)
Acknowledgments
The author wishes to thank Reinhard Furrer for pointing him to Heersink and Furrer (2011), Alexandre Pouget for support, and two anonymous referees for comments.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: An orthogonal basis for zero-mean vectors
We prove below proposition 1, which we restate:
The matrix \(\mathbf {L}=\left[ \begin{array}{l} a\mathbf {1}_{m-1}^{t}\\ b+\mathbf {I}_{m-1} \end{array}\right] \), with \(a=\frac{1}{\sqrt{m}}\) and \(b=-\left( \frac{1}{\left( m-1\right) }\left( 1+\frac{1}{\sqrt{m}}\right) \right) \) is an orthonormal basis for the set of zero-mean vectors \(\mathcal {A}_{m}=\left\{ \mathbf {u}\in \mathbb {R}^{m}|\sum u_{i}=0\right\} \).
Proof
For \(\mathbf {L}_{m\times \left( m-1\right) }\) to be an orthogonal basis for \(\mathcal {A}\), all that is required is that \(\sum _{i=1}^{m}\mathbf {L}_{ij}=0\) for all \(j\) and that \(\mathbf {L}^{t}\mathbf {L}=\mathbf {I}\). To lighten the notation, define \(r=m-1\). The first condition requires
and the second condition
implying that
This is a 2nd order polynomial in \(b\), and it has real roots if:
which is true since \(r=m-1\) is non-negative.
Solving the quadratic equations yields \(b=-\left( \frac{1}{r}+\frac{1}{r\sqrt{m}}\right) \) and substituting into 25 yields \(a=\frac{1}{\sqrt{m}}\).
We can therefore always find values \(a,b\) such that \(\mathbf {L}=\left[ \begin{array}{l} a\mathbf {1}_{m-1}^{t}\\ b+\mathbf {I}_{m-1} \end{array}\right] \) is an orthogonal basis for the linear subset of zero-mean vectors. Moreover, we can show that the columns of N have unit norm:
where the last line is from 26. The result implies that we can compute an orthonormal basis \(\mathbf {L}\) for \(\mathcal {A}\) such that matrix-vector products with \(\mathbf {L}\) are \(\mathcal {O}\left( m\right) \).
Appendix 2: Derivative of the Laplace approximation
Using the implicit function theorem an analytical expression for the gradient of the Laplace approximation can be found (see Rasmussen and Williams 2005, for a similar derivation). The Laplace approximation is given by:
where
is the unnormalised log-posterior evaluated at its conditional mode \(\mathbf {x}{}^{\star }\), and \(\mathbf {H}\left( \mathbf {x}{}^{\star },\varvec{\theta }\right) \) is the Hessian of \(f(\mathbf {x},\varvec{\theta })\) with respect to x evaluated at \(\mathbf {x}{}^{\star },\varvec{\theta }\). Since \(\mathbf {x}^{\star }\) is a maximum it satisfies the gradient equation:
Assuming that \(f(\mathbf {x},\varvec{\theta })\) is twice-differentiable and concave in \(\mathbf {x}\), we can define an implicit function \(\mathbf {x}{}^{\star }\left( \varvec{\theta }\right) \) such that \(F(\mathbf {x}{}^{\star }\left( \varvec{\theta }\right) ,\varvec{\theta })=0\) for all \(\varvec{\theta }\), with derivative
To simplify the notation we will note \(\mathbf {G}\) the matrix \(\varvec{\varSigma }^{-1}\left( \frac{\partial }{\partial \theta _{j}}\varvec{\varSigma }\right) \varvec{\varSigma }^{-1}\). Note that the same \(\mathbf {G}\) matrix appears in the gradient of the Gaussian likelihood with respect to the hyperparameters (Eq. 20). To compute the gradient of \(\mathcal {L}\left( \varvec{\theta }\right) \), we need the derivatives of \(f\left( \mathbf {x}{}^{\star }\left( \varvec{\theta }\right) ,\varvec{\theta }\right) \) with respect to \(\varvec{\theta }\):
where the first part is 0 since the gradient of \(f\) is 0 at \(\mathbf {x}{}^{\star }\left( \varvec{\theta }\right) \), and
This is the gradient of a log-Gaussian density, and we can therefore reuse formula 20 for that part.
The gradient of \(\log \det \mathbf {H}\left( \mathbf {x}{}^{\star },\varvec{\theta }\right) \) is also needed and is sadly more troublesome. A formula for the derivative of the log-det function is given in Petersen and Pedersen (2012):
The Hessian at the mode is the sum of the Hessian of \(\log p\left( \mathbf {y}\vert \mathbf {x}{}^{\star }\right) \) (which depends on \(\varvec{\theta }\) through the implicit dependency of \(\mathbf {x}{}^{\star }\) on \(\varvec{\theta }\)), and the Hessian of \(\log p\left( \mathbf {x}|\varvec{\theta }\right) \), which equals the inverse covariance matrix \(\varvec{\varSigma }^{-1}\). Some additional algebra yields:
where we have assumed that the Hessian of the log-likelihood is diagonal. Accordingly \(\mathbf {H}^{-1}\) is QK, and so is \(\frac{\partial }{\partial \theta _{j}}\mathbf {H}\) (since it equals the sum of a rQK matrix \(-\mathbf {G}\) and a diagonal perturbation \(\mathbf {D}\)). Computing the trace in Eq. (30) is therefore tractable if slightly painful, and can be done by plugging in Eq. (11) and summing the diagonal elements.
Appendix 3: Computing pointwise confidence bands
In this section we outline a simple method for obtaining a Rao-Blackwellised confidence band from posterior samples of \(p(\varvec{\theta }|\mathbf {y})\). A pointwise confidence band around a latent function \(f(x)\) is a vector function \(c_{\alpha }\left( x\right) =\left( \begin{array}{c} h_{\alpha }\left( x\right) \\ l_{\alpha }\left( x\right) \end{array}\right) \), such that, for all \(x\):
In latent Gaussian models the posterior over latent functions is a mixture over conditional posteriors:
The conditional posteriors \(p\left( f\left( t\right) |\mathbf {y},\varvec{\theta }\right) \) are either Gaussian or approximated by a Gaussian. If we have obtained samples from \(p(\varvec{\theta }|\mathbf {y})\) (as in Sect. 3.1) we can approximate (31) using a mixture of Gaussians, which has an analytic cumulative density function. The c.d.f. can be inverted numerically to obtain values \(h_{\alpha }\left( t\right) \) and \(l_{\alpha }\left( t\right) \). For simultaneous rather than pointwise confidence bands, see Bolin and Lindgren (2014).
Appendix 4: Preconditioning for quasi-Newton optimisation
In the whitened parametrisation the Hessian matrix has the following form (eq 24)
Although the whitened parametrisation gives ideal conditioning for the prior term, the conditioning with respect to the likelihood term (\(\mathbf {H}_{l}\)) may be degraded. Preconditioning the problem can help tremendously: given a guess \(\mathbf {z}_{0}\), one precomputes the diagonal values of \(\mathbf {d}_{0}=\text{ diag }\left( \mathbf {I}+\mathbf {G}^{-t} \mathbf {H}_{l}\left( \mathbf {G}^{t}\mathbf {z}_{0}\right) \mathbf {G}^{-t}\right) \) and uses \(\text{ diag }\left( \mathbf {d}_{0}\right) \) as a preconditioner. To compute \(\mathbf {d}_{0}\) the following identity is useful:
where \(\odot \) is the element-wise product. In LGMs \(\mathbf {H}_{l}\) is diagonal and so:
We inject 16 into the above and some rather tedious algebra shows that \(\mathbf {d}_{0}\) can be computed in the following way. Form the matrix \(\mathbf {X}_{0}\) by stacking successive subsets of length \(n\) from \(\mathbf {G}^{t}\mathbf {z}_{0}\). Rotate using the matrix \(\mathbf {B}\) (Sect. 2.5.2) to form another matrix:
then compute \(\mathbf {d}_{0}\) from:
where we have used Matlab notation for subsets of columns of \(\mathbf {Y}\), and \(\mathbf {U}\) and \(\mathbf {V}\) are as in Corrolary 3.
Rights and permissions
About this article
Cite this article
Barthelmé, S. Fast matrix computations for functional additive models. Stat Comput 25, 47–63 (2015). https://doi.org/10.1007/s11222-014-9490-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-014-9490-0