Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Fast matrix computations for functional additive models

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

It is common in functional data analysis to look at a set of related functions: a set of learning curves, a set of brain signals, a set of spatial maps, etc. One way to express relatedness is through an additive model, whereby each individual function \(g_{i}\left( x\right) \) is assumed to be a variation around some shared mean \(f(x)\). Gaussian processes provide an elegant way of constructing such additive models, but suffer from computational difficulties arising from the matrix operations that need to be performed. Recently Heersink & Furrer have shown that functional additive model give rise to covariance matrices that have a specific form they called quasi-Kronecker (QK), whose inverses are relatively tractable. We show that under additional assumptions the two-level additive model leads to a class of matrices we call restricted quasi-Kronecker (rQK), which enjoy many interesting properties. In particular, we formulate matrix factorisations whose complexity scales only linearly in the number of functions in latent field, an enormous improvement over the cubic scaling of naïve approaches. We describe how to leverage the properties of rQK matrices for inference in Latent Gaussian Models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Barthelmé, S., Trukenbrod, H., Engbert, R., Wichmann, F.: Modeling fixation locations using spatial point processes. J. Vis. 13(12), (2013). doi:10.1167/13.12.1

  • Behseta, S., Kass, R.E., Wallstrom, G.L.: Hierarchical models for assessing variability among functions. Biometrika 92(2), 419–434 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  • Bolin, D., Lindgren, F.: Excursion and contour uncertainty regions for latent gaussian models. J. R. Stat. Soc. B (2014)

  • Bornn, L., Shaddick, G., Zidek, J.V.: Modeling nonstationary processes through dimension expansion. J. Am. Stat. Assoc. 107(497), 281–289 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  • Cheng, W., Dryden, I. L., and Huang, X.: Bayesian registration of functions and curves (2013). arXiv:1311.2105

  • Cressie, N., Johannesson, G.: Fixed rank kriging for very large spatial data sets. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 70(1), 209–226 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Fillipone, M., Girolami, M.: Pseudo-marginal bayesian inference for Gaussian processes. IEEE. Trans. Pattern. Anal. Mach. Intell. (2014). doi:10.1109/TPAMI.2014.2316530

  • Furrer, R., Genton, M.G., Nychka, D.: Covariance tapering for interpolation of large spatial datasets. J. Comput. Graph. Stat. 15(3), 502–523 (2006)

    Article  MathSciNet  Google Scholar 

  • Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)

    MATH  Google Scholar 

  • Granger, C.W.J., Morris, M.J.: Time series modelling and interpretation. J. R. Stat. Soc. Ser. A (General) 139, 246–257 (1976)

    Article  MathSciNet  Google Scholar 

  • Heersink, D.K., Furrer, R.: On moore-penrose inverses of quasi-kronecker structured matrices. Linear Algebra Appl. 436, 561–570 (2011)

    Article  MathSciNet  Google Scholar 

  • Illian, J., Penttinen, A., Stoyan, H., Stoyan, D.: Statistical Analysis and Modelling of Spatial Point Patterns (Statistics in Practice), 1st edn. Wiley-Interscience, Hoboken (2008)

    Google Scholar 

  • Kaufman, C., Sain, S.: Bayesian functional ANOVA modeling using Gaussian process prior distributions. Bayesian Anal. 5, 123–150 (2009)

    Article  MathSciNet  Google Scholar 

  • Kneip, A., Ramsay, J.O.: Combining registration and fitting for functional models. J. Am. Stat. Assoc. 103(483), 1155–1165 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 73(4), 423–498 (2011)

    Article  MATH  Google Scholar 

  • Liu, D., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  • Minka, T. P.: Expectation propagation for approximate bayesian inference. In: UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc, San Francisco (2001)

  • Møller, J., Syversveen, A.R., Waagepetersen, R.P.: Log gaussian cox processes. Scand. J. Stat. 25(3), 451–482 (1998)

    Article  Google Scholar 

  • Murray, I., Adams, R.P., MacKay, D.J.: Elliptical slice sampling. JMLR: W&CP 9, 541–548 (2010)

    Google Scholar 

  • Neal, R. M.: Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification (1997). arXiv:physics/9701026

  • Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)

    MATH  MathSciNet  Google Scholar 

  • Nocedal, J., Wright, S.: Numerical Optimization (Springer Series in Operations Research and Financial Engineering), 2nd edn. Springer, Berlin (2006)

    Google Scholar 

  • Opper, M., Archambeau, C.: The variational gaussian approximation revisited. Neural Comput. 21(3), 786–792 (2008)

    Article  MathSciNet  Google Scholar 

  • Paciorek, C.J.: Bayesian smoothing with Gaussian processes using fourier basis functions in the spectralGP package. J. Stat. Softw. 19(2), nihpa22751 (2007)

  • Paninski, L., Pillow, J., Lewi, J.: Statistical models for neural encoding, decoding, and optimal stimulus design. Prog. Brain Res. 165, 493–507 (2007)

  • Petersen, K. B. and Pedersen, M. S.: The matrix cookbook. Version 20121115 (2012)

  • Pouzat, C., Chaffiol, A.: Automatic spike train analysis and report generation. an implementation with r, R2HTML and STAR. J. Neurosci. Methods 181(1), 119–144 (2009)

    Article  Google Scholar 

  • Ramsay, J., Silverman, B.W.: Functional Data Analysis (Springer Series in Statistics), 2nd edn. Springer, Berlin (2005)

    Google Scholar 

  • Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning series). The MIT Press, Cambridge (2005)

    Google Scholar 

  • Rue, H., Held, L.: Gaussian Markov Random Fields: Theory and Applications (Chapman & Hall/CRC Monographs on Statistics & Applied Probability), 1st edn. Chapman and Hall/CRC, London (2005)

    Book  Google Scholar 

  • Rue, H., Martino, S., Chopin, N.: Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 71(2), 319–392 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  • Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)

    Book  MATH  Google Scholar 

  • Sain, S.R., Nychka, D., Mearns, L.: Functional ANOVA and regional climate experiments: a statistical analysis of dynamic downscaling. Environmetrics 22(6), 700–711 (2011)

    Article  MathSciNet  Google Scholar 

  • Telesca, D., Inoue, L.Y.T.: Bayesian hierarchical curve registration. J. Am. Stat. Assoc. 103(481), 328–339 (2008)

  • Tierney, L., Kadane, J.B.: Accurate approximations for posterior moments and marginal densities. J. Am. Stat. Assoc. 81(393), 82–86 (1986)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The author wishes to thank Reinhard Furrer for pointing him to Heersink and Furrer (2011), Alexandre Pouget for support, and two anonymous referees for comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simon Barthelmé.

Appendices

Appendix 1: An orthogonal basis for zero-mean vectors

We prove below proposition 1, which we restate:

The matrix \(\mathbf {L}=\left[ \begin{array}{l} a\mathbf {1}_{m-1}^{t}\\ b+\mathbf {I}_{m-1} \end{array}\right] \), with \(a=\frac{1}{\sqrt{m}}\) and \(b=-\left( \frac{1}{\left( m-1\right) }\left( 1+\frac{1}{\sqrt{m}}\right) \right) \) is an orthonormal basis for the set of zero-mean vectors \(\mathcal {A}_{m}=\left\{ \mathbf {u}\in \mathbb {R}^{m}|\sum u_{i}=0\right\} \).

Proof

For \(\mathbf {L}_{m\times \left( m-1\right) }\) to be an orthogonal basis for \(\mathcal {A}\), all that is required is that \(\sum _{i=1}^{m}\mathbf {L}_{ij}=0\) for all \(j\) and that \(\mathbf {L}^{t}\mathbf {L}=\mathbf {I}\). To lighten the notation, define \(r=m-1\). The first condition requires

$$\begin{aligned}&a+rb+1 = 0\nonumber \\&a = -(1+rb) \end{aligned}$$
(25)

and the second condition

$$\begin{aligned}&\mathbf {L}^{t}\mathbf {L} = \mathbf {I}\\&a^{2}+rb^{2}+2b+\mathbf {I} = \mathbf {I} \end{aligned}$$

implying that

$$\begin{aligned} a^{2}+rb^{2}+2b=0 \end{aligned}$$
(26)

injecting 25 into 26, we get:

$$\begin{aligned}&(1+rb)^{2}+rb^{2}+2b = 0\\&\left( r+r^{2}\right) b^{2}+\left( 2+2r\right) b+1 = 0 \end{aligned}$$

This is a 2nd order polynomial in \(b\), and it has real roots if:

$$\begin{aligned}&\left( 2+2r\right) ^{2}-4(r+r^{2}) > 0\\&4+4r > 0 \end{aligned}$$

which is true since \(r=m-1\) is non-negative.

Solving the quadratic equations yields \(b=-\left( \frac{1}{r}+\frac{1}{r\sqrt{m}}\right) \) and substituting into 25 yields \(a=\frac{1}{\sqrt{m}}\).

We can therefore always find values \(a,b\) such that \(\mathbf {L}=\left[ \begin{array}{l} a\mathbf {1}_{m-1}^{t}\\ b+\mathbf {I}_{m-1} \end{array}\right] \) is an orthogonal basis for the linear subset of zero-mean vectors. Moreover, we can show that the columns of N have unit norm:

$$\begin{aligned} \sum _{i}\mathbf {L}_{ij}^{2}&= a^{2}+\left( b+1\right) ^{2}+\left( m-2\right) b^{2}\\&= a^{2}+\left( m-1\right) b^{2}+2b+1\\&= 1 \end{aligned}$$

where the last line is from 26. The result implies that we can compute an orthonormal basis \(\mathbf {L}\) for \(\mathcal {A}\) such that matrix-vector products with \(\mathbf {L}\) are \(\mathcal {O}\left( m\right) \).

Appendix 2: Derivative of the Laplace approximation

Using the implicit function theorem an analytical expression for the gradient of the Laplace approximation can be found (see Rasmussen and Williams 2005, for a similar derivation). The Laplace approximation is given by:

$$\begin{aligned} \mathcal {L}\left( \varvec{\theta }\right) =f(\mathbf {x}{}^{\star },\varvec{\theta })-\frac{1}{2}\log \det \mathbf {H}\left( \mathbf {x}{}^{\star },\varvec{\theta }\right) \end{aligned}$$
(27)

where

$$\begin{aligned} f(\mathbf {x},\varvec{\theta })=\log p\left( \mathbf {y}\vert \mathbf {x}\right) +\log p\left( \mathbf {x}\vert \varvec{\theta }\right) \end{aligned}$$
(28)

is the unnormalised log-posterior evaluated at its conditional mode \(\mathbf {x}{}^{\star }\), and \(\mathbf {H}\left( \mathbf {x}{}^{\star },\varvec{\theta }\right) \) is the Hessian of \(f(\mathbf {x},\varvec{\theta })\) with respect to x evaluated at \(\mathbf {x}{}^{\star },\varvec{\theta }\). Since \(\mathbf {x}^{\star }\) is a maximum it satisfies the gradient equation:

$$\begin{aligned} F\left( \mathbf {x},\varvec{\theta }\right)&= 0\\ F(\mathbf {x},\varvec{\theta })&= \nabla \log p\left( \mathbf {y}\vert \mathbf {x}\right) +\nabla \log p\left( \mathbf {x}\vert \varvec{\theta }\right) \end{aligned}$$

Assuming that \(f(\mathbf {x},\varvec{\theta })\) is twice-differentiable and concave in \(\mathbf {x}\), we can define an implicit function \(\mathbf {x}{}^{\star }\left( \varvec{\theta }\right) \) such that \(F(\mathbf {x}{}^{\star }\left( \varvec{\theta }\right) ,\varvec{\theta })=0\) for all \(\varvec{\theta }\), with derivative

$$\begin{aligned} \frac{\partial }{\partial \theta _{j}}\mathbf {x}{}^{\star }&= -\left( \frac{\partial }{\partial \mathbf {x}}F(\mathbf {x},\varvec{\theta })\right) ^{-1} \left( \frac{\partial }{\partial \varvec{\theta }}F(\mathbf {x},\varvec{\theta })\right) \nonumber \\&= -\mathbf {H}^{-1}\left( \varvec{\varSigma }^{-1}\left( \frac{\partial }{\partial \theta _{j}}\varvec{\varSigma }\right) \varvec{\varSigma }^{-1}\mathbf {x}^{\star }\left( \varvec{\theta }\right) \right) \end{aligned}$$
(29)

To simplify the notation we will note \(\mathbf {G}\) the matrix \(\varvec{\varSigma }^{-1}\left( \frac{\partial }{\partial \theta _{j}}\varvec{\varSigma }\right) \varvec{\varSigma }^{-1}\). Note that the same \(\mathbf {G}\) matrix appears in the gradient of the Gaussian likelihood with respect to the hyperparameters (Eq. 20). To compute the gradient of \(\mathcal {L}\left( \varvec{\theta }\right) \), we need the derivatives of \(f\left( \mathbf {x}{}^{\star }\left( \varvec{\theta }\right) ,\varvec{\theta }\right) \) with respect to \(\varvec{\theta }\):

$$\begin{aligned} \frac{\partial }{\partial \varvec{\theta }}f=\left( \frac{\partial }{\partial \mathbf {x}{}^{\star }}f \left( \mathbf {x}{}^{\star },\varvec{\theta }\right) \right) \left( \frac{\partial }{\partial \theta _{j}} \mathbf {x}{}^{\star }\right) +\frac{\partial }{\partial \theta _{j}}f\left( \mathbf {x},\varvec{\theta }\right) \end{aligned}$$

where the first part is 0 since the gradient of \(f\) is 0 at \(\mathbf {x}{}^{\star }\left( \varvec{\theta }\right) \), and

$$\begin{aligned} \frac{\partial }{\partial \theta _{j}}f\left( \mathbf {x},\varvec{\theta }\right)&= \frac{\partial }{\partial \theta _{j}}\left( \log p\left( \mathbf {y}\vert \mathbf {x}\right) +\log p\left( \mathbf {x}\vert \varvec{\theta }\right) \right) \\&= \frac{\partial }{\partial \theta _{j}}\log p\left( \mathbf {x}\vert \varvec{\theta }\right) \end{aligned}$$

This is the gradient of a log-Gaussian density, and we can therefore reuse formula 20 for that part.

The gradient of \(\log \det \mathbf {H}\left( \mathbf {x}{}^{\star },\varvec{\theta }\right) \) is also needed and is sadly more troublesome. A formula for the derivative of the log-det function is given in Petersen and Pedersen (2012):

$$\begin{aligned} \frac{\partial }{\partial \theta _{j}}\log \det \text{ H } =\text{ Tr }\left( \mathbf {H}^{-1}\frac{\partial }{\partial \theta _{j}} \mathbf {H}\right) \end{aligned}$$
(30)

The Hessian at the mode is the sum of the Hessian of \(\log p\left( \mathbf {y}\vert \mathbf {x}{}^{\star }\right) \) (which depends on \(\varvec{\theta }\) through the implicit dependency of \(\mathbf {x}{}^{\star }\) on \(\varvec{\theta }\)), and the Hessian of \(\log p\left( \mathbf {x}|\varvec{\theta }\right) \), which equals the inverse covariance matrix \(\varvec{\varSigma }^{-1}\). Some additional algebra yields:

$$\begin{aligned} \frac{\partial }{\partial \theta _{j}}\mathbf {H}&= \text{ diag }\left( \left( \frac{\partial }{\partial \theta _{j}}\mathbf {x}_{i}^{\star }\right) \frac{\partial ^{3}}{\partial ^{3}x_{i}}\log p\left( \mathbf {y}\vert \mathbf {x}\right) \right) \\&-\varvec{\varSigma }^{-1}\left( \frac{\partial }{\partial \theta _{j}}\varvec{\varSigma }\right) \varvec{\varSigma }^{-1}\\&= \mathbf {D}-\mathbf {G} \end{aligned}$$

where we have assumed that the Hessian of the log-likelihood is diagonal. Accordingly \(\mathbf {H}^{-1}\) is QK, and so is \(\frac{\partial }{\partial \theta _{j}}\mathbf {H}\) (since it equals the sum of a rQK matrix \(-\mathbf {G}\) and a diagonal perturbation \(\mathbf {D}\)). Computing the trace in Eq. (30) is therefore tractable if slightly painful, and can be done by plugging in Eq. (11) and summing the diagonal elements.

Appendix 3: Computing pointwise confidence bands

In this section we outline a simple method for obtaining a Rao-Blackwellised confidence band from posterior samples of \(p(\varvec{\theta }|\mathbf {y})\). A pointwise confidence band around a latent function \(f(x)\) is a vector function \(c_{\alpha }\left( x\right) =\left( \begin{array}{c} h_{\alpha }\left( x\right) \\ l_{\alpha }\left( x\right) \end{array}\right) \), such that, for all \(x\):

$$\begin{aligned}&p\left( f\left( x\right) >h_{\alpha }\left( x\right) |\mathbf {y}\right) < 1-\alpha \\&p\left( f\left( x\right) \le l_{\alpha }\left( x\right) |\mathbf {y}\right) < \alpha \end{aligned}$$

In latent Gaussian models the posterior over latent functions is a mixture over conditional posteriors:

$$\begin{aligned} p\left( f\left( x\right) |\mathbf {y}\right) =\int p(f|\mathbf {y},\varvec{\theta }) p(\varvec{\theta }|\mathbf {y})\text{ d }\varvec{\theta } \end{aligned}$$
(31)

The conditional posteriors \(p\left( f\left( t\right) |\mathbf {y},\varvec{\theta }\right) \) are either Gaussian or approximated by a Gaussian. If we have obtained samples from \(p(\varvec{\theta }|\mathbf {y})\) (as in Sect. 3.1) we can approximate (31) using a mixture of Gaussians, which has an analytic cumulative density function. The c.d.f. can be inverted numerically to obtain values \(h_{\alpha }\left( t\right) \) and \(l_{\alpha }\left( t\right) \). For simultaneous rather than pointwise confidence bands, see Bolin and Lindgren (2014).

Appendix 4: Preconditioning for quasi-Newton optimisation

In the whitened parametrisation the Hessian matrix has the following form (eq 24)

$$\begin{aligned} \mathbf {H}_{z}\left( \mathbf {z}\right) =\mathbf {I}+\mathbf {G}^{-t}\mathbf {H}_{l} \left( \mathbf {G}^{t}\mathbf {z}\right) \mathbf {G}^{-1} \end{aligned}$$

Although the whitened parametrisation gives ideal conditioning for the prior term, the conditioning with respect to the likelihood term (\(\mathbf {H}_{l}\)) may be degraded. Preconditioning the problem can help tremendously: given a guess \(\mathbf {z}_{0}\), one precomputes the diagonal values of \(\mathbf {d}_{0}=\text{ diag }\left( \mathbf {I}+\mathbf {G}^{-t} \mathbf {H}_{l}\left( \mathbf {G}^{t}\mathbf {z}_{0}\right) \mathbf {G}^{-t}\right) \) and uses \(\text{ diag }\left( \mathbf {d}_{0}\right) \) as a preconditioner. To compute \(\mathbf {d}_{0}\) the following identity is useful:

$$\begin{aligned} \text{ diag }\left( \mathbf {U}\text{ diag }\left( \mathbf {v}\right) \mathbf {U}^{t}\right) =\left( \mathbf {U}\odot \mathbf {U}\right) \mathbf {v} \end{aligned}$$

where \(\odot \) is the element-wise product. In LGMs \(\mathbf {H}_{l}\) is diagonal and so:

$$\begin{aligned} \mathbf {d}_{0}=1+\left( \mathbf {G}^{-t}\odot \mathbf {G}^{-t}\right) \mathbf {G}^{t}\mathbf {z}_{0} \end{aligned}$$
(32)

We inject 16 into the above and some rather tedious algebra shows that \(\mathbf {d}_{0}\) can be computed in the following way. Form the matrix \(\mathbf {X}_{0}\) by stacking successive subsets of length \(n\) from \(\mathbf {G}^{t}\mathbf {z}_{0}\). Rotate using the matrix \(\mathbf {B}\) (Sect. 2.5.2) to form another matrix:

$$\begin{aligned} \mathbf {Y}=\mathbf {X}_{0}\mathbf {B} \end{aligned}$$

then compute \(\mathbf {d}_{0}\) from:

$$\begin{aligned} \mathbf {d}_{0}=\text{ vec }\left( \left[ \begin{array}{cc} \left( \mathbf {U}\odot \mathbf {U}\right) \mathbf {Y}_{:,1}&\left( \mathbf {V}\odot \mathbf {V}\right) \mathbf {Y}_{:,2:m}\end{array}\right] \right) \end{aligned}$$

where we have used Matlab notation for subsets of columns of \(\mathbf {Y}\), and \(\mathbf {U}\) and \(\mathbf {V}\) are as in Corrolary 3.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barthelmé, S. Fast matrix computations for functional additive models. Stat Comput 25, 47–63 (2015). https://doi.org/10.1007/s11222-014-9490-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9490-0

Keywords