Structured prior distributions for the covariance matrix in latent factor models
Abstract
Factor models are widely used for dimension reduction in the analysis of multivariate data. This is achieved through decomposition of a covariance matrix into the sum of two components. Through a latent factor representation, they can be interpreted as a diagonal matrix of idiosyncratic variances and a shared variation matrix, that is, the product of a factor loadings matrix and its transpose. If , this defines a parsimonious factorisation of the covariance matrix. Historically, little attention has been paid to incorporating prior information in Bayesian analyses using factor models where, at best, the prior for the factor loadings is order invariant. In this work, a class of structured priors is developed that can encode ideas of dependence structure about the shared variation matrix. The construction allows data-informed shrinkage towards sensible parametric structures while also facilitating inference over the number of factors. Using an unconstrained reparameterisation of stationary vector autoregressions, the methodology is extended to stationary dynamic factor models. For computational inference, parameter-expanded Markov chain Monte Carlo samplers are proposed, including an efficient adaptive Gibbs sampler. Two substantive applications showcase the scope of the methodology and its inferential benefits.
Keywords— Covariance matrix; Dimension reduction; Intraday gas demand; Latent factor models; Stationary dynamic factor models; Structured prior distributions
1 Introduction
Factor models are widely used as a tool for dimension reduction in explaining the covariance structure of multivariate data. Let be a -dimensional random vector with mean and covariance matrix for observation unit . The classic factor model posits that the covariances between the components of can be explained by their mutual dependence on a smaller number of unknown common factors . Specifically, the model expresses as a noisy affine transformation of ,
(1) |
in which the matrix is called the factor loadings matrix. The components of the error term are often called the specific factors (or “uniquenesses”) and their variances are often termed the idiosyncratic variances. In (1), we take the specific and common factors, and , to be uncorrelated, zero-mean multivariate normal random variables, with
(2) |
In general, the common factors are assumed to explain all the shared variation between the components of and so is constrained to be a diagonal matrix with positive diagonal elements, . It then follows from (1) that the marginal covariance matrix can be expressed as
(3) |
in which the product will henceforth be termed the shared variation matrix. If , the right-hand-side of (3) constitutes a parsimonious factorisation of the full covariance matrix.
Factor models originally found widespread use in psychology and other social sciences (e.g. Goldberg, 1990), where a primary motivation was the prospect of domain insight through interpretation of the latent factors. Since their introduction, factor models have been extended in a variety of directions, for example, by allowing sparsity in the matrix of factor loadings for high-dimensional problems (Conti et al., 2014) or by modelling the temporal evolution of variances in stochastic volatility models (Lopes and Carvalho, 2007) or the latent factors in dynamic factor models (Sáfadi and Peña, 2008). Accordingly, the use of factor models and their extensions has spread into a diverse array of other fields, such as finance (Aguilar and West, 2000) and genetics (Carvalho et al., 2008).
In many of these application areas, the modeller is likely to hold prior beliefs about the nature of the dependence between variables. For example, when the observation vector represents repeated measures data, or observations at a collection of spatial locations, it would be reasonable to expect stronger associations between measurements that are closer together in time or space. Indeed, such ideas underpin the class of functional factor models which are tailored to variables that might be better represented as functions over a continuous domain, rather than vector-valued random quantities. Some functional factor models, and their dynamic extensions, have functional factors (e.g. Castellanos et al., 2004; Taylor-Rodriguez et al., 2019) while others have functional factor loadings. In the latter case, if we regard the observations on unit as measurements on a function over a continuous domain, say , then we replace (1) with
for . The key idea is then to treat the factor loading curves as smooth unknown functions of . This can be achieved by regarding as a linear combination of (orthogonal) basis functions, often modelled using splines, which then implies a particular covariance function for . Although some of the research in this direction uses a Bayesian approach to inference (Kowal et al., 2017; Kowal and Canale, 2022), the frequentist treatment of the problem is more common, particularly in financial applications to forecasting yield curves (Hays et al., 2012; Jungbacker et al., 2014). An alternative approach in models with functional factor loadings is to assume each to be a realisation of a stochastic process, for instance, a Gaussian process with a chosen covariance function; see, for example, the dynamic spatial model of Lopes et al. (2008).
Beyond analyses of functional data, most of the work in the Bayesian literature on prior specification for factor models has focused on the problem of developing priors that are exchangeable with respect to the order of the variables in the observation vector (e.g. Leung and Drton, 2016; Chan et al., 2018) with more recent work focusing additionally on allowing sparsity in the factor loadings matrix (Frühwirth-Schnatter et al., 2023). There has been very little work on constructing non-exchangeable prior specifications in a general setting. A notable exception is the structured increasing shrinkage process prior of Schiavon et al. (2022) under which each factor loading in an infinite factorisation model is assigned a spike-and-slab prior where variable-specific meta-covariates are incorporated in the spike probabilities. Other related work appears in the context of vector error correction models, which are another class of reduced rank models, where Koop et al. (2009) develop an informative prior for the matrix of cointegration vectors based on economic theory about the likely cointegration relationships between variables.
In this paper, we treat the factor model as a tool for providing a dimension-reduced parameterisation of a covariance matrix. Based on the insight that the shared variation matrix in (3) is more amenable to the specification of prior beliefs than the factor loadings matrix , our main contribution is a framework for incorporating initial beliefs about in the prior for by exploiting the algebraic relationship between the two quantities. By switching the focus from properties of the prior for to properties of the prior for we obtain a methodology that is more flexible than alternatives described in the literature. The prior for the shared variation matrix is induced through a structured prior distribution for the factor loadings matrix which can encode ideas of order-invariance as well as non-exchangeable structure, modelled through dependence on covariance kernels, meta-covariates, pairwise distance metrics and more, within a single principled framework. In particular, the mean of the prior for can be any positive definite matrix of relevance. For example, it could be a phylogenetic distance matrix whose entries encode distances between nodes on a tree, or even the Gram matrix arising from a Gaussian process covariance function. The interpretability of this prior expectation facilitates straightforward incorporation of meaningful domain expertise. Moreover, the flexible class of priors allows data-informed shrinkage towards a region around the mean within a framework that facilitates inference on the number of factors and hence the extent to which dimension-reduction can be achieved. New theoretical results outline the nature of this shrinkage in the rank-reduced space over which has support while shedding light on some of the identifiability issues that can hamper computational inference for factor models.
Based on ideas from the literature on infinite factorisation models and parameter-expansion algorithms, we propose an efficient scheme for computational inference where a single Markov chain Monte Carlo (MCMC) run yields information about both the continuous model parameters and the number of factors. Using an unconstrained reparameterisation of stationary vector autoregressions, the methodology is also extended to a class of stationary dynamic factor models for which the shared variation matrix remains meaningfully defined. To the best of our knowledge, this is the first prior, with associated inferential scheme, for a general class of dynamic factor models that constrains inference to the stationary region without imposing further restrictions. This constitutes an important contribution to the Bayesian time-series literature.
The remainder of this paper is organised as follows. After introducing our parameterisation of Bayesian factor models in Section 2, Section 3 presents the general class of structured priors and its components. In Section 4, we extend these ideas to a class of stationary dynamic factor models. Posterior computation is discussed in Section 5, then Section 6 presents a simulation experiment and two substantive applications that illustrate the scope of the methodology and its inferential benefits.
2 Parameterisation of the Bayesian factor model
It is well known that the parameters and in the marginal covariance matrix of a factor model are not identifiable from the likelihood without imposition of constraints. Indeed, even with the scale of the factors fixed at in (2), there remains a rotational invariance. Consider any orthogonal matrix . We can pre-multiply the factors in (1) by and post-multiply the factor loadings matrix by and this gives the transformed factors in (2) the same distribution as the original factors and so the marginal variance in (3) remains unchanged. The factor loadings matrix can therefore only be identified up to an orthogonal transformation. This means there are degrees of freedom determining the marginal variance . In an unconstrained model for , the number of degrees of freedom would be and so the reduction in the number of parameters is which is positive if where
(4) |
is called the Ledermann bound. Notwithstanding rotational invariance, the first identification issue is therefore whether can be identified uniquely from . Fortunately, Bekker and ten Berge (1997) proved that if then , and therefore , is almost surely globally identifiable. Given identification of , solving the rotation problem would then guarantee unique identification of the factor loadings matrix .
In the Bayesian literature, the most common solution to the rotation problem uses the positive diagonal, lower triangular (PLT) constraint; denoting the constrained matrix by and corresponding factors by , this demands for and for . Although historically this approach has been widely used in the Bayesian literature (e.g. Geweke and Zhou, 1996; Lopes and West, 2004), we choose not to address the rotational invariance and instead parameterise the model using the unidentified and unconstrained factor loadings matrix . There are two main motivations.
First, recent work provides theoretical and empirical evidence that imposition of the PLT constraint can adversely affect computational inference. Writing to denote the lower triangular matrix comprising the first rows and columns of , problems arise when is near singular, which can occur when variables in are highly correlated. When the diagonal element of is close to zero, the signs of are only very weakly identified, producing multiple modes in the posterior corresponding to the different sign flips (e.g. Aßmann et al., 2016; Man and Culpepper, 2022). This can cause poor mixing in MCMC samplers and poor approximation of the posterior for the number of factors (Chan et al., 2018). These problems have been addressed successfully by parameter-expansion techniques, which target a posterior over an expanded space with fewer pathological features (e.g. Ghosh and Dunson, 2009; Chan et al., 2018). Parameterising the model and framing the problem of computational inference in terms of the unidentified factor loadings matrix falls into this class of methods.
Second, parameterisation in terms of the unconstrained and unidentified factor loadings matrix delivers an additional benefit in terms of prior specification. Direct elicitation of a prior for the identified factor loadings in is difficult as they quantify relationships with latent factors whose interpretation is generally unclear a priori. In contrast, beliefs about the shared variation in linear Gaussian models can be related directly to beliefs about observable quantities. As a result, is generally a more intuitive parameter for the quantification of covariation. For example, in a spatial problem where the elements of correspond to observations at different locations, a modeller might reasonably structure their beliefs through a covariance matrix for which covariance decays as a function of distance. Fortunately, subject to , identifiability of is not necessary for identification of . Further, under various standard distributions for an unconstrained random matrix , the first and second order moments of the shared variation matrix can be calculated in closed form. Through careful specification of the prior for , a modeller can therefore capture their beliefs about the moments in the prior for the shared variation matrix.
3 Structured prior distributions
3.1 Significance of prior expectation
The prior expectation of the shared variation matrix is defined by . As we detail in Section 3.4, will be chosen to reflect beliefs about the covariation amongst the elements of . In general, it will be a rank matrix.
For , denote by the set of rank , symmetric, positive semi-definite matrices and write for the space of symmetric, positive definite matrices. The factor loadings matrix is rank and so the prior distribution for the shared variation matrix only offers non-zero support to . Because is not a convex set, it need not contain the prior expectation of . Indeed, as stated previously, will generally be rank . Therefore, although represents the centre of prior mass, in general there will be zero density at this point. The significance of is thus not completely clear. We will elucidate its meaning via an alternative, constrained expectation, as follows.
The Frobenius norm of a matrix is simply the Euclidean norm of its vectorised form. Define the (squared) Frobenius distance between two matrices and as the squared Frobenius norm of their difference:
By analogy with the classic Euclidean expectation, we now define the constrained expectation of as where
Theorem 1 below asserts that if the and eigenvalues of are distinct, the constrained expectation of is the point in that minimises the Frobenius distance to . However, before proving Theorem 1, we need Proposition 1.
Proposition 1.
Let the spectral decomposition of a matrix be , where is a diagonal matrix of ordered eigenvalues, , and is an orthogonal matrix whose columns comprise the corresponding eigenvectors. Assume that . Then, for , the matrix product which minimises the Frobenius distance to is where is the sub-matrix comprising the first columns of . Moreover, the minimum squared Frobenius distance is equal to the sum of the squares of the last eigenvalues of .
The proof of Proposition 1 is provided in the Supplementary Materials.
Theorem 1.
If denotes the spectral decomposition of and , then the constrained expectation of , , is equal to the matrix product which minimises the Frobenius distance to . That is, .
The proof of Theorem 1 is given in the Supplementary Materials. Its significance lies in the suggestion that the prior for encourages shrinkage towards the closest matrix in to the rank prior expectation , hence for a given structure for , approximating it as closely as possible in rank-reduced form. The Supplementary Materials also consider the case and the implications for computational inference.
3.2 A matrix normal prior
A random matrix has a matrix normal distribution with location matrix , among-row scale matrix and among-column scale matrix , written , if is multivariate normal and such that . Since for any scalar , the overall scale of either or can be fixed without loss of generality.
Suppose that we take as a prior for the unconstrained factor loadings matrix and fix the scale of the among-row scale matrix by taking or . As remarked in Section 2, the shared variation matrix is a quantity which is amenable to the specification of prior beliefs, and so our goal is to derive its moments. Using standard theory for the matrix normal distribution (e.g. Gupta and Nagar, 2000, Chapter 2), it can be shown that the expectation of is given by
(5) |
while the covariance between and is . In the special case where and , the variance of is then
(6) |
The derivations of the moments above are provided in the Supplementary Materials.
The result in (5) is significant because of the interpretation it bestows on the among-row scale matrix as a standardised version of our prior expectation for the shared variation matrix . Parametric forms are convenient models for covariance matrices because they provide a parsimonious representation of dependence and a way to model the relationships with as-yet unobserved variables. We can therefore model the among-row scale matrix using a parametric form. This encourages shrinkage of the shared variation matrix towards an area around that parametric form but without enforcing the structure as a model constraint.
3.3 A matrix- prior
The expressions for the means and variances in (5) and (6) under the matrix normal prior reveal a lack of flexibility; recalling that the scale of is fixed, once the mean has been chosen, it is clearly not possible to scale up or down all the variances by scaling . As such, we cannot assess independently the overall scale of the mean and uncertainty around that mean. A matrix- prior for remedies this problem through the introduction of a degree of freedom parameter.
Let and be independent random matrices, where denotes the -dimensional Wishart distribution with degrees of freedom and scale matrix . If we define
(7) |
where and , then the random matrix has a matrix- distribution, written . Again, one can fix the scale of either the among-row scale matrix or the among-column scale matrix without loss of generality.
Suppose that we adopt as a prior for the factor loadings matrix. Expressions for the means, variances and covariances of are derived in the Supplementary Materials. In particular, we show that for where , which is identical to the expression derived under the matrix normal prior. We also show that for a fixed value of , the variance,
can be increased or decreased by decreasing or increasing , respectively. Therefore, by treating as an unknown and assigning it a prior, we can allow the data to influence the degree of shrinkage of the shared variation matrix towards the closest rank matrix to its mean. Writing , a matrix normal distribution for is recovered as . We can therefore allow controlled relaxation of the fixed shrinkage of the matrix normal prior by specifying a prior for with its mode at zero. To this end, we recommend an exponential distribution . In the special case when and , we have where is an increasing function in which takes its minimum value of 1 when . By trial and improvement, we can therefore use the quantiles of the distribution to choose a value for which is consistent with our prior beliefs about a chosen quantile in the distribution for the scale factor for various . This is illustrated in the application in Section 6.2.
3.4 Model and prior for
3.4.1 General form
Under either a matrix normal or matrix- prior for , . As explained in Section 3.2, since we choose to fix or , we can therefore interpret as a standardised version of the prior expectation of . For most of the parametric forms described in this section, it will be possible to factorise or as or , respectively, where yields a positive definite matrix with 1s on the diagonal. The vector typically contains a small number of unknown hyperparameters to which we assign a prior.
3.4.2 Specific examples
The simplest structure for would take giving . In the matrix normal case when, additionally, , we obtain the order-invariant prior presented in Leung and Drton (2016) for the identified factor loadings matrix in which for and for . Although taking will give an order-invariant prior for , a more general distribution that is exchangeable with respect to the order of the variables in arises by taking to be a two-parameter exchangeable matrix. Since we constrain this yields , where , and is a -vector of 1s. This is the most general form for a symmetric, positive definite matrix with which is invariant under a common permutation of the rows and columns. If a modeller has nothing in their prior beliefs to distinguish among the elements of , the advantage of this specification over the order-invariant prior of Leung and Drton (2016) is that it allows shrinkage of the off-diagonal elements of the shared variation matrix towards non-zero values. This can be particularly helpful when the number of observations is small relative to the dimension of .
In confirmatory factor analysis, structural zeros are often introduced in the factor loadings matrix based on prior beliefs about the relationships between the factors and the variables in . For example, in the prototypical two-factor model, the first variables load only on factor 1 whilst the remaining variables load only on factor 2. This would correspond to a block diagonal shared variation matrix . Rather than imposing these beliefs about independent group structure as a model constraint, one could instead encourage shrinkage towards the corresponding structure through the prior. With appropriate ordering of the variables, this could be achieved by taking in which each block is a two-parameter exchangeable matrix, as described above.
Beyond these simple examples, there are many fields of statistics in which parametric forms are adopted for correlation matrices or their inverses. Typically, they are intended to capture the idea that observations from variables that are closer together, often in time or space, tend to be more strongly (positively) correlated. For instance, the precision matrix for a vector of longitudinal measurements might be assumed to be that of a stationary autoregressive process of order one. We could centre the prior for on a matrix of this form by taking to be a tridiagonal matrix with for , for and where .
In spatial statistics, it is very common to use a spatial correlation function, such as one from the Matérn class. A conventional spatial correlation function will depend only on the distance, and possibly direction, between sites. However, Schmidt et al. (2011) introduced the idea of recording distances in a space of dimension greater than two, by introducing covariates into the correlation function. A special case, termed the projection model, measures the distance between sites and in this augmented space using the Mahalanobis metric with respect to an (unknown) symmetric, positive definite matrix , that is,
(8) |
Here is a vector of coordinates for site , comprising its two spatial coordinates and the values of meta-covariates. This is then combined with an exponential spatial correlation function to give . Omitting the spatial coordinates, these ideas can be extended to non-spatial settings if there are meta-covariates associated with the variables that are thought to influence the relationships between deviations from the mean and in any vector measurement. Indeed, any Gaussian process correlation matrix could be used for which has huge scope for the incorporation of meaningful domain expertise.
More generally, if the distances between variables can be measured by some metric, say , this could be used to structure the prior through the specification . An ecological example is provided in Section 6.2 which uses phylogenetic distances between species in a joint model for species occurrence.
3.5 Model and prior for
Except in confirmatory factor analysis, an appropriate number of factors is not usually known a priori. Various methods have been proposed in the literature to address this source of uncertainty. This includes techniques to approximate the marginal likelihood of models with different numbers of factors, including path sampling (Dutta and Ghosh, 2013) and Chib’s method (Chib et al., 2006). However, these methods are very computationally expensive, requiring separate Monte Carlo simulations for models with factors, where is the maximum number of factors that a modeller is prepared to entertain. Lopes and West (2004) propose a reversible jump MCMC algorithm but the proposals for transdimensional moves require pilot MCMC runs for models with factors which rapidly becomes untenable for high-dimensional problems. Other methods that attempt to allow simultaneous inference of the number of factors and the model parameters rely on identification of individual zeros in the factor loadings matrix (Frühwirth-Schnatter and Lopes, 2010; Conti et al., 2014; Frühwirth-Schnatter et al., 2023). The approach taken here in similar in spirit, relying on the removal of columns whose contribution to is negligible.
Due to the challenges in approximating the posterior for , a recent approach that has become popular relies conceptually on an overfitted factor model (Bhattacharya and Dunson, 2011; Legramanti et al., 2020; Schiavon et al., 2022). In theory, infinitely many factors are permitted but cumulative shrinkage priors are used to increasingly shrink columns of the factor loadings matrix towards zero as the column index increases. This allows the infinite factor loadings matrix to be approximated with a finite matrix by omitting columns that have been shrunk close enough to zero. The posterior for the number of non-discarded columns is then used as a proxy for the posterior for . As such, a single MCMC sample yields information about both the model parameters and the number of latent factors.
A popular increasing shrinkage prior is the multiplicative gamma process (MGP) (Bhattacharya and Dunson, 2011). Conditional on a set of hyperparameters, the factor loadings within each column are assumed to be independent normal random variables with zero mean and a column-specific precision parameter. The precisions are a cumulative product of gamma random variables and their prior expectation increases with the column index. A factor is deemed inactive a posteriori if the contribution of its loadings to is negligible under a chosen truncation criterion. In order to apply these ideas to our structured prior for the shared variation matrix, we take the among-column scale matrix to be diagonal and assign the reciprocals of its elements a MGP prior
(9) |
in which the are independent. To ensure identifiability of the shared variation matrix and the idiosyncratic variances in the likelihood, we choose to truncate the prior at where was defined in (4). Guidelines on the choice of and to ensure the are stochastically increasing in a reasonable neighbourhood of zero are provided in Durante (2017).
4 Stationary dynamic factor models
4.1 Bayesian dynamic factor models
In the classic vector autoregressive (VAR) model, the number of parameters grows quadratically with the dimension of the observation vector. In contrast, the number of parameters in a dynamic factor model only grows quadratically with the dimension of the latent factors. This makes them a valuable tool for dimension-reduction in multivariate time series analysis. A dynamic factor model has the same observation equation as a static model,
(10) |
However, the factors in a dynamic model evolve over time according to an order vector autoregression (VAR). This yields the state equation as
(11) |
for in which the observation and factor innovations and are serially and mutually uncorrelated. As in the static case, it is generally assumed that the common factors explain all of the shared variation and so is taken to be diagonal. Various extensions of this model have been presented, for example, Peña and Poncela (2006) incorporate a moving average component in state equation, Aßmann et al. (2016) introduce lagged factors in (10) and Kaufmann and Schumacher (2019) allow the components of the idiosyncratic innovations to evolve independently as univariate stationary autoregressions.
4.2 A stationary dynamic factor model
The methodology described in this paper relies on the use of a structured prior distribution for the shared variation component of the marginal variance. The marginal variance of a dynamic factor model is only meaningfully defined when the factors evolve according to a stationary process; hereafter we focus only on situations where this is a reasonable assumption. The manner by which stationarity is enforced is discussed in Section 4.3.
In order to identify the scale of the loadings matrix in a dynamic factor model, the scale of the factors is normally constrained by fixing . However, to retain the interpretation of as a shared variation matrix, we instead constrain the stationary covariance matrix so that and hence the marginal variance of remains as . We can therefore assign a prior to the unconstrained factor loadings matrix using precisely the ideas discussed in Section 3 for the static model.
To complete specification of the stationary dynamic factor model, the state equation (11) must be initialised at its stationary distribution. To this end, we augment the parameter space with auxiliary factors at times . Generalising the definition of the stationary variance above, we define as the autocovariance of . The initial distribution can then be expressed as where is a positive definite block Toeplitz matrix with as the block in rows to and columns to () and ().
4.3 Enforcing the stationarity constraint
Denote by the backshift operator, . The VAR model governing evolution of the factors can then be written as where for and , , is termed the characteristic polynomial. The process is stable if and only if all the roots of lie outside the unit circle. Because all stable processes are stationary and unstable stationary processes are not generally of interest, this subset of the Cartesian product space in which the lie is often referred to as the stationary region, . Even for the simplest vector case of an order-1 bivariate autoregression, it has a highly complex geometry.
Various authors have considered Bayesian inference for stationary dynamic factor models (e.g. Sáfadi and Peña, 2008; Lopes et al., 2008). However, due to the difficulties of designing efficient MCMC samplers with state space constrained to , the autoregressive coefficient matrices are often assumed to be diagonal, which simplifies the stationarity condition to that of a univariate autoregression. Fortunately, recent work by Heaps (2023) presents a prior for the autoregressive coefficient matrices that is constrained to and facilitates routine computational inference. This is based on an unconstrained reparameterisation of the model through two bijective transformations. First, the model parameters undergo a recursive mapping which yields a new parameter set . Here is the partial autocorrelation matrix, that is, a standardised version of the conditional cross-covariance between and given the intervening variables . Each partial autocorrelation matrix lies in the space of square matrices with singular values less than one. By mapping the singular values of from to the positive real line, a second mapping then constructs an unconstrained square matrix through , , in which denotes the inverse of the symmetric matrix square root of . Specification of a prior for the , discussed in Section 4.4, and computational inference is now routine owing to the Euclidean geometry of the parameter space.
The inverse mapping from the intermediate reparameterisation to the original parameters involves two recursions. The first outputs the stationary covariance matrix from and the second outputs (and recovers ) from and . It is therefore trivial to modify the reparameterisation to accommodate the constraint that ; one simply omits the first recursion, and calculates from and in the second recursion. We note that the autocovariance matrices needed to characterise the initial distribution of are also by-products of this second recursion.
When follows a VAR process, there is a closed form expression for the original model parameters in terms of or, equivalently, . Dropping the 1-subscript for brevity, we can write and .
4.4 Prior for the transformed partial autocorrelation matrices
The stationary dynamic factor model described in the previous sections is invariant under rotations of the factors when compensatory transformations are applied to the factor loadings and parameters of the state equation. In particular, marginalising over the common factors, it can readily be shown that the likelihood evaluated at is the same as the likelihood evaluated at for any orthogonal matrix . In the absence of any meaningful information to distinguish from a priori, we assign a prior which is rotatable, that is and have the same distribution for any orthogonal matrix . To this end, we take the to be independent with , .
5 Posterior computation
5.1 Samplers with fixed truncation level
Consider first the static factor model. For computational inference, we use MCMC to sample from the posterior associated with our unconstrained parameterisation of the model. When the truncation level in the factor loadings matrix is fixed at , we propose a straightforward Gibbs sampler. The full conditional distributions for most unknowns have standard forms and can be sampled directly. However, depending on the structure chosen for the among-row scale matrix , a semi-conjugate prior for the correlation parameter(s) in is not generally available and so we update in a Metropolis-Hastings step. Computational inference is slightly complicated if a matrix-, rather than matrix normal, prior is adopted for because it is not conjugate to the likelihood. However, sampling becomes straightforward if we make use of the representation of a matrix- random variable in (7) and augment the state space of the sampler with the matrix . The full conditional distribution for is then matrix-normal while the full conditional distribution for is Wishart. The only other additional unknown is the degree of freedom parameter . If is assigned the prior from Section 3.3, its full conditional distribution is non-standard. Rather than updating the transformed degree of freedom parameter through its own Metropolis-Hastings step, mixing can be greatly improved by performing a joint update of and . The proposal density takes the form in which is a random walk proposal on the log-scale and is the full conditional density for .
Extension of the Gibbs sampler to handle the dynamic factor model is straightforward. The latent factors can be sampled efficiently from their full conditional distribution in a single block using a variety of sampling approaches (Wilkinson and Yeung, 2002, 2004). In the example in Section 6.3 we used the forward-filtering backward-sampling algorithm (e.g. Frühwirth-Schnatter, 1994) but other methods that rely on banded or sparse matrix algorithms will be more efficient in higher-dimensional applications; see, for example, Chan and Jeliazkov (2009). The full conditional distributions for the transformed partial autocorrelation matrices are non-standard. Given a truncation level of , each contains parameters and so it is not generally feasible to update the whole matrix in a single Metropolis-Hastings step. We therefore propose dividing the elements of each into blocks of length and updating the blocks one-at-a-time. In the application in Section 6.3, we found mixing to be greatly improved by using Metropolis-adjusted Langevin (MALA) steps rather than Gaussian random walks.
Given a large enough fixed truncation level , the MGP prior increasingly shrinks the factor loadings in columns with higher indices if the data suggest their likelihood contribution is negligible. At least conceptually, these columns can be discarded. If there are discardable columns on iteration of the MCMC sampler, the effective number of factors is defined as . The posterior for the effective number of factors can then be used as a proxy for the posterior for . There are various truncation criteria for deciding which columns can be omitted. Bhattacharya and Dunson (2011) consider as inactive any factor whose loadings are all within of zero for some chosen threshold, say . Schiavon and Canale (2020) suggest a more interpretable criterion which, unlike the choice of threshold , is unaffected by the scale of the data and its dimension. Denote by the factor loadings matrix obtained by discarding the columns of from onwards. The basic idea is that if can explain % of the total variability of the data for some , then one decides there are active factors. The choice of proportion then replaces the choice of threshold . This is the truncation criterion we adopt in the applications in Section 6.
When the truncation level is fixed, an alternative to Gibbs sampling is to implement off-the-shelf computational inference using standard probabilistic programming software. Code for the applications in Section 6, written in Stan (Carpenter et al., 2017), is given in the Supplementary Materials. Stan uses Hamiltonian Monte Carlo to perform joint updates of all unknowns and so tends to converge faster and mix better than a Gibbs sampler. The caveat is the necessity to fix ; although this can be addressed in a bespoke implementation of Gibbs sampling (see Section 5.2 below), Stan cannot be customised in this way.
5.2 Adaptive Gibbs sampler
Especially in applications where is large, the number of effective factors is often considerably less than the Ledermann bound . Therefore fixing a truncation level which is less than, but in the vicinity of, can be computationally inefficient. In the literature on infinite factor models, a common pragmatic solution is to use an adaptive Gibbs sampler which tunes the truncation level as it proceeds. Adaptation occurs at iteration with probability where and such that the probability of adaptation decreases over the course of the simulation. This is necessary to satisfy the diminishing adaptation condition of Roberts and Rosenthal (2007).
During an adaptation step, is compared to the current value of . The basic idea is to delete any inactive factors or to add an extra factor if all factors are active. In the static model, implementation is straightforward. If , is reduced to and any inactive factors are deleted along with the corresponding columns of and components of and . If and , is increased by 1 then an extra factor, column of factor loadings, and component of are sampled from their priors. In the dynamic model, the adaptation step is slightly more involved because of the additional complexity in the distribution of the latent factors. Full details of the adaptive Gibbs samplers for both static and dynamic models are given in the Supplementary Materials, along with R code for the applications in Section 6. We also provide details on how the posterior output can be post-processed to obtain samples under a parameterisation in which the factor loadings matrix is identified by the PLT constraint. This is incredibly valuable for parameter interpretation and the assessment of MCMC diagnostics.
6 Applications
6.1 Simulation experiment
6.1.1 Simulation settings
In order to investigate posterior sensitivity to the prior specification in the idealised setting in which we know the data were generated from a factor model with known shared variation matrix , we carried out a series of simulation experiments. In the simplified setting in which , denote by the proportion of the total variation in that is explained by the common factors, that is, . Then for given and , the corresponding value of can be computed as . Three different combinations of and , representing different degrees of dimension reduction, were considered: , and so that was equal to 25%, 18.75% and 12.5%, respectively. For each combination of and , we also considered three values for , , giving nine sets of values for in total. We refer to these as simulation settings. Note that we only set for the purposes of simulating the data; for inference, we still allow .
Under each simulation setting, we simulated 24 data sets of size based on an exponential correlation matrix for , taking to be the closest rank- matrix to (see Proposition 1). In the simulation of each data set, in order to fix the value of , we simulated pairs of coordinates , , by sampling uniformly at random from the unit square, and then set the expected shared variation matrix equal to with element where the length-scale parameter was equal to
For a distance of 1 in the unit square, the correlations therefore ranged from 0.05 to 0.20.
6.1.2 Prior distributions
Each data set generated under each simulation setting was used to update five prior distributions for . Three are structured matrix normal priors, with different assumptions about , and two are widely used priors in factor analysis which allow inference on the number of factors:
- SN (expcov.)
-
A structured matrix normal prior in which was taken to have the same form as that used to simulate the data, that is, an exponential correlation matrix. The length-scale parameter was taken to be unknown and assigned the distribution .
- SN (fixed)
-
A structured matrix normal prior in which was taken to have the same form as that used to simulate the data, that is, an exponential correlation matrix. The length-scale parameter was fixed at the value used to simulate the data.
- SN (exch.)
-
A structured matrix normal prior in which was taken to have a different form to that used to simulate the data, specifically we took , , representing a two-parameter exchangeable matrix with trace fixed at . The correlation parameter was taken to be unknown and assigned the distribution where .
- MGP
-
The original (exchangeable) multiplicative gamma process prior described in (Bhattacharya and Dunson, 2011), which is a global-local prior. We took and in the multiplicative gamma process for the global precision parameters and in the distribution for the local precision parameters.
- CUSP
-
A cumulative shrinkage process prior (Legramanti et al., 2020), which is a slab-and-spike prior. We took in the distribution for the slab, for the position of the spike and in the cumulative stick-breaking construction controlling the probability assigned to the spike.
For the three structured matrix normal priors, we chose and in the multiplicative gamma process for the diagonal elements of . In all cases, the idiosyncratic variances were assigned the prior independently for .
6.1.3 Analysis and results
All analyses were run using an adaptive Gibbs sampler, allowing adaptation after 500 iterations and setting the parameters in the diminishing adaptation condition at and . The chains were run for 10K iterations, following a burn-in of 10K, and thinned to retain every fifth sample to reduce computational overheads. For the three structured matrix normal priors and the original MGP prior, we applied the truncation criterion of Schiavon and Canale (2020), retaining as active enough factors to explain at least of the total variation in the data.
Prior | |||||
---|---|---|---|---|---|
Median | IQR | Median | IQR | ||
SN (expcov.) | (24, 6) | 15.88 | 0.38 | 6.34 | 0.56 |
SN (fixed) | (24, 6) | 15.78 | 0.39 | 6.30 | 0.89 |
SN (exch.) | (24, 6) | 15.93 | 0.39 | 6.63 | 0.69 |
MGP | (24, 6) | 16.02 | 0.45 | 6.27 | 0.86 |
CUSP | (24, 6) | 16.25 | 0.36 | 3.13 | 0.57 |
SN (expcov.) | (48, 9) | 21.25 | 0.58 | 7.54 | 1.95 |
SN (fixed) | (48, 9) | 21.18 | 0.65 | 7.32 | 1.47 |
SN (exch.) | (48, 9) | 22.20 | 1.04 | 5.96 | 2.02 |
MGP | (48, 9) | 22.15 | 0.64 | 6.41 | 2.32 |
CUSP | (48, 9) | 22.86 | 0.84 | 3.00 | 0.23 |
SN (expcov.) | (72, 9) | 23.89 | 1.03 | 7.76 | 1.38 |
SN (fixed) | (72, 9) | 23.69 | 1.37 | 7.85 | 1.55 |
SN (exch.) | (72, 9) | 25.81 | 1.87 | 6.15 | 0.92 |
MGP | (72, 9) | 24.99 | 0.94 | 7.21 | 1.81 |
CUSP | (72, 9) | 27.89 | 1.07 | 3.00 | 0.00 |
The (squared) geodesic distance between two symmetric positive definite matrices and is defined as
in which denotes the th eigenvalue of (L.-H. Lim and R. Sepulchre, 2019). For simulation experiments in which , Table 1 reports the median and interquartile range across the 24 data sets of the posterior mean of where denotes the value used to simulate the data. Corresponding Tables for the simulations where and are presented in the Supplementary Materials. We chose the distance metric because, by construction, is close to singular and so the geodesic distance will be better suited than a metric, such as the Frobenius distance, which ignores the geometry of the space. In particular, it will give large values whenever and do not approximately span the same subspace of , that is when their null spaces are not approximately the same. We also report the median and interquartile range for the posterior mean of the effective number of factors .
Focusing first on the posterior mean of the geodesic distance , it is immediately clear from Table 1 and the corresponding tables in the Supplementary Materials that both the medians and interquartile ranges across data sets are typically smallest when we use a prior in which is based on an exponential correlation matrix. As we would expect, this is especially true when the length-scale parameter in the prior is fixed at the value used in simulating the data. Nevertheless, the modest differences between posterior inferences under SN (expcov.) and SN (fixed) are reassuring; in the analysis of a real data set, though we may have strong prior beliefs about a suitable structure for , we are likely to be more uncertain about the values of the hyperparameters in that structure. We also observe that the gains from assuming the correct form for in this simulation experiment are greatest when the total variation in makes up a larger proportion of the total variation in (that is, when and ) and when the degree of dimension reduction is larger. This is as expected because in both cases, the prior for , and therefore , is likely to impart more influence on the posterior for . It is interesting to observe that when we adopt a structured matrix normal prior but assume an incorrect structure for , the posterior mean for still tends to be smaller than under the MGP or CUSP priors. This is likely to be because we simulated all data sets based on exponential correlation matrices for which have strictly positive diagonal elements. Under SN (exch.), we are shrinking towards a rank-reduced approximation to a two-parameter exchangeable matrix with a common (potentially positive) off-diagonal element whereas the prior expectation for under the MGP and CUSP priors is diagonal.
A discussion of the results concerning the posterior mean of the effective number of factors can be found in the Supplementary Materials.
6.2 Co-occurrence of Finnish birds
6.2.1 Model and prior
As discussed in Section 1, most of the literature on prior specification for factor models focuses on constructing a distribution for the factor loadings matrix that is exchangeable with respect to the order of the components in the observation vector. An exception is the structured increasing shrinkage (SIS) process (Schiavon et al., 2022). This prior allows meta-covariates to inform the within-column sparsity structure of the factor loadings matrix in an infinite factorisation model. In this section, we re-examine the Finnish birds data set analysed in Schiavon et al. to illustrate the extra flexibility afforded by our approach.
The Finnish bird data comprise information on the co-occurrence of the 50 most common bird species in Finland from 137 sampling areas in 2014 (Lindström et al., 2015). The occurrences are arranged into a binary matrix where if bird species was observed in area and otherwise. Information available to explain the variation in the mean at each sampling area include a matrix of environmental covariates containing measurements on spring temperature and its square and a five-level habitat type (broadleaved forests, coniferous forests, open habitats, urban habitats, wetlands) so that . Information available to structure the prior for the marginal variance include a phylogenetic tree, indicating the evolutionary relationships amongst the 50 species, and a matrix of meta-covariates consisting of two species traits: logarithm of typical body mass and a three-level migratory strategy (short-distance migrant, resident species, long-distance migrant). Following Schiavon et al., we model species presence or absence using a multivariate probit regression model where and we assume the latent variables are such that for . Here is the row of , is a matrix of regression coefficients, and the idiosyncratic variances on the diagonal of are set equal to 1 to prevent compensatory rescaling of and . Our prior for the coefficients in is identical to that of Schiavon et al. and described in the Supplementary Materials.
The SIS process prior for is built on a series of assumptions of conditional independence. As a consequence, the expectation of the shared variation matrix is diagonal, irrespective of the priors assigned to its hyperparameters. Moreover, every factor loading has a spike-and-slab distribution in which the meta-covariates only enter through the logit of the spike probability. As such, there is no mechanism for incorporating the rich information on evolutionary relationships that is expressed through the phylogenetic tree.
In contrast, using the approach outlined in Section 3.4 we can structure the prior for the shared variation so that it incorporates the information from the phylogenetic tree in addition to the two meta-covariates. The former can be encoded as a matrix of phylogenetic distances between the 50 species, whose entry, say , is the sum of the branch lengths along the path between the leaves for species and . This distance function is a metric over the set of leaves (Steel, 2016, Chapter 6). Another measure of the dissimilarity between species can be obtained using the generalised distance function from the projection model in (8). As we only have binary response data, which is unlikely to be as informative as continuous data, we simplify (8) by taking the matrix to be diagonal. This yields as a metric for the distance between the meta-covariates for species and , where , and are indicators for whether species is a short-distance migrant, resident or long-distance migrant, respectively, while is the logarithm of typical body mass for species . Since and are metrics over a set comprising the species and their associated traits, the sum is also a metric. We can therefore take as a generalised measure of distance between species and , , in which for . So that all the length-scale parameters operate on a comparable scale, the meta-covariates and branch lengths are standardised prior to calculating and . Finally, we relate the generalised distance to the among-row scale matrix by using the exponential correlation function, . We assume the length-scale parameters to be independent in the prior and assign distributions .
Based on available phylogenetic and ecological information, this prior encourages shrinkage towards a rank-reduced approximation of an expected shared variation matrix with higher correlations between types of bird that are more closely related in terms of their evolutionary history and species traits. Since we allow the length-scale parameters to be unknown we are also able to learn which variables are more important in explaining relationships between species. However, the structure for the expected shared variation matrix represented through is conjectural and so we elect to use the matrix- prior whose extra degree of freedom parameter allows the data to influence the degree of shrinkage. As discussed in Section 3.3, we induce a prior for by specifying a distribution for . Specifically, we take which implies that when the number of factors is 5, 10 and 15, the probability of the scale factor lying between 1 and is 0.75 for , and , respectively. For the diagonal elements in , we assign a MGP prior (9) with , and a maximum truncation point that satisfies .
6.2.2 Analysis and results
We ran two chains, initialised at different starting points, using the adaptive Gibbs sampler discussed in Section 5.2 and another two chains using Stan’s implementation of the (non-adaptive) Hamiltonian Monte Carlo sampler with a fixed truncation level of . In the Gibbs sampler, adaptation was allowed after 500 iterations and, following Schiavon et al. (2022), the parameters in the diminishing adaptation condition were set at and . After a burn-in of 20K iterations, each chain was run for a further 20K iterations, thinning to store every fifth draw in order to reduce computational overheads. Using the truncation criterion of Schiavon and Canale (2020), we retain as active enough factors to explain at least of the total variation in the data. The posterior mode (and median) for the number of effective factors in each case was 5 with 95% credible interval . Comparing the output of identified parameters across the chains from both samplers using the usual graphical diagnostic checks gave no evidence of any lack of convergence.
To gauge performance against another non-exchangeable prior, the model was also fitted under the SIS process prior using an adaptive Gibbs sampler, with the hyperparameters in the prior and tuning parameters in the sampler set in line with the specification from Section 5 of Schiavon et al. (2022). In this case, more or less all the posterior mass for the number of effective factors was stacked at 4.
The marginal prior and posterior densities for the logarithms of the length-scale parameters and the transformed degree of freedom parameter under the structured matrix- prior are displayed in Figure 1. Absence of a relationship between a component of the generalised distance metric and the expected shared variation between species would be indicated by an infinite value for the corresponding length-scale parameter. Therefore the concentration of all posterior densities in Figure 1a over a finite interval near the origin highlights the role played by the generalised distance metric in explaining the structure of the shared variation matrix. It is also clear that the phylogenetic distances, whose length-scale parameter is , have the greatest influence. For the transformed degree of freedom parameter in Figure 1b, the similarity between the prior and posterior indicate that the information learned from the data is modest. Nevertheless, the shift in the position of the mode from 0 in the prior to around 0.1 in the posterior suggests that the data support less shrinkage towards the expected shared variation matrix than would be implied by a matrix normal prior for which .
Further qualitative evidence of the importance of the phylogenetic information on the structure of the marginal covariance matrix is demonstrated in Figure 2 which shows the mean of the posterior for the marginal correlation matrix , where .
The order of the 50 bird species is based on their phylogenetic tree, which is shown in the margins of the plot. There is clear structure in the matrix with several groups of species that share a recent common ancestor displaying strong positive correlations. This includes the species in the first ten rows, which correspond to birds from the Corvoidea superfamily and the Sylviidae family; the species in rows 19 to 30, which correspond largely to birds from the Musicapoidea superfamily; and the species in rows 42 to 47, which correspond mostly to birds from the Scolopacidae superfamily. It is also clear that most of the rows and columns predominated by near-zero correlations correspond to basal species, like Columba palumbus and Grus grus. Nevertheless, the structure of the matrix is clearly not dictated by evolutionary proximity alone. Interestingly, the corresponding plot from the analysis under the SIS process prior appears to be very sparse; see the Supplementary Materials. At least in part, this can be attributed to shrinkage of towards its prior mean, which is a diagonal matrix.
In order to provide a more formal comparison between the two model-prior combinations, we consider measures of goodness-of-fit and predictive performance. In the former case, we follow Schiavon et al. (2022) by computing the pseudo marginal likelihood (PML), defined as the product of conditional predictive ordinates (Gelfand and Dey, 1994). In terms of predictive performance, we consider the Brier score, which is a widely used proper scoring rule for categorical variables, calculated using a 4-fold cross-validation approach (e.g. Gneiting and Raftery, 2007). In our case, the score is positively oriented so that large values indicate better performance. The results are shown in Table 2 from which we can conclude that the combination of the factor model and structured matrix- prior gives the best-fit to the data and, more markedly, the best predictive performance. Further details on the calculation of the PML and Brier score can be found in the Supplementary Materials.
Structured matrix- | SIS process | |
---|---|---|
Log PML | -2831.6 | -2936.5 |
Brier score | -0.29120 | -0.85396 |
6.3 Hourly demand for natural gas
6.3.1 Model and prior
To illustrate the use of our methodology in the context of a dynamic model, we consider an application to modelling the hourly demand for natural gas. In the UK, gas in the national transmission system is transported to individual customers through eight regional distribution networks which are responsible for maintaining a reliable gas supply to customers at all times. This requires accurate short-term forecasts of the hourly demand for gas at individual points in the network called offtakes, where gas is taken off to supply locally. We consider data from a single offtake site in a regional distribution network in the North of England. Hourly gas demand data are available for the period from May 2012 to June 2015 along with the average daily temperature at the offtake site. In addition to the weather, the other main factors that are known to influence the mean level of residential demand for gas are seasonal and calendar effects (Soldo, 2012). We therefore additionally incorporate covariates that represent the day of the year, day of the week and whether or not a day is a public holiday.
The measurement of the demand for gas at an hourly resolution makes construction of a model challenging because the underlying dynamics of the process are likely to vary over a much longer time scale. For example, consider a dynamic linear model for hourly gas demand that allows for calendar and weather effects and incorporates a time-varying local level. Suppose that this local level is modelled as evolving according to a random walk or autoregression. Deviations from behaviour that would be considered typical given the calendar and weather effects would likely persist over days, rather than hours, and so hourly evolution of the local level would demand a near-zero innovation variance. Similarly, if calendar effects such as the day of the week, were regarded as dynamic, hourly evolution would demand vanishingly small innovation variances.
Motivated by these concerns, we define a gas-day-vector as a longitudinal series of measurements of the log-transformed demand for gas over 24 hours, starting at 07:00, which is the beginning of a gas-day. Denote by the gas-day-vector on day so that denotes the log-transformed demand for gas at hour of gas-day . We elect to model the gas-day-vectors using a dynamic factor model with observation equation , where , . For simplicity in this illustrative application, we model evolution of the factors using a first order vector autoregression, , where . Due to the stability of the demand for gas over the time period in question, it is reasonable to assume that the departures from the time-varying mean follow a stationary process. We therefore constrain the stationary variance of the to be and take the initial distribution to be . Reparameterising the model for the factors in terms of the transformed partial autocorrelation matrices yields a single unconstrained square matrix with and . As discussed in Section 4.4, we assign a rotatable prior to , and therefore , by taking .
The variances of the specific factors are assigned the prior independently for which ensures that the distributions have finite variance. The time-varying mean is modelled as in which is row of an matrix of daily covariates . The columns of allow for an intercept, a non-linear effect of deviations from seasonal average temperature and fixed effects for the day-of-the-week, the occurrence of a public holiday, and the day-of-the-year, which is represented as a truncated Fourier series with six harmonics. Full details of the model for are given in the Supplementary Materials along with a description of the associated priors.
In order to reflect both the persistence of above or below average demand for gas and the longitudinal but circular nature of a gas-day-vector, a sensible model for the inverse of the expected shared variation matrix would be the precision matrix of a stationary circular autoregressive process of order one (Held and Rue, 2010). We therefore take to be a tridiagonal Toeplitz matrix with corners; that is, for , and for where we assume and take . Exploratory analysis using data from another offtake site suggest that this structure is sensible based on the inverse of the sample covariance matrix. Given that the existing model is already rather complex, we chose to adopt a matrix normal prior for in this example. The diagonal elements in the among-column scale matrix are given a MGP prior (9), with , and a maximum truncation point of .
6.3.2 Analysis and results
We ran two chains, initialised at different starting points, using the adaptive Gibbs sampler discussed in Section 5.2 and another two chains using the (non-adaptive) Hamiltonian Monte Carlo sampler with a fixed truncation level of . After a burn-in of 50K iterations, each Gibbs chain was run for a further 200K iterations, thinning to retain every 100th draw in order to reduce computational overheads. Adaptation was allowed after 5K iterations and the parameters in the diminishing adaptation condition were set at and . The Hamiltonian Monte Carlo sampler was run for 20K iterations after a burn-in of the same length, thinning the output to retain every 10th draw. As in the Finnish birds example, we retain as active enough factors to explain at least of the total variation in the data. The posterior mode (and median) for the number of effective factors in each case was 10 with 95% credible interval . Comparing the output of identified parameters across the chains for the two samplers using the usual graphical diagnostic checks gave no evidence of any lack of convergence.
Let denote the marginal precision matrix of the process and let . Figure 3 shows the posterior mean for the standardised precision matrix . It is clear that its structure is reasonably consistent with the tridiagonal Toeplitz matrix with corners on which the prior for is centered. However, there are some deviations, most notably another one or two bands of non-zero elements below the subdiagonal and above the supradiagonal. There is also some evidence of at least a partial band in the vicinity of for . This may be due to people switching their heating on twice per day, at around 7:00 in the morning and around 19:00 in the evening. This picture is reinforced by the posterior for the identified factor loadings matrix , shown in the Supplementary Materials, many of whose columns display a double-hump shape with smaller loadings towards the middle of the gas-day. As remarked in Section 4.3, in order to simplify the geometry of the stationary region for stationary dynamic factor models, the autoregressive coefficient matrices in the state equation are often assumed to be diagonal. However, this example provides considerable evidence in favour of our fully flexible approach, with the posterior probabilities for many off-diagonal elements being close to zero or one.
In order to assess the forecasting performance of the model, the number of factors was fixed at and the model was refitted, holding back the last 25% of observations as test data. In keeping with the short-term horizon of interest, we considered forecasting hours ahead and, for comparative purposes, hour ahead. Application of the standard forward filtering and forecasting recursions for (multivariate) dynamic linear models are not appropriate here as they do not allow within-day updates or forecasts. We therefore modify the standard forward filter so that the time-step is hours, rather than days, and at hours within each gas-day, we perform a partial update, comprising an observation step but no prediction step. The forecasting algorithm is similarly modified to allow forecasts to be issued hourly. The full algorithms are given in the Supplementary Materials. By using the draws from the posterior obtained from the fit to the training data and sampling one-for-one from the -step ahead predictive distributions, we obtain samples from the -step ahead posterior predictive distributions. For and , these are visualised in Figure 4 for the first 5% of times in the hold-out period, along with the test data. An analogous figure in the Supplementary Materials shows the last 5% of times. From these plots it is clear that the posterior predictive distributions are both accurate and precise over the forecast horizons of interest.
7 Discussion
We have proposed a class of structured priors for the loadings matrix of a Bayesian factor model with accompanying inferential algorithms. The novelty lies in the insight that the shared variation matrix is much more amenable to the specification of prior beliefs than the factor loadings matrix . The prior is based on a matrix normal or matrix- distribution for . Two important features are the choice of parametric structure for the among-row scale matrix and the increasing shrinkage prior for the diagonal among-column scale matrix . The matrix characterises the conditional prior expectation of . Parametric forms are widely adopted as models for covariance matrices in longitudinal studies, spatial statistics, Gaussian processes, and a host of other areas. By adopting a parametric form, informed by domain expertise, as a model for the expectation of , we benefit from shrinkage towards a rank-reduced matrix that is close to that structure. In general, the number of factors in a factor model is not known a priori. This is addressed through the increasing shrinkage process prior for which allows columns of loadings whose likelihood contribution is negligible to be shrunk towards zero and discarded. A prior that encourages no more factors than are needed is helpful, particularly for prediction, because of the reduction in epistemic uncertainty that is afforded. At the cost of slightly more involved computational inference, the matrix- version of the structured prior also offers a degree of freedom parameter which allows the data to influence the degree of shrinkage of towards a rank-reduced matrix that is close to its mean.
Acknowledgements
EH was supported by the EPSRC grant EP/N510129/1 via the Alan Turing Institute project “Streaming data modelling for real-time monitoring and forecasting”. We are grateful to Michael Betancourt and Malcolm Farrow for conversations which have improved the manuscript. We are also grateful to two anonymous referees for their helpful comments and suggestions. The Finnish bird data are available publicly from https://www.helsinki.fi/en/researchgroups/statistical-ecology/software/hmsc.
References
- Aguilar and West (2000) Aguilar, O. and M. West (2000). Bayesian dynamic factor models and portfolio allocation. J. Bus. Econ. Stat. 18(3), 338–357.
- Aßmann et al. (2016) Aßmann, C., J. Boysen-Hogrefe, and M. Pape (2016). Bayesian analysis of static and dynamic factor models: an ex-post approach towards the rotation problem. J. Economet. 192, 190–206.
- Bekker and ten Berge (1997) Bekker, P. A. and J. M. F. ten Berge (1997). Generic global identification in factor analysis. Linear Algebra and its Applications 264, 255–263.
- Bhattacharya and Dunson (2011) Bhattacharya, A. and D. B. Dunson (2011). Sparse Bayesian infinite factor models. Biometrika 98(2), 291–306.
- Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell (2017). Stan: A probabilistic programming language. J. Statist. Software 76(1), 1–32.
- Carvalho et al. (2008) Carvalho, C. M., J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. J. Amer. Statist. Assoc. 103(484), 1438–1456.
- Castellanos et al. (2004) Castellanos, L., V. Q. Vu, S. Perel, A. B. Schwartz, and R. E. Kass (2004). A multivariate Gaussian process factor model for hand shape during reach-to-grasp movements. Statist. Sinica 25(1), 5–24.
- Chan et al. (2018) Chan, J., R. Leon-Gonzalez, and R. W. Strachan (2018). Invariant inference and efficient computation in the static factor model. J. Amer. Statist. Assoc. 113(522), 819–828.
- Chan and Jeliazkov (2009) Chan, J. C. C. and I. Jeliazkov (2009). Efficient simulation and integrated likelihood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation 1, 101–120.
- Chib et al. (2006) Chib, S., F. Nardari, and N. Shephard (2006). Analysis of high dimensional multivariate stochastic volatility models. J. Economet. 134(2), 341–371.
- Conti et al. (2014) Conti, G., S. Frühwirth-Schnatter, J. J. Heckman, and R. Piatek (2014). Bayesian exploratory factor analysis. J. Economet. 183, 31–57.
- Durante (2017) Durante, D. (2017). A note on the multiplicative gamma process. Stat. Probab. Lett. 122, 198–204.
- Dutta and Ghosh (2013) Dutta, R. and J. K. Ghosh (2013). Bayes model selection with path sampling: factor models and other examples. Stat. Sci. 28(1), 95–115.
- Frühwirth-Schnatter (1994) Frühwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis 15, 183–202.
- Frühwirth-Schnatter et al. (2023) Frühwirth-Schnatter, S., D. Hosszejni, and H. F. Lopes (2023). Sparse Bayesian factor analysis when the number of factors is unknown. arXiv:2301.06459.
- Frühwirth-Schnatter and Lopes (2010) Frühwirth-Schnatter, S. and H. F. Lopes (2010). Parsimonious Bayesian factor analysis when the number of factors is unknown. Technical Report, Chicago Booth.
- Gelfand and Dey (1994) Gelfand, A. E. and D. K. Dey (1994). Bayesian model choice: Asymptotics and exact calculations. J. R. Statist. Soc. B 56(3), 501–514.
- Geweke and Zhou (1996) Geweke, J. and G. Zhou (1996). Measuring the pricing error of the arbitrage pricing theory. The Review of Financial Studies 9(2), 557–587.
- Ghosh and Dunson (2009) Ghosh, J. and D. B. Dunson (2009). Default prior distributions and efficient posterior computation in Bayesian factor analysis. J. Comput. Graph. Stat. 18(2), 306–320.
- Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102(477), 359–378.
- Goldberg (1990) Goldberg, L. R. (1990). An alternative ’description of personality’: the big-five factor structure. Journal of Personality and Social Psychology 59(6), 1216–1229.
- Gupta and Nagar (2000) Gupta, A. K. and D. K. Nagar (2000). Matrix Variate Distributions. Boca Raton, Florida: Chapman & Hall/CRC.
- Hays et al. (2012) Hays, S., H. Shen, and J. X. Huang (2012). Functional dynamic factor models with application to yield curve forecasting. Ann. Appl. Statist. 6(3), 870–894.
- Heaps (2023) Heaps, S. E. (2023). Enforcing stationarity through the prior in vector autoregressions. J. Comput. Graph. Stat. 32(1), 74–83.
- Held and Rue (2010) Held, L. and H. Rue (2010). Conditional and intrinsic autoregressions. In A. E. Gelfand, P. J. Diggle, M. Fuentes, and P. Guttorp (Eds.), Handbook of Spatial Statistics (1st ed.)., Handbooks of Modern Statistical Methods, Chapter 13, pp. 201–216. Boca Raton, Florida: CRC Press.
- Jungbacker et al. (2014) Jungbacker, B., S. J. Koopman, and M. Van der Wel (2014). Smooth dynamic factor analysis with application to the US term structure of interest rates. J. Appl. Economet. 29(3), 65–90.
- Kaufmann and Schumacher (2019) Kaufmann, S. and C. Schumacher (2019). Bayesian estimation of sparse dynamic factor models with order-independent and ex-post identification. J. Economet. 210, 116–134.
- Koop et al. (2009) Koop, G., R. Leon-Gonzalez, and R. W. Strachan (2009). Efficient posterior simulation for cointegrated models with priors on the cointegration space. Econometric Reviews 29(2), 224–242.
- Kowal and Canale (2022) Kowal, D. R. and A. Canale (2022). Semiparametric functional factor models with Bayesian rank selection. arXiv:2108.02151v3.
- Kowal et al. (2017) Kowal, D. R., D. S. Matteson, and D. Ruppert (2017). A Bayesian multivariate functional dynamic linear model. J. Amer. Statist. Assoc. 112(518), 733–744.
- L.-H. Lim and R. Sepulchre (2019) L.-H. Lim and R. Sepulchre (2019). Geometric distance between positive definite matrices of different dimensions. IEEE Transactions of Information Theory 65(9), 5401–5405.
- Legramanti et al. (2020) Legramanti, S., D. Durante, and D. B. Dunson (2020). Bayesian cumulative shrinkage for infinite factorizations. Biometrika 107(3), 745–752.
- Leung and Drton (2016) Leung, D. and M. Drton (2016). Order-invariant prior specification in Bayesian factor analysis. Stat. Probab. Lett. 111, 60–66.
- Lindström et al. (2015) Lindström, Å., M. Green, M. Husby, J. A. Kålås, and A. Lehikoinen (2015). Large-scale monitoring of waders on their Boreal and Arctic breeding grounds in Northern Europe. Ardea 103(1), 3–15.
- Lopes and Carvalho (2007) Lopes, H. F. and C. M. Carvalho (2007). Factor stochastic volatility with time varying loadings and Markov switching regimes. J. Statist. Plan. Inf. 137, 3082–3091.
- Lopes et al. (2008) Lopes, H. F., E. Salazar, and D. Gamerman (2008). Spatial dynamic factor analysis. Bayesian Anal. 3(4), 759–792.
- Lopes and West (2004) Lopes, H. F. and M. West (2004). Bayesian model assessment in factor analysis. Statist. Sinica 14, 41–67.
- Man and Culpepper (2022) Man, A. X. and S. A. Culpepper (2022). A mode-jumping algorithm for Bayesian factor analysis. J. Amer. Statist. Assoc. 117(537), 277–290.
- Peña and Poncela (2006) Peña, D. and P. Poncela (2006). Nonstationary dynamic factor analysis. J. Statist. Plan. Inf. 136, 1237–1257.
- Roberts and Rosenthal (2007) Roberts, G. O. and J. S. Rosenthal (2007). Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Probab 44(2), 458–475.
- Sáfadi and Peña (2008) Sáfadi, T. and D. Peña (2008). Bayesian analysis of dynamic factor models: an application to air pollution and mortality in São Paulo, Brazil. Environmetrics 19(1), 95–117.
- Schiavon and Canale (2020) Schiavon, L. and A. Canale (2020). On the truncation criteria in infinite factor models. Stat 9(1), e298.
- Schiavon et al. (2022) Schiavon, L., A. Canale, and D. B. Dunson (2022). Generalized infinite factorization models. Biometrika 109(3), 817–835.
- Schmidt et al. (2011) Schmidt, A. M., P. Guttorp, and A. O’Hagan (2011). Considering covariates in the covariance structure of spatial processes. Environmetrics 22, 487–500.
- Soldo (2012) Soldo, B. (2012). Forecasting natural gas consumption. Applied Energy 92, 26–37.
- Steel (2016) Steel, M. (2016). Phylogeny: Discrete and Random Processes in Evolution. Philadelphia: Society for Industrial and Applied Mathematics.
- Taylor-Rodriguez et al. (2019) Taylor-Rodriguez, D., A. O. Finley, A. Dhatta, C. Babcock, H. Andersen, B. D. Cook, D. C. Mortin, and S. Banerjee (2019). Spatial factor models for high-dimensional and large spatial data. Statist. Sinica 29(3), 1155–1180.
- Wilkinson and Yeung (2002) Wilkinson, D. J. and S. K. H. Yeung (2002). Conditional simulation from highly structured Gaussian systems, with application to blocking-MCMC for the Bayesian analysis of very large linear models. Statistics and Computing 12, 287–300.
- Wilkinson and Yeung (2004) Wilkinson, D. J. and S. K. H. Yeung (2004). A sparse matrix approach to Bayesian computation in large linear models. Comput. Statist. Data Anal. 44, 493–516.