Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Structured prior distributions for the covariance matrix in latent factor models

Sarah E. Heaps and Ian H. Jermyn
Durham University, Durham, U.K.
Email: sarah.e.heaps@durham.ac.uk
Abstract

Factor models are widely used for dimension reduction in the analysis of multivariate data. This is achieved through decomposition of a p×p𝑝𝑝p\times pitalic_p × italic_p covariance matrix into the sum of two components. Through a latent factor representation, they can be interpreted as a diagonal matrix of idiosyncratic variances and a shared variation matrix, that is, the product of a p×k𝑝𝑘p\times kitalic_p × italic_k factor loadings matrix and its transpose. If kpmuch-less-than𝑘𝑝k\ll pitalic_k ≪ italic_p, this defines a parsimonious factorisation of the covariance matrix. Historically, little attention has been paid to incorporating prior information in Bayesian analyses using factor models where, at best, the prior for the factor loadings is order invariant. In this work, a class of structured priors is developed that can encode ideas of dependence structure about the shared variation matrix. The construction allows data-informed shrinkage towards sensible parametric structures while also facilitating inference over the number of factors. Using an unconstrained reparameterisation of stationary vector autoregressions, the methodology is extended to stationary dynamic factor models. For computational inference, parameter-expanded Markov chain Monte Carlo samplers are proposed, including an efficient adaptive Gibbs sampler. Two substantive applications showcase the scope of the methodology and its inferential benefits.

Keywords— Covariance matrix; Dimension reduction; Intraday gas demand; Latent factor models; Stationary dynamic factor models; Structured prior distributions

1 Introduction

Factor models are widely used as a tool for dimension reduction in explaining the covariance structure of multivariate data. Let 𝒚i=(yi1,,yip)Tsubscript𝒚𝑖superscriptsubscript𝑦𝑖1subscript𝑦𝑖𝑝T\boldsymbol{y}_{i}=(y_{i1},\ldots,y_{ip})^{\mathrm{\scriptscriptstyle T}}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT be a p𝑝pitalic_p-dimensional random vector with mean 𝝁𝝁\boldsymbol{\mu}bold_italic_μ and covariance matrix 𝛀𝛀\boldsymbol{\Omega}bold_Ω for observation unit i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n. The classic factor model posits that the covariances between the p𝑝pitalic_p components of 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be explained by their mutual dependence on a smaller number k𝑘kitalic_k of unknown common factors 𝜼i=(ηi1,,ηik)Tsubscript𝜼𝑖superscriptsubscript𝜂𝑖1subscript𝜂𝑖𝑘T\boldsymbol{\eta}_{i}=(\eta_{i1},\ldots,\eta_{ik})^{\mathrm{\scriptscriptstyle T}}bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_η start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. Specifically, the model expresses 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a noisy affine transformation of 𝜼isubscript𝜼𝑖\boldsymbol{\eta}_{i}bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

𝒚i=𝝁+𝚲𝜼i+ϵi,subscript𝒚𝑖𝝁𝚲subscript𝜼𝑖subscriptbold-italic-ϵ𝑖\boldsymbol{y}_{i}=\boldsymbol{\mu}+\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+% \boldsymbol{\epsilon}_{i},bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_μ + bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (1)

in which the p×k𝑝𝑘p\times kitalic_p × italic_k matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is called the factor loadings matrix. The components of the error term ϵisubscriptbold-italic-ϵ𝑖\boldsymbol{\epsilon}_{i}bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are often called the specific factors (or “uniquenesses”) and their variances are often termed the idiosyncratic variances. In (1), we take the specific and common factors, ϵisubscriptbold-italic-ϵ𝑖\boldsymbol{\epsilon}_{i}bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝜼isubscript𝜼𝑖\boldsymbol{\eta}_{i}bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to be uncorrelated, zero-mean multivariate normal random variables, with

ϵiNp(𝟎,𝚺)and𝜼iNk(𝟎,𝑰k).similar-tosubscriptbold-italic-ϵ𝑖subscriptN𝑝0𝚺andsubscript𝜼𝑖similar-tosubscriptN𝑘0subscript𝑰𝑘\boldsymbol{\epsilon}_{i}\sim\mathrm{N}_{p}(\boldsymbol{0},\boldsymbol{\Sigma}% )\quad\text{and}\quad\boldsymbol{\eta}_{i}\sim\mathrm{N}_{k}(\boldsymbol{0},% \boldsymbol{I}_{k}).bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_0 , bold_Σ ) and bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (2)

In general, the common factors are assumed to explain all the shared variation between the components of 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and so 𝚺𝚺\boldsymbol{\Sigma}bold_Σ is constrained to be a diagonal matrix with positive diagonal elements, 𝚺=diag(σ12,,σp2)𝚺diagsuperscriptsubscript𝜎12superscriptsubscript𝜎𝑝2\boldsymbol{\Sigma}=\mathrm{diag}(\sigma_{1}^{2},\ldots,\sigma_{p}^{2})bold_Σ = roman_diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). It then follows from (1) that the marginal covariance matrix 𝛀𝛀\boldsymbol{\Omega}bold_Ω can be expressed as

𝛀=𝚲𝚲T+𝚺𝛀𝚲superscript𝚲T𝚺\boldsymbol{\Omega}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}+\boldsymbol{\Sigma}bold_Ω = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + bold_Σ (3)

in which the product 𝚫=(δij)=𝚲𝚲T𝚫subscript𝛿𝑖𝑗𝚲superscript𝚲T\boldsymbol{\Delta}=(\delta_{ij})=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{% \mathrm{\scriptscriptstyle T}}bold_Δ = ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT will henceforth be termed the shared variation matrix. If kpmuch-less-than𝑘𝑝k\ll pitalic_k ≪ italic_p, the right-hand-side of (3) constitutes a parsimonious factorisation of the full covariance matrix.

Factor models originally found widespread use in psychology and other social sciences (e.g. Goldberg, 1990), where a primary motivation was the prospect of domain insight through interpretation of the latent factors. Since their introduction, factor models have been extended in a variety of directions, for example, by allowing sparsity in the matrix of factor loadings for high-dimensional problems (Conti et al., 2014) or by modelling the temporal evolution of variances in stochastic volatility models (Lopes and Carvalho, 2007) or the latent factors in dynamic factor models (Sáfadi and Peña, 2008). Accordingly, the use of factor models and their extensions has spread into a diverse array of other fields, such as finance (Aguilar and West, 2000) and genetics (Carvalho et al., 2008).

In many of these application areas, the modeller is likely to hold prior beliefs about the nature of the dependence between variables. For example, when the observation vector represents repeated measures data, or observations at a collection of spatial locations, it would be reasonable to expect stronger associations between measurements that are closer together in time or space. Indeed, such ideas underpin the class of functional factor models which are tailored to variables that might be better represented as functions over a continuous domain, rather than vector-valued random quantities. Some functional factor models, and their dynamic extensions, have functional factors (e.g. Castellanos et al., 2004; Taylor-Rodriguez et al., 2019) while others have functional factor loadings. In the latter case, if we regard the p𝑝pitalic_p observations on unit i𝑖iitalic_i as measurements on a function over a continuous domain, say τ𝜏\tau\in\mathbb{R}italic_τ ∈ blackboard_R, then we replace (1) with

yi(τj)=μ(τj)+fi(τj)+ϵi(τj)=μ(τj)+m=1kλm(τj)ηim+ϵi(τj)subscript𝑦𝑖subscript𝜏𝑗𝜇subscript𝜏𝑗subscript𝑓𝑖subscript𝜏𝑗subscriptitalic-ϵ𝑖subscript𝜏𝑗𝜇subscript𝜏𝑗superscriptsubscript𝑚1𝑘subscript𝜆𝑚subscript𝜏𝑗subscript𝜂𝑖𝑚subscriptitalic-ϵ𝑖subscript𝜏𝑗y_{i}(\tau_{j})=\mu(\tau_{j})+f_{i}(\tau_{j})+\epsilon_{i}(\tau_{j})=\mu(\tau_% {j})+\sum_{m=1}^{k}\lambda_{m}(\tau_{j})\eta_{im}+\epsilon_{i}(\tau_{j})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_μ ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_μ ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_η start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

for j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p. The key idea is then to treat the factor loading curves λm(τ)subscript𝜆𝑚𝜏\lambda_{m}(\tau)italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_τ ) as smooth unknown functions of τ𝜏\tauitalic_τ. This can be achieved by regarding fi(τj)=λm(τj)ηimsubscript𝑓𝑖subscript𝜏𝑗subscript𝜆𝑚subscript𝜏𝑗subscript𝜂𝑖𝑚f_{i}(\tau_{j})=\sum\lambda_{m}(\tau_{j})\eta_{im}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ∑ italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_η start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT as a linear combination of (orthogonal) basis functions, often modelled using splines, which then implies a particular covariance function for fi(τ)subscript𝑓𝑖𝜏f_{i}(\tau)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ). Although some of the research in this direction uses a Bayesian approach to inference (Kowal et al., 2017; Kowal and Canale, 2022), the frequentist treatment of the problem is more common, particularly in financial applications to forecasting yield curves (Hays et al., 2012; Jungbacker et al., 2014). An alternative approach in models with functional factor loadings is to assume each λm(τ)subscript𝜆𝑚𝜏\lambda_{m}(\tau)italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_τ ) to be a realisation of a stochastic process, for instance, a Gaussian process with a chosen covariance function; see, for example, the dynamic spatial model of Lopes et al. (2008).

Beyond analyses of functional data, most of the work in the Bayesian literature on prior specification for factor models has focused on the problem of developing priors that are exchangeable with respect to the order of the variables in the observation vector (e.g. Leung and Drton, 2016; Chan et al., 2018) with more recent work focusing additionally on allowing sparsity in the factor loadings matrix (Frühwirth-Schnatter et al., 2023). There has been very little work on constructing non-exchangeable prior specifications in a general setting. A notable exception is the structured increasing shrinkage process prior of Schiavon et al. (2022) under which each factor loading in an infinite factorisation model is assigned a spike-and-slab prior where variable-specific meta-covariates are incorporated in the spike probabilities. Other related work appears in the context of vector error correction models, which are another class of reduced rank models, where Koop et al. (2009) develop an informative prior for the matrix of cointegration vectors based on economic theory about the likely cointegration relationships between variables.

In this paper, we treat the factor model as a tool for providing a dimension-reduced parameterisation of a covariance matrix. Based on the insight that the shared variation matrix 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲𝑇\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{T}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in (3) is more amenable to the specification of prior beliefs than the factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ, our main contribution is a framework for incorporating initial beliefs about 𝚫𝚫\boldsymbol{\Delta}bold_Δ in the prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ by exploiting the algebraic relationship between the two quantities. By switching the focus from properties of the prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ to properties of the prior for 𝚫𝚫\boldsymbol{\Delta}bold_Δ we obtain a methodology that is more flexible than alternatives described in the literature. The prior for the shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ is induced through a structured prior distribution for the factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ which can encode ideas of order-invariance as well as non-exchangeable structure, modelled through dependence on covariance kernels, meta-covariates, pairwise distance metrics and more, within a single principled framework. In particular, the mean of the prior for 𝚫𝚫\boldsymbol{\Delta}bold_Δ can be any positive definite matrix of relevance. For example, it could be a phylogenetic distance matrix whose entries encode distances between nodes on a tree, or even the Gram matrix arising from a Gaussian process covariance function. The interpretability of this prior expectation facilitates straightforward incorporation of meaningful domain expertise. Moreover, the flexible class of priors allows data-informed shrinkage towards a region around the mean within a framework that facilitates inference on the number of factors and hence the extent to which dimension-reduction can be achieved. New theoretical results outline the nature of this shrinkage in the rank-reduced space over which 𝚫𝚫\boldsymbol{\Delta}bold_Δ has support while shedding light on some of the identifiability issues that can hamper computational inference for factor models.

Based on ideas from the literature on infinite factorisation models and parameter-expansion algorithms, we propose an efficient scheme for computational inference where a single Markov chain Monte Carlo (MCMC) run yields information about both the continuous model parameters and the number of factors. Using an unconstrained reparameterisation of stationary vector autoregressions, the methodology is also extended to a class of stationary dynamic factor models for which the shared variation matrix remains meaningfully defined. To the best of our knowledge, this is the first prior, with associated inferential scheme, for a general class of dynamic factor models that constrains inference to the stationary region without imposing further restrictions. This constitutes an important contribution to the Bayesian time-series literature.

The remainder of this paper is organised as follows. After introducing our parameterisation of Bayesian factor models in Section 2, Section 3 presents the general class of structured priors and its components. In Section 4, we extend these ideas to a class of stationary dynamic factor models. Posterior computation is discussed in Section 5, then Section 6 presents a simulation experiment and two substantive applications that illustrate the scope of the methodology and its inferential benefits.

2 Parameterisation of the Bayesian factor model

It is well known that the parameters 𝚲𝚲\boldsymbol{\Lambda}bold_Λ and 𝚺𝚺\boldsymbol{\Sigma}bold_Σ in the marginal covariance matrix 𝛀=𝚲𝚲T+𝚺𝛀𝚲superscript𝚲T𝚺\boldsymbol{\Omega}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}+\boldsymbol{\Sigma}bold_Ω = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + bold_Σ of a factor model are not identifiable from the likelihood without imposition of constraints. Indeed, even with the scale of the factors fixed at Var(𝜼i)=𝑰kVarsubscript𝜼𝑖subscript𝑰𝑘\text{Var}(\boldsymbol{\eta}_{i})=\boldsymbol{I}_{k}Var ( bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (2), there remains a rotational invariance. Consider any k×k𝑘𝑘k\times kitalic_k × italic_k orthogonal matrix 𝑸𝑸\boldsymbol{Q}bold_italic_Q. We can pre-multiply the factors in (1) by 𝑸𝑸\boldsymbol{Q}bold_italic_Q and post-multiply the factor loadings matrix by 𝑸Tsuperscript𝑸T\boldsymbol{Q}^{\mathrm{\scriptscriptstyle T}}bold_italic_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT and this gives the transformed factors in (2) the same distribution as the original factors and so the marginal variance in (3) remains unchanged. The factor loadings matrix can therefore only be identified up to an orthogonal transformation. This means there are pkk(k1)/2+p𝑝𝑘𝑘𝑘12𝑝pk-k(k-1)/2+pitalic_p italic_k - italic_k ( italic_k - 1 ) / 2 + italic_p degrees of freedom determining the marginal variance 𝛀𝛀\boldsymbol{\Omega}bold_Ω. In an unconstrained model for 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the number of degrees of freedom would be p(p+1)/2𝑝𝑝12p(p+1)/2italic_p ( italic_p + 1 ) / 2 and so the reduction in the number of parameters is p(p+1)/2{pkk(k1)/2+p}𝑝𝑝12𝑝𝑘𝑘𝑘12𝑝p(p+1)/2-\{pk-k(k-1)/2+p\}italic_p ( italic_p + 1 ) / 2 - { italic_p italic_k - italic_k ( italic_k - 1 ) / 2 + italic_p } which is positive if k<φ(p)𝑘𝜑𝑝k<\varphi(p)italic_k < italic_φ ( italic_p ) where

φ(p)=2p+18p+12𝜑𝑝2𝑝18𝑝12\varphi(p)=\frac{2p+1-\sqrt{8p+1}}{2}italic_φ ( italic_p ) = divide start_ARG 2 italic_p + 1 - square-root start_ARG 8 italic_p + 1 end_ARG end_ARG start_ARG 2 end_ARG (4)

is called the Ledermann bound. Notwithstanding rotational invariance, the first identification issue is therefore whether 𝚺𝚺\boldsymbol{\Sigma}bold_Σ can be identified uniquely from 𝛀𝛀\boldsymbol{\Omega}bold_Ω. Fortunately, Bekker and ten Berge (1997) proved that if k<φ(p)𝑘𝜑𝑝k<\varphi(p)italic_k < italic_φ ( italic_p ) then 𝚺𝚺\boldsymbol{\Sigma}bold_Σ, and therefore 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, is almost surely globally identifiable. Given identification of 𝚫𝚫\boldsymbol{\Delta}bold_Δ, solving the rotation problem would then guarantee unique identification of the factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ.

In the Bayesian literature, the most common solution to the rotation problem uses the positive diagonal, lower triangular (PLT) constraint; denoting the constrained matrix by 𝚲~bold-~𝚲\boldsymbol{\tilde{\Lambda}}overbold_~ start_ARG bold_Λ end_ARG and corresponding factors by 𝜼~1,,𝜼~nsubscript~𝜼1subscript~𝜼𝑛\tilde{\boldsymbol{\eta}}_{1},\ldots,\tilde{\boldsymbol{\eta}}_{n}over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, this demands λ~ij=0subscript~𝜆𝑖𝑗0\tilde{\lambda}_{ij}=0over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 for j>i𝑗𝑖j>iitalic_j > italic_i and λ~ii>0subscript~𝜆𝑖𝑖0\tilde{\lambda}_{ii}>0over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT > 0 for i=1,,k𝑖1𝑘i=1,\ldots,kitalic_i = 1 , … , italic_k. Although historically this approach has been widely used in the Bayesian literature (e.g. Geweke and Zhou, 1996; Lopes and West, 2004), we choose not to address the rotational invariance and instead parameterise the model using the unidentified and unconstrained factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ. There are two main motivations.

First, recent work provides theoretical and empirical evidence that imposition of the PLT constraint can adversely affect computational inference. Writing 𝚲~(k)superscriptbold-~𝚲𝑘\boldsymbol{\tilde{\Lambda}}^{(k)}overbold_~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to denote the lower triangular matrix comprising the first k𝑘kitalic_k rows and columns of 𝚲~bold-~𝚲\boldsymbol{\tilde{\Lambda}}overbold_~ start_ARG bold_Λ end_ARG, problems arise when 𝚲~(k)superscriptbold-~𝚲𝑘\boldsymbol{\tilde{\Lambda}}^{(k)}overbold_~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is near singular, which can occur when variables in 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are highly correlated. When the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT diagonal element of 𝚲~(k)superscriptbold-~𝚲𝑘\boldsymbol{\tilde{\Lambda}}^{(k)}overbold_~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is close to zero, the signs of (λ~j+1,j,,λ~pj)subscript~𝜆𝑗1𝑗subscript~𝜆𝑝𝑗(\tilde{\lambda}_{j+1,j},\ldots,\tilde{\lambda}_{pj})( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_j + 1 , italic_j end_POSTSUBSCRIPT , … , over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_p italic_j end_POSTSUBSCRIPT ) are only very weakly identified, producing multiple modes in the posterior corresponding to the different sign flips (e.g. Aßmann et al., 2016; Man and Culpepper, 2022). This can cause poor mixing in MCMC samplers and poor approximation of the posterior for the number of factors (Chan et al., 2018). These problems have been addressed successfully by parameter-expansion techniques, which target a posterior over an expanded space with fewer pathological features (e.g. Ghosh and Dunson, 2009; Chan et al., 2018). Parameterising the model and framing the problem of computational inference in terms of the unidentified factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ falls into this class of methods.

Second, parameterisation in terms of the unconstrained and unidentified factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ delivers an additional benefit in terms of prior specification. Direct elicitation of a prior for the identified factor loadings in 𝚲~bold-~𝚲\boldsymbol{\tilde{\Lambda}}overbold_~ start_ARG bold_Λ end_ARG is difficult as they quantify relationships with latent factors whose interpretation is generally unclear a priori. In contrast, beliefs about the shared variation 𝚫𝚫\boldsymbol{\Delta}bold_Δ in linear Gaussian models can be related directly to beliefs about observable quantities. As a result, 𝚫𝚫\boldsymbol{\Delta}bold_Δ is generally a more intuitive parameter for the quantification of covariation. For example, in a spatial problem where the elements of 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correspond to observations at different locations, a modeller might reasonably structure their beliefs through a covariance matrix for which covariance decays as a function of distance. Fortunately, subject to k<φ(p)𝑘𝜑𝑝k<\varphi(p)italic_k < italic_φ ( italic_p ), identifiability of 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is not necessary for identification of 𝚫𝚫\boldsymbol{\Delta}bold_Δ. Further, under various standard distributions for an unconstrained random matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ, the first and second order moments of the shared variation matrix 𝚫=𝚲~𝚲~T=𝚲𝚲T𝚫bold-~𝚲superscriptbold-~𝚲T𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\tilde{\Lambda}}\boldsymbol{\tilde{\Lambda}}^{% \mathrm{\scriptscriptstyle T}}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{% \mathrm{\scriptscriptstyle T}}bold_Δ = overbold_~ start_ARG bold_Λ end_ARG overbold_~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT can be calculated in closed form. Through careful specification of the prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ, a modeller can therefore capture their beliefs about the moments in the prior for the shared variation matrix.

3 Structured prior distributions

3.1 Significance of prior expectation

The prior expectation of the shared variation matrix 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is defined by E(𝚫)={E(δij)}=E(𝚲𝚲T)𝐸𝚫𝐸subscript𝛿𝑖𝑗𝐸𝚲superscript𝚲T\mathnormal{E}(\boldsymbol{\Delta})=\{\mathnormal{E}(\delta_{ij})\}=% \mathnormal{E}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}})italic_E ( bold_Δ ) = { italic_E ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) } = italic_E ( bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ). As we detail in Section 3.4, E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) will be chosen to reflect beliefs about the covariation amongst the p𝑝pitalic_p elements of 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In general, it will be a rank p𝑝pitalic_p matrix.

For k<p𝑘𝑝k<pitalic_k < italic_p, denote by 𝒮p,k+superscriptsubscript𝒮𝑝𝑘\mathcal{S}_{p,k}^{+}caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT the set of rank k𝑘kitalic_k, p×p𝑝𝑝p\times pitalic_p × italic_p symmetric, positive semi-definite matrices and write 𝒮p+superscriptsubscript𝒮𝑝\mathcal{S}_{p}^{+}caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT for the space of p×p𝑝𝑝p\times pitalic_p × italic_p symmetric, positive definite matrices. The factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is rank k<φ(p)<p𝑘𝜑𝑝𝑝k<\varphi(p)<pitalic_k < italic_φ ( italic_p ) < italic_p and so the prior distribution for the shared variation matrix 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT only offers non-zero support to 𝒮p,k+superscriptsubscript𝒮𝑝𝑘\mathcal{S}_{p,k}^{+}caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Because 𝒮p,k+superscriptsubscript𝒮𝑝𝑘\mathcal{S}_{p,k}^{+}caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is not a convex set, it need not contain the prior expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ. Indeed, as stated previously, E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) will generally be rank p𝑝pitalic_p. Therefore, although E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) represents the centre of prior mass, in general there will be zero density at this point. The significance of E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) is thus not completely clear. We will elucidate its meaning via an alternative, constrained expectation, as follows.

The Frobenius norm of a matrix is simply the Euclidean norm of its vectorised form. Define the (squared) Frobenius distance between two matrices 𝑨𝑨\boldsymbol{A}bold_italic_A and 𝑩𝑩\boldsymbol{B}bold_italic_B as the squared Frobenius norm of their difference:

d(𝑨,𝑩)2=𝑨𝑩F2=tr{(𝑨𝑩)T(𝑨𝑩)}.𝑑superscript𝑨𝑩2superscriptsubscriptnorm𝑨𝑩𝐹2trsuperscript𝑨𝑩T𝑨𝑩d(\boldsymbol{A},\boldsymbol{B})^{2}=\|\boldsymbol{A}-\boldsymbol{B}\|_{F}^{2}% =\mathrm{tr}\left\{(\boldsymbol{A}-\boldsymbol{B})^{\mathrm{\scriptscriptstyle T% }}(\boldsymbol{A}-\boldsymbol{B})\right\}.italic_d ( bold_italic_A , bold_italic_B ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ bold_italic_A - bold_italic_B ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_tr { ( bold_italic_A - bold_italic_B ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ( bold_italic_A - bold_italic_B ) } .

By analogy with the classic Euclidean expectation, we now define the constrained expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ as EF(𝚫)=𝑳0𝑳0T𝒮p,k+subscript𝐸𝐹𝚫subscript𝑳0superscriptsubscript𝑳0Tsuperscriptsubscript𝒮𝑝𝑘\mathnormal{E}_{F}(\boldsymbol{\Delta})=\boldsymbol{L}_{0}\boldsymbol{L}_{0}^{% \mathrm{\scriptscriptstyle T}}\in\mathcal{S}_{p,k}^{+}italic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_Δ ) = bold_italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT where

𝑳0=min𝑳p×kEΛ{d(𝚲𝚲T,𝑳𝑳T)2}.subscript𝑳0subscript𝑳superscript𝑝𝑘subscript𝐸Λ𝑑superscript𝚲superscript𝚲T𝑳superscript𝑳T2\boldsymbol{L}_{0}=\min_{\boldsymbol{L}\in\mathbb{R}^{p\times k}}\mathnormal{E% }_{\Lambda}\left\{d(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}},\boldsymbol{L}\boldsymbol{L}^{\mathrm{% \scriptscriptstyle T}})^{2}\right\}.bold_italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT bold_italic_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT { italic_d ( bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , bold_italic_L bold_italic_L start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } .

Theorem 1 below asserts that if the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT and (k+1)thsuperscript𝑘1th(k+1)^{\text{th}}( italic_k + 1 ) start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT eigenvalues of E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) are distinct, the constrained expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ is the point in 𝒮p,k+superscriptsubscript𝒮𝑝𝑘\mathcal{S}_{p,k}^{+}caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT that minimises the Frobenius distance to E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ). However, before proving Theorem 1, we need Proposition 1.

Proposition 1.

Let the spectral decomposition of a matrix 𝐃𝒮p+𝐃superscriptsubscript𝒮𝑝\boldsymbol{D}\in\mathcal{S}_{p}^{+}bold_italic_D ∈ caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be 𝐔𝐌𝐔T𝐔𝐌superscript𝐔T\boldsymbol{U}\boldsymbol{M}\boldsymbol{U}^{\mathrm{\scriptscriptstyle T}}bold_italic_U bold_italic_M bold_italic_U start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, where 𝐌=diag(m1,,mp)𝐌diagsubscript𝑚1subscript𝑚𝑝\boldsymbol{M}=\mathrm{diag}(m_{1},\ldots,m_{p})bold_italic_M = roman_diag ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) is a diagonal matrix of ordered eigenvalues, m1m2mp>0subscript𝑚1subscript𝑚2subscript𝑚𝑝0m_{1}\geq m_{2}\geq\cdots\geq m_{p}>0italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > 0, and 𝐔𝐔\boldsymbol{U}bold_italic_U is an orthogonal matrix whose columns comprise the corresponding eigenvectors. Assume that mkmk+1subscript𝑚𝑘subscript𝑚𝑘1m_{k}\neq m_{k+1}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ italic_m start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT. Then, for 𝚲p×k𝚲superscript𝑝𝑘\boldsymbol{\Lambda}\in\mathbb{R}^{p\times k}bold_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_k end_POSTSUPERSCRIPT, the matrix product 𝚲𝚲T𝚲superscript𝚲T\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{\scriptscriptstyle T}}bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT which minimises the Frobenius distance to 𝐃𝐃\boldsymbol{D}bold_italic_D is 𝚲𝚲T=𝐃1/2𝐔(k)𝐔(k)T𝐃1/2𝚲superscript𝚲Tsuperscript𝐃12superscript𝐔𝑘superscript𝐔𝑘Tsuperscript𝐃12\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{\scriptscriptstyle T}}=% \boldsymbol{D}^{1/2}\boldsymbol{U}^{(k)}\boldsymbol{U}^{(k)\,{\mathrm{% \scriptscriptstyle T}}}\boldsymbol{D}^{1/2}bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = bold_italic_D start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_italic_U start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_italic_U start_POSTSUPERSCRIPT ( italic_k ) roman_T end_POSTSUPERSCRIPT bold_italic_D start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT where 𝐔(k)superscript𝐔𝑘\boldsymbol{U}^{(k)}bold_italic_U start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the sub-matrix comprising the first k𝑘kitalic_k columns of 𝐔𝐔\boldsymbol{U}bold_italic_U. Moreover, the minimum squared Frobenius distance is equal to the sum of the squares of the last pk𝑝𝑘p-kitalic_p - italic_k eigenvalues of 𝐃𝐃\boldsymbol{D}bold_italic_D.

The proof of Proposition 1 is provided in the Supplementary Materials.

Theorem 1.

If 𝐔𝐌𝐔T𝐔𝐌superscript𝐔T\boldsymbol{U}\boldsymbol{M}\boldsymbol{U}^{\mathrm{\scriptscriptstyle T}}bold_italic_U bold_italic_M bold_italic_U start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT denotes the spectral decomposition of E(𝚫)𝒮p+𝐸𝚫superscriptsubscript𝒮𝑝\mathnormal{E}(\boldsymbol{\Delta})\in\mathcal{S}_{p}^{+}italic_E ( bold_Δ ) ∈ caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and mkmk+1subscript𝑚𝑘subscript𝑚𝑘1m_{k}\neq m_{k+1}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ italic_m start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, then the constrained expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ, EF(𝚫)𝒮p,k+subscript𝐸𝐹𝚫superscriptsubscript𝒮𝑝𝑘\mathnormal{E}_{F}(\boldsymbol{\Delta})\in\mathcal{S}_{p,k}^{+}italic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_Δ ) ∈ caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, is equal to the matrix product which minimises the Frobenius distance to E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ). That is, EF(𝚫)=E(𝚫)1/2𝐔(k)𝐔(k)TE(𝚫)1/2subscript𝐸𝐹𝚫𝐸superscript𝚫12superscript𝐔𝑘superscript𝐔𝑘T𝐸superscript𝚫12\mathnormal{E}_{F}(\boldsymbol{\Delta})=\mathnormal{E}(\boldsymbol{\Delta})^{1% /2}\boldsymbol{U}^{(k)}\boldsymbol{U}^{(k)\,{\mathrm{\scriptscriptstyle T}}}% \mathnormal{E}(\boldsymbol{\Delta})^{1/2}italic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_Δ ) = italic_E ( bold_Δ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_italic_U start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_italic_U start_POSTSUPERSCRIPT ( italic_k ) roman_T end_POSTSUPERSCRIPT italic_E ( bold_Δ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT.

The proof of Theorem 1 is given in the Supplementary Materials. Its significance lies in the suggestion that the prior for 𝚫𝚫\boldsymbol{\Delta}bold_Δ encourages shrinkage towards the closest matrix in 𝒮p,k+superscriptsubscript𝒮𝑝𝑘\mathcal{S}_{p,k}^{+}caligraphic_S start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT to the rank p𝑝pitalic_p prior expectation E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ), hence for a given structure for E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ), approximating it as closely as possible in rank-reduced form. The Supplementary Materials also consider the case mk=mk+1subscript𝑚𝑘subscript𝑚𝑘1m_{k}=m_{k+1}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT and the implications for computational inference.

3.2 A matrix normal prior

A random matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ has a matrix normal distribution with location matrix 𝑴p×k𝑴superscript𝑝𝑘\boldsymbol{M}\in\mathbb{R}^{p\times k}bold_italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_k end_POSTSUPERSCRIPT, among-row scale matrix 𝚽𝒮p+𝚽superscriptsubscript𝒮𝑝\boldsymbol{\Phi}\in\mathcal{S}_{p}^{+}bold_Φ ∈ caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and among-column scale matrix 𝚿𝒮k+𝚿superscriptsubscript𝒮𝑘\boldsymbol{\Psi}\in\mathcal{S}_{k}^{+}bold_Ψ ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, written 𝚲Np,k(𝑴,𝚽,𝚿)similar-to𝚲subscriptN𝑝𝑘𝑴𝚽𝚿\boldsymbol{\Lambda}\sim\mathrm{N}_{p,k}(\boldsymbol{M},\boldsymbol{\Phi},% \boldsymbol{\Psi})bold_Λ ∼ roman_N start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT ( bold_italic_M , bold_Φ , bold_Ψ ), if vec(𝚲)vec𝚲\mathrm{vec}(\boldsymbol{\Lambda})roman_vec ( bold_Λ ) is multivariate normal and such that vec(𝚲)Npk(vec(𝑴),𝚿𝚽)similar-tovec𝚲subscriptN𝑝𝑘vec𝑴tensor-product𝚿𝚽\mathrm{vec}(\boldsymbol{\Lambda})\sim\mathrm{N}_{pk}\Bigl{(}\mathrm{vec}(% \boldsymbol{M}),\boldsymbol{\Psi}\otimes\boldsymbol{\Phi}\Bigr{)}roman_vec ( bold_Λ ) ∼ roman_N start_POSTSUBSCRIPT italic_p italic_k end_POSTSUBSCRIPT ( roman_vec ( bold_italic_M ) , bold_Ψ ⊗ bold_Φ ). Since (α𝚿)(α1𝚽)=𝚿𝚽tensor-product𝛼𝚿superscript𝛼1𝚽tensor-product𝚿𝚽(\alpha\boldsymbol{\Psi})\otimes(\alpha^{-1}\boldsymbol{\Phi})=\boldsymbol{% \Psi}\otimes\boldsymbol{\Phi}( italic_α bold_Ψ ) ⊗ ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Φ ) = bold_Ψ ⊗ bold_Φ for any scalar α>0𝛼0\alpha>0italic_α > 0, the overall scale of either 𝚽𝚽\boldsymbol{\Phi}bold_Φ or 𝚿𝚿\boldsymbol{\Psi}bold_Ψ can be fixed without loss of generality.

Suppose that we take 𝚲Np,k(𝟎,𝚽,𝚿)similar-to𝚲subscriptN𝑝𝑘0𝚽𝚿\boldsymbol{\Lambda}\sim\mathrm{N}_{p,k}(\boldsymbol{0},\boldsymbol{\Phi},% \boldsymbol{\Psi})bold_Λ ∼ roman_N start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT ( bold_0 , bold_Φ , bold_Ψ ) as a prior for the unconstrained factor loadings matrix and fix the scale of the among-row scale matrix by taking tr(𝚽)=ptr𝚽𝑝\mathrm{tr}(\boldsymbol{\Phi})=proman_tr ( bold_Φ ) = italic_p or tr(𝚽1)=ptrsuperscript𝚽1𝑝\mathrm{tr}(\boldsymbol{\Phi}^{-1})=proman_tr ( bold_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = italic_p. As remarked in Section 2, the shared variation matrix 𝚫=(δij)=𝚲𝚲T𝚫subscript𝛿𝑖𝑗𝚲superscript𝚲T\boldsymbol{\Delta}=(\delta_{ij})=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{% \mathrm{\scriptscriptstyle T}}bold_Δ = ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is a quantity which is amenable to the specification of prior beliefs, and so our goal is to derive its moments. Using standard theory for the matrix normal distribution (e.g. Gupta and Nagar, 2000, Chapter 2), it can be shown that the expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ is given by

E(𝚫)=tr(𝚿)𝚽,𝐸𝚫tr𝚿𝚽\mathnormal{E}(\boldsymbol{\Delta})=\mathrm{tr}(\boldsymbol{\Psi})\boldsymbol{% \Phi},italic_E ( bold_Δ ) = roman_tr ( bold_Ψ ) bold_Φ , (5)

while the covariance between δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and δksubscript𝛿𝑘\delta_{k\ell}italic_δ start_POSTSUBSCRIPT italic_k roman_ℓ end_POSTSUBSCRIPT is Cov(δij,δk)=tr(𝚿2)(ϕikϕj+ϕiϕjk)Covsubscript𝛿𝑖𝑗subscript𝛿𝑘trsuperscript𝚿2subscriptitalic-ϕ𝑖𝑘subscriptitalic-ϕ𝑗subscriptitalic-ϕ𝑖subscriptitalic-ϕ𝑗𝑘\text{Cov}(\delta_{ij},\delta_{k\ell})=\mathrm{tr}(\boldsymbol{\Psi}^{2})(\phi% _{ik}\phi_{j\ell}+\phi_{i\ell}\phi_{jk})Cov ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_k roman_ℓ end_POSTSUBSCRIPT ) = roman_tr ( bold_Ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_ϕ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_j roman_ℓ end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT italic_i roman_ℓ end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ). In the special case where i=k𝑖𝑘i=kitalic_i = italic_k and j=𝑗j=\ellitalic_j = roman_ℓ, the variance of δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is then

Var(δij)=tr(𝚿2)(ϕiiϕjj+ϕij2).Varsubscript𝛿𝑖𝑗trsuperscript𝚿2subscriptitalic-ϕ𝑖𝑖subscriptitalic-ϕ𝑗𝑗superscriptsubscriptitalic-ϕ𝑖𝑗2\text{Var}(\delta_{ij})=\mathrm{tr}(\boldsymbol{\Psi}^{2})(\phi_{ii}\phi_{jj}+% \phi_{ij}^{2}).Var ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = roman_tr ( bold_Ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_ϕ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (6)

The derivations of the moments above are provided in the Supplementary Materials.

The result in (5) is significant because of the interpretation it bestows on the among-row scale matrix 𝚽𝚽\boldsymbol{\Phi}bold_Φ as a standardised version of our prior expectation for the shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ. Parametric forms are convenient models for covariance matrices because they provide a parsimonious representation of dependence and a way to model the relationships with as-yet unobserved variables. We can therefore model the among-row scale matrix 𝚽𝚽\boldsymbol{\Phi}bold_Φ using a parametric form. This encourages shrinkage of the shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ towards an area around that parametric form but without enforcing the structure as a model constraint.

Suppose that 𝚽=𝑹(ϑ)𝚽𝑹bold-italic-ϑ\boldsymbol{\Phi}=\boldsymbol{R}(\boldsymbol{\vartheta})bold_Φ = bold_italic_R ( bold_italic_ϑ ) in which ϑbold-italic-ϑ\boldsymbol{\vartheta}bold_italic_ϑ is a low-dimensional vector of hyperparameters on which the parametric form 𝑹()𝑹\boldsymbol{R}(\cdot)bold_italic_R ( ⋅ ) depends. We complete our prior for the factor loadings matrix by taking 𝚿=diag(ψ1,,ψk)𝚿diagsubscript𝜓1subscript𝜓𝑘\boldsymbol{\Psi}=\mathrm{diag}(\psi_{1},\ldots,\psi_{k})bold_Ψ = roman_diag ( italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and assigning a prior to ϑbold-italic-ϑ\boldsymbol{\vartheta}bold_italic_ϑ and the ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Parametric forms 𝑹()𝑹\boldsymbol{R}(\cdot)bold_italic_R ( ⋅ ) and the prior for the ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are discussed in Sections 3.4 and 3.5, respectively.

3.3 A matrix-t𝑡titalic_t prior

The expressions for the means and variances in (5) and (6) under the matrix normal prior reveal a lack of flexibility; recalling that the scale of 𝚽𝚽\boldsymbol{\Phi}bold_Φ is fixed, once the mean E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) has been chosen, it is clearly not possible to scale up or down all the variances by scaling 𝚿𝚿\boldsymbol{\Psi}bold_Ψ. As such, we cannot assess independently the overall scale of the mean and uncertainty around that mean. A matrix-t𝑡titalic_t prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ remedies this problem through the introduction of a degree of freedom parameter.

Let 𝑺Wp(ς+p1,𝚽1)similar-to𝑺subscriptW𝑝𝜍𝑝1superscript𝚽1\boldsymbol{S}\sim\mathrm{W}_{p}(\varsigma+p-1,\boldsymbol{\Phi}^{-1})bold_italic_S ∼ roman_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_ς + italic_p - 1 , bold_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and 𝑿Np,k(𝟎,𝑰p,𝚿)similar-to𝑿subscriptN𝑝𝑘0subscript𝑰𝑝𝚿\boldsymbol{X}\sim\mathrm{N}_{p,k}(\boldsymbol{0},\boldsymbol{I}_{p},% \boldsymbol{\Psi})bold_italic_X ∼ roman_N start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_Ψ ) be independent random matrices, where Wn(q,𝑼)subscriptW𝑛𝑞𝑼\mathrm{W}_{n}(q,\boldsymbol{U})roman_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_q , bold_italic_U ) denotes the n𝑛nitalic_n-dimensional Wishart distribution with q>0𝑞0q>0italic_q > 0 degrees of freedom and scale matrix 𝑼𝒮n+𝑼superscriptsubscript𝒮𝑛\boldsymbol{U}\in\mathcal{S}_{n}^{+}bold_italic_U ∈ caligraphic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. If we define

𝚲=(𝑺1/2)T𝑿+𝑴𝚲superscriptsuperscript𝑺12T𝑿𝑴\boldsymbol{\Lambda}=(\boldsymbol{S}^{-1/2})^{\mathrm{\scriptscriptstyle T}}% \boldsymbol{X}+\boldsymbol{M}bold_Λ = ( bold_italic_S start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_italic_X + bold_italic_M (7)

where 𝑺1/2(𝑺1/2)T=𝑺superscript𝑺12superscriptsuperscript𝑺12T𝑺\boldsymbol{S}^{1/2}(\boldsymbol{S}^{1/2})^{\mathrm{\scriptscriptstyle T}}=% \boldsymbol{S}bold_italic_S start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( bold_italic_S start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = bold_italic_S and 𝑴p×k𝑴superscript𝑝𝑘\boldsymbol{M}\in\mathbb{R}^{p\times k}bold_italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_k end_POSTSUPERSCRIPT, then the random matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ has a matrix-t𝑡titalic_t distribution, written 𝚲tp,k(ς,𝑴,𝚽,𝚿)similar-to𝚲subscriptt𝑝𝑘𝜍𝑴𝚽𝚿\boldsymbol{\Lambda}\sim\mathrm{t}_{p,k}(\varsigma,\boldsymbol{M},\boldsymbol{% \Phi},\boldsymbol{\Psi})bold_Λ ∼ roman_t start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT ( italic_ς , bold_italic_M , bold_Φ , bold_Ψ ). Again, one can fix the scale of either the among-row scale matrix 𝚽𝒮p+𝚽superscriptsubscript𝒮𝑝\boldsymbol{\Phi}\in\mathcal{S}_{p}^{+}bold_Φ ∈ caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT or the among-column scale matrix 𝚿𝒮k+𝚿superscriptsubscript𝒮𝑘\boldsymbol{\Psi}\in\mathcal{S}_{k}^{+}bold_Ψ ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT without loss of generality.

Suppose that we adopt 𝚲tp,k(ς,𝟎,𝚽˘,𝚿)similar-to𝚲subscriptt𝑝𝑘𝜍0bold-˘𝚽𝚿\boldsymbol{\Lambda}\sim\mathrm{t}_{p,k}(\varsigma,\boldsymbol{0},\boldsymbol{% \breve{\Phi}},\boldsymbol{\Psi})bold_Λ ∼ roman_t start_POSTSUBSCRIPT italic_p , italic_k end_POSTSUBSCRIPT ( italic_ς , bold_0 , overbold_˘ start_ARG bold_Φ end_ARG , bold_Ψ ) as a prior for the factor loadings matrix. Expressions for the means, variances and covariances of 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT are derived in the Supplementary Materials. In particular, we show that E(𝚫)=tr(𝚿)𝚽𝐸𝚫tr𝚿𝚽\mathnormal{E}(\boldsymbol{\Delta})=\mathrm{tr}(\boldsymbol{\Psi})\boldsymbol{\Phi}italic_E ( bold_Δ ) = roman_tr ( bold_Ψ ) bold_Φ for ς>2𝜍2\varsigma>2italic_ς > 2 where 𝚽=𝚽˘/(ς2)𝚽bold-˘𝚽𝜍2\boldsymbol{\Phi}=\boldsymbol{\breve{\Phi}}/(\varsigma-2)bold_Φ = overbold_˘ start_ARG bold_Φ end_ARG / ( italic_ς - 2 ), which is identical to the expression derived under the matrix normal prior. We also show that for a fixed value of E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ), the variance,

Var(δij)={tr(𝚿)2+(ς2)tr(𝚿2)}{ςϕij2+(ς2)ϕiiϕjj}(ς1)(ς4)for ς>4,Varsubscript𝛿𝑖𝑗trsuperscript𝚿2𝜍2trsuperscript𝚿2𝜍superscriptsubscriptitalic-ϕ𝑖𝑗2𝜍2subscriptitalic-ϕ𝑖𝑖subscriptitalic-ϕ𝑗𝑗𝜍1𝜍4for ς>4,\text{Var}(\delta_{ij})=\frac{\left\{\mathrm{tr}(\boldsymbol{\Psi})^{2}+(% \varsigma-2)\mathrm{tr}(\boldsymbol{\Psi}^{2})\right\}\left\{\varsigma\phi_{ij% }^{2}+(\varsigma-2)\phi_{ii}\phi_{jj}\right\}}{(\varsigma-1)(\varsigma-4)}% \quad\text{for $\varsigma>4$,}Var ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = divide start_ARG { roman_tr ( bold_Ψ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_ς - 2 ) roman_tr ( bold_Ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) } { italic_ς italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_ς - 2 ) italic_ϕ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT } end_ARG start_ARG ( italic_ς - 1 ) ( italic_ς - 4 ) end_ARG for italic_ς > 4 ,

can be increased or decreased by decreasing or increasing ς𝜍\varsigmaitalic_ς, respectively. Therefore, by treating ς>4𝜍4\varsigma>4italic_ς > 4 as an unknown and assigning it a prior, we can allow the data to influence the degree of shrinkage of the shared variation matrix towards the closest rank k𝑘kitalic_k matrix to its mean. Writing ςˇ=1/(ς4)+ˇ𝜍1𝜍4superscript\check{\varsigma}=1/(\varsigma-4)\in\mathbb{R}^{+}overroman_ˇ start_ARG italic_ς end_ARG = 1 / ( italic_ς - 4 ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, a matrix normal distribution for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is recovered as ςˇ0ˇ𝜍0\check{\varsigma}\to 0overroman_ˇ start_ARG italic_ς end_ARG → 0. We can therefore allow controlled relaxation of the fixed shrinkage of the matrix normal prior by specifying a prior for ςˇˇ𝜍\check{\varsigma}overroman_ˇ start_ARG italic_ς end_ARG with its mode at zero. To this end, we recommend an exponential distribution ςˇExp(a0)similar-toˇ𝜍Expsubscript𝑎0\check{\varsigma}\sim\mathrm{Exp}(a_{0})overroman_ˇ start_ARG italic_ς end_ARG ∼ roman_Exp ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). In the special case when 𝚽=𝑰p𝚽subscript𝑰𝑝\boldsymbol{\Phi}=\boldsymbol{I}_{p}bold_Φ = bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and 𝚿=ψ𝑰k𝚿𝜓subscript𝑰𝑘\boldsymbol{\Psi}=\psi\boldsymbol{I}_{k}bold_Ψ = italic_ψ bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we have Var(δij)=sk(ςˇ)E(δij)Varsubscript𝛿𝑖𝑗subscript𝑠𝑘ˇ𝜍𝐸subscript𝛿𝑖𝑗\text{Var}(\delta_{ij})=s_{k}(\check{\varsigma})\mathnormal{E}(\delta_{ij})Var ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( overroman_ˇ start_ARG italic_ς end_ARG ) italic_E ( italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) where sk(ςˇ)=(1+2ςˇ){1+(2+k)ςˇ}/(1+3ςˇ)subscript𝑠𝑘ˇ𝜍12ˇ𝜍12𝑘ˇ𝜍13ˇ𝜍s_{k}(\check{\varsigma})=(1+2\check{\varsigma})\left\{1+(2+k)\check{\varsigma}% \right\}/(1+3\check{\varsigma})italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( overroman_ˇ start_ARG italic_ς end_ARG ) = ( 1 + 2 overroman_ˇ start_ARG italic_ς end_ARG ) { 1 + ( 2 + italic_k ) overroman_ˇ start_ARG italic_ς end_ARG } / ( 1 + 3 overroman_ˇ start_ARG italic_ς end_ARG ) is an increasing function in ςˇˇ𝜍\check{\varsigma}overroman_ˇ start_ARG italic_ς end_ARG which takes its minimum value of 1 when ςˇ=0ˇ𝜍0\check{\varsigma}=0overroman_ˇ start_ARG italic_ς end_ARG = 0. By trial and improvement, we can therefore use the quantiles of the Exp(a0)Expsubscript𝑎0\mathrm{Exp}(a_{0})roman_Exp ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) distribution to choose a value for a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which is consistent with our prior beliefs about a chosen quantile in the distribution for the scale factor sk(ςˇ)subscript𝑠𝑘ˇ𝜍s_{k}(\check{\varsigma})italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( overroman_ˇ start_ARG italic_ς end_ARG ) for various k𝑘kitalic_k. This is illustrated in the application in Section 6.2.

3.4 Model and prior for 𝚽𝚽\boldsymbol{\Phi}bold_Φ

3.4.1 General form

Under either a matrix normal or matrix-t𝑡titalic_t prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ, E(𝚫|𝚽,𝚿)=tr(𝚿)𝚽𝐸conditional𝚫𝚽𝚿tr𝚿𝚽\mathnormal{E}(\boldsymbol{\Delta}|\boldsymbol{\Phi},\boldsymbol{\Psi})=% \mathrm{tr}(\boldsymbol{\Psi})\boldsymbol{\Phi}italic_E ( bold_Δ | bold_Φ , bold_Ψ ) = roman_tr ( bold_Ψ ) bold_Φ. As explained in Section 3.2, since we choose to fix tr(𝚽)=ptr𝚽𝑝\mathrm{tr}(\boldsymbol{\Phi})=proman_tr ( bold_Φ ) = italic_p or tr(𝚽1)=ptrsuperscript𝚽1𝑝\mathrm{tr}(\boldsymbol{\Phi}^{-1})=proman_tr ( bold_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = italic_p, we can therefore interpret 𝚽𝚽\boldsymbol{\Phi}bold_Φ as a standardised version of the prior expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ. For most of the parametric forms described in this section, it will be possible to factorise 𝚽𝚽\boldsymbol{\Phi}bold_Φ or 𝚵=𝚽1𝚵superscript𝚽1\boldsymbol{\Xi}=\boldsymbol{\Phi}^{-1}bold_Ξ = bold_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as 𝚽=𝑹(ϑ)𝚽𝑹bold-italic-ϑ\boldsymbol{\Phi}=\boldsymbol{R}(\boldsymbol{\vartheta})bold_Φ = bold_italic_R ( bold_italic_ϑ ) or 𝚵=𝑹(ϑ)𝚵𝑹bold-italic-ϑ\boldsymbol{\Xi}=\boldsymbol{R}(\boldsymbol{\vartheta})bold_Ξ = bold_italic_R ( bold_italic_ϑ ), respectively, where 𝑹()𝑹\boldsymbol{R}(\cdot)bold_italic_R ( ⋅ ) yields a positive definite matrix with 1s on the diagonal. The vector ϑbold-italic-ϑ\boldsymbol{\vartheta}bold_italic_ϑ typically contains a small number of unknown hyperparameters to which we assign a prior.

3.4.2 Specific examples

The simplest structure for 𝚽𝚽\boldsymbol{\Phi}bold_Φ would take 𝚽=𝑰p𝚽subscript𝑰𝑝\boldsymbol{\Phi}=\boldsymbol{I}_{p}bold_Φ = bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT giving E(𝚫|𝚽,𝚿)=tr(𝚿)𝑰p𝐸conditional𝚫𝚽𝚿tr𝚿subscript𝑰𝑝\mathnormal{E}(\boldsymbol{\Delta}|\boldsymbol{\Phi},\boldsymbol{\Psi})=% \mathrm{tr}(\boldsymbol{\Psi})\boldsymbol{I}_{p}italic_E ( bold_Δ | bold_Φ , bold_Ψ ) = roman_tr ( bold_Ψ ) bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. In the matrix normal case when, additionally, 𝚿=ψ𝑰k𝚿𝜓subscript𝑰𝑘\boldsymbol{\Psi}=\psi\boldsymbol{I}_{k}bold_Ψ = italic_ψ bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we obtain the order-invariant prior presented in Leung and Drton (2016) for the identified factor loadings matrix in which λ~ijN(0,ψ)similar-tosubscript~𝜆𝑖𝑗N0𝜓\tilde{\lambda}_{ij}\sim\mathrm{N}(0,\psi)over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ roman_N ( 0 , italic_ψ ) for ij𝑖𝑗i\neq jitalic_i ≠ italic_j and λ~ii2Gam{(ki+1)/2,1/(2ψ)}similar-tosuperscriptsubscript~𝜆𝑖𝑖2Gam𝑘𝑖1212𝜓\tilde{\lambda}_{ii}^{2}\sim\mathrm{Gam}\{(k-i+1)/2,1/(2\psi)\}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ roman_Gam { ( italic_k - italic_i + 1 ) / 2 , 1 / ( 2 italic_ψ ) } for i=1,,k𝑖1𝑘i=1,\ldots,kitalic_i = 1 , … , italic_k. Although taking 𝚽=𝑰p𝚽subscript𝑰𝑝\boldsymbol{\Phi}=\boldsymbol{I}_{p}bold_Φ = bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT will give an order-invariant prior for 𝚫𝚫\boldsymbol{\Delta}bold_Δ, a more general distribution that is exchangeable with respect to the order of the variables in 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT arises by taking 𝚽𝚽\boldsymbol{\Phi}bold_Φ to be a two-parameter exchangeable matrix. Since we constrain tr(𝚽)=ptr𝚽𝑝\mathrm{tr}(\boldsymbol{\Phi})=proman_tr ( bold_Φ ) = italic_p this yields 𝚽=(1ϑ)𝑰p+ϑ𝑱p𝚽1italic-ϑsubscript𝑰𝑝italic-ϑsubscript𝑱𝑝\boldsymbol{\Phi}=(1-\vartheta)\boldsymbol{I}_{p}+\vartheta\boldsymbol{J}_{p}bold_Φ = ( 1 - italic_ϑ ) bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_ϑ bold_italic_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where 1/(p1)<ϑ<11𝑝1italic-ϑ1-1/(p-1)<\vartheta<1- 1 / ( italic_p - 1 ) < italic_ϑ < 1, 𝑱p=𝟏p𝟏pTsubscript𝑱𝑝subscript1𝑝superscriptsubscript1𝑝T\boldsymbol{J}_{p}=\boldsymbol{1}_{p}\boldsymbol{1}_{p}^{\mathrm{% \scriptscriptstyle T}}bold_italic_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = bold_1 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT and 𝟏psubscript1𝑝\boldsymbol{1}_{p}bold_1 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a p𝑝pitalic_p-vector of 1s. This is the most general form for a p×p𝑝𝑝p\times pitalic_p × italic_p symmetric, positive definite matrix with tr(𝚽)=ptr𝚽𝑝\mathrm{tr}(\boldsymbol{\Phi})=proman_tr ( bold_Φ ) = italic_p which is invariant under a common permutation of the rows and columns. If a modeller has nothing in their prior beliefs to distinguish among the p𝑝pitalic_p elements of 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the advantage of this specification over the order-invariant prior of Leung and Drton (2016) is that it allows shrinkage of the p(p1)/2𝑝𝑝12p(p-1)/2italic_p ( italic_p - 1 ) / 2 off-diagonal elements of the shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ towards non-zero values. This can be particularly helpful when the number of observations is small relative to the dimension of 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

In confirmatory factor analysis, structural zeros are often introduced in the factor loadings matrix based on prior beliefs about the relationships between the k𝑘kitalic_k factors and the variables in 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For example, in the prototypical two-factor model, the first p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT variables load only on factor 1 whilst the remaining pp1𝑝subscript𝑝1p-p_{1}italic_p - italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT variables load only on factor 2. This would correspond to a block diagonal shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ. Rather than imposing these beliefs about independent group structure as a model constraint, one could instead encourage shrinkage towards the corresponding structure through the prior. With appropriate ordering of the variables, this could be achieved by taking 𝚽=blockdiag(𝚽1,,𝚽k)𝚽blockdiagsubscript𝚽1subscript𝚽𝑘\boldsymbol{\Phi}=\mathrm{blockdiag}(\boldsymbol{\Phi}_{1},\ldots,\boldsymbol{% \Phi}_{k})bold_Φ = roman_blockdiag ( bold_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) in which each 𝚽isubscript𝚽𝑖\boldsymbol{\Phi}_{i}bold_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT block is a two-parameter exchangeable matrix, as described above.

Beyond these simple examples, there are many fields of statistics in which parametric forms are adopted for correlation matrices or their inverses. Typically, they are intended to capture the idea that observations from variables that are closer together, often in time or space, tend to be more strongly (positively) correlated. For instance, the precision matrix for a vector of longitudinal measurements might be assumed to be that of a stationary autoregressive process of order one. We could centre the prior for 𝚫𝚫\boldsymbol{\Delta}bold_Δ on a matrix of this form by taking 𝚵=(ξij)𝚵subscript𝜉𝑖𝑗\boldsymbol{\Xi}=(\xi_{ij})bold_Ξ = ( italic_ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) to be a tridiagonal matrix with ξi,i+1=ξi+1,i=ϑsubscript𝜉𝑖𝑖1subscript𝜉𝑖1𝑖italic-ϑ\xi_{i,i+1}=\xi_{i+1,i}=-\varthetaitalic_ξ start_POSTSUBSCRIPT italic_i , italic_i + 1 end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_i + 1 , italic_i end_POSTSUBSCRIPT = - italic_ϑ for i=1,,p1𝑖1𝑝1i=1,\ldots,p-1italic_i = 1 , … , italic_p - 1, ξii=1+ϑ2subscript𝜉𝑖𝑖1superscriptitalic-ϑ2\xi_{ii}=1+\vartheta^{2}italic_ξ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 + italic_ϑ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for i=2,,p1𝑖2𝑝1i=2,\ldots,p-1italic_i = 2 , … , italic_p - 1 and ξ11=ξpp=1subscript𝜉11subscript𝜉𝑝𝑝1\xi_{11}=\xi_{pp}=1italic_ξ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT = 1 where ϑ(1,1)italic-ϑ11\vartheta\in(-1,1)italic_ϑ ∈ ( - 1 , 1 ).

In spatial statistics, it is very common to use a spatial correlation function, such as one from the Matérn class. A conventional spatial correlation function will depend only on the distance, and possibly direction, between sites. However, Schmidt et al. (2011) introduced the idea of recording distances in a space of dimension greater than two, by introducing covariates into the correlation function. A special case, termed the projection model, measures the distance between sites j𝑗jitalic_j and k𝑘kitalic_k in this augmented space using the Mahalanobis metric with respect to an (unknown) c×c𝑐𝑐c\times citalic_c × italic_c symmetric, positive definite matrix 𝚯𝚯\boldsymbol{\Theta}bold_Θ, that is,

dC,jk={(𝒙j𝒙k)T𝚯(𝒙j𝒙k)}.subscript𝑑𝐶𝑗𝑘superscriptsubscript𝒙𝑗subscript𝒙𝑘T𝚯subscript𝒙𝑗subscript𝒙𝑘d_{C,jk}=\surd\{(\boldsymbol{x}_{j}-\boldsymbol{x}_{k})^{{\mathrm{% \scriptscriptstyle T}}}\boldsymbol{\Theta}(\boldsymbol{x}_{j}-\boldsymbol{x}_{% k})\}.italic_d start_POSTSUBSCRIPT italic_C , italic_j italic_k end_POSTSUBSCRIPT = √ { ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_Θ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } . (8)

Here 𝒙jsubscript𝒙𝑗\boldsymbol{x}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a vector of coordinates for site j𝑗jitalic_j, comprising its two spatial coordinates and the values of c2𝑐2c-2italic_c - 2 meta-covariates. This is then combined with an exponential spatial correlation function to give ϕjk=exp(dC,jk)subscriptitalic-ϕ𝑗𝑘subscript𝑑𝐶𝑗𝑘\phi_{jk}=\exp(-d_{C,jk})italic_ϕ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = roman_exp ( - italic_d start_POSTSUBSCRIPT italic_C , italic_j italic_k end_POSTSUBSCRIPT ). Omitting the spatial coordinates, these ideas can be extended to non-spatial settings if there are meta-covariates associated with the p𝑝pitalic_p variables that are thought to influence the relationships between deviations from the mean (yijμj)subscript𝑦𝑖𝑗subscript𝜇𝑗(y_{ij}-\mu_{j})( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and (yikμk)subscript𝑦𝑖𝑘subscript𝜇𝑘(y_{ik}-\mu_{k})( italic_y start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) in any vector measurement. Indeed, any Gaussian process correlation matrix could be used for 𝚽𝚽\boldsymbol{\Phi}bold_Φ which has huge scope for the incorporation of meaningful domain expertise.

More generally, if the distances between variables can be measured by some metric, say djksubscript𝑑𝑗𝑘d_{jk}italic_d start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT, this could be used to structure the prior through the specification ϕjk=exp(djk/ϑ)subscriptitalic-ϕ𝑗𝑘subscript𝑑𝑗𝑘italic-ϑ\phi_{jk}=\exp(-d_{jk}/\vartheta)italic_ϕ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = roman_exp ( - italic_d start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT / italic_ϑ ). An ecological example is provided in Section 6.2 which uses phylogenetic distances between species in a joint model for species occurrence.

3.5 Model and prior for 𝚿𝚿\boldsymbol{\Psi}bold_Ψ

Except in confirmatory factor analysis, an appropriate number of factors is not usually known a priori. Various methods have been proposed in the literature to address this source of uncertainty. This includes techniques to approximate the marginal likelihood of models with different numbers of factors, including path sampling (Dutta and Ghosh, 2013) and Chib’s method (Chib et al., 2006). However, these methods are very computationally expensive, requiring separate Monte Carlo simulations for models with k=1,,H𝑘1𝐻k=1,\ldots,Hitalic_k = 1 , … , italic_H factors, where H𝐻Hitalic_H is the maximum number of factors that a modeller is prepared to entertain. Lopes and West (2004) propose a reversible jump MCMC algorithm but the proposals for transdimensional moves require pilot MCMC runs for models with k=1,,H𝑘1𝐻k=1,\ldots,Hitalic_k = 1 , … , italic_H factors which rapidly becomes untenable for high-dimensional problems. Other methods that attempt to allow simultaneous inference of the number of factors and the model parameters rely on identification of individual zeros in the factor loadings matrix (Frühwirth-Schnatter and Lopes, 2010; Conti et al., 2014; Frühwirth-Schnatter et al., 2023). The approach taken here in similar in spirit, relying on the removal of columns whose contribution to 𝛀𝛀\boldsymbol{\Omega}bold_Ω is negligible.

Due to the challenges in approximating the posterior for k𝑘kitalic_k, a recent approach that has become popular relies conceptually on an overfitted factor model (Bhattacharya and Dunson, 2011; Legramanti et al., 2020; Schiavon et al., 2022). In theory, infinitely many factors are permitted but cumulative shrinkage priors are used to increasingly shrink columns of the factor loadings matrix towards zero as the column index increases. This allows the infinite factor loadings matrix to be approximated with a finite matrix by omitting columns that have been shrunk close enough to zero. The posterior for the number of non-discarded columns ksuperscript𝑘k^{\ast}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is then used as a proxy for the posterior for k𝑘kitalic_k. As such, a single MCMC sample yields information about both the model parameters and the number of latent factors.

A popular increasing shrinkage prior is the multiplicative gamma process (MGP) (Bhattacharya and Dunson, 2011). Conditional on a set of hyperparameters, the factor loadings within each column are assumed to be independent normal random variables with zero mean and a column-specific precision parameter. The precisions are a cumulative product of gamma random variables and their prior expectation increases with the column index. A factor is deemed inactive a posteriori if the contribution of its loadings to 𝛀𝛀\boldsymbol{\Omega}bold_Ω is negligible under a chosen truncation criterion. In order to apply these ideas to our structured prior for the shared variation matrix, we take the among-column scale matrix 𝚿𝚿\boldsymbol{\Psi}bold_Ψ to be diagonal and assign the reciprocals of its elements a MGP prior

ψh1==1hϱ,ϱ1Gam(a1,1),ϱGam(a2,1),2formulae-sequencesuperscriptsubscript𝜓1superscriptsubscriptproduct1subscriptitalic-ϱformulae-sequencesimilar-tosubscriptitalic-ϱ1Gamsubscript𝑎11formulae-sequencesimilar-tosubscriptitalic-ϱGamsubscript𝑎212\psi_{h}^{-1}=\prod_{\ell=1}^{h}\varrho_{\ell},\quad\varrho_{1}\sim\mathrm{Gam% }(a_{1},1),\quad\varrho_{\ell}\sim\mathrm{Gam}(a_{2},1),\quad\ell\geq 2italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_ϱ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_ϱ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ roman_Gam ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ) , italic_ϱ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∼ roman_Gam ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ) , roman_ℓ ≥ 2 (9)

in which the ϱsubscriptitalic-ϱ\varrho_{\ell}italic_ϱ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT are independent. To ensure identifiability of the shared variation matrix 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT and the idiosyncratic variances 𝚺𝚺\boldsymbol{\Sigma}bold_Σ in the likelihood, we choose to truncate the prior at Hφ(p)1𝐻𝜑𝑝1H\leq\lceil\varphi(p)\rceil-1italic_H ≤ ⌈ italic_φ ( italic_p ) ⌉ - 1 where φ(p)𝜑𝑝\varphi(p)italic_φ ( italic_p ) was defined in (4). Guidelines on the choice of a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to ensure the 1/ψh1subscript𝜓1/\psi_{h}1 / italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are stochastically increasing in a reasonable neighbourhood of zero are provided in Durante (2017).

4 Stationary dynamic factor models

4.1 Bayesian dynamic factor models

In the classic vector autoregressive (VAR) model, the number of parameters grows quadratically with the dimension of the observation vector. In contrast, the number of parameters in a dynamic factor model only grows quadratically with the dimension of the latent factors. This makes them a valuable tool for dimension-reduction in multivariate time series analysis. A dynamic factor model has the same observation equation as a static model,

𝒚t=𝝁+𝚲𝜼t+ϵt,ϵtNp(𝟎,𝚺).formulae-sequencesubscript𝒚𝑡𝝁𝚲subscript𝜼𝑡subscriptbold-italic-ϵ𝑡similar-tosubscriptbold-italic-ϵ𝑡subscriptN𝑝0𝚺\boldsymbol{y}_{t}=\boldsymbol{\mu}+\boldsymbol{\Lambda}\boldsymbol{\eta}_{t}+% \boldsymbol{\epsilon}_{t},\qquad\boldsymbol{\epsilon}_{t}\sim\mathrm{N}_{p}(% \boldsymbol{0},\boldsymbol{\Sigma}).bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ + bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_0 , bold_Σ ) . (10)

However, the factors in a dynamic model evolve over time according to an order m𝑚mitalic_m vector autoregression (VAR(m)(𝑚)\text{(}{m}\text{)}( italic_m )). This yields the state equation as

𝜼t=𝚪1𝜼t1++𝚪m𝜼tm+𝜻t,𝜻tNk(𝟎,𝚷),formulae-sequencesubscript𝜼𝑡subscript𝚪1subscript𝜼𝑡1subscript𝚪𝑚subscript𝜼𝑡𝑚subscript𝜻𝑡similar-tosubscript𝜻𝑡subscriptN𝑘0𝚷\boldsymbol{\eta}_{t}=\boldsymbol{\Gamma}_{1}\boldsymbol{\eta}_{t-1}+\ldots+% \boldsymbol{\Gamma}_{m}\boldsymbol{\eta}_{t-m}+\boldsymbol{\zeta}_{t},\qquad% \boldsymbol{\zeta}_{t}\sim\mathrm{N}_{k}(\boldsymbol{0},\boldsymbol{\Pi}),bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + … + bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT + bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_0 , bold_Π ) , (11)

for t=1,2,𝑡12t=1,2,\ldotsitalic_t = 1 , 2 , … in which the observation and factor innovations ϵtsubscriptbold-italic-ϵ𝑡\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝜻tsubscript𝜻𝑡\boldsymbol{\zeta}_{t}bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are serially and mutually uncorrelated. As in the static case, it is generally assumed that the common factors explain all of the shared variation and so 𝚺𝚺\boldsymbol{\Sigma}bold_Σ is taken to be diagonal. Various extensions of this model have been presented, for example, Peña and Poncela (2006) incorporate a moving average component in state equation, Aßmann et al. (2016) introduce lagged factors in (10) and Kaufmann and Schumacher (2019) allow the components of the idiosyncratic innovations to evolve independently as univariate stationary autoregressions.

4.2 A stationary dynamic factor model

The methodology described in this paper relies on the use of a structured prior distribution for the shared variation component of the marginal variance. The marginal variance of a dynamic factor model is only meaningfully defined when the factors evolve according to a stationary process; hereafter we focus only on situations where this is a reasonable assumption. The manner by which stationarity is enforced is discussed in Section 4.3.

In order to identify the scale of the loadings matrix in a dynamic factor model, the scale of the factors is normally constrained by fixing 𝚷=Var(𝜻t)=𝑰k𝚷Varsubscript𝜻𝑡subscript𝑰𝑘\boldsymbol{\Pi}=\text{Var}(\boldsymbol{\zeta}_{t})=\boldsymbol{I}_{k}bold_Π = Var ( bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. However, to retain the interpretation of 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT as a shared variation matrix, we instead constrain the stationary covariance matrix so that 𝑮0=Var(𝜼t)=𝑰ksubscript𝑮0Varsubscript𝜼𝑡subscript𝑰𝑘\boldsymbol{G}_{0}=\text{Var}(\boldsymbol{\eta}_{t})=\boldsymbol{I}_{k}bold_italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = Var ( bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and hence the marginal variance of 𝒚tsubscript𝒚𝑡\boldsymbol{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT remains as 𝛀=𝚲𝚲T+𝚺𝛀𝚲superscript𝚲T𝚺\boldsymbol{\Omega}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}+\boldsymbol{\Sigma}bold_Ω = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + bold_Σ. We can therefore assign a prior to the unconstrained factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ using precisely the ideas discussed in Section 3 for the static model.

To complete specification of the stationary dynamic factor model, the state equation (11) must be initialised at its stationary distribution. To this end, we augment the parameter space with m𝑚mitalic_m auxiliary factors at times t=1m,,0𝑡1𝑚0t=1-m,\ldots,0italic_t = 1 - italic_m , … , 0. Generalising the definition of the stationary variance 𝑮0subscript𝑮0\boldsymbol{G}_{0}bold_italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT above, we define 𝑮i=Cov(𝜼t,𝜼t+i)subscript𝑮𝑖Covsubscript𝜼𝑡subscript𝜼𝑡𝑖\boldsymbol{G}_{i}=\text{Cov}(\boldsymbol{\eta}_{t},\boldsymbol{\eta}_{t+i})bold_italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Cov ( bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_η start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ) as the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT autocovariance of 𝜼tsubscript𝜼𝑡\boldsymbol{\eta}_{t}bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The initial distribution can then be expressed as (𝜼1mT,,𝜼0T)TNkm(𝟎,𝑮)similar-tosuperscriptsuperscriptsubscript𝜼1𝑚Tsuperscriptsubscript𝜼0TTsubscriptN𝑘𝑚0𝑮(\boldsymbol{\eta}_{1-m}^{\mathrm{\scriptscriptstyle T}},\ldots,\boldsymbol{% \eta}_{0}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\sim% \mathrm{N}_{km}(\boldsymbol{0},\boldsymbol{G})( bold_italic_η start_POSTSUBSCRIPT 1 - italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , … , bold_italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_k italic_m end_POSTSUBSCRIPT ( bold_0 , bold_italic_G ) where 𝑮𝑮\boldsymbol{G}bold_italic_G is a positive definite block Toeplitz matrix with 𝑮jisubscript𝑮𝑗𝑖\boldsymbol{G}_{j-i}bold_italic_G start_POSTSUBSCRIPT italic_j - italic_i end_POSTSUBSCRIPT as the block in rows {k(i1)+1}𝑘𝑖11\{k(i-1)+1\}{ italic_k ( italic_i - 1 ) + 1 } to ki𝑘𝑖kiitalic_k italic_i and columns {k(j1)+1}𝑘𝑗11\{k(j-1)+1\}{ italic_k ( italic_j - 1 ) + 1 } to kj𝑘𝑗kjitalic_k italic_j (i,j=1,,mformulae-sequence𝑖𝑗1𝑚i,j=1,\ldots,mitalic_i , italic_j = 1 , … , italic_m) and 𝑮=𝑮Tsubscript𝑮superscriptsubscript𝑮T\boldsymbol{G}_{-\ell}=\boldsymbol{G}_{\ell}^{\mathrm{\scriptscriptstyle T}}bold_italic_G start_POSTSUBSCRIPT - roman_ℓ end_POSTSUBSCRIPT = bold_italic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT (=1,,m11𝑚1\ell=1,\ldots,m-1roman_ℓ = 1 , … , italic_m - 1).

4.3 Enforcing the stationarity constraint

Denote by B𝐵Bitalic_B the backshift operator, B𝜼t=𝜼t1𝐵subscript𝜼𝑡subscript𝜼𝑡1B\boldsymbol{\eta}_{t}=\boldsymbol{\eta}_{t-1}italic_B bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The VAR model governing evolution of the factors can then be written as 𝜻t=(𝑰k𝚪1B𝚪mBm)𝜼t=𝚪(B)𝜼tsubscript𝜻𝑡subscript𝑰𝑘subscript𝚪1𝐵subscript𝚪𝑚superscript𝐵𝑚subscript𝜼𝑡𝚪𝐵subscript𝜼𝑡\boldsymbol{\zeta}_{t}=(\boldsymbol{I}_{k}-\boldsymbol{\Gamma}_{1}B-\cdots-% \boldsymbol{\Gamma}_{m}B^{m})\boldsymbol{\eta}_{t}=\boldsymbol{\Gamma}(B)% \boldsymbol{\eta}_{t}bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B - ⋯ - bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Γ ( italic_B ) bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where 𝚪ik×ksubscript𝚪𝑖superscript𝑘𝑘\boldsymbol{\Gamma}_{i}\in\mathbb{R}^{k\times k}bold_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT for i=1,,m𝑖1𝑚i=1,\ldots,mitalic_i = 1 , … , italic_m and 𝚪(u)=𝑰k𝚪1u𝚪mum𝚪𝑢subscript𝑰𝑘subscript𝚪1𝑢subscript𝚪𝑚superscript𝑢𝑚\boldsymbol{\Gamma}(u)=\boldsymbol{I}_{k}-\boldsymbol{\Gamma}_{1}u-\cdots-% \boldsymbol{\Gamma}_{m}u^{m}bold_Γ ( italic_u ) = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u - ⋯ - bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, u𝑢u\in\mathbb{C}italic_u ∈ blackboard_C, is termed the characteristic polynomial. The process is stable if and only if all the roots of det{𝚪(u)}=0𝚪𝑢0\det\{\boldsymbol{\Gamma}(u)\}=0roman_det { bold_Γ ( italic_u ) } = 0 lie outside the unit circle. Because all stable processes are stationary and unstable stationary processes are not generally of interest, this subset of the Cartesian product space in which the 𝚪isubscript𝚪𝑖\boldsymbol{\Gamma}_{i}bold_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT lie is often referred to as the stationary region, 𝒞m,ksubscript𝒞𝑚𝑘\mathcal{C}_{m,k}caligraphic_C start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT. Even for the simplest vector case of an order-1 bivariate autoregression, it has a highly complex geometry.

Various authors have considered Bayesian inference for stationary dynamic factor models (e.g. Sáfadi and Peña, 2008; Lopes et al., 2008). However, due to the difficulties of designing efficient MCMC samplers with state space constrained to 𝒞m,ksubscript𝒞𝑚𝑘\mathcal{C}_{m,k}caligraphic_C start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT, the autoregressive coefficient matrices are often assumed to be diagonal, which simplifies the stationarity condition to that of a univariate autoregression. Fortunately, recent work by Heaps (2023) presents a prior for the autoregressive coefficient matrices that is constrained to 𝒞m,ksubscript𝒞𝑚𝑘\mathcal{C}_{m,k}caligraphic_C start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT and facilitates routine computational inference. This is based on an unconstrained reparameterisation of the model through two bijective transformations. First, the model parameters {(𝚪1,,𝚪m),𝚷}subscript𝚪1subscript𝚪𝑚𝚷\{(\boldsymbol{\Gamma}_{1},\ldots,\boldsymbol{\Gamma}_{m}),\boldsymbol{\Pi}\}{ ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , bold_Π } undergo a recursive mapping which yields a new parameter set {(𝑷1,,𝑷m),𝚷}subscript𝑷1subscript𝑷𝑚𝚷\{(\boldsymbol{P}_{1},\ldots,\boldsymbol{P}_{m}),\boldsymbol{\Pi}\}{ ( bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , bold_Π }. Here 𝑷i+1subscript𝑷𝑖1\boldsymbol{P}_{i+1}bold_italic_P start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is the (i+1)thsuperscript𝑖1th(i+1)^{\text{th}}( italic_i + 1 ) start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT partial autocorrelation matrix, that is, a standardised version of the conditional cross-covariance between 𝜼t+1subscript𝜼𝑡1\boldsymbol{\eta}_{t+1}bold_italic_η start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and 𝜼tisubscript𝜼𝑡𝑖\boldsymbol{\eta}_{t-i}bold_italic_η start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT given the i𝑖iitalic_i intervening variables (𝜼t,,𝜼ti+1)subscript𝜼𝑡subscript𝜼𝑡𝑖1(\boldsymbol{\eta}_{t},\ldots,\boldsymbol{\eta}_{t-i+1})( bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , bold_italic_η start_POSTSUBSCRIPT italic_t - italic_i + 1 end_POSTSUBSCRIPT ). Each partial autocorrelation matrix lies in the space of k×k𝑘𝑘k\times kitalic_k × italic_k square matrices with singular values less than one. By mapping the singular values of 𝑷isubscript𝑷𝑖\boldsymbol{P}_{i}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from [0,1)01[0,1)[ 0 , 1 ) to the positive real line, a second mapping then constructs an unconstrained k×k𝑘𝑘k\times kitalic_k × italic_k square matrix 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT through 𝑨i=(𝑰k𝑷i𝑷iT)1/2𝑷isubscript𝑨𝑖superscriptsubscript𝑰𝑘subscript𝑷𝑖superscriptsubscript𝑷𝑖T12subscript𝑷𝑖\boldsymbol{A}_{i}=(\boldsymbol{I}_{k}-\boldsymbol{P}_{i}\boldsymbol{P}_{i}^{% \mathrm{\scriptscriptstyle T}})^{-1/2}\boldsymbol{P}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,m𝑖1𝑚i=1,\ldots,mitalic_i = 1 , … , italic_m, in which 𝑿1/2superscript𝑿12\boldsymbol{X}^{-1/2}bold_italic_X start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT denotes the inverse of the symmetric matrix square root of 𝑿𝑿\boldsymbol{X}bold_italic_X. Specification of a prior for the 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, discussed in Section 4.4, and computational inference is now routine owing to the Euclidean geometry of the parameter space.

The inverse mapping from the intermediate reparameterisation {(𝑷1,,𝑷m),𝚷}subscript𝑷1subscript𝑷𝑚𝚷\{(\boldsymbol{P}_{1},\ldots,\boldsymbol{P}_{m}),\boldsymbol{\Pi}\}{ ( bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , bold_Π } to the original parameters {(𝚪1,,𝚪m),𝚷}subscript𝚪1subscript𝚪𝑚𝚷\{(\boldsymbol{\Gamma}_{1},\ldots,\boldsymbol{\Gamma}_{m}),\boldsymbol{\Pi}\}{ ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , bold_Π } involves two recursions. The first outputs the stationary covariance matrix 𝑮0subscript𝑮0\boldsymbol{G}_{0}bold_italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from {(𝑷1,,𝑷m),𝚷}subscript𝑷1subscript𝑷𝑚𝚷\{(\boldsymbol{P}_{1},\ldots,\boldsymbol{P}_{m}),\boldsymbol{\Pi}\}{ ( bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , bold_Π } and the second outputs (𝚪1,,𝚪m)subscript𝚪1subscript𝚪𝑚(\boldsymbol{\Gamma}_{1},\ldots,\boldsymbol{\Gamma}_{m})( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) (and recovers 𝚷𝚷\boldsymbol{\Pi}bold_Π) from (𝑷1,,𝑷m)subscript𝑷1subscript𝑷𝑚(\boldsymbol{P}_{1},\ldots,\boldsymbol{P}_{m})( bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and 𝑮0subscript𝑮0\boldsymbol{G}_{0}bold_italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. It is therefore trivial to modify the reparameterisation to accommodate the constraint that 𝑮0=𝑰ksubscript𝑮0subscript𝑰𝑘\boldsymbol{G}_{0}=\boldsymbol{I}_{k}bold_italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; one simply omits the first recursion, and calculates {(𝚪1,,𝚪m),𝚷}subscript𝚪1subscript𝚪𝑚𝚷\{(\boldsymbol{\Gamma}_{1},\ldots,\boldsymbol{\Gamma}_{m}),\boldsymbol{\Pi}\}{ ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , bold_Π } from (𝑷1,,𝑷m)subscript𝑷1subscript𝑷𝑚(\boldsymbol{P}_{1},\ldots,\boldsymbol{P}_{m})( bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and 𝑮0=𝑰ksubscript𝑮0subscript𝑰𝑘\boldsymbol{G}_{0}=\boldsymbol{I}_{k}bold_italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the second recursion. We note that the autocovariance matrices 𝑮1,,𝑮m1subscript𝑮1subscript𝑮𝑚1\boldsymbol{G}_{1},\ldots,\boldsymbol{G}_{m-1}bold_italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_G start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT needed to characterise the initial distribution of 𝜼(1m):0subscript𝜼:1𝑚0\boldsymbol{\eta}_{(1-m):0}bold_italic_η start_POSTSUBSCRIPT ( 1 - italic_m ) : 0 end_POSTSUBSCRIPT are also by-products of this second recursion.

When 𝜼tsubscript𝜼𝑡\boldsymbol{\eta}_{t}bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT follows a VAR(1)(1)\text{(}{1}\text{)}( 1 ) process, there is a closed form expression for the original model parameters in terms of 𝑷1subscript𝑷1\boldsymbol{P}_{1}bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or, equivalently, 𝑨1subscript𝑨1\boldsymbol{A}_{1}bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Dropping the 1-subscript for brevity, we can write 𝚪=𝑷=(𝑰k+𝑨𝑨T)1/2𝑨𝚪𝑷superscriptsubscript𝑰𝑘𝑨superscript𝑨T12𝑨\boldsymbol{\Gamma}=\boldsymbol{P}=(\boldsymbol{I}_{k}+\boldsymbol{A}% \boldsymbol{A}^{\mathrm{\scriptscriptstyle T}})^{-1/2}\boldsymbol{A}bold_Γ = bold_italic_P = ( bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_A bold_italic_A start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_A and 𝚷=𝑰k𝑷𝑷T=(𝑰k+𝑨𝑨T)1𝚷subscript𝑰𝑘𝑷superscript𝑷Tsuperscriptsubscript𝑰𝑘𝑨superscript𝑨T1\boldsymbol{\Pi}=\boldsymbol{I}_{k}-\boldsymbol{P}\boldsymbol{P}^{\mathrm{% \scriptscriptstyle T}}=(\boldsymbol{I}_{k}+\boldsymbol{A}\boldsymbol{A}^{% \mathrm{\scriptscriptstyle T}})^{-1}bold_Π = bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_P bold_italic_P start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = ( bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_A bold_italic_A start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

4.4 Prior for the transformed partial autocorrelation matrices

The stationary dynamic factor model described in the previous sections is invariant under rotations of the factors when compensatory transformations are applied to the factor loadings and parameters of the state equation. In particular, marginalising over the common factors, it can readily be shown that the likelihood evaluated at (𝝁,𝚲,𝚺,𝑨1,,𝑨m)𝝁𝚲𝚺subscript𝑨1subscript𝑨𝑚(\boldsymbol{\mu},\boldsymbol{\Lambda},\boldsymbol{\Sigma},\boldsymbol{A}_{1},% \ldots,\boldsymbol{A}_{m})( bold_italic_μ , bold_Λ , bold_Σ , bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is the same as the likelihood evaluated at (𝝁,𝚲𝑸T,𝚺,𝑸𝑨1𝑸T,,𝑸𝑨m𝑸T)𝝁𝚲superscript𝑸T𝚺𝑸subscript𝑨1superscript𝑸T𝑸subscript𝑨𝑚superscript𝑸T(\boldsymbol{\mu},\boldsymbol{\Lambda}\boldsymbol{Q}^{\mathrm{% \scriptscriptstyle T}},\boldsymbol{\Sigma},\boldsymbol{Q}\boldsymbol{A}_{1}% \boldsymbol{Q}^{\mathrm{\scriptscriptstyle T}},\ldots,\boldsymbol{Q}% \boldsymbol{A}_{m}\boldsymbol{Q}^{\mathrm{\scriptscriptstyle T}})( bold_italic_μ , bold_Λ bold_italic_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , bold_Σ , bold_italic_Q bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , … , bold_italic_Q bold_italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) for any orthogonal matrix 𝑸𝒪(k)𝑸𝒪𝑘\boldsymbol{Q}\in\mathcal{O}(k)bold_italic_Q ∈ caligraphic_O ( italic_k ). In the absence of any meaningful information to distinguish 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝑸𝑨i𝑸T𝑸subscript𝑨𝑖superscript𝑸T\boldsymbol{Q}\boldsymbol{A}_{i}\boldsymbol{Q}^{\mathrm{\scriptscriptstyle T}}bold_italic_Q bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT a priori, we assign a prior which is rotatable, that is 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝑸𝑨i𝑸T𝑸subscript𝑨𝑖superscript𝑸T\boldsymbol{Q}\boldsymbol{A}_{i}\boldsymbol{Q}^{\mathrm{\scriptscriptstyle T}}bold_italic_Q bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT have the same distribution for any orthogonal matrix 𝑸𝑸\boldsymbol{Q}bold_italic_Q. To this end, we take the 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be independent with 𝑨iNk,k(𝟎,𝑰k,𝑰k)similar-tosubscript𝑨𝑖subscriptN𝑘𝑘0subscript𝑰𝑘subscript𝑰𝑘\boldsymbol{A}_{i}\sim\mathrm{N}_{k,k}(\boldsymbol{0},\boldsymbol{I}_{k},% \boldsymbol{I}_{k})bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), i=1,,m𝑖1𝑚i=1,\ldots,mitalic_i = 1 , … , italic_m.

5 Posterior computation

5.1 Samplers with fixed truncation level

Consider first the static factor model. For computational inference, we use MCMC to sample from the posterior associated with our unconstrained parameterisation of the model. When the truncation level in the factor loadings matrix is fixed at H𝐻Hitalic_H, we propose a straightforward Gibbs sampler. The full conditional distributions for most unknowns have standard forms and can be sampled directly. However, depending on the structure chosen for the among-row scale matrix 𝚽=𝑹(ϑ)𝚽𝑹bold-italic-ϑ\boldsymbol{\Phi}=\boldsymbol{R}(\boldsymbol{\vartheta})bold_Φ = bold_italic_R ( bold_italic_ϑ ), a semi-conjugate prior for the correlation parameter(s) in ϑbold-italic-ϑ\boldsymbol{\vartheta}bold_italic_ϑ is not generally available and so we update ϑbold-italic-ϑ\boldsymbol{\vartheta}bold_italic_ϑ in a Metropolis-Hastings step. Computational inference is slightly complicated if a matrix-t𝑡titalic_t, rather than matrix normal, prior is adopted for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ because it is not conjugate to the likelihood. However, sampling becomes straightforward if we make use of the representation of a matrix-t𝑡titalic_t random variable in (7) and augment the state space of the sampler with the matrix 𝑺𝑺\boldsymbol{S}bold_italic_S. The full conditional distribution for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is then matrix-normal while the full conditional distribution for 𝑺𝑺\boldsymbol{S}bold_italic_S is Wishart. The only other additional unknown is the degree of freedom parameter ς𝜍\varsigmaitalic_ς. If ς𝜍\varsigmaitalic_ς is assigned the prior from Section 3.3, its full conditional distribution is non-standard. Rather than updating the transformed degree of freedom parameter ςˇˇ𝜍\check{\varsigma}overroman_ˇ start_ARG italic_ς end_ARG through its own Metropolis-Hastings step, mixing can be greatly improved by performing a joint update of 𝑺𝑺\boldsymbol{S}bold_italic_S and ςˇˇ𝜍\check{\varsigma}overroman_ˇ start_ARG italic_ς end_ARG. The proposal density takes the form q(𝑺,ςˇ|𝑺,ςˇ,)=q1(ςˇ|ςˇ)q2(𝑺|ςˇ,)𝑞superscript𝑺conditionalsuperscriptˇ𝜍𝑺ˇ𝜍subscript𝑞1conditionalsuperscriptˇ𝜍ˇ𝜍subscript𝑞2conditionalsuperscript𝑺superscriptˇ𝜍q(\boldsymbol{S}^{\ast},\check{\varsigma}^{\ast}|\boldsymbol{S},\check{% \varsigma},\ldots)=q_{1}(\check{\varsigma}^{\ast}|\check{\varsigma})q_{2}(% \boldsymbol{S}^{\ast}|\check{\varsigma}^{\ast},\ldots)italic_q ( bold_italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_ς end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_italic_S , overroman_ˇ start_ARG italic_ς end_ARG , … ) = italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( overroman_ˇ start_ARG italic_ς end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | overroman_ˇ start_ARG italic_ς end_ARG ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | overroman_ˇ start_ARG italic_ς end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … ) in which q1()subscript𝑞1q_{1}(\cdot)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) is a random walk proposal on the log-scale and q2()subscript𝑞2q_{2}(\cdot)italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) is the full conditional density for 𝑺𝑺\boldsymbol{S}bold_italic_S.

Extension of the Gibbs sampler to handle the dynamic factor model is straightforward. The latent factors 𝜼1m,,𝜼nsubscript𝜼1𝑚subscript𝜼𝑛\boldsymbol{\eta}_{1-m},\ldots,\boldsymbol{\eta}_{n}bold_italic_η start_POSTSUBSCRIPT 1 - italic_m end_POSTSUBSCRIPT , … , bold_italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be sampled efficiently from their full conditional distribution in a single block using a variety of sampling approaches (Wilkinson and Yeung, 2002, 2004). In the example in Section 6.3 we used the forward-filtering backward-sampling algorithm (e.g. Frühwirth-Schnatter, 1994) but other methods that rely on banded or sparse matrix algorithms will be more efficient in higher-dimensional applications; see, for example, Chan and Jeliazkov (2009). The full conditional distributions for the transformed partial autocorrelation matrices 𝑨1,,𝑨msubscript𝑨1subscript𝑨𝑚\boldsymbol{A}_{1},\ldots,\boldsymbol{A}_{m}bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are non-standard. Given a truncation level of H𝐻Hitalic_H, each 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains H2superscript𝐻2H^{2}italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT parameters and so it is not generally feasible to update the whole matrix in a single Metropolis-Hastings step. We therefore propose dividing the elements of each 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into blocks of length b𝑏bitalic_b and updating the blocks one-at-a-time. In the application in Section 6.3, we found mixing to be greatly improved by using Metropolis-adjusted Langevin (MALA) steps rather than Gaussian random walks.

Given a large enough fixed truncation level Hφ(p)1𝐻𝜑𝑝1H\leq\lceil\varphi(p)\rceil-1italic_H ≤ ⌈ italic_φ ( italic_p ) ⌉ - 1, the MGP prior increasingly shrinks the factor loadings in columns with higher indices if the data suggest their likelihood contribution is negligible. At least conceptually, these columns can be discarded. If there are m[i]superscript𝑚delimited-[]𝑖m^{[i]}italic_m start_POSTSUPERSCRIPT [ italic_i ] end_POSTSUPERSCRIPT discardable columns on iteration i𝑖iitalic_i of the MCMC sampler, the effective number of factors is defined as k[i]=Hm[i]superscript𝑘absentdelimited-[]𝑖𝐻superscript𝑚delimited-[]𝑖k^{\ast\,[i]}=H-m^{[i]}italic_k start_POSTSUPERSCRIPT ∗ [ italic_i ] end_POSTSUPERSCRIPT = italic_H - italic_m start_POSTSUPERSCRIPT [ italic_i ] end_POSTSUPERSCRIPT. The posterior for the effective number of factors can then be used as a proxy for the posterior for k𝑘kitalic_k. There are various truncation criteria for deciding which columns can be omitted. Bhattacharya and Dunson (2011) consider as inactive any factor whose loadings are all within ϵitalic-ϵ\epsilonitalic_ϵ of zero for some chosen threshold, say ϵ=104italic-ϵsuperscript104\epsilon=10^{-4}italic_ϵ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Schiavon and Canale (2020) suggest a more interpretable criterion which, unlike the choice of threshold ϵitalic-ϵ\epsilonitalic_ϵ, is unaffected by the scale of the data and its dimension. Denote by 𝚲ksubscript𝚲superscript𝑘\boldsymbol{\Lambda}_{k^{\ast}}bold_Λ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT the factor loadings matrix obtained by discarding the columns of 𝚲𝚲\boldsymbol{\Lambda}bold_Λ from k+1superscript𝑘1k^{\ast}+1italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1 onwards. The basic idea is that if 𝛀k=𝚲k𝚲kT+𝚺subscript𝛀superscript𝑘subscript𝚲superscript𝑘superscriptsubscript𝚲superscript𝑘T𝚺\boldsymbol{\Omega}_{k^{\ast}}=\boldsymbol{\Lambda}_{k^{\ast}}\boldsymbol{% \Lambda}_{k^{\ast}}^{\mathrm{\scriptscriptstyle T}}+\boldsymbol{\Sigma}bold_Ω start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = bold_Λ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + bold_Σ can explain 100T100𝑇100T100 italic_T% of the total variability of the data for some T(0,1)𝑇01T\in(0,1)italic_T ∈ ( 0 , 1 ), then one decides there are ksuperscript𝑘k^{\ast}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT active factors. The choice of proportion T𝑇Titalic_T then replaces the choice of threshold ϵitalic-ϵ\epsilonitalic_ϵ. This is the truncation criterion we adopt in the applications in Section 6.

When the truncation level H𝐻Hitalic_H is fixed, an alternative to Gibbs sampling is to implement off-the-shelf computational inference using standard probabilistic programming software. Code for the applications in Section 6, written in Stan (Carpenter et al., 2017), is given in the Supplementary Materials. Stan uses Hamiltonian Monte Carlo to perform joint updates of all unknowns and so tends to converge faster and mix better than a Gibbs sampler. The caveat is the necessity to fix H𝐻Hitalic_H; although this can be addressed in a bespoke implementation of Gibbs sampling (see Section 5.2 below), Stan cannot be customised in this way.

5.2 Adaptive Gibbs sampler

Especially in applications where p𝑝pitalic_p is large, the number of effective factors is often considerably less than the Ledermann bound φ(p)𝜑𝑝\varphi(p)italic_φ ( italic_p ). Therefore fixing a truncation level H𝐻Hitalic_H which is less than, but in the vicinity of, φ(p)𝜑𝑝\varphi(p)italic_φ ( italic_p ) can be computationally inefficient. In the literature on infinite factor models, a common pragmatic solution is to use an adaptive Gibbs sampler which tunes the truncation level H𝐻Hitalic_H as it proceeds. Adaptation occurs at iteration i𝑖iitalic_i with probability p(i)=exp(α0+α1i)𝑝𝑖subscript𝛼0subscript𝛼1𝑖p(i)=\exp(\alpha_{0}+\alpha_{1}i)italic_p ( italic_i ) = roman_exp ( italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i ) where α00subscript𝛼00\alpha_{0}\leq 0italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ 0 and α1<0subscript𝛼10\alpha_{1}<0italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 0 such that the probability of adaptation decreases over the course of the simulation. This is necessary to satisfy the diminishing adaptation condition of Roberts and Rosenthal (2007).

During an adaptation step, k[i]superscript𝑘absentdelimited-[]𝑖k^{\ast\,[i]}italic_k start_POSTSUPERSCRIPT ∗ [ italic_i ] end_POSTSUPERSCRIPT is compared to the current value of H𝐻Hitalic_H. The basic idea is to delete any inactive factors or to add an extra factor if all H𝐻Hitalic_H factors are active. In the static model, implementation is straightforward. If k[i]<Hsuperscript𝑘absentdelimited-[]𝑖𝐻k^{\ast\,[i]}<Hitalic_k start_POSTSUPERSCRIPT ∗ [ italic_i ] end_POSTSUPERSCRIPT < italic_H, H𝐻Hitalic_H is reduced to k[i]superscript𝑘absentdelimited-[]𝑖k^{\ast\,[i]}italic_k start_POSTSUPERSCRIPT ∗ [ italic_i ] end_POSTSUPERSCRIPT and any inactive factors are deleted along with the corresponding columns of 𝚲𝚲\boldsymbol{\Lambda}bold_Λ and components of ϱbold-italic-ϱ\boldsymbol{\varrho}bold_italic_ϱ and 𝚿𝚿\boldsymbol{\Psi}bold_Ψ. If k[i]=Hsuperscript𝑘absentdelimited-[]𝑖𝐻k^{\ast\,[i]}=Hitalic_k start_POSTSUPERSCRIPT ∗ [ italic_i ] end_POSTSUPERSCRIPT = italic_H and H<φ(p)1𝐻𝜑𝑝1H<\lceil\varphi(p)\rceil-1italic_H < ⌈ italic_φ ( italic_p ) ⌉ - 1, H𝐻Hitalic_H is increased by 1 then an extra factor, column of factor loadings, and component of ϱbold-italic-ϱ\boldsymbol{\varrho}bold_italic_ϱ are sampled from their priors. In the dynamic model, the adaptation step is slightly more involved because of the additional complexity in the distribution of the latent factors. Full details of the adaptive Gibbs samplers for both static and dynamic models are given in the Supplementary Materials, along with R code for the applications in Section 6. We also provide details on how the posterior output can be post-processed to obtain samples under a parameterisation in which the factor loadings matrix is identified by the PLT constraint. This is incredibly valuable for parameter interpretation and the assessment of MCMC diagnostics.

6 Applications

6.1 Simulation experiment

6.1.1 Simulation settings

In order to investigate posterior sensitivity to the prior specification in the idealised setting in which we know the data were generated from a factor model with known shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ, we carried out a series of simulation experiments. In the simplified setting in which 𝚺=σ2𝑰p𝚺superscript𝜎2subscript𝑰𝑝\boldsymbol{\Sigma}=\sigma^{2}\boldsymbol{I}_{p}bold_Σ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, denote by β𝛽\betaitalic_β the proportion of the total variation in 𝛀=𝚫+𝚺𝛀𝚫𝚺\boldsymbol{\Omega}=\boldsymbol{\Delta}+\boldsymbol{\Sigma}bold_Ω = bold_Δ + bold_Σ that is explained by the common factors, that is, β=tr(𝚫)/tr(𝛀)𝛽tr𝚫tr𝛀\beta=\mathrm{tr}(\boldsymbol{\Delta})/\mathrm{tr}(\boldsymbol{\Omega})italic_β = roman_tr ( bold_Δ ) / roman_tr ( bold_Ω ). Then for given β𝛽\betaitalic_β and 𝚫𝚫\boldsymbol{\Delta}bold_Δ, the corresponding value of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be computed as σ2=(1β)tr(𝚫)/(kβ)superscript𝜎21𝛽tr𝚫𝑘𝛽\sigma^{2}=(1-\beta)\mathrm{tr}(\boldsymbol{\Delta})/(k\beta)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( 1 - italic_β ) roman_tr ( bold_Δ ) / ( italic_k italic_β ). Three different combinations of p𝑝pitalic_p and k𝑘kitalic_k, representing different degrees of dimension reduction, were considered: (p=24,k=6)formulae-sequence𝑝24𝑘6(p=24,k=6)( italic_p = 24 , italic_k = 6 ), (p=48,k=9)formulae-sequence𝑝48𝑘9(p=48,k=9)( italic_p = 48 , italic_k = 9 ) and (p=72,k=9)formulae-sequence𝑝72𝑘9(p=72,k=9)( italic_p = 72 , italic_k = 9 ) so that k/p𝑘𝑝k/pitalic_k / italic_p was equal to 25%, 18.75% and 12.5%, respectively. For each combination of p𝑝pitalic_p and k𝑘kitalic_k, we also considered three values for β𝛽\betaitalic_β, β{0.9,0.95,0.99}𝛽0.90.950.99\beta\in\{0.9,0.95,0.99\}italic_β ∈ { 0.9 , 0.95 , 0.99 }, giving nine sets of values for {(p,k),β}𝑝𝑘𝛽\{(p,k),\beta\}{ ( italic_p , italic_k ) , italic_β } in total. We refer to these as simulation settings. Note that we only set 𝚺=σ2𝑰p𝚺superscript𝜎2subscript𝑰𝑝\boldsymbol{\Sigma}=\sigma^{2}\boldsymbol{I}_{p}bold_Σ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for the purposes of simulating the data; for inference, we still allow 𝚺=diag(σ12,,σp2)𝚺diagsuperscriptsubscript𝜎12superscriptsubscript𝜎𝑝2\boldsymbol{\Sigma}=\mathrm{diag}(\sigma_{1}^{2},\ldots,\sigma_{p}^{2})bold_Σ = roman_diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Under each simulation setting, we simulated 24 data sets of size n=50𝑛50n=50italic_n = 50 based on an exponential correlation matrix for 𝚽𝚽\boldsymbol{\Phi}bold_Φ, taking 𝚫𝚫\boldsymbol{\Delta}bold_Δ to be the closest rank-k𝑘kitalic_k matrix to 𝚽𝚽\boldsymbol{\Phi}bold_Φ (see Proposition 1). In the simulation of each data set, in order to fix the value of 𝚫𝚫\boldsymbol{\Delta}bold_Δ, we simulated p𝑝pitalic_p pairs of coordinates 𝒙j=(xj1,xj2)Tsubscript𝒙𝑗superscriptsubscript𝑥𝑗1subscript𝑥𝑗2T\boldsymbol{x}_{j}=(x_{j1},x_{j2})^{\mathrm{\scriptscriptstyle T}}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p, by sampling uniformly at random from the unit square, and then set the expected shared variation matrix equal to 𝚽𝚽\boldsymbol{\Phi}bold_Φ with (j,k)𝑗𝑘(j,k)( italic_j , italic_k ) element ϕjk=exp(𝒙j𝒙k2/ϑ)subscriptitalic-ϕ𝑗𝑘subscriptnormsubscript𝒙𝑗subscript𝒙𝑘2italic-ϑ\phi_{jk}=\exp(-\|\boldsymbol{x}_{j}-\boldsymbol{x}_{k}\|_{2}/\vartheta)italic_ϕ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = roman_exp ( - ∥ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_ϑ ) where the length-scale parameter was equal to

ϑ={1/log(0.05),for data sets 1,,6;1/log(0.10),for data sets 7,,12;1/log(0.15),for data sets 13,,18;1/log(0.20),for data sets 19,,24.italic-ϑcases10.05for data sets 1,,6;10.10for data sets 7,,12;10.15for data sets 13,,18;10.20for data sets 19,,24.\vartheta=\begin{cases}-1/\log(0.05),\quad&\text{for data sets $1,\ldots,6$;}% \\ -1/\log(0.10),\quad&\text{for data sets $7,\ldots,12$;}\\ -1/\log(0.15),\quad&\text{for data sets $13,\ldots,18$;}\\ -1/\log(0.20),\quad&\text{for data sets $19,\ldots,24$.}\\ \end{cases}italic_ϑ = { start_ROW start_CELL - 1 / roman_log ( 0.05 ) , end_CELL start_CELL for data sets 1 , … , 6 ; end_CELL end_ROW start_ROW start_CELL - 1 / roman_log ( 0.10 ) , end_CELL start_CELL for data sets 7 , … , 12 ; end_CELL end_ROW start_ROW start_CELL - 1 / roman_log ( 0.15 ) , end_CELL start_CELL for data sets 13 , … , 18 ; end_CELL end_ROW start_ROW start_CELL - 1 / roman_log ( 0.20 ) , end_CELL start_CELL for data sets 19 , … , 24 . end_CELL end_ROW

For a distance of 1 in the unit square, the correlations therefore ranged from 0.05 to 0.20.

6.1.2 Prior distributions

Each data set generated under each simulation setting was used to update five prior distributions for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ. Three are structured matrix normal priors, with different assumptions about 𝚽𝚽\boldsymbol{\Phi}bold_Φ, and two are widely used priors in factor analysis which allow inference on the number of factors:

SN (expcov.)

A structured matrix normal prior in which 𝚽𝚽\boldsymbol{\Phi}bold_Φ was taken to have the same form as that used to simulate the data, that is, an exponential correlation matrix. The length-scale parameter was taken to be unknown and assigned the distribution log(ϑ)N(0,1)similar-toitalic-ϑN01\log(\vartheta)\sim\mathrm{N}(0,1)roman_log ( italic_ϑ ) ∼ roman_N ( 0 , 1 ).

SN (fixed)

A structured matrix normal prior in which 𝚽𝚽\boldsymbol{\Phi}bold_Φ was taken to have the same form as that used to simulate the data, that is, an exponential correlation matrix. The length-scale parameter was fixed at the value used to simulate the data.

SN (exch.)

A structured matrix normal prior in which 𝚽𝚽\boldsymbol{\Phi}bold_Φ was taken to have a different form to that used to simulate the data, specifically we took 𝚽=(1ϑ)𝑰p+ϑ𝑱p𝚽1italic-ϑsubscript𝑰𝑝italic-ϑsubscript𝑱𝑝\boldsymbol{\Phi}=(1-\vartheta)\boldsymbol{I}_{p}+\vartheta\boldsymbol{J}_{p}bold_Φ = ( 1 - italic_ϑ ) bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_ϑ bold_italic_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, 1/(p1)<ϑ<11𝑝1italic-ϑ1-1/(p-1)<\vartheta<1- 1 / ( italic_p - 1 ) < italic_ϑ < 1, representing a two-parameter exchangeable matrix with trace fixed at p𝑝pitalic_p. The correlation parameter was taken to be unknown and assigned the distribution logit(ϑˇ)N(0,1)similar-tologitˇitalic-ϑN01\mathrm{logit}(\check{\vartheta})\sim\mathrm{N}(0,1)roman_logit ( overroman_ˇ start_ARG italic_ϑ end_ARG ) ∼ roman_N ( 0 , 1 ) where ϑˇ={1+(p1)ϑ}/pˇitalic-ϑ1𝑝1italic-ϑ𝑝\check{\vartheta}=\{1+(p-1)\vartheta\}/poverroman_ˇ start_ARG italic_ϑ end_ARG = { 1 + ( italic_p - 1 ) italic_ϑ } / italic_p.

MGP

The original (exchangeable) multiplicative gamma process prior described in (Bhattacharya and Dunson, 2011), which is a global-local prior. We took a1=1subscript𝑎11a_{1}=1italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and a2=2subscript𝑎22a_{2}=2italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 in the multiplicative gamma process for the global precision parameters and ν=3𝜈3\nu=3italic_ν = 3 in the distribution for the local precision parameters.

CUSP

A cumulative shrinkage process prior (Legramanti et al., 2020), which is a slab-and-spike prior. We took aθ=bθ=2subscript𝑎𝜃subscript𝑏𝜃2a_{\theta}=b_{\theta}=2italic_a start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 2 in the distribution for the slab, θ=0.05subscript𝜃0.05\theta_{\infty}=0.05italic_θ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 0.05 for the position of the spike and α=5𝛼5\alpha=5italic_α = 5 in the cumulative stick-breaking construction controlling the probability assigned to the spike.

For the three structured matrix normal priors, we chose a1=1subscript𝑎11a_{1}=1italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and a2=2subscript𝑎22a_{2}=2italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 in the multiplicative gamma process for the diagonal elements of 𝚿𝚿\boldsymbol{\Psi}bold_Ψ. In all cases, the idiosyncratic variances were assigned the prior 1/σj2γ(1,0.3)similar-to1superscriptsubscript𝜎𝑗2𝛾10.31/\sigma_{j}^{2}\sim\gamma(1,0.3)1 / italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ italic_γ ( 1 , 0.3 ) independently for j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p.

6.1.3 Analysis and results

All analyses were run using an adaptive Gibbs sampler, allowing adaptation after 500 iterations and setting the parameters in the diminishing adaptation condition at α0=1subscript𝛼01\alpha_{0}=-1italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 1 and α1=5×104subscript𝛼15superscript104\alpha_{1}=-5\times 10^{-4}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The chains were run for 10K iterations, following a burn-in of 10K, and thinned to retain every fifth sample to reduce computational overheads. For the three structured matrix normal priors and the original MGP prior, we applied the truncation criterion of Schiavon and Canale (2020), retaining as active enough factors to explain at least T=0.95𝑇0.95T=0.95italic_T = 0.95 of the total variation in the data.

Table 1: Summaries of the simulation experiment when β=0.99𝛽0.99\beta=0.99italic_β = 0.99.
Prior (p,k)𝑝𝑘(p,k)( italic_p , italic_k ) E{dG(𝛀,𝛀0)|𝒚}𝐸conditional-setsubscript𝑑𝐺𝛀subscript𝛀0𝒚\mathnormal{E}\{d_{G}(\boldsymbol{\Omega},\boldsymbol{\Omega}_{0})|\boldsymbol% {y}\}italic_E { italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_Ω , bold_Ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | bold_italic_y } E(k|𝒚)𝐸conditionalsuperscript𝑘𝒚\mathnormal{E}(k^{\ast}|\boldsymbol{y})italic_E ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_italic_y )
Median IQR Median IQR
SN (expcov.) (24, 6) 15.88 0.38 6.34 0.56
SN (fixed) (24, 6) 15.78 0.39 6.30 0.89
SN (exch.) (24, 6) 15.93 0.39 6.63 0.69
MGP (24, 6) 16.02 0.45 6.27 0.86
CUSP (24, 6) 16.25 0.36 3.13 0.57
SN (expcov.) (48, 9) 21.25 0.58 7.54 1.95
SN (fixed) (48, 9) 21.18 0.65 7.32 1.47
SN (exch.) (48, 9) 22.20 1.04 5.96 2.02
MGP (48, 9) 22.15 0.64 6.41 2.32
CUSP (48, 9) 22.86 0.84 3.00 0.23
SN (expcov.) (72, 9) 23.89 1.03 7.76 1.38
SN (fixed) (72, 9) 23.69 1.37 7.85 1.55
SN (exch.) (72, 9) 25.81 1.87 6.15 0.92
MGP (72, 9) 24.99 0.94 7.21 1.81
CUSP (72, 9) 27.89 1.07 3.00 0.00

The (squared) geodesic distance between two symmetric positive definite matrices 𝑨𝒮p+𝑨superscriptsubscript𝒮𝑝\boldsymbol{A}\in\mathcal{S}_{p}^{+}bold_italic_A ∈ caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝑩𝒮p+𝑩superscriptsubscript𝒮𝑝\boldsymbol{B}\in\mathcal{S}_{p}^{+}bold_italic_B ∈ caligraphic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is defined as

dG(𝑨,𝑩)2=[i=1plog{λi(𝑨1𝑩)}2]d_{G}(\boldsymbol{A},\boldsymbol{B})^{2}=\left[\sum_{i=1}^{p}\log\left\{% \lambda_{i}\left(\boldsymbol{A}^{-1}\boldsymbol{B}\right)\right\}^{2}\right]italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_italic_A , bold_italic_B ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_log { italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_B ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

in which λi(𝑿)subscript𝜆𝑖𝑿\lambda_{i}(\boldsymbol{X})italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X ) denotes the i𝑖iitalic_ith eigenvalue of 𝑿𝑿\boldsymbol{X}bold_italic_X (L.-H. Lim and R. Sepulchre, 2019). For simulation experiments in which β=0.99𝛽0.99\beta=0.99italic_β = 0.99, Table 1 reports the median and interquartile range across the 24 data sets of the posterior mean of dG(𝛀,𝛀0)subscript𝑑𝐺𝛀subscript𝛀0d_{G}(\boldsymbol{\Omega},\boldsymbol{\Omega}_{0})italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_Ω , bold_Ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where 𝛀0subscript𝛀0\boldsymbol{\Omega}_{0}bold_Ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the value used to simulate the data. Corresponding Tables for the simulations where β=0.95𝛽0.95\beta=0.95italic_β = 0.95 and β=0.9𝛽0.9\beta=0.9italic_β = 0.9 are presented in the Supplementary Materials. We chose the distance metric dG(,)subscript𝑑𝐺d_{G}(\cdot,\cdot)italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( ⋅ , ⋅ ) because, by construction, 𝛀𝛀\boldsymbol{\Omega}bold_Ω is close to singular and so the geodesic distance will be better suited than a metric, such as the Frobenius distance, which ignores the geometry of the space. In particular, it will give large values whenever 𝛀𝛀\boldsymbol{\Omega}bold_Ω and 𝛀0subscript𝛀0\boldsymbol{\Omega}_{0}bold_Ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT do not approximately span the same subspace of psuperscript𝑝\mathbb{R}^{p}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, that is when their null spaces are not approximately the same. We also report the median and interquartile range for the posterior mean of the effective number of factors ksuperscript𝑘k^{\ast}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Focusing first on the posterior mean of the geodesic distance dG(𝛀,𝛀0)subscript𝑑𝐺𝛀subscript𝛀0d_{G}(\boldsymbol{\Omega},\boldsymbol{\Omega}_{0})italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_Ω , bold_Ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), it is immediately clear from Table 1 and the corresponding tables in the Supplementary Materials that both the medians and interquartile ranges across data sets are typically smallest when we use a prior in which 𝚽𝚽\boldsymbol{\Phi}bold_Φ is based on an exponential correlation matrix. As we would expect, this is especially true when the length-scale parameter in the prior is fixed at the value used in simulating the data. Nevertheless, the modest differences between posterior inferences under SN (expcov.) and SN (fixed) are reassuring; in the analysis of a real data set, though we may have strong prior beliefs about a suitable structure for E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ), we are likely to be more uncertain about the values of the hyperparameters in that structure. We also observe that the gains from assuming the correct form for 𝚽𝚽\boldsymbol{\Phi}bold_Φ in this simulation experiment are greatest when the total variation in 𝚫𝚫\boldsymbol{\Delta}bold_Δ makes up a larger proportion of the total variation in 𝛀𝛀\boldsymbol{\Omega}bold_Ω (that is, when β=0.99𝛽0.99\beta=0.99italic_β = 0.99 and β=0.95𝛽0.95\beta=0.95italic_β = 0.95) and when the degree of dimension reduction is larger. This is as expected because in both cases, the prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ, and therefore 𝚫𝚫\boldsymbol{\Delta}bold_Δ, is likely to impart more influence on the posterior for 𝛀𝛀\boldsymbol{\Omega}bold_Ω. It is interesting to observe that when we adopt a structured matrix normal prior but assume an incorrect structure for 𝚽𝚽\boldsymbol{\Phi}bold_Φ, the posterior mean for dG(𝛀,𝛀0)subscript𝑑𝐺𝛀subscript𝛀0d_{G}(\boldsymbol{\Omega},\boldsymbol{\Omega}_{0})italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_Ω , bold_Ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) still tends to be smaller than under the MGP or CUSP priors. This is likely to be because we simulated all data sets based on exponential correlation matrices for 𝚽𝚽\boldsymbol{\Phi}bold_Φ which have strictly positive diagonal elements. Under SN (exch.), we are shrinking 𝚫𝚫\boldsymbol{\Delta}bold_Δ towards a rank-reduced approximation to a two-parameter exchangeable matrix with a common (potentially positive) off-diagonal element whereas the prior expectation for 𝚫𝚫\boldsymbol{\Delta}bold_Δ under the MGP and CUSP priors is diagonal.

A discussion of the results concerning the posterior mean of the effective number of factors ksuperscript𝑘k^{\ast}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be found in the Supplementary Materials.

6.2 Co-occurrence of Finnish birds

6.2.1 Model and prior

As discussed in Section 1, most of the literature on prior specification for factor models focuses on constructing a distribution for the factor loadings matrix that is exchangeable with respect to the order of the components in the observation vector. An exception is the structured increasing shrinkage (SIS) process (Schiavon et al., 2022). This prior allows meta-covariates to inform the within-column sparsity structure of the factor loadings matrix in an infinite factorisation model. In this section, we re-examine the Finnish birds data set analysed in Schiavon et al. to illustrate the extra flexibility afforded by our approach.

The Finnish bird data comprise information on the co-occurrence of the 50 most common bird species in Finland from 137 sampling areas in 2014 (Lindström et al., 2015). The occurrences are arranged into a n×p𝑛𝑝n\times pitalic_n × italic_p binary matrix 𝒀=(yij)𝒀subscript𝑦𝑖𝑗\boldsymbol{Y}=(y_{ij})bold_italic_Y = ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) where yij=1subscript𝑦𝑖𝑗1y_{ij}=1italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if bird species j𝑗jitalic_j was observed in area i𝑖iitalic_i and yij=0subscript𝑦𝑖𝑗0y_{ij}=0italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 otherwise. Information available to explain the variation in the mean 𝝁isubscript𝝁𝑖\boldsymbol{\mu}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at each sampling area include a n×c𝑛𝑐n\times citalic_n × italic_c matrix of environmental covariates 𝑾𝑾\boldsymbol{W}bold_italic_W containing measurements on spring temperature and its square and a five-level habitat type (broadleaved forests, coniferous forests, open habitats, urban habitats, wetlands) so that c=7𝑐7c=7italic_c = 7. Information available to structure the prior for the marginal variance 𝛀𝛀\boldsymbol{\Omega}bold_Ω include a phylogenetic tree, indicating the evolutionary relationships amongst the 50 species, and a p×q𝑝𝑞p\times qitalic_p × italic_q matrix of meta-covariates 𝑿𝑿\boldsymbol{X}bold_italic_X consisting of two species traits: logarithm of typical body mass and a three-level migratory strategy (short-distance migrant, resident species, long-distance migrant). Following Schiavon et al., we model species presence or absence using a multivariate probit regression model where yij=𝕀(zij>0)subscript𝑦𝑖𝑗𝕀subscript𝑧𝑖𝑗0y_{ij}=\mathbb{I}(z_{ij}>0)italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = blackboard_I ( italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > 0 ) and we assume the latent variables 𝒛i=(zi1,,zip)Tsubscript𝒛𝑖superscriptsubscript𝑧𝑖1subscript𝑧𝑖𝑝T\boldsymbol{z}_{i}=(z_{i1},\ldots,z_{ip})^{\mathrm{\scriptscriptstyle T}}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_z start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT are such that 𝒛i=𝑩T𝒘i+𝚲𝜼i+ϵisubscript𝒛𝑖superscript𝑩Tsubscript𝒘𝑖𝚲subscript𝜼𝑖subscriptbold-italic-ϵ𝑖\boldsymbol{z}_{i}=\boldsymbol{B}^{\mathrm{\scriptscriptstyle T}}\boldsymbol{w% }_{i}+\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{\epsilon}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_B start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n. Here 𝒘iTsuperscriptsubscript𝒘𝑖T\boldsymbol{w}_{i}^{\mathrm{\scriptscriptstyle T}}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT row of 𝑾𝑾\boldsymbol{W}bold_italic_W, 𝑩=(βij)𝑩subscript𝛽𝑖𝑗\boldsymbol{B}=(\beta_{ij})bold_italic_B = ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is a c×p𝑐𝑝c\times pitalic_c × italic_p matrix of regression coefficients, and the idiosyncratic variances on the diagonal of Var(ϵi)=𝚺Varsubscriptbold-italic-ϵ𝑖𝚺\text{Var}(\boldsymbol{\epsilon}_{i})=\boldsymbol{\Sigma}Var ( bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_Σ are set equal to 1 to prevent compensatory rescaling of 𝑩𝑩\boldsymbol{B}bold_italic_B and 𝛀=𝚲𝚲T+𝚺𝛀𝚲superscript𝚲T𝚺\boldsymbol{\Omega}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}+\boldsymbol{\Sigma}bold_Ω = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + bold_Σ. Our prior for the coefficients in 𝑩𝑩\boldsymbol{B}bold_italic_B is identical to that of Schiavon et al. and described in the Supplementary Materials.

The SIS process prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is built on a series of assumptions of conditional independence. As a consequence, the expectation of the shared variation matrix 𝚫𝚫\boldsymbol{\Delta}bold_Δ is diagonal, irrespective of the priors assigned to its hyperparameters. Moreover, every factor loading λijsubscript𝜆𝑖𝑗\lambda_{ij}italic_λ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has a spike-and-slab distribution in which the meta-covariates only enter through the logit of the spike probability. As such, there is no mechanism for incorporating the rich information on evolutionary relationships that is expressed through the phylogenetic tree.

In contrast, using the approach outlined in Section 3.4 we can structure the prior for the shared variation so that it incorporates the information from the phylogenetic tree in addition to the two meta-covariates. The former can be encoded as a matrix of phylogenetic distances between the 50 species, whose (i,j)thsuperscript𝑖𝑗th(i,j)^{\text{th}}( italic_i , italic_j ) start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT entry, say dP,ijsubscript𝑑𝑃𝑖𝑗d_{P,ij}italic_d start_POSTSUBSCRIPT italic_P , italic_i italic_j end_POSTSUBSCRIPT, is the sum of the branch lengths along the path between the leaves for species i𝑖iitalic_i and j𝑗jitalic_j. This distance function is a metric over the set of leaves (Steel, 2016, Chapter 6). Another measure of the dissimilarity between species can be obtained using the generalised distance function from the projection model in (8). As we only have binary response data, which is unlikely to be as informative as continuous data, we simplify (8) by taking the matrix 𝚯𝚯\boldsymbol{\Theta}bold_Θ to be diagonal. This yields dC,ij={k=14(xikxjk)2/ϑk2}1/2subscript𝑑𝐶𝑖𝑗superscriptsuperscriptsubscript𝑘14superscriptsubscript𝑥𝑖𝑘subscript𝑥𝑗𝑘2superscriptsubscriptitalic-ϑ𝑘212d_{C,ij}=\left\{\sum_{k=1}^{4}(x_{ik}-x_{jk})^{2}/\vartheta_{k}^{2}\right\}^{1% /2}italic_d start_POSTSUBSCRIPT italic_C , italic_i italic_j end_POSTSUBSCRIPT = { ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ϑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT as a metric for the distance between the meta-covariates for species i𝑖iitalic_i and j𝑗jitalic_j, where xi1subscript𝑥𝑖1x_{i1}italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT, xi2subscript𝑥𝑖2x_{i2}italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT and xi3subscript𝑥𝑖3x_{i3}italic_x start_POSTSUBSCRIPT italic_i 3 end_POSTSUBSCRIPT are indicators for whether species i𝑖iitalic_i is a short-distance migrant, resident or long-distance migrant, respectively, while xi4subscript𝑥𝑖4x_{i4}italic_x start_POSTSUBSCRIPT italic_i 4 end_POSTSUBSCRIPT is the logarithm of typical body mass for species i𝑖iitalic_i. Since dP,ijsubscript𝑑𝑃𝑖𝑗d_{P,ij}italic_d start_POSTSUBSCRIPT italic_P , italic_i italic_j end_POSTSUBSCRIPT and dC,ijsubscript𝑑𝐶𝑖𝑗d_{C,ij}italic_d start_POSTSUBSCRIPT italic_C , italic_i italic_j end_POSTSUBSCRIPT are metrics over a set comprising the species and their associated traits, the sum is also a metric. We can therefore take as a generalised measure of distance between species i𝑖iitalic_i and j𝑗jitalic_j, dij=dC,ij+dP,ij/ϑ5subscript𝑑𝑖𝑗subscript𝑑𝐶𝑖𝑗subscript𝑑𝑃𝑖𝑗subscriptitalic-ϑ5d_{ij}=d_{C,ij}+d_{P,ij}/\vartheta_{5}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_C , italic_i italic_j end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_P , italic_i italic_j end_POSTSUBSCRIPT / italic_ϑ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, in which ϑi>0subscriptitalic-ϑ𝑖0\vartheta_{i}>0italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 for i=1,,5𝑖15i=1,\ldots,5italic_i = 1 , … , 5. So that all the length-scale parameters ϑisubscriptitalic-ϑ𝑖\vartheta_{i}italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT operate on a comparable scale, the meta-covariates and branch lengths are standardised prior to calculating dC,ijsubscript𝑑𝐶𝑖𝑗d_{C,ij}italic_d start_POSTSUBSCRIPT italic_C , italic_i italic_j end_POSTSUBSCRIPT and dP,ijsubscript𝑑𝑃𝑖𝑗d_{P,ij}italic_d start_POSTSUBSCRIPT italic_P , italic_i italic_j end_POSTSUBSCRIPT. Finally, we relate the generalised distance dijsubscript𝑑𝑖𝑗d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to the among-row scale matrix 𝚽𝚽\boldsymbol{\Phi}bold_Φ by using the exponential correlation function, ϕij=exp(dij)subscriptitalic-ϕ𝑖𝑗subscript𝑑𝑖𝑗\phi_{ij}=\exp(-d_{ij})italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = roman_exp ( - italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ). We assume the length-scale parameters to be independent in the prior and assign distributions logϑiN(0,10)similar-tosubscriptitalic-ϑ𝑖N010\log\vartheta_{i}\sim\mathrm{N}(0,10)roman_log italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_N ( 0 , 10 ).

Based on available phylogenetic and ecological information, this prior encourages shrinkage towards a rank-reduced approximation of an expected shared variation matrix E(𝚫)𝐸𝚫\mathnormal{E}(\boldsymbol{\Delta})italic_E ( bold_Δ ) with higher correlations between types of bird that are more closely related in terms of their evolutionary history and species traits. Since we allow the length-scale parameters ϑisubscriptitalic-ϑ𝑖\vartheta_{i}italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be unknown we are also able to learn which variables are more important in explaining relationships between species. However, the structure for the expected shared variation matrix represented through 𝚽𝚽\boldsymbol{\Phi}bold_Φ is conjectural and so we elect to use the matrix-t𝑡titalic_t prior whose extra degree of freedom parameter ς𝜍\varsigmaitalic_ς allows the data to influence the degree of shrinkage. As discussed in Section 3.3, we induce a prior for ς>4𝜍4\varsigma>4italic_ς > 4 by specifying a distribution for ςˇ=1/(ς4)+ˇ𝜍1𝜍4superscript\check{\varsigma}=1/(\varsigma-4)\in\mathbb{R}^{+}overroman_ˇ start_ARG italic_ς end_ARG = 1 / ( italic_ς - 4 ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Specifically, we take ςˇExp(1)similar-toˇ𝜍Exp1\check{\varsigma}\sim\mathrm{Exp}(1)overroman_ˇ start_ARG italic_ς end_ARG ∼ roman_Exp ( 1 ) which implies that when the number of factors k𝑘kitalic_k is 5, 10 and 15, the probability of the scale factor sk(ςˇ)subscript𝑠𝑘ˇ𝜍s_{k}(\check{\varsigma})italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( overroman_ˇ start_ARG italic_ς end_ARG ) lying between 1 and x𝑥xitalic_x is 0.75 for x=7.8𝑥7.8x=7.8italic_x = 7.8, x=12.9𝑥12.9x=12.9italic_x = 12.9 and x=18.0𝑥18.0x=18.0italic_x = 18.0, respectively. For the diagonal elements in 𝚿𝚿\boldsymbol{\Psi}bold_Ψ, we assign a MGP prior (9) with a1=2subscript𝑎12a_{1}=2italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2, a2=6subscript𝑎26a_{2}=6italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 6 and a maximum truncation point H𝐻Hitalic_H that satisfies Hφ(50)1=40𝐻𝜑50140H\leq\lceil\varphi(50)\rceil-1=40italic_H ≤ ⌈ italic_φ ( 50 ) ⌉ - 1 = 40.

6.2.2 Analysis and results

We ran two chains, initialised at different starting points, using the adaptive Gibbs sampler discussed in Section 5.2 and another two chains using Stan’s implementation of the (non-adaptive) Hamiltonian Monte Carlo sampler with a fixed truncation level of H=10𝐻10H=10italic_H = 10. In the Gibbs sampler, adaptation was allowed after 500 iterations and, following Schiavon et al. (2022), the parameters in the diminishing adaptation condition were set at α0=1subscript𝛼01\alpha_{0}=-1italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 1 and α1=5×104subscript𝛼15superscript104\alpha_{1}=-5\times 10^{-4}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. After a burn-in of 20K iterations, each chain was run for a further 20K iterations, thinning to store every fifth draw in order to reduce computational overheads. Using the truncation criterion of Schiavon and Canale (2020), we retain as active enough factors to explain at least T=0.999𝑇0.999T=0.999italic_T = 0.999 of the total variation in the data. The posterior mode (and median) for the number of effective factors in each case was 5 with 95% credible interval (3,6)36(3,6)( 3 , 6 ). Comparing the output of identified parameters across the chains from both samplers using the usual graphical diagnostic checks gave no evidence of any lack of convergence.

To gauge performance against another non-exchangeable prior, the model was also fitted under the SIS process prior using an adaptive Gibbs sampler, with the hyperparameters in the prior and tuning parameters in the sampler set in line with the specification from Section 5 of Schiavon et al. (2022). In this case, more or less all the posterior mass for the number of effective factors was stacked at 4.

The marginal prior and posterior densities for the logarithms of the length-scale parameters ϑ1,,ϑ5subscriptitalic-ϑ1subscriptitalic-ϑ5\vartheta_{1},\ldots,\vartheta_{5}italic_ϑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϑ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT and the transformed degree of freedom parameter ςˇ=1/(ς4)ˇ𝜍1𝜍4\check{\varsigma}=1/(\varsigma-4)overroman_ˇ start_ARG italic_ς end_ARG = 1 / ( italic_ς - 4 ) under the structured matrix-t𝑡titalic_t prior are displayed in Figure 1. Absence of a relationship between a component of the generalised distance metric and the expected shared variation between species would be indicated by an infinite value for the corresponding length-scale parameter. Therefore the concentration of all posterior densities in Figure 1a over a finite interval near the origin highlights the role played by the generalised distance metric in explaining the structure of the shared variation matrix. It is also clear that the phylogenetic distances, whose length-scale parameter is ϑ5subscriptitalic-ϑ5\vartheta_{5}italic_ϑ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, have the greatest influence. For the transformed degree of freedom parameter ςˇˇ𝜍\check{\varsigma}overroman_ˇ start_ARG italic_ς end_ARG in Figure 1b, the similarity between the prior and posterior indicate that the information learned from the data is modest. Nevertheless, the shift in the position of the mode from 0 in the prior to around 0.1 in the posterior suggests that the data support less shrinkage towards the expected shared variation matrix than would be implied by a matrix normal prior for which ςˇ=0ˇ𝜍0\check{\varsigma}=0overroman_ˇ start_ARG italic_ς end_ARG = 0.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Marginal posterior density for LABEL:sub@fig:log_lengthscale the logarithms of the length scale parameters ϑ1subscriptitalic-ϑ1\vartheta_{1}italic_ϑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ), ϑ2subscriptitalic-ϑ2\vartheta_{2}italic_ϑ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ), ϑ3subscriptitalic-ϑ3\vartheta_{3}italic_ϑ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( ), ϑ4subscriptitalic-ϑ4\vartheta_{4}italic_ϑ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( ), ϑ5subscriptitalic-ϑ5\vartheta_{5}italic_ϑ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( ) and LABEL:sub@fig:dof_check the transformed degree of freedom parameter ςˇˇ𝜍\check{\varsigma}overroman_ˇ start_ARG italic_ς end_ARG. Also shown are the priors (         ).

Further qualitative evidence of the importance of the phylogenetic information on the structure of the marginal covariance matrix 𝛀=(ωij)𝛀subscript𝜔𝑖𝑗\boldsymbol{\Omega}=(\omega_{ij})bold_Ω = ( italic_ω start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is demonstrated in Figure 2 which shows the mean of the posterior for the marginal correlation matrix 𝑺Ω1𝛀𝑺Ω1superscriptsubscript𝑺Ω1𝛀superscriptsubscript𝑺Ω1\boldsymbol{S}_{\Omega}^{-1}\boldsymbol{\Omega}\boldsymbol{S}_{\Omega}^{-1}bold_italic_S start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ω bold_italic_S start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where 𝑺Ω=diag(ω11,,ωpp)subscript𝑺Ωdiagsubscript𝜔11subscript𝜔𝑝𝑝\boldsymbol{S}_{\Omega}=\mathrm{diag}(\surd\omega_{11},\ldots,\surd\omega_{pp})bold_italic_S start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT = roman_diag ( √ italic_ω start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , √ italic_ω start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ).

Refer to caption
Figure 2: Mean of the posterior for the marginal correlation matrix under the structured matrix-t𝑡titalic_t prior. The species are ordered according to their phylogenetic tree which is visualised in the margins.

The order of the 50 bird species is based on their phylogenetic tree, which is shown in the margins of the plot. There is clear structure in the matrix with several groups of species that share a recent common ancestor displaying strong positive correlations. This includes the species in the first ten rows, which correspond to birds from the Corvoidea superfamily and the Sylviidae family; the species in rows 19 to 30, which correspond largely to birds from the Musicapoidea superfamily; and the species in rows 42 to 47, which correspond mostly to birds from the Scolopacidae superfamily. It is also clear that most of the rows and columns predominated by near-zero correlations correspond to basal species, like Columba palumbus and Grus grus. Nevertheless, the structure of the matrix is clearly not dictated by evolutionary proximity alone. Interestingly, the corresponding plot from the analysis under the SIS process prior appears to be very sparse; see the Supplementary Materials. At least in part, this can be attributed to shrinkage of 𝚫𝚫\boldsymbol{\Delta}bold_Δ towards its prior mean, which is a diagonal matrix.

In order to provide a more formal comparison between the two model-prior combinations, we consider measures of goodness-of-fit and predictive performance. In the former case, we follow Schiavon et al. (2022) by computing the pseudo marginal likelihood (PML), defined as the product of conditional predictive ordinates (Gelfand and Dey, 1994). In terms of predictive performance, we consider the Brier score, which is a widely used proper scoring rule for categorical variables, calculated using a 4-fold cross-validation approach (e.g. Gneiting and Raftery, 2007). In our case, the score is positively oriented so that large values indicate better performance. The results are shown in Table 2 from which we can conclude that the combination of the factor model and structured matrix-t𝑡titalic_t prior gives the best-fit to the data and, more markedly, the best predictive performance. Further details on the calculation of the PML and Brier score can be found in the Supplementary Materials.

Table 2: Goodness-of-fit and predictive performance of the two model-prior combinations.
Structured matrix-t𝑡titalic_t SIS process
Log PML -2831.6 -2936.5
Brier score -0.29120 -0.85396

6.3 Hourly demand for natural gas

6.3.1 Model and prior

To illustrate the use of our methodology in the context of a dynamic model, we consider an application to modelling the hourly demand for natural gas. In the UK, gas in the national transmission system is transported to individual customers through eight regional distribution networks which are responsible for maintaining a reliable gas supply to customers at all times. This requires accurate short-term forecasts of the hourly demand for gas at individual points in the network called offtakes, where gas is taken off to supply locally. We consider data from a single offtake site in a regional distribution network in the North of England. Hourly gas demand data are available for the period from May 2012 to June 2015 along with the average daily temperature at the offtake site. In addition to the weather, the other main factors that are known to influence the mean level of residential demand for gas are seasonal and calendar effects (Soldo, 2012). We therefore additionally incorporate covariates that represent the day of the year, day of the week and whether or not a day is a public holiday.

The measurement of the demand for gas at an hourly resolution makes construction of a model challenging because the underlying dynamics of the process are likely to vary over a much longer time scale. For example, consider a dynamic linear model for hourly gas demand that allows for calendar and weather effects and incorporates a time-varying local level. Suppose that this local level is modelled as evolving according to a random walk or autoregression. Deviations from behaviour that would be considered typical given the calendar and weather effects would likely persist over days, rather than hours, and so hourly evolution of the local level would demand a near-zero innovation variance. Similarly, if calendar effects such as the day of the week, were regarded as dynamic, hourly evolution would demand vanishingly small innovation variances.

Motivated by these concerns, we define a gas-day-vector as a longitudinal series of measurements of the log-transformed demand for gas over 24 hours, starting at 07:00, which is the beginning of a gas-day. Denote by 𝒚t=(yt1,,yt,24)Tsubscript𝒚𝑡superscriptsubscript𝑦𝑡1subscript𝑦𝑡24T\boldsymbol{y}_{t}=(y_{t1},\ldots,y_{t,24})^{\mathrm{\scriptscriptstyle T}}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t , 24 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT the gas-day-vector on day t𝑡titalic_t so that yt,hsubscript𝑦𝑡y_{t,h}italic_y start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT denotes the log-transformed demand for gas at hour hhitalic_h of gas-day t𝑡titalic_t. We elect to model the gas-day-vectors using a dynamic factor model with observation equation 𝒚t=𝝁t+𝚲𝜼t+ϵtsubscript𝒚𝑡subscript𝝁𝑡𝚲subscript𝜼𝑡subscriptbold-italic-ϵ𝑡\boldsymbol{y}_{t}=\boldsymbol{\mu}_{t}+\boldsymbol{\Lambda}\boldsymbol{\eta}_% {t}+\boldsymbol{\epsilon}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϵtNp(𝟎,𝚺)similar-tosubscriptbold-italic-ϵ𝑡subscriptN𝑝0𝚺\boldsymbol{\epsilon}_{t}\sim\mathrm{N}_{p}(\boldsymbol{0},\boldsymbol{\Sigma})bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_0 , bold_Σ ), t=1,,n𝑡1𝑛t=1,\ldots,nitalic_t = 1 , … , italic_n. For simplicity in this illustrative application, we model evolution of the factors using a first order vector autoregression, 𝜼t=𝚪𝜼t1+𝜻tsubscript𝜼𝑡𝚪subscript𝜼𝑡1subscript𝜻𝑡\boldsymbol{\eta}_{t}=\boldsymbol{\Gamma}\boldsymbol{\eta}_{t-1}+\boldsymbol{% \zeta}_{t}bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Γ bold_italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where 𝜻tNk(𝟎,𝚷)similar-tosubscript𝜻𝑡subscriptN𝑘0𝚷\boldsymbol{\zeta}_{t}\sim\mathrm{N}_{k}(\boldsymbol{0},\boldsymbol{\Pi})bold_italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_0 , bold_Π ). Due to the stability of the demand for gas over the time period in question, it is reasonable to assume that the departures from the time-varying mean 𝒚t𝝁tsubscript𝒚𝑡subscript𝝁𝑡\boldsymbol{y}_{t}-\boldsymbol{\mu}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT follow a stationary process. We therefore constrain the stationary variance of the 𝜼tsubscript𝜼𝑡\boldsymbol{\eta}_{t}bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be 𝑰ksubscript𝑰𝑘\boldsymbol{I}_{k}bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and take the initial distribution to be 𝜼0Nk(𝟎,𝑰k)similar-tosubscript𝜼0subscriptN𝑘0subscript𝑰𝑘\boldsymbol{\eta}_{0}\sim\mathrm{N}_{k}(\boldsymbol{0},\boldsymbol{I}_{k})bold_italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Reparameterising the model for the factors in terms of the transformed partial autocorrelation matrices yields a single unconstrained square matrix 𝑨=(αij)𝑨subscript𝛼𝑖𝑗\boldsymbol{A}=(\alpha_{ij})bold_italic_A = ( italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) with 𝚪=(𝑰k+𝑨𝑨T)1/2𝑨𝚪superscriptsubscript𝑰𝑘𝑨superscript𝑨T12𝑨\boldsymbol{\Gamma}=(\boldsymbol{I}_{k}+\boldsymbol{A}\boldsymbol{A}^{\mathrm{% \scriptscriptstyle T}})^{-1/2}\boldsymbol{A}bold_Γ = ( bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_A bold_italic_A start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_A and 𝚷=(𝑰k+𝑨𝑨T)1𝚷superscriptsubscript𝑰𝑘𝑨superscript𝑨T1\boldsymbol{\Pi}=(\boldsymbol{I}_{k}+\boldsymbol{A}\boldsymbol{A}^{\mathrm{% \scriptscriptstyle T}})^{-1}bold_Π = ( bold_italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_A bold_italic_A start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. As discussed in Section 4.4, we assign a rotatable prior to 𝑨=(αij)𝑨subscript𝛼𝑖𝑗\boldsymbol{A}=(\alpha_{ij})bold_italic_A = ( italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), and therefore 𝚪𝚪\boldsymbol{\Gamma}bold_Γ, by taking αijN(0,1)similar-tosubscript𝛼𝑖𝑗N01\alpha_{ij}\sim\mathrm{N}(0,1)italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ roman_N ( 0 , 1 ).

The variances of the specific factors ϵtsubscriptbold-italic-ϵ𝑡\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are assigned the prior 1/σj2γ(3.1,2.1)similar-to1superscriptsubscript𝜎𝑗2𝛾3.12.11/\sigma_{j}^{2}\sim\gamma(3.1,2.1)1 / italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ italic_γ ( 3.1 , 2.1 ) independently for j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p which ensures that the distributions have finite variance. The time-varying mean 𝝁tsubscript𝝁𝑡\boldsymbol{\mu}_{t}bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is modelled as 𝝁t=𝑩T𝒘tsubscript𝝁𝑡superscript𝑩Tsubscript𝒘𝑡\boldsymbol{\mu}_{t}=\boldsymbol{B}^{\mathrm{\scriptscriptstyle T}}\boldsymbol% {w}_{t}bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_B start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in which 𝒘tT=(wt1,,wtc)superscriptsubscript𝒘𝑡Tsubscript𝑤𝑡1subscript𝑤𝑡𝑐\boldsymbol{w}_{t}^{\mathrm{\scriptscriptstyle T}}=(w_{t1},\ldots,w_{tc})bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_t italic_c end_POSTSUBSCRIPT ) is row t𝑡titalic_t of an n×c𝑛𝑐n\times citalic_n × italic_c matrix of daily covariates 𝑾𝑾\boldsymbol{W}bold_italic_W. The columns of 𝑾𝑾\boldsymbol{W}bold_italic_W allow for an intercept, a non-linear effect of deviations from seasonal average temperature and fixed effects for the day-of-the-week, the occurrence of a public holiday, and the day-of-the-year, which is represented as a truncated Fourier series with six harmonics. Full details of the model for 𝝁tsubscript𝝁𝑡\boldsymbol{\mu}_{t}bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are given in the Supplementary Materials along with a description of the associated priors.

In order to reflect both the persistence of above or below average demand for gas and the longitudinal but circular nature of a gas-day-vector, a sensible model for the inverse of the expected shared variation matrix 𝚵=𝚽1𝚵superscript𝚽1\boldsymbol{\Xi}=\boldsymbol{\Phi}^{-1}bold_Ξ = bold_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT would be the precision matrix of a stationary circular autoregressive process of order one (Held and Rue, 2010). We therefore take 𝚵=(ξij)𝚵subscript𝜉𝑖𝑗\boldsymbol{\Xi}=(\xi_{ij})bold_Ξ = ( italic_ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) to be a tridiagonal Toeplitz matrix with corners; that is, ξi,i+1=ξi+1,i=ϑ/2subscript𝜉𝑖𝑖1subscript𝜉𝑖1𝑖italic-ϑ2\xi_{i,i+1}=\xi_{i+1,i}=-\vartheta/2italic_ξ start_POSTSUBSCRIPT italic_i , italic_i + 1 end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_i + 1 , italic_i end_POSTSUBSCRIPT = - italic_ϑ / 2 for i=1,,p1𝑖1𝑝1i=1,\ldots,p-1italic_i = 1 , … , italic_p - 1, ξ1p=ξp1=ϑ/2subscript𝜉1𝑝subscript𝜉𝑝1italic-ϑ2\xi_{1p}=\xi_{p1}=-\vartheta/2italic_ξ start_POSTSUBSCRIPT 1 italic_p end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_p 1 end_POSTSUBSCRIPT = - italic_ϑ / 2 and ξii=1subscript𝜉𝑖𝑖1\xi_{ii}=1italic_ξ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 for i=1,,p𝑖1𝑝i=1,\ldots,pitalic_i = 1 , … , italic_p where we assume ϑ[0,1)italic-ϑ01\vartheta\in[0,1)italic_ϑ ∈ [ 0 , 1 ) and take logit(ϑ)N(0,2)similar-tologititalic-ϑN02\mathrm{logit}(\vartheta)\sim\mathrm{N}(0,2)roman_logit ( italic_ϑ ) ∼ roman_N ( 0 , 2 ). Exploratory analysis using data from another offtake site suggest that this structure is sensible based on the inverse of the sample covariance matrix. Given that the existing model is already rather complex, we chose to adopt a matrix normal prior for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ in this example. The diagonal elements in the among-column scale matrix 𝚿𝚿\boldsymbol{\Psi}bold_Ψ are given a MGP prior (9), with a1=2subscript𝑎12a_{1}=2italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2, a2=3subscript𝑎23a_{2}=3italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 and a maximum truncation point of H=φ(24)1=17𝐻𝜑24117H=\lceil\varphi(24)\rceil-1=17italic_H = ⌈ italic_φ ( 24 ) ⌉ - 1 = 17.

6.3.2 Analysis and results

We ran two chains, initialised at different starting points, using the adaptive Gibbs sampler discussed in Section 5.2 and another two chains using the (non-adaptive) Hamiltonian Monte Carlo sampler with a fixed truncation level of H=φ(24)1=17𝐻𝜑24117H=\lceil\varphi(24)\rceil-1=17italic_H = ⌈ italic_φ ( 24 ) ⌉ - 1 = 17. After a burn-in of 50K iterations, each Gibbs chain was run for a further 200K iterations, thinning to retain every 100th draw in order to reduce computational overheads. Adaptation was allowed after 5K iterations and the parameters in the diminishing adaptation condition were set at α0=1subscript𝛼01\alpha_{0}=-1italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 1 and α1=5×105subscript𝛼15superscript105\alpha_{1}=-5\times 10^{-5}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The Hamiltonian Monte Carlo sampler was run for 20K iterations after a burn-in of the same length, thinning the output to retain every 10th draw. As in the Finnish birds example, we retain as active enough factors to explain at least T=0.999𝑇0.999T=0.999italic_T = 0.999 of the total variation in the data. The posterior mode (and median) for the number of effective factors in each case was 10 with 95% credible interval (9,12)912(9,12)( 9 , 12 ). Comparing the output of identified parameters across the chains for the two samplers using the usual graphical diagnostic checks gave no evidence of any lack of convergence.

Refer to caption
Figure 3: Mean of the posterior for the marginal standardised precision matrix. The labels indicate hour of the day in a 24-hour clock.

Let 𝚼=𝛀1=(υij)𝚼superscript𝛀1subscript𝜐𝑖𝑗\boldsymbol{\Upsilon}=\boldsymbol{\Omega}^{-1}=(\upsilon_{ij})bold_Υ = bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ( italic_υ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) denote the marginal precision matrix of the process and let 𝑺Υ=diag(υ11,,υpp)subscript𝑺Υdiagsubscript𝜐11subscript𝜐𝑝𝑝\boldsymbol{S}_{\Upsilon}=\mathrm{diag}(\surd\upsilon_{11},\ldots,\surd% \upsilon_{pp})bold_italic_S start_POSTSUBSCRIPT roman_Υ end_POSTSUBSCRIPT = roman_diag ( √ italic_υ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , √ italic_υ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ). Figure 3 shows the posterior mean for the standardised precision matrix 𝑺Υ1𝚼𝑺Υ1superscriptsubscript𝑺Υ1𝚼superscriptsubscript𝑺Υ1\boldsymbol{S}_{\Upsilon}^{-1}\boldsymbol{\Upsilon}\boldsymbol{S}_{\Upsilon}^{% -1}bold_italic_S start_POSTSUBSCRIPT roman_Υ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Υ bold_italic_S start_POSTSUBSCRIPT roman_Υ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. It is clear that its structure is reasonably consistent with the tridiagonal Toeplitz matrix with corners on which the prior for 𝚫1superscript𝚫1\boldsymbol{\Delta}^{-1}bold_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is centered. However, there are some deviations, most notably another one or two bands of non-zero elements below the subdiagonal and above the supradiagonal. There is also some evidence of at least a partial band in the vicinity of υi,i+12subscript𝜐𝑖𝑖12\upsilon_{i,i+12}italic_υ start_POSTSUBSCRIPT italic_i , italic_i + 12 end_POSTSUBSCRIPT for i=1,,12𝑖112i=1,\ldots,12italic_i = 1 , … , 12. This may be due to people switching their heating on twice per day, at around 7:00 in the morning and around 19:00 in the evening. This picture is reinforced by the posterior for the identified factor loadings matrix 𝚲~bold-~𝚲\boldsymbol{\tilde{\Lambda}}overbold_~ start_ARG bold_Λ end_ARG, shown in the Supplementary Materials, many of whose columns display a double-hump shape with smaller loadings towards the middle of the gas-day. As remarked in Section 4.3, in order to simplify the geometry of the stationary region for stationary dynamic factor models, the autoregressive coefficient matrices in the state equation are often assumed to be diagonal. However, this example provides considerable evidence in favour of our fully flexible approach, with the posterior probabilities Pr(γ~i,jk>0|𝒚1:n)Prsubscript~𝛾𝑖𝑗𝑘conditional0subscript𝒚:1𝑛\Pr(\tilde{\gamma}_{i,jk}>0|\boldsymbol{y}_{1:n})roman_Pr ( over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i , italic_j italic_k end_POSTSUBSCRIPT > 0 | bold_italic_y start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) for many off-diagonal elements being close to zero or one.

In order to assess the forecasting performance of the model, the number of factors was fixed at k=10𝑘10k=10italic_k = 10 and the model was refitted, holding back the last 25% of observations as test data. In keeping with the short-term horizon of interest, we considered forecasting h=2424h=24italic_h = 24 hours ahead and, for comparative purposes, h=11h=1italic_h = 1 hour ahead. Application of the standard forward filtering and forecasting recursions for (multivariate) dynamic linear models are not appropriate here as they do not allow within-day updates or forecasts. We therefore modify the standard forward filter so that the time-step is hours, rather than days, and at hours 2,,242242,\ldots,242 , … , 24 within each gas-day, we perform a partial update, comprising an observation step but no prediction step. The forecasting algorithm is similarly modified to allow forecasts to be issued hourly. The full algorithms are given in the Supplementary Materials. By using the draws from the posterior obtained from the fit to the training data and sampling one-for-one from the hhitalic_h-step ahead predictive distributions, we obtain samples from the hhitalic_h-step ahead posterior predictive distributions. For h=11h=1italic_h = 1 and h=2424h=24italic_h = 24, these are visualised in Figure 4 for the first 5% of times in the hold-out period, along with the test data. An analogous figure in the Supplementary Materials shows the last 5% of times. From these plots it is clear that the posterior predictive distributions are both accurate and precise over the forecast horizons of interest.

Refer to caption
Figure 4: For the first 5% of times in the hold-out period, means ( ) and 95% equi-tailed credible intervals ( ) for the one-step ahead and 24-step ahead posterior predictive distributions. The observed data are also shown ( ).

7 Discussion

We have proposed a class of structured priors for the loadings matrix of a Bayesian factor model with accompanying inferential algorithms. The novelty lies in the insight that the shared variation matrix 𝚫=𝚲𝚲T𝚫𝚲superscript𝚲T\boldsymbol{\Delta}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\mathrm{% \scriptscriptstyle T}}bold_Δ = bold_Λ bold_Λ start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is much more amenable to the specification of prior beliefs than the factor loadings matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ. The prior is based on a matrix normal or matrix-t𝑡titalic_t distribution for 𝚲𝚲\boldsymbol{\Lambda}bold_Λ. Two important features are the choice of parametric structure for the among-row scale matrix 𝚽𝚽\boldsymbol{\Phi}bold_Φ and the increasing shrinkage prior for the diagonal among-column scale matrix 𝚿𝚿\boldsymbol{\Psi}bold_Ψ. The matrix 𝚽𝚽\boldsymbol{\Phi}bold_Φ characterises the conditional prior expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ. Parametric forms are widely adopted as models for covariance matrices in longitudinal studies, spatial statistics, Gaussian processes, and a host of other areas. By adopting a parametric form, informed by domain expertise, as a model for the expectation of 𝚫𝚫\boldsymbol{\Delta}bold_Δ, we benefit from shrinkage towards a rank-reduced matrix that is close to that structure. In general, the number of factors in a factor model is not known a priori. This is addressed through the increasing shrinkage process prior for ψ1,,ψksubscript𝜓1subscript𝜓𝑘\psi_{1},\ldots,\psi_{k}italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT which allows columns of loadings whose likelihood contribution is negligible to be shrunk towards zero and discarded. A prior that encourages no more factors than are needed is helpful, particularly for prediction, because of the reduction in epistemic uncertainty that is afforded. At the cost of slightly more involved computational inference, the matrix-t𝑡titalic_t version of the structured prior also offers a degree of freedom parameter ς𝜍\varsigmaitalic_ς which allows the data to influence the degree of shrinkage of 𝚫𝚫\boldsymbol{\Delta}bold_Δ towards a rank-reduced matrix that is close to its mean.

Acknowledgements

EH was supported by the EPSRC grant EP/N510129/1 via the Alan Turing Institute project “Streaming data modelling for real-time monitoring and forecasting”. We are grateful to Michael Betancourt and Malcolm Farrow for conversations which have improved the manuscript. We are also grateful to two anonymous referees for their helpful comments and suggestions. The Finnish bird data are available publicly from https://www.helsinki.fi/en/researchgroups/statistical-ecology/software/hmsc.

References

  • Aguilar and West (2000) Aguilar, O. and M. West (2000). Bayesian dynamic factor models and portfolio allocation. J. Bus. Econ. Stat. 18(3), 338–357.
  • Aßmann et al. (2016) Aßmann, C., J. Boysen-Hogrefe, and M. Pape (2016). Bayesian analysis of static and dynamic factor models: an ex-post approach towards the rotation problem. J. Economet. 192, 190–206.
  • Bekker and ten Berge (1997) Bekker, P. A. and J. M. F. ten Berge (1997). Generic global identification in factor analysis. Linear Algebra and its Applications 264, 255–263.
  • Bhattacharya and Dunson (2011) Bhattacharya, A. and D. B. Dunson (2011). Sparse Bayesian infinite factor models. Biometrika 98(2), 291–306.
  • Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell (2017). Stan: A probabilistic programming language. J. Statist. Software 76(1), 1–32.
  • Carvalho et al. (2008) Carvalho, C. M., J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. J. Amer. Statist. Assoc. 103(484), 1438–1456.
  • Castellanos et al. (2004) Castellanos, L., V. Q. Vu, S. Perel, A. B. Schwartz, and R. E. Kass (2004). A multivariate Gaussian process factor model for hand shape during reach-to-grasp movements. Statist. Sinica 25(1), 5–24.
  • Chan et al. (2018) Chan, J., R. Leon-Gonzalez, and R. W. Strachan (2018). Invariant inference and efficient computation in the static factor model. J. Amer. Statist. Assoc. 113(522), 819–828.
  • Chan and Jeliazkov (2009) Chan, J. C. C. and I. Jeliazkov (2009). Efficient simulation and integrated likelihood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation 1, 101–120.
  • Chib et al. (2006) Chib, S., F. Nardari, and N. Shephard (2006). Analysis of high dimensional multivariate stochastic volatility models. J. Economet. 134(2), 341–371.
  • Conti et al. (2014) Conti, G., S. Frühwirth-Schnatter, J. J. Heckman, and R. Piatek (2014). Bayesian exploratory factor analysis. J. Economet. 183, 31–57.
  • Durante (2017) Durante, D. (2017). A note on the multiplicative gamma process. Stat. Probab. Lett. 122, 198–204.
  • Dutta and Ghosh (2013) Dutta, R. and J. K. Ghosh (2013). Bayes model selection with path sampling: factor models and other examples. Stat. Sci. 28(1), 95–115.
  • Frühwirth-Schnatter (1994) Frühwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis 15, 183–202.
  • Frühwirth-Schnatter et al. (2023) Frühwirth-Schnatter, S., D. Hosszejni, and H. F. Lopes (2023). Sparse Bayesian factor analysis when the number of factors is unknown. arXiv:2301.06459.
  • Frühwirth-Schnatter and Lopes (2010) Frühwirth-Schnatter, S. and H. F. Lopes (2010). Parsimonious Bayesian factor analysis when the number of factors is unknown. Technical Report, Chicago Booth.
  • Gelfand and Dey (1994) Gelfand, A. E. and D. K. Dey (1994). Bayesian model choice: Asymptotics and exact calculations. J. R. Statist. Soc. B 56(3), 501–514.
  • Geweke and Zhou (1996) Geweke, J. and G. Zhou (1996). Measuring the pricing error of the arbitrage pricing theory. The Review of Financial Studies 9(2), 557–587.
  • Ghosh and Dunson (2009) Ghosh, J. and D. B. Dunson (2009). Default prior distributions and efficient posterior computation in Bayesian factor analysis. J. Comput. Graph. Stat. 18(2), 306–320.
  • Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102(477), 359–378.
  • Goldberg (1990) Goldberg, L. R. (1990). An alternative ’description of personality’: the big-five factor structure. Journal of Personality and Social Psychology 59(6), 1216–1229.
  • Gupta and Nagar (2000) Gupta, A. K. and D. K. Nagar (2000). Matrix Variate Distributions. Boca Raton, Florida: Chapman & Hall/CRC.
  • Hays et al. (2012) Hays, S., H. Shen, and J. X. Huang (2012). Functional dynamic factor models with application to yield curve forecasting. Ann. Appl.  Statist. 6(3), 870–894.
  • Heaps (2023) Heaps, S. E. (2023). Enforcing stationarity through the prior in vector autoregressions. J. Comput. Graph. Stat. 32(1), 74–83.
  • Held and Rue (2010) Held, L. and H. Rue (2010). Conditional and intrinsic autoregressions. In A. E. Gelfand, P. J. Diggle, M. Fuentes, and P. Guttorp (Eds.), Handbook of Spatial Statistics (1st ed.)., Handbooks of Modern Statistical Methods, Chapter 13, pp.  201–216. Boca Raton, Florida: CRC Press.
  • Jungbacker et al. (2014) Jungbacker, B., S. J. Koopman, and M. Van der Wel (2014). Smooth dynamic factor analysis with application to the US term structure of interest rates. J. Appl. Economet. 29(3), 65–90.
  • Kaufmann and Schumacher (2019) Kaufmann, S. and C. Schumacher (2019). Bayesian estimation of sparse dynamic factor models with order-independent and ex-post identification. J. Economet. 210, 116–134.
  • Koop et al. (2009) Koop, G., R. Leon-Gonzalez, and R. W. Strachan (2009). Efficient posterior simulation for cointegrated models with priors on the cointegration space. Econometric Reviews 29(2), 224–242.
  • Kowal and Canale (2022) Kowal, D. R. and A. Canale (2022). Semiparametric functional factor models with Bayesian rank selection. arXiv:2108.02151v3.
  • Kowal et al. (2017) Kowal, D. R., D. S. Matteson, and D. Ruppert (2017). A Bayesian multivariate functional dynamic linear model. J. Amer. Statist. Assoc. 112(518), 733–744.
  • L.-H. Lim and R. Sepulchre (2019) L.-H. Lim and R. Sepulchre (2019). Geometric distance between positive definite matrices of different dimensions. IEEE Transactions of Information Theory 65(9), 5401–5405.
  • Legramanti et al. (2020) Legramanti, S., D. Durante, and D. B. Dunson (2020). Bayesian cumulative shrinkage for infinite factorizations. Biometrika 107(3), 745–752.
  • Leung and Drton (2016) Leung, D. and M. Drton (2016). Order-invariant prior specification in Bayesian factor analysis. Stat. Probab. Lett. 111, 60–66.
  • Lindström et al. (2015) Lindström, Å., M. Green, M. Husby, J. A. Kålås, and A. Lehikoinen (2015). Large-scale monitoring of waders on their Boreal and Arctic breeding grounds in Northern Europe. Ardea 103(1), 3–15.
  • Lopes and Carvalho (2007) Lopes, H. F. and C. M. Carvalho (2007). Factor stochastic volatility with time varying loadings and Markov switching regimes. J. Statist. Plan. Inf. 137, 3082–3091.
  • Lopes et al. (2008) Lopes, H. F., E. Salazar, and D. Gamerman (2008). Spatial dynamic factor analysis. Bayesian Anal. 3(4), 759–792.
  • Lopes and West (2004) Lopes, H. F. and M. West (2004). Bayesian model assessment in factor analysis. Statist. Sinica 14, 41–67.
  • Man and Culpepper (2022) Man, A. X. and S. A. Culpepper (2022). A mode-jumping algorithm for Bayesian factor analysis. J. Amer. Statist. Assoc. 117(537), 277–290.
  • Peña and Poncela (2006) Peña, D. and P. Poncela (2006). Nonstationary dynamic factor analysis. J. Statist. Plan. Inf. 136, 1237–1257.
  • Roberts and Rosenthal (2007) Roberts, G. O. and J. S. Rosenthal (2007). Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Probab 44(2), 458–475.
  • Sáfadi and Peña (2008) Sáfadi, T. and D. Peña (2008). Bayesian analysis of dynamic factor models: an application to air pollution and mortality in São Paulo, Brazil. Environmetrics 19(1), 95–117.
  • Schiavon and Canale (2020) Schiavon, L. and A. Canale (2020). On the truncation criteria in infinite factor models. Stat 9(1), e298.
  • Schiavon et al. (2022) Schiavon, L., A. Canale, and D. B. Dunson (2022). Generalized infinite factorization models. Biometrika 109(3), 817–835.
  • Schmidt et al. (2011) Schmidt, A. M., P. Guttorp, and A. O’Hagan (2011). Considering covariates in the covariance structure of spatial processes. Environmetrics 22, 487–500.
  • Soldo (2012) Soldo, B. (2012). Forecasting natural gas consumption. Applied Energy 92, 26–37.
  • Steel (2016) Steel, M. (2016). Phylogeny: Discrete and Random Processes in Evolution. Philadelphia: Society for Industrial and Applied Mathematics.
  • Taylor-Rodriguez et al. (2019) Taylor-Rodriguez, D., A. O. Finley, A. Dhatta, C. Babcock, H. Andersen, B. D. Cook, D. C. Mortin, and S. Banerjee (2019). Spatial factor models for high-dimensional and large spatial data. Statist. Sinica 29(3), 1155–1180.
  • Wilkinson and Yeung (2002) Wilkinson, D. J. and S. K. H. Yeung (2002). Conditional simulation from highly structured Gaussian systems, with application to blocking-MCMC for the Bayesian analysis of very large linear models. Statistics and Computing 12, 287–300.
  • Wilkinson and Yeung (2004) Wilkinson, D. J. and S. K. H. Yeung (2004). A sparse matrix approach to Bayesian computation in large linear models. Comput. Statist. Data Anal. 44, 493–516.