1 Introduction

In standard settings of learning from independent and identically distributed (iid) data, labels y of training and test instances \(\mathbf {x}\) are drawn independently and are governed by a fixed conditional distribution \(p(y|\mathbf {x})\). Problem settings that relax this assumption are widely referred to as transfer learning. We study a transfer-learning setting in which the conditional \(p(y|\mathbf {x})\) is assumed to vary as a function of additional observable variables \(\mathbf {t}\). The variables \(\mathbf {t}\) can identify a specific domain that an observation was drawn from (as in multitask learning), or can be continuous attributes that describe, for instance, the time or location at which an observation was made (sometimes called concept drift). We focus on applications in which \(\mathbf {t}\) represents a geographic location, or both a location and a point in time.

A natural model for this setting is to assume a conditional \(p(y|\mathbf {x};\mathbf {w})\) with parameters \(\mathbf {w}\) that vary with \(\mathbf {t}\). Such models are known as varying-coefficient models (e.g., Hastie and Tibshirani 1993; Gelfand et al. 2003). In iid learning, it is common to assume an isotropic Gaussian prior \(p(\mathbf {w})\) over model parameters. When the parameters vary as a function of a task variable \(\mathbf {t}\), it is natural to instead assume a Gaussian-process (GP) prior over functions that map values of \(\mathbf {t}\) to values of \(\mathbf {w}\). A Gaussian process implements a prior \(p(\varvec{\omega })\) over functions \(\varvec{\omega }: \mathcal {T} \rightarrow \mathbb {R}^m\) that couple parameters \(\mathbf {w}\in \mathbb {R}^m\) for different values of \(\mathbf {t}\in \mathcal {T}\) and make it possible to generalize over different domains, time, or space. While this model allows to extend Bayesian inference naturally to a variety of transfer-learning problems, inference in these varying-coefficient models for large problems is often impractical: It involves Kronecker products that result in matrices of size \(nm \times nm\), with n the number of instances and m the number of attributes of \(\mathbf {x}\) (Gelfand et al. 2003; Wheeler and Calder 2006).

Alternatively, varying-coefficient models can be derived in a regularized risk-minimization framework. Such models infer point estimates of parameters \(\mathbf {w}\) for different observed values of \(\mathbf {t}\) under some model that expresses how \(\mathbf {w}\) changes smoothly with \(\mathbf {t}\). At test time, point estimates of \(\mathbf {w}\) are required for all \(\mathbf {t}\) observed at the test data points. This is again computationally challenging because typically a separate optimization problem needs to be solved for each test instance. Most prominent are estimation techniques based on kernel-local smoothing (Fan and Zhang 2008; Wu and Chiang 2000; Fan and Huang 2005).

Logistic and ridge regression, among other discriminative models for iid data, rely on an isotropic Gaussian prior \(p(\mathbf {w})\). By this assumption, \(p(\mathbf {w})\) is a product of Gaussians; taking the logarithm results in the standard \(\ell _2\) regularization term \(\mathbf {w}^\top \mathbf {w}\). However, since discriminative models do not involve the likelihood \(p(\mathbf {x}|y)\) of the input variables, an isotropy assumption on \(\mathbf {w}\) does not amount to the assumption that the dimensions of \(\mathbf {x}\) are independent. In analogy, we explore Bayesian varying-coefficient models in conjunction with isotropic GP priors. Our main theoretical result is that Bayesian inference in varying-coefficient models with isotropic GP priors is equal to Bayesian inference in a standard Gaussian process with a specific product kernel. The main practical implication of this result is that inference for varying-coefficient models becomes practical by using standard GP tools.

Our theoretical result also leads to insights regarding existing transfer learning methods: First, we identify the exact modeling assumptions under which Bayesian inference amounts to multitask learning using a Gaussian process with task kernels and instance kernels (Bonilla et al. 2007). Secondly, we show that hierarchical Bayesian multitask models (e.g., Gelman et al. 1995; Finkel and Manning 2009) can be represented as Gaussian process priors; inference then resolves to inference in standard Gaussian processes with multitask kernels based on graph Laplacians (Evgeniou et al. 2005; Álvarez et al. 2011).

Predicting real-estate prices is an economically relevant problem that has previously been addressed using varying-coefficient models (Gelfand et al. 2003). Due to both the limited scalability of known inference methods for varying-coefficient models and limited availability of real-estate transaction data, previous studies have been carried out on small-scale data collections. Substantial amounts of real-estate transactions have been disclosed in the course of recent open data initiatives. We compile a data set of real-estate transaction from the State of New York and rent prices from the States of New York and California and explore varying-coeffient models on this data set.

In probabilistic seismic-hazard analysis, a ground-motion model estimates the expected future distribution of a ground-motion parameter of interest at a specific site, depending on the event’s origin and site-related parameters such as magnitude and distance. A typical ground-motion parameter is the peak ground acceleration that a site experiences during a seismic event. Accurate ground-motion models are important to establish building codes and determine insurance risks. Traditional ground-motion models are ergodic; that is, the conditional distribution of the ground-motion parameter of interest at a given site is identical to the conditional distribution at any other site, given the same magnitude, distance, and site conditions (Anderson and Brune 1999). These ergodic models compete with specialized regional models—e.g., for Greece (Danciu and Tselentis 2007), Italy (Bindi et al. 2011), the Eastern Alps (Bragato and Slejko 2005), and Turkey (Akkar and Cagnan 2010). These regional models suffer from smaller numbers of data points.

In order to weaken the assumption of ergodicity in ground-motion models, Gianniotis et al. (2014) and Stafford (2014) estimate ground-motion models from a larger data set and constrain the coefficients to be similar across each region. Regional adjustments can be broken down into smaller geographical units (Al-Atik et al. 2010; Lin et al. 2011). This approach, however, relies on the availability of sufficiently many observations in each geographical compartment.

Our theoretical findings allow us to derive a ground-motion model in which the coefficients of the model can vary smoothly with geographical location and for which inference is computationally tractable. The model is developed and evaluated on a subset of the NGA West 2 dataset (Ancheta et al. 2014), based on Californian data used by Abrahamson et al. (2014). In California, regional differences between Northern California and Southern California have been found previously (Atkinson and Morrison 2009; Chiou et al. 2010), though the recent NGA West 2 models treat California as a whole.

The rest of this paper is structured as follows. Section 2 describes the problem setting and the varying-coefficient model. Section 3 studies Bayesian inference and presents our main results. Section 4 presents experiments on prediction of real estate prices and seismic-hazard analysis; Sect. 5 discusses related work and concludes.

2 Problem setting and model

In multi-task learning, one or several of the available observable variables are singled out and treated as task variables \(\mathbf {t}\)—for instance, the identity of a speaker in speech recognition or the location in a geospatial prediction problem. This modeling decision reflects the assumption that \(p(y|\mathbf {x},\mathbf {t},\mathbf {w})\) is similar across the values of \(\mathbf {t}\). If the underlying application satisfies this assumption, it allows for better predictions; for instance, for values of \(\mathbf {t}\) that are poorly covered by the training data. This section describes a stochastic process that models applications which are characterized by a conditional distribution \(p(y|\mathbf {x},\varvec{\omega }(\mathbf {t}))\) whose parameterization \(\varvec{\omega }(\mathbf {t})\) varies smoothly as a function of \(\mathbf {t}\).

A fixed set of instances \(\mathbf {x}_1,\ldots ,\mathbf {x}_n\) with \(\mathbf {x}_i\in \mathcal {X}\subseteq \mathbb {R}^m\) is observable, along with values \(\mathbf {t}_1,\ldots ,\mathbf {t}_n \in \mathcal {T}\) of a task variable. The stochastic process starts by drawing a function \(\varvec{\omega }:\mathcal {T}\rightarrow \mathbb {R}^m\) according to a prior \(p(\varvec{\omega })\). The function \(\varvec{\omega }\) associates any task variable \(\mathbf {t}\in \mathcal {T}\) with a corresponding parameter vector \(\varvec{\omega }(\mathbf {t}) \in \mathbb {R}^m\) that defines the conditional distribution \(p(y|\mathbf {x},\varvec{\omega }(\mathbf {t}))\) for task \(\mathbf {t}\in \mathcal {T}\). The domain \(\mathcal {T}\) of the task variable depends on the application at hand. In the case of multitask learning, \(\mathcal {T}=\{1,\ldots ,k\}\) is a set of task identifiers. In hierarchical Bayesian multitask models, a tree \(\mathcal {G} = (\mathcal {T},\mathbf {A})\) over the tasks \(\mathcal {T} = \{1,\ldots ,k\}\) reflects how tasks are related; we represent this tree by its adjacency matrix \(\mathbf {A}\in \mathbb {R}^{k \times k}\). We focus on geospatial transfer-learning problems in which the conditional distribution of y given \(\mathbf {x}\) varies smoothly in the task variables \(\mathbf {t}\) that represent spatial coordinates, or both space and time. In this case, \(\mathcal {T} \subset \mathbb {R}^d\) is a continuous-valued space.

We model \(p(\varvec{\omega })\) using a zero-mean Gaussian process

$$\begin{aligned} \varvec{\omega }\sim \mathcal {GP}(\mathbf {0},\varvec{\kappa }) \end{aligned}$$
(1)

that generates vector-valued functions \(\varvec{\omega }:\mathcal {T}\rightarrow \mathbb {R}^m\). The process is specified by a matrix-valued kernel function \(\varvec{\kappa }: \mathcal {T}\times \mathcal {T} \rightarrow \mathbb {R}^{m \times m}\) that reflects closeness in \(\mathcal {T}\). Here, \(\varvec{\kappa }(\mathbf {t},\mathbf {t}') \in \mathbb {R}^{m\times m}\) is the matrix of covariances between dimensions of the vectors \(\varvec{\omega }(\mathbf {t})\) and \(\varvec{\omega }(\mathbf {t}')\).

In discriminative machine-learning models, one usually assumes that the dimensions of model-parameter vector \(\mathbf {w}\) are generated independently of one another. For instance, one often assumes an isotropic Gaussian prior \(p(\mathbf {w})=\prod _{j=1}^m {{\mathcal {N}}}[0,\sigma ^2](w_j)\); the negative log-posterior then resolves to an \(\ell _2\)-regularized loss function. In analogy, we assume that the dimensions of \(\varvec{\omega }\) are generated by independent Gaussian processes; that is, \(\omega _j\sim \mathcal {GP}(\mathbf {0},k_j)\), where \(k_j(\mathbf {t},\mathbf {t}')\) is the covariance between coefficients \(\omega _j(\mathbf {t})\) and \(\omega _j(\mathbf {t}')\). A special case of this is an isotropic GP; here, kernel functions \(k_j(\mathbf {t},\mathbf {t}')\) are identical for all dimensions j. Therefore, for all \(\mathbf {t}, \mathbf {t}' \in \mathcal {T}\), covariance matrix \(\varvec{\kappa }(\mathbf {t},\mathbf {t}')\) is a diagonal matrix with diagonal elements \(k_j(\mathbf {t},\mathbf {t}')\), where the \(k_j:\mathcal {T}\times \mathcal {T} \rightarrow \mathbb {R}\) are scalar-valued, positive semidefinite kernel functions. Note that we do not make any assumptions about \(p(\mathbf {x}|y,\varvec{\omega }(\mathbf {t}))\), and the independence of the dimensions of \(\varvec{\omega }\) does not imply that the dimensions of the input vector \(\mathbf {x}\) are independent. Also note that isotropy of \(p(\varvec{\omega }(\mathbf {t}))\) is different from the assumption of an isotropic kernel, meaning a kernel that is uniform in all directions of the input space.

The process evaluates function \(\varvec{\omega }\) for all \(\mathbf {t}_i\) to create parameter vectors \(\mathbf {w}_1=\varvec{\omega }(\mathbf {t}_1), \ldots ,\mathbf {w}_n=\varvec{\omega }(\mathbf {t}_n)\). The process concludes by generating labels \(y_i\) from an observation model,

$$\begin{aligned} y_i \sim p(y|\mathbf {x}_i,\mathbf {w}_i); \end{aligned}$$
(2)

for instance, a standard linear model with Gaussian noise for regression or a logistic function of the inner product of \(\mathbf {w}_i\) and \(\mathbf {x}_i\) for classification.

The prediction problem is to infer the distribution of the label \(y_{\star }\) for a new observation \(\mathbf {x}_{\star }\) with task variable \(\mathbf {t}_{\star }\). For notational convenience, we aggregate the training instances into matrix \(\mathbf {X}\in \mathbb {R}^{n\times m}\), task variables into matrix \(\mathbf {T}\in \mathbb {R}^{n\times d}\), the parameter vectors associated with training observations into matrix \(\mathbf {W}\in \mathbb {R}^{n \times m}\) with row vectors \(\mathbf {w}_1^\mathsf{T},\ldots ,\mathbf {w}_n^\mathsf{T}\), and the labels \(y_1,\ldots ,y_n\) into vector \(\mathbf {y}\in \mathcal {Y}^n\).

In this model, the GP prior \(p(\varvec{\omega })\) over functions \(\varvec{\omega }: \mathcal {T} \rightarrow \mathbb {R}^m\) couples parameter vectors \(\varvec{\omega }(\mathbf {t})\) for different values \(\mathbf {t}\) of the task variable. The hierarchical Bayesian model of multitask learning assumes a coupling of parameters based on a hierarchical Bayesian prior (e.g., Gelman et al. 1995; Finkel and Manning 2009). We will now show that the varying-coefficient model with isotropic GP prior subsumes hierarchical Bayesian multitask models by choice of an appropriate kernel function \(\varvec{\kappa }\) of the Gaussian process that defines \(p(\varvec{\omega })\). Together with results on inference presented in Sect. 3, this result shows how inference for hierarchical Bayesian multitask models can be carried out using a Gaussian process. The following definition formalizes the hierarchical Bayesian multitask model.

Definition 1

(Hierarchical Bayesian multitask model) Let \(\mathcal {G} = (\mathcal {T},\mathbf {A})\) denote a tree structure over a set of tasks \(\mathcal {T} = \{1,\ldots ,k\}\) given by an adjacency matrix \(\mathbf {A}\), with \(1 \in \mathcal {T}\) the root node. Let \(\varvec{\sigma }\in \mathbb {R}^k\) denote a vector with entries \(\sigma _1,\ldots ,\sigma _k\). The following process generates the distribution \(p(\mathbf {y}|\mathbf {X},\mathbf {T};\mathcal {G},\varvec{\sigma })\) over labels \(\mathbf {y}\in \mathcal {Y}^n\) given instances \(\mathbf {X}\), task variables \(\mathbf {T}\), the task hierarchy \(\mathcal {G}\), and variances \(\varvec{\sigma }\): The process first samples parameter vectors \(\bar{\mathbf {w}}_{1},\ldots ,\bar{\mathbf {w}}_{k} \in \mathbb {R}^m\) according to

(3)
(4)

where \(pa(l) \in \mathcal {T}\) is a unique node with \(\mathbf {A}_{pa(l),l} = 1\) for each \(l \in \mathcal {T}\). Then, the process generates labels \(y_i \sim p(y|\mathbf {x}_i,\bar{\mathbf {w}}_{i})\), where \(p(y|\mathbf {x}_i,\bar{\mathbf {w}}_{i})\) is the same conditional distribution over labels given an instance and a parameter vector as was chosen for the varying-coefficient model in Eq. (2). This process defines the hierarchical Bayesian multitask model.

The following proposition shows that the varying-coefficient model presented in Sect. 2 subsumes the hierarchical Bayesian multitask model.

Proposition 1

Let \(\mathcal {G} = (\mathcal {T},\mathbf {A})\) denote a tree structure over a set of \(\mathcal {T} = \{1,\ldots ,k\}\) given by an adjacency matrix \(\mathbf {A}\). Let \(\varvec{\sigma }\in \mathbb {R}^k\) be a vector with entries \(\sigma _1,\ldots ,\sigma _k\). Let \(k_{\mathbf {A},\varvec{\sigma }}: \mathcal {T} \times \mathcal {T} \rightarrow \mathbb {R}\) be given by \(k_{\mathbf {A},\varvec{\sigma }}(t,t') = G_{t,t'}\), with \(G_{i,j}\) the entry at row i and column j of the matrix

and \(\mathbf {S}\in \mathbb {R}^{k \times k}\) denotes the diagonal matrix with entries \(\sigma _1^2,\ldots ,\sigma _k^2\). Let \(\varvec{\kappa }: \mathcal {T} \times \mathcal {T} \rightarrow \mathbb {R}^{m \times m}\) be given by and let \(p(\mathbf {y}|\mathbf {X},\mathbf {T};\varvec{\kappa }) = \int p(\mathbf {y}|\mathbf {W},\mathbf {X})p(\mathbf {W}|\mathbf {T};\varvec{\kappa })\mathrm {d}\mathbf {W}\) be the marginal distribution over labels given instances and task variables defined by the varying-coefficient model. Then it holds that \(p(\mathbf {y}|\mathbf {X},\mathbf {T};\varvec{\kappa }) = p(\mathbf {y}|\mathbf {X},\mathbf {T};\mathcal {G},\varvec{\sigma })\).

Proposition 1 implies that Bayesian prediction in the varying-coefficient model with the specified kernel function is identical to Bayesian inference in the hierarchical Bayesian multitask model. The proof is included in the “Appendix”. In Proposition 1, entries \(G_{t,t'}\) of \(\mathbf {G}\) represent a task similarity derived from the tree structure \(\mathcal {G}\). Instead of a tree structure over tasks, feature vectors describing individual tasks may also be given (Bonilla et al. 2007; Yan and Zhang 2009). In this case, \(\varvec{\kappa }(t,t')\) can be computed from the task features; the varying-coefficient model then subsumes existing approaches for multitask learning with task features (see Sect. 3.4).

Note that Equation 3 of Daumé III (2009) is a special case of our model for learning with two tasks and a task kernel \(\varvec{\kappa }\) that is for identical tasks and for differing tasks \(t\ne t'\).

3 Inference

We now address the problem of inferring predictions \(y_{\star }\) for instances \(\mathbf {x}_{\star }\) and task variables \(\mathbf {t}_{\star }\). Section 3.1 presents exact Bayesian solutions for regression; Sect. 3.2 discusses approximate inference for classification. Section 3.4 derives existing multitask models as special cases.

3.1 Regression

This subsection studies linear regression models of the form \(p(y | \mathbf {x}, \mathbf {w}) = \mathcal {N}(y|\mathbf {x}^\mathsf{T}\mathbf {w}, \tau ^2)\). Note that by substituting for the slightly heavier notation \(p(y | \mathbf {x}, \mathbf {w}) = \mathcal {N}(y|\varPhi (\mathbf {x})^\mathsf{T}\mathbf {w}, \tau ^2)\), this treatment also covers finite-dimensional feature maps. The predictive distribution for test instance \(\mathbf {x}_{\star }\) with task variable \(\mathbf {t}_{\star }\) is obtained by integrating over the possible parameter values \(\mathbf {w}_{\star }\) of the conditional distribution that has generated value \(y_{\star }\):

$$\begin{aligned} p(y_{\star }| \mathbf {X}, \mathbf {y},&\mathbf {T}, \mathbf {x}_{\star }, \mathbf {t}_{\star }) = \int p(y_{\star }| \mathbf {x}_{\star }, \mathbf {w}_{\star }) p(\mathbf {w}_{\star }| \mathbf {X}, \mathbf {y}, \mathbf {T}, \mathbf {t}_{\star }) \mathrm {d}\mathbf {w}_{\star }, \end{aligned}$$
(5)

where the posterior over \(\mathbf {w}_{\star }\) is obtained by integrating over the joint parameter values \(\mathbf {W}\) that have generated the labels \(\mathbf {y}\) for instances \(\mathbf {X}\) and task variables \(\mathbf {T}\):

$$\begin{aligned} p(\mathbf {w}_{\star }| \mathbf {X}, \mathbf {y}, \mathbf {T}, \mathbf {t}_{\star }) = \int p(\mathbf {w}_{\star }| \mathbf {W}, \mathbf {T}, \mathbf {t}_{\star })p(\mathbf {W}| \mathbf {X}, \mathbf {y}, \mathbf {T}) \mathrm {d}\mathbf {W}. \end{aligned}$$
(6)

Posterior distribution \(p(\mathbf {W}| \mathbf {X}, \mathbf {y}, \mathbf {T})\) in Eq. (6) depends on the likelihood function—the linear model—and the GP prior \(p(\varvec{\omega })\). The extrapolated posterior \(p(\mathbf {w}_{\star }| \mathbf {W}, \mathbf {T}, \mathbf {t}_{\star })\) for test instance \(\mathbf {x}_{\star }\) with task variable \(\mathbf {t}_{\star }\) depends on the Gaussian process. The following theorem states how the predictive distribution given by Eq. (5) can be computed.

Theorem 1

(Bayesian predictive distribution) Let \(\mathcal {Y} = \mathbb {R}\) and \(p(y|\mathbf {x},\mathbf {w}) = \mathcal {N}(y|\mathbf {x}^\mathsf{T}\mathbf {w}, \tau ^2)\). For all attributes \(j\in \{1,\ldots ,m\}\), let \(k_j(\mathbf {t},\mathbf {t}')\) a positive definite kernel function and let the task-kernel function \(\varvec{\kappa }(\mathbf {t},\mathbf {t}')\) return a diagonal matrix with diagonal elements \(k_j(\mathbf {t},\mathbf {t}')\). Let \(\mathbf {K}\in \mathbb {R}^{n\times n}\) be a matrix with components \(k_{ij}=\mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j) \mathbf {x}_j\) and \(\mathbf {k}\in \mathbb {R}^n\) be a vector with components \(k_{i}=\mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_{\star }) \mathbf {x}_{\star }\). Then, the predictive distribution for the varying-coefficient model defined in Sect. 2 is given by

(7)

Before we prove Theorem 1, we highlight three observations about this result. First, the distribution \(p(y_{\star }|\mathbf {X},\mathbf {y},\mathbf {T}, \mathbf {x}_{\star },\mathbf {t}_{\star })\) has a surprisingly simple form. It is identical to the predictive distribution of a standard Gaussian process that uses concatenated vectors \((\mathbf {x}_1,\mathbf {t}_1),\ldots ,(\mathbf {x}_n,\mathbf {t}_n) \in \mathcal {X}\times \mathcal {T}\) as training instances, labels \(y_1,\ldots ,y_n\), and the kernel function \(k((\mathbf {x}_i,\mathbf {t}_i),(\mathbf {x}_j,\mathbf {t}_j)) = \mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j) \mathbf {x}_j\). Covariance matrix \(\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j)\) is diagonal; when the GP prior is isotropic, then all diagonal elements are identical and we refer to their value as \(k(\mathbf {t}_i,\mathbf {t}_j)\). We can see that in this case, the predictive distribution of Eq. (7) is identical to the predictive distribution of a standard Gaussian process with concatenated vectors \((\mathbf {x}_i,\mathbf {t}_i)\) and product kernel function \(k((\mathbf {x}_i,\mathbf {t}_i),(\mathbf {x}_j,\mathbf {t}_j)) = \mathbf {x}_i^\mathsf{T}\mathbf {x}_j k(\mathbf {t}_i,\mathbf {t}_j)\).

Secondly, when the GP is isotropic, then instances \(\mathbf {x}_1,\ldots ,\mathbf {x}_n,\mathbf {x}_{\star }\in \mathcal {X}\) only enter Eq. (7) in the form of inner products. The model can therefore directly be kernelized by defining the kernel matrix as \(\mathbf {K}_{ij}=k_{\mathcal {X}}(\mathbf {x}_i,\mathbf {x}_j) k(\mathbf {t}_i,\mathbf {t}_j)\) with kernel function \(k_{\mathcal {X}}(\mathbf {x}_i,\mathbf {x}_j) = \varPhi (\mathbf {x}_i)^\mathsf{T}\varPhi (\mathbf {x}_j)\) where \(\varPhi \) maps to a reproducing kernel Hilbert space. When the feature space is finite, then \(\varvec{\omega }\) maps the \(\mathbf {t}_i\) to a finite-dimensional \(\mathbf {w}_i\) and Theorem 1 implies a Bayesian predictive distribution derived from the generative process that Sect. 2 specifies. When the reproducing kernel Hilbert space does not have a finite dimension, Sect. 2 does no longer specify a corresponding proper stochastic process because \(p(\mathbf {w}_1,\ldots ,\mathbf {w}_n|\mathbf {T})\) would become infinite-dimensionally normally distributed. However, given the finite sample \(\mathbf {X}\) and \(\mathbf {T}\), a Mercer map (see, e.g., Schölkopf and Smola 2002, Section 2.2.4) constitutes a finite-dimensional space \(\mathbb {R}^n\) for which Sect. 2 again characterizes a corresponding stochastic process.

Thirdly and finally, Theorem 1 shows how Bayesian inference in varying-coefficient models with isotropic priors can be implemented efficiently. For general varying-coefficient models, the most expensive step of inference is to perform, for each sample generated by Gibbs sampling, a Cholesky decomposition of a matrix of size \(mn \times mn\) (discussed above Equation 17 of Gelfand et al. 2003). A sampled parameter vector \(\hat{\omega }(\mathbf {t}_1)\ldots \hat{\omega }(\mathbf {t}_n)\) is of size nm. In each sampling step, a Cholesky decomposition of the covariance matrix (which then has size \(nm\times nm\)) of parameter vectors has to be performed. This makes inference impractical for large-scale problems. Theorem 1 shows that under the assumption of an isotropic prior—or at least an independent prior distribution for each dimension—the latent parameter vectors \(\mathbf {w}_1,\ldots ,\mathbf {w}_n\) can be integrated out, which results in a GP formulation in which the covariance structure over parameter vectors resolves to an \(n \times n\) product-kernel matrix.

Proof (Proof of Theorem 1)

Let \(w_{ir}\) and \(w_{\star r}\) denote the r-th elements of vectors \(\mathbf {w}_i\) and \(\mathbf {w}_{\star }\), and let \(x_{ir}\) and \(x_{\star r}\) denote the r-th elements of vectors \(\mathbf {x}_i\) and \(\mathbf {x}_{\star }\). Let \(\mathbf {z}_{\star } = (z_1,\ldots ,z_n,z_{\star })^\mathsf{T}\in \mathbb {R}^{n+1}\) with \(z_i = \mathbf {x}_i^\mathsf{T}\mathbf {w}_i\) and \(z_{\star } = \mathbf {x}_{\star }^\mathsf{T}\mathbf {w}_{\star }\). Because \(\mathbf {w}_1,\ldots ,\mathbf {w}_n,\mathbf {w}_{\star }\) are evaluations of the function \(\varvec{\omega }\) drawn from a Gaussian process (Eq. 1), they are jointly Gaussian distributed and thus \(z_1,\ldots ,z_n,z_{\star }\) are also jointly Gaussian (e.g., Murphy 2012, Chapter 10.2.5). Because \(\varvec{\omega }\) is drawn from a zero-mean process, it holds that \(\mathbb {E}[z_i] = \mathbb {E}[\sum _{r=1}^m{x_{ir} w_{ir}}] = \sum _{r=1}^m{x_{ir}\mathbb {E}[w_{ir}]} = 0\) as well as \(\mathbb {E}[z_{\star }] = 0\) and therefore

$$\begin{aligned} p(\mathbf {z}_{\star }|\mathbf {X},\mathbf {T},\mathbf {x}_{\star },\mathbf {t}_{\star }) = \mathcal {N}(\mathbf {z}_{\star }|\mathbf {0},\mathbf {C}) \end{aligned}$$

where \(\mathbf {C}\in \mathbb {R}^{(n+1) \times (n+1)}\) denotes the covariance matrix.

For the covariances \(\mathbb {E}[z_i z_j]\) it holds that

$$\begin{aligned} \mathbb {E} \left[ z_i z_j\right]&= \mathbb {E} \left[ \mathbf {x}_i^\mathsf{T}\mathbf {w}_i \mathbf {x}_j^\mathsf{T}\mathbf {w}_j \right] \nonumber \\&= \mathbb {E} \left[ \left( \sum _{s=1}^m x_{is}w_{is} \right) \left( \sum _{r=1}^m x_{jr}w_{jr}\right) \right] \nonumber \\&= \sum _{s=1}^m \sum _{r=1}^m x_{is} x_{jr} \mathbb {E} \left[ w_{is} w_{jr} \right] \nonumber \\&= \sum _{s=1}^m x_{is} x_{js} \mathbb {E} \left[ w_{is} w_{js} \right] \end{aligned}$$
(8)
$$\begin{aligned}&= \mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j)\mathbf {x}_j. \end{aligned}$$
(9)

In Eqs. (8) and (9) we exploit the independence of the Gaussian process priors for all dimensions: the covariance \(\mathbb {E}[w_{is} w_{jr}]\) is the element in row s and column r of the matrix \(\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j) \in \mathbb {R}^{m \times m}\) obtained by evaluating the kernel function \(\varvec{\kappa }:\mathcal {T}\times \mathcal {T} \rightarrow \mathbb {R}^{m \times m}\) at \((\mathbf {t}_i,\mathbf {t}_j)\); by the independence assumption, \(\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j)\) is a diagonal matrix and \(\mathbb {E}[w_{is} w_{jr}] = 0\) for \(s \ne r\) (see Sect. 2). We analogously derive

$$\begin{aligned} \mathbb {E}[z_i z_{\star }]&= \mathbf {x}_i ^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_{\star })\mathbf {x}_{\star }, \end{aligned}$$
(10)
$$\begin{aligned} \mathbb {E}[z_{\star } z_{\star }]&= \mathbf {x}_{\star }^\mathsf{T}\varvec{\kappa }(\mathbf {t}_{\star },\mathbf {t}_{\star })\mathbf {x}_{\star }. \end{aligned}$$
(11)

Equations (9), (10) and (11) define the covariance matrix \(\mathbf {C}\), yielding

$$\begin{aligned} p(\mathbf {z}_{\star }|\mathbf {X},\mathbf {T},\mathbf {x}_{\star },\mathbf {t}_{\star }) = \mathcal {N}\left( \mathbf {z}_{\star }|\mathbf {0},\left( \begin{array}{cc} \mathbf {K}&{} \mathbf {k}\\ \mathbf {k}^\mathsf{T}&{} k_{\star } \end{array} \right) \right) \end{aligned}$$

where \(k_{\star } = \mathbf {x}_{\star }^\mathsf{T}\varvec{\kappa }(\mathbf {t}_{\star },\mathbf {t}_{\star })\mathbf {x}_{\star }\). For \(\mathbf {y}_{\star } = (y_1,\ldots ,y_n,y_{\star })\) it now follows that

(12)

The claim now follows by applying standard Gaussian identities to compute the conditional distribution \(p(y_{\star }|\mathbf {X},\mathbf {y},\mathbf {T},\mathbf {x}_{\star },\mathbf {t}_{\star })\) from Eq. (12). \(\square \)

3.2 Classification

The result given by Theorem 1 can be extended to classification settings with \(\mathcal {Y} = \{0,1\}\) by using non-Gaussian likelihoods p(y|z) that generate labels \(y \in \mathcal {Y}\) given outputs \(z \in \mathbb {R}\) of the linear model.

Corollary 1

(Bayesian predictive distribution for non-Gaussian likelihoods) Let \(\mathcal {Y} = \{0,1\}\). Let \(p(y_i|\mathbf {x}_i,\mathbf {w}_i)\) be given by a generalized linear model, defined by \(z_i~\sim ~\mathcal {N}(z|\mathbf {w}_i^\mathsf{T}\mathbf {x}_i,\tau ^2)\) and \(y_i \sim p(y|z_i)\). Let \(p(y_{\star }|\mathbf {x}_{\star },\mathbf {w}_{\star })\) be given by \(z_{\star }\sim \mathcal {N}(z|\mathbf {w}_{\star }^\mathsf{T}\mathbf {x}_{\star },\tau ^2)\) and \(y_{\star }\sim p(y|z_{\star })\). Let furthermore \(\mathbf {z}= (z_1,\ldots ,z_n)^\mathsf{T}\in \mathbb {R}^n\). For all attributes \(j\in \{1,\ldots ,m\}\), let \(k_j(\mathbf {t},\mathbf {t}')\) a positive definite kernel function and let the task-kernel function \(\varvec{\kappa }(\mathbf {t},\mathbf {t}')\) return a diagonal matrix with diagonal elements \(k_j(\mathbf {t},\mathbf {t}')\). Let \(\mathbf {K}\in \mathbb {R}^{n\times n}\) be a matrix with components \(k_{ij}=\mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j)\mathbf {x}_j\) and \(\mathbf {k}\in \mathbb {R}^n\) a vector with components \(k_{i}=\mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_\star )\mathbf {x}_{\star }\). Then, the predictive distribution for the GP model defined in Sect. 2 is given by

(13)

with

A straightforward calculation shows that Eq. (13) is identical to the predictive distribution of a standard Gaussian process that uses concatenated vectors \((\mathbf {x}_1,\mathbf {t}_1),\ldots ,(\mathbf {x}_n,\mathbf {t}_n) \in \mathcal {X}\times \mathcal {T}\) as training instances, labels \(y_1,\ldots ,y_n\), the kernel \(k((\mathbf {x}_i,\mathbf {t}_i),(\mathbf {x}_j,\mathbf {t}_j)) = \mathbf {x}_i^\mathsf{T}\varvec{\kappa }(\mathbf {t}_i,\mathbf {t}_j) \mathbf {x}_j\), and likelihood function p(y|z). For isotropic GP priors, the kernel function is the product kernel \(k((\mathbf {x}_i,\mathbf {t}_i),(\mathbf {x}_j,\mathbf {t}_j)) = \mathbf {x}_i^\mathsf{T}\mathbf {x}_j k(\mathbf {t}_i,\mathbf {t}_j)\). For non-Gaussian likelihoods, exact inference in Gaussian processes is generally intractable, but approximate inference methods based on, e.g., Laplace approximation, variational inference or expectation propagation are available. The proof is included in the “Appendix”.

3.3 Algorithm

We are now ready to summarize the inference procedure in Algorithm 1. The \(\textit{isoVCM} \) algorithm summarizes two cases. In the primal case, instances \(\mathbf {x}\) are represented by explicit features. In this case, task-kernel function \(\varvec{\kappa }(\mathbf {t},\mathbf {t}')\) maps a pair of tasks to a diagonal covariance matrix whose diagonal elements \(k_j(\mathbf {t},\mathbf {t}')\) may differ over the features. This allows us to express background knowledge about differing variances of features. In the dual case, a kernel function \(k_{\mathcal {X}}(\mathbf {x},\mathbf {x}')\) is provided. In this case, an explicit feature representation could be constructed for any finite data set using a Mercer map. However, one then has no background knowledge about the variance of these artificially constructed features and the covariance matrices \(\varvec{\kappa }(\mathbf {t},\mathbf {t}')\) are isotropic with diagonal elements \(k(\mathbf {t},\mathbf {t}')\). Algorithm 1 combines input matrix \(\mathbf {X}\) and task variables \(\mathbf {T}\) into an overall kernel matrix \(\mathbf {K}\), and infers the distribution of target variable \(y_{\star }\) for test input \(\mathbf {x}_{\star }\) by standard inference in a GP with kernel matrix \(\mathbf {K}\).

To implement Algorithm 1, we use the Gaussian Processes for Machine Learning (GPML) toolbox (Rasmussen and Nickisch 2010). We use a normal likelihood function for regression. For classification experiments, we use the logistic likelihood function, and the Laplace approximation. To further speed up inference calculations for large data sets, we use the FITC approximation (Snelson and Ghahramani 2005) as implemented in GPML with 1000 randomly sampled inducing points. FITC approximates the overall kernel matrix \(\mathbf {K}\) by the covariances between instances and a set of inducing points, and the covariances between the inducing points. This reduces the costs of handling the kernel matrix from quadratic in the number of instances, to linear in the number of instances and quadratic in the (constant) number of inducing points.

Hyper-parameters of \(\textit{isoVCM} \) are the observation noise parameter and the kernel parameters; these are always tuned on the respective training set using gradient ascent in the marginal likelihood (again implemented through GPML).

figure a

3.4 Product kernels in transfer learning

Sections 3.1 and 3.2 have shown that inference in the varying-coefficient model with isotropic GP priors is equivalent to inference in standard Gaussian processes with products of task kernels and instance kernels. Similar product kernels are used in several existing transfer learning models. Our results identify the generative assumptions that underlie these models by showing that the product kernels which they employ can be derived from the assumption of a varying-coefficient model with isotropic GP prior and an appropriate kernel function.

Bonilla et al. (2007) study a setting in which there is a discrete set of k tasks, which are described by task-specific attribute vectors \(\mathbf {t}_1,\ldots ,\mathbf {t}_k\). They study a Gaussian process model based on concatenated feature vectors \((\mathbf {x},\mathbf {t})\) and a product kernel \(k((\mathbf {x},\mathbf {t}),(\mathbf {x}',\mathbf {t}')) = k_\mathcal {X}(\mathbf {x},\mathbf {x}')k_{\mathcal {T}}(\mathbf {t},\mathbf {t}')\), where \(k_\mathcal {X}(\mathbf {x},\mathbf {x}')\) reflects instance similarity and \(k_{\mathcal {T}}(\mathbf {t},\mathbf {t}')\) reflects task similarity. Theorem 1 and Corollary 1 identify the generative assumptions underlying this model: a varying-coefficient model with isotropic Gaussian process prior and kernel \(k_{\mathcal {T}}\) generates task-specific parameter vectors in a reproducing Hilbert space of the instance kernel \(k_{\mathcal {X}}\); a linear model in that Hilbert space generates the observed labels.

Evgeniou et al. (2005) and Álvarez et al. (2011) study multitask-learning problems in which task similarities are given in terms of a task graph. Their method uses the product of an instance kernel and the graph-Laplacian kernel of the task graph. We will now show that, when the task graph is a tree, that kernel emerges from Proposition 1. This signifies that, when the task graph is a tree, the graph regularization method of Evgeniou et al. (2005) is the dual formulation of hierarchical Bayesian multitask learning, and therefore Bayesian inference for hierarchical Bayesian models can be carried out efficiently using a standard Gaussian process with a graph-Laplacian kernel.

Definition 2

(Graph-Laplacian multitask kernel) Let \(\mathcal {G} = (\mathcal {T},\mathbf {M})\) denote a weighted undirected graph structure over tasks \(\mathcal {T} = \{1,\ldots ,k\}\) given by a symmetric adjacency matrix \(\mathbf {M}\in \mathbb {R}^{k \times k}\), where \(\mathbf {M}_{i,j}\) is the positive weight of the edge between tasks i and j or \(\mathbf {M}_{i,j}=0\) if no such edge exists. Let \(\mathbf {D}\) denote the weighted degree matrix of the graph, and \(\mathbf {L}= \mathbf {D}+ \mathbf {R}- \mathbf {M}\) the graph Laplacian, where a diagonal matrix \(\mathbf {R}\) that acts as a regularizer has been added to the degree matrix (Álvarez et al. 2011). The kernel function \(k_{\mathbf {M},\mathbf {R}}: \left( \mathcal {X}\times \mathcal {T}\right) \times \left( \mathcal {X}\times \mathcal {T}\right) \rightarrow \mathbb {R}\) given by

$$\begin{aligned} k_{\mathbf {M},\mathbf {R}}((\mathbf {x},\mathbf {t}),(\mathbf {x}',\mathbf {t}')) = \mathbf {L}^{\dagger }_{\mathbf {t},\mathbf {t}'} \mathbf {x}^\mathsf{T}\mathbf {x}', \end{aligned}$$

where \(\mathbf {L}^{\dagger }\) is the pseudoinverse of  \(\mathbf {L}\), will be referred to as the graph-Laplacian multitask kernel.

Proposition 2 states that the graph-Laplacian multitask kernel emerges as kernel function of the dual formulation of hierarchical Bayesian multitask learning (Definition 1).

Proposition 2

Let \(\mathcal {G}=(\mathcal {T},\mathbf {A})\) denote a directed tree structure given by an adjacency matrix \(\mathbf {A}\). Let \(\varvec{\sigma }\in \mathbb {R}^k\) be a vector with entries \(\sigma _1,\ldots ,\sigma _k\). Let \(\mathbf {B}\in \mathbb {R}^{k \times k}\) denote the diagonal matrix with entries \(0,\sigma _2^{-2},\ldots ,\sigma _k^{-2}\), let \(\mathbf {R}\in \mathbb {R}^{k \times k}\) denote the diagonal matrix with entries \(\sigma _1^{-2},0,\ldots ,0\), let \(\mathbf {M}= \mathbf {B}\mathbf {A}+ (\mathbf {B}\mathbf {A})^\mathsf{T}\), and let \(k_{\mathbf {A},\varvec{\sigma }}(\mathbf {t},\mathbf {t}')\) be defined as in Proposition 1. Then

$$\begin{aligned} k_{\mathbf {M},\mathbf {R}}((\mathbf {x},\mathbf {t}),(\mathbf {x}',\mathbf {t}')) = k_{\mathbf {A},\varvec{\sigma }}(\mathbf {t},\mathbf {t}')\mathbf {x}^\mathsf{T}\mathbf {x}'. \end{aligned}$$

Note that in Proposition 2, \(\mathbf {B}\mathbf {A}\) is an adjacency matrix in which an edge from node i to node j is weighted by the respective precision \(\sigma _j^{-2}\) of the conditional distribution (Eq. 4); adding the transpose yields a symmetric matrix \(\mathbf {M}\) of task relationship weights. The precision \(\sigma _1^{-2}\) of the root node prior is subsumed in the regularizer \(\mathbf {R}\). The proof is included in the “Appendix”.

4 Empirical study

The main application areas of varying-coefficient models are prediction problems with underlying spatial or temporal dynamics (Gelfand et al. 2003; Fan and Zhang 2008; Zhu et al. 2014; Estes et al. 2014). In this section, we will explore housing-price prediction and seismic-hazard analysis empirically.

4.1 Housing prices

A typical instance of this class of problems is housing-price prediction. Theorem 1 and Corollary 1 show how Bayesian inference for varying-coefficient models can be carried out efficiently for regression and classification problems. We will therefore study varying-coefficient models and reference methods for larger-scale real-estate-price and monthly-rent prediction problems. These applications do not only represent the class of geospatial prediction problems, but are economically relevant problems in their own right.

Propositions 1 and 2 explain how hierarchical Bayesian multitask-learning models can be derived as varying-coefficient models with a hierarchy of tasks. They reveal previously unknown relationships between known models which have been studied extensively, but do not lead to new learning techniques. These theoretical findings do not raise any questions that require empirical investigation.

4.1.1 Models under investigation

We study \(\textit{isoVCM} \), as shown in Algorithm 1 (dual case). As kernel function \(k(\mathbf {t}_i,\mathbf {t}_j)\) we always employ a Matérn kernel of degree \(\nu =1/2\), that is, \(k(\mathbf {t}_i,\mathbf {t}_j) = \theta _{\mathbf {t}} \exp ( - ||\mathbf {t}_i-\mathbf {t}_j||/\rho _{\mathbf {t}})\). As kernel function \(k_{\mathcal {X}}(\mathbf {x}_i,\mathbf {x}_j)\), we study both a linear kernel \(k_{\mathcal {X}}(\mathbf {x}_i,\mathbf {x}_j) = \theta _{\mathbf {x}} \mathbf {x}_i^\mathsf{T}\mathbf {x}_j\) and a Matérn kernel \(k_{\mathcal {X}}(\mathbf {x}_i,\mathbf {x}_j) = \theta _{\mathbf {x}} \exp ( - ||\mathbf {x}_i-\mathbf {x}_j||/\rho _{\mathbf {x}}) \) of degree \(\nu =1/2\). Here, \(\theta _{\mathbf {t}}\), \(\theta _{\mathbf {x}}\), \(\rho _{\mathbf {t}}\), and \(\rho _{\mathbf {x}}\) are hyperparameters of the kernel functions that are tuned according to marginal likelihood. The two resulting versions of our model are denoted by isoVCM \(^{\textit{lin}}\) and isoVCM \(^{\textit{mat}}\), respectively.

We compare Algorithm 1 to the varying-coefficient model with nonisotropic GP prior by Gelfand et al. (2003), in which the covariances are inferred from data (denoted Gelfand). We also compare against the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008) that infers point estimates of model parameters. We study this model using a linear feature map for instances \(\mathbf {x}\in \mathcal {X}\) (Fan & Zhang \(^\textit{lin}\)) and a nonlinear feature map constructed from the same Matérn kernel as used for isoVCM \(^{\textit{mat}}\) (Fan & Zhang \(^\textit{mat}\)). We add an \(\ell _2\) regularizer to the models of Fan and Zhang (2008), because this improves their prediction accuracy. Hyper-parameters of the model of Fan and Zhang (2008) are the regularization parameter and a bandwidth parameter in the smoothing kernel employed by their model. As the model only infers point estimates, hyper-parameters cannot be tuned by marginal likelihood; instead, optimal values of these parameters are inferred by cross-validation grid search on the training data.

We finally compare against iid models that assume that \(p(y|\mathbf {x})\) is constant in \(\mathbf {t}\), and models that treat the variables in \(\mathbf {t}\) as additional features by appending them to the feature vector \(\mathbf {x}\). Specifically, we study standard Gaussian processes on the original features \(\mathbf {x}\), with a linear kernel \(k(\mathbf {x}_i,\mathbf {x}_j) = \theta \mathbf {x}_i^\mathsf{T}\mathbf {x}_j\) and a Matérn kernel \(k(\mathbf {x}_i,\mathbf {x}_j) = \theta \exp ( - ||\mathbf {x}_i-\mathbf {x}_j||/\rho ) \) of degree \(\nu =1/2\), denoted as GP \(_{\mathbf {x}}^{\textit{lin}}\) and GP \(_{\mathbf {x}}^{\textit{mat}}\). We also study standard Gaussian processes on a concatenated feature representation \((\mathbf {x}, \mathbf {t})\), again using a linear kernel \(k((\mathbf {x}_i,\mathbf {t}_i),(\mathbf {x}_j,\mathbf {t}_j)) = \theta (\mathbf {x}_i,\mathbf {t}_i)^\mathsf{T}(\mathbf {x}_j,\mathbf {t}_j)\) and a Matérn kernel \(k((\mathbf {x}_i,\mathbf {t}_i),(\mathbf {x}_j,\mathbf {t}_j)) = \theta \exp ( - ||(\mathbf {x}_i,\mathbf {t}_i)-(\mathbf {x}_j,\mathbf {t}_j)||/\rho ) \) of degree \(\nu =1/2\). These baselines are denoted as GP \(_{\mathbf {x},\mathbf {t}}^{\textit{lin}}\) and GP \(_{\mathbf {x},\mathbf {t}}^{\textit{mat}}\). The standard Gaussian process baselines are also implemented using GPML and use the same likelihood functions and inference methods (including FITC) as isoVCM. Hyper-parameters of the baselines (observation noise parameter and kernel parameters \(\theta \), \(\rho \)) are also tuned using gradient ascent in the marginal likelihood on the training data.

4.1.2 Experimental setting

We acquire records of real-estate sales in New York City (2013). The data set and our preprocessing are detailed in the “Appendix”. For regression, the sales price serves as target variable y; for classification, y is a binary indicator that distinguishes between transactions with a price above the median of 450,000 dollars from transactions below it. After preprocessing, the data set contains 231,708 sales records with 94 attributes such as the the floor space, plot area, property class (e.g., family home, condominium, office, or store), and the date of construction. We transform addresses into geographical latitude and longitude. We encode the sales date and geographical latitude and longitude of the property as task variable \(\mathbf {t}\in \mathbb {R}^3\). This reflects the assumption that the relationship between property features and sales price vary smoothly in geographical location and time. We divide the records, which span dates from January 2003 to December 2009, into 25 consecutive blocks. Models are trained on a set of n instances sampled randomly from a window of five blocks of historical data and evaluated on the subsequent block; results are averaged over all blocks.

Fig. 1
figure 1

Execution time over training set size n

For rent prediction, we acquire records on the monthly rent for apartments and houses in the states of California and New York (US Census Bureau 2013). Again, data set and preprocessing are detailed in the “Appendix”. For regression, the target variable y is the monthly rent; for classification, y is a binary indicator that distinguishes contracts with a monthly rent above the median from those with a rent below. The preprocessed data sets contain 36,785 records (state of California) and 17,944 records (state of New York) with 24 input variables. Geographical latitude and longitude constitute the task variable \(\mathbf {t}\in \mathbb {R}^2\). Models are evaluated using 20-fold cross validation; in each iteration, a random subset of n training instances is sampled randomly from the training part of the data.

4.1.3 Execution time

We compare the execution times of isoVCM \(^{\textit{lin}}\), Gelfand, and Fan & Zhang \(^\textit{lin}\). Figure 1 shows the execution time for training and prediction on one block of test instances in the sales-price prediction task over the training sample size n.

For Gelfand, the most expensive step during inference is computation of the inverse of a Cholesky decomposition of an \(nm \times nm\) matrix, which needs to be performed within each Gibbs sampling iteration. Figure 1 shows the execution time of 5000 iterations of this step (3000 burn-in and 2000 sampling iterations, according to Gelfand et al. 2003) which is a lower bound on the overall execution time. Bayesian inference in isoVCM \(^{\textit{lin}}\) is between 6 and 7 orders of magnitude faster than in the Gelfand model. isoVCM \(^{\textit{lin}}\) uses the FITC approximation; but since we use 1000 inducing points and the sample size n stays below that in this experiment, FITC does not accelerate the inference. We conclude that Gelfand is impractical for this application and exclude this method from the remaining experiments.

For Fan & Zhang \(^\textit{lin}\), separate point estimates of model parameters have to be inferred for each test instance, which involves solving a separate optimization problem. For regression, efficient closed-form solutions for parameter estimates are available. For classification, more expensive numerical optimization is required (Fan and Zhang 2008); this results in a higher execution time.

Fig. 2
figure 2

Mean absolute error for real-estate sales prices in New York City. Error bars indicate the standard error

4.1.4 Prediction accuracy

In all subsequent experiments, each method is given 30 CPU core days of execution time; experiments are run sequentially for increasing number n of training instances until the cumulative execution time reaches this limit.

Figure 2 shows the mean absolute error for real-estate sales-price predictions (left) and the mean zero-one loss for classifying sales transactions (right) as a function of training set size n. For regression, Fan & Zhang \(^\textit{lin}\) and Fan & Zhang \(^\textit{mat}\) partially completed the experiments; for classification, both methods did not complete the experiment for the smallest value of n. All other methods completed the experiments within the time limit. For regression, we observe that isoVCM \(^{\textit{lin}}\) is substantially more accurate than GP \(_{\mathbf {x}}^{\textit{lin}}\), GP \(_{\mathbf {x},\mathbf {t}}^{\textit{lin}}\), and Fan & Zhang \(^\textit{lin}\); isoVCM \(^{\textit{mat}}\) is more accurate than GP \(_{\mathbf {x}}^{\textit{mat}}\) and GP \(_{\mathbf {x},\mathbf {t}}^{\textit{mat}}\) with \(p<0.01\) for all training set sizes according to a paired t-test. Significance values of paired t-test comparing isoVCM \(^{\textit{mat}}\) and Fan & Zhang \(^\textit{mat}\) fluctuate between \(p<0.01\) and \(p < 0.2\) for different n, indicating that isoVCM \(^{\textit{mat}}\) is likely more accurate than Fan & Zhang \(^\textit{mat}\). For classification, isoVCM \(^{\textit{lin}}\) substantially outperforms GP \(_{\mathbf {x}}^{\textit{lin}}\) and GP \(_{\mathbf {x},\mathbf {t}}^{\textit{lin}}\); isoVCM \(^{\textit{mat}}\) outperforms GP \(_{\mathbf {x}}^{\textit{mat}}\) and GP \(_{\mathbf {x},\mathbf {t}}^{\textit{mat}}\) (\(p < 0.01\) for \(n > 125\)).

Figure 3 shows the mean absolute error for predicting the monthly rent (left) and the mean zero-one loss for classifying rental contracts (right) for the states of California (upper row) and New York (lower row) as a function of training set size n. Fan & Zhang \(^\textit{lin}\) completed the regression experiments within the time limit and partially completed the classification experiment; Fan & Zhang \(^\textit{mat}\) partially completed the regression experiment but did not complete the classification experiment for the smallest value of n. We again observe that isoVCM \(^{\textit{mat}}\) yields the most accurate predictions for both classification and regression problems; isoVCM \(^{\textit{lin}}\) always yields more accurate predictions than Fan & Zhang \(^\textit{lin}\) and more accurate predictions than GP \(_{\mathbf {x},\mathbf {t}}^{\textit{lin}}\) for training set sizes larger than \(n = 1000\).

Fig. 3
figure 3

Mean absolute error for monthly housing rents in the states of California (upper row) and New York (lower row)

4.2 Seismic-hazard analysis

In this section, we study the model’s ability to predict ground motion during seismic events in the State of California. Here, we focus on evaluating the model and exploring its versatility; an extended description of this study that focuses on the seismological findings has been published by Landwehr et al. (2016).

In this application, instances \(\mathbf {x}_i=[M, R_{JB},V_{S30},F_{NM},F_R]\) are seismic readings that consist of the magnitude of an earthquake, the Joyner–Boore distance (the distance to the vertical projection of the fault to the earth’s surface), the time-averaged shear-wave velocity in the upper 30ms, and the style (normal or reverse) of faulting. This representation is commonly used in current ground-motion models. Target values \(y_i\) are the logarithmic peak ground acceleration—the highest acceleration which the ground at the given location will experience—and logarithmic spectral accelerations at time periods of 0.02, 0.05, 0.1, 0.2, 0.5, 1, and 4 s, in units of the earth gravity g. Spectral accelerations indicate resonance that may occur in buildings.

For each ground-motion record, latitude and longitude for both the seismic event, \(\mathbf {t}_e\), and the station, \(\mathbf {t}_s\), are available. The event coordinates are given by the horizontal projection of the geographical center of the rupture, estimated from the NGA West 2 source flatfile (Ancheta et al. 2014).

4.2.1 Models under investigation

The structure of a ground-motion model is strongly guided by the underlying basic physical processes. An ergodic ground-motion model has the form

$$\begin{aligned} y= & {} w_1+w_2M+w_3M^2+(w_4+w_5M)\sqrt{R_{JB}^2+h^2}\nonumber \\&+\,w_6R_{JB}+w_7\log V_{s30}+w_8 F_R+w_9F_{NM} \end{aligned}$$
(14)

Since h is a constant, Eq. (14) can be phrased as a linear model \(y=\mathbf {w}^\top \mathbf {x}\) by including the compound terms \(M^2\), \(\sqrt{R_{JB}^2+h^2}\), and \(M\sqrt{R_{JB}^2+h^2}\) in the input vector \(\mathbf {x}\). Based on our understanding of the seismic process, we now allow some of these coefficients to vary with \(\mathbf {t}_e\) and \(\mathbf {t}_s\) by imposing a GP prior on them. For the remaining parameters, which by the nature of the underlying physical processes cannot depend on \(\mathbf {t}_e\) or \(\mathbf {t}_s\), the GP prior resolves to a standard Gaussian prior. Hence, the ground-motion model with varying coefficients has the form

$$\begin{aligned} y= & {} \varvec{\omega }_0(\mathbf {t}_e)+\varvec{\omega }_1(\mathbf {t}_s)+\varvec{\omega }_2M+\varvec{\omega }_3M^2+(\varvec{\omega }_4(\mathbf {t}_e) +\varvec{\omega }_5M)\sqrt{R_{JB}^2+h^2}\nonumber \\&+\,\varvec{\omega }_6(\mathbf {t}_e)R_{JB}+\varvec{\omega }_7(\mathbf {t}_s)\log V_{s30}+\varvec{\omega }_8 F_R+\varvec{\omega }_9F_{NM} \end{aligned}$$
(15)

This model is implemented using Algorithm 1 (primal case). Values of \(\varvec{\kappa }(\mathbf {t},\mathbf {t}')\) are diagonal matrices with entries \(k_1(\mathbf {t},\mathbf {t}'),\ldots ,k_d(\mathbf {t},\mathbf {t}')\). This means that each dimension of \(\varvec{\omega }(\mathbf {t})\)—corresponding to a particular coefficient—is generated by an independent scalar-valued Gaussian process whose covariance is given by \(k_j(\mathbf {t},\mathbf {t}') \in \mathbb {R}\). The kernel functions \(k_j(\mathbf {t},\mathbf {t}')\) are given by

$$\begin{aligned} k_j(\mathbf {t},\mathbf {t}')&= \left\{ \begin{array}{ll} \theta _j &{} \quad \text { if }\,\,j \in \{2,3,5,8,9\}\\ \theta _j \exp \left( - \frac{||\mathbf {t}_e-\mathbf {t}'_e||}{\rho _j} \right) +\pi _j &{}\quad \text { if }\,\,j \in \{0,4,6\}\\ \theta _j \exp \left( - \frac{||\mathbf {t}_s-\mathbf {t}'_s||}{\rho _j} \right) +\pi _j &{}\quad \text { if }\,\,j \in \{1,7\} \end{array} \right. \end{aligned}$$
(16)

where \(\theta _j\), \(\rho _j\) and \(\pi _j\) are kernel parameters. For coefficients that do not depend on either event or station coordinates (\(j\in \{2,3,5,8,9\}\)), the kernel function \(k_j(\mathbf {t},\mathbf {t}')\) is constant, which implies that any function \(\varvec{\omega }\) drawn from the GP prior (Eq. 1) is constant in its j-th dimension. For coefficients that depend on event or station coordinates (\(j \in \{0,4,6\}\) or \(j \in \{1,7\}\), respectively), kernel function \(k_j(\mathbf {t},\mathbf {t}')\) is a Matérn kernel function of degree \(\nu = 1/2\) based on the Euclidian distance between event or station coordinates, implying that the j-th dimension of \(\varvec{\omega }\) varies with \(\mathbf {t}_s\) or \(\mathbf {t}_e\). Parameter \(\pi _j\) is a constant offset to the Matérn kernel.

We compare against a global GP model of the form stated in Eq. (14) that does not assume that any parameter varies with geographical latitude or longitude. Both models are implemented in GPML, using FITC inference and tuning all hyper-parameters according to marginal likelihood on the training data.

4.2.2 Experimental setting

We use the NGA West 2 data set (Abrahamson et al. 2014). We use only the data from California and Nevada, since data from other regions will be spatially uncorrelated to this region. In total, there are 10,692 records from 221 earthquakes, recorded at 1425 stations. We run-tenfold cross validation on this data set.

Fig. 4
figure 4

Prediction of logarithmic spectral acceleration: RMSE for the VCM and the ergodic global model, estimated by tenfold cross validation

4.2.3 Results

Figure 4 shows the root-mean-squared prediction error (RMSE), estimated by tenfold cross-validation. The VCM has a consistently lower RMSE than the global model. This indicates that incorporating spatial differences improves ground-motion prediction, even for a relatively small region such as California.

5 Discussion and related work

Gaussian processes are used widely in geospatial analysis (e.g.,  Matheron 1963). The covariogram is a basic tool in spatial models (e.g.,  Cressie 2015); it models the covariance between the values of a quantity at different points in space as a function of the distance between these points. Gaussian-process convolutions (Higdon 2002) let an arbitrary kernel function induce a covariance function in (time and) space. Gaussian-process convolutions can be applied to discrete multi-task problems: multiple dependent output variables can share the same underlying spatial covariance (Ver Hoef and Barry 1998). For instance, this allows to build a model for the concentration levels of multiple pollutants that share a joint spatial covariance.

In the linear model of coregionalization, multiple output dimensions (or discrete tasks) are coupled by scalar weights; the resulting kernel is a sum where each summand is the product of a covariance functions describing the dependence of output dimensions and a covariance function of the input coordinates. Gaussian-process regression networks (Wilson et al. 2011) model the dependency of multiple output variables (tasks) by an adaptive mixture of Gaussian processes, which resembles the way in which neural networks couple multiple output units via shared hidden variables. In all these models, time and space constitute the input or are part of the input; the multi-task nature of the learning problems results from multiple dependent output variables.

By contrast, varying-coefficient models reflect applications in which the conditional distribution \(p(y|\mathbf {x},\mathbf {t},\mathbf {w})\) of a single output variable y shares much of its structure across different values of task variables \(\mathbf {t}\); in our applications, these task variables are continuous and reflect space, time, or both. That is, a geospatial correlation structure couples the parameters \(\mathbf {w}\) of the relationship between \(\mathbf {x}\) and y, and input variables \(\mathbf {x}\) are treated differently from time and/or space variables \(\mathbf {t}\). In varying-coefficient models with GP priors, \(p(y|\mathbf {x},\varvec{\omega }(\mathbf {t}))\) varies smoothly in \(\mathbf {t}\). While ridge and logistic regression assume that parameters \(\mathbf {w}\) are generated by an isotropic Gaussian prior, we explore a model in which function \(\varvec{\omega }\) is governed by an independent GP prior for each dimension. Ridge regression, logistic regression, and isoVCM are discriminative models—they do not model the likelihood \(p(\mathbf {x}|y,\mathbf {w})\) of the input variables. For discriminative models, the isotropy assumption on the model parameters does not translate into an isotropy assumption on the input attributes.

Propositions 1 and 2 shows that a GP with graph-Laplacian multi-task kernel (Evgeniou et al. 2005) emerges as dual formulation of hierarchical Bayesian multitask learning. The main motivation of varying-coefficient models, however, lies in application domains in which \(p(y|\mathbf {x},\varvec{\omega }(\mathbf {t}))\) varies in time, location, or both.

In varying-coefficient models, each output \(y_i\) is generated by its own parameter vector \(\mathbf {w}_i=\varvec{\omega }(\mathbf {t}_i)\). Inference therefore involves nm parameters; and without an independence or isotropy assumption, \(nm\times nm\) many covariances have to be inferred. This makes GP priors with full covariance (Gelfand et al. 2003) impractical for all but the smallest samples. Theorem 1 shows that, for isotropic GP priors, Bayesian inference in varying-coefficient models can be carried out efficiently with a standard GP using the product of a task kernel and an instance kernel. This clarifies the exact modeling assumptions required to derive the multitask kernel of Bonilla et al. (2007), and also highlights that hierarchical Bayesian inference can be carried out efficiently by using a standard GP with graph-Laplacian kernel (Evgeniou et al. 2005).

Product kernels play a role in other multitask learning models. In the linear coregionalization model, several related functions are modeled as linear combinations of GPs; the covariance function then resolves to a product of a kernel function on instances and a matrix of mixing coefficients (Journel and Huijbregts 1978; Álvarez et al. 2011). A similar model is studied by Wang et al. (2007); here mixing coefficients are given by latent variables. Zhang and Yeung (2010) study a model for learning task relationships, and show that under a matrix-normal regularizer the solution of a multitask-regularized risk minimization problem can be expressed using a product kernel. Theorem 1 can be seen as a generalization of their result in which the regularizer is replaced by a prior over functions, and the regularized risk minimization perspective by a fully Bayesian analysis.

Non-stationarity can also be modeled in GPs by assuming that either the residual variance  (Wang and Neal 2012), the scale of the covariance function (Tolvanen et al. 2014), or the amplitude of the output (Adams and Stegle 2008) are input-dependent. The varying-coefficient model differs from these models in that the source of nonstationarity is observed in the task variable.

The main application areas of varying-coefficient models are prediction problems with underlying spatial or temporal dynamics, such as real-estate pricing (Gelfand et al. 2003; Fan and Zhang 2008) neuroimaging (Zhu et al. 2014), and modeling of time-varying medical risks (Estes et al. 2014). In the domain of real estate price prediction, the dependency between property attributes and the market price changes continuously with geographical coordinates and time. Empirically, we observe that Bayesian inference in isoVCM is several orders of magnitude faster than inference in varying-coefficient models with nonisotropic GP priors. We observe that the linear and kernelized isoVCM models predict real estate prices and housing rents more accurately over time and space than kernel-local smoothing varying-coefficient models, and are also more accurate than linear and kernelized models that append the task variables to the attribute vector or ignore the task variables.

In seismic hazard analysis, the model allows parameters of ground-motion models to vary smoothly in the location of seismic event and station. Today, seismic hazards in the State of California are predicted with an ergodic model whose parameters are fixed over all of California. We have derived a ground-motion model that imposes a GP prior on some model parameters. We observe that the varying-coefficient model consistently reduces the RMSE for predictions of the peak-ground acceleration and spectral accelerations substantially compared to the ergodic model.