Tractable Function-Space Variational Inference in
Bayesian Neural Networks

Tim G. J. Rudner
University of Oxford &Zonghao Chen
University College London &Yee Whye Teh
University of Oxford &Yarin Gal
University of Oxford Corresponding author. Email: <tim.rudner@cs.ox.ac.uk>.

Abstract

Reliable predictive uncertainty estimation plays an important role in enabling the deployment of neural networks to safety-critical settings. A popular approach for estimating the predictive uncertainty of neural networks is to define a prior distribution over the network parameters, infer an approximate posterior distribution, and use it to make stochastic predictions. However, explicit inference over neural network parameters makes it difficult to incorporate meaningful prior information about the data-generating process into the model. In this paper, we pursue an alternative approach. Recognizing that the primary object of interest in most settings is the distribution over functions induced by the posterior distribution over neural network parameters, we frame Bayesian inference in neural networks explicitly as inferring a posterior distribution over functions and propose a scalable function-space variational inference method that allows incorporating prior information and results in reliable predictive uncertainty estimates. We show that the proposed method leads to state-of-the-art uncertainty estimation and predictive performance on a range of prediction tasks and demonstrate that it performs well on a challenging safety-critical medical diagnosis task in which reliable uncertainty estimation is essential.

1 Introduction

Machine learning models succeed at an increasingly wide range of narrowly defined tasks (Krizhevsky et al., 2012; Mnih et al., 2013; Silver et al., 2016; Jumper et al., 2021) but may fail without warning when used on inputs that are meaningfully different from the data they were trained on (Amodei et al., 2016; Hendrycks et al., 2021; Rudner and Toner, 2021a, b). To deploy machine learning models in safety-critical environments where failures are costly or may endanger human lives, machine learning methods must be reliable and have the ability to ‘fail gracefully.’ A promising tool for incorporating fail-safe mechanisms into machine learning systems, predictive uncertainty quantification allows machine learning models to express their confidence in the correctness of their predictions.

In this paper, we develop a method for obtaining reliable uncertainty estimates in Bayesian neural networks (bnns, Neal (1996)). While bnns have promised to combine the advantages of deep learning and Bayesian inference, existing approaches for approximate inference in bnns fall short of this promise and have been demonstrated to result in approximate posterior predictive distributions that underperform ‘non-Bayesian’ methods both in terms of predictive accuracy and uncertainty quantification—making them of limited use in practice (Ovadia et al., 2019; Foong et al., 2019; Farquhar et al., 2020a; Band et al., 2021). A potential reason for this shortcoming is that commonly used parameter-space inference methods make it difficult to define meaningful priors that effectively incorporate information about the data-generating process into inference.

To avoid this limitation, we follow Sun et al. (2019) and consider a variational objective defined explicitly in terms of distributions over functions induced by distributions over parameters. In contrast to prior works that rely on approximation techniques that prevent such function-space variational objectives to be used with high-dimensional inputs and highly-overparameterized neural networks, we propose a simple estimator of the Kullback-Leibler divergence between distributions over functions that enables us to perform stochastic variational inference. The proposed estimation procedure allows defining priors that explicitly encourage high predictive uncertainty away from the training data as well as priors that reflect relevant information about the task at hand.

We demonstrate that this approach leads to posterior approximations that exhibit significantly improved predictive uncertainty estimates compared to a wide array of state-of-the-art Bayesian and non-Bayesian methods. Figure 1 shows examples of predictive distributions obtained via function-space variational inference on low-dimensional, easy-to-visualize datasets. As can be seen in the figures, the predictive distributions fit the training data well while also exhibiting a high degree of predictive uncertainty in parts of the input space far away from the training data, as desired.

Contributions. We propose a simple estimation procedure for performing function-space variational inference in bnns. The variational method allows for the incorporation of meaningful prior information about the data-generating process into the inference and produces reliable predictive uncertainty estimates. We perform a thorough empirical evaluation in which we compare the proposed approach to a wide array of competing methods and show that it consistently results in high predictive performance and reliable predictive uncertainty estimates, outperforming other methods in terms of predictive accuracy, robustness to distribution shifts, and uncertainty-based detection of distributionally-shifted data samples. We evaluate the proposed method on standard benchmarking datasets as well as on a safety-critical medical diagnosis task in which reliable uncertainty estimation is essential.¹¹1Our code can be accessed at https://github.com/timrudner/FSVI.

Refer to caption — (a) Predictive Distribution

2 Preliminaries

We consider supervised learning tasks on data $\mathcal{D}\,\dot{=}\,\{(\mathbf{x}_{n},\mathbf{y}_{n})\}_{n=1}^{N}=(\mathbf{X% }_{\mathcal{D}},\mathbf{y}_{\mathcal{D}})$ with inputs $\mathbf{x}_{n}\in\mathcal{X}\subseteq{}^{D}$ and targets $\mathbf{y}_{n}\in\mathcal{Y}$ , where $\mathcal{Y}\subseteq{}^{Q}$ for regression and $\mathcal{Y}\subseteq\{0,1\}^{Q}$ for classification tasks. Bayesian neural networks (bnns) are stochastic neural networks trained using (approximate) Bayesian inference. Denoting the parameters of such a stochastic neural network by the multivariate random variable $\bm{\Theta}\in\mathbb{R}^{P}$ and letting the function mapping defined by a neural network architecture be given by $f:\mathcal{X}\times\mathbb{R}^{P}\rightarrow\mathbb{R}^{Q}$ , we obtain a random function $f(\cdot\,;\bm{\Theta})$ . For a parameter realization ${\bm{\theta}}$ , we obtain a corresponding function realization, $f(\cdot\,;{\bm{\theta}})$ . When evaluated at a finite collection of points $\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{m}$ , $f(\mathbf{X};\bm{\Theta})$ is a multivariate random variable and $f(\mathbf{X};{\bm{\theta}})$ is a vector.

Letting $p_{\mathbf{y}|f(\mathbf{X};\bm{\Theta})}$ be a likelihood function and $p_{\mathbf{y}|f(\mathbf{X};\bm{\Theta})}(\mathbf{y}_{\mathcal{D}}\,|\,f(% \mathbf{X}_{\mathcal{D}};{\bm{\theta}}))$ be the likelihood of observing the targets $\mathbf{y}_{\mathcal{D}}$ under the stochastic function $f(\cdot\,;\bm{\Theta})$ evaluated at inputs $\mathbf{X}_{\mathcal{D}}$ and letting $p_{\bm{\Theta}}$ be a prior distribution over the stochastic network parameters $\bm{\Theta}$ , we can use Bayes’ Theorem to find the posterior distribution, $p_{\bm{\Theta}|\mathcal{D}}$ (MacKay, 1992; Neal, 1996). However, since the mapping $f$ is a nonlinear function of the stochastic parameters $\bm{\Theta}$ , exact inference is analytically intractable. Variational inference is an approach that seeks to sidestep this intractability by framing posterior inference as a variational optimization problem, where the goal is to find a distribution $q_{\bm{\Theta}}$ in a variational family $\mathcal{Q}_{q_{\bm{\Theta}}}$ that solves the variational problem $\min_{q_{\bm{\Theta}}\in\mathcal{Q}_{q_{\Theta}}}\mathbb{D}_{\textrm{KL}}(q_{% \bm{\Theta}}\;\|\;p_{\bm{\Theta}|\mathcal{D}})$ (Wainwright and Jordan, 2008). If $\mathcal{Q}_{q_{\bm{\Theta}}}$ is the family of mean-field Gaussian distributions and the prior distribution over parameters $p_{\bm{\Theta}}$ given by a diagonal Gaussian distribution, the resulting variational objective is amenable to stochastic variational inference and can be optimized using gradient-based methods (Hinton and van Camp, 1993; Graves, 2011; Hoffman et al., 2013; Blundell et al., 2015).

2.1 A Function-Space Perspective on Variational Inference in Bayesian Neural Networks

Instead of seeking to infer an approximate posterior distribution over parameters, we frame variational inference in stochastic neural networks as inferring an approximation to the posterior distribution over functions $p_{f(\cdot\,;\bm{\Theta})|\mathcal{D}}$ induced by the posterior distribution over parameters $p_{\bm{\Theta}|\mathcal{D}}$ , that is,

\displaystyle\SwapAboveDisplaySkip p_{f(\cdot\,;\bm{\Theta})|\mathcal{D}}(f(% \cdot\,;{\bm{\theta}})\,|\,\mathcal{D})=\int_{\mathbb{R}^{P}}p_{\bm{\Theta}|% \mathcal{D}}({\bm{\theta}}^{\prime}\,|\,\mathcal{D})\,\delta(f(\cdot\,;{\bm{% \theta}})-f(\cdot\,;{\bm{\theta}}^{\prime}))\,\textrm{d}{\bm{\theta}}^{\prime},

(1)

where $\delta(\cdot)$ is the Dirac delta function (Wolpert, 1993). Considering the prior distribution over functions $p_{f(\cdot\,;\bm{\Theta})}$ induced by a prior distribution over parameters $p_{\bm{\Theta}}$ ,

\displaystyle\SwapAboveDisplaySkip p_{f(\cdot\,;\bm{\Theta})}(f(\cdot\,;{\bm{% \theta}}))=\int_{\mathbb{R}^{P}}p_{\bm{\Theta}}({\bm{\theta}}^{\prime})\,% \delta(f(\cdot\,;{\bm{\theta}})-f(\cdot\,;{\bm{\theta}}^{\prime}))\,\textrm{d}% {\bm{\theta}}^{\prime},

(2)

and the variational distribution over functions $q_{f(\cdot\,;\bm{\Theta})}$ induced by a variational distribution over parameters $q_{\bm{\Theta}}$ ,

\displaystyle\SwapAboveDisplaySkip q_{f(\cdot\,;\bm{\Theta})}(f(\cdot\,;{\bm{% \theta}}))=\int_{\mathbb{R}^{P}}q_{\bm{\Theta}}({{\bm{\theta}}^{\prime}})\,% \delta(f(\cdot\,;{\bm{\theta}})-f(\cdot\,;{\bm{\theta}}^{\prime}))\,\textrm{d}% {\bm{\theta}}^{\prime},

(3)

we can express the problem of finding a posterior distribution over functions variationally as

\displaystyle\min_{q_{\bm{\Theta}}\in\mathcal{Q}_{q_{\bm{\Theta}}}}\mathbb{D}_% {\textrm{KL}}(q_{f(\cdot\,;\bm{\Theta})}\,\|\,p_{f(\cdot\,;\bm{\Theta})|% \mathcal{D}}),

(4)

which allows us to effectively incorporate meaningful prior information about the underlying data-generating process into training. As discussed by Burt et al. (2021), this variational objective is guaranteed to be well-defined for suitably chosen prior distributions over functions. Specifically, the KL divergence between two distributions over functions generated from different distributions over parameters applied to the same mapping (e.g., the same neural network architecture) is well-defined (i.e., finite) if the KL divergence between the distributions over parameters is finite, since, by the strong data processing inequality (Polyanskiy and Wu, 2017),

\displaystyle\mathbb{D}_{\textrm{KL}}(q_{f(\cdot\,;\bm{\Theta})}\,\|\,p_{f(% \cdot;\bm{\Theta})})\leq\mathbb{D}_{\textrm{KL}}(q_{\bm{\Theta}}\,\|\,p_{\bm{% \Theta}}).

(5)

As a result, if $\mathbb{D}_{\textrm{KL}}(q_{\bm{\Theta}}\,\|\,p_{\bm{\Theta}})<\infty$ , which is the case for finite-dimensional parameter vectors $\bm{\Theta}$ and $q_{\bm{\Theta}}$ absolutely continuous with respect to $p_{\bm{\Theta}}$ , then the function-space KL divergence is finite and thus well-defined as a variational objective.

Hence, for a likelihood function defined on a finite set of training targets $\mathbf{y}_{\mathcal{D}}$ and a suitably defined prior distribution over functions, we can express the variational problem above equivalently as the well-defined maximization problem $\max_{q_{\bm{\Theta}}\in\mathcal{Q}_{{\bm{\theta}}}}\mathcal{F}(q_{\bm{\Theta}})$ with

\displaystyle\begin{split}\mathcal{F}(q_{\bm{\Theta}})&\,\dot{=}\,\mathbb{E}_{% q_{f(\mathbf{X}_{\mathcal{D}};\bm{\Theta})}}[\log p_{\mathbf{y}|f(\mathbf{X};% \bm{\Theta})}(\mathbf{y}_{\mathcal{D}}\,|\,f(\mathbf{X}_{\mathcal{D}};{\bm{% \theta}}))]-\mathbb{D}_{\textrm{KL}}(q_{f(\cdot;\bm{\Theta})}\,\|\,p_{f(\cdot;% \bm{\Theta})}),\end{split}

(6)

where $\mathbb{D}_{\textrm{KL}}(q_{f(\cdot;\bm{\Theta})}\,\|\,p_{f(\cdot;\bm{\Theta})})$ is also a KL divergence between distributions over functions.

Unfortunately, evaluating the KL divergence in Equation 6 is in general intractable for arbitrary mappings $f$ . To obtain a tractable objective, Sun et al. (2019) showed that $\mathbb{D}_{\textrm{KL}}(q_{f(\cdot;\bm{\Theta})}\,\|\,p_{f(\cdot;\bm{\Theta})})$ can be expressed as the supremum of the KL divergence from $q_{f(\cdot;\bm{\Theta})}$ to $p_{f(\cdot;\bm{\Theta})}$ over all finite sets of evaluation points, resulting in the objective function

\displaystyle\begin{split}\mathcal{F}(q_{\bm{\Theta}})=\mathbb{E}_{q_{f(% \mathbf{X}_{\mathcal{D}};\bm{\Theta})}}[\log p_{\mathbf{y}|f(\mathbf{X};\bm{% \Theta})}(\mathbf{y}_{\mathcal{D}}\,|\,f(\mathbf{X}_{\mathcal{D}};{\bm{\theta}% }))]-\sup_{\mathbf{X}\in\mathcal{X}_{\mathbb{N}}}\mathbb{D}_{\textrm{KL}}(q_{f% (\mathbf{X};\bm{\Theta})}\,\|\,p_{f(\mathbf{X};\bm{\Theta})}),\end{split}

(7)

where $\mathcal{X}_{\mathbb{N}}\,\dot{=}\,\bigcup_{n\in\mathbb{N}}\{\mathbf{X}\in% \mathcal{X}_{n}\,|\,\mathcal{X}_{n}\subseteq\mathbb{R}^{n\times D}\}$ is the collection of all finite sets of evaluation points. However, this objective function is still challenging to optimize in practice: The supremum cannot be obtained analytically and the KL divergence term itself is analytically intractable and difficult to estimate in high dimensions—even for a single evaluation point.

In the next section, we will describe an approximation and estimation procedure that allows scaling function-space variational inference to large neural networks and high-dimensional input data.

3 Deriving a Tractable Function-Space Variational Objective

The primary obstacle to computing the objective in Equation 6 is the KL divergence from $q_{f(\cdot;\bm{\Theta})}$ to $p_{f(\cdot;\bm{\Theta})}$ . There are two reasons why the KL divergence in Equation 7 is intractable: First, for bnns or other non-linear models, we do not have access to the probability density functions of the multivariate distributions $q_{f(\mathbf{X};\bm{\Theta})}$ and $p_{f(\mathbf{X};\bm{\Theta})}$ ; second, for all but extremely simple input spaces, we are unable to compute the supremum over all possible finite sets of evaluation points. In the remainder of this section, we outline an approach for obtaining an estimator of a locally accurate approximation to the KL divergence that allows for scalable gradient-based optimization of Equation 7.

We first approach the problem of computing the KL divergence between two bnns evaluated at a finite set of points. To do so, we first derive tractable approximations to the distributions over functions $q_{f(\mathbf{X};\bm{\Theta})}$ and $p_{f(\mathbf{X};\bm{\Theta})}$ Next, we show that under these approximations, we are able to obtain a closed-form approximation to the KL divergence and describe a simple Monte Carlo estimator of the supremum in the function-space KL divergence.

3.1 Approximating Distributions over Functions via Local Linearization

To obtain an approximation to the probability distributions of $q_{f(\mathbf{X};\bm{\Theta})}$ and $p_{f(\mathbf{X};\bm{\Theta})}$ , we use a first-order Taylor expansion of the mapping $f$ about the mean parameters of $q_{\bm{\Theta}}$ and $p_{\bm{\Theta}}$ , respectively, and derive the induced distributions under the linearized mapping.

For a stochastic function $f(\cdot\,;\bm{\Theta})$ defined in terms of stochastic parameters $\bm{\Theta}$ distributed according to distribution $g_{\bm{\Theta}}$ with $\mathbf{m}\,\dot{=}\,\operatorname{\mathbb{E}}_{g_{\bm{\Theta}}}[\bm{\Theta}]$ and $\mathbf{S}\,\dot{=}\,\text{Cov}_{g_{\bm{\Theta}}}[\bm{\Theta}]$ , we denote the linearization of the stochastic function $f(\cdot\,;\bm{\Theta})$ about $\mathbf{m}$ by

\displaystyle f(\cdot\,;\bm{\Theta})\approx\smash{\tilde{f}}(\cdot\,;\mathbf{m% },\bm{\Theta})\,\dot{=}\,f(\cdot\,;\mathbf{m})+\mathcal{J}(\cdot\,;\mathbf{m})% (\bm{\Theta}-\mathbf{m}),

(8)

where $\mathcal{J}(\cdot\,;\mathbf{m})\,\dot{=}\,(\partial f(\cdot\,;\bm{\Theta})/% \partial\bm{\Theta})|_{\bm{\Theta}=\mathbf{m}}$ is the Jacobian of $f(\cdot\,;\bm{\Theta})$ evaluated at $\bm{\Theta}=\mathbf{m}$ , and the mean and covariance of the distribution over the linearized mapping $\smash{\tilde{f}}$ at $\mathbf{X},\mathbf{X}^{\prime}\in\mathcal{X}$ are given by

	$\displaystyle\operatorname{\mathbb{E}}[\smash{\tilde{f}}(\mathbf{X};\bm{\Theta% })]$	$\displaystyle=f(\mathbf{X};\mathbf{m})$		(9)
	$\displaystyle\textrm{{Cov}}[\smash{\tilde{f}}(\mathbf{X};\bm{\Theta}),\smash{% \tilde{f}}(\mathbf{X}^{\prime};\bm{\Theta})]$	$\displaystyle=\mathcal{J}(\mathbf{X};\mathbf{m})\mathbf{S}\mathcal{J}(\mathbf{% X}^{\prime},\mathbf{m})^{\top}.$		(10)

For a derivation of this result, see Appendix A. Since Gaussianity is preserved under affine transformations, if $g_{\bm{\Theta}}$ is a multivariate Gaussian distribution with mean $\mathbf{m}$ and diagonal co-variance $\mathbf{S}$ , then the distribution $\tilde{g}$ over $\smash{\tilde{f}}(\mathbf{X}\,;\bm{\Theta})$ is given by

\displaystyle\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})}=% \mathcal{N}(f(\mathbf{X};\mathbf{m}),\mathcal{J}(\mathbf{X};\mathbf{m})\mathbf% {S}\mathcal{J}(\mathbf{X};\mathbf{m})^{\top}).

(11)

For stochastic functions parameterized by many millions of parameters, obtaining the covariance of $\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})}$ —which requires computing an inner product of two Jacobian matrices—can be computationally expensive. Instead of computing the distribution over the linearized mapping exactly, we can construct a suitable Monte Carlo estimator. To do so, we consider a partition of the set of parameters into sets $\alpha$ and $\beta$ (with $|\beta|\ll|\alpha|$ ) and note that the linearized mapping can then be expressed as

\displaystyle\smash{\tilde{f}}(\cdot\,;\mathbf{m},\bm{\Theta})=f(\cdot\,;% \mathbf{m})+\smash{\tilde{f}}_{\alpha}(\cdot\,;\mathbf{m},\bm{\Theta}_{\alpha}% )+\mathcal{J}_{\beta}(\cdot\,;\mathbf{m})(\bm{\Theta}_{\beta}-\mathbf{m}_{% \beta}),

(12)

with

\displaystyle\SwapAboveDisplaySkip\smash{\tilde{f}}_{\alpha}(\cdot\,;\mathbf{m% },\bm{\Theta}_{\alpha})\,\dot{=}\,\mathcal{J}_{\alpha}(\cdot\,;\mathbf{m})(\bm% {\Theta}_{\alpha}-\mathbf{m}_{\alpha}),

(13)

where $\mathcal{J}_{\alpha}(\cdot\,;\mathbf{m})$ and $\mathcal{J}_{\beta}(\cdot\,;\mathbf{m})$ are the columns of the Jacobian matrix corresponding to the sets of parameters $\alpha$ and $\beta$ , respectively, and $\bm{\Theta}_{\alpha}$ and $\bm{\Theta}_{\beta}$ are the corresponding random parameter vectors. Noting that Equation 12 expresses $\smash{\tilde{f}}$ as a sum of (affine transformations of) random variables, we can use the fact that for independent Gaussian random variables $\mathbf{X}$ and $\mathbf{Y}$ , the distribution $h_{\mathbf{Z}}$ of $\mathbf{Z}=\mathbf{X}+\mathbf{Y}$ is equal to the convolution of the distributions $h_{\mathbf{X}}$ and $h_{\mathbf{Y}}$ to obtain an approximation to $\smash{\tilde{f}}$ . In particular, we can show that if $g_{\bm{\Theta}}$ is a multivariate Gaussian distribution with $\bm{\Theta}_{\alpha}\perp\bm{\Theta}_{\beta}$ , the distribution $\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})}$ can be approximated by the Monte Carlo estimator

\displaystyle\hat{\tilde{g}}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{% \Theta})}=\frac{1}{R}\sum\nolimits_{j=1}^{R}\mathcal{N}\Big{(}f(\mathbf{X};% \mathbf{m})+\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{% \alpha})^{(j)},\mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})\mathbf{S}_{\beta}{% \mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})}^{\top}\Big{)},

(14)

where $g_{\bm{\Theta}_{\beta}}=\mathcal{N}(\mathbf{m}_{\beta},\mathbf{S}_{\beta})$ and samples $\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{\alpha})^{(j)}$ are obtained by sampling parameters from the distribution $g_{\bm{\Theta}_{\alpha}}=\mathcal{N}(\mathbf{m}_{\alpha},\mathbf{S}_{\alpha})$ . For a derivation of this result, see Appendix A. This estimator is biased for finite $K$ but converges to $\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})}$ as $R\rightarrow\infty$ . Similarly, for finite $R$ , the smaller $[\mathbf{S}_{\alpha}]_{ii}$ , the more accurate and less biased the estimator will be. In our empirical evaluation, we use a single Monte Carlo sample, $R=1$ , to preserve Gaussianity and choose $\alpha$ to be the set of parameters in neural network layers $1:L-1$ and $\beta$ to be the set of parameters in the final neural network layer.

3.2 Approximating the Function-Space Kullback-Leibler Divergence

From Section 3.1, we know that if $q_{\bm{\Theta}}$ and $p_{\bm{\Theta}}$ are both Gaussian distributions, then the induced distributions under the linearized mapping $\smash{\tilde{f}}$ evaluated at a finite set of evaluation points will be Gaussian as well. This means that for Gaussian variational and prior distributions over $\bm{\Theta}$ , we can obtain locally accurate approximations to the induced distributions $q_{f(\cdot;\bm{\Theta})}$ to $p_{f(\cdot;\bm{\Theta})}$ and use them to approximate the KL divergence in the variational objective by $\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{\smash{\tilde{f}}(\mathbf{X};\bm{% \Theta})}\,\|\,\smash{\tilde{p}}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})})$ . Moreover, for an isotropic Gaussian prior and a mean-field Gaussian variational distribution, $\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{\smash{\tilde{f}}(\mathbf{X};\bm{% \Theta})}\,\|\,\smash{\tilde{p}}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})})$ is a KL divergence between two multivariate Gaussians and can be obtained analytically.

Using this approximation, we obtain an estimator of the variational objective given by

\displaystyle\begin{split}&\tilde{\mathcal{F}}(q_{\bm{\Theta}})\,\dot{=}\,% \mathbb{E}_{q_{f(\mathbf{X}_{\mathcal{D}};\bm{\Theta})}}[\log p_{\mathbf{y}|f(% \mathbf{X};\bm{\Theta})}(\mathbf{y}_{\mathcal{D}}\,|\,f(\mathbf{X}_{\mathcal{D% }};{\bm{\theta}}))]-\sup_{\mathbf{X}\in\mathcal{X}_{\mathbb{N}}}\mathbb{D}_{% \textrm{KL}}(\smash{\tilde{q}}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})}\,\|% \,\smash{\tilde{p}}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})}),\end{split}

(15)

where the arguments of the KL divergence have been replaced by the (locally accurate) approximations to the variational and prior distributions over functions evaluated at $\mathbf{X}$ , respectively. Since the stochastic functions $\smash{\tilde{f}}(\cdot\,;\bm{\Theta})$ induced by $q_{\bm{\Theta}}$ and $p_{\bm{\Theta}}$ under the linearized mapping will be closer to the stochastic function under $f$ the smaller the variance of $q_{\bm{\Theta}}$ and $p_{\bm{\Theta}}$ , respectively, the approximation to the KL divergence will be more accurate the smaller the variance of $q_{\bm{\Theta}}$ and $p_{\bm{\Theta}}$ .

Next, we turn to computing the supremum. Unlike Sun et al. (2019), who consider the supremum as a separate optimization problem, we do not seek to compute the supremum by searching over points $\mathbf{X}\in\mathcal{X}_{\mathbb{N}}$ but instead propose to estimate the supremum at every gradient step via a simple finite-sample estimator. Specifically, letting $I(\mathbf{X})\,\dot{=}\,\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{\smash{% \tilde{f}}(\mathbf{X};\bm{\Theta})}\,\|\,\smash{\tilde{p}}_{\smash{\tilde{f}}(% \mathbf{X};\bm{\Theta})})$ , we estimate $G=\sup_{\mathbf{X}\in\mathcal{X}_{\mathbb{N}}}I(\mathbf{X})$ using the Monte Carlo estimator

\displaystyle\SwapAboveDisplaySkip\hat{G}(\mathcal{X}_{\mathcal{C}}^{S})=\max_% {\mathbf{X}\in\mathcal{X}_{\mathcal{C}}^{S}}I(\mathbf{X}),

(16)

where $\mathcal{X}_{\mathcal{C}}^{S}\,\dot{=}\,\{\mathbf{X}_{\mathcal{C}}^{(i)}\}_{i=% 1}^{S}$ is a collection of $S$ sets of context points $\mathbf{X}_{\mathcal{C}}^{(i)}\,\dot{=}\,\{\mathbf{x}^{(j)}\}_{j=1}^{K}$ jointly sampled from a context distribution $p_{\mathcal{X}_{\mathcal{C}}}$ . Each context set $\mathbf{X}_{\mathcal{C}}^{(i)}$ can be viewed as a single Monte Carlo sample from the input space so that the estimator $\hat{G}(\mathcal{X}_{\mathcal{C}}^{S})$ provides an $S$ -sample Monte Carlo estimate of the supremum. While this estimator is crude and only provides a rough approximation to the true supremum, it encourages the variational distribution over functions to match the prior distribution over functions on the sets of context points. The choice of the context distribution $p_{\mathcal{X}_{\mathcal{C}}}$ can be informed by knowledge about the prediction task and should be viewed as a problem-specific modeling choice. Similarly, the numbers of samples $S$ and $K$ are hyperparameters to be optimized with a validation set. For details on how $p_{\mathcal{X}_{\mathcal{C}}}$ is chosen for the empirical evaluation in Section 5, see Appendix D.

3.3 Stochastic Estimation of the Approximate Function-Space Variational Objective

Let $q_{\bm{\Theta}}$ be a Gaussian mean-field variational distribution, let $p_{\bm{\Theta}}$ be an isotropic Gaussian prior, let $(\mathbf{X}_{\mathcal{B}},\mathbf{y}_{\mathcal{B}})$ be a mini-batch of the training data, and reparameterize $\bm{\Theta}$ as $\hat{\bm{\Theta}}({\bm{\mu}},\bm{\Sigma},\bm{\epsilon}^{(j)})\,\dot{=}\,{\bm{% \mu}}+\bm{\Sigma}\odot\bm{\epsilon}^{(j)}$ . Using the estimator $\hat{G}(\mathcal{X}_{\mathcal{C}}^{S})$ defined above and estimating the expected log-likelihood via Monte Carlo sampling, we obtain a Monte Carlo estimator for the function-space variational objective:

\displaystyle\bar{\mathcal{F}}({\bm{\mu}},\bm{\Sigma})=\frac{1}{M}\sum% \nolimits_{j=1}^{M}\log p_{\mathbf{y}|f(\mathbf{X};\bm{\Theta})}(\mathbf{y}_{% \mathcal{B}}\,|\,f(\mathbf{X}_{\mathcal{B}};\hat{\bm{\Theta}}({\bm{\mu}},\bm{% \Sigma},\bm{\epsilon}^{(j)})))-\max_{\mathbf{X}\in\mathcal{X}_{\mathcal{C}}^{S% }}{\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{\smash{\tilde{f}}(\mathbf{X};% \hat{\bm{\Theta}})}\,\|\,\smash{\tilde{p}}_{\smash{\tilde{f}}(\mathbf{X};\hat{% \bm{\Theta}})})}

(17)

with $\bm{\epsilon}^{(j)}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{P})$ and $\mathcal{X}_{\mathcal{C}}^{S}$ as defined above. This Monte Carlo estimator is biased due to the linearization and context-set approximations but allows for scalable gradient-based stochastic optimization.

Selection of Prior. For all experiments that involve uncertainty quantification, we chose a prior distribution over parameters that induces a prior distribution over functions $p_{f(\cdot;\bm{\Theta})}$ and a prior predictive distribution that exhibits a high degree of predictive uncertainty at evaluation points from regions in input space where $p_{\mathcal{X}_{\mathcal{C}}}$ has non-zero support and, under smoothness constraints, on evaluation points in nearby regions. For settings where prior information is encoded in data—for example, in the form of expert demonstrations of robotic manipulation tasks (Rudner et al., 2021) or in the form of pre-trained networks in continual or transfer learning (Rudner et al., 2022)—an empirical prior that reflects this information can be specified. For further details, see Appendix D.

Selection of Context Distribution. The distribution $p_{\mathcal{X}_{\mathcal{C}}}$ allows us to incorporate information about the data-generating process into training and encourage the variational distribution to match the prior over functions in relevant parts of the input space. By taking advantage of the abundance of data available in real-world settings, context distributions can be constructed from large datasets like ImageNet (Krizhevsky et al., 2012), from small but diverse datasets like CIFAR-100, or by using any set of task-related unlabeled data. In our experiments, we choose two types of context distributions. One of the context distributions is constructed from the training data and only contains randomly sampled monochrome images, and one is constructed from a real-world dataset generated from a data distribution related to that of the training data. For example, when training on FashionMNIST, we use KMNIST as the context distribution, and when training on CIFAR-10, we use CIFAR-100 as the context distribution. For further details, see Appendix D.

Posterior Predictive Distribution. After optimizing the variational objective with respect to the parameters of the variational distribution $q_{\bm{\Theta}}$ , we use the fact that we can obtain function draws by sampling from the distribution over parameters to obtain an approximate posterior predictive distribution

\displaystyle\begin{split}q(\mathbf{y}_{\ast}\,|\,\mathbf{x}_{\ast})&=\int p(% \mathbf{y}_{\ast}\,|\,f(\mathbf{x}_{\ast};{\bm{\theta}}))\,q_{f(\mathbf{x}_{% \ast};\bm{\Theta})}\,\,\textrm{d}f(\mathbf{x}_{\ast};{\bm{\theta}})\\ &\approx\frac{1}{M_{\ast}}\sum\nolimits_{j=1}^{M_{\ast}}p(\mathbf{y}_{\ast}\,|% \,f(\mathbf{x}_{\ast};\bm{\Theta}^{(j)}))\quad\text{with}\quad\bm{\Theta}^{(j)% }\sim q_{\bm{\Theta}},\end{split}

(18)

where $M_{\ast}$ is the number of Monte Carlo samples used to estimate the predictive distribution.

4 Related Work

There is a growing body of work on function-space approaches to inference in bnns, deep learning, and applications such as continual learning (Benjamin et al., 2019; Sun et al., 2019; Titsias et al., 2020; Burt et al., 2021; Pan et al., 2020; Ma and Hernández-Lobato, 2021; Rudner et al., 2022).

Function-Space Inference in Bayesian Neural Networks.

Previously proposed methods for fsvi in bnns are based on approximate gradient estimators and either replace the supremum in Equation 7 with an expectation (Sun et al., 2019) or do not define an explicit variational objective (Wang et al., 2019). Sun et al. (2019) and Carvalho et al. (2020) use Gaussian process priors over functions for which the function-space variational inference problem is not well-defined (see Section 2.1 and Burt et al. (2021)). More recent work has attempted to circumvent the intractability of the variational objective in Equation 6 by proposing alternative objectives for function-space inference in bnns (Ma et al., 2019; Ober and Aitchison, 2020; Ma and Hernández-Lobato, 2021). Rudner et al. (2022) extend the approach presented in Section 3 to sequential inference problems and apply it to continual learning.

Linear Models.

Immer et al. (2020) and Khan et al. (2019) show that approximate bnn posterior distribution via the Laplace and Generalized-Gauss-Newton approximation corresponds to exact posteriors under linearizations of different models. Unlike in our approach, they use a Laplace approximation and do not perform variational inference and do not optimize the variance parameters. Furthermore, Immer et al. (2020) and Khan et al. (2019) use a neural network model to obtain a parameter maximum a posteriori estimate, but then use a linearization of the neural network model to compute a posterior predictive distribution. In contrast, our work only uses the linearization to obtain an estimator of the variational objective but uses the unlinearized model to construct a posterior predictive distribution.

Pathologies of Variational Inference in Bayesian Neural Networks.

Burt et al. (2021) consider the function-space variational objective in Equation 6 and show that the KL divergence between bnns with different networks architectures are not well-defined. A parallel line of research showed that posterior predictive distributions of shallow bnns with mean-field variational distributions have a limited ability to represent complex covariance structures in function space (Foong et al., 2019, 2020) but that deep bnns do not suffer from this limitation (Farquhar et al., 2020b). Our results are consistent with the findings of Farquhar et al. (2020b) that mean-field variational distributions are able to represent complex covariance structures in function space.

5 Empirical Evaluation

In this section, we evaluate fsvi on high-dimensional classification tasks that were out of reach for function-space variational inference methods proposed in prior works and compare fsvi to several well-established and state-of-the-art Bayesian deep learning and deterministic uncertainty quantification methods. We show that fsvi (sometimes significantly) outperforms existing Bayesian and non-Bayesian methods in terms of their in-distribution uncertainty calibration and out-of-distribution predictive uncertainty estimation. For a details on models, training and validation procedures, and datasets used, see Appendix D. For a comparison to Sun et al. (2019) on small-scale regression tasks, see Section B.2.

5.1 Predictive Performance, Uncertainty Estimation, and Distribution Shift Detection

In this set of experiments, we assess the reliability of the uncertainty estimates generated by fsvi. If a bnn trained via fsvi is able to perform reliable uncertainty estimation, its predictive uncertainty will be significantly higher on input points that were generated according to a different data-generating distribution than the training data. For models trained on the FashionMNIST dataset, we use the MNIST and NotMNIST datasets as out-of-distribution evaluation points, while for models trained on the CIFAR-10 dataset, we use the SVHN dataset as out-of-distribution evaluation points.

For models trained on either FashionMNIST or CIFAR-10, we evaluate their in-distribution performance in terms of test accuracy, test log-likelihood, and test calibration. To evaluate the quality of different models’ uncertainty estimates, we compute uncertainty estimates for the pairs FashionMNIST/MNIST, FashionMNIST/NotMNIST, and CIFAR-10/SVHN to and measure for a range of thresholds how well the datasets in each pair can be separated solely based on the uncertainty estimates. This experiment setup follows prior work by van Amersfoort et al. (2020) and Immer et al. (2020). We report the area under the receiver operating characteristic (ROC) curve in Tables 1 and 2.

Table 1: Comparison of in- and out-of-distribution performance metrics on FashionMNIST (mean

\pm

standard error over ten random seeds). The last two columns show the AUROC for binary in- vs. out-of-distribution detection on MNIST (M) and NotMNIST (NM). MNIST and NotMNIST are used as out-of-distribution datasets. Best overall results for single and ensemble models are printed in boldface with gray shading. Results within a

95

% confidence interval of the best overall result are printed in boldface only. All methods use the same four-layer CNN architecture. For further details about model architectures and training and evaluation protocols, see Appendix D.

Method	Accuracy $\uparrow$	ECE $\downarrow$	AUROC M $\uparrow$	AUROC NM $\uparrow$
map	91.73 ${\scriptstyle\pm 0.08}$	0.037 ${\scriptstyle\pm 0.001}$	87.00 ${\scriptstyle\pm 0.30}$	74.85 ${\scriptstyle\pm 1.31}$
mfvi (Blundell et al., 2015)	91.03 ${\scriptstyle\pm 0.04}$	0.038 ${\scriptstyle\pm 0.001}$	93.10 ${\scriptstyle\pm 0.34}$	88.88 ${\scriptstyle\pm 0.74}$
mfvi (tempered)	91.38 ${\scriptstyle\pm 0.05}$	0.058 ${\scriptstyle\pm 0.001}$	86.30 ${\scriptstyle\pm 0.29}$	80.78 ${\scriptstyle\pm 0.68}$
mfvi (radial) (Farquhar et al., 2020a)	90.31 ${\scriptstyle\pm 0.11}$	0.035 ${\scriptstyle\pm 0.001}$	84.40 ${\scriptstyle\pm 0.68}$	82.11 ${\scriptstyle\pm 1.15}$
mc dropout (Gal and Ghahramani, 2016)	90.55 ${\scriptstyle\pm 0.04}$	$0.012$ ${\scriptstyle\pm 0.001}$	88.46 ${\scriptstyle\pm 0.57}$	80.02 ${\scriptstyle\pm 1.04}$
swag (Maddox et al., 2019)	92.56 ${\scriptstyle\pm 0.05}$	0.043 ${\scriptstyle\pm 0.001}$	85.18 ${\scriptstyle\pm 0.35}$	80.31 ${\scriptstyle\pm 0.30}$
duq (van Amersfoort et al., 2020)	$92.40$ ${\scriptstyle\pm 0.20}$	$-$	95.50 ${\scriptstyle\pm 0.70}$	94.60 ${\scriptstyle\pm 1.80}$
bnn-laplace (Immer et al., 2020)	92.25 ${\scriptstyle\pm 0.10}$	$0.012{\scriptstyle\pm 0.003}$	95.55 ${\scriptstyle\pm 0.60}$	$-$
spg (Ma and Hernández-Lobato, 2021)	91.60 ${\scriptstyle\pm 0.14}$	$-$	95.60 ${\scriptstyle\pm 6.00}$	$-$
fsvi ( $p_{\mathbf{X}_{\mathcal{C}}}$ = random monochrome)	$93.13{\scriptstyle\pm 0.13}$	$0.012{\scriptstyle\pm 0.002}$	${96.23}{\scriptstyle\pm 0.46}$	${95.02}{\scriptstyle\pm 0.69}$
fsvi ( $p_{\mathbf{X}_{\mathcal{C}}}$ = KMNIST)	$\mathbf{93.48}{\scriptstyle\pm 0.12}$	$\mathbf{0.010}$ ${\scriptstyle\pm 0.001}$	$\mathbf{99.80}{\scriptstyle\pm 0.20}$	$\mathbf{97.26}{\scriptstyle\pm 0.23}$
Deep Ensemble	92.49 ${\scriptstyle\pm 0.01}$	$\mathbf{0.019}$ ${\scriptstyle\pm 0.000}$	89.22 ${\scriptstyle\pm 0.09}$	83.17 ${\scriptstyle\pm 0.91}$
fsvi Ensemble ( $p_{\mathbf{X}_{\mathcal{C}}}$ = random monochrome)	$\mathbf{94.44}$ ${\scriptstyle\pm 0.07}$	0.020 ${\scriptstyle\pm 0.001}$	$\mathbf{97.85}$ ${\scriptstyle\pm 0.15}$	$\mathbf{96.95}{\scriptstyle\pm 0.20}$

Predictive Performance and Calibration. To assess in-distribution predictive performance and calibration, we report the test accuracy, negative log-likelihood (NLL), and expected calibration error (ECE) for models trained on FashionMNIST and CIFAR-10 in Tables 1 and 2. On both FashionMNIST and CIFAR-10, fsvi achieves the lowest NLL and either the best or second-best predictive accuracy and ECE, respectively, across all methods. Notably, fsvi significantly outperforms spg (Ma and Hernández-Lobato, 2021), an alternative function-space variational inference method.

Predictive Uncertainty under Distribution Shift. In Tables 1 and 2, we report evaluation metrics that elucidate the reliability of different methods’ predictive uncertainty under distribution shift. fsvi exhibits reliable predictive uncertainty estimates that allow distinguishing between in- and out-of-distribution inputs with high accuracy. As would be expected, we observe that using context distributions that reflect our knowledge about the data-generating process can significantly improve uncertainty quantification under fsvi. For the FashionMNIST experiment, we used the KMNIST dataset, which contains grayscale images of Kuzushiji letters, and for the CIFAR-10 experiment, we used the CIFAR-100 dataset, which contains RGB images of 100 classes. Both KMNIST and CIFAR-100 differ from the OOD datasets (MNIST and NotMNIST and SVHN, respectively) used to compute OOD-AUROC metrics in Tables 1 and 2, but using them as context distributions significantly increased the ability of bnns trained via fsvi to identify distributionally shifted samples. Since the variational objective encourages matching the prior (which we chose to have high variance) on samples from the context distribution can improve uncertainty estimation in regions of the input space far from the training data.

Table 2: Comparison of in- and out-of-distribution performance metrics on CIFAR-10 (mean

\pm

standard error over ten random seeds). SVHN and corrupted CIFAR-10 (C-CIFAR) are used as an out-of-distribution datasets. The penultimate column shows the AUROC for binary in- vs. out-of-distribution detection on SVHN. Best overall results for single and ensemble models are printed in boldface with gray shading. Results within a

95

% confidence interval of the best overall result are printed in boldface only. All methods use a ResNet-18 architecture. For further details about model architectures and training and evaluation protocols, see Appendix D.

Method	Accuracy $\uparrow$	ECE $\downarrow$	OOD-AUROC $\uparrow$	C-CIFAR Acc $\uparrow$
map	93.19 ${\scriptstyle\pm 0.11}$	0.043 ${\scriptstyle\pm 0.001}$	94.65 ${\scriptstyle\pm 0.27}$	78.87 ${\scriptstyle\pm 1.39}$
mfvi (Blundell et al., 2015)	89.98 ${\scriptstyle\pm 0.09}$	0.040 ${\scriptstyle\pm 0.001}$	92.14 ${\scriptstyle\pm 0.34}$	79.36 ${\scriptstyle\pm 1.35}$
mfvi (tempered)	90.87 ${\scriptstyle\pm 0.11}$	0.048 ${\scriptstyle\pm 0.001}$	91.82 ${\scriptstyle\pm 0.90}$	79.86 ${\scriptstyle\pm 1.32}$
mc dropout (Gal and Ghahramani, 2016)	93.55 ${\scriptstyle\pm 0.07}$	0.040 ${\scriptstyle\pm 0.001}$	92.44 ${\scriptstyle\pm 0.57}$	80.13 ${\scriptstyle\pm 1.37}$
swag (Maddox et al., 2019)	93.13 ${\scriptstyle\pm 0.14}$	0.067 ${\scriptstyle\pm 0.002}$	89.79 ${\scriptstyle\pm 0.50}$	76.12 ${\scriptstyle\pm 0.51}$
vogn (Osawa et al., 2019)	84.27 ${\scriptstyle\pm 0.20}$	0.040 ${\scriptstyle\pm 0.002}$	87.60 ${\scriptstyle\pm 0.20}$	$-$
duq (van Amersfoort et al., 2020)	$\mathbf{94.10}$ ${\scriptstyle\pm 0.20}$	$-$	92.70 ${\scriptstyle\pm 1.30}$	$-$
spg (Ma and Hernández-Lobato, 2021)	77.69 ${\scriptstyle\pm 0.64}$	$-$	88.30 ${\scriptstyle\pm 4.00}$	$-$
fsvi ( $p_{\mathbf{X}_{\mathcal{C}}}$ = random monochrome)	$93.35$ ${\scriptstyle\pm 0.04}$	0.034 ${\scriptstyle\pm 0.001}$	$94.76$ ${\scriptstyle\pm 0.24}$	$\mathbf{80.81}$ ${\scriptstyle\pm 0.43}$
fsvi ( $p_{\mathbf{X}_{\mathcal{C}}}$ = CIFAR-100)	$93.57$ ${\scriptstyle\pm 0.04}$	$\mathbf{0.026}$ ${\scriptstyle\pm 0.001}$	$\mathbf{98.07}$ ${\scriptstyle\pm 0.10}$	$\mathbf{81.20}$ ${\scriptstyle\pm 0.42}$
Deep Ensemble	$95.13$ ${\scriptstyle\pm 0.06}$	0.019 ${\scriptstyle\pm 0.001}$	$\mathbf{98.04}$ ${\scriptstyle\pm 0.07}$	$\mathbf{81.22}$ ${\scriptstyle\pm 0.37}$
fsvi Ensemble ( $p_{\mathbf{X}_{\mathcal{C}}}$ = random monochrome)	$\mathbf{95.19}$ ${\scriptstyle\pm 0.03}$	0.013 ${\scriptstyle\pm 0.001}$	$\mathbf{99.19}$ ${\scriptstyle\pm 0.41}$	$\mathbf{81.35}$ ${\scriptstyle\pm 0.48}$

5.2 Generalization and Reliability of Predictive Uncertainty under Distribution Shift

To assess the reliability of predictive models in deep learning, Ovadia et al. (2019) propose the following desiderata: In order for a model to be considered reliable, it ought to (i) exhibit low predictive uncertainty on training data and high predictive uncertainty on out-of-distribution inputs, (ii) generate predictive uncertainty estimates that allow distinguishing in- from out-of-distribution inputs, and (iii) if possible, maintain high predictive accuracy even under distribution shift. Models that satisfy these desiderata are less likely to make poor, high-confidence predictions and more amenable for use in safety-critical downstream tasks.

To illustrate these desiderata, we follow Ovadia et al. (2019) and consider the rotated MNIST task, where a model is trained on MNIST and evaluated on rotated MNIST digits. The goal is to maintain a high level of predictive accuracy (measured in terms of Brier scores) while exhibiting an increasing level of predictive uncertainty on distribution shifts of increasing magnitude. Figure 2 shows Brier scores (lower is better) and predictive entropy estimates (higher means more uncertain) of four different models. As rotating the MNIST digits gradually shifts the data distributions, we would expect Brier scores to increase (corresponding to worse predictive accuracy) as the rotation angle increases. A model with reliable predictive entropy estimates would only experience a small decrease under distribution shift while exhibiting a large increase in predictive uncertainty. As can be seen in the plot, the Brier scores of fsvi decreases the least, while fsvi’s uncertainty is significantly higher than other models’. To assess the reliability of different uncertainty quantification methods on a more challenging distribution-shift task, we consider corrupted CIFAR-10 inputs under the second-mildest corruption level used in (Ovadia et al., 2019) and report our results in Table 2. Consistent with the rotated MNIST results, fsvi achieves the highest accuracy on the corrupted data.

5.3 Safety-Critical Uncertainty-Aware Selective Prediction: Diabetic Retinopathy Diagnosis

To evaluate the reliability of the predictive uncertainty of fsvi in a real-world safety-critical setting, we consider the task of diagnosing diabetic retinopathy (DR), a medical condition that can lead to impaired vision, from retina scans (Leibig et al., 2017; Filos et al., 2019; Band et al., 2021). We use two publicly available datasets, EyePACS (2015) and APTOS (2019), each containing RGB images of a human retina graded by a medical expert on the following scale: 0 (no DR), 1 (mild DR), 2 (moderate DR), 3 (severe DR), and 4 (proliferative DR). The Kaggle dataset was collected from patients in the United States, while the APTOS dataset was collected from patients in India using cheaper but more modern scanning devices. We follow Leibig et al. (2017), Filos et al. (2019), and Band et al. (2021) and binarize all examples from both the EyePACS and APTOS datasets by dividing the classes up into sight-threatening diabetic retinopathy—defined as moderate diabetic retinopathy or worse (classes $\{2,3,4\}$ )—and non-sight-threatening diabetic retinopathy—defined as no or mild diabetic retinopathy (classes $\{0,1\}$ ). This results in a binary prediction task.

To assess the reliability of predictive models when medical training and test data are obtained from different patient populations or collected with the same medical equipment, we follow Band et al. (2021) and use the Kaggle dataset for training and the distributionally shifted APTOS dataset for evaluation. The results are shown in Figure 4, which plot the ROC curves for the binary prediction problems as well as the area under the ROC curve for an uncertainty aware selective prediction task. For further details about the uncertainty-aware selective prediction evaluation protocol, see Section D.4. Figure 4 shows that fsvi performs well on all four tasks and is only outperformed by mc dropout. For full tabular results, see Section B.1.

6 Conclusion

The paper proposed a scalable and effective approach to function-space variational inference in bnns. We demonstrated that the proposed estimator of the function-space variational objective can be scaled up to high-dimensional data and large neural network architectures and that fsvi exhibits consistently reliable in- and out-of-distribution predictive performance on a wide range of datasets when compared to well-established and state-of-the-art uncertainty quantification methods. We hope that this work will lead to further research into function-space variational inference and the development of more sophisticated data-driven prior distributions over functions.

Acknowledgements

We thank Bryn Elesedy, Bobby He, and Andrew Jesson for feedback on an early draft of this paper. We thank Joost van Amersfoort for helpful discussions about experiment design and implementations. Tim G. J. Rudner is funded by the Rhodes Trust and the Engineering and Physical Sciences Research Council (EPSRC). We gratefully acknowledge donations of computing resources by the Alan Turing Institute.

References

Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016.
APTOS (2019) APTOS. APTOS 2019 Blindness Detection Dataset, 2019.
Band et al. (2021) Neil Band, Tim G. J. Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W. Dusenberry, Ghassen Jerfel, Dustin Tran, and Yarin Gal. Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks. 2021.
Benjamin et al. (2019) Ari Benjamin, David Rolnick, and Konrad Kording. Measuring and regularizing networks in function space. In International Conference on Learning Representations, 2019.
Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. volume 37 of Proceedings of Machine Learning Research, pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR.
Burt et al. (2021) David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, and Mark van der Wilk. Understanding variational inference in function-space. In Third Symposium on Advances in Approximate Bayesian Inference, 2021.
Carvalho et al. (2020) Eduardo D. C. Carvalho, Ronald Clark, Andrea Nicastro, and Paul H. J. Kelly. Scalable uncertainty for computer vision with functional variational inference. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
EyePACS (2015) EyePACS. Diabetic Retinopathy Detection Dataset, 2015.
Farquhar et al. (2020a) Sebastian Farquhar, Michael A. Osborne, and Yarin Gal. Radial Bayesian neural networks: Beyond discrete support in large-scale Bayesian deep learning. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1352–1362. PMLR, 26–28 Aug 2020a.
Farquhar et al. (2020b) Sebastian Farquhar, Lewis Smith, and Yarin Gal. Liberty or depth: Deep Bayesian neural nets do not need complex weight posterior approximations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020b.
Filos et al. (2019) Angelos Filos, Sebastian Farquhar, Aidan N. Gomez, Tim G. J. Rudner, Zachary Kenton, Lewis Smith, Milad Alizadeh, Arnoud de Kroon, and Yarin Gal. A systematic comparison of Bayesian deep learning robustness in diabetic retinopathy tasks, 2019.
Foong et al. (2019) Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. ’in-between’ uncertainty in Bayesian neural networks, 2019.
Foong et al. (2020) Andrew Y. K. Foong, David R. Burt, Yingzhen Li, and Richard E. Turner. On the expressiveness of approximate inference in Bayesian neural networks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 2016, pages 1050–1059, 2016.
Graves (2011) Alex Graves. Practical variational inference for neural networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 2348–2356, Red Hook, NY, USA, 2011. Curran Associates Inc. ISBN 9781618395993.
Hendrycks et al. (2021) Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2021.
Hinton and van Camp (1993) Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ’93, page 5–13, New York, NY, USA, 1993. Association for Computing Machinery. ISBN 0897916115.
Hoffman et al. (2013) Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, May 2013. ISSN 1532-4435.
Immer et al. (2020) Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of Bayesian neural networks via local linearization, 2020.
Izmailov et al. (2020) Pavel Izmailov, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Subspace inference for Bayesian deep learning. In Ryan P. Adams and Vibhav Gogate, editors, Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pages 1169–1179. PMLR, 22–25 Jul 2020.
Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zidek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021. doi: 10.1038/s41586-021-03819-2.
Khan et al. (2019) Mohammad Emtiyaz E Khan, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. Approximate inference turns deep networks into Gaussian processes. In Advances in Neural Information Processing Systems 32, pages 3094–3104. Curran Associates, Inc., 2019.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25:, pages 1106–1114, 2012.
Leibig et al. (2017) Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, and Siegfried Wahl. Leveraging Uncertainty Information From Deep Neural Networks for Disease Detection. Nature Scientific Reports, 7(1):17816, 2017.
Ma and Hernández-Lobato (2021) Chao Ma and José Miguel Hernández-Lobato. Functional variational inference based on stochastic process generators. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
Ma et al. (2019) Chao Ma, Yingzhen Li, and Jose Miguel Hernandez-Lobato. Variational implicit processes. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4222–4233. PMLR, 09–15 Jun 2019.
MacKay (1992) David J. C. MacKay. A practical Bayesian framework for backpropagation networks. Neural Comput., 4(3):448–472, May 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.3.448.
Maddox et al. (2019) Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13153–13164, 2019.
Matthews et al. (2016) Alexander G. de G. Matthews, James Hensman, Richard Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. volume 51 of Proceedings of Machine Learning Research, pages 231–239, Cadiz, Spain, 09–11 May 2016. PMLR.
Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
Neal (1996) Radford M Neal. Bayesian Learning for Neural Networks. 1996.
Ober and Aitchison (2020) Sebastian W. Ober and Laurence Aitchison. Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes, 2020.
Osawa et al. (2019) Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschenhagen, Richard E Turner, and Rio Yokota. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, volume 32, pages 4287–4299. Curran Associates, Inc., 2019.
Ovadia et al. (2019) Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32. 2019.
Pan et al. (2020) Pingbo Pan, Siddharth Swaroop, Alexander Immer, Runa Eschenhagen, Richard E. Turner, and Mohammad Emtiyaz Khan. Continual deep learning by functional regularisation of memorable past. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
Polyanskiy and Wu (2017) Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels and Bayesian networks. In Eric Carlen, Mokshay Madiman, and Elisabeth M. Werner, editors, Convexity and Concentration, pages 211–249, New York, NY, 2017. Springer New York. ISBN 978-1-4939-7005-6.
Rudner and Toner (2021a) Tim G. J. Rudner and Helen Toner. Key Concepts in AI Safety: An Overview. In CSET Issue Briefs, 2021a.
Rudner and Toner (2021b) Tim G. J. Rudner and Helen Toner. Key Concepts in AI Safety: Robustness and Adversarial Examples. In CSET Issue Briefs, 2021b.
Rudner et al. (2021) Tim G. J. Rudner, Cong Lu, Michael A. Osborne, Yarin Gal, and Yee Whye Teh. On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations. In Advances in Neural Information Processing Systems 34, 2021.
Rudner et al. (2022) Tim G. J. Rudner, Freddie Bickford Smith, Qixuan Feng, Yee Whye Teh, and Yarin Gal. Continual Learning via Sequential Function-Space Variational Inference. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2022.
Schervish (1995) M. J. Schervish. Theory of Statistics. Springer-Verlag, New York, NY, 1995.
Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. 529, 2016.
Snelson and Ghahramani (2006) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006.
Sun et al. (2019) Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger B. Grosse. Functional variational Bayesian neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
Titsias et al. (2020) Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, and Yee Whye Teh. Functional regularisation for continual learning with Gaussian processes. In International Conference on Learning Representations, 2020.
van Amersfoort et al. (2020) Joost van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Uncertainty estimation using a single deep deterministic neural network. In International Conference on Machine Learning, 2020.
van Amersfoort et al. (2021) Joost van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, and Yarin Gal. Variational deterministic uncertainty quantification, 2021.
Wainwright and Jordan (2008) Martin J Wainwright and Michael I Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., Hanover, MA, USA, 2008. ISBN 1601981848.
Wang et al. (2019) Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for Bayesian neural networks. In International Conference on Learning Representations, 2019.
Widdowson (2016) D. T. S. Widdowson. The Management of Grading Quality: Good Practice in the Quality Assurance of Grading. Tech. Rep., 2016.
Wolpert (1993) David H. Wolpert. Bayesian backpropagation over i-o functions rather than weights. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993.

Appendix

\startcontents

[sections] \printcontents[sections]l1

Appendix A Proofs & Derivations

A.1 Function-Space Variational Objective

This proof follows steps from Matthews et al. [2016]. Consider measures $\hat{P}$ and $P$ both of which define distributions over some function $f$ , indexed by an infinite index set $X$ . Let $\mathcal{D}$ be a dataset and let $\mathbf{X}_{\mathcal{D}}$ denote a set of inputs and $\mathbf{y}_{\mathcal{D}}$ a set of targets. Consider the measure-theoretic version of Bayes’ Theorem [Schervish, 1995]:

\displaystyle\frac{d\hat{P}}{dP}(f)=\frac{p_{X}(Y\,|\,f)}{p(Y)},

(A.1)

where $p_{X}(Y\,|\,f)$ is the likelihood and $p(Y)=\int_{{}^{X}}p_{X}(Y\,|\,f)dP(f)$ is the marginal likelihood. We assume that the likelihood function is evaluated at a finite subset of the index set $X$ . Denote by $\pi_{C}:{}^{X}\to{}^{C}$ a projection function that takes a function and returns the same function, evaluated at a finite set of points $C$ , so we can write

\displaystyle\frac{d\hat{P}}{dP}(f)=\frac{d\hat{P}_{\mathbf{X}_{\mathcal{D}}}}% {dP_{\mathbf{X}_{\mathcal{D}}}}(\pi_{\mathbf{X}_{\mathcal{D}}}(f))=\frac{p(% \mathbf{y}_{\mathcal{D}}\,|\,\pi_{\mathbf{X}_{\mathcal{D}}}(f))}{p(\mathbf{y}_% {\mathcal{D}})},

(A.2)

and similarly, the marginal likelihood becomes $p(\mathbf{y}_{\mathcal{D}})=\int p_{\mathbf{y}|f_{\mathbf{X}}}(\mathbf{y}_{% \mathcal{D}}\,|\,f_{\mathbf{X}_{\mathcal{D}}})\,\textrm{d}P_{\mathbf{X}_{% \mathcal{D}}}(f_{\mathbf{X}_{\mathcal{D}}})$ . Now, considering the measure-theoretic version of the KL divergence between an approximating stochastic process $Q$ and a posterior stochastic process $\hat{P}$ , we can write

\displaystyle\mathbb{D}_{\textrm{KL}}(Q\,\|\,\hat{P})=\int\log{\frac{dQ}{dP}(f% )}\,\textrm{d}Q(f)-\int\log{\frac{d\hat{P}}{dP}(f)}\,\textrm{d}Q(f),

(A.3)

where $P$ is some prior stochastic process. Now, we can apply the measure-theoretic Bayes’ Theorem to obtain

$\displaystyle\mathbb{D}_{\textrm{KL}}(Q\,\\|\,\hat{P})$	$\displaystyle=\int\log{\frac{dQ}{dP}(f)}\,\textrm{d}Q(f)-\int\log{\frac{d\hat{% P}}{dP}(f)}\,\textrm{d}Q(f)$	(A.4)
	$\displaystyle=\int\log{\frac{dQ^{\pi}}{dP^{\pi}}(f)}\,\textrm{d}Q^{\pi}(f)-% \int\log{\frac{d\hat{P}_{\mathbf{X}_{\mathcal{D}}}}{dP_{\mathbf{X}_{\mathcal{D% }}}}\left(f_{\mathbf{X}_{\mathcal{D}}}\right)}\,\textrm{d}Q_{\mathbf{X}_{% \mathcal{D}}}\left(f_{\mathbf{X}_{\mathcal{D}}}\right)$	(A.5)
	$\displaystyle=\int\log{\frac{dQ^{\pi}}{dP^{\pi}}(f)}\,\textrm{d}Q^{\pi}(f)-% \operatorname{\mathbb{E}}_{Q_{\mathbf{X}_{\mathcal{D}}}}\left[\log p\left(% \mathbf{y}_{\mathcal{D}}\,\|\,f_{\mathbf{X}_{\mathcal{D}}}\right)\right]-\log p% (\mathbf{y}_{\mathcal{D}}),$	(A.6)

where $\frac{dQ^{\pi}}{dP^{\pi}}(f)$ is marginally consistent given the projection $\pi$ . Rearranging, we can get

$\displaystyle p(\mathbf{y}_{\mathcal{D}})$	$\displaystyle=\operatorname{\mathbb{E}}_{Q_{\mathbf{X}_{\mathcal{D}}}}\left[% \log p_{\mathbf{y}\|f_{\mathbf{X}}}(\mathbf{y}_{\mathcal{D}}\,\|\,f_{\mathbf{X}_% {\mathcal{D}}})\right]-\int\log{\frac{dQ^{\pi}}{dP^{\pi}}(f)}\,\textrm{d}Q^{% \pi}(f)+\mathbb{D}_{\textrm{KL}}(Q^{\pi}\,\\|\,\hat{P})$	(A.7)
	$\displaystyle\geq\operatorname{\mathbb{E}}_{Q_{\mathbf{X}_{\mathcal{D}}}}\left% [\log p_{\mathbf{y}\|f_{\mathbf{X}}}(\mathbf{y}_{\mathcal{D}}\,\|\,f_{\mathbf{X}% _{\mathcal{D}}})\right]-\int\log{\frac{dQ^{\pi}}{dP^{\pi}}(f)}\,\textrm{d}Q^{% \pi}(f)$	(A.8)
	$\displaystyle=\operatorname{\mathbb{E}}_{Q_{\mathbf{X}_{\mathcal{D}}}}\left[% \log p_{\mathbf{y}\|f_{\mathbf{X}}}(\mathbf{y}_{\mathcal{D}}\,\|\,f_{\mathbf{X}_% {\mathcal{D}}})\right]-\mathbb{D}_{\textrm{KL}}(Q^{\pi}\,\\|\,P^{\pi}).$	(A.9)

Finally, this lower bound can equivalently be expressed as

\displaystyle p(\mathbf{y}_{\mathcal{D}})

\displaystyle\geq\operatorname{\mathbb{E}}_{Q_{\mathbf{X}_{\mathcal{D}}}}\left% [\log p_{\mathbf{y}|f_{\mathbf{X}}}(\mathbf{y}_{\mathcal{D}}\,|\,f_{\mathbf{X}% _{\mathcal{D}}})\right]-\mathbb{D}_{\textrm{KL}}(Q_{\mathbf{X}_{\mathcal{D}},% \mathbf{X}_{\backslash\mathcal{D}}}\,\|\,P_{\mathbf{X}_{\mathcal{D}},\mathbf{X% }_{\backslash\mathcal{D}}}),

(A.10)

where $\mathbf{X}_{\backslash\mathcal{D}}$ is an infinite index set excluding the finite index set $\mathbf{X}_{\mathcal{D}}$ , that is, $\mathbf{X}_{\backslash\mathcal{D}}\cap\mathbf{X}_{\mathcal{D}}=\varnothing$ , or by Theorem 1 in Sun et al. [2019], we can write

\displaystyle p(\mathbf{y}_{\mathcal{D}})

\displaystyle\geq\operatorname{\mathbb{E}}_{Q_{\mathbf{X}_{\mathcal{D}}}}\left% [\log p_{\mathbf{y}|f_{\mathbf{X}}}(\mathbf{y}_{\mathcal{D}}\,|\,f_{\mathbf{X}% _{\mathcal{D}}})\right]-\sup_{\mathbf{X}\in\mathcal{X}_{\mathbb{N}}}\mathbb{D}% _{\textrm{KL}}(Q_{\mathbf{X}}\,\|\,P_{\mathbf{X}}),

(A.11)

A.2 Distribution under Linearized Function Mapping

Proposition 1 (Distribution under Linearized Mapping).

For a stochastic function $f(\cdot\,;\bm{\Theta})$ defined in terms of stochastic parameters $\bm{\Theta}$ distributed according to distribution $g_{\bm{\Theta}}$ with $\mathbf{m}\,\dot{=}\,\operatorname{\mathbb{E}}_{g_{\bm{\Theta}}}[\bm{\Theta}]$ and $\mathbf{S}\,\dot{=}\,\text{\emph{Cov}}_{g_{\bm{\Theta}}}[\bm{\Theta}]$ , denote the linearization of the stochastic function $f(\cdot\,;\bm{\Theta})$ about $\mathbf{m}$ by

\displaystyle f(\cdot\,;\bm{\Theta})\approx\smash{\tilde{f}}(\cdot\,;\mathbf{m% },\bm{\Theta})\,\dot{=}\,f(\cdot\,;\mathbf{m})+\mathcal{J}(\cdot\,;\mathbf{m})% (\bm{\Theta}-\mathbf{m}),

where $\mathcal{J}(\cdot\,;\mathbf{m})\,\dot{=}\,(\partial f(\cdot\,;\bm{\Theta})/% \partial\bm{\Theta})|_{\bm{\Theta}=\mathbf{m}}$ is the Jacobian of $f(\cdot\,;\bm{\Theta})$ evaluated at $\bm{\Theta}=\mathbf{m}$ . Then the mean and co-variance of the distribution over the linearized mapping $\smash{\tilde{f}}$ at $\mathbf{X},\mathbf{X}^{\prime}\in\mathcal{X}$ are given by

	$\displaystyle\SwapAboveDisplaySkip\operatorname{\mathbb{E}}[\smash{\tilde{f}}(% \mathbf{X};\bm{\Theta})]$	$\displaystyle=f(\mathbf{X};\mathbf{m})$
	$\displaystyle\textrm{\emph{Cov}}[\smash{\tilde{f}}(\mathbf{X};\bm{\Theta}),% \smash{\tilde{f}}(\mathbf{X}^{\prime};\bm{\Theta})]$	$\displaystyle=\mathcal{J}(\mathbf{X};\mathbf{m})\mathbf{S}\mathcal{J}(\mathbf{% X}^{\prime};\mathbf{m})^{\top}.$

Proof.

We wish to find $\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})]$ and

\displaystyle\begin{split}&\textrm{Cov}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m% },\bm{\Theta}),\smash{\tilde{f}}(\mathbf{X}^{\prime};\mathbf{m},{\bm{\theta}})% )\\ &=\mathbb{E}[(\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{\theta}})-\mathbb{E% }[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{\theta}})])\,(\smash{\tilde{f}}% (\mathbf{X}^{\prime};\mathbf{m},{\bm{\theta}})-\mathbb{E}[\smash{\tilde{f}}(% \mathbf{X}^{\prime};\mathbf{m},{\bm{\theta}})])^{\top}].\end{split}

(A.12)

To see that $\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{\theta}})]=f(\mathbf{X% };\mathbf{m})$ , note that, by linearity of expectation, we have

\displaystyle\begin{split}\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{% \bm{\theta}})]&=\mathbb{E}[f(\mathbf{X};\mathbf{m})+\mathcal{J}(\mathbf{X};% \mathbf{m})(\bm{\Theta}-\mathbf{m})]\\ &=f(\mathbf{X};\mathbf{m})+\mathcal{J}(\mathbf{X};\mathbf{m})(\mathbb{E}[\bm{% \Theta}]-\mathbf{m})=f(\mathbf{X};\mathbf{m}).\end{split}

(A.13)

To see that $\textrm{Cov}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{\theta}}),\smash{% \tilde{f}}(\mathbf{X}^{\prime};\mathbf{m},{\bm{\theta}}))=\mathcal{J}(\mathbf{% X};\mathbf{m})\mathbf{S}\mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})^{\top}$ , note that in general, for a multivariate random variable $\mathbf{Z}$ , $\textrm{Cov}(\mathbf{Z},\mathbf{Z})=\mathbb{E}[\mathbf{Z}\mathbf{Z}^{\top}]+% \mathbb{E}[\mathbf{Z}]\mathbb{E}[\mathbf{Z}]^{\top}$ , and hence,

\displaystyle\begin{split}&\textrm{Cov}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m% },\bm{\Theta}),\smash{\tilde{f}}(\mathbf{X}^{\prime};\mathbf{m},\bm{\Theta}))% \\ &=\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})\smash{\tilde% {f}}(\mathbf{X}^{\prime};\mathbf{m},\bm{\Theta})^{\top}]-\mathbb{E}[\smash{% \tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})]\mathbb{E}[\smash{\tilde{f}}(% \mathbf{X}^{\prime};\mathbf{m},\bm{\Theta})]^{\top}.\end{split}

(A.14)

We already know that $\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})]=f(\mathbf{X};\mathbf{m})$ , so we only need to find $\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})\smash{\tilde{f}}(\mathbf{% X}^{\prime};\bm{\Theta})^{\top}]$ :

	$\displaystyle\begin{split}\mathbb{E}_{g_{\bm{\Theta}}}&[\smash{\tilde{f}}(% \mathbf{X};\mathbf{m},\bm{\Theta})\smash{\tilde{f}}(\mathbf{X}^{\prime};% \mathbf{m},\bm{\Theta})^{\top}]\\ =&\mathbb{E}_{g_{\bm{\Theta}}}[(f(\mathbf{X};\mathbf{m})+\mathcal{J}(\mathbf{X% };\mathbf{m})(\bm{\Theta}-\mathbf{m}))(f(\mathbf{X}^{\prime};\mathbf{m})+% \mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})(\bm{\Theta}-\mathbf{m}))^{\top}]% \end{split}$			(A.15)
	$\displaystyle\begin{split}=&\mathbb{E}_{g_{\bm{\Theta}}}[f(\mathbf{X};\mathbf{% m})f(\mathbf{X}^{\prime};\mathbf{m})^{\top}+(\mathcal{J}(\mathbf{X};\mathbf{m}% )(\bm{\Theta}-\mathbf{m}))(\mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})(\bm{% \Theta}-\mathbf{m}))^{\top}\\ &\qquad\qquad+f(\mathbf{X};\mathbf{m})(\mathcal{J}(\mathbf{X}^{\prime};\mathbf% {m})(\bm{\Theta}-\mathbf{m}))^{\top}+\mathcal{J}(\mathbf{X};\mathbf{m})(\bm{% \Theta}-\mathbf{m})f(\mathbf{X}^{\prime};\mathbf{m})^{\top}]\end{split}$			(A.16)
	$\displaystyle\begin{split}=&\mathbb{E}_{g_{\bm{\Theta}}}[f(\mathbf{X};\mathbf{% m})f(\mathbf{X}^{\prime};\mathbf{m})^{\top}+\mathcal{J}(\mathbf{X};\mathbf{m})% (\bm{\Theta}-\mathbf{m})(\bm{\Theta}-\mathbf{m})^{\top}\mathcal{J}(\mathbf{X}^% {\prime};\mathbf{m})^{\top}\\ &\qquad\qquad+f(\mathbf{X};\mathbf{m})(\mathcal{J}(\mathbf{X}^{\prime};\mathbf% {m})(\bm{\Theta}-\mathbf{m}))^{\top}+\mathcal{J}(\mathbf{X};\mathbf{m})(\bm{% \Theta}-\mathbf{m})f(\mathbf{X}^{\prime};\mathbf{m})^{\top}]\end{split}$			(A.17)
	$\displaystyle\begin{split}=&f(\mathbf{X};\mathbf{m})f(\mathbf{X}^{\prime};% \mathbf{m})^{\top}+\mathcal{J}(\mathbf{X};\mathbf{m})\mathbb{E}_{g_{\bm{\Theta% }}}[(\bm{\Theta}-\mathbf{m})(\bm{\Theta}-\mathbf{m})^{\top}]\mathcal{J}(% \mathbf{X}^{\prime};\mathbf{m})^{\top}\\ &\qquad\qquad+f(\mathbf{X};\mathbf{m})(\mathcal{J}(\mathbf{X}^{\prime};\mathbf% {m})(\underbrace{\mathbb{E}_{g_{\bm{\Theta}}}[\bm{\Theta}]-\mathbf{m})}_{=0})^% {\top}+\mathcal{J}(\mathbf{X};\mathbf{m})(\underbrace{\mathbb{E}_{g_{\bm{% \Theta}}}[\bm{\Theta}]-\mathbf{m}}_{=0})f(\mathbf{X}^{\prime};\mathbf{m})^{% \top},\end{split}$			(A.18)

where the last line follows from the definition of $g_{\bm{\Theta}}$ . By definition of the covariance, we then obtain

	$\displaystyle\begin{split}&\mathbb{E}_{g_{\bm{\Theta}}}[\smash{\tilde{f}}(% \mathbf{X};\mathbf{m},\bm{\Theta})\smash{\tilde{f}}(\mathbf{X}^{\prime};% \mathbf{m},\bm{\Theta})^{\top}]\\ &=f(\mathbf{X};\mathbf{m})f(\mathbf{X}^{\prime};\mathbf{m})^{\top}+\mathcal{J}% (\mathbf{X};\mathbf{m})\mathbb{E}_{g_{\bm{\Theta}}}[(\bm{\Theta}-\mathbf{m})(% \bm{\Theta}-\mathbf{m})^{\top}]\mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})^{% \top}\end{split}$			(A.19)
		$\displaystyle=f(\mathbf{X};\mathbf{m})f(\mathbf{X};\mathbf{m})^{\top}+\mathcal% {J}(\mathbf{X};\mathbf{m})\text{Cov}(\bm{\Theta})\mathcal{J}(\mathbf{X}^{% \prime};\mathbf{m})^{\top}.$		(A.20)

With this result, we obtain the covariance function

$\displaystyle\begin{split}&\textrm{Cov}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m% },\bm{\Theta}),\smash{\tilde{f}}(\mathbf{X}^{\prime};\mathbf{m},\bm{\Theta}))% \\ &=\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})\smash{\tilde% {f}}(\mathbf{X}^{\prime};\mathbf{m},\bm{\Theta})^{\top}]-\mathbb{E}[\smash{% \tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})]\mathbb{E}[\smash{\tilde{f}}(% \mathbf{X}^{\prime};\mathbf{m},\bm{\Theta})]^{\top}\end{split}$		(A.21)
	$\displaystyle=\mathbb{E}[\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})% \smash{\tilde{f}}(\mathbf{X}^{\prime};\mathbf{m},\bm{\Theta})^{\top}]-f(% \mathbf{X};\mathbf{m})f(\mathbf{X};\mathbf{m})^{\top}+\mathcal{J}(\mathbf{X};% \mathbf{m})\text{Cov}(\bm{\Theta})\mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})^% {\top}$	(A.22)
	$\displaystyle=f(\mathbf{X};\bm{\Theta})f(\mathbf{X}^{\prime};\bm{\Theta})^{% \top}-f(\mathbf{X};\mathbf{m})f(\mathbf{X};\mathbf{m})^{\top}+\mathcal{J}(% \mathbf{X};\mathbf{m})\text{Cov}\bm{\Theta})\mathcal{J}(\mathbf{X}^{\prime};% \mathbf{m})^{\top}$	(A.23)
	$\displaystyle=\mathcal{J}(\mathbf{X};\mathbf{m})\operatorname{\mathbb{V}}[\bm{% \Theta}]\mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})^{\top}.$	(A.24)

Finally, $\text{Cov}(\bm{\Theta})=\mathbf{S}$ yields $\textrm{Cov}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta}),\smash{% \tilde{f}}(\mathbf{X}^{\prime};\mathbf{m},\bm{\Theta}))=\mathcal{J}(\mathbf{X}% ;\mathbf{m})\mathbf{S}\mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})^{\top}$ . This concludes the proof. ∎

Proposition 2 (Approximate Distribution under Linearized Mapping).

For a stochastic function $f(\cdot\,;\bm{\Theta})$ defined in terms of stochastic parameters $\bm{\Theta}$ distributed according to distribution $g_{\bm{\Theta}}=\mathcal{N}(\mathbf{m},\mathbf{S})$ , denote the linearization of the stochastic function $f(\cdot\,;\bm{\Theta})$ about $\mathbf{m}$ by

\displaystyle f(\cdot\,;\bm{\Theta})\approx\smash{\tilde{f}}(\cdot\,;\mathbf{m% },\bm{\Theta})\,\dot{=}\,f(\cdot\,;\mathbf{m})+\mathcal{J}(\cdot\,;\mathbf{m})% (\bm{\Theta}-\mathbf{m}),

where $\mathcal{J}(\cdot\,;\mathbf{m})\,\dot{=}\,(\partial f(\cdot\,;\bm{\Theta})/% \partial\bm{\Theta})|_{\bm{\Theta}=\mathbf{m}}$ is the Jacobian of $f(\cdot\,;\bm{\Theta})$ evaluated at $\bm{\Theta}=\mathbf{m}$ . Then, for a partition of the set of parameters into sets $\alpha$ and $\beta$ , a distribution $g_{\bm{\Theta}}=\mathcal{N}(\mathbf{m},\mathbf{S})$ with $\bm{\Theta}_{\alpha}\perp\bm{\Theta}_{\beta}$ , the distribution $\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})}$ can be approximated via the Monte Carlo estimator

\displaystyle\hat{\tilde{g}}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{% \Theta})}=\frac{1}{R}\sum\nolimits_{j=1}^{R}\mathcal{N}\Big{(}f(\mathbf{X};% \mathbf{m})+\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{% \alpha})^{(j)},\mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})\mathbf{S}_{\beta}{% \mathcal{J}(\mathbf{X}^{\prime};\mathbf{m})_{\beta}}^{\top}\Big{)},

(A.25)

where $g_{\bm{\Theta}_{\alpha}}=\mathcal{N}(\mathbf{m}_{\alpha},\mathbf{S}_{\alpha})$ , $g_{\bm{\Theta}_{\beta}}=\mathcal{N}(\mathbf{m}_{\beta},\mathbf{S}_{\beta})$ , and

\displaystyle\smash{\tilde{f}}_{\alpha}(\cdot\,;\mathbf{m},\bm{\Theta}_{\alpha% })\,\dot{=}\,\mathcal{J}_{\alpha}(\cdot\,;\mathbf{m})(\bm{\Theta}_{\alpha}-% \mathbf{m}_{\alpha}),

(A.26)

with $\mathcal{J}_{\alpha}(\cdot\,;\mathbf{m})$ denoting the columns of the Jacobian matrix corresponding to the sets of parameters $\alpha$ and $\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{\alpha})^{(j)}$ for $j=1,...,R$ obtained by sampling parameters from the distribution $g_{\bm{\Theta}_{\alpha}}=\mathcal{N}(\mathbf{m}_{\alpha},\mathbf{S}_{\alpha})$ .

Proof.

Consider a partition of the set of parameters into sets $\alpha$ and $\beta$ and express the linearized mapping as

\displaystyle\SwapAboveDisplaySkip\smash{\tilde{f}}(\cdot\,;\mathbf{m},\bm{% \Theta})=\smash{\tilde{f}}_{\alpha}(\cdot\,;\mathbf{m},\bm{\Theta}_{\alpha})+% \smash{\tilde{f}}_{\beta}(\cdot\,;\mathbf{m},\bm{\Theta}_{\beta}),

(A.27)

with

\displaystyle\SwapAboveDisplaySkip\smash{\tilde{f}}_{\alpha}(\cdot\,;\mathbf{m% },\bm{\Theta}_{\alpha})\,\dot{=}\,\mathcal{J}_{\alpha}(\cdot\,;\mathbf{m})(\bm% {\Theta}_{\alpha}-\mathbf{m}_{\alpha}),

(A.28)

and

\displaystyle\SwapAboveDisplaySkip\smash{\tilde{f}}_{\beta}(\cdot\,;\mathbf{m}% ,\bm{\Theta}_{\beta})\,\dot{=}\,f(\cdot\,;\mathbf{m})+\mathcal{J}_{\beta}(% \cdot\,;\mathbf{m})(\bm{\Theta}_{\beta}-\mathbf{m}_{\beta}),

(A.29)

Noting that Equation A.27 expresses $\smash{\tilde{f}}$ as a sum of (affine transformations of) random variables, we can use the fact that for independent Gaussian random variables $\mathbf{X}$ and $\mathbf{Y}$ , the distribution $h_{\mathbf{Z}}$ of $\mathbf{Z}=\mathbf{X}+\mathbf{Y}$ is equal to the convolution of the distributions $h_{\mathbf{X}}$ and $h_{\mathbf{Y}}$ to obtain an approximation to $\smash{\tilde{f}}$ . In particular, for $\mathbf{Z}=\mathbf{X}+\mathbf{Y}$ ,

\displaystyle\SwapAboveDisplaySkip f_{\mathbf{Z}}(\mathbf{z})=\int_{-\infty}^{% \infty}f_{\mathbf{Y}}(\mathbf{z}-\mathbf{x})f_{\mathbf{X}}(\mathbf{x})\,% \textrm{d}\mathbf{x}.

(A.30)

Letting $\mathbf{X}=\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{% \alpha})$ , $\mathbf{Y}=\smash{\tilde{f}}_{\beta}(\mathbf{X};\mathbf{m},\bm{\Theta}_{\beta})$ , and $\mathbf{X}=\smash{\tilde{f}}(\mathbf{X};\bm{\Theta})$ , we can write

	$\displaystyle\begin{split}&\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},% \bm{\Theta})}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{\theta}}))\\ &=\int_{-\infty}^{\infty}\tilde{g}_{\smash{\tilde{f}}_{\beta}(\mathbf{X};% \mathbf{m},\bm{\Theta}_{\beta})}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{% \theta}})-\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},{\bm{\theta}}_{% \alpha}))\tilde{g}_{\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{% \Theta}_{\alpha})}(\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},{\bm{% \theta}}_{\alpha}))\,\textrm{d}\mathbf{X},\end{split}$			(A.31)
		$\displaystyle=\int_{-\infty}^{\infty}\mathcal{N}(\smash{\tilde{f}}(\mathbf{X};% \mathbf{m},{\bm{\theta}})\,;\bm{\mu}(\mathbf{X},\mathbf{m},{\bm{\theta}}_{% \alpha},\smash{\tilde{f}}_{\alpha}),\bm{\Sigma}(\mathbf{X},\mathbf{m},{\bm{% \theta}}_{\beta},\mathbf{S}_{\beta}))\tilde{g}_{\smash{\tilde{f}}_{\alpha}(% \mathbf{X};\mathbf{m},\bm{\Theta}_{\alpha})}(\smash{\tilde{f}}_{\alpha}(% \mathbf{X};\mathbf{m},{\bm{\theta}}_{\alpha}))\,\textrm{d}\mathbf{X},$		(A.32)

with

\displaystyle\SwapAboveDisplaySkip\bm{\mu}(\mathbf{X};\mathbf{m},{\bm{\theta}}% _{\alpha},\smash{\tilde{f}}_{\alpha})=f(\mathbf{X};\mathbf{m})+\smash{\tilde{f% }}_{\alpha}(\mathbf{X};\mathbf{m},{\bm{\theta}}_{\alpha})

(A.33)

and

\displaystyle\SwapAboveDisplaySkip\bm{\Sigma}(\mathbf{X};\mathbf{m},\mathbf{S}% _{\beta},\mathcal{J}_{\beta})=\mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})% \mathbf{S}_{\beta}{\mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})}^{\top},

(A.34)

where we have used the fact that for a Gaussian distribution with mean $m$ and covariance $S$ , $\mathcal{N}(z-y;m,S)=\mathcal{N}(z;m+y,S)$ . We can then approximate the probability density function $\tilde{g}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})}(\smash{\tilde% {f}}(\mathbf{X};{\bm{\theta}}))$ via the Monte Carlo estimator

\displaystyle\begin{split}&\hat{\tilde{g}}_{\smash{\tilde{f}}(\mathbf{X};% \mathbf{m},\bm{\Theta})}(\smash{\tilde{f}}(\mathbf{X};\mathbf{m},{\bm{\theta}}% ))\\ &=\frac{1}{R}\sum\nolimits_{j=1}^{R}\mathcal{N}(\smash{\tilde{f}}(\mathbf{X};% \mathbf{m},{\bm{\theta}})\,;\bm{\mu}(\mathbf{X},\mathbf{m},\smash{\tilde{f}}_{% \alpha}(\mathbf{X};\mathbf{m},{\bm{\theta}}_{\alpha})^{(j)}),\bm{\Sigma}(% \mathbf{X};\mathbf{m},\mathbf{S}_{\beta},\mathcal{J}_{\beta}))\end{split}

(A.35)

with $\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},{\bm{\theta}}_{\alpha})^{(j)}% \sim\tilde{g}_{\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{% \alpha})}$ . Finally, we can express the distribution $\hat{\tilde{g}}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{\Theta})}$ as

\displaystyle\hat{\tilde{g}}_{\smash{\tilde{f}}(\mathbf{X};\mathbf{m},\bm{% \Theta})}=\frac{1}{R}\sum\nolimits_{j=1}^{R}\mathcal{N}\Big{(}f(\mathbf{X};% \mathbf{m})+\smash{\tilde{f}}_{\alpha}(\mathbf{X};\mathbf{m},\bm{\Theta}_{% \alpha})^{(j)},\mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})\mathbf{S}_{\beta}{% \mathcal{J}_{\beta}(\mathbf{X};\mathbf{m})}^{\top}\Big{)},

(A.36)

Appendix B Further Empirical Results

B.1 Tabular Results for Diabetic Retinopathy Diagnosis Tasks

The results below were reproduced from Band et al. [2021] using the retina benchmark.

Table 3: Country Shift. Prediction and uncertainty quality of baseline methods in terms of the area under the receiver operating characteristic curve (AUC) and classification accuracy, as a function of the proportion of data referred to a medical expert. All methods are tuned on in-domain validation AUC, and ensembles have

K=3

constituent models (true for all subsequent tables unless specified otherwise). On in-domain data, mc dropout performs best across all thresholds. On distributionally shifted data, no method consistently performs best.

EyePACS Dataset (In-Domain)
	No Referral		$50\%$ Data Referred		$70\%$ Data Referred
Method	AUC (%) $\uparrow$	Accuracy (%) $\uparrow$	AUC (%) $\uparrow$	Accuracy (%) $\uparrow$	AUC (%) $\uparrow$	Accuracy $\uparrow$
map (Deterministic)	$87.4{\scriptstyle\pm 1.3}$	$88.6{\scriptstyle\pm 0.7}$	$91.1{\scriptstyle\pm 1.8}$	$95.9{\scriptstyle\pm 0.4}$	$94.9{\scriptstyle\pm 1.1}$	$96.5{\scriptstyle\pm 0.3}$
mfvi	$83.3{\scriptstyle\pm 0.2}$	$85.7{\scriptstyle\pm 0.1}$	$85.5{\scriptstyle\pm 0.7}$	$94.5{\scriptstyle\pm 0.1}$	$88.2{\scriptstyle\pm 0.7}$	$95.9{\scriptstyle\pm 0.1}$
radial-mfvi	$83.2{\scriptstyle\pm 0.5}$	$74.2{\scriptstyle\pm 5.0}$	$88.9{\scriptstyle\pm 0.9}$	$81.8{\scriptstyle\pm 6.0}$	$91.2{\scriptstyle\pm 1.3}$	$83.8{\scriptstyle\pm 5.5}$
fsvi	$88.5{\scriptstyle\pm 0.1}$	$89.8{\scriptstyle\pm 0.0}$	$91.0{\scriptstyle\pm 0.4}$	$96.4{\scriptstyle\pm 0.0}$	$94.3{\scriptstyle\pm 0.3}$	$97.2{\scriptstyle\pm 0.1}$
mc dropout	$91.4{\scriptstyle\pm 0.2}$	$90.9{\scriptstyle\pm 0.1}$	$95.3{\scriptstyle\pm 0.2}$	$97.4{\scriptstyle\pm 0.1}$	$97.4{\scriptstyle\pm 0.1}$	$98.1{\scriptstyle\pm 0.0}$
rank-1	$85.6{\scriptstyle\pm 1.4}$	$87.7{\scriptstyle\pm 0.8}$	$87.1{\scriptstyle\pm 2.3}$	$95.3{\scriptstyle\pm 0.5}$	$90.9{\scriptstyle\pm 2.0}$	$96.4{\scriptstyle\pm 0.4}$
deep ensemble	$90.3{\scriptstyle\pm 0.2}$	$90.3{\scriptstyle\pm 0.3}$	$91.7{\scriptstyle\pm 0.6}$	$97.2{\scriptstyle\pm 0.0}$	$95.0{\scriptstyle\pm 0.5}$	$97.9{\scriptstyle\pm 0.0}$
mfvi ensemble	$85.4{\scriptstyle\pm 0.0}$	$87.8{\scriptstyle\pm 0.0}$	$86.3{\scriptstyle\pm 0.4}$	$95.4{\scriptstyle\pm 0.0}$	$89.2{\scriptstyle\pm 0.4}$	$96.7{\scriptstyle\pm 0.1}$
radial-mfvi ensemble	$84.9{\scriptstyle\pm 0.1}$	$74.2{\scriptstyle\pm 1.5}$	$91.4{\scriptstyle\pm 0.2}$	$83.4{\scriptstyle\pm 1.7}$	$93.3{\scriptstyle\pm 0.3}$	$85.9{\scriptstyle\pm 1.6}$
fsvi ensemble	$90.3{\scriptstyle\pm 0.1}$	$90.6{\scriptstyle\pm 0.0}$	$92.1{\scriptstyle\pm 0.2}$	$97.1{\scriptstyle\pm 0.0}$	$95.2{\scriptstyle\pm 0.2}$	$97.8{\scriptstyle\pm 0.1}$
mc dropout ensemble	$\mathbf{92.5{\scriptstyle\pm 0.0}}$	$\mathbf{91.6{\scriptstyle\pm 0.0}}$	$\mathbf{95.8{\scriptstyle\pm 0.1}}$	$\mathbf{97.8{\scriptstyle\pm 0.0}}$	$\mathbf{97.7{\scriptstyle\pm 0.1}}$	$\mathbf{98.4{\scriptstyle\pm 0.0}}$
rank-1 ensemble	$89.5{\scriptstyle\pm 0.8}$	$89.3{\scriptstyle\pm 0.4}$	$88.5{\scriptstyle\pm 1.3}$	$96.9{\scriptstyle\pm 0.3}$	$91.6{\scriptstyle\pm 1.2}$	$97.6{\scriptstyle\pm 0.3}$
APTOS 2019 Dataset (Population Shift)
map (Deterministic)	$92.2{\scriptstyle\pm 0.2}$	$86.2{\scriptstyle\pm 0.6}$	$80.1{\scriptstyle\pm 3.6}$	$87.6{\scriptstyle\pm 1.5}$	$55.4{\scriptstyle\pm 4.3}$	$85.4{\scriptstyle\pm 1.2}$
mfvi	$91.4{\scriptstyle\pm 0.2}$	$84.1{\scriptstyle\pm 0.3}$	$93.8{\scriptstyle\pm 0.4}$	$92.1{\scriptstyle\pm 0.5}$	$93.0{\scriptstyle\pm 0.6}$	$92.7{\scriptstyle\pm 0.5}$
radial-mfvi	$90.7{\scriptstyle\pm 0.7}$	$71.8{\scriptstyle\pm 4.6}$	$82.0{\scriptstyle\pm 2.5}$	$81.5{\scriptstyle\pm 2.7}$	$66.4{\scriptstyle\pm 2.1}$	$85.9{\scriptstyle\pm 1.0}$
fsvi	$94.1{\scriptstyle\pm 0.1}$	$87.6{\scriptstyle\pm 0.5}$	$90.6{\scriptstyle\pm 0.9}$	$90.7{\scriptstyle\pm 0.7}$	$77.2{\scriptstyle\pm 4.6}$	$89.8{\scriptstyle\pm 0.3}$
mc dropout	$94.0{\scriptstyle\pm 0.2}$	$86.8{\scriptstyle\pm 0.2}$	$87.4{\scriptstyle\pm 0.3}$	$88.1{\scriptstyle\pm 0.2}$	$65.3{\scriptstyle\pm 1.7}$	$88.2{\scriptstyle\pm 0.4}$
rank-1	$92.5{\scriptstyle\pm 0.3}$	$86.2{\scriptstyle\pm 0.5}$	$90.1{\scriptstyle\pm 2.5}$	$91.4{\scriptstyle\pm 1.1}$	$75.1{\scriptstyle\pm 7.8}$	$89.5{\scriptstyle\pm 1.5}$
deep ensemble	$94.2{\scriptstyle\pm 0.2}$	$87.5{\scriptstyle\pm 0.1}$	$91.2{\scriptstyle\pm 1.9}$	$92.4{\scriptstyle\pm 0.9}$	$67.4{\scriptstyle\pm 7.3}$	$90.1{\scriptstyle\pm 1.2}$
mfvi ensemble	$93.2{\scriptstyle\pm 0.1}$	$87.0{\scriptstyle\pm 0.2}$	$\mathbf{94.9{\scriptstyle\pm 0.3}}$	$\mathbf{93.7{\scriptstyle\pm 0.3}}$	$\mathbf{94.2{\scriptstyle\pm 0.3}}$	$\mathbf{94.0{\scriptstyle\pm 0.3}}$
radial-mfvi ensemble	$91.8{\scriptstyle\pm 0.2}$	$69.0{\scriptstyle\pm 1.9}$	$78.6{\scriptstyle\pm 0.6}$	$79.8{\scriptstyle\pm 0.9}$	$60.9{\scriptstyle\pm 0.3}$	$86.7{\scriptstyle\pm 0.2}$
fsvi ensemble	$\mathbf{94.6{\scriptstyle\pm 0.1}}$	$\mathbf{88.9{\scriptstyle\pm 0.2}}$	$90.7{\scriptstyle\pm 0.5}$	$91.1{\scriptstyle\pm 0.6}$	$74.1{\scriptstyle\pm 3.4}$	$89.8{\scriptstyle\pm 0.2}$
mc dropout ensemble	$94.1{\scriptstyle\pm 0.1}$	$87.6{\scriptstyle\pm 0.1}$	$86.8{\scriptstyle\pm 0.2}$	$88.0{\scriptstyle\pm 0.2}$	$62.3{\scriptstyle\pm 0.4}$	$87.7{\scriptstyle\pm 0.2}$
rank-1 ensemble	$94.1{\scriptstyle\pm 0.2}$	$88.3{\scriptstyle\pm 0.2}$	$\mathbf{94.9{\scriptstyle\pm 0.4}}$	$93.5{\scriptstyle\pm 0.3}$	$92.4{\scriptstyle\pm 1.5}$	$93.8{\scriptstyle\pm 0.3}$

B.2 UCI Regression

Table 4: This table compares the predictive performance between the method proposed in this paper and the method proposed by Sun et al. [2019] on six datasets from the UCI database. We followed the same training protocol as Sun et al. [2019] and used the code provided by the authors to load and process the data. The same network architecture was used (one hidden layer with 50 hidden units). We report the results for the best set of hyperparameters, computed over ten random seeds. Lower RMSE and higher log-likelihood are better. Best results are shaded in gray. The first five rows are small-scale UCI experiments, and the sixth row (“Protein”) is a larger-scale experiment (45,740 data points).

Boston	$\mathbf{2.378\pm 0.104}$	$3.632\pm 0.515$	$\mathbf{-2.301\pm 0.038}$	$-3.150\pm 0.495$
	RMSE		Log-Likelihood
	Sun et al. [2019]	Ours	Sun et al. [2019]	Ours
Concrete	$4.935\pm 0.180$	$\mathbf{4.177\pm 0.443}$	$-3.096\pm 0.016$	$\mathbf{-2.855\pm 0.116}$
Energy	$0.412\pm 0.017$	$\mathbf{0.409\pm 0.060}$	$-0.684\pm 0.020$	$\mathbf{-0.539\pm 0.138}$
Wine	$0.673\pm 0.014$	$\mathbf{0.615\pm 0.033}$	$-1.040\pm 0.013$	$\mathbf{-0.959\pm 0.034}$
Yacht	$0.607\pm 0.068$	$\mathbf{0.514\pm 0.242}$	$-1.033\pm 0.033$	$\mathbf{-0.888\pm 0.334}$
Protein	$4.326\pm 0.019$	$\mathbf{4.248\pm 0.043}$	$-2.892\pm 0.004$	$\mathbf{-2.866\pm 0.009}$

Appendix C Illustrative Examples

C.1 Two Moons Classification Task

C.2 Synthetic 1D Regression Datasets

Appendix D Implementation, Training, and Evaluation Details

D.1 Hyperparameter Selection Protocol

For fsvi, we used a holdout validation set (10% of the training set) to conduct a hyperparameter search over the prior variance, the number of context points used to evaluate the KL divergence, the context distribution, and the number of Monte Carlo samples used to evaluate the expected log-likelihood. We selected the set of hyperparameters that yielded the highest validation log-likelihood for all experiments. We state the hyperparameters selected for the different datasets below.

For other methods, we used a holdout validation set of the same size and selected the best-performing hyperparameters. We used implementations provided by the authors of mfvi (radial) and swag. All other methods were implemented from scratch unless stated otherwise.

D.2 FashionMNIST vs. MNIST/NotMNIST

We train all model on the FashionMNIST dataset and evaluate the models’ predictive uncertainty performance on out-of-distribution data on the MNIST dataset. Both datasets consist of images of size $28\times 28$ pixels. The FashionMNIST dataset is normalized to have zero mean and a standard deviation of one. The MNIST dataset is normalized with the same transformation, that is, using the same mean and standard deviation used for the in-distribution data. We chose FashionMNIST/MNIST instead of MNIST/NotMNIST because the latter is notably easier than the former.

In this experiment, a network architecture with two convolutional layers of 32 and 64 $3\times 3$ filters and a fully-connected final layer of 128 hidden units is used. A max pooling operation is placed after each convolutional layer and ReLU activations are used. We do not use batch normalization. All models are trained for 30 epochs with a mini-batch size of 128 using SGD with a learning rate of $5\times 10^{-3}$ , momentum (with momentum parameter 0.9), and a cosine learning rate schedule with parameter $0.05$ .

For fsvi with $p_{\mathbf{X}_{\mathcal{C}}}=$ random monochrome, we sampled 50% of the context points for each gradient step from the mini-batch and the other 50% according to the method described in Section D.8. For fsvi with $p_{\mathbf{X}_{\mathcal{C}}}$ = KNIST, we used the KMNIST dataset.

D.3 CIFAR-10 vs. SVHN

We train all model on the CIFAR-10 dataset and evaluate the models’ predictive uncertainty performance on out-of-distribution data on the SVHN dataset. Both datasets consist of images of size $32\times 32\times 3$ , with RBG channels. The CIFAR-10 dataset is normalized to have zero mean and a standard deviation of one. The SVHN dataset is normalized with the same transformation, that is, using the same mean and standard deviation used for the in-distribution data. The training data is augmented with random horizontal flips (with a probability of 0.5) and random crops (4 zero pixels on all sides).

In this experiment, a standard ResNet-18 network architecture was used. All models are trained for 200 epochs with a mini-batch size of 128 using SGD with a learning rate of $5\times 10^{-3}$ , momentum (with momentum parameter 0.9), and a cosine learning rate schedule with parameter $0.05$ .

For fsvi with $p_{\mathbf{X}_{\mathcal{C}}}=$ random monochrome, we sampled 100% of the context points for each gradient step from the mini-batch and the other 50% according to the method described in Section D.8. For fsvi with $p_{\mathbf{X}_{\mathcal{C}}}$ = CIFAR-100, we used the CIFAR-100 dataset.

D.4 Diabetic Retinopathy Diagnosis

Prediction and Expert Referral.

In real-world settings where the evaluation data may be sampled from a shifted distribution, incorrect predictions may become increasingly likely. To account for that possibility, predictive uncertainty estimates can be used to identify datapoints where the likelihood of an incorrect prediction is particularly high and refer them for further review. We consider a corresponding selective prediction task, where the predictive performance of a given model is evaluated for varying expert referral rates. That is, for a given referral rate of $\gamma\in[0,1]$ , a model’s predictive uncertainty is used to identify the $\gamma$ proportion of images in the evaluation set for which the model’s predictions are most uncertain. Those images are referred to a medical professional for further review, and the model is assessed on its predictions on the remaining $(1-\gamma)$ proportion of images. By repeating this process for all possible referral rates and assessing the model’s predictive performance on the retained images, we estimate how reliable it would be in a safety-critical downstream task, where predictive uncertainty estimates are used in conjunction with human expertise to avoid harmful predictions. Importantly, selective prediction tolerates out-of-distribution examples. For example, even if unfamiliar features appear in certain images, a model with reliable uncertainty estimates will perform better in selective prediction by assigning these images high epistemic (and predictive) uncertainty, therefore referring them to an expert at a lower $\gamma$ .

For all methods, experiments are performed using a ResNet-50 network architecture. Training and evaluation scripts as well as model checkpoints can be found at

github.com/google/uncertainty-baselines/.../diabetic_retinopathy_detection.

D.5 Two Moons

In this experiment, we use a multi-layer perceptron (MLP) consisting of two fully-connected layers with 30 hidden units each and tanh activations. We train all models with a learning rate of $10^{-3}$ .

For fsvi, we sampled context points uniformly from $[-10,10]\times[-10,10]$ .

D.6 1D Regression

In this experiment, we use a multi-layer perceptron (MLP) consisting of two fully-connected layers with 100 hidden units each and ReLU activations.

For fsvi, we sampled context points uniformly from $[-10,10]$ .

D.7 Further Implementation Details

We use the Adam optimizer with default settings of $\beta_{1}=0.9$ , $\beta_{2}=0.99$ and $\epsilon=10^{-8}$ for all experiments. The deterministic neural networks that were used for the ensemble were trained with a weight decay of $\lambda$ = 1e-1. mfvi (tempered) was trained with a KL scaling factor of 0.1 to obtain a cold posterior.

D.8 Selection of Context Distribution

We estimate the supremum at every gradient step by sampling a set of context points $\mathbf{X}_{\mathcal{C}}$ from a distribution $p_{\mathbf{X}_{\mathcal{C}}}$ at every gradient step. For tasks with image inputs, we construct a distribution $p_{\mathbf{X}_{\mathcal{C}}}$ , defined as a uniform distribution over images with monochromatic channels. To generate a sample from this “monochrome images” distribution, we first take all images in the training data, flatten each channel, and stack the flattened image channels into a single vector each. We then draw a random element (i.e., a pixel) from each channel vector and then use these pixels to generate a monochrome image of a given resolution by setting every channel equal to the value of the pixel that was drawn. For regression tasks with a $D$ -dimensional input space, $p_{\mathbf{X}_{\mathcal{C}}}$ is defined as a uniform distribution with lower and upper bounds set to the empirical lower and upper bounds of the training data. For further details on the effect of different sampling schemes on the posterior predictive distribution’s performance, see Appendix B.

D.9 Compute Resources

All experiments were carried out on an Nvidia V-100 GPU with 32GB of memory.

Tractable Function-Space Variational Inference in Bayesian Neural Networks

Abstract

1 Introduction

2 Preliminaries

2.1 A Function-Space Perspective on Variational Inference in Bayesian Neural Networks

3 Deriving a Tractable Function-Space Variational Objective

3.1 Approximating Distributions over Functions via Local Linearization

3.2 Approximating the Function-Space Kullback-Leibler Divergence

3.3 Stochastic Estimation of the Approximate Function-Space Variational Objective

4 Related Work

Function-Space Inference in Bayesian Neural Networks.

Linear Models.

Pathologies of Variational Inference in Bayesian Neural Networks.

5 Empirical Evaluation

5.1 Predictive Performance, Uncertainty Estimation, and Distribution Shift Detection

5.2 Generalization and Reliability of Predictive Uncertainty under Distribution Shift

5.3 Safety-Critical Uncertainty-Aware Selective Prediction: Diabetic Retinopathy Diagnosis

6 Conclusion

Acknowledgements

References

Appendix

Table of Contents

Appendix A Proofs & Derivations

A.1 Function-Space Variational Objective

A.2 Distribution under Linearized Function Mapping

Proposition 1 (Distribution under Linearized Mapping).

Proof.

Proposition 2 (Approximate Distribution under Linearized Mapping).

Proof.

Appendix B Further Empirical Results

B.1 Tabular Results for Diabetic Retinopathy Diagnosis Tasks

B.2 UCI Regression

Appendix C Illustrative Examples

C.1 Two Moons Classification Task

C.2 Synthetic 1D Regression Datasets

Appendix D Implementation, Training, and Evaluation Details

D.1 Hyperparameter Selection Protocol

D.2 FashionMNIST vs. MNIST/NotMNIST

D.3 CIFAR-10 vs. SVHN

D.4 Diabetic Retinopathy Diagnosis

Prediction and Expert Referral.

D.5 Two Moons

D.6 1D Regression

D.7 Further Implementation Details

D.8 Selection of Context Distribution

D.9 Compute Resources

Tractable Function-Space Variational Inference in
Bayesian Neural Networks