Vincent Fortuin
Department of Computer Science
arXiv:2105.06868v1 [stat.ML] 14 May 2021
ETH Zürich
Zürich, Switzerland
While the choice of prior is one of the most critical parts of the Bayesian infer-
ence workflow, recent Bayesian deep learning models have often fallen back on
uninformative priors, such as standard Gaussians. In this review, we highlight the
importance of prior choices for Bayesian deep learning and present an overview of
different priors that have been proposed for (deep) Gaussian processes, variational
autoencoders, and Bayesian neural networks. We also outline different methods
of learning priors for these models from data. We hope to motivate practitioners
in Bayesian deep learning to think more carefully about the prior specification for
their models and to provide them with some inspiration in this regard.
1 Introduction
Bayesian models have gained a stable popularity in data analysis [1] and machine learning [2]. Es-
pecially in recent years, the interest in combining these models with deep learning has surged1. The
main idea of Bayesian modeling is to infer a posterior distribution over the parameters θ of the
model given some observed data D using Bayes’ theorem [3, 4] as
p(D | θ) p(θ) p(D | θ) p(θ)
p(θ | D) = = R (1)
p(D) p(D | θ) p(θ) dθ
where p(D|θ) is the likelihood, p(D) is the marginal likelihood (or evidence), and p(θ) is the prior.
The prior can often be parameterized by hyperparameters ψ, in which case we will write it as p(θ; ψ)
if we want to highlight this dependence. This posterior can then be used to model new unseen data
D∗ using the posterior predictive
p(D∗ | D) = p(D∗ | θ) p(θ | D) dθ (2)
The integral in Eq. (2) is also called the Bayesian model average, because it averages the predic-
tions of all plausible models weighted by their posterior probability. This is in contrast to standard
maximum-likelihood learning, where only one parameter θ ∗ is used for the predictions as
p(D∗ | D) ≈ p(D∗ | θ ∗ ) with θ ∗ = arg max p(D | θ) (3)
While much previous work has focused on the properties of the posterior predictive [5, 6], the
approximation of the integrals in Eq. (1) and Eq. (2) [7–9], or the use of the marginal likelihood
for Bayesian model selection [10, 11], we want to use this survey to shed some light on the often-
neglected term in Eq. (1): the prior p(θ).
In orthodox Bayesianism, the prior should be chosen in a way such that it accurately reflects our
beliefs about the parameters θ before seeing any data [12]. This has been described as being the
As attested, for instance, by the growing interest in the Bayesian Deep Learning workshop at NeurIPS.
most crucial part of Bayesian model building, but also the hardest one, since it is often not trivial to
map the subjective beliefs of the practitioner unambiguously onto tractable probability distributions
[13]. However, in practice, choosing the prior is often rather seen as a nuisance, and there have been
many attempts to try to avoid having to choose a meaningful prior, for instance, through objective
priors [14, 15], empirical Bayes [16], or combinations of the two [17]. Especially in Bayesian deep
learning, it is common practice to choose a (seemingly) “uninformative” prior, such as a standard
Gaussian [c.f., 18].
This trend is troubling, because choosing a bad prior can have detrimental consequences for the
whole inference endeavor. While the choice of uninformative (or weakly informative) priors is often
being motivated by invocation of the asymptotic consistency guarantees of the Bernstein-von-Mises
theorem [19], this theorem does not in fact hold in many applications, since its regularity conditions
are not satisfied [20]. Moreover, in the non-asymptotic regime of our practical inferences, the prior
can have a strong influence on the posterior, often forcing the probability mass onto arbitrary sub-
spaces of the parameter space, such as a spherical subspace in the case of the seemingly innocuous
standard Gaussian prior [21].
Worse yet, prior misspecification can undermine the very properties that compel us to use Bayesian
inference in the first place. For instance, marginal likelihoods can become meaningless under prior
misspecification, leading us to choose suboptimal models when using Bayesian model selection
[22]. Moreover, de Finetti’s famous Dutch book argument [23] can be extended to cases where we
can be convinced to take wagers that lose money in expectation when using bad priors, which even
holds for the aforementioned objective (Jeffreys) priors [24]. In a similar vein, Savage’s theorem
[25], which promises us optimal decisions under Bayesian decision theory, breaks down under prior
misspecification [26]. Finally, it can even be shown that PAC-Bayesian inference can exceed the
Bayesian one in terms of generalization performance when the prior is misspecified [27, 28].
On a more optimistic note, the no-free-lunch theorem [29] states that no learning algorithm is univer-
sally superior, or in other words, that different learning algorithms outperform each other on different
datasets. Applied to Bayesian learning, this means that there is also no universally preferred prior,
but that each task is potentially endowed with its own optimal prior. Finding (or at least approxi-
mating) this optimal prior then offers the potential for significantly improving the performance of
the inference or even enabling successful inference in cases where it otherwise would not have been
All these observations should at least motivate us to think a bit more carefully about our priors
than is often done in practice. But do we really have reason to believe that the commonly used
priors in Bayesian deep learning are misspecified? One recent piece of evidence is the fact that in
Bayesian linear models, it can be shown that prior misspecification leads to the necessity to temper
the posterior for optimal performance (i.e., use a posterior pT (θ | D) ∝ p(θ | D)1/T for some T < 1)
[30]. And indeed, this need for posterior tempering has also been observed empirically in modern
Bayesian deep learning models [e.g., 31–33, 18].
Based on all these insights, it is thus high time that we critically reflect upon our choices of priors
in Bayesian deep learning models. Luckily for us, there are many alternative priors that we could
choose over the standard uninformative ones. This survey shall attempt to provide an overview
of them. We will review existing prior designs for (deep) Gaussian processes in Section 2, for
variational autoencoders in Section 3, and for Bayesian neural networks in Section 4. We will then
finish by giving some brief outline of methods for learning priors from data in Section 5.
This prior is called a Gaussian process because it has the property that when evaluating the function
at any finite set of points x, the function values f := f (x) are distributed as p(f ) = N (mx , Kxx ),
where mx = mψ (x) is the vector of mean function outputs, the (i, j)’th element of the kernel
matrix Kxx is given by kψ (xi , xj ), and the d-dimensional multivariate Gaussian N (f ; µ, Σ) is
1 1
p(f ) = N (µ, Σ) := p exp − (f − µ)⊤ Σ−1 (f − µ) (5)
(2π)d det Σ 2
The Gaussian process can also be seen as an infinite-dimensional version of this multivariate Gaus-
sian distribution, following the Kolmogorov extension theorem [36].
This model is often combined with a Gaussian observation likelihood p(y | f ) = N (f , σ 2 I), since
it then allows for a closed-form posterior inference [35] on unseen data points (x∗ , y ∗ ) as
p(y ∗ | x∗ , x, y) = N (m∗ , K ∗ ) with (6)
∗ 2
m = mx∗ + Kx∗ x Kxx + σ I (y − mx )
K ∗ = Kx∗ x∗ − Kx∗ x Kxx + σ 2 I Kxx∗ + σ 2 I
While these models are not deep per se, there are many ways in which they connect to Bayesian deep
learning, which merits their appearance in this survey. In the following, we are going to present how
GP priors can be parameterized by deep neural networks (Section 2.1), how GPs can be stacked to
build deeper models (Section 2.2), and how deep neural networks can themselves turn into GPs or
be approximated by GPs (Section 2.3).
tent functions {f1 , . . . , fk } with function outputs {f1 , . . . , fk } and latent variables {z1 , . . . , zk−1 },
where each function uses the previous latent variable as its inputs, that is, fi+1 = fi+1 (zi ) and
f1 = f1 (x). In the simplest case, all these latent GPs still have Gaussian latent likelihoods
p(zi | fi ) = N (fi , σi2 I) and a Gaussian output likelihood p(y | fk ) = N (fk , σk2 I). If each of
these functions is endowed with a GP prior p(fi ) = GP(mψi (·), kψi (·, ·)), this model is called a
deep Gaussian process [50]. Similarly to deep neural networks, these models can represent increas-
ingly complex distributions with increasing depth, but unlike neural networks, they still offer a fully
Bayesian treatment. Crucially, in contrast to standard GPs, deep GPs can model a larger class of
output distributions [51], which includes distributions with non-Gaussian marginals [52]. For in-
creased flexibility, these models can also be coupled with warping functions between the GP layers
[53]. Moreover, they can be combined with the convolutional GP kernels mentioned above, to yield
models that are similar in spirit to deep CNNs [54, 55].
While these models seem to be strictly superior and preferable to standard GPs, their additional flex-
ibility comes at a price: the posterior inference is not tractable in closed form anymore. This means
that the posterior has to be estimated using approximate inference techniques, such as variational
inference [50, 56], expectation propagation [57], or amortized inference [58]. A very popular ap-
proximate inference technique for GPs is based on so-called inducing points, which are chosen to
be a subset of the training points or generally of the training domain [59–63]. This technique can
also be extended to inference in deep GPs [50, 64] or replaced by variational random features [65].
In contrast to the inference techniques, the choice of priors for deep GPs has generally been under-
studied. While a deep GP as a whole can model a rather complex prior over functions, the priors
for the single layers in terms of mψi and kψi are often chosen to be quite simple, for instance,
RBF kernels with different lengthscales [50]. Another option to build models with more expres-
sive kernels is to actually parameterize the kernel of a GP with another GP [66]. Particularly, the
(hierarchical) prior is then p(f ) = GP(mψ (·), k̂(·, ·)) with k̂(x, x′ ) = FT−1 (exp s(x − x′ )) and
p(s) = GP(0, kψ (·, ·)), where FT−1 is the inverse Fourier transform. This can also be seen as
a deep GP with one hidden layer, and it also does not allow closed-form inference, but relies on
approximate inference, for instance, using elliptical slice sampling [66].
Another way to connect GPs to DNNs is via neural network limits. It has been known for some
time now that the function-space prior p(f ) induced by a Bayesian neural network (BNN) with a
single hidden layer and any independent finite-variance parameter prior p(θ) converges in the limit
of infinite width to a GP, due to the central limit theorem [67, 68]. The limiting GP prior is given by
p(f ) = GP(0, kNN (·, ·)) with
kNN (x, x′ ) = σw
Ew,b ϕ(w⊤ x + b) ϕ(w⊤ x′ + b) + σb22
where (8)
w∼ N (0, σw 1
I) and b∼ N (0, σb21 I)
with the prior weight and bias variances σw 1
, σb21 in the first layer, σw
, σb22 in the second layer, and
nonlinear activation function ϕ(·). Note that here it is usually assumed that the weight variances are
set as σw i
∝ 1/ni , where ni is the number of units in the i’th layer. The kernel kNN (·, ·) is then
called the neural network GP (NNGP) kernel. This result has recently been extended to BNNs with
ReLU activations [69] and deep BNNs [70–72], where the lower layer GP kernel takes the same
form as above and the kernel for the higher layers assumes the recursive form
kNN (x, x′ ) = σw
E(z1 ,z2 )∼N (0,K ℓ−1 ) [ϕ(z1 ) ϕ(z2 )] + σb2ℓ (9)
ℓ−1 ′
with Kxx ′ being the 2 × 2 kernel matrix of the (ℓ − 1)’th layer kernel evaluated at x and x .
Moreover, these convergence results can also be shown for convolutional BNNs [73, 74], even with
weight correlations [75], and for attention neural networks [76].
While these results only hold for independent finite-variance priors, they can be extended to de-
pendent priors, where they yield GPs that are marginalized over a hyperprior [77], and to infinite-
variance priors, where they lead to α-stable processes [78]. Excitingly, it has been shown that this
convergence of the BNN prior to a stochastic process also implies the convergence of the posterior
under mild regularity assumptions [79]. While these results have typically been derived manually,
the recent theoretical framework of tensor programs allows to rederive them in a unified way, in-
cluding for recurrent architectures and batch-norm [80–82]. Moreover, it allows to derive limits for
networks where only a subset of the layers converge to infinite width, which recovers the models’
ability to learn latent features [83].
Not only infinitely wide BNNs can lead to GP limits, but this is also true for infinitely wide standard
DNNs. Crucially however, in this case, the GP arises not as a function-space prior at initialization,
but as a model of training under gradient descent [84, 85]. Specifically, neural networks under
gradient descent training can be shown to follow the kernel gradient of their functional loss with
respect to the so-called neural tangent kernel (NTK) which is
kNTK (x, x′ ) = Jθ (x)Jθ (x′ )⊤ (10)
where Jθ (x) is the Jacobian of the neural network with respect to the parameters θ evaluated at
input x. In the limit of infinite width, this kernel becomes stable over training and can be recursively
computed as
kNTK (x, x′ ) = kNTK
(x, x′ ) Σ̇ℓ (x, x′ ) + Σℓ (x, x′ ) with (11)
1 ⊤ ′
kNTK (x, x′ ) = Σ1 (x, x′ ) = x x +1
Σℓ (x, x′ ) = E(z1 ,z2 )∼N (0,K ℓ−1 ) [ϕ(z1 ) ϕ(z2 )]
ℓ ′
Σ̇ (x, x ) = E(z1 ,z2 )∼N (0,K ℓ−1 ) [ϕ̇(z1 ) ϕ̇(z2 )]
where n0 is the number of inputs, Kxx ′ is again the kernel matrix of the NTK at the previous layer,
ϕ̇(·) is the derivative of the activation function, and the Σℓ are so-called activation kernels.
In the case of finite width, this kernel will not model the training behavior exactly, but there exist
approximate corrections [86]. Interestingly, this same kernel can also be derived from approximate
inference in the neural network, leading to an implicit linearization [87]. This linearization can also
be made explicit and can then be used to improve the performance of BNN predictives [88] and
for fast domain adaptation in multi-task learning [89]. Moreover, when using the NTK in a kernel
machine, such as a support vector machine, it can outperform the original neural network it was
derived from, at least in the small-data regime [90]. Similarly to the aforementioned NNGP kernels,
the NTKs for different architectures can also be rederived using the framework of tensor programs
[80, 91] and there exist practical Python packages for the efficient computation of NNGP kernels
and NTKs [92]. Finally, it should be noted that this linearization of neural networks has also been
linked to the scaling of the parameters and described as lazy training, which has been argued to be
inferior to standard neural network training [93].
choices, which we will explore in the following. Particularly, we will look at some proper probability
distributions that can directly replace the standard Gaussian (Section 3.1), at some structural priors
that also require changes to the architecture (Section 3.2), and finally at a particularly interesting
VAE model with idiosyncratic architecture and prior, namely the neural process (Section 3.3).
where µ is again the mean, κ the concentration parameter, and Γ(·) is the Gamma function. Since
the Gamma function is easier to evaluate than the modified Bessel function, this density allows for
closed form evaluation and reparameterizable sampling. Empirically, it yields the same performance
in VAEs as the vMF prior, while being numerically more stable [99].
Another type of priors are mixture priors [100–102], typically mixtures of Gaussian of the form
p(z) = πi N (µi , σi2 I) with πi = 1 (15)
i=1 i=1
with K mixture components where πi are the mixture weights that are often set to πi = 1/K
in the prior. These priors have been motivated by the idea that the data might consist of clusters,
which should also be disjoint in the latent space [100], and they have been shown to outperform
many other clustering methods on challenging datasets [102]. However, similarly to many other
clustering methods, one challenge is to choose the number of clusters K a priori. This can also be
optimized automatically, for instance by specifying a stick-breaking or Dirichlet process hyperprior
[103], albeit at the cost of more involved inference.
Finally, most of these priors assume independence between data points. If we have prior knowledge
about potential similarity between data points and we can encode it into a kernel function, a Gaussian
process can be a powerful prior for a VAE [104–106]. The prior is usually defined as
p(Z) = N (0, Kzz ) (16)
where Z = (z1 , . . . , zn ) is the matrix of latent variables and Kzz is again the kernel matrix with
(i, j)’th element k(zi , zj ) for some suitable kernel function k(·, ·). These models have been shown
to excel at conditional generation [104], time series modeling [106], missing data imputation [105],
and disentanglement [107, 108]. It should be noted that this comes at additional computational cost
compared to standard VAEs, since it requires the O(n3 ) inversion of the kernel matrix (see Eq. (6)).
However, this operation can be made more scalable, either through the use of inducing point methods
[109, 110] (c.f., Section 2.2) or through factorized kernels [111]. Moreover, depending on the prior
knowledge of the generative process, these models can also be extended to use additive GP priors
[112] or tensor-valued ones [113].
In contrast to the distributional priors discussed above, we will use the term structural priors to
refer to priors that do not only change the actual prior distribution p(z) in the VAE model, but also
the model architecture itself. Some of these structural priors are extensions of the distributional
priors mentioned above. For instance, the aforementioned Gaussian mixture priors can be extended
with a mixture-of-experts decoder, that is, a factorized generative likelihood, where each factor only
depends on one of the latent mixture components [102]. Another example are the Gaussian process
priors, which are defined over the whole latent dataset Z and thus benefit from a modified encoder
(i.e., inference network), which encodes the complete dataset X jointly [105].
In addition to these distributional priors with modified architectures, there are also structural priors
which could not be realized with the standard VAE architecture. One example are hierarchical priors
[114–116], such as
p(z1 , . . . , zK ) = p(z1 ) p(zi | zi−1 ) (17)
or p(z1 , . . . , zK ) = p(z1 ) p(zi | z1 , . . . , zi−1 ) (18)
We see here that instead of having a single latent variable z, these models feature K different latent
variables {zi , . . . , zK }, which depend on each other hierarchically. These models require additional
generative networks to parameterize the conditional probabilities in Eq. (17), which then enable
them to better model data with intrinsically hierarchical features [114, 115] and to reach state-of-
the-art performance in image generation with VAEs [116].
Another type of structural priors are discrete latent priors, such as the VQ-VAE prior [117]
p(zq ) = with zq = arg min kze − ek22 (19)
|E| e∈E
where E is a finite dictionary of prototypes and ze is a continuous latent variable that is then dis-
cretized to zq . Crucially, the prior is not placed over the continuous ze , but over the discrete zq ,
namely as a uniform prior over the dictionary E. These discrete latent variables can then be saved
very cheaply and thus lead to much stronger compression than standard VAEs [117]. When combin-
ing these models with the hierarchical latent variables described above, they can also reach competi-
tive image generation performance [118]. Moreover, these discrete latent variables can be extended
to include neighborhood structures such as self-organizing maps [119], leading to more interpretable
latent representations that can also be used for clustering [120–122].
To conclude this section, we will look at a structural VAE prior that has spawned a lot of inter-
est in recent years and thus deserves its own subsection: the neural process (NP). This model has
been independently proposed under the names of partialVAE [123] and (conditional) neural process
[124, 125], but the latter nomenclature has caught on in the literature. The main novelty of this VAE
architecture is that it not only models the distribution of one type of observed variable x, but of
two variables (x, y), which can be split into a context and target set (x, y) = (xc , yc ) ∩ (xt , yt ).
These sets are conditionally independent given z, that is, p(x, y | z) = p(xc , yc | z) p(xt , yt | z).
This then allows to infer an unobserved yt based on the other variables using a variational approx-
imation q(z | xc , yc ) and the conditional likelihood p(yt | z, xt ). Thus, the model can be used for
missing data imputation [123] and regression [125] tasks. Note that, since the likelihood is typically
conditioned on xt instead of just on z, this model can be framed as a conditional VAE [126].
One remarkable feature of this model is the used prior, which is namely
p(z) = p(z | xc , yc ) ≈ q(z | xc , yc ) (20)
This means that instead of using an unconditional prior p(z) for the full posterior p(z | x, y), a
part of the data (the context set) is used to condition the prior, which is in turn approximated by
the variational posterior with reduced conditioning set. While this is atypical for classical Bayesian
inference and generally frowned upon by orthodox Bayesians, it bears resemblence to the data-
dependent oracle priors that can be used in PAC-Bayesian bounds and have been shown to make
those bounds tighter [127, 128].
The NP model has been heavily inspired by stochastic processes (hence the name) and has been
shown to constitute a stochastic process itself under some assumptions [125]. Moreover, when the
conditional likelihood p(yt | z, xt ) is chosen to be an affine transformation, the model is actually
equivalent to a Gaussian process with neural network kernel [129].
Since their inception, NP models have been extended in expressivity in different ways, both in
terms of their inference and their generative model. On the inference side, there are attentive NPs
[130], which endow the encoder with self-attention (and thus make it Turing complete [131]), and
convolutional (conditional) NPs [132, 133], which add translation equivariance to the model. On the
generative side, there are functional NPs [134], which introduce dependence between the predictions
by learning a relational graph structure over the latents z, and Gaussian NPs [135], which achieve a
similar property by replacing the generative likelihood with a Gaussian process, the mean and kernel
of which are inferred based on the latents.
where M is the mean matrix, U and V are the row and column covariances, and tr[·] is the trace
operator. These matrix-valued Gaussians can then also be used as variational distributions, leading
to increased performance compared to isotropic Gaussians on many tasks [147].
Another way to improve the expressiveness of Gaussian priors is to combine them with hierarchical
hyperpriors [148, 149], which has already been proposed in early work on BNNs [136] as
p(θ) = N (µ, Σ) p(Σ) dΣ (22)
where p(Σ) is a hyperprior over the covariance. An example of such a hyperprior is the inverse
Wishart distribution [e.g., 43], which is in d dimensions given by
ν+d−1 ν+2d
(det K) 2 (det Σ)− 2 exp − 21 tr KΣ−1
p(Σ) = IW d (ν, K) = (ν+d−1)d
2 2 Γd ( ν+d−1
2 )
where ν are the degrees of freedom and K is the mean of p(Σ). When marginalizing the prior
in Eq. (22) over the hyperprior in Eq. (23), it turns out that one gets a d-dimensional multivariate
Student-t distribution with ν degrees of freedom [150], namely
− 12 − ν+d
Γ( ν+d
2 ) (det K) (θ − µ)⊤ K −1 (θ − µ) 2
p(θ) = n ν 1+ (24)
((ν − 2)π) 2 Γ( 2 ) ν−2
Such distributions have been shown to model the predictive variance more flexibly in stochastic
processes [150] and BNNs [43]. Moreover, in BNNs, it has been shown that priors like these,
which are heavy-tailed (also including Laplace priors [151]) and allow for weight correlations, can
decrease the cold posterior effect [18], suggesting that they are less misspecified than isotropic
Gaussians. Finally, when using Student-t priors, it has been shown that one can obtain expressive
BNNs posteriors even when forcing the posterior mean of the weights to be zero [152], which
highlights the flexibility of these distributions.
Another hierarchical Gaussian prior is the horseshoe prior [153], which is
p(θi ) = N (0, τ 2 σi2 ) with p(τ ) = C + (0, b0 ) and p(σi ) = C + (0, b1 ) (25)
where b0 and b1 are scale parameters and C + is the half-Cauchy distribution
( 2
1 + (σ−µ)
2 if σ ≥ µ
p(σ) = C (µ, b) = π b b (26)
0 otherwise
In BNNs, the horseshoe prior can encourage sparsity [154] and enable interpretable feature selection
[155]. It can also be used to aid compression of the neural network weights [156]. Moreover,
in application areas such as genomics, where prior knowledge about the signal-to-noise ratio is
available, this knowledge can be encoded in such sparsity-inducing hierarchical priors [157].
Another interesting prior is the radial-directional prior, which disentangles the direction of the
weight vector from its length [158]. It is given by
θ = θr θ d with θr ∼ prad (θr ) and θd ∼ pdir (θd ) (27)
where pdir is a distribution over the d-dimensional unit sphere and prad is a distribution over R. It
has been proposed by Oh et al. [158] to use the von-Mises-Fisher distribution (see Eq. (13)) for
pdir and the half-Cauchy (see Eq. (26)) for prad . Conversely, Farquhar et al. [159] suggest to use a
Gaussian for prad and a uniform distribution over the unit sphere for pdir , which they reparameterize
by sampling from a standard Gaussian and normalizing the sampled vectors to unit length. It should
be noted that the idea of the radial-directional prior is related to the Goldilocks zone hypothesis,
which says that there exists an annulus at a certain distance from the origin which has a particularly
high density of high-performing weight vectors [160].
In terms of even more expressive priors, it has been proposed to model the parameters in terms of
the units of the neural network instead of the weights themselves [161]. The weight θij between
units i and j would then have the prior
p(θij ) = g(zi , zj , ǫ) with p(z) = p(ǫ) = N (0, I) (28)
where the function g can be either parameterized by a neural network [161] or by a Gaussian process
[162]. A similarly implicit model, with even more flexibility, has been proposed by Atanov et al.
[163] and is simply given by
p(θ) = g(z, ǫ) with p(z) = p(ǫ) = N (0, I) (29)
In both of these priors, the main challenge is to choose the function g. Since this is hard to do
manually, the function is usually (meta-)learned (see Section 5.3).
As we saw, there are many different weight space priors that one can choose for Bayesian neural
networks. However, choosing the right one can be challenging, since we often have better intuitions
about the functions we would expect rather than the parameters themselves. The trouble is then that
the mapping from parameters to functions in neural networks is highly non-trivial due to their many
weight-space symmetries [164] and complex function-space geometries [165]. This has led to an
alternative approach to prior specification in BNNs, namely to specify the priors directly in function
space, such that Z
δ(φ(· ; θ)) p(θ | D) dθ ≈ p(f | D) ∝ p(D | f ) p(f ) (30)
where p(f ) is the function-space prior, φ(· ; θ) is the function implemented by a neural network with
parameters θ and δ(·) is the Dirac delta measure (in function space).
As we have seen before (c.f., Section 2), Gaussian processes offer an excellent model class to
encode functional prior knowledge through the choice of kernel and mean functions, that is,
p(f ) = GP(m(·), k(·, ·)). It is thus a natural idea to use GP priors as function-space priors for
BNNs. If one applies this idea in the most straightforward way, one can just optimize a posterior
that now depends on the KL divergence between the BNN posterior and the GP prior. However,
since this KL is defined in an infinite-dimensional space, it requires approximations, such as Stein
kernel gradient estimators [166]. Alternatively, one can first optimize a weight-space distribution on
a BNN to minimize the KL divergence with the desired GP prior (e.g., using Monte Carlo estimates)
and then use this optimized weight prior as the BNN prior during inference [167].
While both of these approaches seem reasonable at first sight, it has been discovered that GP and
BNN function-space distributions do not actually have the same support and that the true KL diver-
gence is thus infinite (or undefined) [168]. It has therefore recently been proposed to use the Wasser-
stein distance instead, although this also requires approximations [169]. If one wants to forego the
need for a well-defined divergence, one can also use a hypernetwork [170, 171] as an implicit dis-
tribution of BNN weights and then train the network to match the GP samples on a certain set of
function outputs [172]. Finally, it has recently been discovered that the ridgelet transform [173] can
be used to approximate GP function-space distributions with BNN weight-space distributions [174].
As a sidenote, it should be noted that the reverse can actually be achieved more easily, namely fitting
a GP to the outputs of a BNN [175], which can also be of interest in certain applications.
If one does not want to use a GP prior in function space, one can still encode useful functional prior
knowledge into BNN priors. For instance, through the study of the infinite-width limits of BNNs
(see Section 2.3), one finds that the activation function of the network has a strong influence on the
functions being implemented and one can, for instance, modulate the smoothness or periodicity of
the BNN output by choosing different activation functions [176]. Moreover, one can directly define
priors over the BNN outputs, which can encode strong prior assumptions about the values that the
functions are allowed to take in certain parts of the input space [177], that is,
p(θ) = pbase (θ) D(φ(Cx ; θ), Cy ) =⇒ p(θ | D) ∝ p(D | θ) D(φ(Cx ; θ), Cy ) pbase (θ) (31)
where pbase (θ) is some base prior in weight space, (Cx , Cy ) are the inputs and outputs in terms of
which the functional constraint is defined and D(·, ·) is a discrepancy function. We see that these
priors on output constraints end up looking like additional likelihood terms in the posterior and can
thus help to encourage specific features of the output function, for instance, to ensure safety features
in critical applications. A similar idea are noise-contrastive priors, which are also specified in func-
tion space directly through a prior over unseen data p(D̃) [178], which yields the prior predictive
p(D ) = p(D∗ | θ) p(θ) p(D̃) dθ dD̃ (32)
This prior can encode the belief that the epistemic uncertainty should grow away from the in-
distribution data and can thus also lead to more GP-like behavior in BNN posteriors. Finally, if
we have the prior belief that the BNN functions should not be much more complex than the ones
of a different function class (e.g., shallower or even linear models), we can use this other class as a
functional reference prior and thus regularize the predictive complexity of the model [179].
Deep neural network ensembles, or deep ensembles, are a frequentist method similar to the bootstrap
[180] that has been used to gain uncertainty estimates in neural networks [181]. However, it has been
recently argued that these ensembles actually approximate the BNN posterior predictive [139], that
p(D∗ | D) = p(D∗ | θ) p(θ | D) dθ ≈ p(D∗ | θi ) (33)
where θi are the weights of K independently trained ensemble members of the same architecture.
These models can also be extended to ensembles with different hyperparameters [182], thus also
approximating a hierarchical hyperposterior. Moreover, they can be made more parameter-efficient
by sharing certain parameters between ensemble members [183], which can then also be used for
approximate BNN inference [144]. While these models have performed well in many practical tasks
[6], they can still severely overfit in some scenarios [184], leading to ill-calibrated uncertainties
[185]. However, it has been shown recently that each ensemble member can be combined with a
random function that is sampled from a function-space prior [186, 187], and that this can indeed
yield uncertainties that are conservative with respect to the Bayesian ones [188]. More specifically,
the uncertainties of such ensembles are with high probability at least as large as the ones from a
Gaussian process with the corresponding NNGP kernel (see Section 2.3). These results can also be
extended to the NTK [189].
Another way of making these deep ensembles more Bayesian and incorporating priors are particle-
based approximate inference methods, such as Stein variational gradient descent (SVGD) [190]. In
SVGD, the ensemble members (or particles) are updated according to
θi ← θi + η φ(θi ) with φ(θi ) = k(θi , θj ) ∇θj log p(θj | D) − ∇θi k(θi , θj ) (34)
where η is a step-size and k(·, ·) is a kernel function in weight space. With the right step-size
schedule, this update rule converges asymptotically to the true posterior [191] and even enjoys some
non-asymptotic guarantees [192]. Moreover, note that it only requires sample-based access to the
gradient of the log posterior (and thus also the log prior), which allows it to be used with different
weight-space priors [193] and even function-space priors, such as GPs [194].
5 Learning Priors
So far, we have explored different types of distributions and methods to encode our prior knowledge
into Bayesian deep learning models. But what if we do not have any useful prior knowledge to
encode? While orthodox Bayesianism would prescribe an uninformative prior in such a case [15,
1], there are alternative ways to elucidate priors, namely by learning them from data. If we go
the traditional route of Bayesian model selection using the marginal likelihood (the term p(D) in
Eq. (1)), we can choose a functional form p(θ; ψ) for the prior and optimize its hyperparameters ψ
with respect to this quantity. This is called empirical Bayes [16] or type-II maximum likelihood (ML-
II) estimation [35]. While there are reasons to be worried about overfitting in such a setting, there
are also arguments that the marginal likelihood automatically trades off the goodness of fit with the
model complexity and thus leads to model parsimony in the spirit of Occam’s razor principle [195].
In the case where we have previously solved tasks that are related to the task at hand (so-called
meta-tasks), we can alternatively also rely on the framework of learning to learn [196, 197] or
meta-learning [198]. If we apply this idea to learning priors for Bayesian models in a hierarchical
Bayesian way, we arrive at Bayesian meta-learning [199–202]. This can then also be extended to
modern gradient-based methods [203–205].
While these ML-II optimization and Bayesian meta-learning ideas can in principle be used to learn
hyperparameters for most of the priors discussed above, we will briefly review some successful
examples of their application below. Following the general structure from above, we will explore
learning priors for Gaussian processes (Section 5.1), variational autoencoders (Section 5.2), and
Bayesian neural networks (Section 5.3).
Following the idea of ML-II optimization, we can use the marginal likelihood to select hyperparam-
eters for the mean and kernel functions of GPs. Conveniently, the marginal likelihood for GPs (with
Gaussian observation likelihood) is available in closed form as
pψ (y | x) = p(y | f, x) GP(mψ (·), kψ (·, ·)) df (35)
(y − m(x)⊤ (Kxx + σ 2 I)−1 (y − m(x)) + log det(Kxx + σ 2 I) + N log 2π
with N being the number of data points, Kxx the kernel matrix on the data points, and σ 2 the noise
of the observation likelihood. We can see that the first term measures the goodness of fit, while the
second term (the log determinant of the kernel matrix) measures the complexity of the model and
thus incorporates the Occam’s razor principle [35].
While this quantity can be optimized to select the hyperparameters of simple kernels, such as the
lengthscale of an RBF kernel, it can also be used for more expressive ones. For instance, one can
define a spectral mixture kernel in the Fourier domain and then optimize the basis functions’ coeffi-
cients using the marginal likelihood, which can recover a range of different kernel functions [206].
To make the kernels even more expressive, we can also allow for addition and multiplication of dif-
ferent kernels [207], which can ultimately lead to an automatic statistician [208], that is, a model
that can choose its own problem-dependent kernel combination based on the data and some kernel
grammar. While this model naïvely scales rather unfavorably due to the size of the combinatorial
search space, it can be made more scalable through cheaper approximations [209] or by making the
kernel grammar differentiable [210].
Another avenue, which was already alluded to above (see Section 2.1), is to use a neural network
to parameterize the kernel. The first attempt at this trained a deep belief network on the data and
then used it as the kernel function [40], but later approaches optimized the neural network kernel
directly using the marginal likelihood [37], often in combination with sparse approximations [38]
or stochastic variational inference [39] for scalability (see Eq. (7)). In this vein, it has recently been
proposed to regularize the Lipschitzness of the used neural network, in order for the learned kernel to
preserve distances between data points and thus improve its out-of-distribution uncertainties [211].
While all these approaches still rely on the log determinant term in Eq. (35) to protect them from
overfitting, it has been shown that this is unfortunately not effective enough when the employed
neural networks are overparameterized [43]. However, this can be remedied by adding a prior over
the neural network parameters, thus effectively turning them into BNNs and the whole model into a
proper hierarchical Bayesian model. It should be noted that these techniques cannot only be used to
learn GP priors that work well for a particular task, but also to learn certain invariances from data
[212] or to fit GP priors to other (implicit) function-space distributions [175] (c.f., Section 4.2).
As mentioned above, if we have related tasks available, we can use them to meta-learn the GP prior.
This can be applied to the kernel [48, 213] as well as the mean function [48], by optimizing the
marginal likelihood on these meta-tasks as
ψ ∗ = arg max log pψ (yi | xi ) with DM = {(xi , yi )}i=1 (36)
ψ i=1
where DM is the set of meta-tasks. Note that the mean function can only safely be optimized in this
meta-learning setting, but not in the ML-II setting, since Eq. (35) does not provide any complexity
penalty on the mean function and it would thus severely overfit. While meta-learning does not
risk overfitting on the actual training data (since it is not used), it might overfit on the meta-tasks,
if there are too few of them, or if they are too similar to each other [214, 215]. In the Bayesian
meta-learning setting, this can be overcome by specifying a hierarchical hyperprior, which turns out
to be equivalent to optimizing a PAC-Bayesian bound [216]. This has been shown to successfully
meta-learn GP priors from as few as five meta-tasks.
5.2 Learning VAE priors
Variational autoencoders are already trained using the ELBO (see Eq. (12)), which is a lower bound
on the marginal likelihood. Moreover, their likelihood p(x | z) is trained on this objective, as op-
posed to being fixed a priori as in most other Bayesian models. One could thus expect that VAEs
would be well suited to also learn their prior using their ELBO. Indeed, the ELBO can be further
decomposed as
L(x, ϑ) = Ez∼qϑ (z | x) [log pϑ (x | z)] − Iqϑ (z,x) (z, x) − DKL (q̄ϑ (z) k p(z)) (37)
where Iqϑ (z,x) (z, x) is the mutual information between z and x under the joint distribution
qϑ (z, x) = qϑ (z | x) p(x) and q̄ϑ is the aggregated approximate posterior q̄ϑ (z) = K
i=1 qϑ (z | xi ).
Since the KL term in this objective is the only term that depends on the prior and the complexity
of qϑ (z | x) is already penalized by the mutual information term, it has been argued that optimizing
the prior p(z) with respect to the ELBO could be beneficial [217]. One can then show that the opti-
mal prior under this objective is the aggregated posterior Ex∼p(x) [p(z | x)], where p(x) is the data
distribution [218].
As mentioned above, a more expressive family of prior distributions than the common standard
Gaussian priors are Gaussian mixture priors [100] (see Section 3.1). In particular, with an increasing
number of components, these mixtures can approximate any smooth distribution arbitrarily closely
[219]. These VAE priors can be optimized using the ELBO [220], however it has been found that this
can severely overfit [218], highlighting again that the marginal likelihood (or its lower bound) cannot
always protect against overfitting (see Section 5.1). Instead, it has been proposed to parameterize
the mixture components as variational posteriors on certain inducing points, that is
p(z) = q(z | xi ) (38)
where the xi ’s are learned [218]. This can indeed improve the VAE performance without overfitting,
and since the prior is defined in terms of inducing points in data space, it can also straightforwardly
be used with hierarchical VAEs [221].
Since mixture models can exacerbate the computation of the KL divergence and require the difficult
choice of a number of components K, an alternative are implicit priors which are parameterized by
learnable functions. One specific example for image data has been proposed for VAEs in which the
latent space preserves the shape of the data, that is, the z’s are not just vectors, but 2D or 3D tensors.
In such models, one can define a hierarchical prior over z, which is parameterized by learnable
convolutions over the latent dimensions [222]. Another way of specifying a learnable hierarchical
prior is to use memory modules, where the prior is then dependent on the stored memories and the
memory is learned together with the rest of the model [223]. More generally, one can define implicit
prior distributions in VAEs as
z = g(ξ; ψ) with p(ξ) = N (0, I) (39)
where g(· ; ψ) is a learnable diffeomorphism, such as a normalizing flow [224]. This has been
successfully demonstrated with RealNVP flows [225], where it has been shown that the VAE can
learn very expressive latent representations even with a single latent dimension [226]. Moreover,
it has been shown that using an autoregressive flow [227] in this way for the prior is equivalent to
using an inverse autoregressive flow as part of the decoder [228].
Finally, one can also reshape some base prior by a multiplicative term, that is
p(z) ∝ pbase (z) α(z; ψ) with pbase (z) = N (0, I) (40)
where α(z; ψ) is some learnable acceptance function [229]. Depending on the form of the α-
function, the normalization constant of this prior might be intractable, thus requiring approxima-
tions such as accept/reject sampling [229]. Interestingly, when defining an energy E(z; ψ) =
− log α(z; ψ), the model above can be seen as a latent energy-based model [230, 231]. More-
over, when defining this function in terms of a discriminator d(·) in the data space, that is,
α(z; ψ) = Ex∼p(x | z) [d(x; ψ)], this yields a so-called pull-back prior [232], which is related to
generative adversarial networks [233].
5.3 Learning BNN priors
Finally, we will consider learning priors for Bayesian neural networks. Due to the large dimen-
sionality of BNN weight spaces and the complex mapping between weights and functions (see
Section 4.1), learning BNN priors has not been attempted very often in the literature. A manual
prior specification procedure that may be loosely called “learning” is the procedure in Fortuin et al.
[18], where the authors train standard neural networks using gradient descent and use their empiri-
cal weight distributions to inform their prior choices. When it comes to proper ML-II optimization,
BNNs pose an additional challenge, because their marginal likelihoods are typically intractable and
even lower bounds are hard to compute. Learning BNN priors using ML-II has therefore so far
only focused on learning the parameters of Gaussian priors in BNNs with Gaussian approximate
posteriors, where the posteriors were computed either using moment-matching [149] or using the
Laplace-Generalized-Gauss-Newton method [234], that is
Lap GGN ∗ 1 1
log p(D) ≈ log q(D) ≈ log p(D | θ ) − log det Ĥθ ∗ (41)
2 2π
with Ĥθ ∗ = Jθ⊤∗ HθL∗ Jθ ∗ + HθP∗
where q(D) is the marginal likelihood of a Laplace approximation, θ ∗ = arg maxθ p(θ | D) is the
maximum a posteriori (MAP) estimate of the parameters, Ĥθ ∗ is an approximate Hessian around
θ ∗ , Jθ ∗ is the Jacobian of the BNN outputs with respect to the parameters, HθL∗ is the Hessian of
the log likelihood, and HθP∗ is the Hessian of the log prior. Using this approximation, the marginal
likelihood is actually differentiable with respect to the prior hyperparameters ψ, such that they can
be trained together with the BNN posterior [234].
Again, if meta-tasks are available, one can try to meta-learn the BNN prior. For CNNs, one can for
instance train standard neural networks on the meta-tasks and then learn a generative model (e.g., a
VAE) for the filter weights. This generative model can then be used as a BNN prior for convolutional
filters [163]. In the case of only few meta-tasks, we can also again successfully use PAC-Bayesian
bounds to avoid meta-overfitting, at least when meta-learning Gaussian BNN priors [216]. Finally,
if we do not have access to actual meta-tasks, but we are aware of invariances in our data, we can
construct meta-tasks using data augmentation and use them to learn a prior that is (approximately)
invariant to these augmentations [235], that is
ψ ∗ = arg min Eθ∼p(θ;ψ) Ex̃∼q(x̃ | x) [DKL (p(y | x, θ) k p(y | x̃, θ))]
6 Conclusion
We have argued that choosing good priors in Bayesian models is crucial to actually achieve the
theoretical and empirical properties that they are commonly celebrated for, including uncertainty
estimation, model selection, and optimal decision support. While practitioners in Bayesian deep
learning currently often resort to the option of isotropic Gaussian (or similarly uninformative) pri-
ors, we have also highlighted that these priors are usually misspecified and can lead to several unin-
tended negative consequences during inference. On the other hand, well chosen priors can improve
performance and even enable novel applications. Luckily, a plethora of alternative prior choices is
available for popular Bayesian deep learning models, such as (deep) Gaussian processes, variational
autoencoders, and Bayesian neural networks. Moreover, in certain cases, useful priors for these
models can even be learned from data alone.
We hope that this survey—while necessarily being incomplete in certain ways—has provided the
interested reader with a first overview of the existing literature on priors for Bayesian deep learning
and with some guidance on how to choose them. We also hope to encourage practitioners in this
field to consider their prior choices a bit more carefully, and to potentially choose one of the priors
presented here instead of the standard Gaussian ones, or better yet, to use inspiration from these
priors and come up with even better suited ones for their own models. If only a small fraction of the
time usually spent thinking about increasingly elaborate inference techniques will be instead spent
on thinking about the priors used, this effort will have been worthwhile.
We acknowledge funding from the Swiss Data Science Center through a PhD fellowship. We thank
Alex Immer, Adrià Garriga-Alonso, and Claire Vernade for helpful feedback on the draft.
