Priors in Bayesian Learning
Priors in Bayesian Learning
Priors in Bayesian Learning
Vincent Fortuin
Department of Computer Science
arXiv:2105.06868v1 [stat.ML] 14 May 2021
ETH Zürich
Zürich, Switzerland
fortuin@inf.ethz.ch
A BSTRACT
While the choice of prior is one of the most critical parts of the Bayesian infer-
ence workflow, recent Bayesian deep learning models have often fallen back on
uninformative priors, such as standard Gaussians. In this review, we highlight the
importance of prior choices for Bayesian deep learning and present an overview of
different priors that have been proposed for (deep) Gaussian processes, variational
autoencoders, and Bayesian neural networks. We also outline different methods
of learning priors for these models from data. We hope to motivate practitioners
in Bayesian deep learning to think more carefully about the prior specification for
their models and to provide them with some inspiration in this regard.
1 Introduction
Bayesian models have gained a stable popularity in data analysis [1] and machine learning [2]. Es-
pecially in recent years, the interest in combining these models with deep learning has surged1. The
main idea of Bayesian modeling is to infer a posterior distribution over the parameters θ of the
model given some observed data D using Bayes’ theorem [3, 4] as
p(D | θ) p(θ) p(D | θ) p(θ)
p(θ | D) = = R (1)
p(D) p(D | θ) p(θ) dθ
where p(D|θ) is the likelihood, p(D) is the marginal likelihood (or evidence), and p(θ) is the prior.
The prior can often be parameterized by hyperparameters ψ, in which case we will write it as p(θ; ψ)
if we want to highlight this dependence. This posterior can then be used to model new unseen data
D∗ using the posterior predictive
Z
p(D∗ | D) = p(D∗ | θ) p(θ | D) dθ (2)
The integral in Eq. (2) is also called the Bayesian model average, because it averages the predic-
tions of all plausible models weighted by their posterior probability. This is in contrast to standard
maximum-likelihood learning, where only one parameter θ ∗ is used for the predictions as
p(D∗ | D) ≈ p(D∗ | θ ∗ ) with θ ∗ = arg max p(D | θ) (3)
θ
While much previous work has focused on the properties of the posterior predictive [5, 6], the
approximation of the integrals in Eq. (1) and Eq. (2) [7–9], or the use of the marginal likelihood
for Bayesian model selection [10, 11], we want to use this survey to shed some light on the often-
neglected term in Eq. (1): the prior p(θ).
In orthodox Bayesianism, the prior should be chosen in a way such that it accurately reflects our
beliefs about the parameters θ before seeing any data [12]. This has been described as being the
1
As attested, for instance, by the growing interest in the Bayesian Deep Learning workshop at NeurIPS.
most crucial part of Bayesian model building, but also the hardest one, since it is often not trivial to
map the subjective beliefs of the practitioner unambiguously onto tractable probability distributions
[13]. However, in practice, choosing the prior is often rather seen as a nuisance, and there have been
many attempts to try to avoid having to choose a meaningful prior, for instance, through objective
priors [14, 15], empirical Bayes [16], or combinations of the two [17]. Especially in Bayesian deep
learning, it is common practice to choose a (seemingly) “uninformative” prior, such as a standard
Gaussian [c.f., 18].
This trend is troubling, because choosing a bad prior can have detrimental consequences for the
whole inference endeavor. While the choice of uninformative (or weakly informative) priors is often
being motivated by invocation of the asymptotic consistency guarantees of the Bernstein-von-Mises
theorem [19], this theorem does not in fact hold in many applications, since its regularity conditions
are not satisfied [20]. Moreover, in the non-asymptotic regime of our practical inferences, the prior
can have a strong influence on the posterior, often forcing the probability mass onto arbitrary sub-
spaces of the parameter space, such as a spherical subspace in the case of the seemingly innocuous
standard Gaussian prior [21].
Worse yet, prior misspecification can undermine the very properties that compel us to use Bayesian
inference in the first place. For instance, marginal likelihoods can become meaningless under prior
misspecification, leading us to choose suboptimal models when using Bayesian model selection
[22]. Moreover, de Finetti’s famous Dutch book argument [23] can be extended to cases where we
can be convinced to take wagers that lose money in expectation when using bad priors, which even
holds for the aforementioned objective (Jeffreys) priors [24]. In a similar vein, Savage’s theorem
[25], which promises us optimal decisions under Bayesian decision theory, breaks down under prior
misspecification [26]. Finally, it can even be shown that PAC-Bayesian inference can exceed the
Bayesian one in terms of generalization performance when the prior is misspecified [27, 28].
On a more optimistic note, the no-free-lunch theorem [29] states that no learning algorithm is univer-
sally superior, or in other words, that different learning algorithms outperform each other on different
datasets. Applied to Bayesian learning, this means that there is also no universally preferred prior,
but that each task is potentially endowed with its own optimal prior. Finding (or at least approxi-
mating) this optimal prior then offers the potential for significantly improving the performance of
the inference or even enabling successful inference in cases where it otherwise would not have been
possible.
All these observations should at least motivate us to think a bit more carefully about our priors
than is often done in practice. But do we really have reason to believe that the commonly used
priors in Bayesian deep learning are misspecified? One recent piece of evidence is the fact that in
Bayesian linear models, it can be shown that prior misspecification leads to the necessity to temper
the posterior for optimal performance (i.e., use a posterior pT (θ | D) ∝ p(θ | D)1/T for some T < 1)
[30]. And indeed, this need for posterior tempering has also been observed empirically in modern
Bayesian deep learning models [e.g., 31–33, 18].
Based on all these insights, it is thus high time that we critically reflect upon our choices of priors
in Bayesian deep learning models. Luckily for us, there are many alternative priors that we could
choose over the standard uninformative ones. This survey shall attempt to provide an overview
of them. We will review existing prior designs for (deep) Gaussian processes in Section 2, for
variational autoencoders in Section 3, and for Bayesian neural networks in Section 4. We will then
finish by giving some brief outline of methods for learning priors from data in Section 5.
This prior is called a Gaussian process because it has the property that when evaluating the function
at any finite set of points x, the function values f := f (x) are distributed as p(f ) = N (mx , Kxx ),
2
where mx = mψ (x) is the vector of mean function outputs, the (i, j)’th element of the kernel
matrix Kxx is given by kψ (xi , xj ), and the d-dimensional multivariate Gaussian N (f ; µ, Σ) is
1 1
p(f ) = N (µ, Σ) := p exp − (f − µ)⊤ Σ−1 (f − µ) (5)
(2π)d det Σ 2
The Gaussian process can also be seen as an infinite-dimensional version of this multivariate Gaus-
sian distribution, following the Kolmogorov extension theorem [36].
This model is often combined with a Gaussian observation likelihood p(y | f ) = N (f , σ 2 I), since
it then allows for a closed-form posterior inference [35] on unseen data points (x∗ , y ∗ ) as
p(y ∗ | x∗ , x, y) = N (m∗ , K ∗ ) with (6)
∗ 2
−1
m = mx∗ + Kx∗ x Kxx + σ I (y − mx )
−1
K ∗ = Kx∗ x∗ − Kx∗ x Kxx + σ 2 I Kxx∗ + σ 2 I
While these models are not deep per se, there are many ways in which they connect to Bayesian deep
learning, which merits their appearance in this survey. In the following, we are going to present how
GP priors can be parameterized by deep neural networks (Section 2.1), how GPs can be stacked to
build deeper models (Section 2.2), and how deep neural networks can themselves turn into GPs or
be approximated by GPs (Section 2.3).
3
tent functions {f1 , . . . , fk } with function outputs {f1 , . . . , fk } and latent variables {z1 , . . . , zk−1 },
where each function uses the previous latent variable as its inputs, that is, fi+1 = fi+1 (zi ) and
f1 = f1 (x). In the simplest case, all these latent GPs still have Gaussian latent likelihoods
p(zi | fi ) = N (fi , σi2 I) and a Gaussian output likelihood p(y | fk ) = N (fk , σk2 I). If each of
these functions is endowed with a GP prior p(fi ) = GP(mψi (·), kψi (·, ·)), this model is called a
deep Gaussian process [50]. Similarly to deep neural networks, these models can represent increas-
ingly complex distributions with increasing depth, but unlike neural networks, they still offer a fully
Bayesian treatment. Crucially, in contrast to standard GPs, deep GPs can model a larger class of
output distributions [51], which includes distributions with non-Gaussian marginals [52]. For in-
creased flexibility, these models can also be coupled with warping functions between the GP layers
[53]. Moreover, they can be combined with the convolutional GP kernels mentioned above, to yield
models that are similar in spirit to deep CNNs [54, 55].
While these models seem to be strictly superior and preferable to standard GPs, their additional flex-
ibility comes at a price: the posterior inference is not tractable in closed form anymore. This means
that the posterior has to be estimated using approximate inference techniques, such as variational
inference [50, 56], expectation propagation [57], or amortized inference [58]. A very popular ap-
proximate inference technique for GPs is based on so-called inducing points, which are chosen to
be a subset of the training points or generally of the training domain [59–63]. This technique can
also be extended to inference in deep GPs [50, 64] or replaced by variational random features [65].
In contrast to the inference techniques, the choice of priors for deep GPs has generally been under-
studied. While a deep GP as a whole can model a rather complex prior over functions, the priors
for the single layers in terms of mψi and kψi are often chosen to be quite simple, for instance,
RBF kernels with different lengthscales [50]. Another option to build models with more expres-
sive kernels is to actually parameterize the kernel of a GP with another GP [66]. Particularly, the
(hierarchical) prior is then p(f ) = GP(mψ (·), k̂(·, ·)) with k̂(x, x′ ) = FT−1 (exp s(x − x′ )) and
p(s) = GP(0, kψ (·, ·)), where FT−1 is the inverse Fourier transform. This can also be seen as
a deep GP with one hidden layer, and it also does not allow closed-form inference, but relies on
approximate inference, for instance, using elliptical slice sampling [66].
Another way to connect GPs to DNNs is via neural network limits. It has been known for some
time now that the function-space prior p(f ) induced by a Bayesian neural network (BNN) with a
single hidden layer and any independent finite-variance parameter prior p(θ) converges in the limit
of infinite width to a GP, due to the central limit theorem [67, 68]. The limiting GP prior is given by
p(f ) = GP(0, kNN (·, ·)) with
kNN (x, x′ ) = σw
2
Ew,b ϕ(w⊤ x + b) ϕ(w⊤ x′ + b) + σb22
2
where (8)
2
w∼ N (0, σw 1
I) and b∼ N (0, σb21 I)
2
with the prior weight and bias variances σw 1
, σb21 in the first layer, σw
2
2
, σb22 in the second layer, and
nonlinear activation function ϕ(·). Note that here it is usually assumed that the weight variances are
2
set as σw i
∝ 1/ni , where ni is the number of units in the i’th layer. The kernel kNN (·, ·) is then
called the neural network GP (NNGP) kernel. This result has recently been extended to BNNs with
ReLU activations [69] and deep BNNs [70–72], where the lower layer GP kernel takes the same
form as above and the kernel for the higher layers assumes the recursive form
ℓ
kNN (x, x′ ) = σw
2
ℓ
E(z1 ,z2 )∼N (0,K ℓ−1 ) [ϕ(z1 ) ϕ(z2 )] + σb2ℓ (9)
xx′
ℓ−1 ′
with Kxx ′ being the 2 × 2 kernel matrix of the (ℓ − 1)’th layer kernel evaluated at x and x .
Moreover, these convergence results can also be shown for convolutional BNNs [73, 74], even with
weight correlations [75], and for attention neural networks [76].
While these results only hold for independent finite-variance priors, they can be extended to de-
pendent priors, where they yield GPs that are marginalized over a hyperprior [77], and to infinite-
variance priors, where they lead to α-stable processes [78]. Excitingly, it has been shown that this
convergence of the BNN prior to a stochastic process also implies the convergence of the posterior
under mild regularity assumptions [79]. While these results have typically been derived manually,
4
the recent theoretical framework of tensor programs allows to rederive them in a unified way, in-
cluding for recurrent architectures and batch-norm [80–82]. Moreover, it allows to derive limits for
networks where only a subset of the layers converge to infinite width, which recovers the models’
ability to learn latent features [83].
Not only infinitely wide BNNs can lead to GP limits, but this is also true for infinitely wide standard
DNNs. Crucially however, in this case, the GP arises not as a function-space prior at initialization,
but as a model of training under gradient descent [84, 85]. Specifically, neural networks under
gradient descent training can be shown to follow the kernel gradient of their functional loss with
respect to the so-called neural tangent kernel (NTK) which is
kNTK (x, x′ ) = Jθ (x)Jθ (x′ )⊤ (10)
where Jθ (x) is the Jacobian of the neural network with respect to the parameters θ evaluated at
input x. In the limit of infinite width, this kernel becomes stable over training and can be recursively
computed as
ℓ
kNTK (x, x′ ) = kNTK
ℓ−1
(x, x′ ) Σ̇ℓ (x, x′ ) + Σℓ (x, x′ ) with (11)
1 ⊤ ′
1
kNTK (x, x′ ) = Σ1 (x, x′ ) = x x +1
n0
Σℓ (x, x′ ) = E(z1 ,z2 )∼N (0,K ℓ−1 ) [ϕ(z1 ) ϕ(z2 )]
xx′
ℓ ′
Σ̇ (x, x ) = E(z1 ,z2 )∼N (0,K ℓ−1 ) [ϕ̇(z1 ) ϕ̇(z2 )]
xx′
ℓ−1
where n0 is the number of inputs, Kxx ′ is again the kernel matrix of the NTK at the previous layer,
ϕ̇(·) is the derivative of the activation function, and the Σℓ are so-called activation kernels.
In the case of finite width, this kernel will not model the training behavior exactly, but there exist
approximate corrections [86]. Interestingly, this same kernel can also be derived from approximate
inference in the neural network, leading to an implicit linearization [87]. This linearization can also
be made explicit and can then be used to improve the performance of BNN predictives [88] and
for fast domain adaptation in multi-task learning [89]. Moreover, when using the NTK in a kernel
machine, such as a support vector machine, it can outperform the original neural network it was
derived from, at least in the small-data regime [90]. Similarly to the aforementioned NNGP kernels,
the NTKs for different architectures can also be rederived using the framework of tensor programs
[80, 91] and there exist practical Python packages for the efficient computation of NNGP kernels
and NTKs [92]. Finally, it should be noted that this linearization of neural networks has also been
linked to the scaling of the parameters and described as lazy training, which has been argued to be
inferior to standard neural network training [93].
5
choices, which we will explore in the following. Particularly, we will look at some proper probability
distributions that can directly replace the standard Gaussian (Section 3.1), at some structural priors
that also require changes to the architecture (Section 3.2), and finally at a particularly interesting
VAE model with idiosyncratic architecture and prior, namely the neural process (Section 3.3).
where µ is again the mean, κ the concentration parameter, and Γ(·) is the Gamma function. Since
the Gamma function is easier to evaluate than the modified Bessel function, this density allows for
closed form evaluation and reparameterizable sampling. Empirically, it yields the same performance
in VAEs as the vMF prior, while being numerically more stable [99].
Another type of priors are mixture priors [100–102], typically mixtures of Gaussian of the form
K
X K
X
p(z) = πi N (µi , σi2 I) with πi = 1 (15)
i=1 i=1
with K mixture components where πi are the mixture weights that are often set to πi = 1/K
in the prior. These priors have been motivated by the idea that the data might consist of clusters,
which should also be disjoint in the latent space [100], and they have been shown to outperform
many other clustering methods on challenging datasets [102]. However, similarly to many other
clustering methods, one challenge is to choose the number of clusters K a priori. This can also be
optimized automatically, for instance by specifying a stick-breaking or Dirichlet process hyperprior
[103], albeit at the cost of more involved inference.
Finally, most of these priors assume independence between data points. If we have prior knowledge
about potential similarity between data points and we can encode it into a kernel function, a Gaussian
process can be a powerful prior for a VAE [104–106]. The prior is usually defined as
p(Z) = N (0, Kzz ) (16)
where Z = (z1 , . . . , zn ) is the matrix of latent variables and Kzz is again the kernel matrix with
(i, j)’th element k(zi , zj ) for some suitable kernel function k(·, ·). These models have been shown
to excel at conditional generation [104], time series modeling [106], missing data imputation [105],
6
and disentanglement [107, 108]. It should be noted that this comes at additional computational cost
compared to standard VAEs, since it requires the O(n3 ) inversion of the kernel matrix (see Eq. (6)).
However, this operation can be made more scalable, either through the use of inducing point methods
[109, 110] (c.f., Section 2.2) or through factorized kernels [111]. Moreover, depending on the prior
knowledge of the generative process, these models can also be extended to use additive GP priors
[112] or tensor-valued ones [113].
In contrast to the distributional priors discussed above, we will use the term structural priors to
refer to priors that do not only change the actual prior distribution p(z) in the VAE model, but also
the model architecture itself. Some of these structural priors are extensions of the distributional
priors mentioned above. For instance, the aforementioned Gaussian mixture priors can be extended
with a mixture-of-experts decoder, that is, a factorized generative likelihood, where each factor only
depends on one of the latent mixture components [102]. Another example are the Gaussian process
priors, which are defined over the whole latent dataset Z and thus benefit from a modified encoder
(i.e., inference network), which encodes the complete dataset X jointly [105].
In addition to these distributional priors with modified architectures, there are also structural priors
which could not be realized with the standard VAE architecture. One example are hierarchical priors
[114–116], such as
K
Y
p(z1 , . . . , zK ) = p(z1 ) p(zi | zi−1 ) (17)
i=2
K
Y
or p(z1 , . . . , zK ) = p(z1 ) p(zi | z1 , . . . , zi−1 ) (18)
i=2
We see here that instead of having a single latent variable z, these models feature K different latent
variables {zi , . . . , zK }, which depend on each other hierarchically. These models require additional
generative networks to parameterize the conditional probabilities in Eq. (17), which then enable
them to better model data with intrinsically hierarchical features [114, 115] and to reach state-of-
the-art performance in image generation with VAEs [116].
Another type of structural priors are discrete latent priors, such as the VQ-VAE prior [117]
1
p(zq ) = with zq = arg min kze − ek22 (19)
|E| e∈E
where E is a finite dictionary of prototypes and ze is a continuous latent variable that is then dis-
cretized to zq . Crucially, the prior is not placed over the continuous ze , but over the discrete zq ,
namely as a uniform prior over the dictionary E. These discrete latent variables can then be saved
very cheaply and thus lead to much stronger compression than standard VAEs [117]. When combin-
ing these models with the hierarchical latent variables described above, they can also reach competi-
tive image generation performance [118]. Moreover, these discrete latent variables can be extended
to include neighborhood structures such as self-organizing maps [119], leading to more interpretable
latent representations that can also be used for clustering [120–122].
To conclude this section, we will look at a structural VAE prior that has spawned a lot of inter-
est in recent years and thus deserves its own subsection: the neural process (NP). This model has
been independently proposed under the names of partialVAE [123] and (conditional) neural process
[124, 125], but the latter nomenclature has caught on in the literature. The main novelty of this VAE
architecture is that it not only models the distribution of one type of observed variable x, but of
two variables (x, y), which can be split into a context and target set (x, y) = (xc , yc ) ∩ (xt , yt ).
These sets are conditionally independent given z, that is, p(x, y | z) = p(xc , yc | z) p(xt , yt | z).
This then allows to infer an unobserved yt based on the other variables using a variational approx-
imation q(z | xc , yc ) and the conditional likelihood p(yt | z, xt ). Thus, the model can be used for
missing data imputation [123] and regression [125] tasks. Note that, since the likelihood is typically
conditioned on xt instead of just on z, this model can be framed as a conditional VAE [126].
7
One remarkable feature of this model is the used prior, which is namely
p(z) = p(z | xc , yc ) ≈ q(z | xc , yc ) (20)
This means that instead of using an unconditional prior p(z) for the full posterior p(z | x, y), a
part of the data (the context set) is used to condition the prior, which is in turn approximated by
the variational posterior with reduced conditioning set. While this is atypical for classical Bayesian
inference and generally frowned upon by orthodox Bayesians, it bears resemblence to the data-
dependent oracle priors that can be used in PAC-Bayesian bounds and have been shown to make
those bounds tighter [127, 128].
The NP model has been heavily inspired by stochastic processes (hence the name) and has been
shown to constitute a stochastic process itself under some assumptions [125]. Moreover, when the
conditional likelihood p(yt | z, xt ) is chosen to be an affine transformation, the model is actually
equivalent to a Gaussian process with neural network kernel [129].
Since their inception, NP models have been extended in expressivity in different ways, both in
terms of their inference and their generative model. On the inference side, there are attentive NPs
[130], which endow the encoder with self-attention (and thus make it Turing complete [131]), and
convolutional (conditional) NPs [132, 133], which add translation equivariance to the model. On the
generative side, there are functional NPs [134], which introduce dependence between the predictions
by learning a relational graph structure over the latents z, and Gaussian NPs [135], which achieve a
similar property by replacing the generative likelihood with a Gaussian process, the mean and kernel
of which are inferred based on the latents.
8
where M is the mean matrix, U and V are the row and column covariances, and tr[·] is the trace
operator. These matrix-valued Gaussians can then also be used as variational distributions, leading
to increased performance compared to isotropic Gaussians on many tasks [147].
Another way to improve the expressiveness of Gaussian priors is to combine them with hierarchical
hyperpriors [148, 149], which has already been proposed in early work on BNNs [136] as
Z
p(θ) = N (µ, Σ) p(Σ) dΣ (22)
where p(Σ) is a hyperprior over the covariance. An example of such a hyperprior is the inverse
Wishart distribution [e.g., 43], which is in d dimensions given by
ν+d−1 ν+2d
(det K) 2 (det Σ)− 2 exp − 21 tr KΣ−1
p(Σ) = IW d (ν, K) = (ν+d−1)d
(23)
2 2 Γd ( ν+d−1
2 )
where ν are the degrees of freedom and K is the mean of p(Σ). When marginalizing the prior
in Eq. (22) over the hyperprior in Eq. (23), it turns out that one gets a d-dimensional multivariate
Student-t distribution with ν degrees of freedom [150], namely
− 12 − ν+d
Γ( ν+d
2 ) (det K) (θ − µ)⊤ K −1 (θ − µ) 2
p(θ) = n ν 1+ (24)
((ν − 2)π) 2 Γ( 2 ) ν−2
Such distributions have been shown to model the predictive variance more flexibly in stochastic
processes [150] and BNNs [43]. Moreover, in BNNs, it has been shown that priors like these,
which are heavy-tailed (also including Laplace priors [151]) and allow for weight correlations, can
decrease the cold posterior effect [18], suggesting that they are less misspecified than isotropic
Gaussians. Finally, when using Student-t priors, it has been shown that one can obtain expressive
BNNs posteriors even when forcing the posterior mean of the weights to be zero [152], which
highlights the flexibility of these distributions.
Another hierarchical Gaussian prior is the horseshoe prior [153], which is
p(θi ) = N (0, τ 2 σi2 ) with p(τ ) = C + (0, b0 ) and p(σi ) = C + (0, b1 ) (25)
where b0 and b1 are scale parameters and C + is the half-Cauchy distribution
( 2
−1
+
2
1 + (σ−µ)
2 if σ ≥ µ
p(σ) = C (µ, b) = π b b (26)
0 otherwise
In BNNs, the horseshoe prior can encourage sparsity [154] and enable interpretable feature selection
[155]. It can also be used to aid compression of the neural network weights [156]. Moreover,
in application areas such as genomics, where prior knowledge about the signal-to-noise ratio is
available, this knowledge can be encoded in such sparsity-inducing hierarchical priors [157].
Another interesting prior is the radial-directional prior, which disentangles the direction of the
weight vector from its length [158]. It is given by
θ = θr θ d with θr ∼ prad (θr ) and θd ∼ pdir (θd ) (27)
where pdir is a distribution over the d-dimensional unit sphere and prad is a distribution over R. It
has been proposed by Oh et al. [158] to use the von-Mises-Fisher distribution (see Eq. (13)) for
pdir and the half-Cauchy (see Eq. (26)) for prad . Conversely, Farquhar et al. [159] suggest to use a
Gaussian for prad and a uniform distribution over the unit sphere for pdir , which they reparameterize
by sampling from a standard Gaussian and normalizing the sampled vectors to unit length. It should
be noted that the idea of the radial-directional prior is related to the Goldilocks zone hypothesis,
which says that there exists an annulus at a certain distance from the origin which has a particularly
high density of high-performing weight vectors [160].
In terms of even more expressive priors, it has been proposed to model the parameters in terms of
the units of the neural network instead of the weights themselves [161]. The weight θij between
units i and j would then have the prior
p(θij ) = g(zi , zj , ǫ) with p(z) = p(ǫ) = N (0, I) (28)
9
where the function g can be either parameterized by a neural network [161] or by a Gaussian process
[162]. A similarly implicit model, with even more flexibility, has been proposed by Atanov et al.
[163] and is simply given by
p(θ) = g(z, ǫ) with p(z) = p(ǫ) = N (0, I) (29)
In both of these priors, the main challenge is to choose the function g. Since this is hard to do
manually, the function is usually (meta-)learned (see Section 5.3).
As we saw, there are many different weight space priors that one can choose for Bayesian neural
networks. However, choosing the right one can be challenging, since we often have better intuitions
about the functions we would expect rather than the parameters themselves. The trouble is then that
the mapping from parameters to functions in neural networks is highly non-trivial due to their many
weight-space symmetries [164] and complex function-space geometries [165]. This has led to an
alternative approach to prior specification in BNNs, namely to specify the priors directly in function
space, such that Z
δ(φ(· ; θ)) p(θ | D) dθ ≈ p(f | D) ∝ p(D | f ) p(f ) (30)
where p(f ) is the function-space prior, φ(· ; θ) is the function implemented by a neural network with
parameters θ and δ(·) is the Dirac delta measure (in function space).
As we have seen before (c.f., Section 2), Gaussian processes offer an excellent model class to
encode functional prior knowledge through the choice of kernel and mean functions, that is,
p(f ) = GP(m(·), k(·, ·)). It is thus a natural idea to use GP priors as function-space priors for
BNNs. If one applies this idea in the most straightforward way, one can just optimize a posterior
that now depends on the KL divergence between the BNN posterior and the GP prior. However,
since this KL is defined in an infinite-dimensional space, it requires approximations, such as Stein
kernel gradient estimators [166]. Alternatively, one can first optimize a weight-space distribution on
a BNN to minimize the KL divergence with the desired GP prior (e.g., using Monte Carlo estimates)
and then use this optimized weight prior as the BNN prior during inference [167].
While both of these approaches seem reasonable at first sight, it has been discovered that GP and
BNN function-space distributions do not actually have the same support and that the true KL diver-
gence is thus infinite (or undefined) [168]. It has therefore recently been proposed to use the Wasser-
stein distance instead, although this also requires approximations [169]. If one wants to forego the
need for a well-defined divergence, one can also use a hypernetwork [170, 171] as an implicit dis-
tribution of BNN weights and then train the network to match the GP samples on a certain set of
function outputs [172]. Finally, it has recently been discovered that the ridgelet transform [173] can
be used to approximate GP function-space distributions with BNN weight-space distributions [174].
As a sidenote, it should be noted that the reverse can actually be achieved more easily, namely fitting
a GP to the outputs of a BNN [175], which can also be of interest in certain applications.
If one does not want to use a GP prior in function space, one can still encode useful functional prior
knowledge into BNN priors. For instance, through the study of the infinite-width limits of BNNs
(see Section 2.3), one finds that the activation function of the network has a strong influence on the
functions being implemented and one can, for instance, modulate the smoothness or periodicity of
the BNN output by choosing different activation functions [176]. Moreover, one can directly define
priors over the BNN outputs, which can encode strong prior assumptions about the values that the
functions are allowed to take in certain parts of the input space [177], that is,
p(θ) = pbase (θ) D(φ(Cx ; θ), Cy ) =⇒ p(θ | D) ∝ p(D | θ) D(φ(Cx ; θ), Cy ) pbase (θ) (31)
where pbase (θ) is some base prior in weight space, (Cx , Cy ) are the inputs and outputs in terms of
which the functional constraint is defined and D(·, ·) is a discrepancy function. We see that these
priors on output constraints end up looking like additional likelihood terms in the posterior and can
thus help to encourage specific features of the output function, for instance, to ensure safety features
in critical applications. A similar idea are noise-contrastive priors, which are also specified in func-
tion space directly through a prior over unseen data p(D̃) [178], which yields the prior predictive
ZZ
∗
p(D ) = p(D∗ | θ) p(θ) p(D̃) dθ dD̃ (32)
10
This prior can encode the belief that the epistemic uncertainty should grow away from the in-
distribution data and can thus also lead to more GP-like behavior in BNN posteriors. Finally, if
we have the prior belief that the BNN functions should not be much more complex than the ones
of a different function class (e.g., shallower or even linear models), we can use this other class as a
functional reference prior and thus regularize the predictive complexity of the model [179].
Deep neural network ensembles, or deep ensembles, are a frequentist method similar to the bootstrap
[180] that has been used to gain uncertainty estimates in neural networks [181]. However, it has been
recently argued that these ensembles actually approximate the BNN posterior predictive [139], that
is
Z XK
p(D∗ | D) = p(D∗ | θ) p(θ | D) dθ ≈ p(D∗ | θi ) (33)
i=1
where θi are the weights of K independently trained ensemble members of the same architecture.
These models can also be extended to ensembles with different hyperparameters [182], thus also
approximating a hierarchical hyperposterior. Moreover, they can be made more parameter-efficient
by sharing certain parameters between ensemble members [183], which can then also be used for
approximate BNN inference [144]. While these models have performed well in many practical tasks
[6], they can still severely overfit in some scenarios [184], leading to ill-calibrated uncertainties
[185]. However, it has been shown recently that each ensemble member can be combined with a
random function that is sampled from a function-space prior [186, 187], and that this can indeed
yield uncertainties that are conservative with respect to the Bayesian ones [188]. More specifically,
the uncertainties of such ensembles are with high probability at least as large as the ones from a
Gaussian process with the corresponding NNGP kernel (see Section 2.3). These results can also be
extended to the NTK [189].
Another way of making these deep ensembles more Bayesian and incorporating priors are particle-
based approximate inference methods, such as Stein variational gradient descent (SVGD) [190]. In
SVGD, the ensemble members (or particles) are updated according to
K
X
θi ← θi + η φ(θi ) with φ(θi ) = k(θi , θj ) ∇θj log p(θj | D) − ∇θi k(θi , θj ) (34)
j=1
where η is a step-size and k(·, ·) is a kernel function in weight space. With the right step-size
schedule, this update rule converges asymptotically to the true posterior [191] and even enjoys some
non-asymptotic guarantees [192]. Moreover, note that it only requires sample-based access to the
gradient of the log posterior (and thus also the log prior), which allows it to be used with different
weight-space priors [193] and even function-space priors, such as GPs [194].
5 Learning Priors
So far, we have explored different types of distributions and methods to encode our prior knowledge
into Bayesian deep learning models. But what if we do not have any useful prior knowledge to
encode? While orthodox Bayesianism would prescribe an uninformative prior in such a case [15,
1], there are alternative ways to elucidate priors, namely by learning them from data. If we go
the traditional route of Bayesian model selection using the marginal likelihood (the term p(D) in
Eq. (1)), we can choose a functional form p(θ; ψ) for the prior and optimize its hyperparameters ψ
with respect to this quantity. This is called empirical Bayes [16] or type-II maximum likelihood (ML-
II) estimation [35]. While there are reasons to be worried about overfitting in such a setting, there
are also arguments that the marginal likelihood automatically trades off the goodness of fit with the
model complexity and thus leads to model parsimony in the spirit of Occam’s razor principle [195].
In the case where we have previously solved tasks that are related to the task at hand (so-called
meta-tasks), we can alternatively also rely on the framework of learning to learn [196, 197] or
meta-learning [198]. If we apply this idea to learning priors for Bayesian models in a hierarchical
Bayesian way, we arrive at Bayesian meta-learning [199–202]. This can then also be extended to
modern gradient-based methods [203–205].
11
While these ML-II optimization and Bayesian meta-learning ideas can in principle be used to learn
hyperparameters for most of the priors discussed above, we will briefly review some successful
examples of their application below. Following the general structure from above, we will explore
learning priors for Gaussian processes (Section 5.1), variational autoencoders (Section 5.2), and
Bayesian neural networks (Section 5.3).
Following the idea of ML-II optimization, we can use the marginal likelihood to select hyperparam-
eters for the mean and kernel functions of GPs. Conveniently, the marginal likelihood for GPs (with
Gaussian observation likelihood) is available in closed form as
Z
pψ (y | x) = p(y | f, x) GP(mψ (·), kψ (·, ·)) df (35)
1
(y − m(x)⊤ (Kxx + σ 2 I)−1 (y − m(x)) + log det(Kxx + σ 2 I) + N log 2π
=−
2
with N being the number of data points, Kxx the kernel matrix on the data points, and σ 2 the noise
of the observation likelihood. We can see that the first term measures the goodness of fit, while the
second term (the log determinant of the kernel matrix) measures the complexity of the model and
thus incorporates the Occam’s razor principle [35].
While this quantity can be optimized to select the hyperparameters of simple kernels, such as the
lengthscale of an RBF kernel, it can also be used for more expressive ones. For instance, one can
define a spectral mixture kernel in the Fourier domain and then optimize the basis functions’ coeffi-
cients using the marginal likelihood, which can recover a range of different kernel functions [206].
To make the kernels even more expressive, we can also allow for addition and multiplication of dif-
ferent kernels [207], which can ultimately lead to an automatic statistician [208], that is, a model
that can choose its own problem-dependent kernel combination based on the data and some kernel
grammar. While this model naïvely scales rather unfavorably due to the size of the combinatorial
search space, it can be made more scalable through cheaper approximations [209] or by making the
kernel grammar differentiable [210].
Another avenue, which was already alluded to above (see Section 2.1), is to use a neural network
to parameterize the kernel. The first attempt at this trained a deep belief network on the data and
then used it as the kernel function [40], but later approaches optimized the neural network kernel
directly using the marginal likelihood [37], often in combination with sparse approximations [38]
or stochastic variational inference [39] for scalability (see Eq. (7)). In this vein, it has recently been
proposed to regularize the Lipschitzness of the used neural network, in order for the learned kernel to
preserve distances between data points and thus improve its out-of-distribution uncertainties [211].
While all these approaches still rely on the log determinant term in Eq. (35) to protect them from
overfitting, it has been shown that this is unfortunately not effective enough when the employed
neural networks are overparameterized [43]. However, this can be remedied by adding a prior over
the neural network parameters, thus effectively turning them into BNNs and the whole model into a
proper hierarchical Bayesian model. It should be noted that these techniques cannot only be used to
learn GP priors that work well for a particular task, but also to learn certain invariances from data
[212] or to fit GP priors to other (implicit) function-space distributions [175] (c.f., Section 4.2).
As mentioned above, if we have related tasks available, we can use them to meta-learn the GP prior.
This can be applied to the kernel [48, 213] as well as the mean function [48], by optimizing the
marginal likelihood on these meta-tasks as
K
K
X
ψ ∗ = arg max log pψ (yi | xi ) with DM = {(xi , yi )}i=1 (36)
ψ i=1
where DM is the set of meta-tasks. Note that the mean function can only safely be optimized in this
meta-learning setting, but not in the ML-II setting, since Eq. (35) does not provide any complexity
penalty on the mean function and it would thus severely overfit. While meta-learning does not
risk overfitting on the actual training data (since it is not used), it might overfit on the meta-tasks,
if there are too few of them, or if they are too similar to each other [214, 215]. In the Bayesian
meta-learning setting, this can be overcome by specifying a hierarchical hyperprior, which turns out
to be equivalent to optimizing a PAC-Bayesian bound [216]. This has been shown to successfully
meta-learn GP priors from as few as five meta-tasks.
12
5.2 Learning VAE priors
Variational autoencoders are already trained using the ELBO (see Eq. (12)), which is a lower bound
on the marginal likelihood. Moreover, their likelihood p(x | z) is trained on this objective, as op-
posed to being fixed a priori as in most other Bayesian models. One could thus expect that VAEs
would be well suited to also learn their prior using their ELBO. Indeed, the ELBO can be further
decomposed as
L(x, ϑ) = Ez∼qϑ (z | x) [log pϑ (x | z)] − Iqϑ (z,x) (z, x) − DKL (q̄ϑ (z) k p(z)) (37)
where Iqϑ (z,x) (z, x) is the mutual information between z and x under the joint distribution
qϑ (z, x) = qϑ (z | x) p(x) and q̄ϑ is the aggregated approximate posterior q̄ϑ (z) = K
P
i=1 qϑ (z | xi ).
Since the KL term in this objective is the only term that depends on the prior and the complexity
of qϑ (z | x) is already penalized by the mutual information term, it has been argued that optimizing
the prior p(z) with respect to the ELBO could be beneficial [217]. One can then show that the opti-
mal prior under this objective is the aggregated posterior Ex∼p(x) [p(z | x)], where p(x) is the data
distribution [218].
As mentioned above, a more expressive family of prior distributions than the common standard
Gaussian priors are Gaussian mixture priors [100] (see Section 3.1). In particular, with an increasing
number of components, these mixtures can approximate any smooth distribution arbitrarily closely
[219]. These VAE priors can be optimized using the ELBO [220], however it has been found that this
can severely overfit [218], highlighting again that the marginal likelihood (or its lower bound) cannot
always protect against overfitting (see Section 5.1). Instead, it has been proposed to parameterize
the mixture components as variational posteriors on certain inducing points, that is
K
X
p(z) = q(z | xi ) (38)
i=1
where the xi ’s are learned [218]. This can indeed improve the VAE performance without overfitting,
and since the prior is defined in terms of inducing points in data space, it can also straightforwardly
be used with hierarchical VAEs [221].
Since mixture models can exacerbate the computation of the KL divergence and require the difficult
choice of a number of components K, an alternative are implicit priors which are parameterized by
learnable functions. One specific example for image data has been proposed for VAEs in which the
latent space preserves the shape of the data, that is, the z’s are not just vectors, but 2D or 3D tensors.
In such models, one can define a hierarchical prior over z, which is parameterized by learnable
convolutions over the latent dimensions [222]. Another way of specifying a learnable hierarchical
prior is to use memory modules, where the prior is then dependent on the stored memories and the
memory is learned together with the rest of the model [223]. More generally, one can define implicit
prior distributions in VAEs as
z = g(ξ; ψ) with p(ξ) = N (0, I) (39)
where g(· ; ψ) is a learnable diffeomorphism, such as a normalizing flow [224]. This has been
successfully demonstrated with RealNVP flows [225], where it has been shown that the VAE can
learn very expressive latent representations even with a single latent dimension [226]. Moreover,
it has been shown that using an autoregressive flow [227] in this way for the prior is equivalent to
using an inverse autoregressive flow as part of the decoder [228].
Finally, one can also reshape some base prior by a multiplicative term, that is
p(z) ∝ pbase (z) α(z; ψ) with pbase (z) = N (0, I) (40)
where α(z; ψ) is some learnable acceptance function [229]. Depending on the form of the α-
function, the normalization constant of this prior might be intractable, thus requiring approxima-
tions such as accept/reject sampling [229]. Interestingly, when defining an energy E(z; ψ) =
− log α(z; ψ), the model above can be seen as a latent energy-based model [230, 231]. More-
over, when defining this function in terms of a discriminator d(·) in the data space, that is,
α(z; ψ) = Ex∼p(x | z) [d(x; ψ)], this yields a so-called pull-back prior [232], which is related to
generative adversarial networks [233].
13
5.3 Learning BNN priors
Finally, we will consider learning priors for Bayesian neural networks. Due to the large dimen-
sionality of BNN weight spaces and the complex mapping between weights and functions (see
Section 4.1), learning BNN priors has not been attempted very often in the literature. A manual
prior specification procedure that may be loosely called “learning” is the procedure in Fortuin et al.
[18], where the authors train standard neural networks using gradient descent and use their empiri-
cal weight distributions to inform their prior choices. When it comes to proper ML-II optimization,
BNNs pose an additional challenge, because their marginal likelihoods are typically intractable and
even lower bounds are hard to compute. Learning BNN priors using ML-II has therefore so far
only focused on learning the parameters of Gaussian priors in BNNs with Gaussian approximate
posteriors, where the posteriors were computed either using moment-matching [149] or using the
Laplace-Generalized-Gauss-Newton method [234], that is
Lap GGN ∗ 1 1
log p(D) ≈ log q(D) ≈ log p(D | θ ) − log det Ĥθ ∗ (41)
2 2π
with Ĥθ ∗ = Jθ⊤∗ HθL∗ Jθ ∗ + HθP∗
where q(D) is the marginal likelihood of a Laplace approximation, θ ∗ = arg maxθ p(θ | D) is the
maximum a posteriori (MAP) estimate of the parameters, Ĥθ ∗ is an approximate Hessian around
θ ∗ , Jθ ∗ is the Jacobian of the BNN outputs with respect to the parameters, HθL∗ is the Hessian of
the log likelihood, and HθP∗ is the Hessian of the log prior. Using this approximation, the marginal
likelihood is actually differentiable with respect to the prior hyperparameters ψ, such that they can
be trained together with the BNN posterior [234].
Again, if meta-tasks are available, one can try to meta-learn the BNN prior. For CNNs, one can for
instance train standard neural networks on the meta-tasks and then learn a generative model (e.g., a
VAE) for the filter weights. This generative model can then be used as a BNN prior for convolutional
filters [163]. In the case of only few meta-tasks, we can also again successfully use PAC-Bayesian
bounds to avoid meta-overfitting, at least when meta-learning Gaussian BNN priors [216]. Finally,
if we do not have access to actual meta-tasks, but we are aware of invariances in our data, we can
construct meta-tasks using data augmentation and use them to learn a prior that is (approximately)
invariant to these augmentations [235], that is
ψ ∗ = arg min Eθ∼p(θ;ψ) Ex̃∼q(x̃ | x) [DKL (p(y | x, θ) k p(y | x̃, θ))]
(42)
ψ
6 Conclusion
We have argued that choosing good priors in Bayesian models is crucial to actually achieve the
theoretical and empirical properties that they are commonly celebrated for, including uncertainty
estimation, model selection, and optimal decision support. While practitioners in Bayesian deep
learning currently often resort to the option of isotropic Gaussian (or similarly uninformative) pri-
ors, we have also highlighted that these priors are usually misspecified and can lead to several unin-
tended negative consequences during inference. On the other hand, well chosen priors can improve
performance and even enable novel applications. Luckily, a plethora of alternative prior choices is
available for popular Bayesian deep learning models, such as (deep) Gaussian processes, variational
autoencoders, and Bayesian neural networks. Moreover, in certain cases, useful priors for these
models can even be learned from data alone.
We hope that this survey—while necessarily being incomplete in certain ways—has provided the
interested reader with a first overview of the existing literature on priors for Bayesian deep learning
and with some guidance on how to choose them. We also hope to encourage practitioners in this
field to consider their prior choices a bit more carefully, and to potentially choose one of the priors
presented here instead of the standard Gaussian ones, or better yet, to use inspiration from these
priors and come up with even better suited ones for their own models. If only a small fraction of the
time usually spent thinking about increasingly elaborate inference techniques will be instead spent
on thinking about the priors used, this effort will have been worthwhile.
14
Acknowledgments
We acknowledge funding from the Swiss Data Science Center through a PhD fellowship. We thank
Alex Immer, Adrià Garriga-Alonso, and Claire Vernade for helpful feedback on the draft.
References
[1] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B
Rubin. Bayesian data analysis. CRC press, 2013.
[2] Kevin P Murphy. Machine learning: a probabilistic perspective. 2012.
[3] Thomas Bayes. An essay towards solving a problem in the doctrine of chances. By the
late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFRS.
Philosophical transactions of the Royal Society of London, (53):370–418, 1763.
[4] Pierre Simon Laplace. Mémoire sur la probabilité de causes par les évenements. Mémoire de
l’académie royale des sciences, 1774.
[5] Andrew Gelman, Xiao-Li Meng, and Hal Stern. Posterior predictive assessment of model
fitness via realized discrepancies. Statistica sinica, pages 733–760, 1996.
[6] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin,
Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your
model’s uncertainty? Evaluating predictive uncertainty under dataset shift. arXiv preprint
arXiv:1906.02530, 2019.
[7] Robert E Kass, Bradley P Carlin, Andrew Gelman, and Radford M Neal. Markov chain
Monte Carlo in practice: a roundtable discussion. The American Statistician, 52(2):93–100,
1998.
[8] Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families, and
variational inference. Now Publishers Inc, 2008.
[9] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for
statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
[10] Fernando Llorente, Luca Martino, David Delgado, and Javier Lopez-Santiago. Marginal
likelihood computation for model selection and hypothesis testing: an extensive review. arXiv
preprint arXiv:2005.08334, 2020.
[11] Edwin Fong and CC Holmes. On the marginal likelihood and cross-validation. Biometrika,
107(2):489–496, 2020.
[12] Andrew Gelman. Bayesian model-building by pure thought: some principles and examples.
Statistica Sinica, pages 215–232, 1996.
[13] Christian Robert. The Bayesian choice: from decision-theoretic foundations to computational
implementation. Springer Science & Business Media, 2007.
[14] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Pro-
ceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186
(1007):453–461, 1946.
[15] Edwin T Jaynes. Prior probabilities. IEEE Transactions on systems science and cybernetics,
4(3):227–241, 1968.
[16] Herbert Robbins. An Empirical Bayes Approach to Statistics. Office of Scientific Research,
US Air Force, 1955.
[17] Ilja Klebanov, Alexander Sikorski, Christof Schütte, and Susanna Röblitz. Objective priors
in the empirical Bayes framework. Scandinavian Journal of Statistics, 2020.
[18] Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner,
Mark van der Wilk, and Laurence Aitchison. Bayesian Neural Network Priors Revisited.
arXiv preprint arXiv:2102.06571, 2021.
[19] Joseph L Doob. Application of the theory of martingales. Le calcul des probabilites et ses
applications, pages 23–27, 1949.
[20] Bas JK Kleijn, Aad W van der Vaart, et al. The Bernstein-von-Mises theorem under misspec-
ification. Electronic Journal of Statistics, 6:354–381, 2012.
15
[21] Andrew Gelman, Daniel Simpson, and Michael Betancourt. The prior can often only be
understood in the context of the likelihood. Entropy, 19(10):555, 2017.
[22] Andrew Gelman and Yuling Yao. Holes in Bayesian statistics. Journal of Physics G: Nuclear
and Particle Physics, 48(1):014002, 2020.
[23] Bruno De Finetti. Sul significato soggettivo della probabilita. Fundamenta mathematicae, 17
(1):298–329, 1931.
[24] Morris L Eaton, David A Freedman, et al. Dutch book against someobjective’priors.
Bernoulli, 10(5):861–872, 2004.
[25] Leonard J Savage. The foundations of statistics. Courier Corporation, 1972.
[26] Simone Cerreia-Vioglio, Lars Peter Hansen, Fabio Maccheroni, and Massimo Marinacci.
Making Decisions under Model Misspecification. arXiv preprint arXiv:2008.01071, 2020.
[27] Andrés R Masegosa. Learning under model misspecification: Applications to variational and
ensemble methods. arXiv preprint arXiv:1912.08335, 2019.
[28] Warren R Morningstar, Alexander A Alemi, and Joshua V Dillon. PACm -Bayes: Nar-
rowing the Empirical Risk Gap in the Misspecified Bayesian Regime. arXiv preprint
arXiv:2010.09629, 2020.
[29] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural
computation, 8(7):1341–1390, 1996.
[30] Peter Grünwald, Thijs Van Ommen, et al. Inconsistency of Bayesian inference for misspec-
ified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–1103,
2017.
[31] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient
as variational inference. In International Conference on Machine Learning, pages 5852–5861.
PMLR, 2018.
[32] Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E Turner, Rio
Yokota, and Mohammad Emtiyaz Khan. Practical deep learning with bayesian principles.
arXiv preprint arXiv:1906.02506, 2019.
[33] Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan
Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How Good
is the Bayes Posterior in Deep Neural Networks Really? In International Conference on
Machine Learning, pages 10248–10259. PMLR, 2020.
[34] CKI Williams and CE Rasmussen. Gaussian Processes for Regression. In Ninth Annual
Conference on Neural Information Processing Systems (NIPS 1995), pages 514–520. MIT
Press, 1996.
[35] Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine
Learning. MIT Press, 2006.
[36] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer
Science & Business Media, 2013.
[37] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Mani-
fold Gaussian processes for regression. In 2016 International Joint Conference on Neural
Networks (IJCNN), pages 3338–3345. IEEE, 2016.
[38] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel
learning. In Artificial intelligence and statistics, pages 370–378. PMLR, 2016.
[39] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Stochastic
variational deep kernel learning. arXiv preprint arXiv:1611.00336, 2016.
[40] Ruslan Salakhutdinov and Geoffrey Hinton. Using deep belief nets to learn covariance ker-
nels for Gaussian processes. In Proceedings of the 20th International Conference on Neural
Information Processing Systems, pages 1249–1256, 2007.
[41] Miguel Lázaro-Gredilla and Aníbal R Figueiras-Vidal. Marginalized neural network mixtures
for large-scale regression. IEEE transactions on neural networks, 21(8):1345–1351, 2010.
16
[42] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sun-
daram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using
deep neural networks. In International conference on machine learning, pages 2171–2180.
PMLR, 2015.
[43] Sebastian W Ober and Laurence Aitchison. Global inducing point variational posteriors for
bayesian neural networks and deep gaussian processes. arXiv preprint arXiv:2005.08140,
2020.
[44] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes
overconfidence in relu networks. In International Conference on Machine Learning, pages
5436–5446. PMLR, 2020.
[45] Joe Watson, Jihao Andreas Lin, Pascal Klink, Joni Pajarinen, and Jan Peters. Latent Deriva-
tive Bayesian Last Layer Networks. In International Conference on Artificial Intelligence
and Statistics, pages 1198–1206. PMLR, 2021.
[46] John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples,
uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv
preprint arXiv:1707.02476, 2017.
[47] Tomoharu Iwata and Zoubin Ghahramani. Improving output uncertainty estimation and
generalization in deep learning via neural network Gaussian processes. arXiv preprint
arXiv:1707.05922, 2017.
[48] Vincent Fortuin, Heiko Strathmann, and Gunnar Rätsch. Meta-Learning Mean Functions for
Gaussian Processes. arXiv e-prints, pages arXiv–1901, 2019.
[49] Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian
processes. In Proceedings of the 31st International Conference on Neural Information Pro-
cessing Systems, pages 2845–2854, 2017.
[50] Andreas Damianou and Neil D Lawrence. Deep gaussian processes. In Artificial intelligence
and statistics, pages 207–215. PMLR, 2013.
[51] David Duvenaud, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. Avoiding pathologies
in very deep networks. In Artificial Intelligence and Statistics, pages 202–210. PMLR, 2014.
[52] Tim GJ Rudner, Dino Sejdinovic, and Yarin Gal. Inter-domain deep Gaussian processes. In
International Conference on Machine Learning, pages 8286–8294. PMLR, 2020.
[53] Matthew M Dunlop, Mark A Girolami, Andrew M Stuart, and Aretha L Teckentrup. How
deep are deep Gaussian processes? Journal of Machine Learning Research, 19(54):1–46,
2018.
[54] Vinayak Kumar, Vaibhav Singh, PK Srijith, and Andreas Damianou. Deep Gaussian pro-
cesses with convolutional kernels. arXiv preprint arXiv:1806.01655, 2018.
[55] Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional Gaussian
processes. In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pages 582–597. Springer, 2019.
[56] Hugh Salimbeni, Vincent Dutordoir, James Hensman, and Marc Deisenroth. Deep Gaussian
processes with importance-weighted variational inference. In International Conference on
Machine Learning, pages 5589–5598. PMLR, 2019.
[57] Thang Bui, Daniel Hernández-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard
Turner. Deep Gaussian processes for regression using approximate expectation propagation.
In International conference on machine learning, pages 1472–1481. PMLR, 2016.
[58] Zhenwen Dai, Andreas C Damianou, Javier González, and Neil D Lawrence. Variational
Auto-encoded Deep Gaussian Processes. In ICLR, 2016.
[59] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-
mate Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959,
2005.
[60] Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approx-
imations. In Artificial Intelligence and Statistics, pages 524–531. PMLR, 2007.
[61] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In
Artificial intelligence and statistics, pages 567–574. PMLR, 2009.
17
[62] James Hensman, Nicolò Fusi, and Neil D Lawrence. Gaussian processes for Big data. In
Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages
282–290, 2013.
[63] Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, and Gunnar Rätsch. Scalable gaussian
processes on discrete domains. arXiv preprint arXiv:1810.10368, 2018.
[64] Hugh Salimbeni and Marc Peter Deisenroth. Doubly stochastic variational inference for deep
Gaussian processes. In Proceedings of the 31st International Conference on Neural Informa-
tion Processing Systems, pages 4591–4602, 2017.
[65] Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature
expansions for deep Gaussian processes. In International Conference on Machine Learning,
pages 884–893. PMLR, 2017.
[66] Gregory W Benton, Wesley J Maddox, Jayson P Salkey, Júlio Albinati, and Andrew Gordon
Wilson. Function-space distributions over kernels. Advances in Neural Information Process-
ing Systems, 32, 2019.
[67] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto,
1995.
[68] Christopher KI Williams. Computing with infinite networks. In Proceedings of the 9th Inter-
national Conference on Neural Information Processing Systems, pages 295–301, 1996.
[69] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Proceedings
of the 22nd International Conference on Neural Information Processing Systems, pages 342–
350, 2009.
[70] Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural
networks. arXiv preprint arXiv:1508.05133, 2015.
[71] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and
Jascha Sohl-Dickstein. Deep Neural Networks as Gaussian Processes. In International Con-
ference on Learning Representations, 2018.
[72] Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin
Ghahramani. Gaussian Process Behaviour in Wide Deep Neural Networks. In International
Conference on Learning Representations, 2018.
[73] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep Convolu-
tional Networks as shallow Gaussian Processes. In International Conference on Learning
Representations, 2018.
[74] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A
Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian Deep Convolutional Net-
works with Many Channels are Gaussian Processes. In International Conference on Learning
Representations, 2018.
[75] Adrià Garriga-Alonso and Mark van der Wilk. Correlated Weights in Infinite Limits of Deep
Convolutional Neural Networks. arXiv preprint arXiv:2101.04097.
[76] Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Infinite attention:
NNGP and NTK for deep attention networks. In International Conference on Machine Learn-
ing, pages 4376–4386. PMLR, 2020.
[77] Russell Tsuchida, Fred Roosta, and Marcus Gallagher. Richer priors for infinitely wide multi-
layer perceptrons. arXiv preprint arXiv:1911.12927, 2019.
[78] Stefano Peluchetti, Stefano Favaro, and Sandra Fortini. Stable behaviour of infinitely wide
deep neural networks. In International Conference on Artificial Intelligence and Statistics,
pages 1137–1146. PMLR, 2020.
[79] Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, and Jascha Sohl-Dickstein.
Exact posterior distributions of wide Bayesian neural networks. arXiv preprint
arXiv:2006.10541, 2020.
[80] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro-
cess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint
arXiv:1902.04760, 2019.
18
[81] Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any archi-
tecture are gaussian processes. arXiv preprint arXiv:1910.12478, 2019.
[82] Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020.
[83] Greg Yang and Edward J Hu. Feature Learning in Infinite-Width Neural Networks. arXiv
preprint arXiv:2011.14522, 2020.
[84] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and
generalization in neural networks. In Proceedings of the 32nd International Conference on
Neural Information Processing Systems, pages 8580–8589, 2018.
[85] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Roman Novak, Jascha
Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear
models under gradient descent. Journal of Statistical Mechanics: Theory and Experiment,
2020(12):124002, 2020.
[86] Boris Hanin and Mihai Nica. Finite Depth and Width Corrections to the Neural Tangent
Kernel. In International Conference on Learning Representations, 2019.
[87] Mohammad Emtiyaz Khan, Alexander Immer, Ehsan Abedi, and Maciej Jan Korzepa. Ap-
proximate Inference Turns Deep Networks into Gaussian Processes. In 33rd Conference on
Neural Information Processing Systems, page 1751. Neural Information Processing Systems
Foundation, 2019.
[88] Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of Bayesian
neural networks via local linearization. arXiv preprint arXiv:2008.08400, 2020.
[89] Wesley Maddox, Shuai Tang, Pablo Moreno, Andrew Gordon Wilson, and Andreas Dami-
anou. Fast Adaptation with Linearized Neural Networks. In International Conference on
Artificial Intelligence and Statistics, pages 2737–2745. PMLR, 2021.
[90] Sanjeev Arora, Simon S Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli
Yu. Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks. In International
Conference on Learning Representations, 2019.
[91] Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint
arXiv:2006.14548, 2020.
[92] Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A Alemi, Jascha Sohl-
Dickstein, and Samuel S Schoenholz. Neural Tangents: Fast and Easy Infinite Neural Net-
works in Python. In International Conference on Learning Representations, 2019.
[93] Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Pro-
gramming. Advances in Neural Information Processing Systems, 32:2937–2947, 2019.
[94] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint
arXiv:1312.6114, 2013.
[95] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. In International conference on machine
learning, pages 1278–1286. PMLR, 2014.
[96] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyper-
spherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial Intelli-
gence 2018, UAI 2018, pages 856–865. Association For Uncertainty in Artificial Intelligence
(AUAI), 2018.
[97] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Cambridge
University Press, 2020.
[98] Tim R Davidson, Jakub M Tomczak, and Efstratios Gavves. Increasing Expressivity of a
Hyperspherical VAE. arXiv preprint arXiv:1910.02912, 2019.
[99] Nicola De Cao and Wilker Aziz. The power spherical distribution. arXiv preprint
arXiv:2006.04437, 2020.
[100] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni,
Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mix-
ture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
19
[101] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational
deep embedding: an unsupervised and generative approach to clustering. In Proceedings of
the 26th International Joint Conference on Artificial Intelligence, pages 1965–1972, 2017.
[102] Andreas Kopf, Vincent Fortuin, Vignesh Ram Somnath, and Manfred Claassen. Mixture-of-
Experts Variational Autoencoder for clustering and generating from similarity-based repre-
sentations. arXiv preprint arXiv:1910.07763, 2019.
[103] Eric Nalisnick and Padhraic Smyth. Stick-breaking variational autoencoders. arXiv preprint
arXiv:1605.06197, 2016.
[104] Francesco Paolo Casale, Adrian V Dalca, Luca Saglietti, Jennifer Listgarten, and Nicolo Fusi.
Gaussian process prior variational autoencoders. In Proceedings of the 32nd International
Conference on Neural Information Processing Systems, pages 10390–10401, 2018.
[105] Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, and Stephan Mandt. Gp-vae: Deep
probabilistic time series imputation. In International Conference on Artificial Intelligence
and Statistics, pages 1651–1661. PMLR, 2020.
[106] Michael Pearce. The gaussian process prior vae for interpretable latent dynamics from pixels.
In Symposium on Advances in Approximate Bayesian Inference, pages 1–12. PMLR, 2020.
[107] Sarthak Bhagat, Shagun Uppal, Zhuyun Yin, and Nengli Lim. Disentangling Multiple Fea-
tures in Video Sequences using Gaussian Processes in Variational Autoencoders. In European
Conference on Computer Vision, pages 102–117. Springer, 2020.
[108] Simon Bing, Vincent Fortuin, and Gunnar Rätsch. On Disentanglement in Gaussian Process
Variational Autoencoders. arXiv preprint arXiv:2102.05507, 2021.
[109] Metod Jazbec, Vincent Fortuin, Michael Pearce, Stephan Mandt, and Gunnar Rätsch. Scalable
gaussian process variational autoencoders. arXiv preprint arXiv:2010.13472, 2020.
[110] Matthew Ashman, Jonathan So, William Tebbutt, Vincent Fortuin, Michael Pearce, and
Richard E Turner. Sparse Gaussian Process Variational Autoencoders. arXiv preprint
arXiv:2010.10177, 2020.
[111] Metod Jazbec, Michael Pearce, and Vincent Fortuin. Factorized Gaussian Process Variational
Autoencoders. arXiv preprint arXiv:2011.07255, 2020.
[112] Siddharth Ramchandran, Gleb Tikhonov, Miika Koskinen, and Harri Lähdesmäki. Longitudi-
nal Variational Autoencoder. arXiv preprint arXiv:2006.09763, 2020.
[113] Alex Campbell and Pietro Liò. tvGP-VAE: Tensor-variate Gaussian process prior variational
autoencoder. arXiv preprint arXiv:2006.04788, 2020.
[114] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther.
Ladder variational autoencoders. In Proceedings of the 30th International Conference on
Neural Information Processing Systems, pages 3745–3753, 2016.
[115] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from deep
generative models. In International Conference on Machine Learning, pages 4091–4099.
PMLR, 2017.
[116] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. arXiv
preprint arXiv:2007.03898, 2020.
[117] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representa-
tion learning. In Proceedings of the 31st International Conference on Neural Information
Processing Systems, pages 6309–6318, 2017.
[118] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images
with vq-vae-2. arXiv preprint arXiv:1906.00446, 2019.
[119] Teuvo Kohonen. Self-Organizing Maps. Springs Series in Information Sciences, 30:362,
1995.
[120] Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch.
SOM-VAE: Interpretable Discrete Representation Learning on Time Series. In International
Conference on Learning Representations, 2018.
[121] Florent Forest, Mustapha Lebbah, Hanane Azzag, and Jérôme Lacaille. Deep architectures
for joint clustering and visualization with self-organizing maps. In Pacific-Asia Conference
on Knowledge Discovery and Data Mining, pages 105–116. Springer, 2019.
20
[122] Laura Manduchi, Matthias Hüser, Julia Vogt, Gunnar Rätsch, and Vincent Fortuin. DPSOM:
Deep probabilistic clustering with self-organizing maps. arXiv preprint arXiv:1910.01590,
2019.
[123] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-Lobato, Se-
bastian Nowozin, and Cheng Zhang. EDDI: Efficient Dynamic Discovery of High-Value
Information with Partial VAE. In International Conference on Machine Learning, pages
4234–4243. PMLR, 2019.
[124] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Mur-
ray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural pro-
cesses. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2018.
[125] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Es-
lami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
[126] Kihyuk Sohn, Xinchen Yan, and Honglak Lee. Learning structured output representation us-
ing deep conditional generative models. In Proceedings of the 28th International Conference
on Neural Information Processing Systems-Volume 2, pages 3483–3491, 2015.
[127] Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvári, and John Shawe-Taylor. PAC-Bayes
analysis beyond the usual bounds. arXiv preprint arXiv:2006.13057, 2020.
[128] Gintare Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, Gabriel Arpino, and Daniel Roy.
On the role of data in PAC-Bayes. In International Conference on Artificial Intelligence and
Statistics, pages 604–612. PMLR, 2021.
[129] Tim GJ Rudner, Vincent Fortuin, Yee Whye Teh, and Yarin Gal. On the connection between
neural processes and gaussian processes with deep kernels. In Workshop on Bayesian Deep
Learning, NeurIPS, 2018.
[130] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum,
Oriol Vinyals, and Yee Whye Teh. Attentive Neural Processes. In International Conference
on Learning Representations, 2018.
[131] Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is Turing-Complete. Journal
of Machine Learning Research, 22(75):1–35, 2021.
[132] Jonathan Gordon, Wessel P Bruinsma, Andrew YK Foong, James Requeima, Yann Dubois,
and Richard E Turner. Convolutional Conditional Neural Processes. In International Confer-
ence on Learning Representations, 2019.
[133] Andrew Foong, Wessel Bruinsma, Jonathan Gordon, Yann Dubois, James Requeima, and
Richard Turner. Meta-Learning Stationary Stochastic Process Prediction with Convolutional
Neural Processes. Advances in Neural Information Processing Systems, 33, 2020.
[134] Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The functional neural
process. arXiv preprint arXiv:1906.08324, 2019.
[135] Wessel P Bruinsma, James Requeima, Andrew YK Foong, Jonathan Gordon, and Richard E
Turner. The Gaussian Neural Process. arXiv preprint arXiv:2101.03606.
[136] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural
computation, 4(3):448–472, 1992.
[137] Laurent Valentin Jospin, Wray Buntine, Farid Boussaid, Hamid Laga, and Mohammed Ben-
namoun. Hands-on Bayesian Neural Networks–a Tutorial for Deep Learning Users. arXiv
preprint arXiv:2007.06823, 2020.
[138] Eric Thomas Nalisnick. On priors for bayesian neural networks. PhD thesis, UC Irvine,
2018.
[139] Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic per-
spective of generalization. arXiv preprint arXiv:2002.08791, 2020.
[140] Daniele Silvestro and Tobias Andermann. Prior choice affects ability of Bayesian neural
networks to identify unknowns. arXiv preprint arXiv:2005.04987, 2020.
[141] José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable
learning of bayesian neural networks. In International Conference on Machine Learning,
pages 1861–1869. PMLR, 2015.
21
[142] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian
neural networks. In International Conference on Machine Learning, pages 2218–2227.
PMLR, 2017.
[143] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson.
Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning. In International Con-
ference on Learning Representations, 2019.
[144] Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller,
Balaji Lakshminarayanan, and Dustin Tran. Efficient and scalable bayesian neural nets with
rank-1 factors. In International conference on machine learning, pages 2782–2792. PMLR,
2020.
[145] Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, and Julyan Arbel. Understanding priors in
bayesian neural networks at the unit level. In International Conference on Machine Learning,
pages 6458–6467. PMLR, 2019.
[146] David JC MacKay. Introduction to Gaussian processes. NATO ASI series F computer and
systems sciences, 168:133–166, 1998.
[147] Christos Louizos and Max Welling. Structured and efficient variational deep learning with
matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–
1716. PMLR, 2016.
[148] Alex Graves. Practical variational inference for neural networks. In Advances in neural
information processing systems, pages 2348–2356. Citeseer, 2011.
[149] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández-
Lobato, and Alexander L Gaunt. Deterministic Variational Inference for Robust Bayesian
Neural Networks. In International Conference on Learning Representations, 2018.
[150] Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to
Gaussian processes. In Artificial intelligence and statistics, pages 877–885. PMLR, 2014.
[151] Peter M Williams. Bayesian regularization and pruning using a Laplace prior. Neural com-
putation, 7(1):117–143, 1995.
[152] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variance Net-
works: When Expectation Does Not Meet Your Expectations. In International Conference
on Learning Representations, 2018.
[153] Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the horse-
shoe. In Artificial Intelligence and Statistics, pages 73–80. PMLR, 2009.
[154] Soumya Ghosh, Jiayu Yao, and Finale Doshi-Velez. Structured variational learning of
Bayesian neural networks with horseshoe priors. In International Conference on Machine
Learning, pages 1744–1753. PMLR, 2018.
[155] Hiske Overweg, Anna-Lena Popkes, Ari Ercole, Yingzhen Li, José Miguel Hernández-
Lobato, Yordan Zaykov, and Cheng Zhang. Interpretable outcome prediction with sparse
Bayesian neural networks in intensive care. arXiv preprint arXiv:1905.02599, 2019.
[156] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learn-
ing. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, pages 3290–3300, 2017.
[157] Tianyu Cui, Aki Havulinna, Pekka Marttinen, and Samuel Kaski. Informative gaussian scale
mixture priors for bayesian neural networks. arXiv preprint arXiv:2002.10243, 2020.
[158] Changyong Oh, Kamil Adamczewski, and Mijung Park. Radial and directional posteriors for
bayesian neural networks. arXiv preprint arXiv:1902.02603, 2019.
[159] Sebastian Farquhar, Michael A Osborne, and Yarin Gal. Radial bayesian neural networks:
Beyond discrete support in large-scale bayesian deep learning. In International Conference
on Artificial Intelligence and Statistics, pages 1352–1362. PMLR, 2020.
[160] Stanislav Fort and Adam Scherlis. The Goldilocks Zone: Towards Better Understanding
of Neural Network Loss Landscapes. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 3574–3581, 2019.
[161] Theofanis Karaletsos, Peter Dayan, and Zoubin Ghahramani. Probabilistic meta-
representations of neural networks. arXiv preprint arXiv:1810.00555, 2018.
22
[162] Theofanis Karaletsos and Thang D Bui. Hierarchical Gaussian Process Priors for Bayesian
Neural Network Weights. Advances in Neural Information Processing Systems, 33, 2020.
[163] Andrei Atanov, Arsenii Ashukha, Kirill Struminsky, Dmitriy Vetrov, and Max Welling. The
Deep Weight Prior. In International Conference on Learning Representations, 2018.
[164] Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry
in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the
loss landscape. arXiv preprint arXiv:1907.02911, 2019.
[165] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss land-
scapes. arXiv preprint arXiv:1906.04724, 2019.
[166] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional Variational
Bayesian Neural Networks. In International Conference on Learning Representations, 2018.
[167] Daniel Flam-Shepherd, James Requeima, and David Duvenaud. Mapping Gaussian process
priors to Bayesian neural networks. In NIPS Bayesian deep learning workshop, volume 13,
2017.
[168] David R Burt, Sebastian W Ober, Adrià Garriga-Alonso, and Mark van der Wilk. Understand-
ing Variational Inference in Function-Space. arXiv preprint arXiv:2011.09421, 2020.
[169] Ba-Hien Tran, Simone Rossi, Dimitrios Milios, and Maurizio Filippone. All You Need is a
Good Functional Prior for Bayesian Deep Learning. arXiv preprint arXiv:2011.12829, 2020.
[170] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,
2016.
[171] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron
Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
[172] Daniel Flam-Shepherd, James Requeima, and David Duvenaud. Characterizing and Warping
the Function Space of Bayesian Neural Networks. In NeurIPS Workshop on Bayesian Deep
Learning, 2018.
[173] Emmanuel Jean Candes. Ridgelets: Theory and application. Ph. D. dissertation, Dept. of
Statistics, Stanford Univ., 1998.
[174] Takuo Matsubara, Chris J Oates, and François-Xavier Briol. The Ridgelet Prior: A Covari-
ance Function Approach to Prior Specification for Bayesian Neural Networks. arXiv preprint
arXiv:2010.08488, 2020.
[175] Chao Ma, Yingzhen Li, and José Miguel Hernández-Lobato. Variational implicit processes.
In International Conference on Machine Learning, pages 4222–4233. PMLR, 2019.
[176] Tim Pearce, Russell Tsuchida, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. Ex-
pressive priors in bayesian neural networks: Kernel combinations and periodic functions. In
Uncertainty in Artificial Intelligence, pages 134–144. PMLR, 2020.
[177] Wanqian Yang, Lars Lorch, Moritz A Graule, Srivatsan Srinivasan, Anirudh Suresh, Jiayu
Yao, Melanie F Pradier, and Finale Doshi-Velez. Output-constrained Bayesian neural net-
works. arXiv preprint arXiv:1905.06287, 2019.
[178] Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, and James Davidson. Noise
contrastive priors for functional uncertainty. In Uncertainty in Artificial Intelligence, pages
905–914. PMLR, 2020.
[179] Eric Nalisnick, Jonathan Gordon, and José Miguel Hernández-Lobato. Predictive Complexity
Priors. In International Conference on Artificial Intelligence and Statistics, pages 694–702.
PMLR, 2021.
[180] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.
[181] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre-
dictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, pages 6405–6416, 2017.
[182] Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensem-
bles for robustness and uncertainty quantification. arXiv preprint arXiv:2006.13570, 2020.
23
[183] Yeming Wen, Dustin Tran, and Jimmy Ba. BatchEnsemble: an Alternative Approach to
Efficient Ensemble and Lifelong Learning. In International Conference on Learning Repre-
sentations, 2019.
[184] Rahul Rahaman and Alexandre H Thiery. Uncertainty quantification and deep ensembles.
arXiv preprint arXiv:2007.08792, 2020.
[185] Jiayu Yao, Weiwei Pan, Soumya Ghosh, and Finale Doshi-Velez. Quality of uncertainty quan-
tification for Bayesian neural network inference. arXiv preprint arXiv:1906.09686, 2019.
[186] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein-
forcement learning. In Proceedings of the 32nd International Conference on Neural Informa-
tion Processing Systems, pages 8626–8638, 2018.
[187] Ian Osband, Benjamin Van Roy, Daniel J Russo, and Zheng Wen. Deep Exploration via
Randomized Value Functions. Journal of Machine Learning Research, 20(124):1–62, 2019.
[188] Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hofmann, and Richard Turner. Con-
servative uncertainty estimation by fitting prior networks. In International Conference on
Learning Representations, 2019.
[189] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the
neural tangent kernel. arXiv preprint arXiv:2007.05864, 2020.
[190] Qiang Liu and Dilin Wang. Stein variational Gradient descent: a general purpose Bayesian
inference algorithm. In Proceedings of the 30th International Conference on Neural Informa-
tion Processing Systems, pages 2378–2386, 2016.
[191] Qiang Liu. Stein variational gradient descent as gradient flow. In Proceedings of the 31st In-
ternational Conference on Neural Information Processing Systems, pages 3118–3126, 2017.
[192] Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic
analysis for Stein variational gradient descent. arXiv preprint arXiv:2006.09797, 2020.
[193] Xinyu Hu, Paul Szerlip, Theofanis Karaletsos, and Rohit Singh. Applying SVGD to
Bayesian Neural Networks for Cyclical Time-Series Prediction and Inference. arXiv preprint
arXiv:1901.05906, 2019.
[194] Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function Space Particle Optimization
for Bayesian Neural Networks. In International Conference on Learning Representations,
2018.
[195] CE Rasmussen and Z Ghahramani. Occam’s razor. Advances in Neural Information Process-
ing Systems, pages 294–300, 2001.
[196] Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how
to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987.
[197] Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning
to learn, pages 3–17. Springer, 1998.
[198] Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence re-
search, 12:149–198, 2000.
[199] Tom Heskes. Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learn-
ing and a Hierarchical Bayesian Approach. In Proceedings of the Fifteenth International
Conference on Machine Learning, pages 233–241, 1998.
[200] Joshua B Tenenbaum. A Bayesian Framework for Concept Learning. PhD thesis, Citeseer,
1999.
[201] Li Fei-Fei et al. A Bayesian approach to unsupervised one-shot learning of object categories.
In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1134–1141.
IEEE, 2003.
[202] Neil D Lawrence and John C Platt. Learning to learn with the informative vector machine. In
Proceedings of the twenty-first international conference on Machine learning, page 65, 2004.
[203] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting
Gradient-Based Meta-Learning as Hierarchical Bayes. In International Conference on Learn-
ing Representations, 2018.
24
[204] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn.
Bayesian model-agnostic meta-learning. In Proceedings of the 32nd International Conference
on Neural Information Processing Systems, pages 7343–7353, 2018.
[205] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In
Proceedings of the 32nd International Conference on Neural Information Processing Systems,
pages 9537–9548, 2018.
[206] Andrew Wilson and Ryan Adams. Gaussian process kernels for pattern discovery and extrap-
olation. In International conference on machine learning, pages 1067–1075. PMLR, 2013.
[207] David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin.
Structure discovery in nonparametric regression through compositional kernel search. In
International Conference on Machine Learning, pages 1166–1174. PMLR, 2013.
[208] James Lloyd, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani.
Automatic construction and natural-language description of nonparametric regression models.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
[209] Hyunjik Kim and Yee Whye Teh. Scaling up the Automatic Statistician: Scalable structure
discovery using Gaussian processes. In International Conference on Artificial Intelligence
and Statistics, pages 575–584. PMLR, 2018.
[210] Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger
Grosse. Differentiable compositional kernel learning for Gaussian processes. In International
Conference on Machine Learning, pages 4828–4837. PMLR, 2018.
[211] Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Laksh-
minarayanan. Simple and principled uncertainty estimation with deterministic deep learning
via distance awareness. arXiv preprint arXiv:2006.10108, 2020.
[212] M Van der Wilk, M Bauer, ST John, and J Hensman. Learning Invariances using the Marginal
Likelihood. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018),
pages 9938–9948, 2019.
[213] Massimiliano Patacchiola, Jack Turner, Elliot J Crowley, Michael O’Boyle, and Amos J
Storkey. Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels. Advances
in Neural Information Processing Systems, 33, 2020.
[214] Yunxiao Qin, Weiguo Zhang, Chenxu Zhao, Zezheng Wang, Hailin Shi, Guojun Qi, Jingping
Shi, and Zhen Lei. Rethink and redesign meta learning. arXiv preprint arXiv:1812.04955,
2018.
[215] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Meta-
Learning without Memorization. In International Conference on Learning Representations,
2019.
[216] Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, and Andreas Krause. PACOH: Bayes-
optimal meta-learning with PAC-guarantees. arXiv preprint arXiv:2002.05551, 2020.
[217] Matthew D Hoffman and Matthew J Johnson. ELBO Surgery: Yet another way to carve up
the evidence lower bound.
[218] Jakub Tomczak and Max Welling. VAE with a VampPrior. In International Conference on
Artificial Intelligence and Statistics, pages 1214–1223. PMLR, 2018.
[219] SR Dalal and WJ Hall. Approximating priors by mixtures of natural conjugate priors. Journal
of the Royal Statistical Society: Series B (Methodological), 45(2):278–286, 1983.
[220] Chunsheng Guo, Jialuo Zhou, Huahua Chen, Na Ying, Jianwu Zhang, and Di Zhou. Varia-
tional autoencoder with optimizing Gaussian mixture model priors. IEEE Access, 8:43992–
44005, 2020.
[221] Philip Botros and Jakub M Tomczak. Hierarchical VampPrior variational fair auto-encoder.
arXiv preprint arXiv:1806.09918, 2018.
[222] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
25
[223] Jörg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo J Rezende. Variational memory
addressing in generative models. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, pages 3923–3932, 2017.
[224] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In
International Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
[225] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.
arXiv preprint arXiv:1605.08803, 2016.
[226] Chin-Wei Huang, Ahmed Touati, Laurent Dinh, Michal Drozdzal, Mohammad Havaei, Lau-
rent Charlin, and Aaron Courville. Learnable explicit density for continuous latent space and
variational inference. arXiv preprint arXiv:1710.02248, 2017.
[227] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. In Proceedings of
the 30th International Conference on Neural Information Processing Systems, pages 4743–
4751, 2016.
[228] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul-
man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint
arXiv:1611.02731, 2016.
[229] Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. In The
22nd International Conference on Artificial Intelligence and Statistics, pages 66–75. PMLR,
2019.
[230] Bo Pang, Tian Han, Erik Nijkamp, Song-Chun Zhu, and Ying Nian Wu. Learning Latent
Space Energy-Based Prior Model. Advances in Neural Information Processing Systems, 33,
2020.
[231] Jyoti Aneja, Alexander Schwing, Jan Kautz, and Arash Vahdat. NCP-VAE: Variational Au-
toencoders with Noise Contrastive Priors. arXiv preprint arXiv:2010.02917, 2020.
[232] Wenxiao Chen, Wenda Liu, Zhenting Cai, Haowen Xu, and Dan Pei. VAEPP: Variational
Autoencoder with a Pull-Back Prior. In International Conference on Neural Information
Processing, pages 366–379. Springer, 2020.
[233] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings
of the 27th International Conference on Neural Information Processing Systems-Volume 2,
pages 2672–2680, 2014.
[234] Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz
Khan. Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning. arXiv
preprint arXiv:2104.04975, 2021.
[235] Eric Nalisnick and Padhraic Smyth. Learning priors for invariance. In International Confer-
ence on Artificial Intelligence and Statistics, pages 366–375. PMLR, 2018.
26