Continual Learning via Sequential Function-Space Variational Inference

Tim G. J. Rudner Freddie Bickford Smith Qixuan Feng Yee Whye Teh Yarin Gal

Abstract

Sequential Bayesian inference over predictive functions is a natural framework for continual learning from streams of data. However, applying it to neural networks has proved challenging in practice. Addressing the drawbacks of existing techniques, we propose an optimization objective derived by formulating continual learning as sequential function-space variational inference. In contrast to existing methods that regularize neural network parameters directly, this objective allows parameters to vary widely during training, enabling better adaptation to new tasks. Compared to objectives that directly regularize neural network predictions, the proposed objective allows for more flexible variational distributions and more effective regularization. We demonstrate that, across a range of task sequences, neural networks trained via sequential function-space variational inference achieve better predictive accuracy than networks trained with related methods while depending less on maintaining a set of representative points from previous tasks.

1 Introduction

Continual learning promises to enable applications of machine learning to settings with resource constraints, privacy concerns, or non-stationary data distributions. However, continual learning in deep neural networks remains a challenge. While progress has been made to mitigate “forgetting” of previously learned abilities, existing objective-based approaches to continual learning still fall short.

A popular family of objectives penalizes changes in parameters from one task to another (Ahn et al.,, 2019; Aljundi et al.,, 2018; Chaudhry et al.,, 2018; Kirkpatrick et al.,, 2017; Lee et al.,, 2017; Liu et al.,, 2018; Loo et al.,, 2020; Nguyen et al.,, 2018; Park et al.,, 2019; Ritter et al.,, 2018; Schwarz et al.,, 2018; Swaroop et al.,, 2019; Yin et al., 2020a, ; Yin et al., 2020b, ; Zenke et al.,, 2017). However, explicitly regularizing parameters in this way may be ineffective, since parameters are only a proxy for a neural network’s predictive function. For example, predictive functions defined by overparameterized neural networks may be obtained with several different parameter configurations, and small changes in a network’s parameters may cause large changes in its predictions.

An alternative approach that addresses this shortcoming is to regularize the predictive function directly (Benjamin et al.,, 2019; Bui et al.,, 2017; Buzzega et al.,, 2020; Jung et al.,, 2018; Kapoor et al.,, 2021; Kim et al.,, 2018; Li and Hoiem,, 2018; Moreno-Muñoz et al.,, 2019; Pan et al.,, 2020; Titsias et al.,, 2020). Existing function-space regularization methods represent the state of the art among objective-based approaches to continual learning (Kapoor et al.,, 2021; Pan et al.,, 2020; Titsias et al.,, 2020). Yet, as we demonstrate, these methods still leave room for improvement. For example, “functional regularization of the memorable past” (fromp; Pan et al.,, 2020) uses a Laplace approximation and as such does not directly optimize variance parameters, while “functional regularization for continual learning” (frcl; Titsias et al.,, 2020) is constrained to linear models.

To address these limitations, we frame continual learning as sequential function-space variational inference (s-fsvi) and adapt the variational objective proposed by Rudner et al., (2021) to the continual-learning setting. The resulting variational optimization objective has three key advantages over existing alternatives. First, it is expressed purely in terms of distributions over predictive functions, which allows greater flexibility than with parameter-space regularization methods (Figure 1). Second, unlike fromp, it allows direct optimization of variational variance parameters. Third, unlike frcl, it can be applied to fully-stochastic neural networks—not just to Bayesian linear models.

We demonstrate that s-fsvi outperforms existing objective-based continual learning methods—in some cases by a significant margin—on a wide range of task sequences, including single-head split mnist, multi-head split cifar, and multi-head sequential Omniglot. We further present empirical results that showcase the usefulness of learned variational variance parameters and demonstrate that s-fsvi is less reliant on careful selection of datapoints that summarize past tasks than other methods.

Refer to caption — Figure 1: Schematic of how sequential function-space variational inference (s-fsvi) allows a Bayesian neural network to learn new tasks while maintaining previously learned abilities. (Top: predictive distributions.) On task 1, the model fits dataset $\mathcal{D}_{1}$ by updating an initial distribution over parameters $q_{0}({\bm{\theta}})$ to a variational posterior $q_{1}({\bm{\theta}})$ , which in turn induces a distribution over functions $q_{1}(f)$ . On task 2, the variational objective encourages the posterior distribution over functions to match $q_{1}(f)$ on a small set of data points from task 1 while also fitting dataset $\mathcal{D}_{2}$ . The mean and two standard deviations of the distributions over functions learned on task 1 and task 2 are shown in grey and blue, respectively. (Bottom: learning trajectories.) On task 1, the distribution over functions changes by a large amount for inputs $\mathbf{X}_{1}$ (left) but by a small amount for inputs $\mathbf{X}_{2}$ (right). On task 2, the reverse is true. On both tasks, the change in the distribution over parameters (center) is decoupled from the changes in the distribution over functions (left, right).

2 Background

2.1 Continual Learning as Bayesian Inference

Consider a sequence of tasks indexed by $t\in\{1,\ldots,T\}$ . Each task involves making predictions on a supervised dataset $\mathcal{D}_{t}=(\mathbf{X}_{t},\mathbf{y}_{t})$ . Continual learning is the problem of inferring a distribution over predictive functions that fits the whole collection of datasets $\{\mathcal{D}_{1},\ldots,\mathcal{D}_{T}\}$ as well as possible given access to only a single full dataset at a time.

Sequential Bayesian inference over predictive functions $f$ provides a natural framework for this. Assuming we have a prior $p(f)$ , the posterior distribution over $f$ at task 1 is

\displaystyle p(f\,|\,\mathcal{D}_{1})=p(\mathcal{D}_{1}\,|\,f)p(f)/p(\mathcal% {D}_{1}).

(1)

For subsequent tasks $t$ , the posterior can be expressed as

\displaystyle p(f\,|\,\mathcal{D}_{1},\ldots,\mathcal{D}_{t})\propto p(% \mathcal{D}_{t}\,|\,f)p(f\,|\,\mathcal{D}_{1},\ldots,\mathcal{D}_{t-1}),

(2)

where the posterior after task $t-1$ is treated as the prior for task $t$ . Given the intractibility of computing this posterior exactly, we need to use approximate inference.

2.2 Function-Space Variational Inference

Given a dataset $\mathcal{D}=(\mathbf{X},\mathbf{y})$ , a prior $p(f)$ and a variational family $\mathcal{Q}_{f}$ , function-space variational inference (Burt et al.,, 2021; Matthews et al.,, 2016; Rudner et al.,, 2021; Sun et al.,, 2019) consists of finding the variational distribution $q(f)\in\mathcal{Q}_{f}$ that maximizes

\displaystyle\mathbb{E}_{q(f)}[\log p(\mathbf{y}\,|\,f(\mathbf{X}))]-\mathbb{D% }_{\textrm{KL}}(q(f)\,\|\,p(f)).

(3)

This variational optimization problem presents a trade-off between fitting the data and matching a prior over functions. To address the fact that the KL divergence between distributions over functions is not in general tractable, prior works have developed estimation procedures that allow turning Equation 3 into an objective function that can be used in practice (Rudner et al.,, 2021; Sun et al.,, 2019).

3 Continual Learning via Sequential Function-Space Variational Inference

The ideas presented in Section 2 provide a starting point for our method. To approximate the posterior in Equation 2 at task $t$ , we would like to find a variational distribution $q_{t}(f)\in\mathcal{Q}_{f}$ that minimizes

\displaystyle\mathbb{D}_{\textrm{KL}}(q_{t}(f)\,\|\,p_{t}(f|\mathcal{D}_{1},..% .,\mathcal{D}_{t})),

(4)

which can equivalently be expressed as maximizing

\displaystyle\mathbb{E}_{q_{t}(f)}[\log p(\mathbf{y}_{t}\,|\,f(\mathbf{X}_{t})% )]-\mathbb{D}_{\textrm{KL}}(q_{t}(f)\,\|\,p_{t}(f|\mathcal{D}_{1},...,\mathcal% {D}_{t-1})).

Since we do not have access to $p_{t}(f|\mathcal{D}_{1},...,\mathcal{D}_{t-1})$ , we simplify the inference problem to maximizing the variational objective

\displaystyle\mathbb{E}_{q_{t}(f)}[\log p(\mathbf{y}_{t}\,|\,f(\mathbf{X}_{t})% )]-\mathbb{D}_{\textrm{KL}}(q_{t}(f)\,\|\,p_{t}(f)),

(5)

where for $t=1$ we assume some prior $p_{1}(f)$ and for $t>1$ the prior is given by the variational posterior distribution over functions inferred on the previous task. That is,

p_{t}(f)\doteq q_{t-1}(f).

While this objective is in general intractable for distributions over functions induced by neural networks with stochastic parameters, Rudner et al., (2021) proposed an approximation that makes this objective amenable to gradient-based optimization and scalable to large neural networks. To perform sequential function-space variational inference, we adapt the estimation procedure proposed by Rudner et al., (2021) to the continual-learning setting:

Proposition 1 (Sequential Function-Space Variational Inference (s-fsvi); adapted from Rudner et al., (2021)).

Let $D_{t}$ be the number of model output dimensions for $t$ tasks, let $f^{\text{\emph{}}}:\mathcal{X}\times\mathbb{R}^{P}\rightarrow\mathbb{R}^{D_{t}}$ be a mapping defined by a neural network architecture, let $\bm{\Theta}\in\mathbb{R}^{P}$ be a multivariate random vector of network parameters, and let $q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})$ and $q_{t-1}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\bm{\Sigma}_{t-1})$ be variational distributions over $\bm{\Theta}$ . Additionally, let $\mathbf{X}_{\mathcal{C}}$ denote a set of context points, and let $\bar{\mathbf{X}}_{t}\subseteq\{\mathbf{X}_{t}\cup\mathbf{X}_{\mathcal{C}}\}$ . Under a diagonal approximation of the prior and variational posterior covariance functions across output dimensions, the objective in Equation 5 can be approximated by

\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\\ &\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}% \,|\,f(\mathbf{X}_{t};{\bm{\theta}}))]\\ &~{}~{}~{}-\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(}\log\frac{|[\mathbf{K}^{p_{t}}% ]_{k}|}{|[\mathbf{K}^{q_{t}}]_{k}|}-\frac{|\bar{\mathbf{X}}_{t}|}{D_{t}}+\text% {\emph{Tr}}([\mathbf{K}^{p_{t}}]_{k}^{-1}[\mathbf{K}^{q_{t}}]_{k})\\ &~{}~{}~{}~{}~{}~{}+\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1% })^{\top}[\mathbf{K}^{p_{t}}]_{k}^{-1}\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{% t},{\bm{\mu}}_{t-1})\bigg{)},\end{split}

(6)

where

\displaystyle\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})% \doteq[f(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t})]_{k}-[f(\bar{\mathbf{X}}_{t};{% \bm{\mu}}_{t-1})]_{k}

(7)

and

	$\displaystyle\mathbf{K}^{p_{t}}$	$\displaystyle\doteq\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})\bm{% \Sigma}_{t-1}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})^{\top}$		(8)
	$\displaystyle\mathbf{K}^{q_{t}}$	$\displaystyle\doteq\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})\bm{\Sigma}% _{t}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})^{\top},$		(9)

are covariance matrix estimates constructed from Jacobians $\mathcal{J}(\cdot,\mathbf{m})\doteq\frac{\partial f(\cdot\,;\bm{\Theta})}{% \partial\bm{\Theta}}|_{\bm{\Theta}=\mathbf{m}}\,$ with $\mathbf{m}=\{{\bm{\mu}}_{t},{\bm{\mu}}_{t-1}\}$ .

Proof.

See Appendix A. ∎

“Functional regularization for continual learning” (frcl; Titsias et al.,, 2020) and “functional regularization of the memorable past” (fromp; Pan et al.,, 2020) use objectives conceptually similar to the objective in Equation 5 and mathematically similar to the objective in Equation 6. To highlight the differences between the s-fsvi objective above and fromp and frcl, respectively, we make the relationship between these two methods and s-fsvi precise in the following two propositions.

Proposition 2 (Relationship between fromp and s-fsvi).

With the s-fsvi objective $\mathcal{F}$ defined as in Equation 6, let $\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}$ . Then, up to a multiplicative constant, the fromp objective corresponds to the s-fsvi objective with the prior covariance given by a Laplace approximation about ${\bm{\mu}}_{t-1}$ and the variational distribution given by a Dirac delta distribution $q_{t}^{\textsc{fromp}}({\bm{\theta}})\doteq\delta({\bm{\theta}}-{\bm{\mu}}_{t})$ . Denoting the prior covariance under a Laplace approximation about ${\bm{\mu}}_{t-1}$ by $\hat{\bm{\Sigma}}_{0}({\bm{\mu}}_{t-1})$ so that $q_{t-1}^{\textsc{fromp}}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\hat% {\bm{\Sigma}}_{0}({\bm{\mu}}_{t-1}))$ , the fromp objective can be expressed as

\displaystyle\begin{split}&\mathcal{L}^{\textsc{fromp}}(q_{t}^{\textsc{fromp}}% ,q_{t-1}^{\textsc{fromp}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{% t})\\ &~{}~{}~{}=\mathcal{F}(q_{t}^{\textsc{fromp}},q_{t-1}^{\textsc{fromp}},\mathbf% {X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})-\mathcal{V},\end{split}

where

\displaystyle\mathcal{V}\doteq-\frac{1}{2}\sum_{k}\left(\log\frac{{[\bar{% \mathbf{K}}^{\hat{p}_{t}}]_{k}}}{{[\bar{\mathbf{K}}^{q_{t}}]_{k}}}+\frac{[\bar% {\mathbf{K}}^{q_{t}}]_{k}}{[\bar{\mathbf{K}}^{\hat{p}_{t}}]_{k}}-1\right),

with $\bar{\mathbf{K}}$ denoting a covariance matrix under a block-diagonalization without inter-task dependence, and

\displaystyle\bar{\mathbf{K}}^{\hat{p}_{t}}\doteq\text{\emph{block-diag}}\left% (\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})\hat{\bm{\Sigma}}_{0}({\bm{% \mu}}_{t-1})\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})^{\top}\right).

Proof.

See Appendix A. ∎

Proposition 2 shows that the fromp objective nearly corresponds to the s-fsvi objective but is missing the term in the s-fsvi objective (denoted by $\mathcal{V}$ above) that encourages learning variational variance parameters that accurately reflect the variance of the prior. This insight reflects a shortcoming of the fromp objective. Unlike in the s-fsvi objective which allows optimization over $\bm{\Sigma}$ , the fromp objective is restricted to covariance estimates given by the Laplace approximation.

The frcl objective can be related to the s-fsvi objective in a similar way:

Proposition 3 (Relationship between frcl and s-fsvi).

With the s-fsvi objective $\mathcal{F}$ defined as in Equation 6, let $\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}$ , and let $f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})\doteq\Phi_{\psi}(\cdot)\bm{\Theta}$ be a Bayesian linear model, where $\Phi_{\psi}(\cdot)$ is a deterministic feature map parameterized by $\psi$ . Then the frcl objective corresponds to the s-fsvi objective for the model $f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})$ plus an additional weight-space KL divergence penalty. That is, for $p_{t}({\bm{\theta}})\doteq\mathcal{N}(\mathbf{0},\mathbf{I}$ , and $q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})$ ,

\displaystyle\begin{split}&\mathcal{L}^{\text{{frcl}}}(q_{t}^{\textsc{frcl}},q% _{t-1}^{\textsc{frcl}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})% \\ &~{}~{}~{}=\mathcal{F}(q_{t}^{\textsc{frcl}},q_{t-1}^{\textsc{frcl}},\mathbf{X% }_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})+\mathbb{D}_{\textrm{\emph{KL}}}% (q_{t}({\bm{\theta}})\,\|\,p_{t}({\bm{\theta}})).\end{split}

(10)

Proof.

See Appendix A. ∎

Proposition 3 highlights that the frcl objective is restricted to Bayesian linear models and does not regularize the deterministic parameters in the feature map as effectively as if they were variational parameters.

3.1 Simplified Sequential Function-Space VI

For ease of computation and to ensure scalability to large neural networks, we consider mean-field distributions $q^{\text{MF}}_{t}({\bm{\theta}})$ for all tasks, diagonalize the covariance matrix estimates $\mathbf{K}^{p_{t}}$ and $\mathbf{K}^{q_{t}}$ across input points in $\bar{\mathbf{X}}_{t}$ , and let $(\mathbf{X}_{\mathcal{B}},\mathbf{y}_{\mathcal{B}})\subset\mathcal{D}_{t}$ be a mini-batch from the current dataset. This way, we obtain the simplified variational objective

\displaystyle\begin{split}&\tilde{\mathcal{F}}(q^{\text{MF}}_{t},q^{\text{MF}}% _{t-1},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{\mathcal{B}},\mathbf{y}_{\mathcal{% B}})\\ &=\frac{1}{S}\sum_{i=1}^{S}\log p(\mathbf{y}_{\mathcal{B}}\,|\,f(\mathbf{X}_{% \mathcal{B}};h({\bm{\mu}}_{t},\bm{\Sigma}_{t},\bm{\epsilon}^{(i)})))\\ &~{}~{}~{}-\sum_{j=1}^{|\bar{\mathbf{X}}|}\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(% }\log\frac{[\mathbf{K}^{p_{t}}]_{j,k}}{[\mathbf{K}^{q_{t}}]_{j,k}}+\frac{[% \mathbf{K}^{q_{t}}]_{j,k}}{[\mathbf{K}^{p_{t}}]_{j,k}}-1\\ &\quad\quad\quad+\frac{\left([f(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t})]_{j,k}-[f% (\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t-1})]_{j,k}\right)^{2}}{[\mathbf{K}^{p_{t}}% ]_{j,k}}\bigg{)},\end{split}

(11)

where $h({\bm{\mu}}_{t},\bm{\Sigma}_{t},\bm{\epsilon}^{(i)})\doteq{\bm{\mu}}_{t}+\bm{% \Sigma}_{t}\odot\bm{\epsilon}^{(i)}$ is a reparameterization of $\bm{\Theta}\in\mathbb{R}^{P}$ with $\bm{\epsilon}^{(i)}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{P})$ , $S$ is the number of Monte Carlo samples, $D_{t}$ is as defined before, and

	$\displaystyle\mathbf{K}^{p_{t}}$	$\displaystyle\doteq\text{{diag}}\left(\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{% \mu}}_{t-1})\bm{\Sigma}_{t-1}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1}% )^{\top}\right)$		(12)
	$\displaystyle\mathbf{K}^{q_{t}}$	$\displaystyle\doteq\text{{diag}}\left(\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{% \mu}}_{t})\bm{\Sigma}_{t}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})^{% \top}\right).$		(13)

This simplified objective does not require matrix inversion, and the time and space complexity for gradient estimation and prediction scale linearly in the number of context points $\bar{\mathbf{X}}_{t}$ and network parameters. The context set $\mathbf{X}_{\mathcal{C}}$ can be constructed from coresets containing representative points from previous tasks.

We provide an empirical comparison of the simplified s-fsvi, fromp, and frcl objectives in Section 5 to assess the extent to which the differences described above affect continual learning.

Table 1: Predictive accuracies of a selection of objective-based methods for continual learning. Results are reported for three task sequences: split mnist (s-mnist), split Fashion mnist (s-fmnist) and permuted mnist (p-mnist). In some cases, a multi-head setup (MH) is used; in others, a single-head setup (SH). Best results for identical network architectures are printed in boldface (exception: var-gp uses a non-parametric model). Best overall results are highlighted in gray. Each numerical entry denotes the mean accuracy across tasks at the end of training. Where possible, this accuracy is based on experiments repeated with different random seeds (10 repeats for s-fsvi), with both the mean value and standard error reported. All methods use the same architecture and coreset size unless indicated otherwise. See Appendix C for more experimental details.

{}^{1}

Accuracies computed using the best coreset-selection method (either random or

k

-center).

{}^{2}

Uses random coreset selection.

{}^{3}

Requires a multi-head setup with task identifiers, including for permuted mnist. This requirement explains the missing frcl result for s-mnist (SH).

{}^{4}

Uses a larger MLP architecture (see Table 4 in appendix).

{}^{5}

Evaluates the KL divergence at points sampled from the empirical data distribution of the current task.

{}^{6}

Uses one sample per class as a coreset.

Method	s-mnist (MH)	s-fmnist (MH)	p-mnist (SH)	s-mnist (SH)
ewc (Kirkpatrick et al.,, 2017)	63.10%	—	84.00%	—
si (Zenke et al.,, 2017)	98.90%	—	86.00%	—
vcl (Nguyen et al.,, 2018) ${}^{1}$	98.40%	98.60% ${\scriptstyle\pm 0.04}$	93.00%	32.11% ${\scriptstyle\pm 1.16}$
vcl (no coreset)	97.00%	89.60% ${\scriptstyle\pm 1.75}$	87.50% ${\scriptstyle\pm 0.61}$	17.74% ${\scriptstyle\pm 1.20}$
frcl (Titsias et al.,, 2020) ${}^{3}$	97.80% ${\scriptstyle\pm 0.22}$	97.28% ${\scriptstyle\pm 0.17}$	94.30% ${\scriptstyle\pm 0.06}$	—
fromp (Pan et al.,, 2020)	99.00% ${\scriptstyle\pm 0.04}$	99.00% ${\scriptstyle\pm 0.03}$	94.90% ${\scriptstyle\pm 0.04}$	35.29% ${\scriptstyle\pm 0.52}$
var-gp (Kapoor et al.,, 2021)	—	—	97.20% ${\scriptstyle\pm 0.08}$	90.57% ${\scriptstyle\pm 1.06}$
s-fsvi (ours) ${}^{2}$	99.54% ${\scriptstyle\pm 0.04}$	99.19% ${\scriptstyle\pm 0.02}$	95.76% ${\scriptstyle\pm 0.02}$	92.87% ${\scriptstyle\pm 0.14}$
s-fsvi Ablation Study:
s-fsvi (larger networks) ${}^{4}$	99.76% ${\scriptstyle\pm 0.00}$	99.16% ${\scriptstyle\pm 0.03}$	97.50% ${\scriptstyle\pm 0.01}$	93.38% ${\scriptstyle\pm 0.10}$
s-fsvi (no coreset) ${}^{5}$	99.62% ${\scriptstyle\pm 0.02}$	99.54% ${\scriptstyle\pm 0.01}$	84.06% ${\scriptstyle\pm 0.46}$	20.15% ${\scriptstyle\pm 0.52}$
s-fsvi (minimal coreset) ${}^{6}$	—	—	89.59% ${\scriptstyle\pm 0.30}$	51.44% ${\scriptstyle\pm 1.22}$

4 Related Work

There are three main (partially overlapping) categories of methods for continual learning in a deep neural network. Objective-based approaches modify the objective function used to train the neural network. Replay-based approaches summarize past tasks using either stored data or freshly generated synthetic data. Architecture-based approaches change the neural network’s structure from one task to another. For extensive reviews, see De Lange et al., (2021) and Parisi et al., (2019). As sequential function-space variational inference (s-fsvi) centers around a new training objective, we focus on objective-based approaches in this review. (Like the methods reviewed below, s-fsvi does incorporate a form of replay in that it uses context points, but the primary interest is the training objective.)

For a neural network to retain abilities it has previously learned, its predictions on data associated with past tasks must not change significantly from one task to another. One way of achieving this is to include in the training objective a form of function-space regularization to discourage important changes in the network’s predictions or internal representations. “Learning without forgetting” (Li and Hoiem,, 2018) uses a modified cross-entropy loss that penalizes the difference between the predictions of the current network on the current task data and the predictions of the previous network on the current task data. “Less-forgetful learning” (Jung et al.,, 2018) employs the same method but uses squared Euclidean distance rather than the modified cross-entropy loss and applies it to the penultimate-layer representations rather than the network’s predictions. “Keep and learn” (Kim et al.,, 2018) also uses internal representations as a basis for regularization. The method subsequently proposed by Benjamin et al., (2019) involves comparing the current network with all previous versions of the network and on data from all past tasks instead of with only the most recent network on data from the current task. Each pair of networks is compared by computing the Euclidean distance between the networks’ predictions. “Dark experience replay” (Buzzega et al.,, 2020) extends this method to work in a setting where task boundaries are not clearly defined.

While these approaches mitigate forgetting, they do not explicitly account for predictive uncertainty, which is an issue if the neural network is a poor fit to the data. This deficiency is addressed by probabilistic approaches to function-space regularization, which encourage a network’s predictions to agree with a prior distribution over functions rather than with a single function. “Functional regularization for continual learning” (frcl; Titsias et al.,, 2020) considers a network whose final layer is a Bayesian linear model. Based on the duality between parameter space and function space, the frcl objective includes the KL divergence between predictive distributions at a selection of input points. This encourages similarity between the network’s current predictive distribution and the distributions from past tasks. frcl is theoretically appealing, building on a well-understood method for stochastic variational inference using inducing points, but is only applicable to Bayesian linear models. In contrast, “functional regularization of the memorable past” (fromp; Pan et al.,, 2020) maintains a posterior distribution over all the parameters of a neural network. While fromp achieves state-of-the-art performance on several continual-learning task sequences, it relies on a change in the underlying probabilistic model and uses a surrogate objective for optimization, which divorces it from function-space variational objectives. As we show, this results in suboptimal performance compared to sequential function-space variational inference, which maintains a stronger link to the underlying Bayesian approximation.

Although our focus is on methods for training deep neural networks, for completeness, we also note methods based on Gaussian processes (gps). Incremental variational sparse gp regression (Cheng and Boots,, 2016), streaming sparse gps (Bui et al.,, 2017) and online sparse multi-output gp regression (Yang et al.,, 2019) built on the work of Csató and Opper, (2002) and Csató, (2002), and are effective approaches to continual learning for regression tasks. Continual multi-task gps (Moreno-Muñoz et al.,, 2019) extend to multi-output settings with non-Gaussian likelihoods. The success of variational autoregressive gps (var-gp; Kapoor et al.,, 2021) on continual learning for task sequences with image inputs gives reason for inclusion where relevant in Section 5. However, we note that var-gp scales poorly with the number of tasks: the time complexity for inference is cubic in the number of context points and hence in the number of tasks, which may limit its applicability to task sequences like sequential Omniglot. In contrast, the time complexity of s-fsvi is linear in the number of context points.

Also distinct from but related to our method are a number of objective-based approaches to continual learning that directly regularize the parameters of a neural network. We briefly discuss these approaches in Appendix D.

5 Empirical Evaluation

After visualizing how s-fsvi works in practice (Section 5.1), we compare s-fsvi’s performance with that of existing objective-based methods for continual learning (Sections 5.2, 5.3 and 5.4). For a comprehensive comparison, we evaluate s-fsvi on a range of task sequences used in related work. Aiming to use as strong baselines as possible, we report results taken directly from the literature in most cases (and mention when we do not). Reporting baselines in this way leaves gaps in our comparison: for each existing technique, results are available for only a subset of the task sequences we consider here (e.g., Pan et al., (2020) report results for split cifar but not sequential Omniglot, while Titsias et al., (2020) do the reverse).

Our evaluation pays attention to two factors important in the assessment of continual-learning methods: the use of task identifiers when making predictions, and the use of a coresets of data points to summarize past tasks (Farquhar and Gal,, 2018). To provide some commentary on the first of these factors, we run an experiment that compares the performance of a single-head neural network (which does not use task identifiers) to that of a multi-head neural network (which uses task identifiers). Regarding the second factor, we explore how performance changes when the coreset size changes or a context set unrelated to previous tasks is used.

Details about the experimental setups (e.g., optimization routines and hyperparameter searches) can be found in Appendix C. Our code can be accessed at:

5.1 Illustrative Example

To provide intuition for how s-fsvi allows learning on new tasks while maintaining previously acquired abilities, we apply it to a task sequence based on easy-to-visualize synthetic 2D data, originally proposed by Pan et al., (2020). In this task sequence, each data point belongs to one of two classes, and more data points are revealed as the task sequence progresses. The data-generating process is assumed to reveal data from mostly non-overlapping subsets of the input space. The continual-learning problem is then to infer the decision boundary around data points revealed up to and including the current task without forgetting the decision boundary inferred on previous tasks. We use a single-head neural network.

In Figure 2, we plot the model’s posterior predictive distribution after training on each of five tasks. After training on task 1, the model has low predictive uncertainty close to the data points and high uncertainty (class probabilities around 0.5) everywhere else (Figure 2a). On task 2, s-fsvi seeks to match the distribution over functions inferred on the previous task while fitting the new set of data points. s-fsvi achieves this and expands the area in input space where the model is confident in its predictions (Figure 2b).

As more tasks and data are revealed, s-fsvi allows the model to continually explore the data space and infer the decision boundary while maintaining accurate, high-confidence predictions on data points in parts of the inputs space where it was previously trained on observed data. Finally, after training on five tasks, the model has inferred the decision boundary between the two classes, while maintaining high predictive uncertainty in parts of the input space where no data points have been observed yet (Figure 2e). The model maintains high predictive uncertainty away from the data, which makes it easier to learn on new tasks. This is unlike deterministic neural networks, which tend to make highly confident predictions in parts of the inputs space where no data has been observed, or on data points that lie outside of the distribution of the training data.

5.2 Split (Fashion) MNIST & Permuted MNIST

Having established some intuition for how s-fsvi works, we demonstrate how this translates to high predictive accuracy on three task sequences commonly used to evaluate continual-learning methods. First is split mnist (s-mnist), in which each task consists of binary classification on a pair of mnist classes (0 vs. 1, 2 vs. 3, and so on). Second is split Fashion mnist (s-fmnist), which has the same structure but uses data from Fashion mnist, posing a harder problem. Third is permuted mnist (p-mnist), in which each task consists of ten-way classification on mnist images whose pixels have been randomly reordered. A multi-head setup (MH) with task identifiers provided at prediction time is the default for s-mnist and s-fmnist, while a single-head setup (SH) without task identifiers is standard for p-mnist. In addition to running the default setup for all three task sequences, we run a single-head setup for s-mnist.

With a standard configuration, s-fsvi outperforms all existing methods based on deep neural networks by a statistically significant margin on all task sequences (Table 1). As noted in Section 4, var-gp’s conceptual connection to our method warrants its inclusion in our comparison. var-gp performs better than our standard configuration of s-fsvi on permuted mnist, but this advantage disappears once a larger neural network is used with s-fsvi. Moreover, var-gp is unlikely to scale well to more challenging task sequences, such as those in Sections 5.3 and 5.4.

Table 2: Predictive accuracies of s-fsvi and related methods on sequential Omniglot. For s-fsvi and frcl, the coreset consists of two data points per class. All baseline results are from Titsias et al., (2020). For all methods, the mean and standard deviation over five random task permutations are reported.

{}^{1}

Li and Hoiem, (2018).

{}^{2}

Schwarz et al., (2018).

{}^{3}

Schwarz et al., (2018).

{}^{4}

Coreset selected using frcl’s “trace” method.

{}^{5}

Details in Appendix C.

Method	Test Accuracy
Learning Without Forgetting ${}^{1}$	62.06% ${\scriptstyle\pm 2.0}$
ewc	67.32% ${\scriptstyle\pm 4.7}$
Online ewc ${}^{2}$	69.99% ${\scriptstyle\pm 3.2}$
Progress & Compress ${}^{3}$	70.32% ${\scriptstyle\pm 3.3}$
frcl ${}^{4}$	81.47% ${\scriptstyle\pm 1.6}$
s-fsvi (ours) ${}^{5}$	83.29% ${\scriptstyle\pm 1.2}$

5.3 Sequential Omniglot

Sequential Omniglot (Lake et al.,, 2015; Schwarz et al.,, 2018) provides a more challenging task sequence than those considered in Section 5.2. It consists of 50 classification tasks, where the number of classes varies between the tasks (details in Appendix C). We find that s-fsvi produces better predictive accuracy than all available baselines, including frcl, by a statistically significant margin (Table 2). To illustrate the stability of s-fsvi across long task sequences, we plot its mean accuracy over 50 tasks in Figure 5.

5.4 Split CIFAR

Moving beyond classification tasks on grayscale images, we evaluate s-fsvi on split cifar (Pan et al.,, 2020; Zenke et al.,, 2017). This uses the full cifar-10 dataset for the first task, followed by five ten-way classification tasks drawn from cifar-100. Our results show s-fsvi achieving higher accuracy on all tasks than fromp and vcl after learning all six tasks (Figure 4a). Notably, on each task except the first, s-fsvi performs close to or better than two baselines: a model trained only on that task, and a model trained on all tasks jointly. The latter is a particularly strong baseline, because all data is available during training.

As in related work (Lopez-Paz and Ranzato,, 2017; Pan et al.,, 2020), we compute the forward transfer (FT) and backward transfer (BT) for s-fsvi on split cifar. FT captures by how much the accuracy on the current tasks increases as the number of past tasks increases; BT captures by how much the accuracy on the previous tasks increases as more tasks are observed (see Section C.6 for mathematical definitions). As well as having the best overall accuracy, s-fsvi significantly outperforms all baselines in terms of FT and has BT comparable to ewc and fromp (Table 3).

Table 3: Forward transfer (FT) and backward transfer (BT) of s-fsvi and related methods on split cifar. All baseline results are from Pan et al., (2020). For all methods, the mean and standard error over five repeated experiments are reported.

{}^{1}

Details in Appendix C.

Method	Test Accuracy	FT	BT
ewc	71.6% ${\scriptstyle\pm 0.4}$	0.2 ${\scriptstyle\pm 0.4}$	-2.3 ${\scriptstyle\pm 0.6}$
vcl	67.4% ${\scriptstyle\pm 0.6}$	1.8 ${\scriptstyle\pm 1.4}$	-9.2 ${\scriptstyle\pm 0.8}$
fromp	76.2% ${\scriptstyle\pm 0.2}$	6.1 ${\scriptstyle\pm 0.3}$	-2.6 ${\scriptstyle\pm 0.4}$
s-fsvi (ours) ${}^{1}$	77.6% ${\scriptstyle\pm 0.2}$	7.3 ${\scriptstyle\pm 0.2}$	-2.5 ${\scriptstyle\pm 0.2}$

5.5 Function- vs. Parameter-Space Inference

To demonstrate the importance of performing inference in function space, we compare how the accuracies of s-fsvi and vcl evolve from one task to another on split mnist and permuted mnist (Figure 7). We find that s-fsvi consistently outperforms vcl whose predictive performance steadily degrades suggesting that function-space inference may be more effective than parameter-space inference at transferring prior knowledge from one task to another, and that this may offset the information loss in the KL divergence between distributions over functions compared to the KL divergence between distributions over parameters.

5.6 Coreset Size and Selection

Similar to existing methods such as fromp and frcl, s-fsvi includes in the training objective a function-space regularization term that encourages matching the prior distribution over functions at a set of context points. Typically, this requires keeping a representative coreset of data points from each task, from which a context set can be constructed.

s-fsvi offers two benefits with respect to coresets. First, it is insensitive to which points get included in the coresets. Whereas existing methods often require expensive procedures to select important data points from previous tasks, Figures 3 and 4b show that s-fsvi achieves strong performance while only using randomly selected coresets. Second, s-fsvi does not require large coresets to perform well. On permuted mnist, s-fsvi achieves better predictive accuracy than ewc and si even if the coreset used for s-fsvi consists of only a single data point per class (Table 1). On the single-head version of split mnist, a minimal coreset (one point per class, or two points per task) allows s-fsvi to outperform vcl and fromp, both with coresets of 40 points per task (Table 1). In some multi-head settings, s-fsvi achieves state-of-the-art predictive accuracies with randomly-generated noise coresets (Table 1 and Figure 7).

6 Conclusion

We presented sequential function-space variational inference (s-fsvi), a method for continual learning in deep neural networks. We showed that s-fsvi improves on the predictive performance of existing objective-based continual learning methods—often by a significant margin—including on task sequences with high-dimensional inputs (split cifar) and large numbers of tasks (sequential Omniglot). Lastly, we demonstrated that—unlike existing function-space regularization methods—s-fsvi does not rely on careful coreset selection and, in multi-head settings, can achieve state-of-the-art performance even without coresets collected on previous tasks. We hope that this work will lead to future research into further improving function-space objectives for continual learning.

Acknowledgements

Tim G. J. Rudner and Freddie Bickford Smith are funded by the Engineering and Physical Sciences Research Council (EPSRC). Tim G. J. Rudner is also funded by the Rhodes Trust and by a Qualcomm Innovation Fellowship. We gratefully acknowledge donations of computing resources by the Alan Turing Institute.

References

Ahn et al., (2019) Ahn, H., Cha, S., Lee, D., and Moon, T. (2019). Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems.
Aljundi et al., (2018) Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. (2018). Memory aware synapses: learning what (not) to forget. In European Conference on Computer Vision.
Benjamin et al., (2019) Benjamin, A., Rolnick, D., and Kording, K. (2019). Measuring and regularizing networks in function space. In International Conference on Learning Representations.
Broderick et al., (2013) Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Streaming variational Bayes. In Advances in Neural Information Processing Systems.
Bui et al., (2017) Bui, T. D., Nguyen, C., and Turner, R. E. (2017). Streaming sparse Gaussian process approximations. In Advances in Neural Information Processing Systems.
Burt et al., (2021) Burt, D. R., Ober, S. W., Garriga-Alonso, A., and van der Wilk, M. (2021). Understanding variational inference in function-space. In Symposium on Advances in Approximate Bayesian Inference.
Buzzega et al., (2020) Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. (2020). Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems.
Chaudhry et al., (2018) Chaudhry, A., Dokania, P., Ajanthan, T., and Torr, P. (2018). Riemannian walk for incremental learning: understanding forgetting and intransigence. In European Conference on Computer Vision.
Cheng and Boots, (2016) Cheng, C.-A. and Boots, B. (2016). Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems.
Cover and Thomas, (1991) Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
Csató, (2002) Csató, L. (2002). Gaussian processes: iterative sparse approximations. PhD thesis, Aston University.
Csató and Opper, (2002) Csató, L. and Opper, M. (2002). Sparse on-line Gaussian processes. Neural Computation.
De Lange et al., (2021) De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. (2021). A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Ebrahimi et al., (2020) Ebrahimi, S., Elhoseiny, M., Darrell, T., and Rohrbach, M. (2020). Uncertainty-guided continual learning with Bayesian neural networks. In International Conference on Learning Representations.
Farquhar and Gal, (2018) Farquhar, S. and Gal, Y. (2018). Towards robust evaluations of continual learning. ICML Workshop on Lifelong Learning: A Reinforcement Learning Approach.
Ghahramani and Attias, (2000) Ghahramani, Z. and Attias, H. (2000). Online variational Bayesian learning. In NIPS Workshop on Online Learning.
Honkela and Valpola, (2003) Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In International Symposium on Independent Component Analysis and Blind Signal Separation.
Jung et al., (2018) Jung, H., Ju, J., Jung, M., and Kim, J. (2018). Less-forgetful learning for domain expansion in deep neural networks. In AAAI Conference on Artificial Intelligence.
Kapoor et al., (2021) Kapoor, S., Karaletsos, T., and Bui, T. D. (2021). Variational auto-regressive Gaussian processes for continual learning. In International Conference on Machine Learning.
Kessler et al., (2019) Kessler, S., Nguyen, V., Zohren, S., and Roberts, S. (2019). Hierarchical Indian buffet neural networks for Bayesian continual learning. arXiv.
Kim et al., (2018) Kim, H.-E., Kim, S., and Lee, J. (2018). Keep and learn: continual learning by constraining the latent space for knowledge preservation in neural networks. In Medical Image Computing and Computer Assisted Intervention.
Kirkpatrick et al., (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences.
Lake et al., (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science.
Lee et al., (2017) Lee, S.-W., Kim, J.-H., Jun, J., Ha, J.-W., and Zhang, B.-T. (2017). Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems.
Li and Hoiem, (2018) Li, Z. and Hoiem, D. (2018). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Liu et al., (2018) Liu, X., Masana, M., Herranz, L., van de Weijer, J., López, A. M., and Bagdanov, A. D. (2018). Rotate your networks: better weight consolidation and less catastrophic forgetting. International Conference on Pattern Recognition.
Loo et al., (2020) Loo, N., Swaroop, S., and Turner, R. E. (2020). Generalized variational continual learning. In International Conference on Learning Representations.
Lopez-Paz and Ranzato, (2017) Lopez-Paz, D. and Ranzato, M. A. (2017). Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems.
Matthews et al., (2016) Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In International Conference on Artificial Intelligence and Statistics.
Moreno-Muñoz et al., (2019) Moreno-Muñoz, P., Artés-Rodríguez, A., and Álvarez, M. A. (2019). Continual multi-task Gaussian processes. arXiv.
Nguyen et al., (2018) Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. (2018). Variational continual learning. In International Conference on Learning Representations.
Pan et al., (2020) Pan, P., Swaroop, S., Immer, A., Eschenhagen, R., Turner, R., and Khan, M. E. E. (2020). Continual deep learning by functional regularisation of memorable past. In Advances in Neural Information Processing Systems.
Parisi et al., (2019) Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. (2019). Continual lifelong learning with neural networks: a review. Neural Networks.
Park et al., (2019) Park, D., Hong, S., Han, B., and Lee, K. M. (2019). Continual learning by asymmetric loss approximation with single-side overestimation. In International Conference on Computer Vision.
Ritter et al., (2018) Ritter, H., Botev, A., and Barber, D. (2018). Online structured Laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems.
Rudner et al., (2021) Rudner, T. G. J., Chen, Z., and Gal, Y. (2021). Rethinking function-space variational inference in Bayesian neural networks. In Symposium on Advances in Approximate Bayesian Inference.
Sato, (2001) Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation.
Schwarz et al., (2018) Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. (2018). Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning.
Shannon and Weaver, (1949) Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Urbana and Chicago.
Sun et al., (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian neural networks. In International Conference on Learning Representations.
Swaroop et al., (2019) Swaroop, S., Nguyen, C. V., Bui, T. D., and Turner, R. E. (2019). Improving and understanding variational continual learning. In NeurIPS Workshop on Continual Learning.
Titsias et al., (2020) Titsias, M. K., Schwarz, J., de G. Matthews, A. G., Pascanu, R., and Teh, Y. W. (2020). Functional regularisation for continual learning with Gaussian processes. In International Conference on Learning Representations.
Yang et al., (2019) Yang, L., Wang, K., and Mihaylova, L. S. (2019). Online sparse multi-output Gaussian process regression and learning. IEEE Transactions on Signal and Information Processing over Networks.
(44) Yin, D., Farajtabar, M., and Li, A. (2020a). SOLA: continual learning with second-order loss approximation. arXiv.
(45) Yin, D., Farajtabar, M., Li, A., Levine, N., and Mott, A. (2020b). Optimization and generalization of regularization-based continual learning: a loss approximation viewpoint. arXiv.
Zenke et al., (2017) Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In International Conference on Machine Learning.

Supplementary Material

Appendix A: Proofs
Appendix B: Further Empirical Results
Appendix C: Experimental Details
Appendix D: Further Related Work

Appendix A Proofs

A.1 Variational Objective

Proposition 1 (Sequential Function-Space Variational Inference (s-fsvi); adapted from (Rudner et al.,, 2021)).

Let $D_{t}$ be the number of model output dimensions for $t$ tasks, let $f^{\text{\emph{}}}:\mathcal{X}\times\mathbb{R}^{P}\rightarrow\mathbb{R}^{D_{t}}$ be a mapping defined by a neural network architecture, let $\bm{\Theta}\in\mathbb{R}^{P}$ be a multivariate random vector of network parameters, and let $q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})$ and $q_{t-1}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\bm{\Sigma}_{t-1})$ be variational distributions over $\bm{\Theta}$ . Additionally, let $\mathbf{X}_{\mathcal{C}}$ denote a sample of context points, and let $\bar{\mathbf{X}}_{t}\subseteq\{\mathbf{X}_{t}\cup\mathbf{X}_{\mathcal{C}}\}$ . Under a diagonal approximation of the prior and variational posterior covariance functions across output dimensions, the objective in Equation 5 can be approximated by

\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\\ &\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}% \,|\,f(\mathbf{X}_{t};{\bm{\theta}}))]\\ &\qquad-\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(}\log\frac{|[\mathbf{K}^{p_{t}}]_{% k}|}{|[\mathbf{K}^{q_{t}}]_{k}|}-\frac{|\bar{\mathbf{X}}_{t}|}{D_{t}}+\text{% \emph{Tr}}([\mathbf{K}^{p_{t}}]_{k}^{-1}[\mathbf{K}^{q_{t}}]_{k})+\Delta(\bar{% \mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})^{\top}[\mathbf{K}^{p_{t}}]_{k% }^{-1}\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})\bigg{)},% \end{split}

(A.1)

where

\displaystyle\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})% \doteq[f(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t})]_{k}-[f(\bar{\mathbf{X}}_{t};{% \bm{\mu}}_{t-1})]_{k}

(A.2)

and

\displaystyle\SwapAboveDisplaySkip\mathbf{K}^{p_{t}}\doteq\mathcal{J}(\bar{% \mathbf{X}}_{t},{\bm{\mu}}_{t-1})\bm{\Sigma}_{t-1}\mathcal{J}(\bar{\mathbf{X}}% _{t},{\bm{\mu}}_{t-1})^{\top}\quad\text{and}\quad\mathbf{K}^{q_{t}}\doteq% \mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})\bm{\Sigma}_{t}\mathcal{J}(% \bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})^{\top},

(A.3)

Proof.

The results follows directly from the variational objective derived in (Rudner et al.,, 2021) when setting the prior to $p\doteq q_{t-1}$ and specifying the context set to be constructed from the coreset. ∎

A.2 Derivation of Correspondence to Other Function-Space Objectives

Proposition 2 (Relationship between fromp and s-fsvi).

\displaystyle\begin{split}&\mathcal{L}^{\textsc{fromp}}(q_{t}^{\textsc{fromp}}% ,q_{t-1}^{\textsc{fromp}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{% t})=\mathcal{F}(q_{t}^{\textsc{fromp}},q_{t-1}^{\textsc{fromp}},\mathbf{X}_{% \mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})-\mathcal{V},\end{split}

where

\displaystyle\SwapAboveDisplaySkip\mathcal{V}\doteq-\frac{1}{2}\sum_{k}\left(% \log\frac{{[\bar{\mathbf{K}}^{\hat{p}_{t}}]_{k}}}{{[\bar{\mathbf{K}}^{q_{t}}]_% {k}}}+\frac{[\bar{\mathbf{K}}^{q_{t}}]_{k}}{[\bar{\mathbf{K}}^{\hat{p}_{t}}]_{% k}}-1\right),

with $\bar{\mathbf{K}}$ denoting a covariance matrix under a block-diagonalization without inter-task dependence, and

\displaystyle\bar{\mathbf{K}}^{\hat{p}_{t}}\doteq\text{\emph{block-diag}}\left% (\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})\hat{\bm{\Sigma}}_{0}({\bm{% \mu}}_{t-1})\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})^{\top}\right).

Proof.

By Equation (8) in Pan et al., (2020), the fromp objective function is given by

\displaystyle\begin{split}&\mathcal{L}^{\text{{fromp}}}(q_{t}^{\textsc{fromp}}% ,q_{t-1}^{\textsc{fromp}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{% t})\\ &\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}% \,|\,f(\mathbf{X}_{t};{\bm{\mu}}_{t}))]+\sum_{k=1}^{t-1}\frac{\tau}{2}\left([f% (\mathbf{X}_{\mathcal{C}};{\bm{\mu}}_{t})]_{k}-[f(\mathbf{X}_{\mathcal{C}};{% \bm{\mu}}_{t-1})]_{k}\right)^{\top}[\mathbf{K}^{\hat{p}_{t}}]^{-1}_{k}\left([f% (\mathbf{X}_{\mathcal{C}};{\bm{\mu}}_{t})]_{k}-[f(\mathbf{X}_{\mathcal{C}};{% \bm{\mu}}_{t-1})]_{k}\right),\end{split}

(A.4)

with temperature parameter $\tau$ . The result follows directly from the definition of $\mathcal{F}(q_{t}^{\textsc{fromp}},q_{t-1}^{\textsc{fromp}},\mathbf{X}_{% \mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})$ and $\tau=1$ . ∎

Proposition 3 (Relationship between frcl and s-fsvi).

With the s-fsvi objective $\mathcal{F}$ defined as in Equation 6, let $\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}$ , and let $f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})\doteq\Phi_{\psi}(\cdot)\bm{\Theta}$ be a Bayesian linear model, where $\Phi_{\psi}(\cdot)$ is a deterministic feature map parameterized by $\psi$ . Then the frcl objective corresponds to the s-fsvi objective for the model $f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})$ plus an additional weight-space KL divergence penalty. That is, for $p_{t}({\bm{\theta}})\doteq\mathcal{N}(\mathbf{0},\mathbf{I})$ , and $q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})$ ,

\displaystyle\begin{split}&\mathcal{L}^{\text{{frcl}}}(q_{t}^{\textsc{frcl}},q% _{t-1}^{\textsc{frcl}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})% =\mathcal{F}(q_{t}^{\textsc{frcl}},q_{t-1}^{\textsc{frcl}},\mathbf{X}_{% \mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})+\mathbb{D}_{\textrm{\emph{KL}}}(q_% {t}({\bm{\theta}})\,\|\,p_{t}({\bm{\theta}})).\end{split}

(A.5)

Proof.

By Section 2.3 in Titsias et al., (2020), the frcl objective function is given by

\displaystyle\begin{split}\mathcal{L}^{\text{{frcl}}}({\bm{\mu}}_{t},\bm{% \Sigma}_{t},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})&\doteq% \mathbb{E}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}\,|\,\Phi_{\psi}(% \mathbf{X}_{t}){\bm{\theta}})]-\mathbb{D}_{\textrm{{KL}}}(q_{t}({\bm{\theta}})% \,\|\,p_{t}({\bm{\theta}}))\\ &\quad-\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{t}(\smash{\tilde{f}}(% \mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}}))\,\|\,\smash{\tilde{p}}_{t}(\smash% {\tilde{f}}(\mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}})))-\sum_{k=1}^{t-1}% \mathbb{D}_{\textrm{KL}}(\perp(\smash{\tilde{q}}_{k}(\smash{\tilde{f}}(\mathbf% {X}_{\mathcal{C}_{k}};{\bm{\theta}})))\,\|\,\smash{\tilde{p}}_{k}(\smash{% \tilde{f}}(\mathbf{X}_{\mathcal{C}_{k}};{\bm{\theta}}))),\end{split}

(A.6)

with the inducing points associated with task $k$ denoted by $\mathbf{X}_{\mathcal{C}_{k}}$ and $\perp$ denoting the stop-gradient operator, whereas the s-fsvi objective for a Bayesian linear model is

\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{% \theta}})}[\log p(\mathbf{y}_{t}\,|\,\Phi_{\psi}(\mathbf{X}_{t}){\bm{\theta}})% )]\\ &\qquad-\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(}\log\frac{|[\mathbf{K}^{p_{t}}]_{% k}|}{|[\mathbf{K}^{q_{t}}]_{k}|}-\frac{|\bar{\mathbf{X}}_{t}|}{D_{t}}+\text{{% Tr}}([\mathbf{K}^{p_{t}}]_{k}^{-1}[\mathbf{K}^{q_{t}}]_{k})+\Delta(\bar{% \mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})^{\top}[\mathbf{K}^{p_{t}}]_{k% }^{-1}\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})\bigg{)},% \end{split}

(A.7)

with

\displaystyle\SwapAboveDisplaySkip\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{% \bm{\mu}}_{t-1})\doteq[\Phi_{\psi}(\bar{\mathbf{X}}_{t}){\bm{\mu}}_{t}]_{k}-[% \Phi_{\psi}(\bar{\mathbf{X}}_{t}){\bm{\mu}}_{t-1}]_{k}

(A.8)

and

\displaystyle\SwapAboveDisplaySkip\mathbf{K}^{p_{t}}\doteq\Phi_{\psi}(\bar{% \mathbf{X}}_{t})\bm{\Sigma}_{t-1}\Phi_{\psi}(\bar{\mathbf{X}}_{t})^{\top}% \qquad\mathbf{K}^{q_{t}}\doteq\Phi_{\psi}(\bar{\mathbf{X}}_{t})\bm{\Sigma}_{t}% \Phi_{\psi}(\bar{\mathbf{X}}_{t})^{\top}.

(A.9)

Letting $\mathbf{X}_{\mathcal{C}_{k}}$ be the context points associated with task $k$ and letting $\bar{\mathbf{K}}$ denote a covariance matrix under a block-diagonalization without inter-task dependence, we define

\displaystyle\begin{split}\bar{\mathbf{K}}^{p_{t}}\doteq\textrm{{block-diag}}% \left(\Phi_{\psi}(\bar{\mathbf{X}}_{t})\bm{\Sigma}_{t-1}\Phi_{\psi}(\bar{% \mathbf{X}}_{t})^{\top}\right)\qquad\bar{\mathbf{K}}^{q_{t}}\doteq\textrm{{% block-diag}}\left(\Phi_{\psi}(\bar{\mathbf{X}}_{t})\bm{\Sigma}_{t}\Phi_{\psi}(% \bar{\mathbf{X}}_{t})^{\top}\right),\end{split}

(A.10)

with diagonal entries $\{\mathbf{K}^{p_{t}}_{1},...,\mathbf{K}^{p_{t}}_{t}\}$ and $\{\mathbf{K}^{q_{t}}_{1},...,\mathbf{K}^{q_{t}}_{t}\}$ , respectively, where each $\mathbf{K}^{p_{t}}_{k}$ is computed from task-specific context points $\mathbf{X}_{\mathcal{C}_{k}}$ . Fixing $[{\bm{\mu}}_{t-1}]_{k}=\mathbf{0}$ and $[\bm{\Sigma}_{t-1}]_{k}=\mathbf{I}_{M_{k}}$ for all $k\leq t$ with $M_{k}=|\mathbf{X}_{\mathcal{C}_{k}}|$ , as in Titsias et al., (2020), we then get

\displaystyle\SwapAboveDisplaySkip\bar{\mathbf{K}}^{p_{t}}_{k}=\Phi_{\psi}(% \bar{\mathbf{X}}_{\mathcal{C}_{k}})\Phi_{\psi}(\bar{\mathbf{X}}_{\mathcal{C}_{% k}})^{\top}\quad\forall k\leq t.

(A.11)

Considering $[{\bm{\mu}}_{t}]_{k}$ and $[\bm{\Sigma}_{t}]_{k}$ as fixed for all $k\leq t-1$ , as in Titsias et al., (2020), using the stop-gradient operator $\perp$ , we can write the s-fsvi objective as

\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{% \theta}})}[\log p(\mathbf{y}_{t}\,|\,\Phi_{\psi}(\mathbf{X}_{t}){\bm{\theta}})% )]\\ &\qquad-\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{t}(\smash{\tilde{f}}(% \mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}}))\,\|\,\smash{\tilde{p}}_{t}(\smash% {\tilde{f}}(\mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}})))-\sum_{k=1}^{t-1}% \mathbb{D}_{\textrm{KL}}(\perp(\smash{\tilde{q}}_{k}(\smash{\tilde{f}}(\mathbf% {X}_{\mathcal{C}_{k}};{\bm{\theta}})))\,\|\,\smash{\tilde{p}}_{k}(\smash{% \tilde{f}}(\mathbf{X}_{\mathcal{C}_{k}};{\bm{\theta}}))),\end{split}

(A.12)

concluding the proof. ∎

Appendix B Further Empirical Results

Appendix C Experimental Details

Our empirical evaluation centers around six sequences of classification tasks: a synthetic sequence of binary-classification tasks with 2D inputs; split mnist; split Fashion mnist; permuted mnist; split cifar; and sequential Omniglot. With the exception of permuted mnist, each of these task sequences can be tackled by a neural network with either a multi-head setup (MH) or a single-head setup (SH). In a multi-head setup, the neural network has a separate output layer (or head) for each task, and task identifiers are provided at test time in order to select the appropriate head. In a single-head setup, the neural network has just one output layer shared across all tasks, and task identifiers are not provided. In our experiments, we use multi-head setups for split Fashion mnist, split cifar and sequential Omniglot, and single-head setups for the synthetic task sequence along with permuted mnist. For split mnist, we run both setups.

C.1 Illustrative Example

The task sequence shown in Figure 2 was created by Pan et al., (2020). Each of the five tasks in this sequence involves binary classification on 2D inputs, where the number of training examples per task is 3,600. Following Pan et al., (2020), we use a fully connected neural network with an input layer of size 2, two hidden layers of size 20 and an output layer of size 2. When running s-fsvi, we set the prior covariance as $\bm{\Sigma}_{0}=0.1$ and train the neural network for 250 epochs on each task. We use the Adam optimizer with an initial learning rate of $0.0005$ ( $\beta_{1}=0.9,\beta_{2}=0.999$ ) and a batch size of 128. The coreset is constructed by choosing 40 samples from the training data for each task. To evaluate the KL divergence between the posterior and the prior distributions over functions, for each previous task we sample 20 input points from the context set and generate another 30 samples by sampling each pixel uniformly from the range $[-4,4]$ . For example, when we train the model on task $t\in\{1,2,3,\ldots\}$ , we use $20(t-1)$ samples chosen from the context set and $30t$ white-noise samples. The noise samples encourage the neural network to preserve high predictive uncertainty in regions far from the training data.

C.2 Task Sequences Based on (Fashion) MNIST

Split mnist consists of five tasks, where each task is binary classification on a pair of mnist classes. Split Fashion mnist has the same form but uses data from Fashion mnist. Permuted mnist comprises ten tasks, where each task involves classifying images into the ten mnist classes after the image pixels have been randomly reordered. Unless specified otherwise, the following setups apply to Figures 3, 7, 7 and 8 and Table 1.

Dataset. In all cases, 60,000 data samples are used for training and 10,000 data samples are used for testing. The input images are converted to floating-point numbers with values in the range $[0,1]$ .

Neural-Network Size & Coreset Size. To ensure fair comparison, all methods in Table 1 (unless where explicitly indicated otherwise) use the same neural-network size and (where applicable) coreset size. As in prior work (Pan et al.,, 2020; Titsias et al.,, 2020), we use fully connected neural networks, with two hidden layers of size 100 for permuted mnist and two hidden layers of size 256 for split (Fashion) mnist. In all cases, the ReLU activation function is applied to non-output units. For single-head setups, we use 200 coreset points; for multi-head setups, we use 40 points.

Coreset Selection. For s-fsvi with a coreset, when training on the first task, 40 context points are generated by sampling each pixel uniformly from the range $[0,1]$ ; during training on subsequent tasks, 40 context points are chosen randomly from the context set. For s-fsvi without a coreset, 40 context points are chosen uniformly randomly from the training data of the current task (corresponding to the “Random” label in Figure 3).

Prior Distribution. For the first task, s-fsvi uses a prior distribution over functions with fixed mean and diagonal covariance. When using a coreset, the prior distribution is assumed to be Gaussian with zero mean and a diagonal covariance of magnitude 0.001. When not using a coreset, the prior distribution is assumed to be Gaussian with zero mean and a diagonal covariance of magnitude 100. The prior variance is optimized via hyperparameter selection on a validation set.

Optimization. We use the Adam optimizer with an initial learning rate of $0.0005$ ( $\beta_{1}=0.9,\beta_{2}=0.999$ ). The number of epochs on each task is 60 for split mnist (MH), 60 for split Fashion mnist (MH), 10 for permuted mnist (SH) and 80 for split mnist (SH). The batch size is 128.

Prediction. The predictive distribution used for computing the expected log-likelihood is estimated using five Monte Carlo samples.

Hyperparameter Selection. For “s-fsvi (optimized)” in Table 1, we used the optimized hyperparameters chosen on a validation set after exploring the configurations shown in Table 4. For cases where no configuration is significantly better than the rest, the default value given in Section C.2 is used.

Table 4: Hyperparameter selection. Optimal values (in bold) were chosen based on validation-set accuracy. Standard errors were computed across ten random seeds.

Task Sequences	Number of Layers & Units	Magnitude of Prior Variance	Number of Epochs
Split mnist (MH)	{1, 2} * {100, 200, 300, 400}	{0.001, 0.01, 0.1, 1, 10, 100}	{40, 60, 80, 120, 160}
Split Fashion mnist (MH)	{4} * {50, 200, 300, 400}	{0.001, 0.01, 0.1, 1, 10, 100}	{40, 60, 80, 120, 160}
Permuted mnist (SH)	{2} * {100, 200, 400, 500}	{0.001, 0.01, 0.1, 1, 10, 100}	{10, 20, 40, 60, 80}
Split mnist (SH)	{1, 2} * {100, 200, 300, 400}	{0.001, 0.01, 0.1, 1, 10, 100}	{60, 80, 120, 160, 240}

C.3 Split CIFAR

Split cifar, as described in Pan et al., (2020), consists of six tasks. The first is ten-way classification on the full cifar-10 dataset. Each of the following five is also ten-way classification, with classes drawn from cifar-100. Following Pan et al., (2020), we use a neural network with four convolutional layers followed by two fully connected layers followed by multiple output heads (one for each task). For s-fsvi, we use the following setup: Adam optimizer with learning rate 0.0005, prior with covariance 0.01, random coreset selection, 200 coreset points per task, 50 context points at each task. We also use this setup (and a training duration of 2000 epochs) when training individual neural networks for the “separate” baseline.

C.4 Sequential Omniglot

Sequential Omniglot, as described in Schwarz et al., (2018), comprises 50 classification tasks. Each task is associated with an alphabet, and the number of characters (classes) varies between alphabets. Following Schwarz et al., (2018), we use a neural network with four convolutional layers followed by one fully connected layer. For s-fsvi, we use two coreset points per character, as used by Titsias et al., (2020). The coreset points are sampled from the training set with probability proportional to the entropy of the neural network’s posterior predictive distribution. To limit memory usage, we draw no more than 25 context points from the context set at each gradient step after task 25. We use a learning rate of 0.001 and a prior covariance of 1.0. For the first task, the neural network trains for 200 epochs; for subsequent tasks, it trains for ten epochs per task. We use the same data augmentation and train-test split as Titsias et al., (2020).

C.5 Coreset-Selection Methods

We consider different distributions from which to sample points to be added to the coreset. For each of the scoring methods below, we use the scores to create a probability mass function from which points can be sampled.

Random. Points are sampled uniformly from the training data.

Predictive-Entropy Scoring. Points are scored according to the total predictive uncertainty (i.e., the predictive entropy) of the model. For a model with stochastic parameters $\bm{\Theta}$ , pre-likelihood outputs $f(\mathbf{X};{\bm{\theta}})$ , and a likelihood function $p(\mathbf{y}\,|\,f(\mathbf{X};{\bm{\theta}}))$ , the predictive entropy is given by $\mathcal{H}(\operatorname{\mathbb{E}}[p(\mathbf{y}\,|\,f(\mathbf{X};{\bm{% \theta}}))])$ (Cover and Thomas,, 1991; Shannon and Weaver,, 1949). The expectation is taken with respect to the model parameters. $\mathcal{H}(\cdot)$ is the entropy functional, and $\mathcal{I}(\mathbf{y}_{\ast};\,\bm{\Theta})$ is the mutual information between the model parameters and its predictions.

Evidence-Lower-Bound Scoring. Points are scored according to the value of the evidence lower bound (ELBO) given in Equation 11.

Kullback-Leibler-Divergence Scoring. Points are scored according to the value of the approximation to the function-space KL divergence given in Equation 11.

Score-Based Distributions. After scoring with the above methods, points are added to the coreset by sampling from one of the following probability mass functions:

\displaystyle\begin{split}\textrm{{Lowest:}}\quad\mathbb{P}(i)\doteq\frac{\bar% {s}_{i}}{\sum_{j=1}^{N}\bar{s}_{j}}\qquad\textrm{and}\qquad\textrm{{Highest:}}% \quad\mathbb{P}(i)\doteq\frac{s_{i}}{\sum_{j=1}^{N}s_{j}},\end{split}

(C.13)

where $s_{i}$ is the score of $i$ -th point, $\bar{s}_{i}=\max_{j=1}^{N}s_{j}-s_{i}$ , and $N$ is the number of candidate points.

C.6 Forward and Backward Transfer

In Table 3, we report forward and backward transfer metrics as defined in Pan et al., (2020). Backward transfer (BT) indicates the performance gain on past tasks when new tasks are learnt, while forward transfer (FT) quantifies how much knowledge from past tasks helps the learning of new tasks. Higher is better for both. For $T$ tasks, let $R_{i,i}$ be the accuracy of model on task $t_{i}$ after training on task $t_{i}$ , and let $R_{i}^{\textrm{ind}}$ be the accuracy of an independent model trained only on task $t_{i}$ . Then

\displaystyle\begin{split}\textrm{BT}\doteq\frac{1}{T-1}\sum_{i=1}^{T-1}R_{T,i% }-R_{i,i}\qquad\textrm{and}\qquad\textrm{FT}\doteq\frac{1}{T-1}\sum_{i=2}^{T}R% _{i,i}-R_{i}^{\textrm{ind}}.\end{split}

Appendix D Further Related Work

Objective-based approaches to continual learning involve training a neural network using a specially designed objective function. Typically the objective includes a regularization term that penalizes changes in the neural network’s configuration. Whereas in Section 4 we summarise methods that regularize in function space, here we cover methods that regularize directly in terms of the parameters of a neural network. Among these, most relevant to our work are those that approximate Bayesian updating, in which the posterior from the previous task forms the prior for the current task.

A key idea is shared between many methods for parameter-space regularization: for each parameter, apply a penalty on the difference between its current setting and its prior setting, weighted by a measure of the parameter’s importance. Methods vary in how they measure importance. Variational continual learning (vcl; Nguyen et al.,, 2018; Swaroop et al.,, 2019), which extends the concept of online variational inference (Broderick et al.,, 2013; Ghahramani and Attias,, 2000; Honkela and Valpola,, 2003; Sato,, 2001) to deep neural networks, uses the parameter covariance matrix of the model currently serving as the prior. Elastic weight consolidation (ewc; Kirkpatrick et al.,, 2017) and its successors (Chaudhry et al.,, 2018; Lee et al.,, 2017; Liu et al.,, 2018; Schwarz et al.,, 2018) use a Fisher information matrix computed on each task. Online structured Laplace (Ritter et al.,, 2018) and second-order loss approximation (Yin et al., 2020a, ) respectively use Kronecker-factored and low-rank Hessians. Synaptic intelligence (si; Zenke et al.,, 2017) uses a cumulative sum of the gradient of the training objective with respect to the parameters. Memory-aware synapses (mas; Aljundi et al.,, 2018) use the gradient of the model output with respect to the parameters.

Other related work on parameter-space regularization includes various modifications to vcl (Ahn et al.,, 2019; Kessler et al.,, 2019), uncertainty-guided continual learning in Bayesian neural networks (Ebrahimi et al.,, 2020), and a variation of si known as asymmetric loss approximation with single-side overestimation (Park et al.,, 2019). There have also been efforts to conceptually unify some of the approaches outlined above: Loo et al., (2020) draws a link between vcl and online ewc; Chaudhry et al., (2018) combines ewc and si in a single method; Yin et al., 2020b generalizes ewc, online structured Laplace, si and mas.

Continual Learning via Sequential Function-Space Variational Inference

Abstract

1 Introduction

2 Background

2.1 Continual Learning as Bayesian Inference

2.2 Function-Space Variational Inference

3 Continual Learning via Sequential Function-Space Variational Inference

Proposition 1 (Sequential Function-Space Variational Inference (s-fsvi); adapted from Rudner et al., (2021)).

Proof.

Proposition 2 (Relationship between fromp and s-fsvi).

Proof.

Proposition 3 (Relationship between frcl and s-fsvi).

Proof.

3.1 Simplified Sequential Function-Space VI

4 Related Work

5 Empirical Evaluation

5.1 Illustrative Example

5.2 Split (Fashion) MNIST & Permuted MNIST

5.3 Sequential Omniglot

5.4 Split CIFAR

5.5 Function- vs. Parameter-Space Inference

5.6 Coreset Size and Selection

6 Conclusion

Acknowledgements

References

Supplementary Material

Table of Contents

Appendix A Proofs

A.1 Variational Objective

Proposition 1 (Sequential Function-Space Variational Inference (s-fsvi); adapted from (Rudner et al.,, 2021)).

Proof.

A.2 Derivation of Correspondence to Other Function-Space Objectives

Proposition 2 (Relationship between fromp and s-fsvi).

Proof.

Proposition 3 (Relationship between frcl and s-fsvi).

Proof.

Appendix B Further Empirical Results

Appendix C Experimental Details

C.1 Illustrative Example

C.2 Task Sequences Based on (Fashion) MNIST

C.3 Split CIFAR

C.4 Sequential Omniglot

C.5 Coreset-Selection Methods

C.6 Forward and Backward Transfer

Appendix D Further Related Work