Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2312.17210v1 [stat.ML] 28 Dec 2023

Continual Learning via Sequential Function-Space Variational Inference

Tim G. J. Rudner    Freddie Bickford Smith    Qixuan Feng    Yee Whye Teh    Yarin Gal
Abstract

Sequential Bayesian inference over predictive functions is a natural framework for continual learning from streams of data. However, applying it to neural networks has proved challenging in practice. Addressing the drawbacks of existing techniques, we propose an optimization objective derived by formulating continual learning as sequential function-space variational inference. In contrast to existing methods that regularize neural network parameters directly, this objective allows parameters to vary widely during training, enabling better adaptation to new tasks. Compared to objectives that directly regularize neural network predictions, the proposed objective allows for more flexible variational distributions and more effective regularization. We demonstrate that, across a range of task sequences, neural networks trained via sequential function-space variational inference achieve better predictive accuracy than networks trained with related methods while depending less on maintaining a set of representative points from previous tasks.


1 Introduction

Continual learning promises to enable applications of machine learning to settings with resource constraints, privacy concerns, or non-stationary data distributions. However, continual learning in deep neural networks remains a challenge. While progress has been made to mitigate “forgetting” of previously learned abilities, existing objective-based approaches to continual learning still fall short.

A popular family of objectives penalizes changes in parameters from one task to another (Ahn et al.,, 2019; Aljundi et al.,, 2018; Chaudhry et al.,, 2018; Kirkpatrick et al.,, 2017; Lee et al.,, 2017; Liu et al.,, 2018; Loo et al.,, 2020; Nguyen et al.,, 2018; Park et al.,, 2019; Ritter et al.,, 2018; Schwarz et al.,, 2018; Swaroop et al.,, 2019; Yin et al., 2020a, ; Yin et al., 2020b, ; Zenke et al.,, 2017). However, explicitly regularizing parameters in this way may be ineffective, since parameters are only a proxy for a neural network’s predictive function. For example, predictive functions defined by overparameterized neural networks may be obtained with several different parameter configurations, and small changes in a network’s parameters may cause large changes in its predictions.

An alternative approach that addresses this shortcoming is to regularize the predictive function directly (Benjamin et al.,, 2019; Bui et al.,, 2017; Buzzega et al.,, 2020; Jung et al.,, 2018; Kapoor et al.,, 2021; Kim et al.,, 2018; Li and Hoiem,, 2018; Moreno-Muñoz et al.,, 2019; Pan et al.,, 2020; Titsias et al.,, 2020). Existing function-space regularization methods represent the state of the art among objective-based approaches to continual learning (Kapoor et al.,, 2021; Pan et al.,, 2020; Titsias et al.,, 2020). Yet, as we demonstrate, these methods still leave room for improvement. For example, “functional regularization of the memorable past” (fromp; Pan et al.,, 2020) uses a Laplace approximation and as such does not directly optimize variance parameters, while “functional regularization for continual learning” (frcl; Titsias et al.,, 2020) is constrained to linear models.

To address these limitations, we frame continual learning as sequential function-space variational inference (s-fsvi) and adapt the variational objective proposed by Rudner et al., (2021) to the continual-learning setting. The resulting variational optimization objective has three key advantages over existing alternatives. First, it is expressed purely in terms of distributions over predictive functions, which allows greater flexibility than with parameter-space regularization methods (Figure 1). Second, unlike fromp, it allows direct optimization of variational variance parameters. Third, unlike frcl, it can be applied to fully-stochastic neural networks—not just to Bayesian linear models.

We demonstrate that s-fsvi outperforms existing objective-based continual learning methods—in some cases by a significant margin—on a wide range of task sequences, including single-head split mnist, multi-head split cifar, and multi-head sequential Omniglot. We further present empirical results that showcase the usefulness of learned variational variance parameters and demonstrate that s-fsvi is less reliant on careful selection of datapoints that summarize past tasks than other methods.

Refer to caption
Figure 1: Schematic of how sequential function-space variational inference (s-fsvi) allows a Bayesian neural network to learn new tasks while maintaining previously learned abilities. (Top: predictive distributions.) On task 1, the model fits dataset 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by updating an initial distribution over parameters q0(𝜽)subscript𝑞0𝜽q_{0}({\bm{\theta}})italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_θ ) to a variational posterior q1(𝜽)subscript𝑞1𝜽q_{1}({\bm{\theta}})italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_θ ), which in turn induces a distribution over functions q1(f)subscript𝑞1𝑓q_{1}(f)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f ). On task 2, the variational objective encourages the posterior distribution over functions to match q1(f)subscript𝑞1𝑓q_{1}(f)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f ) on a small set of data points from task 1 while also fitting dataset 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The mean and two standard deviations of the distributions over functions learned on task 1 and task 2 are shown in grey and blue, respectively. (Bottom: learning trajectories.) On task 1, the distribution over functions changes by a large amount for inputs 𝐗1subscript𝐗1\mathbf{X}_{1}bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (left) but by a small amount for inputs 𝐗2subscript𝐗2\mathbf{X}_{2}bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (right). On task 2, the reverse is true. On both tasks, the change in the distribution over parameters (center) is decoupled from the changes in the distribution over functions (left, right).

2 Background

2.1 Continual Learning as Bayesian Inference

Consider a sequence of tasks indexed by t{1,,T}𝑡1𝑇t\in\{1,\ldots,T\}italic_t ∈ { 1 , … , italic_T }. Each task involves making predictions on a supervised dataset 𝒟t=(𝐗t,𝐲t)subscript𝒟𝑡subscript𝐗𝑡subscript𝐲𝑡\mathcal{D}_{t}=(\mathbf{X}_{t},\mathbf{y}_{t})caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Continual learning is the problem of inferring a distribution over predictive functions that fits the whole collection of datasets {𝒟1,,𝒟T}subscript𝒟1subscript𝒟𝑇\{\mathcal{D}_{1},\ldots,\mathcal{D}_{T}\}{ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } as well as possible given access to only a single full dataset at a time.

Sequential Bayesian inference over predictive functions f𝑓fitalic_f provides a natural framework for this. Assuming we have a prior p(f)𝑝𝑓p(f)italic_p ( italic_f ), the posterior distribution over f𝑓fitalic_f at task 1 is

p(f|𝒟1)=p(𝒟1|f)p(f)/p(𝒟1).𝑝conditional𝑓subscript𝒟1𝑝conditionalsubscript𝒟1𝑓𝑝𝑓𝑝subscript𝒟1\displaystyle p(f\,|\,\mathcal{D}_{1})=p(\mathcal{D}_{1}\,|\,f)p(f)/p(\mathcal% {D}_{1}).italic_p ( italic_f | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_p ( caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_f ) italic_p ( italic_f ) / italic_p ( caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (1)

For subsequent tasks t𝑡titalic_t, the posterior can be expressed as

p(f|𝒟1,,𝒟t)p(𝒟t|f)p(f|𝒟1,,𝒟t1),proportional-to𝑝conditional𝑓subscript𝒟1subscript𝒟𝑡𝑝conditionalsubscript𝒟𝑡𝑓𝑝conditional𝑓subscript𝒟1subscript𝒟𝑡1\displaystyle p(f\,|\,\mathcal{D}_{1},\ldots,\mathcal{D}_{t})\propto p(% \mathcal{D}_{t}\,|\,f)p(f\,|\,\mathcal{D}_{1},\ldots,\mathcal{D}_{t-1}),italic_p ( italic_f | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∝ italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_f ) italic_p ( italic_f | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (2)

where the posterior after task t1𝑡1t-1italic_t - 1 is treated as the prior for task t𝑡titalic_t. Given the intractibility of computing this posterior exactly, we need to use approximate inference.

2.2 Function-Space Variational Inference

Given a dataset 𝒟=(𝐗,𝐲)𝒟𝐗𝐲\mathcal{D}=(\mathbf{X},\mathbf{y})caligraphic_D = ( bold_X , bold_y ), a prior p(f)𝑝𝑓p(f)italic_p ( italic_f ) and a variational family 𝒬fsubscript𝒬𝑓\mathcal{Q}_{f}caligraphic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, function-space variational inference (Burt et al.,, 2021; Matthews et al.,, 2016; Rudner et al.,, 2021; Sun et al.,, 2019) consists of finding the variational distribution q(f)𝒬f𝑞𝑓subscript𝒬𝑓q(f)\in\mathcal{Q}_{f}italic_q ( italic_f ) ∈ caligraphic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT that maximizes

𝔼q(f)[logp(𝐲|f(𝐗))]𝔻KL(q(f)p(f)).subscript𝔼𝑞𝑓delimited-[]𝑝conditional𝐲𝑓𝐗subscript𝔻KLconditional𝑞𝑓𝑝𝑓\displaystyle\mathbb{E}_{q(f)}[\log p(\mathbf{y}\,|\,f(\mathbf{X}))]-\mathbb{D% }_{\textrm{KL}}(q(f)\,\|\,p(f)).blackboard_E start_POSTSUBSCRIPT italic_q ( italic_f ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y | italic_f ( bold_X ) ) ] - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q ( italic_f ) ∥ italic_p ( italic_f ) ) . (3)

This variational optimization problem presents a trade-off between fitting the data and matching a prior over functions. To address the fact that the KL divergence between distributions over functions is not in general tractable, prior works have developed estimation procedures that allow turning Equation 3 into an objective function that can be used in practice (Rudner et al.,, 2021; Sun et al.,, 2019).

3 Continual Learning via Sequential Function-Space Variational Inference

The ideas presented in Section 2 provide a starting point for our method. To approximate the posterior in Equation 2 at task t𝑡titalic_t, we would like to find a variational distribution qt(f)𝒬fsubscript𝑞𝑡𝑓subscript𝒬𝑓q_{t}(f)\in\mathcal{Q}_{f}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) ∈ caligraphic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT that minimizes

𝔻KL(qt(f)pt(f|𝒟1,,𝒟t)),\displaystyle\mathbb{D}_{\textrm{KL}}(q_{t}(f)\,\|\,p_{t}(f|\mathcal{D}_{1},..% .,\mathcal{D}_{t})),blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (4)

which can equivalently be expressed as maximizing

𝔼qt(f)[logp(𝐲t|f(𝐗t))]𝔻KL(qt(f)pt(f|𝒟1,,𝒟t1)).\displaystyle\mathbb{E}_{q_{t}(f)}[\log p(\mathbf{y}_{t}\,|\,f(\mathbf{X}_{t})% )]-\mathbb{D}_{\textrm{KL}}(q_{t}(f)\,\|\,p_{t}(f|\mathcal{D}_{1},...,\mathcal% {D}_{t-1})).blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_f ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) .

Since we do not have access to pt(f|𝒟1,,𝒟t1)subscript𝑝𝑡conditional𝑓subscript𝒟1subscript𝒟𝑡1p_{t}(f|\mathcal{D}_{1},...,\mathcal{D}_{t-1})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), we simplify the inference problem to maximizing the variational objective

𝔼qt(f)[logp(𝐲t|f(𝐗t))]𝔻KL(qt(f)pt(f)),subscript𝔼subscript𝑞𝑡𝑓delimited-[]𝑝conditionalsubscript𝐲𝑡𝑓subscript𝐗𝑡subscript𝔻KLconditionalsubscript𝑞𝑡𝑓subscript𝑝𝑡𝑓\displaystyle\mathbb{E}_{q_{t}(f)}[\log p(\mathbf{y}_{t}\,|\,f(\mathbf{X}_{t})% )]-\mathbb{D}_{\textrm{KL}}(q_{t}(f)\,\|\,p_{t}(f)),blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_f ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) ) , (5)

where for t=1𝑡1t=1italic_t = 1 we assume some prior p1(f)subscript𝑝1𝑓p_{1}(f)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f ) and for t>1𝑡1t>1italic_t > 1 the prior is given by the variational posterior distribution over functions inferred on the previous task. That is,

pt(f)qt1(f).approaches-limitsubscript𝑝𝑡𝑓subscript𝑞𝑡1𝑓p_{t}(f)\doteq q_{t-1}(f).italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ) ≐ italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_f ) .

While this objective is in general intractable for distributions over functions induced by neural networks with stochastic parameters, Rudner et al., (2021) proposed an approximation that makes this objective amenable to gradient-based optimization and scalable to large neural networks. To perform sequential function-space variational inference, we adapt the estimation procedure proposed by Rudner et al., (2021) to the continual-learning setting:

Proposition 1 (Sequential Function-Space Variational Inference (s-fsvi); adapted from Rudner et al., (2021)).

Let Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the number of model output dimensions for t𝑡titalic_t tasks, let f:𝒳×PDtnormal-:superscript𝑓normal-→𝒳superscript𝑃superscriptsubscript𝐷𝑡f^{\text{\emph{}}}:\mathcal{X}\times\mathbb{R}^{P}\rightarrow\mathbb{R}^{D_{t}}italic_f start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : caligraphic_X × blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be a mapping defined by a neural network architecture, let 𝚯P𝚯superscript𝑃\bm{\Theta}\in\mathbb{R}^{P}bold_Θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT be a multivariate random vector of network parameters, and let qt(𝛉)𝒩(𝛍t,𝚺t)approaches-limitsubscript𝑞𝑡𝛉𝒩subscript𝛍𝑡subscript𝚺𝑡q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and qt1(𝛉)𝒩(𝛍t1,𝚺t1)approaches-limitsubscript𝑞𝑡1𝛉𝒩subscript𝛍𝑡1subscript𝚺𝑡1q_{t-1}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\bm{\Sigma}_{t-1})italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) be variational distributions over 𝚯𝚯\bm{\Theta}bold_Θ. Additionally, let 𝐗𝒞subscript𝐗𝒞\mathbf{X}_{\mathcal{C}}bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT denote a set of context points, and let 𝐗¯t{𝐗t𝐗𝒞}subscriptnormal-¯𝐗𝑡subscript𝐗𝑡subscript𝐗𝒞\bar{\mathbf{X}}_{t}\subseteq\{\mathbf{X}_{t}\cup\mathbf{X}_{\mathcal{C}}\}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ { bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT }. Under a diagonal approximation of the prior and variational posterior covariance functions across output dimensions, the objective in Equation 5 can be approximated by

(qt,qt1,𝐗𝒞,𝐗t,𝐲t)𝔼qt(𝜽)[logp(𝐲t|f(𝐗t;𝜽))]k=1Dt12(log|[𝐊pt]k||[𝐊qt]k||𝐗¯t|Dt+Tr([𝐊pt]k1[𝐊qt]k)+Δ(𝐗¯t;𝝁t,𝝁t1)[𝐊pt]k1Δ(𝐗¯t;𝝁t,𝝁t1)),approaches-limitsubscript𝑞𝑡subscript𝑞𝑡1subscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡subscript𝔼subscript𝑞𝑡𝜽𝑝conditionalsubscript𝐲𝑡𝑓subscript𝐗𝑡𝜽superscriptsubscript𝑘1subscript𝐷𝑡12subscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑘subscriptdelimited-[]superscript𝐊subscript𝑞𝑡𝑘subscript¯𝐗𝑡subscript𝐷𝑡Trsuperscriptsubscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑘1subscriptdelimited-[]superscript𝐊subscript𝑞𝑡𝑘Δsuperscriptsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1topsuperscriptsubscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑘1Δsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\\ &\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}% \,|\,f(\mathbf{X}_{t};{\bm{\theta}}))]\\ &~{}~{}~{}-\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(}\log\frac{|[\mathbf{K}^{p_{t}}% ]_{k}|}{|[\mathbf{K}^{q_{t}}]_{k}|}-\frac{|\bar{\mathbf{X}}_{t}|}{D_{t}}+\text% {\emph{Tr}}([\mathbf{K}^{p_{t}}]_{k}^{-1}[\mathbf{K}^{q_{t}}]_{k})\\ &~{}~{}~{}~{}~{}~{}+\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1% })^{\top}[\mathbf{K}^{p_{t}}]_{k}^{-1}\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{% t},{\bm{\mu}}_{t-1})\bigg{)},\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≐ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_f ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_log divide start_ARG | [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG | [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG - divide start_ARG | over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + Tr ( [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (6)

where

Δ(𝐗¯t;𝝁t,𝝁t1)[f(𝐗¯t;𝝁t)]k[f(𝐗¯t;𝝁t1)]kapproaches-limitΔsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1subscriptdelimited-[]𝑓subscript¯𝐗𝑡subscript𝝁𝑡𝑘subscriptdelimited-[]𝑓subscript¯𝐗𝑡subscript𝝁𝑡1𝑘\displaystyle\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})% \doteq[f(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t})]_{k}-[f(\bar{\mathbf{X}}_{t};{% \bm{\mu}}_{t-1})]_{k}roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ≐ [ italic_f ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - [ italic_f ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (7)

and

𝐊ptsuperscript𝐊subscript𝑝𝑡\displaystyle\mathbf{K}^{p_{t}}bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 𝒥(𝐗¯t,𝝁t1)𝚺t1𝒥(𝐗¯t,𝝁t1)approaches-limitabsent𝒥subscript¯𝐗𝑡subscript𝝁𝑡1subscript𝚺𝑡1𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡1top\displaystyle\doteq\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})\bm{% \Sigma}_{t-1}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})^{\top}≐ caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (8)
𝐊qtsuperscript𝐊subscript𝑞𝑡\displaystyle\mathbf{K}^{q_{t}}bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 𝒥(𝐗¯t,𝝁t)𝚺t𝒥(𝐗¯t,𝝁t),approaches-limitabsent𝒥subscript¯𝐗𝑡subscript𝝁𝑡subscript𝚺𝑡𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡top\displaystyle\doteq\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})\bm{\Sigma}% _{t}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})^{\top},≐ caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (9)

are covariance matrix estimates constructed from Jacobians 𝒥(,𝐦)f(;𝚯)𝚯|𝚯=𝐦approaches-limit𝒥normal-⋅𝐦evaluated-at𝑓normal-⋅𝚯𝚯𝚯𝐦\mathcal{J}(\cdot,\mathbf{m})\doteq\frac{\partial f(\cdot\,;\bm{\Theta})}{% \partial\bm{\Theta}}|_{\bm{\Theta}=\mathbf{m}}\,caligraphic_J ( ⋅ , bold_m ) ≐ divide start_ARG ∂ italic_f ( ⋅ ; bold_Θ ) end_ARG start_ARG ∂ bold_Θ end_ARG | start_POSTSUBSCRIPT bold_Θ = bold_m end_POSTSUBSCRIPT with 𝐦={𝛍t,𝛍t1}𝐦subscript𝛍𝑡subscript𝛍𝑡1\mathbf{m}=\{{\bm{\mu}}_{t},{\bm{\mu}}_{t-1}\}bold_m = { bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }.

Proof.

See Appendix A. ∎

“Functional regularization for continual learning” (frclTitsias et al.,, 2020) and “functional regularization of the memorable past” (frompPan et al.,, 2020) use objectives conceptually similar to the objective in Equation 5 and mathematically similar to the objective in Equation 6. To highlight the differences between the s-fsvi objective above and fromp and frcl, respectively, we make the relationship between these two methods and s-fsvi precise in the following two propositions.

Proposition 2 (Relationship between fromp and s-fsvi).

With the s-fsvi objective \mathcal{F}caligraphic_F defined as in Equation 6, let 𝐗¯t=𝐗𝒞subscriptnormal-¯𝐗𝑡subscript𝐗𝒞\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT. Then, up to a multiplicative constant, the fromp objective corresponds to the s-fsvi objective with the prior covariance given by a Laplace approximation about 𝛍t1subscript𝛍𝑡1{\bm{\mu}}_{t-1}bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the variational distribution given by a Dirac delta distribution qtfromp(𝛉)δ(𝛉𝛍t)approaches-limitsuperscriptsubscript𝑞𝑡fromp𝛉𝛿𝛉subscript𝛍𝑡q_{t}^{\textsc{fromp}}({\bm{\theta}})\doteq\delta({\bm{\theta}}-{\bm{\mu}}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( bold_italic_θ ) ≐ italic_δ ( bold_italic_θ - bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denoting the prior covariance under a Laplace approximation about 𝛍t1subscript𝛍𝑡1{\bm{\mu}}_{t-1}bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT by 𝚺^0(𝛍t1)subscriptnormal-^𝚺0subscript𝛍𝑡1\hat{\bm{\Sigma}}_{0}({\bm{\mu}}_{t-1})over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) so that qt1fromp(𝛉)𝒩(𝛍t1,𝚺^0(𝛍t1))approaches-limitsuperscriptsubscript𝑞𝑡1fromp𝛉𝒩subscript𝛍𝑡1subscriptnormal-^𝚺0subscript𝛍𝑡1q_{t-1}^{\textsc{fromp}}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\hat% {\bm{\Sigma}}_{0}({\bm{\mu}}_{t-1}))italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ), the fromp objective can be expressed as

fromp(qtfromp,qt1fromp,𝐗𝒞,𝐗t,𝐲t)=(qtfromp,qt1fromp,𝐗𝒞,𝐗t,𝐲t)𝒱,superscriptfrompsuperscriptsubscript𝑞𝑡frompsuperscriptsubscript𝑞𝑡1frompsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡superscriptsubscript𝑞𝑡frompsuperscriptsubscript𝑞𝑡1frompsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡𝒱\displaystyle\begin{split}&\mathcal{L}^{\textsc{fromp}}(q_{t}^{\textsc{fromp}}% ,q_{t-1}^{\textsc{fromp}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{% t})\\ &~{}~{}~{}=\mathcal{F}(q_{t}^{\textsc{fromp}},q_{t-1}^{\textsc{fromp}},\mathbf% {X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})-\mathcal{V},\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - caligraphic_V , end_CELL end_ROW

where

𝒱12k(log[𝐊¯p^t]k[𝐊¯qt]k+[𝐊¯qt]k[𝐊¯p^t]k1),approaches-limit𝒱12subscript𝑘subscriptdelimited-[]superscript¯𝐊subscript^𝑝𝑡𝑘subscriptdelimited-[]superscript¯𝐊subscript𝑞𝑡𝑘subscriptdelimited-[]superscript¯𝐊subscript𝑞𝑡𝑘subscriptdelimited-[]superscript¯𝐊subscript^𝑝𝑡𝑘1\displaystyle\mathcal{V}\doteq-\frac{1}{2}\sum_{k}\left(\log\frac{{[\bar{% \mathbf{K}}^{\hat{p}_{t}}]_{k}}}{{[\bar{\mathbf{K}}^{q_{t}}]_{k}}}+\frac{[\bar% {\mathbf{K}}^{q_{t}}]_{k}}{[\bar{\mathbf{K}}^{\hat{p}_{t}}]_{k}}-1\right),caligraphic_V ≐ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_log divide start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) ,

with 𝐊¯normal-¯𝐊\bar{\mathbf{K}}over¯ start_ARG bold_K end_ARG denoting a covariance matrix under a block-diagonalization without inter-task dependence, and

𝐊¯p^tblock-diag(𝒥(𝐗¯t,𝝁t1)𝚺^0(𝝁t1)𝒥(𝐗¯t,𝝁t1)).approaches-limitsuperscript¯𝐊subscript^𝑝𝑡block-diag𝒥subscript¯𝐗𝑡subscript𝝁𝑡1subscript^𝚺0subscript𝝁𝑡1𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡1top\displaystyle\bar{\mathbf{K}}^{\hat{p}_{t}}\doteq\text{\emph{block-diag}}\left% (\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})\hat{\bm{\Sigma}}_{0}({\bm{% \mu}}_{t-1})\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})^{\top}\right).over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ block-diag ( caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) .
Proof.

See Appendix A. ∎

Proposition 2 shows that the fromp objective nearly corresponds to the s-fsvi objective but is missing the term in the s-fsvi objective (denoted by 𝒱𝒱\mathcal{V}caligraphic_V above) that encourages learning variational variance parameters that accurately reflect the variance of the prior. This insight reflects a shortcoming of the fromp objective. Unlike in the s-fsvi objective which allows optimization over 𝚺𝚺\bm{\Sigma}bold_Σ, the fromp objective is restricted to covariance estimates given by the Laplace approximation.

The frcl objective can be related to the s-fsvi objective in a similar way:

Proposition 3 (Relationship between frcl and s-fsvi).

With the s-fsvi objective \mathcal{F}caligraphic_F defined as in Equation 6, let 𝐗¯t=𝐗𝒞subscriptnormal-¯𝐗𝑡subscript𝐗𝒞\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, and let fLM(;𝚯)Φψ()𝚯approaches-limitsuperscript𝑓LMnormal-⋅𝚯subscriptnormal-Φ𝜓normal-⋅𝚯f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})\doteq\Phi_{\psi}(\cdot)\bm{\Theta}italic_f start_POSTSUPERSCRIPT LM end_POSTSUPERSCRIPT ( ⋅ ; bold_Θ ) ≐ roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ) bold_Θ be a Bayesian linear model, where Φψ()subscriptnormal-Φ𝜓normal-⋅\Phi_{\psi}(\cdot)roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ) is a deterministic feature map parameterized by ψ𝜓\psiitalic_ψ. Then the frcl objective corresponds to the s-fsvi objective for the model fLM(;𝚯)superscript𝑓LMnormal-⋅𝚯f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})italic_f start_POSTSUPERSCRIPT LM end_POSTSUPERSCRIPT ( ⋅ ; bold_Θ ) plus an additional weight-space KL divergence penalty. That is, for pt(𝛉)𝒩(𝟎,𝐈p_{t}({\bm{\theta}})\doteq\mathcal{N}(\mathbf{0},\mathbf{I}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_0 , bold_I, and qt(𝛉)𝒩(𝛍t,𝚺t)approaches-limitsubscript𝑞𝑡𝛉𝒩subscript𝛍𝑡subscript𝚺𝑡q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ),

frcl(qtfrcl,qt1frcl,𝐗𝒞,𝐗t,𝐲t)=(qtfrcl,qt1frcl,𝐗𝒞,𝐗t,𝐲t)+𝔻KL(qt(𝜽)pt(𝜽)).superscriptfrclsuperscriptsubscript𝑞𝑡frclsuperscriptsubscript𝑞𝑡1frclsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡superscriptsubscript𝑞𝑡frclsuperscriptsubscript𝑞𝑡1frclsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡subscript𝔻KLconditionalsubscript𝑞𝑡𝜽subscript𝑝𝑡𝜽\displaystyle\begin{split}&\mathcal{L}^{\text{{frcl}}}(q_{t}^{\textsc{frcl}},q% _{t-1}^{\textsc{frcl}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})% \\ &~{}~{}~{}=\mathcal{F}(q_{t}^{\textsc{frcl}},q_{t-1}^{\textsc{frcl}},\mathbf{X% }_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})+\mathbb{D}_{\textrm{\emph{KL}}}% (q_{t}({\bm{\theta}})\,\|\,p_{t}({\bm{\theta}})).\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ) . end_CELL end_ROW (10)
Proof.

See Appendix A. ∎

Proposition 3 highlights that the frcl objective is restricted to Bayesian linear models and does not regularize the deterministic parameters in the feature map as effectively as if they were variational parameters.

Refer to caption
(a) Task 1
Refer to caption
(b) Task 2
Refer to caption
(c) Task 3
Refer to caption
(d) Task 4
Refer to caption
(e) Task 5
Refer to caption
(f)
Figure 2: A practical demonstration of sequential function-space variational inference (s-fsvi) on a sequence of five binary-classification tasks with 2D inputs. The neural network infers a decision boundary between the two classes while maintaining high predictive uncertainty away from the data. The experimental setup is described in detail in Appendix C.

3.1 Simplified Sequential Function-Space VI

For ease of computation and to ensure scalability to large neural networks, we consider mean-field distributions qtMF(𝜽)subscriptsuperscript𝑞MF𝑡𝜽q^{\text{MF}}_{t}({\bm{\theta}})italic_q start_POSTSUPERSCRIPT MF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) for all tasks, diagonalize the covariance matrix estimates 𝐊ptsuperscript𝐊subscript𝑝𝑡\mathbf{K}^{p_{t}}bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐊qtsuperscript𝐊subscript𝑞𝑡\mathbf{K}^{q_{t}}bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT across input points in 𝐗¯tsubscript¯𝐗𝑡\bar{\mathbf{X}}_{t}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and let (𝐗,𝐲)𝒟tsubscript𝐗subscript𝐲subscript𝒟𝑡(\mathbf{X}_{\mathcal{B}},\mathbf{y}_{\mathcal{B}})\subset\mathcal{D}_{t}( bold_X start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ) ⊂ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be a mini-batch from the current dataset. This way, we obtain the simplified variational objective

~(qtMF,qt1MF,𝐗𝒞,𝐗,𝐲)=1Si=1Slogp(𝐲|f(𝐗;h(𝝁t,𝚺t,ϵ(i))))j=1|𝐗¯|k=1Dt12(log[𝐊pt]j,k[𝐊qt]j,k+[𝐊qt]j,k[𝐊pt]j,k1+([f(𝐗¯t;𝝁t)]j,k[f(𝐗¯t;𝝁t1)]j,k)2[𝐊pt]j,k),~subscriptsuperscript𝑞MF𝑡subscriptsuperscript𝑞MF𝑡1subscript𝐗𝒞subscript𝐗subscript𝐲1𝑆superscriptsubscript𝑖1𝑆𝑝conditionalsubscript𝐲𝑓subscript𝐗subscript𝝁𝑡subscript𝚺𝑡superscriptbold-italic-ϵ𝑖superscriptsubscript𝑗1¯𝐗superscriptsubscript𝑘1subscript𝐷𝑡12subscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑗𝑘subscriptdelimited-[]superscript𝐊subscript𝑞𝑡𝑗𝑘subscriptdelimited-[]superscript𝐊subscript𝑞𝑡𝑗𝑘subscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑗𝑘1superscriptsubscriptdelimited-[]𝑓subscript¯𝐗𝑡subscript𝝁𝑡𝑗𝑘subscriptdelimited-[]𝑓subscript¯𝐗𝑡subscript𝝁𝑡1𝑗𝑘2subscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑗𝑘\displaystyle\begin{split}&\tilde{\mathcal{F}}(q^{\text{MF}}_{t},q^{\text{MF}}% _{t-1},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{\mathcal{B}},\mathbf{y}_{\mathcal{% B}})\\ &=\frac{1}{S}\sum_{i=1}^{S}\log p(\mathbf{y}_{\mathcal{B}}\,|\,f(\mathbf{X}_{% \mathcal{B}};h({\bm{\mu}}_{t},\bm{\Sigma}_{t},\bm{\epsilon}^{(i)})))\\ &~{}~{}~{}-\sum_{j=1}^{|\bar{\mathbf{X}}|}\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(% }\log\frac{[\mathbf{K}^{p_{t}}]_{j,k}}{[\mathbf{K}^{q_{t}}]_{j,k}}+\frac{[% \mathbf{K}^{q_{t}}]_{j,k}}{[\mathbf{K}^{p_{t}}]_{j,k}}-1\\ &\quad\quad\quad+\frac{\left([f(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t})]_{j,k}-[f% (\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t-1})]_{j,k}\right)^{2}}{[\mathbf{K}^{p_{t}}% ]_{j,k}}\bigg{)},\end{split}start_ROW start_CELL end_CELL start_CELL over~ start_ARG caligraphic_F end_ARG ( italic_q start_POSTSUPERSCRIPT MF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT MF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT roman_log italic_p ( bold_y start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT | italic_f ( bold_X start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ; italic_h ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over¯ start_ARG bold_X end_ARG | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_log divide start_ARG [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG - 1 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG ( [ italic_f ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT - [ italic_f ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG ) , end_CELL end_ROW (11)

where h(𝝁t,𝚺t,ϵ(i))𝝁t+𝚺tϵ(i)approaches-limitsubscript𝝁𝑡subscript𝚺𝑡superscriptbold-italic-ϵ𝑖subscript𝝁𝑡direct-productsubscript𝚺𝑡superscriptbold-italic-ϵ𝑖h({\bm{\mu}}_{t},\bm{\Sigma}_{t},\bm{\epsilon}^{(i)})\doteq{\bm{\mu}}_{t}+\bm{% \Sigma}_{t}\odot\bm{\epsilon}^{(i)}italic_h ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ≐ bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ bold_italic_ϵ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is a reparameterization of 𝚯P𝚯superscript𝑃\bm{\Theta}\in\mathbb{R}^{P}bold_Θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT with ϵ(i)𝒩(𝟎,𝐈P)similar-tosuperscriptbold-italic-ϵ𝑖𝒩0subscript𝐈𝑃\bm{\epsilon}^{(i)}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{P})bold_italic_ϵ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ), S𝑆Sitalic_S is the number of Monte Carlo samples, Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is as defined before, and

𝐊ptsuperscript𝐊subscript𝑝𝑡\displaystyle\mathbf{K}^{p_{t}}bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT diag(𝒥(𝐗¯t,𝝁t1)𝚺t1𝒥(𝐗¯t,𝝁t1))approaches-limitabsentdiag𝒥subscript¯𝐗𝑡subscript𝝁𝑡1subscript𝚺𝑡1𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡1top\displaystyle\doteq\text{{diag}}\left(\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{% \mu}}_{t-1})\bm{\Sigma}_{t-1}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1}% )^{\top}\right)≐ diag ( caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) (12)
𝐊qtsuperscript𝐊subscript𝑞𝑡\displaystyle\mathbf{K}^{q_{t}}bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT diag(𝒥(𝐗¯t,𝝁t)𝚺t𝒥(𝐗¯t,𝝁t)).approaches-limitabsentdiag𝒥subscript¯𝐗𝑡subscript𝝁𝑡subscript𝚺𝑡𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡top\displaystyle\doteq\text{{diag}}\left(\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{% \mu}}_{t})\bm{\Sigma}_{t}\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})^{% \top}\right).≐ diag ( caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) . (13)

This simplified objective does not require matrix inversion, and the time and space complexity for gradient estimation and prediction scale linearly in the number of context points 𝐗¯tsubscript¯𝐗𝑡\bar{\mathbf{X}}_{t}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and network parameters. The context set 𝐗𝒞subscript𝐗𝒞\mathbf{X}_{\mathcal{C}}bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT can be constructed from coresets containing representative points from previous tasks.

We provide an empirical comparison of the simplified s-fsvi, fromp, and frcl objectives in Section 5 to assess the extent to which the differences described above affect continual learning.

Table 1: Predictive accuracies of a selection of objective-based methods for continual learning. Results are reported for three task sequences: split mnist (s-mnist), split Fashion mnist (s-fmnist) and permuted mnist (p-mnist). In some cases, a multi-head setup (MH) is used; in others, a single-head setup (SH). Best results for identical network architectures are printed in boldface (exception: var-gp uses a non-parametric model). Best overall results are highlighted in gray. Each numerical entry denotes the mean accuracy across tasks at the end of training. Where possible, this accuracy is based on experiments repeated with different random seeds (10 repeats for s-fsvi), with both the mean value and standard error reported. All methods use the same architecture and coreset size unless indicated otherwise. See Appendix C for more experimental details. 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTAccuracies computed using the best coreset-selection method (either random or k𝑘kitalic_k-center). 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTUses random coreset selection. 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTRequires a multi-head setup with task identifiers, including for permuted mnist. This requirement explains the missing frcl result for s-mnist (SH). 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTUses a larger MLP architecture (see Table 4 in appendix).55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTEvaluates the KL divergence at points sampled from the empirical data distribution of the current task. 66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPTUses one sample per class as a coreset.
Method s-mnist (MH) s-fmnist (MH) p-mnist (SH) s-mnist (SH)
ewc (Kirkpatrick et al.,, 2017) 63.10% 84.00%
si (Zenke et al.,, 2017) 98.90% 86.00%
vcl (Nguyen et al.,, 2018)11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 98.40% 98.60%±0.04plus-or-minus0.04{\scriptstyle\pm 0.04}± 0.04 93.00% 32.11%±1.16plus-or-minus1.16{\scriptstyle\pm 1.16}± 1.16
   vcl (no coreset) 97.00% 89.60%±1.75plus-or-minus1.75{\scriptstyle\pm 1.75}± 1.75 87.50%±0.61plus-or-minus0.61{\scriptstyle\pm 0.61}± 0.61 17.74%±1.20plus-or-minus1.20{\scriptstyle\pm 1.20}± 1.20
frcl (Titsias et al.,, 2020)33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT 97.80%±0.22plus-or-minus0.22{\scriptstyle\pm 0.22}± 0.22 97.28%±0.17plus-or-minus0.17{\scriptstyle\pm 0.17}± 0.17 94.30%±0.06plus-or-minus0.06{\scriptstyle\pm 0.06}± 0.06
fromp (Pan et al.,, 2020) 99.00%±0.04plus-or-minus0.04{\scriptstyle\pm 0.04}± 0.04 99.00%±0.03plus-or-minus0.03{\scriptstyle\pm 0.03}± 0.03 94.90%±0.04plus-or-minus0.04{\scriptstyle\pm 0.04}± 0.04 35.29%±0.52plus-or-minus0.52{\scriptstyle\pm 0.52}± 0.52
var-gp (Kapoor et al.,, 2021) 97.20%±0.08plus-or-minus0.08{\scriptstyle\pm 0.08}± 0.08 90.57%±1.06plus-or-minus1.06{\scriptstyle\pm 1.06}± 1.06
s-fsvi (ours)22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 99.54%±0.04plus-or-minus0.04{\scriptstyle\pm 0.04}± 0.04 99.19%±0.02plus-or-minus0.02{\scriptstyle\pm 0.02}± 0.02 95.76%±0.02plus-or-minus0.02{\scriptstyle\pm 0.02}± 0.02 92.87%±0.14plus-or-minus0.14{\scriptstyle\pm 0.14}± 0.14
s-fsvi Ablation Study:
   s-fsvi (larger networks)44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT 99.76%±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 99.16%±0.03plus-or-minus0.03{\scriptstyle\pm 0.03}± 0.03 97.50%±0.01plus-or-minus0.01{\scriptstyle\pm 0.01}± 0.01 93.38%±0.10plus-or-minus0.10{\scriptstyle\pm 0.10}± 0.10
   s-fsvi (no coreset)55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT 99.62%±0.02plus-or-minus0.02{\scriptstyle\pm 0.02}± 0.02 99.54%±0.01plus-or-minus0.01{\scriptstyle\pm 0.01}± 0.01 84.06%±0.46plus-or-minus0.46{\scriptstyle\pm 0.46}± 0.46 20.15%±0.52plus-or-minus0.52{\scriptstyle\pm 0.52}± 0.52
   s-fsvi (minimal coreset)66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT 89.59%±0.30plus-or-minus0.30{\scriptstyle\pm 0.30}± 0.30 51.44%±1.22plus-or-minus1.22{\scriptstyle\pm 1.22}± 1.22

4 Related Work

There are three main (partially overlapping) categories of methods for continual learning in a deep neural network. Objective-based approaches modify the objective function used to train the neural network. Replay-based approaches summarize past tasks using either stored data or freshly generated synthetic data. Architecture-based approaches change the neural network’s structure from one task to another. For extensive reviews, see De Lange et al., (2021) and Parisi et al., (2019). As sequential function-space variational inference (s-fsvi) centers around a new training objective, we focus on objective-based approaches in this review. (Like the methods reviewed below, s-fsvi does incorporate a form of replay in that it uses context points, but the primary interest is the training objective.)

For a neural network to retain abilities it has previously learned, its predictions on data associated with past tasks must not change significantly from one task to another. One way of achieving this is to include in the training objective a form of function-space regularization to discourage important changes in the network’s predictions or internal representations. “Learning without forgetting” (Li and Hoiem,, 2018) uses a modified cross-entropy loss that penalizes the difference between the predictions of the current network on the current task data and the predictions of the previous network on the current task data. “Less-forgetful learning” (Jung et al.,, 2018) employs the same method but uses squared Euclidean distance rather than the modified cross-entropy loss and applies it to the penultimate-layer representations rather than the network’s predictions. “Keep and learn” (Kim et al.,, 2018) also uses internal representations as a basis for regularization. The method subsequently proposed by Benjamin et al., (2019) involves comparing the current network with all previous versions of the network and on data from all past tasks instead of with only the most recent network on data from the current task. Each pair of networks is compared by computing the Euclidean distance between the networks’ predictions. “Dark experience replay” (Buzzega et al.,, 2020) extends this method to work in a setting where task boundaries are not clearly defined.

While these approaches mitigate forgetting, they do not explicitly account for predictive uncertainty, which is an issue if the neural network is a poor fit to the data. This deficiency is addressed by probabilistic approaches to function-space regularization, which encourage a network’s predictions to agree with a prior distribution over functions rather than with a single function. “Functional regularization for continual learning” (frclTitsias et al.,, 2020) considers a network whose final layer is a Bayesian linear model. Based on the duality between parameter space and function space, the frcl objective includes the KL divergence between predictive distributions at a selection of input points. This encourages similarity between the network’s current predictive distribution and the distributions from past tasks. frcl is theoretically appealing, building on a well-understood method for stochastic variational inference using inducing points, but is only applicable to Bayesian linear models. In contrast, “functional regularization of the memorable past” (fromp;  Pan et al.,, 2020) maintains a posterior distribution over all the parameters of a neural network. While fromp achieves state-of-the-art performance on several continual-learning task sequences, it relies on a change in the underlying probabilistic model and uses a surrogate objective for optimization, which divorces it from function-space variational objectives. As we show, this results in suboptimal performance compared to sequential function-space variational inference, which maintains a stronger link to the underlying Bayesian approximation.

Although our focus is on methods for training deep neural networks, for completeness, we also note methods based on Gaussian processes (gps). Incremental variational sparse gp regression (Cheng and Boots,, 2016), streaming sparse gp(Bui et al.,, 2017) and online sparse multi-output gp regression (Yang et al.,, 2019) built on the work of Csató and Opper, (2002) and Csató, (2002), and are effective approaches to continual learning for regression tasks. Continual multi-task gp(Moreno-Muñoz et al.,, 2019) extend to multi-output settings with non-Gaussian likelihoods. The success of variational autoregressive gps (var-gp;  Kapoor et al.,, 2021) on continual learning for task sequences with image inputs gives reason for inclusion where relevant in Section 5. However, we note that var-gp scales poorly with the number of tasks: the time complexity for inference is cubic in the number of context points and hence in the number of tasks, which may limit its applicability to task sequences like sequential Omniglot. In contrast, the time complexity of s-fsvi is linear in the number of context points.

Also distinct from but related to our method are a number of objective-based approaches to continual learning that directly regularize the parameters of a neural network. We briefly discuss these approaches in Appendix D.

Refer to caption
(a) s-mnist (MH)
Refer to caption
(b) s-fmnist (MH)
Refer to caption
(c) p-mnist (SH)
Refer to caption
(d) s-mnist (SH)
Figure 3: Effect of the coreset size and coreset-selection method on the predictive accuracy of s-fsvi. Three coreset-selection methods are presented: sampling data points with uniform probability; sampling with probability proportional to model’s predictive entropy; and sampling with probability proportional to the KL divergence between the posterior predictive distribution and the prior predictive distribution. Ten inducing points are used in each case. No coreset-selection method consistently yields higher accuracy.

5 Empirical Evaluation

After visualizing how s-fsvi works in practice (Section 5.1), we compare s-fsvi’s performance with that of existing objective-based methods for continual learning (Sections 5.2, 5.3 and 5.4). For a comprehensive comparison, we evaluate s-fsvi on a range of task sequences used in related work. Aiming to use as strong baselines as possible, we report results taken directly from the literature in most cases (and mention when we do not). Reporting baselines in this way leaves gaps in our comparison: for each existing technique, results are available for only a subset of the task sequences we consider here (e.g., Pan et al., (2020) report results for split cifar but not sequential Omniglot, while Titsias et al., (2020) do the reverse).

Our evaluation pays attention to two factors important in the assessment of continual-learning methods: the use of task identifiers when making predictions, and the use of a coresets of data points to summarize past tasks (Farquhar and Gal,, 2018). To provide some commentary on the first of these factors, we run an experiment that compares the performance of a single-head neural network (which does not use task identifiers) to that of a multi-head neural network (which uses task identifiers). Regarding the second factor, we explore how performance changes when the coreset size changes or a context set unrelated to previous tasks is used.

Details about the experimental setups (e.g., optimization routines and hyperparameter searches) can be found in Appendix C. Our code can be accessed at:

Refer to caption
(a) Split cifar Accuracies After Training on All Tasks
Refer to caption
(b) s-fsvi Accuracy as a Function of Coreset Size
Figure 4: Predictive accuracies of s-fsvi and related methods on split cifar. (a) Per-task and average accuracy after training on six tasks. The result of “joint” baseline is obtained using a model trained on data from all tasks at the same time. The accuracy at task t𝑡titalic_t for the “separate” baseline is the accuracy of an independent model trained only on task t𝑡titalic_t. We use the best performing method for each baseline: fromp for “joint”, s-fsvi for “separate”. (b) Average accuracy after training on six tasks with different coreset sizes. “Random” coreset selection denotes uniform sampling from the training set. “Entropy” coreset selection denotes sampling from the training set with probability proportional to the entropy of the model’s posterior predictive distribution.

5.1 Illustrative Example

To provide intuition for how s-fsvi allows learning on new tasks while maintaining previously acquired abilities, we apply it to a task sequence based on easy-to-visualize synthetic 2D data, originally proposed by Pan et al., (2020). In this task sequence, each data point belongs to one of two classes, and more data points are revealed as the task sequence progresses. The data-generating process is assumed to reveal data from mostly non-overlapping subsets of the input space. The continual-learning problem is then to infer the decision boundary around data points revealed up to and including the current task without forgetting the decision boundary inferred on previous tasks. We use a single-head neural network.

In Figure 2, we plot the model’s posterior predictive distribution after training on each of five tasks. After training on task 1, the model has low predictive uncertainty close to the data points and high uncertainty (class probabilities around 0.5) everywhere else (Figure 2a). On task 2, s-fsvi seeks to match the distribution over functions inferred on the previous task while fitting the new set of data points. s-fsvi achieves this and expands the area in input space where the model is confident in its predictions (Figure 2b).

As more tasks and data are revealed, s-fsvi allows the model to continually explore the data space and infer the decision boundary while maintaining accurate, high-confidence predictions on data points in parts of the inputs space where it was previously trained on observed data. Finally, after training on five tasks, the model has inferred the decision boundary between the two classes, while maintaining high predictive uncertainty in parts of the input space where no data points have been observed yet (Figure 2e). The model maintains high predictive uncertainty away from the data, which makes it easier to learn on new tasks. This is unlike deterministic neural networks, which tend to make highly confident predictions in parts of the inputs space where no data has been observed, or on data points that lie outside of the distribution of the training data.

5.2 Split (Fashion) MNIST & Permuted MNIST

Having established some intuition for how s-fsvi works, we demonstrate how this translates to high predictive accuracy on three task sequences commonly used to evaluate continual-learning methods. First is split mnist (s-mnist), in which each task consists of binary classification on a pair of mnist classes (0 vs. 1, 2 vs. 3, and so on). Second is split Fashion mnist (s-fmnist), which has the same structure but uses data from Fashion mnist, posing a harder problem. Third is permuted mnist (p-mnist), in which each task consists of ten-way classification on mnist images whose pixels have been randomly reordered. A multi-head setup (MH) with task identifiers provided at prediction time is the default for s-mnist and s-fmnist, while a single-head setup (SH) without task identifiers is standard for p-mnist. In addition to running the default setup for all three task sequences, we run a single-head setup for s-mnist.

With a standard configuration, s-fsvi outperforms all existing methods based on deep neural networks by a statistically significant margin on all task sequences (Table 1). As noted in Section 4, var-gp’s conceptual connection to our method warrants its inclusion in our comparison. var-gp performs better than our standard configuration of s-fsvi on permuted mnist, but this advantage disappears once a larger neural network is used with s-fsvi. Moreover, var-gp is unlikely to scale well to more challenging task sequences, such as those in Sections 5.3 and 5.4.

Table 2: Predictive accuracies of s-fsvi and related methods on sequential Omniglot. For s-fsvi and frcl, the coreset consists of two data points per class. All baseline results are from Titsias et al., (2020). For all methods, the mean and standard deviation over five random task permutations are reported. 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTLi and Hoiem, (2018). 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSchwarz et al., (2018). 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTSchwarz et al., (2018). 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTCoreset selected using frcl’s “trace” method. 55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTDetails in Appendix C.
Method Test Accuracy
Learning Without Forgetting11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 62.06%±2.0plus-or-minus2.0{\scriptstyle\pm 2.0}± 2.0
ewc 67.32%±4.7plus-or-minus4.7{\scriptstyle\pm 4.7}± 4.7
Online ewc22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 69.99%±3.2plus-or-minus3.2{\scriptstyle\pm 3.2}± 3.2
Progress & Compress33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT 70.32%±3.3plus-or-minus3.3{\scriptstyle\pm 3.3}± 3.3
frcl44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT 81.47%±1.6plus-or-minus1.6{\scriptstyle\pm 1.6}± 1.6
s-fsvi (ours)55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT 83.29%±1.2plus-or-minus1.2{\scriptstyle\pm 1.2}± 1.2

5.3 Sequential Omniglot

Sequential Omniglot (Lake et al.,, 2015; Schwarz et al.,, 2018) provides a more challenging task sequence than those considered in Section 5.2. It consists of 50 classification tasks, where the number of classes varies between the tasks (details in Appendix C). We find that s-fsvi produces better predictive accuracy than all available baselines, including frcl, by a statistically significant margin (Table 2). To illustrate the stability of s-fsvi across long task sequences, we plot its mean accuracy over 50 tasks in Figure 5.

Refer to caption
Figure 5: Predictive accuracies of s-fsvi and frcl on sequential Omniglot. For s-fsvi, the accuracy shown at task t𝑡titalic_t is the mean accuracy across all tasks up to that point (mean ±plus-or-minus\pm± one standard error as computed across five permutations of the task order). We were unable to reproduce the result reported in Titsias et al., (2020) using the authors’ code. However, we compare against the result from the paper (only the accuracy at task 50 is reported) here to provide a strong baseline.
Figure 6: Predictive accuracies of s-fsvi and parameter-space variational inference (vcl) on split mnist and permuted mnist. The accuracy shown at task t𝑡titalic_t is the mean accuracy across all tasks up to that point (mean ±plus-or-minus\pm± one standard error as computed across ten repetitions of the experiment). With a coreset, s-fsvi outperforms vcl on both task sequences. Without a coreset, s-fsvi performs poorly on permuted mnist.
Refer to caption (a) s-mnist (MH) Refer to caption (b) p-mnist (SH)
Refer to caption (a) s-mnist (MH) Refer to caption (b) s-fmnist (MH)
Figure 6: Predictive accuracies of s-fsvi and parameter-space variational inference (vcl) on split mnist and permuted mnist. The accuracy shown at task t𝑡titalic_t is the mean accuracy across all tasks up to that point (mean ±plus-or-minus\pm± one standard error as computed across ten repetitions of the experiment). With a coreset, s-fsvi outperforms vcl on both task sequences. Without a coreset, s-fsvi performs poorly on permuted mnist.
Figure 7: Predictive accuracies of s-fsvi, fromp and frcl on multi-head split (Fashion) mnist without using coresets. Inducing inputs for evaluating the KL divergence are sampled according to three different sampling schemes derived from the current task’s empirical data distribution (see Appendix C for details). Using s-fsvi with images sampled from the current task’s training set significantly outperforms all other methods.

5.4 Split CIFAR

Moving beyond classification tasks on grayscale images, we evaluate s-fsvi on split cifar (Pan et al.,, 2020; Zenke et al.,, 2017). This uses the full cifar-10 dataset for the first task, followed by five ten-way classification tasks drawn from cifar-100. Our results show s-fsvi achieving higher accuracy on all tasks than fromp and vcl after learning all six tasks (Figure 4a). Notably, on each task except the first, s-fsvi performs close to or better than two baselines: a model trained only on that task, and a model trained on all tasks jointly. The latter is a particularly strong baseline, because all data is available during training.

As in related work (Lopez-Paz and Ranzato,, 2017; Pan et al.,, 2020), we compute the forward transfer (FT) and backward transfer (BT) for s-fsvi on split cifar. FT captures by how much the accuracy on the current tasks increases as the number of past tasks increases; BT captures by how much the accuracy on the previous tasks increases as more tasks are observed (see Section C.6 for mathematical definitions). As well as having the best overall accuracy, s-fsvi significantly outperforms all baselines in terms of FT and has BT comparable to ewc and fromp (Table 3).

Table 3: Forward transfer (FT) and backward transfer (BT) of s-fsvi and related methods on split cifar. All baseline results are from Pan et al., (2020). For all methods, the mean and standard error over five repeated experiments are reported. 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDetails in Appendix C.
Method Test Accuracy FT BT
ewc 71.6%±0.4plus-or-minus0.4{\scriptstyle\pm 0.4}± 0.4 0.2±0.4plus-or-minus0.4{\scriptstyle\pm 0.4}± 0.4 -2.3±0.6plus-or-minus0.6{\scriptstyle\pm 0.6}± 0.6
vcl 67.4%±0.6plus-or-minus0.6{\scriptstyle\pm 0.6}± 0.6 1.8±1.4plus-or-minus1.4{\scriptstyle\pm 1.4}± 1.4 -9.2±0.8plus-or-minus0.8{\scriptstyle\pm 0.8}± 0.8
fromp 76.2%±0.2plus-or-minus0.2{\scriptstyle\pm 0.2}± 0.2 6.1±0.3plus-or-minus0.3{\scriptstyle\pm 0.3}± 0.3 -2.6±0.4plus-or-minus0.4{\scriptstyle\pm 0.4}± 0.4
s-fsvi (ours)11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 77.6%±0.2plus-or-minus0.2{\scriptstyle\pm 0.2}± 0.2 7.3±0.2plus-or-minus0.2{\scriptstyle\pm 0.2}± 0.2 -2.5±0.2plus-or-minus0.2{\scriptstyle\pm 0.2}± 0.2

5.5 Function- vs. Parameter-Space Inference

To demonstrate the importance of performing inference in function space, we compare how the accuracies of s-fsvi and vcl evolve from one task to another on split mnist and permuted mnist (Figure 7). We find that s-fsvi consistently outperforms vcl whose predictive performance steadily degrades suggesting that function-space inference may be more effective than parameter-space inference at transferring prior knowledge from one task to another, and that this may offset the information loss in the KL divergence between distributions over functions compared to the KL divergence between distributions over parameters.

5.6 Coreset Size and Selection

Similar to existing methods such as fromp and frcl, s-fsvi includes in the training objective a function-space regularization term that encourages matching the prior distribution over functions at a set of context points. Typically, this requires keeping a representative coreset of data points from each task, from which a context set can be constructed.

s-fsvi offers two benefits with respect to coresets. First, it is insensitive to which points get included in the coresets. Whereas existing methods often require expensive procedures to select important data points from previous tasks, Figures 3 and 4b show that s-fsvi achieves strong performance while only using randomly selected coresets. Second, s-fsvi does not require large coresets to perform well. On permuted mnist, s-fsvi achieves better predictive accuracy than ewc and si even if the coreset used for s-fsvi consists of only a single data point per class (Table 1). On the single-head version of split mnist, a minimal coreset (one point per class, or two points per task) allows s-fsvi to outperform vcl and fromp, both with coresets of 40 points per task (Table 1). In some multi-head settings, s-fsvi achieves state-of-the-art predictive accuracies with randomly-generated noise coresets (Table 1 and Figure 7).

6 Conclusion

We presented sequential function-space variational inference (s-fsvi), a method for continual learning in deep neural networks. We showed that s-fsvi improves on the predictive performance of existing objective-based continual learning methods—often by a significant margin—including on task sequences with high-dimensional inputs (split cifar) and large numbers of tasks (sequential Omniglot). Lastly, we demonstrated that—unlike existing function-space regularization methods—s-fsvi does not rely on careful coreset selection and, in multi-head settings, can achieve state-of-the-art performance even without coresets collected on previous tasks. We hope that this work will lead to future research into further improving function-space objectives for continual learning.

Acknowledgements

Tim G. J. Rudner and Freddie Bickford Smith are funded by the Engineering and Physical Sciences Research Council (EPSRC). Tim G. J. Rudner is also funded by the Rhodes Trust and by a Qualcomm Innovation Fellowship. We gratefully acknowledge donations of computing resources by the Alan Turing Institute.

References

  • Ahn et al., (2019) Ahn, H., Cha, S., Lee, D., and Moon, T. (2019). Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems.
  • Aljundi et al., (2018) Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. (2018). Memory aware synapses: learning what (not) to forget. In European Conference on Computer Vision.
  • Benjamin et al., (2019) Benjamin, A., Rolnick, D., and Kording, K. (2019). Measuring and regularizing networks in function space. In International Conference on Learning Representations.
  • Broderick et al., (2013) Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Streaming variational Bayes. In Advances in Neural Information Processing Systems.
  • Bui et al., (2017) Bui, T. D., Nguyen, C., and Turner, R. E. (2017). Streaming sparse Gaussian process approximations. In Advances in Neural Information Processing Systems.
  • Burt et al., (2021) Burt, D. R., Ober, S. W., Garriga-Alonso, A., and van der Wilk, M. (2021). Understanding variational inference in function-space. In Symposium on Advances in Approximate Bayesian Inference.
  • Buzzega et al., (2020) Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. (2020). Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems.
  • Chaudhry et al., (2018) Chaudhry, A., Dokania, P., Ajanthan, T., and Torr, P. (2018). Riemannian walk for incremental learning: understanding forgetting and intransigence. In European Conference on Computer Vision.
  • Cheng and Boots, (2016) Cheng, C.-A. and Boots, B. (2016). Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems.
  • Cover and Thomas, (1991) Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
  • Csató, (2002) Csató, L. (2002). Gaussian processes: iterative sparse approximations. PhD thesis, Aston University.
  • Csató and Opper, (2002) Csató, L. and Opper, M. (2002). Sparse on-line Gaussian processes. Neural Computation.
  • De Lange et al., (2021) De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. (2021). A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Ebrahimi et al., (2020) Ebrahimi, S., Elhoseiny, M., Darrell, T., and Rohrbach, M. (2020). Uncertainty-guided continual learning with Bayesian neural networks. In International Conference on Learning Representations.
  • Farquhar and Gal, (2018) Farquhar, S. and Gal, Y. (2018). Towards robust evaluations of continual learning. ICML Workshop on Lifelong Learning: A Reinforcement Learning Approach.
  • Ghahramani and Attias, (2000) Ghahramani, Z. and Attias, H. (2000). Online variational Bayesian learning. In NIPS Workshop on Online Learning.
  • Honkela and Valpola, (2003) Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In International Symposium on Independent Component Analysis and Blind Signal Separation.
  • Jung et al., (2018) Jung, H., Ju, J., Jung, M., and Kim, J. (2018). Less-forgetful learning for domain expansion in deep neural networks. In AAAI Conference on Artificial Intelligence.
  • Kapoor et al., (2021) Kapoor, S., Karaletsos, T., and Bui, T. D. (2021). Variational auto-regressive Gaussian processes for continual learning. In International Conference on Machine Learning.
  • Kessler et al., (2019) Kessler, S., Nguyen, V., Zohren, S., and Roberts, S. (2019). Hierarchical Indian buffet neural networks for Bayesian continual learning. arXiv.
  • Kim et al., (2018) Kim, H.-E., Kim, S., and Lee, J. (2018). Keep and learn: continual learning by constraining the latent space for knowledge preservation in neural networks. In Medical Image Computing and Computer Assisted Intervention.
  • Kirkpatrick et al., (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences.
  • Lake et al., (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science.
  • Lee et al., (2017) Lee, S.-W., Kim, J.-H., Jun, J., Ha, J.-W., and Zhang, B.-T. (2017). Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems.
  • Li and Hoiem, (2018) Li, Z. and Hoiem, D. (2018). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Liu et al., (2018) Liu, X., Masana, M., Herranz, L., van de Weijer, J., López, A. M., and Bagdanov, A. D. (2018). Rotate your networks: better weight consolidation and less catastrophic forgetting. International Conference on Pattern Recognition.
  • Loo et al., (2020) Loo, N., Swaroop, S., and Turner, R. E. (2020). Generalized variational continual learning. In International Conference on Learning Representations.
  • Lopez-Paz and Ranzato, (2017) Lopez-Paz, D. and Ranzato, M. A. (2017). Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems.
  • Matthews et al., (2016) Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In International Conference on Artificial Intelligence and Statistics.
  • Moreno-Muñoz et al., (2019) Moreno-Muñoz, P., Artés-Rodríguez, A., and Álvarez, M. A. (2019). Continual multi-task Gaussian processes. arXiv.
  • Nguyen et al., (2018) Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. (2018). Variational continual learning. In International Conference on Learning Representations.
  • Pan et al., (2020) Pan, P., Swaroop, S., Immer, A., Eschenhagen, R., Turner, R., and Khan, M. E. E. (2020). Continual deep learning by functional regularisation of memorable past. In Advances in Neural Information Processing Systems.
  • Parisi et al., (2019) Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. (2019). Continual lifelong learning with neural networks: a review. Neural Networks.
  • Park et al., (2019) Park, D., Hong, S., Han, B., and Lee, K. M. (2019). Continual learning by asymmetric loss approximation with single-side overestimation. In International Conference on Computer Vision.
  • Ritter et al., (2018) Ritter, H., Botev, A., and Barber, D. (2018). Online structured Laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems.
  • Rudner et al., (2021) Rudner, T. G. J., Chen, Z., and Gal, Y. (2021). Rethinking function-space variational inference in Bayesian neural networks. In Symposium on Advances in Approximate Bayesian Inference.
  • Sato, (2001) Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation.
  • Schwarz et al., (2018) Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. (2018). Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning.
  • Shannon and Weaver, (1949) Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Urbana and Chicago.
  • Sun et al., (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian neural networks. In International Conference on Learning Representations.
  • Swaroop et al., (2019) Swaroop, S., Nguyen, C. V., Bui, T. D., and Turner, R. E. (2019). Improving and understanding variational continual learning. In NeurIPS Workshop on Continual Learning.
  • Titsias et al., (2020) Titsias, M. K., Schwarz, J., de G. Matthews, A. G., Pascanu, R., and Teh, Y. W. (2020). Functional regularisation for continual learning with Gaussian processes. In International Conference on Learning Representations.
  • Yang et al., (2019) Yang, L., Wang, K., and Mihaylova, L. S. (2019). Online sparse multi-output Gaussian process regression and learning. IEEE Transactions on Signal and Information Processing over Networks.
  • (44) Yin, D., Farajtabar, M., and Li, A. (2020a). SOLA: continual learning with second-order loss approximation. arXiv.
  • (45) Yin, D., Farajtabar, M., Li, A., Levine, N., and Mott, A. (2020b). Optimization and generalization of regularization-based continual learning: a loss approximation viewpoint. arXiv.
  • Zenke et al., (2017) Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In International Conference on Machine Learning.

Supplementary Material

Table of Contents

  1. Appendix A: Proofs

  2. Appendix B: Further Empirical Results

  3. Appendix C: Experimental Details

  4. Appendix D: Further Related Work

Appendix A Proofs

A.1 Variational Objective

Proposition 1 (Sequential Function-Space Variational Inference (s-fsvi); adapted from (Rudner et al.,, 2021)).

Let Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the number of model output dimensions for t𝑡titalic_t tasks, let f:𝒳×PDtnormal-:superscript𝑓normal-→𝒳superscript𝑃superscriptsubscript𝐷𝑡f^{\text{\emph{}}}:\mathcal{X}\times\mathbb{R}^{P}\rightarrow\mathbb{R}^{D_{t}}italic_f start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : caligraphic_X × blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be a mapping defined by a neural network architecture, let 𝚯P𝚯superscript𝑃\bm{\Theta}\in\mathbb{R}^{P}bold_Θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT be a multivariate random vector of network parameters, and let qt(𝛉)𝒩(𝛍t,𝚺t)approaches-limitsubscript𝑞𝑡𝛉𝒩subscript𝛍𝑡subscript𝚺𝑡q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and qt1(𝛉)𝒩(𝛍t1,𝚺t1)approaches-limitsubscript𝑞𝑡1𝛉𝒩subscript𝛍𝑡1subscript𝚺𝑡1q_{t-1}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\bm{\Sigma}_{t-1})italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) be variational distributions over 𝚯𝚯\bm{\Theta}bold_Θ. Additionally, let 𝐗𝒞subscript𝐗𝒞\mathbf{X}_{\mathcal{C}}bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT denote a sample of context points, and let 𝐗¯t{𝐗t𝐗𝒞}subscriptnormal-¯𝐗𝑡subscript𝐗𝑡subscript𝐗𝒞\bar{\mathbf{X}}_{t}\subseteq\{\mathbf{X}_{t}\cup\mathbf{X}_{\mathcal{C}}\}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ { bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT }. Under a diagonal approximation of the prior and variational posterior covariance functions across output dimensions, the objective in Equation 5 can be approximated by

(qt,qt1,𝐗𝒞,𝐗t,𝐲t)𝔼qt(𝜽)[logp(𝐲t|f(𝐗t;𝜽))]k=1Dt12(log|[𝐊pt]k||[𝐊qt]k||𝐗¯t|Dt+Tr([𝐊pt]k1[𝐊qt]k)+Δ(𝐗¯t;𝝁t,𝝁t1)[𝐊pt]k1Δ(𝐗¯t;𝝁t,𝝁t1)),approaches-limitsubscript𝑞𝑡subscript𝑞𝑡1subscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡subscript𝔼subscript𝑞𝑡𝜽𝑝conditionalsubscript𝐲𝑡𝑓subscript𝐗𝑡𝜽superscriptsubscript𝑘1subscript𝐷𝑡12subscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑘subscriptdelimited-[]superscript𝐊subscript𝑞𝑡𝑘subscript¯𝐗𝑡subscript𝐷𝑡Trsuperscriptsubscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑘1subscriptdelimited-[]superscript𝐊subscript𝑞𝑡𝑘Δsuperscriptsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1topsuperscriptsubscriptdelimited-[]superscript𝐊subscript𝑝𝑡𝑘1Δsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\\ &\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}% \,|\,f(\mathbf{X}_{t};{\bm{\theta}}))]\\ &\qquad-\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(}\log\frac{|[\mathbf{K}^{p_{t}}]_{% k}|}{|[\mathbf{K}^{q_{t}}]_{k}|}-\frac{|\bar{\mathbf{X}}_{t}|}{D_{t}}+\text{% \emph{Tr}}([\mathbf{K}^{p_{t}}]_{k}^{-1}[\mathbf{K}^{q_{t}}]_{k})+\Delta(\bar{% \mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})^{\top}[\mathbf{K}^{p_{t}}]_{k% }^{-1}\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})\bigg{)},% \end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≐ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_f ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_log divide start_ARG | [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG | [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG - divide start_ARG | over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + Tr ( [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (A.1)

where

Δ(𝐗¯t;𝝁t,𝝁t1)[f(𝐗¯t;𝝁t)]k[f(𝐗¯t;𝝁t1)]kapproaches-limitΔsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1subscriptdelimited-[]𝑓subscript¯𝐗𝑡subscript𝝁𝑡𝑘subscriptdelimited-[]𝑓subscript¯𝐗𝑡subscript𝝁𝑡1𝑘\displaystyle\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})% \doteq[f(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t})]_{k}-[f(\bar{\mathbf{X}}_{t};{% \bm{\mu}}_{t-1})]_{k}roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ≐ [ italic_f ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - [ italic_f ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (A.2)

and

𝐊pt𝒥(𝐗¯t,𝝁t1)𝚺t1𝒥(𝐗¯t,𝝁t1)𝑎𝑛𝑑𝐊qt𝒥(𝐗¯t,𝝁t)𝚺t𝒥(𝐗¯t,𝝁t),formulae-sequenceapproaches-limitsuperscript𝐊subscript𝑝𝑡𝒥subscript¯𝐗𝑡subscript𝝁𝑡1subscript𝚺𝑡1𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡1top𝑎𝑛𝑑approaches-limitsuperscript𝐊subscript𝑞𝑡𝒥subscript¯𝐗𝑡subscript𝝁𝑡subscript𝚺𝑡𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡top\displaystyle\SwapAboveDisplaySkip\mathbf{K}^{p_{t}}\doteq\mathcal{J}(\bar{% \mathbf{X}}_{t},{\bm{\mu}}_{t-1})\bm{\Sigma}_{t-1}\mathcal{J}(\bar{\mathbf{X}}% _{t},{\bm{\mu}}_{t-1})^{\top}\quad\text{and}\quad\mathbf{K}^{q_{t}}\doteq% \mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})\bm{\Sigma}_{t}\mathcal{J}(% \bar{\mathbf{X}}_{t},{\bm{\mu}}_{t})^{\top},bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (A.3)

are covariance matrix estimates constructed from Jacobians 𝒥(,𝐦)f(;𝚯)𝚯|𝚯=𝐦approaches-limit𝒥normal-⋅𝐦evaluated-at𝑓normal-⋅𝚯𝚯𝚯𝐦\mathcal{J}(\cdot,\mathbf{m})\doteq\frac{\partial f(\cdot\,;\bm{\Theta})}{% \partial\bm{\Theta}}|_{\bm{\Theta}=\mathbf{m}}\,caligraphic_J ( ⋅ , bold_m ) ≐ divide start_ARG ∂ italic_f ( ⋅ ; bold_Θ ) end_ARG start_ARG ∂ bold_Θ end_ARG | start_POSTSUBSCRIPT bold_Θ = bold_m end_POSTSUBSCRIPT with 𝐦={𝛍t,𝛍t1}𝐦subscript𝛍𝑡subscript𝛍𝑡1\mathbf{m}=\{{\bm{\mu}}_{t},{\bm{\mu}}_{t-1}\}bold_m = { bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }.

Proof.

The results follows directly from the variational objective derived in (Rudner et al.,, 2021) when setting the prior to pqt1approaches-limit𝑝subscript𝑞𝑡1p\doteq q_{t-1}italic_p ≐ italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and specifying the context set to be constructed from the coreset. ∎

A.2 Derivation of Correspondence to Other Function-Space Objectives

Proposition 2 (Relationship between fromp and s-fsvi).

With the s-fsvi objective \mathcal{F}caligraphic_F defined as in Equation 6, let 𝐗¯t=𝐗𝒞subscriptnormal-¯𝐗𝑡subscript𝐗𝒞\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT. Then, up to a multiplicative constant, the fromp objective corresponds to the s-fsvi objective with the prior covariance given by a Laplace approximation about 𝛍t1subscript𝛍𝑡1{\bm{\mu}}_{t-1}bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the variational distribution given by a Dirac delta distribution qtfromp(𝛉)δ(𝛉𝛍t)approaches-limitsuperscriptsubscript𝑞𝑡fromp𝛉𝛿𝛉subscript𝛍𝑡q_{t}^{\textsc{fromp}}({\bm{\theta}})\doteq\delta({\bm{\theta}}-{\bm{\mu}}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( bold_italic_θ ) ≐ italic_δ ( bold_italic_θ - bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denoting the prior covariance under a Laplace approximation about 𝛍t1subscript𝛍𝑡1{\bm{\mu}}_{t-1}bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT by 𝚺^0(𝛍t1)subscriptnormal-^𝚺0subscript𝛍𝑡1\hat{\bm{\Sigma}}_{0}({\bm{\mu}}_{t-1})over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) so that qt1fromp(𝛉)𝒩(𝛍t1,𝚺^0(𝛍t1))approaches-limitsuperscriptsubscript𝑞𝑡1fromp𝛉𝒩subscript𝛍𝑡1subscriptnormal-^𝚺0subscript𝛍𝑡1q_{t-1}^{\textsc{fromp}}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t-1},\hat% {\bm{\Sigma}}_{0}({\bm{\mu}}_{t-1}))italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ), the fromp objective can be expressed as

fromp(qtfromp,qt1fromp,𝐗𝒞,𝐗t,𝐲t)=(qtfromp,qt1fromp,𝐗𝒞,𝐗t,𝐲t)𝒱,superscriptfrompsuperscriptsubscript𝑞𝑡frompsuperscriptsubscript𝑞𝑡1frompsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡superscriptsubscript𝑞𝑡frompsuperscriptsubscript𝑞𝑡1frompsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡𝒱\displaystyle\begin{split}&\mathcal{L}^{\textsc{fromp}}(q_{t}^{\textsc{fromp}}% ,q_{t-1}^{\textsc{fromp}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{% t})=\mathcal{F}(q_{t}^{\textsc{fromp}},q_{t-1}^{\textsc{fromp}},\mathbf{X}_{% \mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})-\mathcal{V},\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - caligraphic_V , end_CELL end_ROW

where

𝒱12k(log[𝐊¯p^t]k[𝐊¯qt]k+[𝐊¯qt]k[𝐊¯p^t]k1),approaches-limit𝒱12subscript𝑘subscriptdelimited-[]superscript¯𝐊subscript^𝑝𝑡𝑘subscriptdelimited-[]superscript¯𝐊subscript𝑞𝑡𝑘subscriptdelimited-[]superscript¯𝐊subscript𝑞𝑡𝑘subscriptdelimited-[]superscript¯𝐊subscript^𝑝𝑡𝑘1\displaystyle\SwapAboveDisplaySkip\mathcal{V}\doteq-\frac{1}{2}\sum_{k}\left(% \log\frac{{[\bar{\mathbf{K}}^{\hat{p}_{t}}]_{k}}}{{[\bar{\mathbf{K}}^{q_{t}}]_% {k}}}+\frac{[\bar{\mathbf{K}}^{q_{t}}]_{k}}{[\bar{\mathbf{K}}^{\hat{p}_{t}}]_{% k}}-1\right),caligraphic_V ≐ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_log divide start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG [ over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) ,

with 𝐊¯normal-¯𝐊\bar{\mathbf{K}}over¯ start_ARG bold_K end_ARG denoting a covariance matrix under a block-diagonalization without inter-task dependence, and

𝐊¯p^tblock-diag(𝒥(𝐗¯t,𝝁t1)𝚺^0(𝝁t1)𝒥(𝐗¯t,𝝁t1)).approaches-limitsuperscript¯𝐊subscript^𝑝𝑡block-diag𝒥subscript¯𝐗𝑡subscript𝝁𝑡1subscript^𝚺0subscript𝝁𝑡1𝒥superscriptsubscript¯𝐗𝑡subscript𝝁𝑡1top\displaystyle\bar{\mathbf{K}}^{\hat{p}_{t}}\doteq\text{\emph{block-diag}}\left% (\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})\hat{\bm{\Sigma}}_{0}({\bm{% \mu}}_{t-1})\mathcal{J}(\bar{\mathbf{X}}_{t},{\bm{\mu}}_{t-1})^{\top}\right).over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ block-diag ( caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) caligraphic_J ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) .
Proof.

By Equation (8) in Pan et al., (2020), the fromp objective function is given by

fromp(qtfromp,qt1fromp,𝐗𝒞,𝐗t,𝐲t)𝔼qt(𝜽)[logp(𝐲t|f(𝐗t;𝝁t))]+k=1t1τ2([f(𝐗𝒞;𝝁t)]k[f(𝐗𝒞;𝝁t1)]k)[𝐊p^t]k1([f(𝐗𝒞;𝝁t)]k[f(𝐗𝒞;𝝁t1)]k),approaches-limitsuperscriptfrompsuperscriptsubscript𝑞𝑡frompsuperscriptsubscript𝑞𝑡1frompsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡subscript𝔼subscript𝑞𝑡𝜽𝑝conditionalsubscript𝐲𝑡𝑓subscript𝐗𝑡subscript𝝁𝑡superscriptsubscript𝑘1𝑡1𝜏2superscriptsubscriptdelimited-[]𝑓subscript𝐗𝒞subscript𝝁𝑡𝑘subscriptdelimited-[]𝑓subscript𝐗𝒞subscript𝝁𝑡1𝑘topsubscriptsuperscriptdelimited-[]superscript𝐊subscript^𝑝𝑡1𝑘subscriptdelimited-[]𝑓subscript𝐗𝒞subscript𝝁𝑡𝑘subscriptdelimited-[]𝑓subscript𝐗𝒞subscript𝝁𝑡1𝑘\displaystyle\begin{split}&\mathcal{L}^{\text{{fromp}}}(q_{t}^{\textsc{fromp}}% ,q_{t-1}^{\textsc{fromp}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{% t})\\ &\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}% \,|\,f(\mathbf{X}_{t};{\bm{\mu}}_{t}))]+\sum_{k=1}^{t-1}\frac{\tau}{2}\left([f% (\mathbf{X}_{\mathcal{C}};{\bm{\mu}}_{t})]_{k}-[f(\mathbf{X}_{\mathcal{C}};{% \bm{\mu}}_{t-1})]_{k}\right)^{\top}[\mathbf{K}^{\hat{p}_{t}}]^{-1}_{k}\left([f% (\mathbf{X}_{\mathcal{C}};{\bm{\mu}}_{t})]_{k}-[f(\mathbf{X}_{\mathcal{C}};{% \bm{\mu}}_{t-1})]_{k}\right),\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≐ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_f ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ( [ italic_f ( bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - [ italic_f ( bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( [ italic_f ( bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - [ italic_f ( bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL end_ROW (A.4)

with temperature parameter τ𝜏\tauitalic_τ. The result follows directly from the definition of (qtfromp,qt1fromp,𝐗𝒞,𝐗t,𝐲t)superscriptsubscript𝑞𝑡frompsuperscriptsubscript𝑞𝑡1frompsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡\mathcal{F}(q_{t}^{\textsc{fromp}},q_{t-1}^{\textsc{fromp}},\mathbf{X}_{% \mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fromp end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and τ=1𝜏1\tau=1italic_τ = 1. ∎

Proposition 3 (Relationship between frcl and s-fsvi).

With the s-fsvi objective \mathcal{F}caligraphic_F defined as in Equation 6, let 𝐗¯t=𝐗𝒞subscriptnormal-¯𝐗𝑡subscript𝐗𝒞\bar{\mathbf{X}}_{t}=\mathbf{X}_{\mathcal{C}}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, and let fLM(;𝚯)Φψ()𝚯approaches-limitsuperscript𝑓LMnormal-⋅𝚯subscriptnormal-Φ𝜓normal-⋅𝚯f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})\doteq\Phi_{\psi}(\cdot)\bm{\Theta}italic_f start_POSTSUPERSCRIPT LM end_POSTSUPERSCRIPT ( ⋅ ; bold_Θ ) ≐ roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ) bold_Θ be a Bayesian linear model, where Φψ()subscriptnormal-Φ𝜓normal-⋅\Phi_{\psi}(\cdot)roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ) is a deterministic feature map parameterized by ψ𝜓\psiitalic_ψ. Then the frcl objective corresponds to the s-fsvi objective for the model fLM(;𝚯)superscript𝑓LMnormal-⋅𝚯f^{\text{\emph{LM}}}(\cdot\,;\bm{\Theta})italic_f start_POSTSUPERSCRIPT LM end_POSTSUPERSCRIPT ( ⋅ ; bold_Θ ) plus an additional weight-space KL divergence penalty. That is, for pt(𝛉)𝒩(𝟎,𝐈)approaches-limitsubscript𝑝𝑡𝛉𝒩0𝐈p_{t}({\bm{\theta}})\doteq\mathcal{N}(\mathbf{0},\mathbf{I})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_0 , bold_I ), and qt(𝛉)𝒩(𝛍t,𝚺t)approaches-limitsubscript𝑞𝑡𝛉𝒩subscript𝛍𝑡subscript𝚺𝑡q_{t}({\bm{\theta}})\doteq\mathcal{N}({\bm{\mu}}_{t},\bm{\Sigma}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ≐ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ),

frcl(qtfrcl,qt1frcl,𝐗𝒞,𝐗t,𝐲t)=(qtfrcl,qt1frcl,𝐗𝒞,𝐗t,𝐲t)+𝔻KL(qt(𝜽)pt(𝜽)).superscriptfrclsuperscriptsubscript𝑞𝑡frclsuperscriptsubscript𝑞𝑡1frclsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡superscriptsubscript𝑞𝑡frclsuperscriptsubscript𝑞𝑡1frclsubscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡subscript𝔻KLconditionalsubscript𝑞𝑡𝜽subscript𝑝𝑡𝜽\displaystyle\begin{split}&\mathcal{L}^{\text{{frcl}}}(q_{t}^{\textsc{frcl}},q% _{t-1}^{\textsc{frcl}},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})% =\mathcal{F}(q_{t}^{\textsc{frcl}},q_{t-1}^{\textsc{frcl}},\mathbf{X}_{% \mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})+\mathbb{D}_{\textrm{\emph{KL}}}(q_% {t}({\bm{\theta}})\,\|\,p_{t}({\bm{\theta}})).\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ) . end_CELL end_ROW (A.5)
Proof.

By Section 2.3 in Titsias et al., (2020), the frcl objective function is given by

frcl(𝝁t,𝚺t,𝐗𝒞,𝐗t,𝐲t)𝔼qt(𝜽)[logp(𝐲t|Φψ(𝐗t)𝜽)]𝔻KL(qt(𝜽)pt(𝜽))𝔻KL(q~t(f~(𝐗𝒞t;𝜽))p~t(f~(𝐗𝒞t;𝜽)))k=1t1𝔻KL((q~k(f~(𝐗𝒞k;𝜽)))p~k(f~(𝐗𝒞k;𝜽))),approaches-limitsuperscriptfrclsubscript𝝁𝑡subscript𝚺𝑡subscript𝐗𝒞subscript𝐗𝑡subscript𝐲𝑡annotatedsubscript𝔼subscript𝑞𝑡𝜽delimited-[]𝑝conditionalsubscript𝐲𝑡subscriptΦ𝜓subscript𝐗𝑡𝜽subscript𝔻KLconditionalsubscript𝑞𝑡𝜽subscript𝑝𝑡𝜽subscript𝔻KLconditionalsubscript~𝑞𝑡~𝑓subscript𝐗subscript𝒞𝑡𝜽subscript~𝑝𝑡~𝑓subscript𝐗subscript𝒞𝑡𝜽superscriptsubscript𝑘1𝑡1subscript𝔻KLperpendicular-toabsentconditionalsubscript~𝑞𝑘~𝑓subscript𝐗subscript𝒞𝑘𝜽subscript~𝑝𝑘~𝑓subscript𝐗subscript𝒞𝑘𝜽\displaystyle\begin{split}\mathcal{L}^{\text{{frcl}}}({\bm{\mu}}_{t},\bm{% \Sigma}_{t},\mathbf{X}_{\mathcal{C}},\mathbf{X}_{t},\mathbf{y}_{t})&\doteq% \mathbb{E}_{q_{t}({\bm{\theta}})}[\log p(\mathbf{y}_{t}\,|\,\Phi_{\psi}(% \mathbf{X}_{t}){\bm{\theta}})]-\mathbb{D}_{\textrm{{KL}}}(q_{t}({\bm{\theta}})% \,\|\,p_{t}({\bm{\theta}}))\\ &\quad-\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{t}(\smash{\tilde{f}}(% \mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}}))\,\|\,\smash{\tilde{p}}_{t}(\smash% {\tilde{f}}(\mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}})))-\sum_{k=1}^{t-1}% \mathbb{D}_{\textrm{KL}}(\perp(\smash{\tilde{q}}_{k}(\smash{\tilde{f}}(\mathbf% {X}_{\mathcal{C}_{k}};{\bm{\theta}})))\,\|\,\smash{\tilde{p}}_{k}(\smash{% \tilde{f}}(\mathbf{X}_{\mathcal{C}_{k}};{\bm{\theta}}))),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUPERSCRIPT frcl end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL ≐ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_θ ) ] - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ∥ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( ⟂ ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ) ∥ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ) , end_CELL end_ROW (A.6)

with the inducing points associated with task k𝑘kitalic_k denoted by 𝐗𝒞ksubscript𝐗subscript𝒞𝑘\mathbf{X}_{\mathcal{C}_{k}}bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and perpendicular-to\perp denoting the stop-gradient operator, whereas the s-fsvi objective for a Bayesian linear model is

(qt,qt1,𝐗𝒞,𝐗t,𝐲t)𝔼qt(𝜽)[logp(𝐲t|Φψ(𝐗t)𝜽))]k=1Dt12(log|[𝐊pt]k||[𝐊qt]k||𝐗¯t|Dt+Tr([𝐊pt]k1[𝐊qt]k)+Δ(𝐗¯t;𝝁t,𝝁t1)[𝐊pt]k1Δ(𝐗¯t;𝝁t,𝝁t1)),\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{% \theta}})}[\log p(\mathbf{y}_{t}\,|\,\Phi_{\psi}(\mathbf{X}_{t}){\bm{\theta}})% )]\\ &\qquad-\sum_{k=1}^{D_{t}}\frac{1}{2}\bigg{(}\log\frac{|[\mathbf{K}^{p_{t}}]_{% k}|}{|[\mathbf{K}^{q_{t}}]_{k}|}-\frac{|\bar{\mathbf{X}}_{t}|}{D_{t}}+\text{{% Tr}}([\mathbf{K}^{p_{t}}]_{k}^{-1}[\mathbf{K}^{q_{t}}]_{k})+\Delta(\bar{% \mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})^{\top}[\mathbf{K}^{p_{t}}]_{k% }^{-1}\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{\bm{\mu}}_{t-1})\bigg{)},% \end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≐ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_θ ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_log divide start_ARG | [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG | [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG - divide start_ARG | over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + Tr ( [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (A.7)

with

Δ(𝐗¯t;𝝁t,𝝁t1)[Φψ(𝐗¯t)𝝁t]k[Φψ(𝐗¯t)𝝁t1]kapproaches-limitΔsubscript¯𝐗𝑡subscript𝝁𝑡subscript𝝁𝑡1subscriptdelimited-[]subscriptΦ𝜓subscript¯𝐗𝑡subscript𝝁𝑡𝑘subscriptdelimited-[]subscriptΦ𝜓subscript¯𝐗𝑡subscript𝝁𝑡1𝑘\displaystyle\SwapAboveDisplaySkip\Delta(\bar{\mathbf{X}}_{t};{\bm{\mu}}_{t},{% \bm{\mu}}_{t-1})\doteq[\Phi_{\psi}(\bar{\mathbf{X}}_{t}){\bm{\mu}}_{t}]_{k}-[% \Phi_{\psi}(\bar{\mathbf{X}}_{t}){\bm{\mu}}_{t-1}]_{k}roman_Δ ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ≐ [ roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - [ roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (A.8)

and

𝐊ptΦψ(𝐗¯t)𝚺t1Φψ(𝐗¯t)𝐊qtΦψ(𝐗¯t)𝚺tΦψ(𝐗¯t).formulae-sequenceapproaches-limitsuperscript𝐊subscript𝑝𝑡subscriptΦ𝜓subscript¯𝐗𝑡subscript𝚺𝑡1subscriptΦ𝜓superscriptsubscript¯𝐗𝑡topapproaches-limitsuperscript𝐊subscript𝑞𝑡subscriptΦ𝜓subscript¯𝐗𝑡subscript𝚺𝑡subscriptΦ𝜓superscriptsubscript¯𝐗𝑡top\displaystyle\SwapAboveDisplaySkip\mathbf{K}^{p_{t}}\doteq\Phi_{\psi}(\bar{% \mathbf{X}}_{t})\bm{\Sigma}_{t-1}\Phi_{\psi}(\bar{\mathbf{X}}_{t})^{\top}% \qquad\mathbf{K}^{q_{t}}\doteq\Phi_{\psi}(\bar{\mathbf{X}}_{t})\bm{\Sigma}_{t}% \Phi_{\psi}(\bar{\mathbf{X}}_{t})^{\top}.bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (A.9)

Letting 𝐗𝒞ksubscript𝐗subscript𝒞𝑘\mathbf{X}_{\mathcal{C}_{k}}bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the context points associated with task k𝑘kitalic_k and letting 𝐊¯¯𝐊\bar{\mathbf{K}}over¯ start_ARG bold_K end_ARG denote a covariance matrix under a block-diagonalization without inter-task dependence, we define

𝐊¯ptblock-diag(Φψ(𝐗¯t)𝚺t1Φψ(𝐗¯t))𝐊¯qtblock-diag(Φψ(𝐗¯t)𝚺tΦψ(𝐗¯t)),formulae-sequenceapproaches-limitsuperscript¯𝐊subscript𝑝𝑡block-diagsubscriptΦ𝜓subscript¯𝐗𝑡subscript𝚺𝑡1subscriptΦ𝜓superscriptsubscript¯𝐗𝑡topapproaches-limitsuperscript¯𝐊subscript𝑞𝑡block-diagsubscriptΦ𝜓subscript¯𝐗𝑡subscript𝚺𝑡subscriptΦ𝜓superscriptsubscript¯𝐗𝑡top\displaystyle\begin{split}\bar{\mathbf{K}}^{p_{t}}\doteq\textrm{{block-diag}}% \left(\Phi_{\psi}(\bar{\mathbf{X}}_{t})\bm{\Sigma}_{t-1}\Phi_{\psi}(\bar{% \mathbf{X}}_{t})^{\top}\right)\qquad\bar{\mathbf{K}}^{q_{t}}\doteq\textrm{{% block-diag}}\left(\Phi_{\psi}(\bar{\mathbf{X}}_{t})\bm{\Sigma}_{t}\Phi_{\psi}(% \bar{\mathbf{X}}_{t})^{\top}\right),\end{split}start_ROW start_CELL over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ block-diag ( roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≐ block-diag ( roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) , end_CELL end_ROW (A.10)

with diagonal entries {𝐊1pt,,𝐊tpt}subscriptsuperscript𝐊subscript𝑝𝑡1subscriptsuperscript𝐊subscript𝑝𝑡𝑡\{\mathbf{K}^{p_{t}}_{1},...,\mathbf{K}^{p_{t}}_{t}\}{ bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and {𝐊1qt,,𝐊tqt}subscriptsuperscript𝐊subscript𝑞𝑡1subscriptsuperscript𝐊subscript𝑞𝑡𝑡\{\mathbf{K}^{q_{t}}_{1},...,\mathbf{K}^{q_{t}}_{t}\}{ bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_K start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, respectively, where each 𝐊kptsubscriptsuperscript𝐊subscript𝑝𝑡𝑘\mathbf{K}^{p_{t}}_{k}bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is computed from task-specific context points 𝐗𝒞ksubscript𝐗subscript𝒞𝑘\mathbf{X}_{\mathcal{C}_{k}}bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Fixing [𝝁t1]k=𝟎subscriptdelimited-[]subscript𝝁𝑡1𝑘0[{\bm{\mu}}_{t-1}]_{k}=\mathbf{0}[ bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_0 and [𝚺t1]k=𝐈Mksubscriptdelimited-[]subscript𝚺𝑡1𝑘subscript𝐈subscript𝑀𝑘[\bm{\Sigma}_{t-1}]_{k}=\mathbf{I}_{M_{k}}[ bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all kt𝑘𝑡k\leq titalic_k ≤ italic_t with Mk=|𝐗𝒞k|subscript𝑀𝑘subscript𝐗subscript𝒞𝑘M_{k}=|\mathbf{X}_{\mathcal{C}_{k}}|italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = | bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT |, as in Titsias et al., (2020), we then get

𝐊¯kpt=Φψ(𝐗¯𝒞k)Φψ(𝐗¯𝒞k)kt.formulae-sequencesubscriptsuperscript¯𝐊subscript𝑝𝑡𝑘subscriptΦ𝜓subscript¯𝐗subscript𝒞𝑘subscriptΦ𝜓superscriptsubscript¯𝐗subscript𝒞𝑘topfor-all𝑘𝑡\displaystyle\SwapAboveDisplaySkip\bar{\mathbf{K}}^{p_{t}}_{k}=\Phi_{\psi}(% \bar{\mathbf{X}}_{\mathcal{C}_{k}})\Phi_{\psi}(\bar{\mathbf{X}}_{\mathcal{C}_{% k}})^{\top}\quad\forall k\leq t.over¯ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∀ italic_k ≤ italic_t . (A.11)

Considering [𝝁t]ksubscriptdelimited-[]subscript𝝁𝑡𝑘[{\bm{\mu}}_{t}]_{k}[ bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and [𝚺t]ksubscriptdelimited-[]subscript𝚺𝑡𝑘[\bm{\Sigma}_{t}]_{k}[ bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as fixed for all kt1𝑘𝑡1k\leq t-1italic_k ≤ italic_t - 1, as in Titsias et al., (2020), using the stop-gradient operator perpendicular-to\perp, we can write the s-fsvi objective as

(qt,qt1,𝐗𝒞,𝐗t,𝐲t)𝔼qt(𝜽)[logp(𝐲t|Φψ(𝐗t)𝜽))]𝔻KL(q~t(f~(𝐗𝒞t;𝜽))p~t(f~(𝐗𝒞t;𝜽)))k=1t1𝔻KL((q~k(f~(𝐗𝒞k;𝜽)))p~k(f~(𝐗𝒞k;𝜽))),\displaystyle\begin{split}&\mathcal{F}(q_{t},q_{t-1},\mathbf{X}_{\mathcal{C}},% \mathbf{X}_{t},\mathbf{y}_{t})\doteq\operatorname{\mathbb{E}}_{q_{t}({\bm{% \theta}})}[\log p(\mathbf{y}_{t}\,|\,\Phi_{\psi}(\mathbf{X}_{t}){\bm{\theta}})% )]\\ &\qquad-\mathbb{D}_{\textrm{KL}}(\smash{\tilde{q}}_{t}(\smash{\tilde{f}}(% \mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}}))\,\|\,\smash{\tilde{p}}_{t}(\smash% {\tilde{f}}(\mathbf{X}_{\mathcal{C}_{t}};{\bm{\theta}})))-\sum_{k=1}^{t-1}% \mathbb{D}_{\textrm{KL}}(\perp(\smash{\tilde{q}}_{k}(\smash{\tilde{f}}(\mathbf% {X}_{\mathcal{C}_{k}};{\bm{\theta}})))\,\|\,\smash{\tilde{p}}_{k}(\smash{% \tilde{f}}(\mathbf{X}_{\mathcal{C}_{k}};{\bm{\theta}}))),\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≐ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | roman_Φ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_θ ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ∥ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( ⟂ ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ) ∥ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ) ) , end_CELL end_ROW (A.12)

concluding the proof. ∎

Appendix B Further Empirical Results

Refer to caption
(a) s-mnist (MH)
Refer to caption
(b) s-fmnist (MH)
Refer to caption
(c) p-mnist (SH)
Refer to caption
(d) s-mnist (SH)
Figure 8: Effect of Empirical Prior Covariance.  Comparison of predictive performance under the induced prior covariance function 𝐊pt=diag(𝒥𝝁t1(𝐱)𝚺t1𝒥𝝁t1(𝐱))superscript𝐊subscript𝑝𝑡diagsubscript𝒥subscript𝝁𝑡1𝐱subscript𝚺𝑡1subscript𝒥subscript𝝁𝑡1superscriptsuperscript𝐱top\mathbf{K}^{p_{t}}=\textrm{{diag}}\left(\mathcal{J}_{{\bm{\mu}}_{t-1}}(\mathbf% {x})\bm{\Sigma}_{t-1}\mathcal{J}_{{\bm{\mu}}_{t-1}}(\mathbf{x}^{\prime})^{\top% }\right)bold_K start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = diag ( caligraphic_J start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) bold_Σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) (left) vs. an identity covariance function (right).
Refer to caption
Refer to caption
Refer to caption
(a) s-mnist (MH)
Refer to caption
Refer to caption
Refer to caption
(b) s-fmnist (MH)
Refer to caption
Refer to caption
Refer to caption
(c) p-mnist (SH)
Refer to caption
Refer to caption
Refer to caption
(d) s-mnist (SH)
Figure 9: Effect of Neural-Network Size, First-Task Prior Covariance, and the Number of Training Epochs.  We explore settings of neural-network size (e.g., 2×10021002\times 1002 × 100 means a fully connected neural network with two hidden layers of size 100), initial prior covariance and number of training epochs for each task. To limit the computational resources required, we vary the values of one hyperparameter at a time instead of carrying out a full grid search.
Refer to caption
(a) Permuted MNIST (SH)
Refer to caption
(b) Split MNIST (SH)
Figure 10: Effect of Neural-Network Size under Minimal Coresets.  Predictive accuracy under s-fsvi on permuted mnist (SH) and split mnist (SH) as a function of network width, using only a minimal coreset of one sample per class, selected randomly.
Refer to caption
(a)
Refer to caption
(b)
Figure 11: Hyperparameter Search on Split CIFAR.  We explore different settings of the initial first-task prior covariance and the number of epochs for the first task. To limit the computational resources required, we vary the values of one hyperparameter at a time instead of carrying out a full grid search.
Refer to caption
Figure 12: Comparison of Different Coreset-Selection Methods on Split CIFAR.  For score-based coreset-selection methods, we first score each coreset point—using Equation 11 for elbo scoring, using the predictive entropy for entropy scoring, and the KL divergence in Equation 11 for KL scoring—then sample context points from the coreset according to the probability mass function defined in Equation C.13.
Refer to caption
Figure 13: Hyperparameter Search on Sequential Omniglot.  We compare two settings. In the first, we always sample one context point for each previous task from the context set at each gradient step. In the second, we sample a larger number of context points (with a budget of 60 samples per gradient step) from the context set when learning on the first 25 tasks.

Appendix C Experimental Details

Our empirical evaluation centers around six sequences of classification tasks: a synthetic sequence of binary-classification tasks with 2D inputs; split mnist; split Fashion mnist; permuted mnist; split cifar; and sequential Omniglot. With the exception of permuted mnist, each of these task sequences can be tackled by a neural network with either a multi-head setup (MH) or a single-head setup (SH). In a multi-head setup, the neural network has a separate output layer (or head) for each task, and task identifiers are provided at test time in order to select the appropriate head. In a single-head setup, the neural network has just one output layer shared across all tasks, and task identifiers are not provided. In our experiments, we use multi-head setups for split Fashion mnist, split cifar and sequential Omniglot, and single-head setups for the synthetic task sequence along with permuted mnist. For split mnist, we run both setups.

C.1 Illustrative Example

The task sequence shown in Figure 2 was created by Pan et al., (2020). Each of the five tasks in this sequence involves binary classification on 2D inputs, where the number of training examples per task is 3,600. Following Pan et al., (2020), we use a fully connected neural network with an input layer of size 2, two hidden layers of size 20 and an output layer of size 2. When running s-fsvi, we set the prior covariance as 𝚺0=0.1subscript𝚺00.1\bm{\Sigma}_{0}=0.1bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.1 and train the neural network for 250 epochs on each task. We use the Adam optimizer with an initial learning rate of 0.00050.00050.00050.0005 (β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999) and a batch size of 128. The coreset is constructed by choosing 40 samples from the training data for each task. To evaluate the KL divergence between the posterior and the prior distributions over functions, for each previous task we sample 20 input points from the context set and generate another 30 samples by sampling each pixel uniformly from the range [4,4]44[-4,4][ - 4 , 4 ]. For example, when we train the model on task t{1,2,3,}𝑡123t\in\{1,2,3,\ldots\}italic_t ∈ { 1 , 2 , 3 , … }, we use 20(t1)20𝑡120(t-1)20 ( italic_t - 1 ) samples chosen from the context set and 30t30𝑡30t30 italic_t white-noise samples. The noise samples encourage the neural network to preserve high predictive uncertainty in regions far from the training data.

C.2 Task Sequences Based on (Fashion) MNIST

Split mnist consists of five tasks, where each task is binary classification on a pair of mnist classes. Split Fashion mnist has the same form but uses data from Fashion mnist. Permuted mnist comprises ten tasks, where each task involves classifying images into the ten mnist classes after the image pixels have been randomly reordered. Unless specified otherwise, the following setups apply to Figures 3, 7, 7 and 8 and Table 1.

Dataset.  In all cases, 60,000 data samples are used for training and 10,000 data samples are used for testing. The input images are converted to floating-point numbers with values in the range [0,1]01[0,1][ 0 , 1 ].

Neural-Network Size & Coreset Size.  To ensure fair comparison, all methods in Table 1 (unless where explicitly indicated otherwise) use the same neural-network size and (where applicable) coreset size. As in prior work (Pan et al.,, 2020; Titsias et al.,, 2020), we use fully connected neural networks, with two hidden layers of size 100 for permuted mnist and two hidden layers of size 256 for split (Fashion) mnist. In all cases, the ReLU activation function is applied to non-output units. For single-head setups, we use 200 coreset points; for multi-head setups, we use 40 points.

Coreset Selection.  For s-fsvi with a coreset, when training on the first task, 40 context points are generated by sampling each pixel uniformly from the range [0,1]01[0,1][ 0 , 1 ]; during training on subsequent tasks, 40 context points are chosen randomly from the context set. For s-fsvi without a coreset, 40 context points are chosen uniformly randomly from the training data of the current task (corresponding to the “Random” label in Figure 3).

Prior Distribution.  For the first task, s-fsvi uses a prior distribution over functions with fixed mean and diagonal covariance. When using a coreset, the prior distribution is assumed to be Gaussian with zero mean and a diagonal covariance of magnitude 0.001. When not using a coreset, the prior distribution is assumed to be Gaussian with zero mean and a diagonal covariance of magnitude 100. The prior variance is optimized via hyperparameter selection on a validation set.

Optimization.  We use the Adam optimizer with an initial learning rate of 0.00050.00050.00050.0005 (β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999). The number of epochs on each task is 60 for split mnist (MH), 60 for split Fashion mnist (MH), 10 for permuted mnist (SH) and 80 for split mnist (SH). The batch size is 128.

Prediction.  The predictive distribution used for computing the expected log-likelihood is estimated using five Monte Carlo samples.

Hyperparameter Selection.  For “s-fsvi (optimized)” in Table 1, we used the optimized hyperparameters chosen on a validation set after exploring the configurations shown in Table 4. For cases where no configuration is significantly better than the rest, the default value given in Section C.2 is used.

Table 4: Hyperparameter selection. Optimal values (in bold) were chosen based on validation-set accuracy. Standard errors were computed across ten random seeds.
Task Sequences Number of Layers & Units Magnitude of Prior Variance Number of Epochs
Split mnist (MH) {1, 2} * {100, 200, 300, 400} {0.001, 0.01, 0.1, 1, 10, 100} {40, 60, 80, 120, 160}
Split Fashion mnist (MH)   {4} * {50, 200, 300, 400} {0.001, 0.01, 0.1, 1, 10, 100} {40, 60, 80, 120, 160}
Permuted mnist (SH) {2} * {100, 200, 400, 500} {0.001, 0.01, 0.1, 1, 10, 100} {10, 20, 40, 60, 80}
Split mnist (SH) {1, 2} * {100, 200, 300, 400} {0.001, 0.01, 0.1, 1, 10, 100} {60, 80, 120, 160, 240}

C.3 Split CIFAR

Split cifar, as described in Pan et al., (2020), consists of six tasks. The first is ten-way classification on the full cifar-10 dataset. Each of the following five is also ten-way classification, with classes drawn from cifar-100. Following Pan et al., (2020), we use a neural network with four convolutional layers followed by two fully connected layers followed by multiple output heads (one for each task). For s-fsvi, we use the following setup: Adam optimizer with learning rate 0.0005, prior with covariance 0.01, random coreset selection, 200 coreset points per task, 50 context points at each task. We also use this setup (and a training duration of 2000 epochs) when training individual neural networks for the “separate” baseline.

C.4 Sequential Omniglot

Sequential Omniglot, as described in Schwarz et al., (2018), comprises 50 classification tasks. Each task is associated with an alphabet, and the number of characters (classes) varies between alphabets. Following Schwarz et al., (2018), we use a neural network with four convolutional layers followed by one fully connected layer. For s-fsvi, we use two coreset points per character, as used by Titsias et al., (2020). The coreset points are sampled from the training set with probability proportional to the entropy of the neural network’s posterior predictive distribution. To limit memory usage, we draw no more than 25 context points from the context set at each gradient step after task 25. We use a learning rate of 0.001 and a prior covariance of 1.0. For the first task, the neural network trains for 200 epochs; for subsequent tasks, it trains for ten epochs per task. We use the same data augmentation and train-test split as Titsias et al., (2020).

C.5 Coreset-Selection Methods

We consider different distributions from which to sample points to be added to the coreset. For each of the scoring methods below, we use the scores to create a probability mass function from which points can be sampled.

Random.  Points are sampled uniformly from the training data.

Predictive-Entropy Scoring.  Points are scored according to the total predictive uncertainty (i.e., the predictive entropy) of the model. For a model with stochastic parameters 𝚯𝚯\bm{\Theta}bold_Θ, pre-likelihood outputs f(𝐗;𝜽)𝑓𝐗𝜽f(\mathbf{X};{\bm{\theta}})italic_f ( bold_X ; bold_italic_θ ), and a likelihood function p(𝐲|f(𝐗;𝜽))𝑝conditional𝐲𝑓𝐗𝜽p(\mathbf{y}\,|\,f(\mathbf{X};{\bm{\theta}}))italic_p ( bold_y | italic_f ( bold_X ; bold_italic_θ ) ), the predictive entropy is given by (𝔼[p(𝐲|f(𝐗;𝜽))])𝔼𝑝conditional𝐲𝑓𝐗𝜽\mathcal{H}(\operatorname{\mathbb{E}}[p(\mathbf{y}\,|\,f(\mathbf{X};{\bm{% \theta}}))])caligraphic_H ( blackboard_E [ italic_p ( bold_y | italic_f ( bold_X ; bold_italic_θ ) ) ] ) (Cover and Thomas,, 1991; Shannon and Weaver,, 1949). The expectation is taken with respect to the model parameters. ()\mathcal{H}(\cdot)caligraphic_H ( ⋅ ) is the entropy functional, and (𝐲;𝚯)subscript𝐲𝚯\mathcal{I}(\mathbf{y}_{\ast};\,\bm{\Theta})caligraphic_I ( bold_y start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; bold_Θ ) is the mutual information between the model parameters and its predictions.

Evidence-Lower-Bound Scoring.  Points are scored according to the value of the evidence lower bound (ELBO) given in Equation 11.

Kullback-Leibler-Divergence Scoring.  Points are scored according to the value of the approximation to the function-space KL divergence given in Equation 11.

Score-Based Distributions.  After scoring with the above methods, points are added to the coreset by sampling from one of the following probability mass functions:

Lowest:(i)s¯ij=1Ns¯jandHighest:(i)sij=1Nsj,formulae-sequenceapproaches-limitLowest:𝑖subscript¯𝑠𝑖superscriptsubscript𝑗1𝑁subscript¯𝑠𝑗approaches-limitandHighest:𝑖subscript𝑠𝑖superscriptsubscript𝑗1𝑁subscript𝑠𝑗\displaystyle\begin{split}\textrm{{Lowest:}}\quad\mathbb{P}(i)\doteq\frac{\bar% {s}_{i}}{\sum_{j=1}^{N}\bar{s}_{j}}\qquad\textrm{and}\qquad\textrm{{Highest:}}% \quad\mathbb{P}(i)\doteq\frac{s_{i}}{\sum_{j=1}^{N}s_{j}},\end{split}start_ROW start_CELL Lowest: blackboard_P ( italic_i ) ≐ divide start_ARG over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG and Highest: blackboard_P ( italic_i ) ≐ divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW (C.13)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the score of i𝑖iitalic_i-th point, s¯i=maxj=1Nsjsisubscript¯𝑠𝑖superscriptsubscript𝑗1𝑁subscript𝑠𝑗subscript𝑠𝑖\bar{s}_{i}=\max_{j=1}^{N}s_{j}-s_{i}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and N𝑁Nitalic_N is the number of candidate points.

C.6 Forward and Backward Transfer

In Table 3, we report forward and backward transfer metrics as defined in Pan et al., (2020). Backward transfer (BT) indicates the performance gain on past tasks when new tasks are learnt, while forward transfer (FT) quantifies how much knowledge from past tasks helps the learning of new tasks. Higher is better for both. For T𝑇Titalic_T tasks, let Ri,isubscript𝑅𝑖𝑖R_{i,i}italic_R start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT be the accuracy of model on task tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT after training on task tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and let Riindsuperscriptsubscript𝑅𝑖indR_{i}^{\textrm{ind}}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT be the accuracy of an independent model trained only on task tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then

BT1T1i=1T1RT,iRi,iandFT1T1i=2TRi,iRiind.formulae-sequenceapproaches-limitBT1𝑇1superscriptsubscript𝑖1𝑇1subscript𝑅𝑇𝑖subscript𝑅𝑖𝑖andapproaches-limitFT1𝑇1superscriptsubscript𝑖2𝑇subscript𝑅𝑖𝑖superscriptsubscript𝑅𝑖ind\displaystyle\begin{split}\textrm{BT}\doteq\frac{1}{T-1}\sum_{i=1}^{T-1}R_{T,i% }-R_{i,i}\qquad\textrm{and}\qquad\textrm{FT}\doteq\frac{1}{T-1}\sum_{i=2}^{T}R% _{i,i}-R_{i}^{\textrm{ind}}.\end{split}start_ROW start_CELL BT ≐ divide start_ARG 1 end_ARG start_ARG italic_T - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_T , italic_i end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT and FT ≐ divide start_ARG 1 end_ARG start_ARG italic_T - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT . end_CELL end_ROW

Appendix D Further Related Work

Objective-based approaches to continual learning involve training a neural network using a specially designed objective function. Typically the objective includes a regularization term that penalizes changes in the neural network’s configuration. Whereas in Section 4 we summarise methods that regularize in function space, here we cover methods that regularize directly in terms of the parameters of a neural network. Among these, most relevant to our work are those that approximate Bayesian updating, in which the posterior from the previous task forms the prior for the current task.

A key idea is shared between many methods for parameter-space regularization: for each parameter, apply a penalty on the difference between its current setting and its prior setting, weighted by a measure of the parameter’s importance. Methods vary in how they measure importance. Variational continual learning (vclNguyen et al.,, 2018; Swaroop et al.,, 2019), which extends the concept of online variational inference (Broderick et al.,, 2013; Ghahramani and Attias,, 2000; Honkela and Valpola,, 2003; Sato,, 2001) to deep neural networks, uses the parameter covariance matrix of the model currently serving as the prior. Elastic weight consolidation (ewcKirkpatrick et al.,, 2017) and its successors (Chaudhry et al.,, 2018; Lee et al.,, 2017; Liu et al.,, 2018; Schwarz et al.,, 2018) use a Fisher information matrix computed on each task. Online structured Laplace (Ritter et al.,, 2018) and second-order loss approximation (Yin et al., 2020a, ) respectively use Kronecker-factored and low-rank Hessians. Synaptic intelligence (siZenke et al.,, 2017) uses a cumulative sum of the gradient of the training objective with respect to the parameters. Memory-aware synapses (masAljundi et al.,, 2018) use the gradient of the model output with respect to the parameters.

Other related work on parameter-space regularization includes various modifications to vcl (Ahn et al.,, 2019; Kessler et al.,, 2019), uncertainty-guided continual learning in Bayesian neural networks (Ebrahimi et al.,, 2020), and a variation of si known as asymmetric loss approximation with single-side overestimation (Park et al.,, 2019). There have also been efforts to conceptually unify some of the approaches outlined above: Loo et al., (2020) draws a link between vcl and online ewcChaudhry et al., (2018) combines ewc and si in a single method; Yin et al., 2020b generalizes ewc, online structured Laplace, si and mas.