Toward Understanding Catastrophic Forgetting in Continual Learning

Toward Understanding Catastrophic Forgetting in Continual Learning
Cuong V. Nguyen† , Alessandro Achille† , Michael Lam† , Tal Hassner‡∗,

Vijay Mahadevan† , Stefano Soatto†
†
Amazon Web Services
{nguycuo,aachille,michlam,vmahad,soattos}@amazon.com
arXiv:1908.01091v1 [cs.LG] 2 Aug 2019
‡
Facebook Inc.
{talhassner}@gmail.com
Abstract learning research community, due to its potential to reduce

training time and training set sizes (e.g., by continuously
We study the relationship between catastrophic forget- adapting from previous models), both of which are critical
ting and properties of task sequences. In particular, given a to the training of modern deep neural networks. Solving
sequence of tasks, we would like to understand which prop- continual learning is also an essential step toward artificial
erties of this sequence influence the error rates of continual general intelligence as it allows machines to continuously
learning algorithms trained on the sequence. To this end, adapt to changes in the environment with minimal human
we propose a new procedure that makes use of recent de- intervention, a process analogous to human learning.
velopments in task space modeling as well as correlation However, continual learning by deep models has proven
analysis to specify and analyze the properties we are in- to be very challenging due to catastrophic forgetting, a long
terested in. As an application, we apply our procedure to known problem of training deep neural networks [4, 5, 14,
study two properties of a task sequence: (1) total complexity 15, 28, 31, 34]. Catastrophic forgetting refers to the tendency
and (2) sequential heterogeneity. We show that error rates of a model to forget all its previously learned tasks if not
are strongly and positively correlated to a task sequence’s trained properly on a new task, e.g., when fine-tuning on the
total complexity for some state-of-the-art algorithms. We new task for a long time without proper regularization to the
also show that, surprisingly, the error rates have no or even previous model parameters.
negative correlations in some cases to sequential heterogene- Recent work attempted to tackle this problem either by
ity. Our findings suggest directions for improving continual better training algorithms [21, 25, 33, 51], structure shar-
learning benchmarks and methods. ing [36, 41, 49], episodic memory [9, 27, 29], machine-
generated pseudo-data [18, 26, 42], or a combination of these
approaches [29, 39]. Benchmarks to compare these methods
1. Introduction typically constructed a sequence of tasks and then measured
Continual learning (or life-long learning) [32, 37, 44] is the algorithms’ performance when transferring from one task
the ability of a machine learning model to continuously learn to another. Two popular examples of these benchmarks are
from a stream of data, which could possibly be non-iid or the permuted MNIST [15] and split MNIST [51].
come from different but related tasks. A continual learning In this paper, we seek to understand catastrophic forget-
system is required to adapt its current model to the new ting at a more fundamental level. Specifically, we investigate
tasks or datasets without revisiting the previous data. Such the following question:
a system should be able to positively transfer its current Given a sequence of tasks, which properties of the tasks
knowledge (summarized in its model) to the new tasks using influence the hardness of the entire sequence?
as few data as possible, to avoid catastrophically forgetting We measure task sequence hardness by the final error rate
the old tasks, and to transfer back its knowledge from new of a model trained sequentially on the tasks in the sequence.
tasks to old tasks in order to improve overall performance.
An answer to this question is useful for continual learn-
In recent years, interest in continual learning has risen [1,
ing research in several ways. First, it helps us estimate
21, 26, 27, 29, 39, 40, 42, 51], especially from the deep
the hardness of a benchmark based on its individual tasks,
∗ Work done at Amazon. thereby potentially assisting the development of new and bet-
1
ter benchmarks for continual learning. Additionally, know- backward transfer (i.e., learning a new task helps a previous
ing the hardness of a task sequence allows us to estimate a task) and (b) continual learning algorithms should be cus-
priori the cost and limits of running continual learning algo- tomized for specific task pairs to improve their effectiveness.
rithms on it. Crucially, by gaining a better understanding of We give detailed analysis and discussions in Sec. 7.
catastrophic forgetting at a more fundamental level, we gain
more insights to develop better methods to mitigate it. 2. Continual learning algorithms and existing
This work is the first attempt to answer the above question. benchmarks
We propose a new and general procedure that can be applied
to study the relationship between catastrophic forgetting and We overview modern continual learning algorithms and
properties of task sequences. Our procedure makes use of existing benchmarks used to evaluate them. For more com-
recent developments in task space modeling methods, such prehensive reviews of continual learning, we refer to Chen
as the Task2Vec framework [2], to specify the interested et al. [11] and Parisi et al. [30]
properties. Then, we apply correlation analysis to study the
relationship between the specified properties and the actual 2.1. Continual learning algorithms
measures of catastrophic forgetting.
The simplest and most common approaches to continual
As an application, we use our procedure to analyze two
learning use weight regularization to prevent catastrophic
properties of a task sequence—total complexity and sequen-
forgetting. Weight regularization adds a regularizer to the
tial heterogeneity—and design experiments to study their
likelihood during training to pull the new weights toward the
correlations with the sequence’s actual hardness. We refer to
previous weights. It has been improved and applied to con-
total complexity as the total hardness of individual tasks in
tinual learning of deep networks in the elastic weight consoli-
the sequence, while sequential heterogeneity measures the
dation (EWC) algorithm [21], where the regularizer is scaled
total dissimilarity between pairs of consecutive tasks.
by the diagonal of the Fisher information matrix computed
We show how these two properties are estimated using
from the previous task. Since the diagonal Fisher informa-
the Task2Vec framework [2], which maps datasets (or equiv-
tion approximates the average Hessian of the likelihoods,
alently, tasks) to vectors on a vector space. We choose these
EWC is closely related to Laplace propagation [17, 43],
two properties for our analysis because of their intuitive rela-
where Laplace’s approximation is applied after each task to
tionships to the hardness of task sequences: since continual
compute the regularizers. Besides Fisher information, the
learning algorithms attempt to transfer knowledge from one
path integral of the gradient vector field along the parameter
task to another, both the hardness of each individual task
optimization trajectory can also be used for the regularizer,
and the dissimilarity between them should play a role in
as in the synaptic intelligence (SI) approach [51].
determining the effectiveness of the transfer.
Another form of regularization naturally arises by us-
The findings from our analysis are summarized below.
ing Bayesian methods. For instance, variational continual
• Total complexity has a strong correlation with the task learning (VCL) [29, 45] applied a sequence of variational
sequence hardness measured by the actual error rate. approximations to the true posterior and used the current ap-
proximate posterior as prior for the new task. The Kullback-
• Sequential heterogeneity has little or no correlation Leibler term in the variational lower bound of VCL naturally
with the task sequence hardness. When factoring out regularizes the approximate posterior toward the prior. Im-
the task complexity, we even find negative correlations proved training procedures have also been developed for this
in some cases. type of approximate Bayesian continual learning through the
The first finding, although expected, emphasizes that use of natural gradients [10, 47], fixed-point updates [52],
we should take into account the complexity of each task and local approximation [6]. More expressive classes of
when designing new algorithms or benchmarks, which is variational distributions were also considered, including
currently lacking in continual learning research. Besides, channel factorized Gaussian [22], multiplicative normalizing
the research community is currently somewhat divided on flow [22], or structured Laplace approximations [33].
the issue whether task similarity helps or hurts continual The above methods can be complemented by an episodic
learning performance. Some authors showed that task simi- memory, sometimes called a coreset, which stores a sub-
larity helps improve performance in the context of transfer set of previous data. Several algorithms have been devel-
learning [2, 3, 35], while some others conjectured that task oped for utilizing coresets, including gradient episodic mem-
dissimilarity could help improve continual learning perfor- ory (GEM) [27], averaged GEM [9], coreset VCL [29], and
mance [13]. Our second finding gives evidence that supports Stein gradients coreset [10].
the latter view. Other algorithmic ideas to prevent catastrophic forget-
Deeper analysis into these phenomena suggests that (a) ting include moment matching [25], learning without forget-
the task sequence hardness also depends on the ability to ting [26], and deep generative replay [13, 19, 42]. Structure
2
sharing [36, 39] is also another promising direction that can 4. Total complexity and sequential heterogene-
be combined with the above algorithmic solutions to improve ity of task sequences
continual learning.
We define two properties that we would like to investi-
2.2. Existing benchmarks gate: the total complexity and sequential heterogeneity of a
The most common benchmarks for continual learning task sequence, and detail the methodology used to estimate
use MNIST [24] as the base dataset and construct various these quantities from data. We start by first introducing the
task sequences for continual learning. For example, per- Task2Vec framework [2], the main tool that we employ to
muted MNIST [15] applies a fixed random permutation on quantify the above properties.
the pixels of MNIST input images for each task, creating
a sequence of tasks that keep the original labels but have 4.1. Preliminaries: Task2Vec
different input structures. Split MNIST [51], on the other Task2Vec [2] is a recently developed framework for em-
hand, considers five consecutive binary classification tasks bedding visual classification tasks as vectors in a real vector
based on MNIST: 0/1, 2/3, . . . , 8/9. Another variant is ro- space. The embeddings have many desirable properties that
tated MNIST [27], where the digits are rotated by a fixed allow reasoning about the semantic and taxonomic relations
angle between 0 and 180 degrees in each task. Similar con- between different visual tasks. This is one of several recent
structions can also be applied to the not-MNIST set [7], the attempts to provide tools for understanding the structure of
fashion MNIST set [48], or the CIFAR set [23] such as in task space. Other related efforts that can be used as alterna-
the split not-MNIST [29] and split CIFAR benchmarks [51]. tives to Task2Vec include, e.g., [12, 46, 50].
Other continual learning benchmarks include ones typ- Given a labeled classification dataset, D = {(xi , yi )}N
i=1 ,
ically used for reinforcement learning. For instance, Kirk- Task2Vec works as follows. First, a network pre-trained on
patrick et al. [21] tested the performance of EWC when a large dataset (e.g., ImageNet), called the probe network,
learning to play Atari games. Schwarz et al. [38] proposed a is applied to all the images xi in the dataset to extract the
new benchmark for continual learning based on the StarCraft features from the last hidden layer (i.e., the value vectors
II video game, where an agent must master a sequence of returned by this layer). Using these features as new inputs
skills without forgetting the previously acquired skills. and the labels yi , we then train the classification layer for
the task. After the training, we compute the Fisher informa-
3. Analysis of catastrophic forgetting tion matrix for the feature extractor parameters. Since the
Recent developments in task space modeling, such as Fisher information matrix is very large for deep networks,
Task2Vec [2] and Taskonomy [50], provide excellent tools in practice we usually approximate it by (1) using only the
to specify and analyze relationships between different tasks diagonal entries and (2) averaging the Fisher information
from data. In this paper, we propose a novel and general of all weights in the same filter. This results in a vector
procedure that utilizes these tools to study catastrophic for- representation with size equal to the number of filters in the
getting. Our procedure is conceptually simple and can be probe network. In this paper, we will use a ResNet [16]
summarized in the following steps: probe network that only has convolutional layers.
Task2Vec embeddings have many properties that can be
1. Specify the properties of a task sequence that we are in-
used to study the relationships between tasks. We discuss
terested in and estimate these properties using a suitable
two properties that are most relevant to our work. The first of
task space modeling methods.
these properties is that the norms of the embeddings encode
2. Estimate actual measures of catastrophic forgetting the difficulty of the tasks. This property can be explained
from real experiments. In our case, we measure catas- intuitively by noticing that easy examples (those that the
trophic forgetting by the task sequence hardness, de- model is very confident about) have less contributions to the
fined as the final error rate of a model trained sequen- Fisher information while uncertain examples (those that are
tially on the sequence. near the decision boundary) have more contributions. Hence,
if the task is difficult, the model would be uncertain on many
3. Use correlation analysis to study the correlations be- examples leading to a large embedding.
tween the estimated properties in Step 1 and the actual
The second property that we are interested in is that
measures in Step 2.
Task2Vec embeddings can encode the similarity between
This procedure can be used even in other cases, such as tasks. Achille et al. [2] empirically showed this effect on
transfer or multi-task learning, to study properties of new the iNaturalist dataset [53], where the distances between
algorithms. For the rest of this paper, we demonstrate its Task2Vec embeddings strongly agree with the distances be-
use for analyzing two properties of task sequences and their tween natural taxonomical orders, hinting that the dissimi-
effects on continual learning algorithms. larity between tasks can be approximated from the distance
3
between them in the embedding space. The embeddings The total complexity in Eq. (1) depends on the sequence
were also shown to be useful for model selection between length. We can also consider the total complexity per task,
different domains and tasks. C(T )/k, which does not depend on sequence length. In our
analysis, however, we will only consider sequences of the
4.2. Total complexity same length. Hence, our results are not affected whether
We now discuss the notions of total complexity and se- total complexity or total complexity per task is used.
quential heterogeneity of task sequences, and how we can We note that our total complexity measure is very crude
estimate them from Task2Vec embeddings. We note that and only captures some aspects of task sequence complexity.
these definitions only capture specific aspects of sequence However, as we will show in Sec. 6.2, our measure is pos-
complexity and heterogeneity; however, they are enough to itively correlated with catastrophic forgetting and thus can
serve the purpose of our paper. In future work, we will con- be used to explain catastrophic forgetting. A possible future
sider more sophisticated definitions of sequence complexity research direction would be to design better measures of task
and heterogeneity. sequence complexity that can better explain catastrophic
We define the total complexity of a task sequence as the forgetting (i.e., by giving better correlation scores).
sum of the complexities of its individual tasks. Formally,
4.3. Sequential heterogeneity
let T = (t1 , t2 , . . . , tk ) be a sequence of k distinct tasks and
C(t) be a function measuring the complexity of a task t. The We define the sequential heterogeneity of a task sequence
total complexity of the task sequence T is: as the sum of the dissimilarities between all pairs of consec-
k
utive tasks in the sequence. Formally, for a task sequence
X T = (t1 , t2 , . . . , tk ) of distinct tasks, its sequential hetero-
C(T ) = C(ti ). (1)
i=1
geneity is:
k−1
X
We slightly abuse notation by using the same function C(·) F (T ) = F (ti , ti+1 ), (4)
for the complexity of both sequences and tasks. i=1
For simplicity, we only consider sequences of distinct 0
where F (t, t ) is a function measuring the dissimilarity be-
tasks where data for each task are only observed once. The tween tasks t and t0 . Note that we also use the same notation
scenario where data for one task may be observed many F (·) for sequential heterogeneity and task dissimilarity here,
times requires different definitions of total complexity and but its interpretation should be clear from the context.
sequential heterogeneity. We will leave this extension to The dissimilarity F (t, t0 ) can be naively estimated by
future work. applying transfer learning algorithms and measuring how
A simple way to estimate the complexity C(t) of a task t well we can transfer between the two tasks. However, this
is to measure the error rate of a model trained for this task. would give a tautological measure of dissimilarity that is
However, this method often gives unreliable estimates since affected by both the model choice and the choice of the
it depends on various factors such as the choice of model transfer learning algorithm.
and the training algorithm. To avoid this problem, we also propose to estimate
In this work, we propose to estimate C(t) from the F (t, t0 ) from the Task2Vec embedding. For our purpose,
Task2Vec embedding of task t. Specifically, we adopt the it is clear that we can use the distance d(·, ·) of Eq. (3) as an
suggestion from Achille et al. [2] to measure the complexity estimate for F (·, ·). That is,
of task t by its distance to the trivial task (i.e., the task em-
bedded at the origin for standard Fisher embedding) in the 0 et e t0
F (t, t ) = d(et , et0 ) = cos , . (5)
embedding space. That is, et + et0 et + et0
C(t) = d(et , e0 ), (2) The sequential heterogeneity in Eq. (4) only considers
pairs of consecutive tasks, under the assumption that catas-
where et and e0 are the embeddings of task t and the trivial trophic forgetting is mostly influenced by the dissimilarity
task respectively, and d(·, ·) is a symmetric distance between between these task pairs. In general, we can define other
two tasks in the embedding space. Following Achille et al. measures of heterogeneity, such as the total dissimilarity
[2], we choose d(·, ·) to be the normalized cosine distance: between all pairs of tasks. We will leave these extensions to

e1 e2
future work.
d(e1 , e2 ) = cos , , (3) Our choice of using Task2Vec to estimate C(T ) and F (T )
e1 + e2 e1 + e2
is more compatible with the multi-head models for continual
where e1 and e2 are two task embeddings and the division is learning [29, 51], which we will use in our experiments.
element-wise. This distance was shown to be well correlated In multi-head models, a separate last layer (the SoftMax
with natural distances between tasks [2]. layer) is trained for each different task and the other weights
4
are shared among tasks. This setting is consistent with the a good idea to constrain the total complexity of the task
way Task2Vec is constructed in many cases. For instance, sequences Ti to be the same, so that the individual tasks’
if we have two binary classification tasks whose labels are complexities would not affect the correlations. This can
reversed, they would be considered similar by Task2Vec and be achieve by using the same set of individual tasks for all
are indeed very easy to transfer from one to another in the the sequences (i.e., the sequences are permutations of each
multi-head setting, by changing only the head. other). We call the sequential heterogeneity obtained from
this method the normalized sequential heterogeneity.
5. Correlation analysis
6. Experiments
Having defined total complexity and sequential hetero-
geneity, we now discuss how we can study their relationships We next describe the settings of our experiments and
to the hardness of a task sequence. Given a task sequence discuss our results. More detailed discussions on the impli-
T = (t1 , t2 , . . . , tk ), we measure its actual hardness with cations of these results to continual learning and catastrophic
respect to a continual learning algorithm A by the final er- forgetting research are provided in Sec. 7.
ror rate obtained after running A on the tasks t1 , t2 , . . . , tk
sequentially. That is, the hardness of T with respect to A is:
6.1. Settings
Datasets and task construction. We conduct our experi-
HA (T ) = errA (T ). (6) ments on two datasets: MNIST and CIFAR-10, which are
the most common datasets used to evaluate continual learn-
In this paper, we choose final error rate as the measure of
ing algorithms. For each of these sets, we construct a more
actual hardness as it is an important metric commonly used
general split version as follows. First, we consider all pairs of
to evaluate continual learning algorithms. In future work,
different labels as a unit binary classification task, resulting
we will explore other metrics such as the forgetting rate [8].
in a total of 45 unit tasks. From these unit tasks, we then cre-
To analyze the relationships between the hardness and
ate 120 task sequences of length five by randomly drawing,
total complexity or sequential heterogeneity, we employ cor-
for each sequence, five unit tasks without replacement.
relation analysis as the main statistical tool. In particular, we
We also construct 120 split task sequences which are
sample M task sequences T1 , T2 , . . . , TM and compute their
permutations of a fixed task set containing five random unit
hardness measures (HA (Ti ))M i=1 as well as their total com- tasks to compute the normalized sequential heterogeneity.
plexity (C(Ti ))M i=1 and sequential heterogeneity (F (Ti ))M
i=1 For each unit task, we train its Task2Vec embedding using
measures. From these measures, we compute the Pearson
a ResNet18 [16] probe network pre-trained on a combined
correlation coefficients between hardness and total complex-
dataset containing both MNIST and CIFAR-10.
ity measures or between hardness and sequential heterogene-
Algorithms and network architectures. We choose two
ity measures. These coefficients tell us how correlated these
recent continual learning algorithms to analyze in our exper-
quantities are.
iments: synaptic intelligence (SI) [51] and variational con-
Formally, the Pearson correlation coefficient between
tinual learning (VCL) [29]. For the experiments on MNIST,
two corresponding sets of measures X = (Xi )M i=1 and we also consider the coreset version of VCL (coreset VCL).
Y = (Yi )M i=1 is defined as:
These algorithms are among the state-of-the-art continual
PM learning algorithms on the considered datasets, with SI repre-
i=1 (Xi − X̄)(Yi − Ȳ )
rXY = qP qP , (7) senting the regularization-based methods, VCL representing
M 2 M 2
i=1 (X i − X̄) (Y
i=1 i − Ȳ ) the Bayesian methods, and coreset VCL combining Bayesian
and rehearsal methods.
where X̄ and Ȳ are the means of X and Y respectively. In On CIFAR-10, we run SI with the same network archi-
addition to the correlation coefficients, we can also compute tecture as those considered in [51]: a CNN with four convo-
the p-values, which tell us how statistically significant these lutional layers, followed by two dense layers with dropout.
correlations are. Since VCL was not originally developed with convolutional
When computing the correlations between the hardness layers, we flatten the input images and train with a fully
measures HA (Ti ) and the total complexities C(Ti ), it is connected network containing four hidden layers, each of
often a good idea to constrain the task sequences Ti to have which has 256 hidden units. On MNIST, we run both SI and
the same length. The reason for this normalization is that VCL with a fully connected network containing two hidden
longer sequences tend to have larger complexities, thus the layers, each of which has 256 hidden units. We denote this
correlation may be biased by the sequence lengths rather setting by MNIST-2562 .
than reflecting the complexity of individual tasks. Since MNIST is a relatively easy dataset, we may not
Similarly, when computing the correlations between observe meaningful results if all the errors obtained from
HA (Ti ) and the sequential heterogeneity F (Ti ), it is also different sequences are low and not very different. Thus,
5
to make the dataset harder for the learning algorithms, we statistically significant), respectively. Interestingly, we find
also consider smaller network architectures. In particular, no significant correlation between error rate and sequential
we consider fully connected networks with a single hidden heterogeneity in all the MNIST settings, which suggests that
layer, containing either 50 hidden units (for MNIST-50) or heterogeneity may not be a significant factor determining the
20 hidden units (for MNIST-20). Following [29, 51], we also performance of continual learning algorithms on this dataset.
use the multi-head version of the models where a separate Since the complexity of each individual task in a sequence
last layer (the SoftMax layer) is trained for each different may influence the heterogeneity between the tasks (e.g., an
task and the other weights are shared among tasks. For easy task may be more similar to another easy task than to a
coreset VCL, we use random coresets with sizes 40, 40, 20 hard task), the complexity may indirectly affect the results
for MNIST-2562 , MNIST-50 and MNIST-20 respectively. in Table 1(b). To avoid this problem, we also look at the nor-
Optimizer settings. For both SI and VCL, we set the regu- malized sequential heterogeneity in Table 1(c) and Fig 1(c),
larization strength parameter to the default value λ = 1. In where the set of tasks is fixed and thus task complexity has
all of our experiments, the models are trained using Adam been factored out.
optimizer [20] with learning rate 0.001. Similar to [51], Surprisingly, Table 1(c) reports some negative correla-
we set the batch size to be 256 in CIFAR-10 and 64 in tions between error rate and sequential heterogeneity. For
MNIST settings for SI. We run this algorithm for 60, 10, 10, example, the correlation coefficient for SI on CIFAR-10 is
5 epochs per task on CIFAR-10, MNIST-2562 , MNIST-50 -0.25 with a p-value less than 0.01, while there is no signif-
and MNIST-20 respectively. icant correlation for this algorithm on the MNIST dataset.
For VCL and coreset VCL, we set the batch size to be the VCL, on the other hand, has negative correlations with co-
training set size [29] and run the algorithms for 50, 120, 50, efficients -0.20 and -0.21, respectively on MNIST-50 and
20 epochs per task on CIFAR-10, MNIST-2562 , MNIST-50 MNIST-20, with p-values less than 0.05. Coreset VCL also
and MNIST-20 respectively. For all algorithms, we run each has negative correlation between its error rate and sequential
setting ten times using different random seeds and average heterogeneity on MNIST-50, with coefficient -0.26 and p-
their errors to get the final error rates. value less than 0.01. These unexpected results suggest that
in some cases, dissimilarity between tasks may even help
6.2. Results continual learning algorithms, a fact contrary to the common
assumption that the performance of continual learning algo-
Tables 1(a–c) show the correlation coefficients and their
rithms would degrade if the tasks they need to solve are very
p-values obtained from our experiments for the total com-
different [3, 35].
plexity, sequential heterogeneity, and normalized sequential
heterogeneity, respectively. We also show the scatter plots
of the errors versus these quantities, together with the linear
7. Discussions
regression fits for the CIFAR-10 dataset in Fig. 1. All plots On total complexity. The strong positive correlations be-
in the experiments, including those for the MNIST dataset, tween error rate and total complexity found in our analysis
are provided in Fig. 4, 5, and 6. show that task complexity is an important factor in deter-
Table 1(a) and Fig. 1(a) show strong positive correlations mining the effectiveness of continual learning algorithms.
between error rate and total complexity for both SI and VCL However, this factor is usually not taken into consideration
in the CIFAR-10 setting, with a correlation coefficient of when designing new algorithms or benchmarks. We suggest
0.86 for the former algorithm and 0.69 for the latter. These that task complexity is explicitly considered to improve algo-
correlations are both statistically significant with p-values rithm and benchmark design. For example, different transfer
less than 0.01. On the MNIST-2562 settings, SI and coreset methods can be used depending on whether one transfers
VCL have weak positive correlations with total complexity, from an easy task to a hard one or vice versa, rather than
where the algorithms have correlation coefficients of 0.24 using a single transfer technique across all task complexities,
and 0.28, both with p-values less than 0.01, respectively. as currently done in the literature. Similarly, when designing
When we reduce the capacity of the network and make new benchmarks for continual learning, it is also useful to
the problem relatively harder (i.e., in the MNIST-50 and provide different complexity structures to test the effective-
MNIST-20 settings), we observe stronger correlations for ness of continual learning algorithms on a broader range of
all three algorithms. With the smallest network (in MNIST- scenarios and difficulty levels.
20), all the algorithms have statistically significant positive To illustrate the usefulness of comparing on various
correlation with total complexity. benchmarks, we construct two split MNIST sequences, one
In terms of sequential heterogeneity, Table 1(b) and of which has high total complexity while the other has low
Fig. 1(b) show that it has a weak positive correlation with total complexity. The sequences are constructed by starting
error rate in the CIFAR-10 setting. In particular, SI and with the binary classification task 0/1 and greedily adding
VCL have correlation coefficients of 0.30 and 0.21 (both tasks that have the highest (or lowest) complexity C(t).
6
0.07
0.15 0.15
Average Error
Average Error
Average Error
0.06
0.10 0.10
0.05
SI
0.05 0.05 0.04
0.35 0.40 0.45 0.50 0.05 0.10 0.15 0.06 0.08 0.10
Total Complexity Sequential Heterogeneity Sequential Heterogeneity
0.4 0.25
Average Error
Average Error
Average Error
0.4
0.3
VCL
0.20
0.2 0.2
0.1
0.35 0.40 0.45 0.50 0.05 0.10 0.15 0.06 0.08 0.10
Total Complexity Sequential Heterogeneity Sequential Heterogeneity
(a) Total complexity (b) Sequential heterogeneity (c) Normalized sequential heterogeneity
Figure 1. Error vs. (a) total complexity, (b) sequential heterogeneity and (c) normalized sequential heterogeneity on CIFAR-10,
together with the linear regression fits and 95% confidence intervals. Green (red) color indicates statistically significant positive (negative)
correlations. Black color indicates negligible correlations.
Variable Algorithm MNIST-2562 MNIST-50 MNIST-20 CIFAR-10

(a) Total SI 0.24 (p < 0.01) 0.22 (p < 0.05) 0.36 (p < 0.01) 0.86 (p < 0.01)
Complexity VCL 0.05 (p = 0.59) 0.17 (p = 0.07) 0.21 (p < 0.05) 0.69 (p < 0.01)
Coreset VCL 0.28 (p < 0.01) 0.41 (p < 0.01) 0.37 (p < 0.01) -
(b) Sequential SI -0.01 (p = 0.86) 0.05 (p = 0.55) 0.07 (p = 0.48) 0.30 (p < 0.01)
Heterogeneity VCL 0.04 (p = 0.69) 0.01 (p = 0.88) 0.05 (p = 0.58) 0.21 (p < 0.05)
Coreset VCL 0.09 (p = 0.31) 0.12 (p = 0.18) 0.18 (p = 0.05) -
(c) Normalized SI -0.07 (p = 0.43) -0.04 (p = 0.65) 0.05 (p = 0.58) -0.25 (p < 0.01)
Sequential VCL 0.03 (p = 0.76) -0.20 (p < 0.05) -0.21 (p < 0.05) -0.17 (p = 0.06)
Heterogeneity Coreset VCL -0.08 (p = 0.37) -0.26 (p < 0.01) -0.16 (p = 0.07) -
Table 1. Correlation coefficients (p-values) between error rate and (a) total complexity, (b) sequential heterogeneity, and (c) nor-
malized sequential heterogeneity of three state-of-the-art continual learning algorithms (SI, VCL, coreset VCL) on four different tests
conducted with the CIFAR-10 and MNIST datasets. Results with statistical significance (p < 0.05) are shown in bold.
Fig. 3 shows these sequences and the error rates of VCL, On sequential heterogeneity. The weak or negative correla-
coreset VCL and SI when evaluated on them. We also show tions between error rate and sequential heterogeneity found
the error rates of the algorithms on the standard split MNIST in our analysis show an interesting contradiction to our in-
sequence for comparison. From the figure, if we only com- tuition on the relationship between catastrophic forgetting
pare on the standard sequence, we may conclude that coreset and task dissimilarity. We emphasize that in our context,
VCL and SI have the same performance. However, if we the weak and negative correlations are not a negative re-
consider the other two sequences, we can see that SI is in sult, but actually a positive result. In fact, some previous
fact slightly better than coreset VCL. This small experiment work showed that task similarity helps improve performance
suggests that we should use various benchmarks, ideally in the context of transfer learning [2, 3, 35], while some
with different levels of complexity, for better comparison of others claimed that task dissimilarity could help continual
continual learning algorithms. learning [13] although their discussion was more related to
It is also worth noting that although the correlation be- the permuted MNIST setting. Our finding gives evidence
tween error rate and task complexity seems trivial, we are that supports the latter view in the split MNIST and split
still not very clear which definition of task sequence com- CIFAR-10 settings.
plexity would be best to explain catastrophic forgetting (i.e., To identify possible causes of this phenomenon, we care-
to give the best correlations). In this paper, we propose the fully analyze the changes in error rates of VCL and SI on
first measure for this purpose, the total complexity. CIFAR-10 and observe some issues that may cause the neg-
7
VCL (Sequence 1) VCL (Sequence 2) SI (Sequence 1) SI (Sequence 2)
Task 1 error Task 2 error Task 3 error Task 4 error Task 5 error Average error
0.4
0.3
Error
0.2
0.1
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Current task Current task Current task Current task Current task Current task
0.075
Error
0.050
0.025
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Current task Current task Current task Current task Current task Current task
Figure 2. Details of the error rates of VCL and SI on two typical task sequences from CIFAR-10. Each column shows the errors on a
particular task when subsequent tasks are continuously observed. Sequence 1 contains the binary tasks 2/9, 0/4, 3/9, 4/8, 1/2 with sequential
heterogeneity 0.091, while sequence 2 contains the tasks 1/2, 2/9, 3/9, 0/4, 4/8 with sequential heterogeneity 0.068 (the labels are encoded to
0, 1, . . . , 9 as usually done for this dataset). For both algorithms, the final average errors (the last points in the right-most plots) on sequence
2 are higher than those on sequence 1, despite sequence 1’s higher sequential heterogeneity.
VCL coreset VCL SI but the algorithm tends not to transfer very well on the sec-
ond sequence. This inability to transfer leads to higher error
Low complexity
rates on tasks 3, 4, and 5 even when the algorithm learns
Standard them for the first time. One possible cause of this problem
High complexity could be that a fixed regularization strength λ = 1 is used for
0.005 0.010 0.015 0.020 0.025 0.030 all tasks, making the algorithm unable to adapt to new tasks
Average error well. This explanation suggests that we should customize
Figure 3. Average error rates of VCL, coreset VCL and SI on
the algorithm (e.g., by tuning the λ values or the optimizer)
3 task sequences from MNIST with different complexity levels. for effectively transferring between different pairs of tasks
The high complexity sequence contains the binary tasks 0/1, 2/5, in the sequence.
3/5, 2/3, 2/6 with total complexity 0.48, while the low complexity Future directions. The analysis offered by our paper pro-
sequence contains the tasks 0/1, 1/8, 1/3, 1/5, 7/8 with total com- vides a general and novel methodology to study the relation-
plexity 0.35. The standard sequence contains the common split 0/1, ship between catastrophic forgetting and properties of task
2/3, 4/5, 6/7, 8/9 with total complexity 0.41. sequences. Although the two measures considered in our
paper, total complexity and sequential heterogeneity, can
ative correlations. For illustration, we show in Fig. 2 the explain some aspects of catastrophic forgetting, the correla-
detailed error rates of these algorithms on two typical task tions in Table 1 are not very strong (i.e., their coefficients are
sequences where the final average error rates do not conform not near 1 or -1). Thus, they can still be improved to provide
with the sequential heterogeneity. Both of these sequences better explanations for the phenomenon. Besides these two
have the same total complexity, with the first sequence hav- measures, we can also design other measures for properties
ing higher sequential heterogeneity. such as intransigence [8].
From the changes in error rates of VCL in Fig. 2, we
observe that for the first sequence, learning a new task would 8. Conclusion
cause forgetting of its immediate predecessor task but could
also help a task learned before that. For instance, learning This paper developed a new analysis for studying rela-
task 3 and task 5 increases the errors on task 2 and task 4 tionships between catastrophic forgetting and properties of
respectively, but helps reduce errors on task 1 (i.e., backward task sequences. An application of our analysis to two simple
transferring to task 1). This observation suggests that the properties suggested that task complexity should be con-
dissimilarities between only consecutive tasks may not be sidered when designing new continual learning algorithms
enough to explain catastrophic forgetting, and thus we should or benchmarks, and continual learning algorithms should
take into account the dissimilarities between a task and all be customized for specific transfers. Our analysis can be
the previously learned tasks. extended to study other relationships between algorithms
From the error rates of SI in Fig. 2, we observe a different and task structures such as the effectiveness of transfer or
situation. In this case, catastrophic forgetting is not severe, multi-task learning with respect to properties of tasks.
8
MNIST-2562 MNIST-50 MNIST-20 CIFAR-10
0.10
0.06
0.15
Average Error
Average Error
Average Error
Average Error
0.04
0.05 0.04
0.10
SI
0.02
0.02
0.05
0.00 0.00 0.00
0.35 0.40 0.45 0.35 0.40 0.45 0.35 0.40 0.45 0.35 0.40 0.45 0.50
Total Complexity Total Complexity Total Complexity Total Complexity
0.15
0.10 0.10
Average Error
Average Error
Average Error
Average Error
0.4
VCL
0.10
0.05 0.05
0.2
0.05
0.00
0.00
0.35 0.40 0.45 0.35 0.40 0.45 0.35 0.40 0.45 0.35 0.40 0.45 0.50
Total Complexity Total Complexity Total Complexity Total Complexity
coreset VCL
Average Error
Average Error
Average Error
0.04 0.10
0.02
0.02
0.05
0.00
0.35 0.40 0.45 0.35 0.40 0.45 0.35 0.40 0.45
Total Complexity Total Complexity Total Complexity
Figure 4. Total complexity vs. average error, together with the linear regression fit and 95% confidence interval, for each algorithm and
test in Table 1(a). Green color indicates statistically significant positive correlations. Black color indicates negligible correlations.

0.10
0.06
0.15
Average Error
Average Error
Average Error
Average Error
0.04
0.05 0.04
0.10
SI
0.02
0.02
0.05
0.00 0.00 0.00
0.10 0.15 0.20 0.10 0.15 0.20 0.10 0.15 0.20 0.05 0.10 0.15
Sequential Heterogeneity Sequential Heterogeneity Sequential Heterogeneity Sequential Heterogeneity
0.15
0.10 0.10
0.4
Average Error
Average Error
Average Error
Average Error
VCL
0.10 0.3
0.05 0.05
0.2
0.05
0.00 0.1
0.00
0.10 0.15 0.20 0.10 0.15 0.20 0.10 0.15 0.20 0.05 0.10 0.15
coreset VCL
Average Error
Average Error
Average Error
0.04 0.10
0.02
0.02
0.05
0.00
0.10 0.15 0.20 0.10 0.15 0.20 0.10 0.15 0.20
Sequential Heterogeneity Sequential Heterogeneity Sequential Heterogeneity
Figure 5. Sequential heterogeneity vs. average error, together with the linear regression fit and 95% confidence interval, for each algorithm
and test in Table 1(b). Green color indicates statistically significant positive correlations. Black color indicates negligible correlations.
9
0.03 0.07
Average Error
Average Error
Average Error
Average Error
0.03 0.06
0.02
0.02
SI
0.02 0.05
0.01
0.04
0.00 0.01
0.10 0.11 0.12 0.13 0.10 0.11 0.12 0.13 0.10 0.11 0.12 0.13 0.06 0.08 0.10
0.04
0.25
Average Error
0.04 0.100
Average Error
Average Error
Average Error
VCL
0.02 0.075
0.20
0.02
0.050
0.00
0.10 0.11 0.12 0.13 0.10 0.11 0.12 0.13 0.10 0.11 0.12 0.13 0.06 0.08 0.10
0.08
0.03
coreset VCL
0.02
Average Error
Average Error
Average Error
0.06
0.02
0.01
0.04
0.01
0.00
0.10 0.11 0.12 0.13 0.10 0.11 0.12 0.13 0.10 0.11 0.12 0.13
Sequential Heterogeneity Sequential Heterogeneity Sequential Heterogeneity
Figure 6. Normalized sequential heterogeneity vs. average error, together with the linear regression fit and 95% confidence interval, for
each algorithm and test in Table 1(c). Red color indicates statistically significant negative correlations. Black color indicates negligible
correlations.
References [7] Y. Bulatov. notMNIST dataset. http://yaroslavvb.

blogspot.com/2011/09/notmnist-dataset.
[1] A. Achille, T. Eccles, L. Matthey, C. Burgess, N. Watters, html, 2011. Accessed: 2019-01-10. 3
A. Lerchner, and I. Higgins. Life-long disentangled repre-
[8] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr.
sentation learning with cross-domain latent homologies. In
Riemannian walk for incremental learning: Understanding
Advances in Neural Information Processing Systems, pages
forgetting and intransigence. In European Conference on
9895–9905, 2018. 1
Computer Vision, 2018. 5, 8
[2] A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji,
[9] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny.
C. Fowlkes, S. Soatto, and P. Perona. Task2Vec: Task em-
Efficient lifelong learning with A-GEM. In International
bedding for meta-learning. arXiv:1902.03545, 2019. 2, 3, 4,
Conference on Learning Representations, 2019. 1, 2
7
[10] Y. Chen, T. Diethe, and N. Lawrence. Facilitating Bayesian
[3] H. B. Ammar, E. Eaton, M. E. Taylor, D. C. Mocanu,
continual learning by natural gradients and Stein gradients.
K. Driessens, G. Weiss, and K. Tuyls. An automated measure
In Continual Learning Workshop @ NeurIPS, 2018. 2
of MDP similarity for transfer in reinforcement learning. In
AAAI Conference on Artificial Intelligence Workshops, 2014. [11] Z. Chen and B. Liu. Lifelong Machine Learning. Morgan &
2, 6, 7 Claypool Publishers, 2016. 2
[4] B. Ans and S. Rousset. Avoiding catastrophic forgetting [12] H. Edwards and A. Storkey. Towards a neural statistician. In
by coupling two reverberating neural networks. Comptes International Conference on Learning Representations, 2017.
Rendus de l’Académie des Sciences-Series III-Sciences de la 3
Vie, 320(12):989–997, 1997. 1 [13] S. Farquhar and Y. Gal. Towards robust evaluations of contin-
[5] B. Ans and S. Rousset. Neural networks with a self-refreshing ual learning. arXiv:1805.09733, 2018. 2, 7
memory: Knowledge transfer in sequential learning tasks [14] R. M. French. Catastrophic forgetting in connectionist net-
without catastrophic forgetting. Connection Science, 12(1):1– works. Trends in cognitive sciences, 3(4):128–135, 1999.
19, 2000. 1 1
[6] T. D. Bui, C. V. Nguyen, S. Swaroop, and R. E. Turner. Par- [15] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Ben-
titioned variational inference: A unified framework encom- gio. An empirical investigation of catastrophic forgetting in
passing federated and continual learning. arXiv:1811.11206, gradient-based neural networks. In International Conference
2018. 2 on Learning Representations, 2014. 1, 3
10
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [33] H. Ritter, A. Botev, and D. Barber. Online structured
for image recognition. In IEEE Conference on Computer Laplace approximations for overcoming catastrophic forget-
Vision and Pattern Recognition, pages 770–778, 2016. 3, 5 ting. arXiv:1805.07810, 2018. 1, 2
[17] F. Huszár. Note on the quadratic penalties in elastic weight [34] A. Robins. Catastrophic forgetting, rehearsal and pseudore-
consolidation. Proceedings of the National Academy of Sci- hearsal. Connection Science, 7(2):123–146, 1995. 1
ences, 115(11):E2496–E2497, 2018. 2 [35] S. Ruder and B. Plank. Learning to select data for trans-
[18] D. Isele and A. Cosgun. Selective experience replay for life- fer learning with Bayesian optimization. In Conference on
long learning. In AAAI Conference on Artificial Intelligence, Empirical Methods in Natural Language Processing, pages
2018. 1 372–382, 2017. 2, 6, 7
[19] N. Kamra, U. Gupta, and Y. Liu. Deep generative dual mem- [36] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer,
ory network for continual learning. arXiv:1710.10368, 2017. J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Had-
2 sell. Progressive neural networks. arXiv:1606.04671, 2016.
[20] D. P. Kingma and J. Ba. Adam: A method for stochastic 1, 3
optimization. In International Conference on Learning Rep- [37] J. C. Schlimmer and D. Fisher. A case study of incremental
resentations, 2015. 6 concept induction. In AAAI Conference on Artificial Intelli-
gence, volume 86, pages 496–501, 1986. 1
[21] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-
jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, [38] J. Schwarz, D. Altman, A. Dudzik, O. Vinyals, Y. W. Teh,
A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Pascanu. Towards a natural benchmark for continual
and R. Hadsell. Overcoming catastrophic forgetting in neural learning. In Continual Learning Workshop @ NeurIPS, 2018.
networks. Proceedings of the National Academy of Sciences, 3
2017. 1, 2, 3 [39] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-
[22] M. Kochurov, T. Garipov, D. Podoprikhin, D. Molchanov, Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress
A. Ashukha, and D. Vetrov. Bayesian incremental learning & compress: A scalable framework for continual learning. In
for deep neural networks. In International Conference on International Conference on Machine Learning, 2018. 1, 3
Learning Representations Workshop, 2018. 2 [40] J. Serrà, D. Surı́s, M. Miron, and A. Karatzoglou. Overcoming
catastrophic forgetting with hard attention to the task. In
[23] A. Krizhevsky and G. Hinton. Learning multiple layers of
International Conference on Machine Learning, 2018. 1
features from tiny images. Technical report, University of
Toronto, 2009. 3 [41] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls-
son. CNN features off-the-shelf: An astounding baseline for
[24] Y. LeCun, C. Cortes, and C. Burges. MNIST handwritten
recognition. In IEEE Conference on Computer Vision and
digit database. AT&T Labs [Online]. Available: http://yann.
Pattern Recognition Workshops, pages 806–813, 2014. 1
lecun. com/exdb/mnist, 2, 2010. 3
[42] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with
[25] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang. deep generative replay. In Advances in Neural Information
Overcoming catastrophic forgetting by incremental moment Processing Systems, pages 2990–2999, 2017. 1, 2
matching. In Advances in Neural Information Processing
[43] A. J. Smola, S. Vishwanathan, and E. Eskin. Laplace propaga-
Systems, pages 4652–4662, 2017. 1, 2
tion. In Advances in Neural Information Processing Systems,
[26] Z. Li and D. Hoiem. Learning without forgetting. IEEE pages 441–448, 2004. 2
Transactions on Pattern Analysis and Machine Intelligence, [44] R. S. Sutton and S. D. Whitehead. Online learning with ran-
40(12):2935–2947, 2018. 1, 2 dom representations. In International Conference on Machine
[27] D. Lopez-Paz and M. Ranzato. Gradient episodic memory Learning, pages 314–321, 1993. 1
for continual learning. In Advances in Neural Information [45] S. Swaroop, C. V. Nguyen, T. D. Bui, and R. E. Turner. Im-
Processing Systems, pages 6467–6476, 2017. 1, 2, 3 proving and understanding variational continual learning. In
[28] M. McCloskey and N. J. Cohen. Catastrophic interference Continual Learning Workshop @ NeurIPS, 2018. 2
in connectionist networks: The sequential learning problem. [46] A. T. Tran, C. V. Nguyen, and T. Hassner. Transferability
Psychology of Learning and Motivation, 24:109–165, 1989. and hardness of supervised classification tasks. In IEEE/CVF
1 International Conference on Computer Vision, 2019. 3
[29] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational [47] H. Tseran, M. E. Khan, T. Harada, and T. D. Bui. Natural vari-
continual learning. In International Conference on Learning ational continual learning. In Continual Learning Workshop
Representations, 2018. 1, 2, 3, 4, 5, 6 @ NeurIPS, 2018. 2
[30] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. [48] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: A novel
Continual lifelong learning with neural networks: A review. image dataset for benchmarking machine learning algorithms.
arXiv:1802.07569, 2018. 2 arXiv:1708.07747, 2017. 3
[31] R. Ratcliff. Connectionist models of recognition memory: [49] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learn-
Constraints imposed by learning and forgetting functions. ing with dynamically expandable networks. In International
Psychological Review, 97(2):285, 1990. 1 Conference on Learning Representations, 2018. 1
[32] M. B. Ring. CHILD: A first step towards continual learning. [50] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and
Machine Learning, 28(1):77–104, 1997. 1 S. Savarese. Taskonomy: Disentangling task transfer learn-
11
ing. In IEEE Conference on Computer Vision and Pattern
Recognition, 2018. 3
[51] F. Zenke, B. Poole, and S. Ganguli. Continual learning
through synaptic intelligence. In International Conference on
Machine Learning, pages 3987–3995, 2017. 1, 2, 3, 4, 5, 6
[52] C. Zeno, I. Golan, E. Hoffer, and D. Soudry. Task agnostic
continual learning using online variational Bayes. In Bayesian
Deep Learning Workshop @ NeurIPS, 2018. 2
[53] X. Zhang, Y. Cui, Y. Song, H. Adam, and S. Belongie. The
iMaterialist challenge 2017 dataset. In FGVC Workshop at
CVPR, volume 2, 2017. 3
12

Toward Understanding Catastrophic Forgetting in Continual Learning

Uploaded by

Copyright:

Available Formats

Toward Understanding Catastrophic Forgetting in Continual Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Toward Understanding Catastrophic Forgetting in Continual Learning

Uploaded by

Copyright:

Available Formats

Toward Understanding Catastrophic Forgetting in Continual Learning

Cuong V. Nguyen† , Alessandro Achille† , Michael Lam† , Tal Hassner‡∗,

Abstract learning research community, due to its potential to reduce

0.05 0.05 0.04

Variable Algorithm MNIST-2562 MNIST-50 MNIST-20 CIFAR-10

MNIST-2562 MNIST-50 MNIST-20 CIFAR-10

References [7] Y. Bulatov. notMNIST dataset. http://yaroslavvb.

You might also like