Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Joint Input and Output Coordination for Class-Incremental Learning

Shuai Wang1,2    Yibing Zhan3    Yong Luo1,2∗    Han Hu4    Wei Yu1111Corresponding authors: Yong Luo, Wei Yu.   
Yonggang Wen5
   Dacheng Tao5
1Institute of Artificial Intelligence, School of Computer Science, Wuhan University, China.
2 Hubei Luojia Laboratory, Wuhan, China. 3JD Explore Academy, JD.com, Inc., China.
4School of Information and Electronics, Beijing Institute of Technology, China.
5College of Computing & Data Science, Nanyang Technological University, Singapore. wangshuai123@whu.edu.cn, zhanyibing@jd.com, luoyong@whu.edu.cn, hhu@bit.edu.cn, yuwei@whu.edu.cn, ygwen@ntu.edu.sg, dacheng.tao@ntu.edu.sg
Abstract

Incremental learning is nontrivial due to severe catastrophic forgetting. Although storing a small amount of data on old tasks during incremental learning is a feasible solution, current strategies still do not 1) adequately address the class bias problem, and 2) alleviate the mutual interference between new and old tasks, and 3) consider the problem of class bias within tasks. This motivates us to propose a joint input and output coordination (JIOC) mechanism to address these issues. This mechanism assigns different weights to different categories of data according to the gradient of the output score, and uses knowledge distillation (KD) to reduce the mutual interference between the outputs of old and new tasks. The proposed mechanism is general and flexible, and can be incorporated into different incremental learning approaches that use memory storage. Extensive experiments show that our mechanism can significantly improve their performance.

1 Introduction

In recent years, incremental learning has attracted much attention since it can play an important role in a wide variety of fields, including unmanned driving Santoso and Finn (2022) and human-computer interaction Tschandl et al. (2020). Incremental learning is nontrivial since the parameters of deep models in the old tasks are often destroyed in the process of learning new tasks. This leads to the occurrence of catastrophic forgetting French and Chater (2002). How to well preserve past information and fully explore new knowledge has become a major challenge of incremental learning.

Existing incremental learning approaches mainly focus on memory storage replay Ahn et al. (2021); Li and Hoiem (2017); Wu et al. (2019); Rebuffi et al. (2017); Yan et al. (2021), model dynamic expansion Serra et al. (2018); Mallya and Lazebnik (2018), and regularization constraints design Aljundi et al. (2019). Memory store replay has been demonstrated to be very effective, and it alleviates the destruction of old task weights by storing past data or simulating human memory. However, due to the privacy restriction and limited memory, the data to be accessed from old tasks are often quite scarce. This makes incremental learning models suffer from severe inter-task class bias, or known as the class imbalance issue between old and new tasks.

There exist some recent approaches Ahn et al. (2021); Rebuffi et al. (2017); Yan et al. (2021) that alleviate the problem of class imbalance between old and new tasks by utilizing rescaling, balanced scoring, or softmax separating. Although these approaches can improve the performance to some extent, the problem of category imbalance still exists, since during the incremental learning progresses, the category imbalance becomes more severe as the number of sample categories continuously increase. Moreover, the mutual interference between old and new tasks has not been well addressed. That is, only the predictions in old tasks are tried to be maintained, and the output scores of old task data on the classification heads of new tasks are not well suppressed. The output consistency of new task data on old classification heads before and after updating the new task model is also not considered. Besides, none of the existing approaches deal with the class bias within tasks. An illustration is shown in Figure 1.

Refer to caption
Figure 1: An illustration of the class imbalance and mutual interference issues. The difference in the number of input data for each class between tasks and within tasks makes the weights of fully connected layers greatly biased (neuron size). The output scores of data from old tasks (1,,t11𝑡11,\cdots,t-11 , ⋯ , italic_t - 1) on the classification heads of new task t𝑡titalic_t should approximate zero, but may be much larger than zero (green solid line) after training the new task model. The output scores of data from the new task on the classification heads of old tasks may be inconsistent before (blue dotted line) and after (blue solid line) updating the old task models.

In order to address these issues, we propose a joint input and output coordination (JIOC) mechanism, which enables incremental learning models to simultaneously alleviate the class imbalance and reduce the interference between the predictions of old and new tasks. Specifically, different weights are adaptively assigned to different input data according to their gradients for the output scores during the training of the new task and updating of the old task models. Then the outputs of old task data on new classification heads are explicitly suppressed and knowledge distillation (KD) Menon et al. (2021) is utilized for harmonization of the output scores based on the principle of human inductive memory Williams (1999); Redondo and Morris (2011).

The main contributions are summarized as follows:

  • We propose a joint input and output coordination mechanism for incremental learning. As far as we are concerned, this is the first work that simultaneously adjusts input data and output layer for incremental learning;

  • We design an adaptive input weighting strategy. The samples of different classes are weighted according to their gradients of the output scores. This alleviates the class bias problem both in and between tasks.

  • We develop an output coordination strategy, which maintains the outputs of new task data on the old task classification heads before and after training, and suppresses the outputs of old task data on the new task classification heads.

The proposed method is general and flexible, and can be utilized as a plug-and-play tool for existing incremental learning approaches that use memory storage. To demonstrate the effectiveness of our mechanism, we incorporate it into some recent or competitive incremental learning approaches on multiple popular datasets (CIFAR10-LT, CIFAR100-LT, CIFAR100 Krizhevsky et al. (2009), MiniImagNet Vinyals et al. (2016), TinyImageNet Le and Yang (2015) and Cub-200-2011 Wah et al. (2011)). The results show that we can consistently improve the existing approaches, and the relative improvement is more than 10%percent1010\%10 % sometimes.

2 Related Work

2.1 Incremental Learning

Incremental learning De Lange et al. (2021) has received extensive attention in recent decades. In incremental learning, input data in new tasks are continuously used to extend the knowledge of existing models. This makes incremental learning manifest as a dynamic learning technique. An incremental learning model can be defined as one that meets the following conditions: (1) The model can learn useful knowledge from new task data; (2) The old task data that has been used to train the model does not need to be accessed or has a small amount of access; (3) It has a memory function for the knowledge that has been learned. The current study on incremental learning mainly focuses on domain incremental learning Mirza et al. (2022); Garg et al. (2022); Mallya et al. (2018), class-incremental learning Ahn et al. (2021); Rebuffi et al. (2017); Yan et al. (2021); Zhang et al. (2020); Liu et al. (2021), and small sample incremental learning Tao et al. (2020); Cheraghian et al. (2021).

There are many works on class-incremental learning (CIL), and most of these works overcome catastrophic forgetting by using knowledge distillation (KD) together with a small amount of old task data accessed. For example, DMC Zhang et al. (2020) utilizes separate models for the new and old classes and trains the two models by combining double distillation. SPB Liu et al. (2021) utilizes cosine classifier and reciprocal adaptive weights, and a new method of learning class-independent knowledge and multi-view knowledge is designed to balance the stability-plasticity dilemma of incremental learning.

Although the above approaches can achieve promising performance sometimes, none of them address class bias within tasks, nor adequately address class bias between old and new tasks. Therefore, we propose joint input and output coordination (JIOC) mechanism that enables incremental learning models to alleviate class imbalance and reduce interference between the predictions of old and new tasks.

2.2 Human Inductive Memory

The inductive memory method is a unique ability of human beings. It causes the memorized content to be induced according to different attributes or categories; Subsequently, these contents are memorized by different categories or attributes. As early as 1999, Williams et al. Williams (1999) investigated the relationship between memory for input and inductive learning of morphological rules relating to functional categories in a semi-artificial form of Italian. The ability to perform induction appears in the early age of human, while the underlying mechanisms remain unclear. Therefore, Fisher et al. Fisher and Sloutsky (2005) demonstrated that category- and similarity-based induction should result in different memory traces and thus different memory accuracy. Hayes et al. Hayes et al. (2013) examined the development of the relationship between inductive reasoning and visual recognition memory, and demonstrated it through two studies. Inspired by human inductive memory, Geng et al. Geng et al. (2020) proposed a Dynamic Memory Induction Network (DMIN) to further address the small-sample challenge. These examples of inductive memory inspire us to propose an output distribution coordination mechanism.

Refer to caption
Figure 2: Overall structure of the proposed method. Firstly, the absolute gradient of the output scores is computed, based on the p^i,j,k=iτsubscriptsuperscript^𝑝𝜏𝑖𝑗𝑘𝑖\hat{p}^{\tau}_{i,j,k=i}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k = italic_i end_POSTSUBSCRIPT and the yi,j,k=iτsubscriptsuperscript𝑦𝜏𝑖𝑗𝑘𝑖y^{\tau}_{i,j,k=i}italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k = italic_i end_POSTSUBSCRIPT, to induce a weight for each sample, where the weights are adaptively updated during the training. Then LOC,1t1subscript𝐿𝑂𝐶1𝑡1L_{OC,1\rightarrow t-1}italic_L start_POSTSUBSCRIPT italic_O italic_C , 1 → italic_t - 1 end_POSTSUBSCRIPT is employed to maintain the outputs for each old task. LOC,1t1subscript𝐿𝑂𝐶1𝑡1L_{OC,1\rightarrow t-1}italic_L start_POSTSUBSCRIPT italic_O italic_C , 1 → italic_t - 1 end_POSTSUBSCRIPT is also utilized to make the outputs of new task data on old task classification heads after updating the old task models agree with those before the update. Finally, to suppress the outputs of old task data on new task classification heads, their output scores p^i,j,ktsubscriptsuperscript^𝑝𝑡𝑖𝑗𝑘\hat{p}^{t}_{i,j,k}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT are directly optimized to approach zeros (The solid blue line and solid green line represents the output distribution of the new task and old task data, respectively, on the new task classification head; The dashed blue line and dashed green line represents the output distribution of the new task and old task data, respectively, on the old task classification head).

3 Method

3.1 Notations and Problem Setup

In CIL, data for new tasks are arriving constantly, which are represented as 𝒟={𝒟1,𝒟2,,𝒟t,,𝒟T}𝒟superscript𝒟1superscript𝒟2superscript𝒟𝑡superscript𝒟𝑇\mathcal{D}=\{\mathcal{D}^{1},\mathcal{D}^{2},\cdots,\mathcal{D}^{t},\cdots,% \mathcal{D}^{T}\}caligraphic_D = { caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }. The data in the t𝑡titalic_t-th new task is 𝒟t={(xi,jt,yi,jt)i=1,2,,m;j=1,2,,nm}superscript𝒟𝑡subscriptsubscriptsuperscript𝑥𝑡𝑖𝑗subscriptsuperscript𝑦𝑡𝑖𝑗formulae-sequence𝑖12𝑚𝑗12subscript𝑛𝑚\mathcal{D}^{t}=\{(x^{t}_{i,j},y^{t}_{i,j})_{i=1,2,\cdots,m;j=1,2,\cdots,n_{m}}\}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , 2 , ⋯ , italic_m ; italic_j = 1 , 2 , ⋯ , italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where m𝑚mitalic_m is the number of classes, nmsubscript𝑛𝑚n_{m}italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the number of samples for the m𝑚mitalic_m-th category, x𝑥xitalic_x is the input data, and y𝑦yitalic_y is the corresponding data label. The number of samples may vary for different categories in the new task. When learning the t𝑡titalic_t-th new task, we assume that there are a small amount of data stored for the old tasks, i.e.,

𝒟oldt={(xi,j1,yi,j1),,(xi,jt1,yi,jt1)},subscriptsuperscript𝒟𝑡𝑜𝑙𝑑subscriptsuperscript𝑥1𝑖𝑗subscriptsuperscript𝑦1𝑖𝑗subscriptsuperscript𝑥𝑡1𝑖𝑗subscriptsuperscript𝑦𝑡1𝑖𝑗\mathcal{D}^{t}_{old}=\left\{(x^{1}_{i,j},y^{1}_{i,j}),\cdots,(x^{t-1}_{i,j},y% ^{t-1}_{i,j})\right\},caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) , ⋯ , ( italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) } , (1)

where i=1,2,,m𝑖12𝑚i=1,2,\cdots,mitalic_i = 1 , 2 , ⋯ , italic_m and j=1,2,,nold𝑗12subscript𝑛𝑜𝑙𝑑j=1,2,\cdots,n_{old}italic_j = 1 , 2 , ⋯ , italic_n start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT, noldnmuch-less-thansubscript𝑛𝑜𝑙𝑑𝑛n_{old}\ll nitalic_n start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ≪ italic_n. That is, the number of old data 𝒟oldtsubscriptsuperscript𝒟𝑡𝑜𝑙𝑑\mathcal{D}^{t}_{old}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT in the repository is much smaller than that of 𝒟tsuperscript𝒟𝑡\mathcal{D}^{t}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. In CIL, a feature extractor f()𝑓f(\cdot)italic_f ( ⋅ ) (such as ResNet He et al. (2016)) and a fully connected layer (FCL) together with a softmax𝑠𝑜𝑓𝑡𝑚𝑎𝑥softmaxitalic_s italic_o italic_f italic_t italic_m italic_a italic_x classifier is generally adopted, i.e.,

𝐱i,jτ=f(xi,jτ;Θ),subscriptsuperscript𝐱𝜏𝑖𝑗𝑓subscriptsuperscript𝑥𝜏𝑖𝑗Θ\mathbf{x}^{\tau}_{i,j}=f(x^{\tau}_{i,j};\Theta),bold_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ; roman_Θ ) , (2)
𝐩^i,jτ=softmax(𝐱i,jτ;W),subscriptsuperscript^𝐩𝜏𝑖𝑗𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscriptsuperscript𝐱𝜏𝑖𝑗𝑊\hat{\mathbf{p}}^{\tau}_{i,j}=softmax(\mathbf{x}^{\tau}_{i,j};W),over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( bold_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ; italic_W ) , (3)

where τ={1,2,,t}𝜏12𝑡\tau=\{1,2,\cdots,t\}italic_τ = { 1 , 2 , ⋯ , italic_t }, ΘΘ\Thetaroman_Θ is the parameter of the feature extractor, W𝑊Witalic_W is the parameter of the classifier, and 𝐩^i,jτsubscriptsuperscript^𝐩𝜏𝑖𝑗\hat{\mathbf{p}}^{\tau}_{i,j}over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is a vector of output scores. When incremental learning proceeds to the t𝑡titalic_t-th task, all the data in 𝒟t𝒟oldt={(xi,jτ,yi,jτ),(τ=1,2,,t)}superscript𝒟𝑡subscriptsuperscript𝒟𝑡𝑜𝑙𝑑subscriptsuperscript𝑥𝜏𝑖𝑗subscriptsuperscript𝑦𝜏𝑖𝑗𝜏12𝑡\mathcal{D}^{t}\cup\mathcal{D}^{t}_{old}=\left\{(x^{\tau}_{i,j},y^{\tau}_{i,j}% ),(\tau=1,2,\cdots,t)\right\}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) , ( italic_τ = 1 , 2 , ⋯ , italic_t ) } are utilized for training, and the following cross-entropy loss is usually adopted:

Lce,t=1Nold+nnewi,j,k,τ=1tyi,j,kτlog(p^i,j,kτ),subscript𝐿𝑐𝑒𝑡1subscript𝑁𝑜𝑙𝑑subscript𝑛𝑛𝑒𝑤superscriptsubscript𝑖𝑗𝑘𝜏1𝑡subscriptsuperscript𝑦𝜏𝑖𝑗𝑘subscriptsuperscript^𝑝𝜏𝑖𝑗𝑘L_{ce,t}=-\dfrac{1}{N_{old}+n_{new}}\sum_{i,j,k,\tau=1}^{t}y^{\tau}_{i,j,k}% \log(\hat{p}^{\tau}_{i,j,k}),italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT ) , (4)

where Noldsubscript𝑁𝑜𝑙𝑑N_{old}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT is total number of stored data for old tasks, nnewsubscript𝑛𝑛𝑒𝑤n_{new}italic_n start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is the total number of samples for the new task, and p^i,j,kτsubscriptsuperscript^𝑝𝜏𝑖𝑗𝑘\hat{p}^{\tau}_{i,j,k}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT is the output score at the k𝑘kitalic_k-th neuron.

3.2 Overview

According to the above problem setup, it can be seen that when performing incremental learning, only a limited number of samples from the old tasks will be retained. Due to the large number of samples in the new task, incremental learning suffers from the class imbalance issue between the old and new tasks. The class imbalance issue also exists within the new task, but this is ignored by existing CIL approaches Ahn et al. (2021); Rebuffi et al. (2017); Yan et al. (2021).

Refer to caption
(a)
Refer to caption
(b)
Figure 3: A comparison of the SSIL approach (left) and the proposed output coordination (right). In SSIL, only the outputs of old task data on old classification heads are kept consistent before and after updating. We improve it by further enforcing the outputs of new task data on old classification heads to be consistent, and suppress the outputs of old task data on new classification heads.

Therefore, we propose the joint input and output coordination (JIOC) mechanism, as shown in Figure 2, where we assign different weights to different input data according to their the absolute value of the gradient for output scores. In addition, in order to prevent the mutual interference of output distributions between old and new tasks, we split the softmax layer inspired by the principle of human inductive memory. This is similar to the SSIL Ahn et al. (2021) approach, but has several significant differences, as shown in Figure 3: 1) for each of the old tasks, we utilize KD to maintain the output distribution of each task. To make the output scores of new task data on the classification heads of old tasks consistent, we also employ KD to enforce the outputs after updating the old task models agree with the scores before the update; 2) to suppress the outputs of old task data on the classification heads of new tasks, their ground-truth target values are directly set to be zero for training.

3.3 Input Coordination

As we know, the class imbalance issue may lead to significant bias in the learned weights of the fully connected layers Li et al. (2020). Therefore, p^i,j,kτsubscriptsuperscript^𝑝𝜏𝑖𝑗𝑘\hat{p}^{\tau}_{i,j,k}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT may deviate greatly from its corresponding true value pi,j,kτsubscriptsuperscript𝑝𝜏𝑖𝑗𝑘p^{\tau}_{i,j,k}italic_p start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT, and hence it is necessary to balance the weight of fully connected layers between tasks and within tasks.

Due to the severe bias in the weights of the fully connected layer, we propose to utilize the outputs of fully connected layer’s previous layer to adjust the weights. Suppose that 𝐪^i,jτsubscriptsuperscript^𝐪𝜏𝑖𝑗\hat{\mathbf{q}}^{\tau}_{i,j}over^ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the vector of the previous layer that outputs scores 𝐩^i,jτsubscriptsuperscript^𝐩𝜏𝑖𝑗\hat{\mathbf{p}}^{\tau}_{i,j}over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. The derivative of Lce,tsubscript𝐿𝑐𝑒𝑡L_{ce,t}italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT w.r.t. 𝐪^i,jτsubscriptsuperscript^𝐪𝜏𝑖𝑗\hat{\mathbf{q}}^{\tau}_{i,j}over^ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (we refer to the supplementary material for the detailed calculation) can be given by:

Lce,t𝐪^i,jτ=[p^i,j,1τyi,j,1τp^i,j,kτyi,j,kτp^i,j,mtτyi,j,mtτ].subscript𝐿𝑐𝑒𝑡subscriptsuperscript^𝐪𝜏𝑖𝑗delimited-[]subscriptsuperscript^𝑝𝜏𝑖𝑗1subscriptsuperscript𝑦𝜏𝑖𝑗1subscriptsuperscript^𝑝𝜏𝑖𝑗𝑘subscriptsuperscript𝑦𝜏𝑖𝑗𝑘subscriptsuperscript^𝑝𝜏𝑖𝑗𝑚𝑡subscriptsuperscript𝑦𝜏𝑖𝑗𝑚𝑡\dfrac{\partial L_{ce,t}}{\partial\hat{\mathbf{q}}^{\tau}_{i,j}}=\begin{% subarray}{c}\left[\begin{subarray}{c}\hat{p}^{\tau}_{i,j,1}-y^{\tau}_{i,j,1}\\ \cdots\\ \hat{p}^{\tau}_{i,j,k}-y^{\tau}_{i,j,k}\\ \cdots\\ \hat{p}^{\tau}_{i,j,mt}-y^{\tau}_{i,j,mt}\end{subarray}\right]\end{subarray}.divide start_ARG ∂ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG = start_ARG start_ROW start_CELL [ start_ARG start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , 1 end_POSTSUBSCRIPT - italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT - italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_m italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_m italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] end_CELL end_ROW end_ARG . (5)

Then the absolute value of the gradient of the output score for the input data when k=i𝑘𝑖k=iitalic_k = italic_i is:

δi,j,k=iτ=|p^i,j,k=iτyi,j,k=iτ|.subscriptsuperscript𝛿𝜏𝑖𝑗𝑘𝑖subscriptsuperscript^𝑝𝜏𝑖𝑗𝑘𝑖subscriptsuperscript𝑦𝜏𝑖𝑗𝑘𝑖\delta^{\tau}_{i,j,k=i}=|\hat{p}^{\tau}_{i,j,k=i}-y^{\tau}_{i,j,k=i}|.italic_δ start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k = italic_i end_POSTSUBSCRIPT = | over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k = italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k = italic_i end_POSTSUBSCRIPT | . (6)

When the number of data is large for a certain category, the model tends to bias to this category and thus the absolute value of the gradient in Eq. (6) tends to be small in the learning process. To alleviate the bias issue, we propose to regard the absolute value as the weight for the corresponding input sample and add it into the loss during the training. That is, smaller weights will be adaptively assigned to the samples of the category that has more input data, and hence the model would focus more on the category that has fewer samples.

Based on the above analysis, we utilize the absolute values δi,jτsubscriptsuperscript𝛿𝜏𝑖𝑗\delta^{\tau}_{i,j}italic_δ start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT of the gradient to induce a weight for each input data during the training. First of all, we incorporate the absolute value of the gradient of the input data into the traditional cross-entropy loss (Eq. (4)), i.e.,

LIC,t=1Nold+nnewi,j,k=i,τ=1tyi,j,kτδi,j,kτlog(p^i,j,kτ).subscript𝐿𝐼𝐶𝑡1subscript𝑁𝑜𝑙𝑑subscript𝑛𝑛𝑒𝑤superscriptsubscriptformulae-sequence𝑖𝑗𝑘𝑖𝜏1𝑡subscriptsuperscript𝑦𝜏𝑖𝑗𝑘subscriptsuperscript𝛿𝜏𝑖𝑗𝑘subscriptsuperscript^𝑝𝜏𝑖𝑗𝑘L_{IC,t}=-\dfrac{1}{N_{old}+n_{new}}\sum_{i,j,k=i,\tau=1}^{t}y^{\tau}_{i,j,k}% \delta^{\tau}_{i,j,k}\log(\hat{p}^{\tau}_{i,j,k}).italic_L start_POSTSUBSCRIPT italic_I italic_C , italic_t end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k = italic_i , italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT ) . (7)

Then, we can use Eq. (7) to balance the loss of each category. In this way, the category weights of the fully connected layer can be balanced according to the absolute value δi,jsubscript𝛿𝑖𝑗\delta_{i,j}italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT of the gradient of each input data. It not only alleviates the category bias between old and new tasks in incremental learning, but also greatly reduces the within-task bias. The main procedure is summarized in Algorithm 1 222In the entire algorithm pipeline, the outer loop and inner loop iterate Total𝑇𝑜𝑡𝑎𝑙Totalitalic_T italic_o italic_t italic_a italic_l and mnm+Noldbatchsize𝑚subscript𝑛𝑚subscript𝑁𝑜𝑙𝑑𝑏𝑎𝑡𝑐𝑠𝑖𝑧𝑒\dfrac{m*n_{m}+N_{old}}{batchsize}divide start_ARG italic_m ∗ italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_b italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e end_ARG times, respectively. We neglect the time complexity of Eq. (6), Eq. (7), as well as the parameter updates for ΘΘ\Thetaroman_Θ and W𝑊Witalic_W. The overall time complexity of the algorithm pipeline is O(Totalmnm+Noldbatchsize)𝑂𝑇𝑜𝑡𝑎𝑙𝑚subscript𝑛𝑚subscript𝑁𝑜𝑙𝑑𝑏𝑎𝑡𝑐𝑠𝑖𝑧𝑒O(Total*\dfrac{m*n_{m}+N_{old}}{batchsize})italic_O ( italic_T italic_o italic_t italic_a italic_l ∗ divide start_ARG italic_m ∗ italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_b italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e end_ARG )..

Algorithm 1 Main procedure of input coordination.

Input: The data of the incremental learning model {𝒟oldt,𝒟t}subscriptsuperscript𝒟𝑡𝑜𝑙𝑑superscript𝒟𝑡\left\{\mathcal{D}^{t}_{old},\mathcal{D}^{t}\right\}{ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }; the feature extractor of the current model is f(,Θ)𝑓Θf\left(\cdot,\Theta\right)italic_f ( ⋅ , roman_Θ ); the parameter of the current fully-connected layer is W𝑊Witalic_W;

Output: The updated parameters ΘΘ\Thetaroman_Θ and W𝑊Witalic_W;

1:  for epoch=1𝑒𝑝𝑜𝑐1epoch=1italic_e italic_p italic_o italic_c italic_h = 1; epoch<Total𝑒𝑝𝑜𝑐𝑇𝑜𝑡𝑎𝑙epoch<Totalitalic_e italic_p italic_o italic_c italic_h < italic_T italic_o italic_t italic_a italic_l; epoch++epoch++italic_e italic_p italic_o italic_c italic_h + +  do
2:     while batchsize𝑏𝑎𝑡𝑐𝑠𝑖𝑧𝑒batchsizeitalic_b italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e loads {𝒟oldt,𝒟t}subscriptsuperscript𝒟𝑡𝑜𝑙𝑑superscript𝒟𝑡\left\{\mathcal{D}^{t}_{old},\mathcal{D}^{t}\right\}{ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } data do
3:        (1)δi,jτ{𝐪^i,jτ}1superscriptsubscript𝛿𝑖𝑗𝜏subscriptsuperscript^𝐪𝜏𝑖𝑗\left(1\right)\delta_{i,j}^{\tau}\leftarrow\left\{\hat{\mathbf{q}}^{\tau}_{i,j% }\right\}( 1 ) italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ← { over^ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT }, by using Eq. (6);
4:        (2)LIC,tLce,t2subscript𝐿𝐼𝐶𝑡subscript𝐿𝑐𝑒𝑡\left(2\right)L_{IC,t}\leftarrow L_{ce,t}( 2 ) italic_L start_POSTSUBSCRIPT italic_I italic_C , italic_t end_POSTSUBSCRIPT ← italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT and δi,jτsuperscriptsubscript𝛿𝑖𝑗𝜏\delta_{i,j}^{\tau}italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT, by using Eq. (7);
5:        (3)3\left(3\right)( 3 ) According to the loss value LIC,tsubscript𝐿𝐼𝐶𝑡L_{IC,t}italic_L start_POSTSUBSCRIPT italic_I italic_C , italic_t end_POSTSUBSCRIPT obtained in the previous step, the parameters ΘΘ\Thetaroman_Θ and W𝑊Witalic_W of the incremental learning model are updated.
6:     end while
7:     Return the updated ΘΘ\Thetaroman_Θ and W𝑊Witalic_W.
8:  end for

3.4 Output Coordination

According to the above analysis, it is necessary to keep the output distribution of the new task data 𝒟tsuperscript𝒟𝑡\mathcal{D}^{t}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT on the old task classification heads consistently before and after updating the old task models 333During the updating of the (t1)𝑡1\left(t-1\right)( italic_t - 1 ) old tasks, there are only m(t1)𝑚𝑡1m\cdot\left(t-1\right)italic_m ⋅ ( italic_t - 1 ) classification heads. This does not contain the classification heads for the t𝑡titalic_t-th task.. Also, it is necessary to suppress the output scores of the old task data 𝒟oldtsubscriptsuperscript𝒟𝑡𝑜𝑙𝑑\mathcal{D}^{t}_{old}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT on the classification heads of the new task (In Figure 1, this is to keep the blue solid line consistent with the dotted line, and make the green solid line approach to 00).

When the model trains the t𝑡titalic_t-th task, we suppose that the output score of the data 𝒟t𝒟oldtsuperscript𝒟𝑡subscriptsuperscript𝒟𝑡𝑜𝑙𝑑\mathcal{D}^{t}\cup\mathcal{D}^{t}_{old}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT without going through softmax𝑠𝑜𝑓𝑡𝑚𝑎𝑥softmaxitalic_s italic_o italic_f italic_t italic_m italic_a italic_x layer is given by z^i,j,kτsubscriptsuperscript^𝑧𝜏𝑖𝑗𝑘\hat{z}^{\tau}_{i,j,k}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT. Before updating the old tasks models and training the t𝑡titalic_t-th task, the output score of the data 𝒟t𝒟oldtsuperscript𝒟𝑡subscriptsuperscript𝒟𝑡𝑜𝑙𝑑\mathcal{D}^{t}\cup\mathcal{D}^{t}_{old}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT is z~i,j,kτsubscriptsuperscript~𝑧𝜏𝑖𝑗𝑘\tilde{z}^{\tau}_{i,j,k}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT. By considering the principle of human inductive memory, KD is used to enforce the output consistency of the new task data on the classification heads of each old task before and after updating the corresponding model, i.e.,

LOC,1t1=τ=1t1[i,j,kρKLϵ(z^i,j,kτ,z~i,j,kτ)],subscript𝐿𝑂𝐶1𝑡1superscriptsubscript𝜏1𝑡1delimited-[]subscript𝑖𝑗𝑘subscriptsuperscript𝜌italic-ϵ𝐾𝐿subscriptsuperscript^𝑧𝜏𝑖𝑗𝑘subscriptsuperscript~𝑧𝜏𝑖𝑗𝑘L_{OC,1\rightarrow t-1}=\sum_{\tau=1}^{t-1}\left[\sum_{i,j,k}\rho^{\epsilon}_{% KL}\left(\hat{z}^{\tau}_{i,j,k},\tilde{z}^{{\tau}}_{i,j,k}\right)\right],italic_L start_POSTSUBSCRIPT italic_O italic_C , 1 → italic_t - 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT ) ] , (8)

where ρKLϵ()subscriptsuperscript𝜌italic-ϵ𝐾𝐿\rho^{\epsilon}_{KL}\left(\cdot\right)italic_ρ start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( ⋅ ) is the distillation loss, and ϵitalic-ϵ\epsilonitalic_ϵ is a temperature scaling parameter.

The output of the old task data 𝒟oldtsubscriptsuperscript𝒟𝑡𝑜𝑙𝑑\mathcal{D}^{t}_{old}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT on the classification head of the new task can be adjusted according to:

LOC,told=1nnewi,j,k(p^i,j,kt0),subscriptsuperscript𝐿𝑜𝑙𝑑𝑂𝐶𝑡1subscript𝑛𝑛𝑒𝑤subscript𝑖𝑗𝑘subscriptsuperscript^𝑝𝑡𝑖𝑗𝑘0L^{old}_{OC,t}=\frac{1}{n_{new}}\sum_{i,j,k}(\hat{p}^{t}_{i,j,k}-0),italic_L start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O italic_C , italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT - 0 ) , (9)

where i{1,,m(t1)}𝑖1𝑚𝑡1i\in\left\{1,\cdots,m\left(t-1\right)\right\}italic_i ∈ { 1 , ⋯ , italic_m ( italic_t - 1 ) }.

Although the principle of Eq. (8)8\left(\ref{eq:output_coordination1}\right)( ) is similar to the SSIL Ahn et al. (2021) approach, the output coordination mechanism proposed in this paper is different from the SSIL approach, as shown in Figure 3. Combining the output coordination loss LIC,tsubscript𝐿𝐼𝐶𝑡L_{IC,t}italic_L start_POSTSUBSCRIPT italic_I italic_C , italic_t end_POSTSUBSCRIPT of Eq. (7)7\left(\ref{eq:input_coordination_loss}\right)( ), the overall loss function LJIOC,tsubscript𝐿𝐽𝐼𝑂𝐶𝑡L_{JIOC,t}italic_L start_POSTSUBSCRIPT italic_J italic_I italic_O italic_C , italic_t end_POSTSUBSCRIPT of the method proposed can be obtained, i.e.,

LJIOC,t=LIC,t+γ1LOC,1t1+γ2LOC,told,subscript𝐿𝐽𝐼𝑂𝐶𝑡subscript𝐿𝐼𝐶𝑡subscript𝛾1subscript𝐿𝑂𝐶1𝑡1subscript𝛾2subscriptsuperscript𝐿𝑜𝑙𝑑𝑂𝐶𝑡L_{JIOC,t}=L_{IC,t}+\gamma_{1}L_{OC,1\rightarrow t-1}+\gamma_{2}L^{old}_{OC,t},italic_L start_POSTSUBSCRIPT italic_J italic_I italic_O italic_C , italic_t end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_I italic_C , italic_t end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_O italic_C , 1 → italic_t - 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O italic_C , italic_t end_POSTSUBSCRIPT , (10)

where γ10subscript𝛾10\gamma_{1}\geq 0italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0 and γ20subscript𝛾20\gamma_{2}\geq 0italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0 are trade-off hyper-parameters.

4 Experiment

4.1 Datasets and Evaluation Criteria

Datasets. In this paper, we not only validate the effectiveness of our method on unbalanced CIFAR10-LT and CIFAR100-LT datasets but also conduct corresponding validation on balanced CIFAR100 Krizhevsky et al. (2009), MiniImageNet Vinyals et al. (2016), TinyImageNet Le and Yang (2015), and Cub-200-2011 Wah et al. (2011) datasets. The CIFAR10 and CIFAR100 datasets both consist of 50,0005000050,00050 , 000 training images and 10,0001000010,00010 , 000 test images, with 10101010 and 100100100100 categories respectively. To create unbalanced settings for the balanced datasets, we reduce the number of training samples for some classes. To ensure that our method is applicable to various settings 444Since we use a small amount of old task data, the setup is slightly different from that of Cui et al. (2019). , we consider long-tail imbalances Cui et al. (2019), and a summarization of the dataset is reported in Table 1. The MiniImageNet dataset was excerpted from the ImageNet Russakovsky et al. (2015) dataset, and it contains 100100100100 classes, each with 600600600600 images of size 84×84848484\times 8484 × 84.

Datasets CIFAR10-LT CIFAR100-LT
Training images 16,271 32,775
Classes 10 100
Max #{images} 5,000 500
Min #{images} 206 200
Imbalance factor 24 2.5
Table 1: The detailed information of long-tail imbalance datasets.

Typically, the training and test split of this dataset is 80:20:802080:2080 : 20. The TinyImageNet dataset is also a subset of the ImageNet Russakovsky et al. (2015) dataset and contains 200200200200 classes, with each class containing 500500500500 training images, 50505050 validation images, and 50505050 testing images. The Cub-200-2011 dataset is a bird dataset used for image classification. It covers 200 categories with a total of 11,788 images.

Evaluation Criteria. Following Shi et al. (2022), the average accuracy is used to measure the performance of the incremental learning algorithm, i.e.,

A¯=1tτ=1tAτ,¯𝐴1𝑡superscriptsubscript𝜏1𝑡subscript𝐴𝜏\displaystyle\bar{A}=\dfrac{1}{t}\sum_{\tau=1}^{t}A_{\tau},over¯ start_ARG italic_A end_ARG = divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , (11)

where Aτsubscript𝐴𝜏A_{\tau}italic_A start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is the accuracy of the τ𝜏\tauitalic_τ-th task.

Dataset CIFAR10-LT CIFAR100-LT
Network ResNet18 ResNet32
T𝑇Titalic_T 5 5 10 10 10 10
Noldsubscript𝑁𝑜𝑙𝑑N_{old}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT 1K 1.5K 1K 1.5K 1K 1.5K
Methods Average accuracy
BiC Wu et al. (2019) 65.46 66.48 39.91 42.88 60.03 61.64
PODNet Douillard et al. (2020) 76.00 71.23 38.38 41.33 47.15 49.21
SSIL Ahn et al. (2021) 70.67 73.14 42.24 46.14 50.22 55.32
COIL Zhou et al. (2021) 79.60 79.64 48.98 50.72 58.31 58.31
ICARL Rebuffi et al. (2017) 70.18 73.25 38.47 42.04 50.17 54.67
ICARL_JIOC 71.37+1.191.19\tiny{\boldsymbol{+1.19}}bold_+ bold_1.19 74.16+0.910.91\tiny{\boldsymbol{+0.91}}bold_+ bold_0.91 46.51+8.048.04\tiny{\boldsymbol{+8.04}}bold_+ bold_8.04 49.73+7.697.69\tiny{\boldsymbol{+7.69}}bold_+ bold_7.69 55.24+5.075.07\tiny{\boldsymbol{+5.07}}bold_+ bold_5.07 58.41+3.743.74\tiny{\boldsymbol{+3.74}}bold_+ bold_3.74
DER Yan et al. (2021) 71.05 73.27 52.22 54.22 64.18 66.13
DER_JIOC 71.78+0.730.73\tiny{\boldsymbol{+0.73}}bold_+ bold_0.73 74.59+1.321.32\tiny{\boldsymbol{+1.32}}bold_+ bold_1.32 53.20+0.980.98\tiny{\boldsymbol{+0.98}}bold_+ bold_0.98 54.52+0.300.30\tiny{\boldsymbol{+0.30}}bold_+ bold_0.30 66.46+2.282.28\tiny{\boldsymbol{+2.28}}bold_+ bold_2.28 68.16+2.032.03\tiny{\boldsymbol{+2.03}}bold_+ bold_2.03
FOSTER Wang et al. (2022) 71.74 74.39 51.72 49.43 60.98 62.82
FOSTER_JIOC 73.95+2.522.52\tiny{\boldsymbol{+2.52}}bold_+ bold_2.52 77.20+2.722.72\tiny{\boldsymbol{+2.72}}bold_+ bold_2.72 53.28+1.561.56\tiny{\boldsymbol{+1.56}}bold_+ bold_1.56 54.42+4.994.99\tiny{\boldsymbol{+4.99}}bold_+ bold_4.99 62.32+1.341.34\tiny{\boldsymbol{+1.34}}bold_+ bold_1.34 64.94+2.122.12\tiny{\boldsymbol{+2.12}}bold_+ bold_2.12
Table 2: Results on the CIFAR10-LT and CIFAR100-LT datasets (\ast means our implementation).
Dataset MiniImageNet TinyImageNet Cub-200-2011 CIFAR100
Network ResNet18 ResNet32
T𝑇Titalic_T 10 10 10 10 10 10 10 10 10
Noldsubscript𝑁𝑜𝑙𝑑N_{old}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT 1K 2K 1K 2K 0.5K 1K 2K 1K 2K
Methods Average accuracy
BiC Wu et al. (2019) 52.95 52.10 52.95 54.18 29.13 34.43 45.77 58.01 61.93
PODNetDouillard et al. (2020) 55.74 59.27 43.30 45.66 31.81 42.12 47.25 51.66 53.63
SSIL Ahn et al. (2021) 47.59 54.52 35.94 42.13 33.20 43.63 50.75 51.76 56.12
COIL Zhou et al. (2021) 64.97 65.10 42.87 42.84 34.81 51.25 56.21 57.12 60.03
ICARL Rebuffi et al. (2017) 53.43 60.90 33.63 39.24 30.10 38.29 44.32 52.20 56.35
ICARL_JIOC 59.88+6.456.45\tiny{\boldsymbol{+6.45}}bold_+ bold_6.45 65.98+5.085.08\tiny{\boldsymbol{+5.08}}bold_+ bold_5.08 38.60+4.974.97\tiny{\boldsymbol{+4.97}}bold_+ bold_4.97 44.74+5.505.50\tiny{\boldsymbol{+5.50}}bold_+ bold_5.50 35.11+5.015.01\tiny{\boldsymbol{+5.01}}bold_+ bold_5.01 47.49+9.209.20\tiny{\boldsymbol{+9.20}}bold_+ bold_9.20 53.74+9.429.42\tiny{\boldsymbol{+9.42}}bold_+ bold_9.42 56.66+4.464.46\tiny{\boldsymbol{+4.46}}bold_+ bold_4.46 59.84+3.493.49\tiny{\boldsymbol{+3.49}}bold_+ bold_3.49
DER Yan et al. (2021) 69.06 72.36 53.19 56.54 36.16 53.16 57.28 67.57 70.12
DER_JIOC 70.06+1.001.00\tiny{\boldsymbol{+1.00}}bold_+ bold_1.00 73.08+0.720.72\tiny{\boldsymbol{+0.72}}bold_+ bold_0.72 56.37+3.183.18\tiny{\boldsymbol{+3.18}}bold_+ bold_3.18 57.63+1.091.09\tiny{\boldsymbol{+1.09}}bold_+ bold_1.09 38.82+2.262.26\tiny{\boldsymbol{+2.26}}bold_+ bold_2.26 56.29+3.133.13\tiny{\boldsymbol{+3.13}}bold_+ bold_3.13 59.31+2.032.03\tiny{\boldsymbol{+2.03}}bold_+ bold_2.03 70.45+2.882.88\tiny{\boldsymbol{+2.88}}bold_+ bold_2.88 71.88+1.761.76\tiny{\boldsymbol{+1.76}}bold_+ bold_1.76
FOSTER Wang et al. (2022) 67.93 69.37 50.36 54.78 27.49 52.09 52.09 64.05 65.93
FOSTER_JIOC 70.84+2.912.91\tiny{\boldsymbol{+2.91}}bold_+ bold_2.91 73.70+4.334.33\tiny{\boldsymbol{+4.33}}bold_+ bold_4.33 52.41+2.052.05\tiny{\boldsymbol{+2.05}}bold_+ bold_2.05 55.91+1.131.13\tiny{\boldsymbol{+1.13}}bold_+ bold_1.13 38.10+10.6110.61\tiny{\boldsymbol{+10.61}}bold_+ bold_10.61 55.69+3.603.60\tiny{\boldsymbol{+3.60}}bold_+ bold_3.60 59.85+7.767.76\tiny{\boldsymbol{+7.76}}bold_+ bold_7.76 65.12+1.071.07\tiny{\boldsymbol{+1.07}}bold_+ bold_1.07 67.79+1.861.86\tiny{\boldsymbol{+1.86}}bold_+ bold_1.86
Table 3: Results on the MiniImageNet, TinyImageNet, Cub-200, CIFAR100 datasets (\ast means our impementation).

Baseline Protocol. The training sets of CIFAR10-LT is divided into T=5𝑇5T=5italic_T = 5 tasks, and the number of categories for each task is 2222. The number of samples in the memory is fixed to be Nold={1000,2000}subscript𝑁𝑜𝑙𝑑10002000N_{old}=\left\{1000,2000\right\}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = { 1000 , 2000 } during the incremental training. In the CIFAR100-LT, CIFAR100, MiniImageNet, TinyImageNet, and Cub-200-2011 datasets, the training tasks are divided into T=10𝑇10T=10italic_T = 10. The memory sizes of CIFAR100, MiniImageNet, and TinyImageNet dataset are also fixed as Nold={1000,2000}subscript𝑁𝑜𝑙𝑑10002000N_{old}=\left\{1000,2000\right\}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = { 1000 , 2000 }, and the memory size is chosen from Nold={1000,1500}subscript𝑁𝑜𝑙𝑑10001500N_{old}=\left\{1000,1500\right\}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = { 1000 , 1500 } In addition, for the Cub-200-2011 dataset, the memory storage size fixed as Nold={500}subscript𝑁𝑜𝑙𝑑500N_{old}=\left\{500\right\}italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = { 500 }. The numbers of categories for each task of CIFAR10-LT, CIFAR100, MiniImageNet, TinyImageNet and Cub-200-2011 dataset are 10101010, 10101010, 10101010, 20202020 and 20202020, respectively. Regarding the memory storage samples related to our fusion algorithm, we all follow the existing algorithm Rebuffi et al. (2017); Yan et al. (2021); Wang et al. (2022).

Implementation details. Our method and all the compared approaches (BiC Wu et al. (2019), PODNet Douillard et al. (2020), COIL Zhou et al. (2021), SSIL Ahn et al. (2021), ICARL Rebuffi et al. (2017), DER Yan et al. (2021) and FOSTER Wang et al. (2022)) are implemented using PyCIL Zhou et al. (2023) and Pytorch Paszke et al. (2017).

On the experimental dataset, we used ResNet18 He et al. (2016) and ResNet32 as feature extractors respectively. ResNet32 is just used to further demonstrate that this mechanism can have some effectiveness in other network frameworks. In terms of parameter settings, we align with the original methods on PyCIL Zhou et al. (2023) to facilitate a fair comparison. Among these, the batch size is set to 128128128128. Additionally, the SGD optimizer is used to gradually update the weights during incremental learning model training. The learning rate is initially set to be 0.10.10.10.1 and gradually decays. We run the training on two NVIDIA 3090RTX GPUs.

4.2 Results and Analysis

We incorporates the proposed JIOC strategy into existing class-incremental learning algorithms (ICARL Rebuffi et al. (2017), DER Yan et al. (2021), and FOSTER Wang et al. (2022)). The experimental results on different datasets are shown in Table 2 and Table 3.

Results on CIFAR10-LT and CIFAR100-LT. From the overall performance analysis of the Table 2, it can be seen that the ICARL, DER, and FOSTER algorithms on our created imbalanced datasets have been significantly improved. The relative improvements of ICARL_JIOC are 1.70%percent1.701.70\%1.70 %, 1.24%percent1.241.24\%1.24 %, 20.90%percent20.9020.90\%20.90 %, 18.29%percent18.2918.29\%18.29 %, 10.11%percent10.1110.11\%10.11 % and 6.84%percent6.846.84\%6.84 % compared with the original ICARL algorithm. For the DER algorithm, our DER_JIOC improves it by 1.03%percent1.031.03\%1.03 %, 1.80%percent1.801.80\%1.80 %, 1.88%percent1.881.88\%1.88 %, 0.55%percent0.550.55\%0.55 %, 3.55%percent3.553.55\%3.55 % and 3.07%percent3.073.07\%3.07 %. In regard to the FOSTER algorithm, the relative improvements are the significant 3.51%percent3.513.51\%3.51 %, 3.66%percent3.663.66\%3.66 %, 3.02%percent3.023.02\%3.02 %, 10.10%percent10.1010.10\%10.10 %, 2.20%percent2.202.20\%2.20 %, and 3.37%percent3.373.37\%3.37 % respectively. Compared with all the counterparts, the best performance is usually achieved by the proposed FOSTER_JIOC method. This not only demonstrates the effectiveness of our method on imbalanced datasets, but also further confirms its ability to alleviate catastrophic forgetting in other network frameworks.

Results on MiniImageNet, TinyImageNet, Cub-200-2011 and CIFAR100. We can observe from Table 3 that the mechanism proposed in this paper also has significant improvements on the MiniImageNet, TinyImageNet, Cub-200-2011 and CIFAR100 datasets. For example, our FOSTER_JIOC outperforms the original FOSTER algorithm by 4.28%percent4.284.28\%4.28 %, 6.24%percent6.246.24\%6.24 %, 4.07%percent4.074.07\%4.07 %, 2.06%percent2.062.06\%2.06 %, 38.60%percent38.6038.60\%38.60 %, 6.91%percent6.916.91\%6.91 %, 14.90%percent14.9014.90\%14.90 %, 1.67%percent1.671.67\%1.67 % and 2.82%percent2.822.82\%2.82 %, respectively. The best performance is also achieved by the proposed DER_JIOC and ICARL_JIOC, which are comparable. This further demonstrates that the proposed method not only alleviates catastrophic forgetting in class-imbalanced datasets but also has a forgetting-mitigation effect on normal data.

4.3 Ablation Studies

In this section, we first separately investigate the effectiveness of input and output coordination strategies, and then analyze that the proposed output coordination strategy exhibits a more pronounced effect in mitigating forgetting compared to the SSIL method.

ICARL Task
1 2 3 4 5 6 7 8 9 10 Avg
AllTasks𝐴𝑙subscript𝑙𝑇𝑎𝑠𝑘𝑠All_{Tasks}italic_A italic_l italic_l start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_s end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 80.70 58.95 50.97 39.60 35.06 30.45 29.07 22.15 19.57 18.21 38.47
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 80.70 63.50 55.80 45.18 40.84 36.88 33.46 27.21 25.24 22.79 43.16
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT 80.70 63.75 57.50 46.90 43.58 38.07 36.56 30.71 28.56 24.98 45.13
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT 80.70 64.90 58.37 47.50 46.08 40.33 36.79 32.66 30.46 27.26 46.51
NewTask𝑁𝑒subscript𝑤𝑇𝑎𝑠𝑘New_{Task}italic_N italic_e italic_w start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 80.70 61.50 76.1 70.20 81.9 70.01 74.7 73.00 77.10 71.20 73.64
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 80.70 66.40 81.00 73.10 84.60 72.50 77.7 74.90 79.70 71.60 76.22
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT 80.70 64.40 80.30 71.70 83.20 69.70 76.50 72.40 78.80 70.60 74.83
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT 80.70 66.40 80.40 70.90 84.20 71.50 80.30 73.10 78.40 72.40 75.83
OldTasks𝑂𝑙subscript𝑑𝑇𝑎𝑠𝑘𝑠Old_{Tasks}italic_O italic_l italic_d start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_s end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT   56.40 38.40 29.40 23.35 22.52 19.15 14.89 12.38 12.32 25.42
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT   60.60 45.35 37.43 32.52 30.28 25.65 21.89 19.30 18.00 32.34
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT   63.10 46.10 38.63 33.67 31.74 29.90 24.76 22.28 19.91 34.45
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT   63.40 47.35 39.70 36.55 34.1 29.53 26.89 24.46 22.24 36.02
Table 4: The results obtained by running with Nold=1000subscript𝑁𝑜𝑙𝑑1000N_{old}=1000italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = 1000, using ResNet18 as the feature extractor on the CIFAR100-LT dataset (Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT is the loss function used by the ICARL algorithm).
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: A Compare the experimental results of SSIL with those of SSIL_OC (SSIL_OC is formed by integrating the proposed OC strategy into the SSIL algorithm. \ast means our impementation).

Study on the effectiveness of Input and Output Coordination. To validate the effectiveness of input and output coordination strategies, we exclusively employed ResNet18 as the feature extractor on the CIFAR100-LT dataset, , as shown in Tables 4. Furthermore, on the CIFAR100 dataset, we conducted experiments separately using ResNet18 and ResNet32 as feature extractors, as shown in Table 5 and Table 6 (we refer to the supplementary material for Table 5 and Table 6).

(1)1\left(1\right)( 1 ) The results are reported in Table 4, where we can see that the average accuracy of the ICARL algorithm with the proposed input coordination is 32.3432.3432.3432.34 on the old tasks, 76.2276.2276.2276.22 on the new task, and 43.1643.1643.1643.16 overall. Compared with the original ICARL approach, the improvements are 27.22%percent27.2227.22\%27.22 %, 3.50%percent3.503.50\%3.50 %, and 12.19%percent12.1912.19\%12.19 %, respectively. The results from the old task, new task, and overall task in Table 5 and Table 6 also illustrate the competitiveness of the input coordination strategy. This demonstrates that the input coordination strategy can alleviate class imbalance in incremental learning. Besides, the ICARL algorithm only uses KD to maintain the output distribution on the old task classification heads. It does not take into account the human inductive memory mechanism for coordinating output distribution across different tasks. In Table 4, the ICARL algorithm achieved average performances of 34.4534.4534.4534.45, 74.8374.8374.8374.83, and 45.1345.1345.1345.13 on the old tasks, new tasks, and overall, respectively. Our proposed output coordination mechanisms improved these performances by 35.52%percent35.5235.52\%35.52 %, 1.62%percent1.621.62\%1.62 %, and 17.31%percent17.3117.31\%17.31 %, respectively. It can be concluded that the input and output coordination strategy proposed in this paper yields significant improvements, whether applied to the CIAFR100 dataset, the CIAFR100-LT dataset, or different network architectures with varying depths.

(2)2\left(2\right)( 2 ) In the CIART100-LT dataset, the input coordination strategy demonstrates notable enhancements in the outcomes for old tasks (27.22%)percent27.22\left(27.22\%\right)( 27.22 % ), new tasks (3.50%)percent3.50\left(3.50\%\right)( 3.50 % ), and overall task (12.19%)percent12.19\left(12.19\%\right)( 12.19 % ) performance when compared to the original ICARL algorithm, as shown in Table 4. Similarly, In the CIART100 dataset, the input coordination strategy further improves the original ICARL algorithm by 42.05%, 3.61%, and 19.51% on the old tasks, new tasks, and overall tasks, as shown in Table 5. According to the description and corresponding improvement effects of CIART100-LT and CIART100, it can be seen that the input coordination strategy has a good regulating effect on the imbalance of old and new categories.

Experiment Comparison between Output Coordination Strategy and the SSIL. To quantitatively analyze the differences between the output coordination strategy and the SSIL, we conducted corresponding experimental results based on different feature extractors and datasets, including class-imbalanced and balanced datasets, as shown in Figure 4. From the results in Figure 4, it is evident that using the output distribution coordination strategy leads to significant improvements in each stage task on class-imbalanced datasets and in deep feature networks (ResNet32). This also indicates that the output distribution coordination strategy enables new task data to maintain consistent output distributions on the old task classification head and suppresses old task data on the new task classification head during the incremental learning process. This avoidance of interference between new and old tasks is achieved. However, SSIL does not avoid interference from old task outputs, which results in its performance being inferior to the output distribution coordination strategy.

5 Conclusion

Although the existing approaches address the class bias issue in class-incremental learning (CIL) to a certain extent by scaling and dividing the softmax layer, they all ignore the bias within the task. In addition, the mutual interference between old and new tasks has not been well resolved. Therefore, we propose a joint input and output coordination (JIOC) mechanism to enable incremental learning models to simultaneously reduce the interference between predictions for these tasks and alleviate the class imbalance issue between and within tasks. From the extensive experiments on multiple popular datasets, we observe significant improvements when incorporating the proposed mechanism into the existing CIL approaches that utilize memory storage. In the future, we intend to design more sophisticated strategies to reweight the inputs, and develop a general end-to-end framework for CIL.

5.0.1 Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (Grant No. U23A20318), the Special Fund of Hubei Luojia Laboratory under Grant 220100014, the Fundamental Research Funds for the Central Universities (No. 2042024kf0039), the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-GC-2023-006), and the CCF-Zhipu AI Large Model Fund OF 202224.

References

  • Ahn et al. [2021] Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. Ss-il: Separated softmax for incremental learning. In Proceedings of the IEEE/CVF International conference on computer vision, pages 844–853, 2021.
  • Aljundi et al. [2019] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019.
  • Cheraghian et al. [2021] Ali Cheraghian, Shafin Rahman, Sameera Ramasinghe, Pengfei Fang, Christian Simon, Lars Petersson, and Mehrtash Harandi. Synthesized feature based few-shot class-incremental learning on a mixture of subspaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8661–8670, 2021.
  • Cui et al. [2019] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019.
  • De Lange et al. [2021] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.
  • Douillard et al. [2020] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 86–102. Springer, 2020.
  • Fisher and Sloutsky [2005] Anna V Fisher and Vladimir M Sloutsky. When induction meets memory: Evidence for gradual transition from similarity-based to category-based induction. Child development, 76(3):583–597, 2005.
  • French and Chater [2002] Robert M French and Nick Chater. Using noise to compute error surfaces in connectionist networks: A novel means of reducing catastrophic forgetting. Neural computation, 14(7):1755–1769, 2002.
  • Garg et al. [2022] Prachi Garg, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian, and CV Jawahar. Multi-domain incremental learning for semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 761–771, 2022.
  • Geng et al. [2020] Ruiying Geng, Binhua Li, Yongbin Li, Jian Sun, and Xiaodan Zhu. Dynamic memory induction networks for few-shot text classification. arXiv preprint arXiv:2005.05727, 2020.
  • Hayes et al. [2013] Brett K Hayes, Kristina Fritz, and Evan Heit. The relationship between memory and inductive reasoning: Does it develop ? In Developmental Psychology, volume 44, pages 848–860, 2013.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Le and Yang [2015] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • Li and Hoiem [2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  • Li et al. [2020] Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, and Jiashi Feng. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10991–11000, 2020.
  • Liu et al. [2021] Yaoyao Liu, Bernt Schiele, and Qianru Sun. Adaptive aggregation networks for class-incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2544–2553, 2021.
  • Mallya and Lazebnik [2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  • Mallya et al. [2018] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
  • Menon et al. [2021] Aditya K Menon, Ankit Singh Rawat, Sashank Reddi, Seungyeon Kim, and Sanjiv Kumar. A statistical perspective on distillation. In International Conference on Machine Learning, pages 7632–7642. PMLR, 2021.
  • Mirza et al. [2022] M Jehanzeb Mirza, Marc Masana, Horst Possegger, and Horst Bischof. An efficient domain-incremental learning approach to drive in all weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2022.
  • Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • Rebuffi et al. [2017] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  • Redondo and Morris [2011] Roger L Redondo and Richard GM Morris. Making memories last: the synaptic tagging and capture hypothesis. Nature Reviews Neuroscience, 12(1):17–30, 2011.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Santoso and Finn [2022] Fendy Santoso and Anthony Finn. A data-driven cyber–physical system using deep-learning convolutional neural networks: Study on false-data injection attacks in an unmanned ground vehicle under fault-tolerant conditions. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 53(1):346–356, 2022.
  • Serra et al. [2018] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
  • Shi et al. [2022] Yujun Shi, Kuangqi Zhou, Jian Liang, Zihang Jiang, Jiashi Feng, Philip HS Torr, Song Bai, and Vincent YF Tan. Mimicking the oracle: an initial phase decorrelation approach for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16722–16731, 2022.
  • Tao et al. [2020] Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12183–12192, 2020.
  • Tschandl et al. [2020] Philipp Tschandl, Christoph Rinner, Zoe Apalla, Giuseppe Argenziano, Noel Codella, Allan Halpern, Monika Janda, Aimilios Lallas, Caterina Longo, Josep Malvehy, et al. Human–computer collaboration for skin cancer recognition. Nature Medicine, 26(8):1229–1234, 2020.
  • Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  • Wah et al. [2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • Wang et al. [2022] Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Foster: Feature boosting and compression for class-incremental learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV, pages 398–414. Springer, 2022.
  • Williams [1999] John N Williams. Memory, attention, and inductive learning. Studies in Second Language Acquisition, 21(1):1–48, 1999.
  • Wu et al. [2019] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019.
  • Yan et al. [2021] Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3014–3023, 2021.
  • Zhang et al. [2020] Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. Class-incremental learning via deep model consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1131–1140, 2020.
  • Zhou et al. [2021] Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Co-transport for class-incremental learning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1645–1654, 2021.
  • Zhou et al. [2023] Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, and De-Chuan Zhan. Pycil: a python toolbox for class-incremental learning. SCIENCE CHINA Information Sciences, 66(9):197101–, 2023.

Appendix

Appendix A Absolute Value of Gradient

According to the above problem setup, the data {(xi,jt,yi,jt)}subscriptsuperscript𝑥superscript𝑡𝑖𝑗subscriptsuperscript𝑦superscript𝑡𝑖𝑗\left\{(x^{t^{\prime}}_{i,j},y^{t^{\prime}}_{i,j})\right\}{ ( italic_x start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) } in the t𝑡titalic_t-th task is simplified to {(xu,yu),(u=1,2,,mt)}subscript𝑥𝑢subscript𝑦𝑢𝑢12𝑚𝑡\left\{(x_{u},y_{u}),(u=1,2,\cdots,m\cdot t)\right\}{ ( italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , ( italic_u = 1 , 2 , ⋯ , italic_m ⋅ italic_t ) }, where mt𝑚𝑡m\cdot titalic_m ⋅ italic_t is the number of classes. Besides, the corresponding output score 𝐩^i,jtsubscriptsuperscript^𝐩superscript𝑡𝑖𝑗\hat{\mathbf{p}}^{t^{\prime}}_{i,j}over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is simplified to 𝐩^usubscript^𝐩𝑢\hat{\mathbf{p}}_{u}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The previous layer’s output score for softmax is 𝐪^usubscript^𝐪𝑢\hat{\mathbf{q}}_{u}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, i.e.,

𝐩^u=e𝐪^ur=1mte𝐪^rsubscript^𝐩𝑢superscript𝑒subscript^𝐪𝑢superscriptsubscript𝑟1𝑚𝑡superscript𝑒subscript^𝐪𝑟\hat{\mathbf{p}}_{u}=\dfrac{e^{\hat{\mathbf{q}}_{u}}}{\sum_{r=1}^{m\cdot t}e^{% \hat{\mathbf{q}}_{r}}}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m ⋅ italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (12)

If the k𝑘kitalic_k-th neuron is the correct output label, yk=1subscript𝑦𝑘1y_{k}=1italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 in [y1,y,2,,ymt]\left[y_{1},y_{,2},\cdots,y_{m\cdot t}\right][ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT , 2 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_m ⋅ italic_t end_POSTSUBSCRIPT ] and others are 0. The derivative of Lce,tsubscript𝐿𝑐𝑒𝑡L_{ce,t}italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT w.r.t. 𝐪^usubscript^𝐪𝑢\hat{\mathbf{q}}_{u}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can be given by:

ϑLce,tϑ𝐪^u=ϑLce,tϑ𝐩^uϑ𝐩^uϑ𝐪^u=ϑ(u=1mtyulog𝐩^u)ϑ𝐩^uϑ𝐩^uϑ𝐪^uitalic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝐪𝑢italic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝐩𝑢italic-ϑsubscript^𝐩𝑢italic-ϑsubscript^𝐪𝑢italic-ϑsuperscriptsubscript𝑢1𝑚𝑡subscript𝑦𝑢𝑙𝑜𝑔subscript^𝐩𝑢italic-ϑsubscript^𝐩𝑢italic-ϑsubscript^𝐩𝑢italic-ϑsubscript^𝐪𝑢\begin{split}\dfrac{\vartheta L_{ce,t}}{\vartheta\hat{\mathbf{q}}_{u}}&=\dfrac% {\vartheta L_{ce,t}}{\vartheta\hat{\mathbf{p}}_{u}}\cdot\dfrac{\vartheta\hat{% \mathbf{p}}_{u}}{\vartheta\hat{\mathbf{q}}_{u}}\\ &=\dfrac{\vartheta\left(-\sum_{u=1}^{m\cdot t}y_{u}log{\hat{\mathbf{p}}_{u}}% \right)}{\vartheta\hat{\mathbf{p}}_{u}}\cdot\dfrac{\vartheta\hat{\mathbf{p}}_{% u}}{\vartheta\hat{\mathbf{q}}_{u}}\\ \end{split}start_ROW start_CELL divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL start_CELL = divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_ϑ over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_ϑ ( - ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m ⋅ italic_t end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_l italic_o italic_g over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ϑ over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_ϑ over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL end_ROW (13)

Since yk=1subscript𝑦𝑘1y_{k}=1italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, ϑLce,tϑp^u0italic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝑝𝑢0\dfrac{\vartheta L_{ce,t}}{\vartheta\hat{p}_{u}}\neq 0divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ≠ 0, and the others are 0. The above Eq. 13 can be further simplified as:

ϑLce,tϑ𝐪^u=yup^kϑp^kϑ𝐪^uitalic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝐪𝑢subscript𝑦𝑢subscript^𝑝𝑘italic-ϑsubscript^𝑝𝑘italic-ϑsubscript^𝐪𝑢\begin{split}\dfrac{\vartheta L_{ce,t}}{\vartheta\hat{\mathbf{q}}_{u}}&=\dfrac% {y_{u}}{\hat{p}_{k}}\cdot\dfrac{\vartheta\hat{p}_{k}}{\vartheta\hat{\mathbf{q}% }_{u}}\\ \end{split}start_ROW start_CELL divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL start_CELL = divide start_ARG italic_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_ϑ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL end_ROW (14)

The solution of ϑLce,tϑ𝐪^uitalic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝐪𝑢\dfrac{\vartheta L_{ce,t}}{\vartheta\hat{\mathbf{q}}_{u}}divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG in Eq. 14 needs to be divided into two cases (u=k𝑢𝑘u=kitalic_u = italic_k and uk𝑢𝑘u\neq kitalic_u ≠ italic_k).

u=k𝑢𝑘u=kitalic_u = italic_k,

ϑp^kϑ𝐪^u=ϑp^kϑq^k=eq^kr=1mteq^r(eq^kr=1mteq^r)2=p^k(1p^k)italic-ϑsubscript^𝑝𝑘italic-ϑsubscript^𝐪𝑢italic-ϑsubscript^𝑝𝑘italic-ϑsubscript^𝑞𝑘superscript𝑒subscript^𝑞𝑘superscriptsubscript𝑟1𝑚𝑡superscript𝑒subscript^𝑞𝑟superscriptsuperscript𝑒subscript^𝑞𝑘superscriptsubscript𝑟1𝑚𝑡superscript𝑒subscript^𝑞𝑟2subscript^𝑝𝑘1subscript^𝑝𝑘\begin{split}\dfrac{\vartheta\hat{p}_{k}}{\vartheta\hat{\mathbf{q}}_{u}}&=% \dfrac{\vartheta\hat{p}_{k}}{\vartheta\hat{q}_{k}}\\ &=\dfrac{e^{\hat{q}_{k}}}{\sum_{r=1}^{m\cdot t}e^{\hat{q}_{r}}}-\left(\dfrac{e% ^{\hat{q}_{k}}}{\sum_{r=1}^{m\cdot t}e^{\hat{q}_{r}}}\right)^{2}\\ &=\hat{p}_{k}\left(1-\hat{p}_{k}\right)\end{split}start_ROW start_CELL divide start_ARG italic_ϑ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL start_CELL = divide start_ARG italic_ϑ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_e start_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m ⋅ italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG - ( divide start_ARG italic_e start_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m ⋅ italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW (15)

uk𝑢𝑘u\neq kitalic_u ≠ italic_k,

ϑp^kϑ𝐪^u=p^k𝐩^uitalic-ϑsubscript^𝑝𝑘italic-ϑsubscript^𝐪𝑢subscript^𝑝𝑘subscript^𝐩𝑢\begin{split}\dfrac{\vartheta\hat{p}_{k}}{\vartheta\hat{\mathbf{q}}_{u}}=-\hat% {p}_{k}\hat{\mathbf{p}}_{u}\end{split}start_ROW start_CELL divide start_ARG italic_ϑ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG = - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_CELL end_ROW (16)

From the above Eq. 15 and 16, The derivative of Lce,tsubscript𝐿𝑐𝑒𝑡L_{ce,t}italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT w.r.t. 𝐪^usubscript^𝐪𝑢\hat{\mathbf{q}}_{u}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can be given by:

ϑLce,tϑ𝐪^u=[p^1p^k1p^mt]italic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝐪𝑢delimited-[]matrixsubscript^𝑝1subscript^𝑝𝑘1subscript^𝑝𝑚𝑡\begin{split}\dfrac{\vartheta L_{ce,t}}{\vartheta\hat{\mathbf{q}}_{u}}&=\left[% \begin{matrix}\hat{p}_{1}\\ \cdots\\ \hat{p}_{k}-1\\ \cdots\\ \hat{p}_{m\cdot t}\end{matrix}\right]\end{split}start_ROW start_CELL divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL start_CELL = [ start_ARG start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_m ⋅ italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] end_CELL end_ROW (17)

Since yk=1subscript𝑦𝑘1y_{k}=1italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 and others are 0, the above Eq. 17 can be further rewritten, i.e.,

ϑLce,tϑ𝐪^u=[p^1y1p^kykp^mtymt]italic-ϑsubscript𝐿𝑐𝑒𝑡italic-ϑsubscript^𝐪𝑢delimited-[]matrixsubscript^𝑝1subscript𝑦1subscript^𝑝𝑘subscript𝑦𝑘subscript^𝑝𝑚𝑡subscript𝑦𝑚𝑡\begin{split}\dfrac{\vartheta L_{ce,t}}{\vartheta\hat{\mathbf{q}}_{u}}&=\left[% \begin{matrix}\hat{p}_{1}-y_{1}\\ \cdots\\ \hat{p}_{k}-y_{k}\\ \cdot\\ \hat{p}_{m\cdot t}-y_{m\cdot t}\end{matrix}\right]\end{split}start_ROW start_CELL divide start_ARG italic_ϑ italic_L start_POSTSUBSCRIPT italic_c italic_e , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ϑ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG end_CELL start_CELL = [ start_ARG start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋅ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_m ⋅ italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_m ⋅ italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] end_CELL end_ROW (18)

Appendix B Ablation Studies Table

On the CIFAR100 dataset, we conducted experiments separately using ResNet18 and ResNet32 as feature extractors, as shown in Table 5 and Table 6.

ICARL Task
1 2 3 4 5 6 7 8 9 10 Avg
AllTasks𝐴𝑙subscript𝑙𝑇𝑎𝑠𝑘𝑠All_{Tasks}italic_A italic_l italic_l start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_s end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 87.10 60.75 49.13 39.22 33.54 29.03 26.40 21.31 19.54 16.83 38.29
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 87.10 68.65 59.07 48.12 42.36 37.00 35.37 29.65 26.94 23.34 45.76
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT 87.10 69.15 59.57 48.42 45.10 40.98 39.23 32.48 30.26 27.55 47.98
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT 87.10 68.80 58.93 49.55 45.96 40.68 38.73 33.39 29.86 27.88 48.09
NewTask𝑁𝑒subscript𝑤𝑇𝑎𝑠𝑘New_{Task}italic_N italic_e italic_w start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 87.10 66.80 80.70 74.70 85.10 75.40 79.10 76.40 79.50 73.40 77.82
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 87.10 71.80 85.20 77.30 86.20 77.90 82.70 79.90 82.40 75.80 80.63
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT 87.10 71.90 84.50 76.00 85.50 77.70 82.50 78.40 82.20 76.40 80.22
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT 87.10 71.70 84.10 77.40 85.40 76.30 81.70 79.30 82.20 77.20 80.24
OldTasks𝑂𝑙subscript𝑑𝑇𝑎𝑠𝑘𝑠Old_{Tasks}italic_O italic_l italic_d start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_s end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT   54.70 33.35 27.40 20.65 19.76 17.62 13.44 12.05 10.54 23.28
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT   65.50 46.00 38.40 31.40 28.82 27.48 22.47 20.01 17.51 33.07
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT   66.40 47.10 39.23 35.00 33.64 32.02 25.91 23.76 22.12 36.13
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT   65.90 46.35 40.27 36.10 33.56 31.57 26.83 23.31 22.40 36.25
Table 5: The results obtained by running with Nold=1000subscript𝑁𝑜𝑙𝑑1000N_{old}=1000italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = 1000, using ResNet18 as the feature extractor on the CIFAR100 dataset (Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT is the loss function used by the ICARL algorithm.)
ICARL Task
1 2 3 4 5 6 7 8 9 10 Avg
AllTasks𝐴𝑙subscript𝑙𝑇𝑎𝑠𝑘𝑠All_{Tasks}italic_A italic_l italic_l start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_s end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 89.90 75.05 67.20 55.50 49.62 45.03 41.26 35.62 32.77 30.06 52.20
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 89.90 74.70 66.57 55.65 51.84 45.60 43.76 37.16 34.18 33.69 53.31
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT 89.90 75.85 68.97 58.98 54.98 50.02 47.26 41.30 39.53 37.21 56.40
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT 89.90 76.25 67.47 58.82 55.54 51.77 48.39 42.26 39.01 37.23 56.66
NewTask𝑁𝑒subscript𝑤𝑇𝑎𝑠𝑘New_{Task}italic_N italic_e italic_w start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 89.90 78.50 87.80 82.70 90.90 83.20 88.20 84.00 87.90 83.30 85.64
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT 89.90 77.80 88.70 83.30 89.30 84.70 89.00 85.00 88.20 83.40 85.93
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT 89.90 79.20 89.50 82.80 89.10 81.60 87.10 81.90 86.10 80.30 84.75
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT 89.90 79.00 88.70 82.70 89.80 82.30 88.00 81.40 86.80 81.50 85.01
OldTasks𝑂𝑙subscript𝑑𝑇𝑎𝑠𝑘𝑠Old_{Tasks}italic_O italic_l italic_d start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_s end_POSTSUBSCRIPT Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT   71.60 56.90 46.43 39.30 37.40 33.43 28.71 25.88 24.74 40.49
𝑳𝑰𝑪+LKDsubscript𝑳𝑰𝑪subscript𝐿𝐾𝐷\boldsymbol{L_{IC}}+L_{KD}bold_italic_L start_POSTSUBSCRIPT bold_italic_I bold_italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT   71.60 55.50 46.43 42.48 37.78 36.22 30.33 27.42 28.17 41.77
Lce+𝑳𝑶𝑪subscript𝐿𝑐𝑒subscript𝑳𝑶𝑪L_{ce}+\boldsymbol{L_{OC}}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_italic_L start_POSTSUBSCRIPT bold_italic_O bold_italic_C end_POSTSUBSCRIPT   72.50 58.70 51.03 46.45 43.70 40.62 35.50 33.71 32.42 46.07
𝑳𝑱𝑰𝑶𝑪subscript𝑳𝑱𝑰𝑶𝑪\boldsymbol{L_{JIOC}}bold_italic_L start_POSTSUBSCRIPT bold_italic_J bold_italic_I bold_italic_O bold_italic_C end_POSTSUBSCRIPT   73.50 56.85 50.87 46.98 45.66 41.78 36.67 33.04 32.31 46.41
Table 6: The results obtained by running with Nold=1000subscript𝑁𝑜𝑙𝑑1000N_{old}=1000italic_N start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = 1000, using ResNet32 as the feature extractor on the CIFAR100 dataset (Lce+LKDsubscript𝐿𝑐𝑒subscript𝐿𝐾𝐷L_{ce}+L_{KD}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT is the loss function used by the ICARL algorithm.)