Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

CL2R: Compatible Lifelong Learning Representations

Published: 06 January 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL2R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL2R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.

    1 Introduction

    The universe is dynamic, and the emergence of novel data and new knowledge is unavoidable. The unique ability of natural intelligence to lifelong learning is highly dependent on memory and knowledge representation [18]. Through memory and knowledge representation, natural intelligent systems continually search, recognize, and learn new objects in an open universe after exposure to one or a few samples. Memory is substantially a cognitive function that encodes, stores, and retrieves knowledge. Artificial representations learned by Deep Convolutional Neural Network (DCNN) models [3, 61, 63, 64, 76] stored in a memory bank (i.e., the gallery-set) have been shown to be quite effective in searching and recognizing objects in an open-set/open-world learning context. Successful examples are face recognition [10, 14, 59], person re-identification [78, 79, 80], and image retrieval [19, 65, 73].
    These approaches rely on learning feature representations from static datasets in which all images are accessible at training time. However, dynamic assimilation of new data for lifelong learning suffers from catastrophic forgetting: the tendency of neural networks to abruptly forget previously learned information [37, 52].
    In the case of visual search, even avoiding catastrophic forgetting by repeatedly training DCNN models on both old and new data, the feature representation still irreversibly changes [31]. Thus, to benefit from the newly learned model, features stored in the gallery must be reprocessed and the “old” features replaced with the “new” ones. Reprocessing not only requires the storage of the original images (a noticeable leap from natural intelligence) but also their authorization to access them [66]. More importantly, extracting new features at each update of the model is computationally expensive or infeasible in the case of large gallery-sets. The speed at which the representation is updated to benefit from the newly learned data may impose time constraints on the re-indexing process. This may occur from timescales on the order of weeks/months as in retrieval systems or social networks [62], to within seconds as in autonomous robotics or real-time surveillance [43, 48]. Recently, in the work of Shen et al. [62], a novel training procedure was proposed to avoid re-indexing the gallery-set. The representation obtained in this manner is said to be compatible, as the features before and after the learning upgrade can be directly compared. Training takes advantage of all data from previous tasks (i.e., no lifelong learning), guaranteeing the absence of catastrophic forgetting. The advantage of considering compatible representation learning within the lifelong learning paradigm, as in this work, is that compatible representation allows visual search systems not only to distribute the computation over time but also to avoid or possibly limit the storage of images on private servers for gallery data. This can have important implications for the societal debate related to privacy, ethical, and sustainable issues (e.g., carbon footprint) of modern AI systems [11, 49, 60, 66].
    We identify stationarity as the key requirement for feature representation to be compatible during lifelong learning. Stationary features have been shown to be biologically plausible in many studies of working memory in the prefrontal cortex of macaques [33, 39, 40]. Some works [39, 40] decoded the information from the neural activity of the working memory using a classifier with a single fixed set of weights. They noted that a non-stationary feature representation seems to be biologically problematic since it would imply that the synaptic weights would have to change continuously for the information to be continuously available in memory.
    Inspired by this, in this article, we formalize the problem of Compatible Lifelong Learning Representations ( \(\textbf {CL}^{\bf 2}{\bf R}\) ) in relation to the relevant areas of compatible learning and lifelong (continual) learning. We call any training procedure that aims to obtain compatible features and minimize catastrophic forgetting as CL2R training, and we propose (1) a novel set of metrics to properly evaluate CL2R training procedures, (2) a training procedure based on rehearsal [52, 54], and feature stationarity [46, 47] to jointly address catastrophic forgetting and feature compatibility. Figure 1 provides an overview of the problem and the training procedure. Specifically, our CL2R training procedure is achieved by encouraging global and local stationarity to the learned features.
    Fig. 1.
    Fig. 1. Overview of the Compatible Lifelong Learning Representations (CL2R) problem and proposed training procedure. The learning agent searches object instances from query images \(I_\mathcal {Q}\) without re-indexing the gallery-set. Any update to the internal feature representation \(\phi\) does not render the features in the gallery-set unusable (i.e., no images are stored). Compatible feature representation under catastrophic forgetting is learned imposing stationarity to features learned from the the class-incremental learning surrogate task. Training is based on rehearsal with the episodic memory \(\mathcal {M}_t\) .
    The rest of the article is organized as follows. In Section 2, we discuss related work, and in Section 3, we highlight our contributions. Section 4 presents the formulation of CL2R, Section 5 proposes new metrics to evaluate compatibility, and Section 6 describes a new training procedure. In Section 7, we compare our results with adapted state-of-the-art methods. Section 8 presents the ablation study. We conclude in Section 9.

    2 Related Work

    Compatible learning. The work proposed by Shen et al. [62], called Backward-Compatible Training (BCT), first formalizes the problem of learning compatible representation to avoid re-indexing. The method takes advantage of an influence loss that encourages the feature representation toward one that can be used by the old classifier. The old classifier is fixed while learning with the novel data (i.e., its parameters are no longer updated by back-propagation) and cooperates with the new representation model. Cooperation is achieved by aligning the prototypes of the new classifier with the prototypes of the old fixed one. The underlying assumption is that the upgraded feature representation follows the representation learned by the old classifier. BCT has been evaluated in scenarios without the effects of catastrophic forgetting by repeatedly training DCNN models on both old and new images (i.e., jointly re-training from scratch at each upgrade). To compare with this learning strategy in a lifelong learning scenario instead of starting from scratch every time, we have added to BCT the capability of learning by fine-tuning the previously learned model according to a memory-based rehearsal strategy [52, 54].
    Compatibility under catastrophic forgetting has been implicitly studied in the work of Iscen et al. [25] (FAN), in which authors presented a method for storing features instead of images in Class-incremental Learning (CiL). They introduce a feature adaptation function to update the preserved features as the network learns novel classes. We compared to this method by storing the updated-preserved features obtained at each task. Although designed to improve classification accuracy, the work can be considered close to a lifelong learning approach with compatible representation in which the feature adaptation function they defined addresses implicitly the problem of feature compatibility as in other works [6, 23, 38, 68]. Differently from BCT, these methods do not completely prevents the cost of re-indexing since the learned mappings require evaluation every time the dataset is upgraded and are therefore they are not suited to lifelong learning and/or large gallery-set. For example, the ResNet-101 architecture is one order slower than the mapping proposed in the work of Chen et al. [6]; therefore, when the size of the gallery increases by an order of magnitude, it is equivalent to re-indexing the images. The method described in the work of Ramanujan et al. [51], in addition to the current feature model, trains from the same data an auxiliary model in a different way (i.e., using self-supervised learning). The auxiliary feature model will then be used with future learned models to learn a mapping model to obtain compatible representations as in other works [25, 38, 68]. The underlying assumption is that as the auxiliary feature model is trained with a different strategy, it encodes different knowledge that may facilitate learning the mapping between the representation spaces.
    Compatibility of the representation in a more general sense has been considered in the work of Li et al. [31] and Wang et al. [70], where similarity between features extracted from identical architectures and trained from different initialization has been extensively evaluated. The work of Budnik and Avrithis [5] avoids re-indexing the gallery, although the new model used for queries is not trained on more data. Their work is motivated by the scenario where the gallery is indexed by a large model and the queries are captured from mobile devices in which the use of small models is the only viable solution.
    Lifelong learning. Lifelong learning or continual learning studies the problem of learning from a non-i.i.d. stream of data with the goal of assimilating new knowledge preventing catastrophic forgetting [9, 37]. Methods for preventing catastrophic forgetting have been explored primarily in the classification task, where catastrophic forgetting often manifests itself as a significant drop in classification accuracy [2, 13, 35, 41, 67]. The key aspects that distinguish lifelong feature learning for visual search from classification are the following: (i) categorical data often have coarser granularity than visual search data, (ii) evaluation metrics do not involve classification accuracy, and (iii) class labels are not required to be explicitly learned. These differences may suggest that these two catastrophic forgetting occurrences are of different origins. In this context, recent works have discussed the importance of the specific task in assessing catastrophic forgetting of learned representations [1, 7, 8, 12, 47, 50]. Among others, empirical evidence presented in the work of Davari and Belilovsky [12] suggests that feature forgetting is not as catastrophic as classification forgetting and that many approaches that address the problem of catastrophic forgetting do not improve feature forgetting in terms of the usefulness of the representation. We argue that such evidence is relevant in visual search and that it can be exploited with techniques that further encourage learning compatible feature representation. According to this, we consider CiL as the basic building block for the general purpose of learning feature representation incrementally.
    In this article, the focus is on CiL methods based on Knowledge Distillation (KD) [21] and rehearsal [55], which are known to be versatile, effective, and widely applicable to reduce catastrophic forgetting. We leverage the classification task in CiL as a surrogate task to learn feature representation as typically performed in face/body identification and retrieval [14, 65, 79]. The work of Li and Hoiem [32] first introduces KD in lifelong learning as an effective way to preserve the knowledge previously acquired from old tasks. In iCaRL [53], KD is combined with rehearsal to reserve samples of exemplars stored in an episodic memory for classes already seen. The BiC work, proposed by Wu et al. [71], extends the work of Rebuffi et al. [53] by developing a bias correction layer to recalibrate the output probabilities learning an additional linear layer on a small set of data. Along a similar vein, in the work of Zhao et al. [77], the bias correction is performed by aligning the norms of the weight vectors of the classifier for new classes to those for old classes without using additional model parameters or reserved data. The work of Romero et al. [56] introduces Feature Distillation (FD), a distillation loss evaluated on the feature vectors instead on the classifier outputs. FD has recently been successfully applied by Hou et al. [22] (LUCIR) and Douillard et al. [16] to reduce catastrophic forgetting. Differently from LUCIR, PODNet uses a spatial-based distillation loss to constrain the statistics of intermediate features after each residual block. Similar to LUCIR, PODNetm and many others works on continual/lifelong learning in the literature, our problem formulation takes advantage of the general concept of KD. Differently from these works, our approach is novel in that it considers FD for the dual purpose of learning feature compatibility and mitigating feature forgetting. The work of Iscen et al. [25] (FAN), also discussed in the previous paragraph, combines strategies from other works [22, 32, 53] to learn and preserve previous features. Although the work does not consider the compatibility problem, it is the closest work to our approach. Recently, Yan et al. [72] (DER) showed an interesting performance improvement in CiL by freezing the previously learned representation and expanding its dimension from a new learnable feature extractor. Despite the clear improvements in classification performance, this has no trivial exploitation in compatible training, as the varying dimensions across tasks do not allow direct application of nearest-neighbor search between models. Features with different dimensions typically require to be projected into a common single space to allow nearest-neighbor to be applied. The FOSTER method [69] improves upon DER by addressing this specific problem by transforming the growing dimension of the feature representation with a trainable linear layer that maps the growing feature vector into a fixed dimension. More in general, CiL methods addressing catastrophic forgetting are in a certain sense related to compatible representation, since forgetting is the change in the feature representation of classifiers that will be learned in the future. We evaluate these methods as baselines to quantify the level of lifelong-compatible representation they intrinsically may have.

    3 Main Contributions

    (1)
    We consider compatible representation learning within the lifelong learning paradigm. We refer to this general learning problem as CL2R.
    (2)
    We define a novel set of metrics to properly evaluate CL2R training procedures.
    (3)
    We propose a CL2R training procedure that imposes global and local stationarity on the learned features to achieve compatibility between representations under catastrophic forgetting. Global and local interactions show a significant performance improvement when local stationarity is promoted only from already observed samples in the episodic memory.
    (4)
    We empirically assess the effectiveness of our approach in several benchmarks showing improvements over baselines and adapted state-of-the-art methods.

    4 CL2R Problem Formulation

    In a CL2R setting, a sequence of representation models, \(\lbrace \phi _t \rbrace _{t=1}^{T}\) , is learned incrementally with a sequence of T tasks, \(\lbrace (\mathcal {D}_t, K_t) \rbrace _{t=1}^T\) , where \(\mathcal {D}_t\) are the images of the t-th task represented by \(K_t\) different classes. Specifically, each task is disjoint from the others: \(K_k \cap K_t= \emptyset\) with \(t \ne k\) . The learned representation model \(\phi _t\) is used to transform the query images into feature vectors that are used to retrieve the images most similar to a set of given gallery images transformed with a previous model \(\phi _k\) . Specifically, we indicate with the couple \(\mathcal {G}=(I_\mathcal {G},F_\mathcal {G})\) the gallery-set, where \({I}_\mathcal {G}=\lbrace \mathbf {x}_i\rbrace _{i=1}^N\) is the image collection from which the features \(F_\mathcal {G}=\lbrace \mathbf {f}_i \rbrace _{i=1}^N\) are extracted, and N is the number of elements of the two sets. Without loss of generality, we assume that the features in \(F_\mathcal {G}\) are extracted using the representation model \(\phi _{ k}:{\mathbb {R}}^D \rightarrow {\mathbb {R}}^d\) that transforms an image \(\mathbf {x} \in {\mathbb {R}}^D\) into a feature vector \(\mathbf {f} \in {\mathbb {R}}^d\) , where d and D are the dimensionality of the feature and the image space, respectively. Analogously, we will refer to \(\mathcal {Q}=(I_\mathcal {Q},F_\mathcal {Q})\) as the query-set, where \(I_\mathcal {Q}\) and \(F_\mathcal {Q}\) are the corresponding image-set and the feature-set, respectively. As the t-th task becomes available, the model \(\phi _{t}\) is incrementally learned from the previous one along with the new task data \(\mathcal {D}_t\) . Our goal is to design a training procedure to learn the model \(\phi _{t}\) so that any query image transformed with it can be used to perform visual search through some distance \({\rm dist}:{\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow \mathbb {R}_+\) to identify the closest features \({F}_\mathcal {G}\) to the query features \({F}_\mathcal {Q}\) without forgetting the previous representation and without computing \(F_\mathcal {G}=\lbrace \mathbf {f} \in \mathbb {R}^d \, | \, \mathbf {f} = \phi _{t}(\mathbf {x}) \, \forall \mathbf {x} \in I_\mathcal {G}\rbrace\) (i.e., re-indexing). If this holds, then the resulting representation \(\phi _{t}\) is said to be lifelong compatible with \(\phi _{k}\) .
    The main challenge of the CL2R problem is to jointly alleviate catastrophic forgetting and learn a compatible representation between the previously learned models. In Figure 1, we illustrate the complete CL2R training example using rehearsal to alleviate the effects of catastrophic forgetting.

    5 Compatibility Evaluation

    A representation model \(\phi _{\rm new}\) upgraded with new data is said to be compatible with an old representation model \(\phi _{\rm old}\) when it holds [62]:
    \(\begin{equation} M\big (\phi _{\rm new}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}} \big) \gt {M} \big (\phi _{\rm old}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}} \big). \end{equation}\)
    (1)
    Equation (1) represents the Empirical Compatibility Criterion (ECC), where \({M}\) is an evaluation metric specific to the given visual search problem. Notable examples of the metric M can be found in face verification accuracy [24, 30], face verification/identification accuracy in terms of true acceptance rate and false acceptance rate (TAR \(@\) FAR) [27], and person re-identification mean average precision (mAP) [74]. The intuition of these metrics is based on the observation that they can be instantiated with two different representation models \(\phi _{\rm new}\) and \(\phi _{\rm old}\) when considering the query-gallery pair. The specific notation \({M} (\phi _{\rm new}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}})\) defines the cross-test between the new and the old model, and it represents the case in which \(\phi _{\rm new}\) is used to extract the features of the query-set, \(F_\mathcal {Q}\) , whereas \(\phi _{\rm old}\) is used to extract the gallery-set ones, \(F_\mathcal {G}\) . \({M} (\phi _{\rm old}^{\mathcal {Q}}, \phi _{\rm old}^\mathcal {G})\) is the self-test, and it represents the case in which both query and gallery features are extracted with \(\phi _{\rm old}\) . When the model is trained incrementally on T tasks, Equation (1) is evaluated according to the multi-model ECC introduced by Biondi et al. [4]:
    \(\begin{eqnarray} M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) \gt M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big) {\rm \quad with \:} t \gt k, \end{eqnarray}\)
    (2)
    where \(t, k \in \lbrace 1,2,\ldots ,T\rbrace\) refer to two different tasks such that task k is processed by the model before task t. The model \(\phi _t\) is compatible with the model \(\phi _k\) , when the cross-test \(M (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G})\) between \(\phi _t\) and \(\phi _k\) is greater than the self-test \(M (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G})\) of the model \(\phi _k\) . The underlying intuition is that if the performance of matching the gallery feature vectors extracted with the old model with the query feature vectors extracted with the new model (i.e., cross-test) is better than the performance of matching the gallery feature vectors with the query feature vectors both extracted with the old model (i.e., self-test), then the system is learning compatible representations. In other words, learning from the new task data improves the representation without breaking the compatibility with the previously learned model. Based on Equation (2), the compatibility matrix C is defined as follows:
    \(\begin{equation} C_{t, k} = {\left\lbrace \begin{array}{ll} M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) & \text{if} t \gt k \\ M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big) & \text{if} t = k \\ \qquad 0 & \text{if} t \lt k \end{array}\right.}, \end{equation}\)
    (3)
    where the element in the row t and the column k of the compatibility matrix denotes the evaluation metric M of the model t to the model k. This definition combines the basic intuition of the classification accuracy matrix R defined elsewhere [15, 34], used to evaluate the CiL problem, with the two specific aspects that distinguish the \(\text{CL}^2\text{R}\) learning setting from the CiL one. Namely, (i) in CiL at each task, the train and test data are sampled from the same distribution, whereas in \(\text{CL}^2\text{R,}\) the test-set classes are sampled from an unknown distribution (i.e., \(\text{CL}^2\text{R}\) addresses the open-set recognition problem); (ii) in CiL, the test-set is dynamic (i.e., it grows including images from the task distributions), whereas in \(\text{CL}^2\text{R,}\) it is assumed static for the purpose of a reliable evaluation [62]. In the \(\text{CL}^2\text{R}\) setting, a dynamic test-set, as used in CiL, is of difficult definition, as there are infinite ways to make the gallery dynamic and each of them may change unexpectedly the performance of the evaluation. We follow Shen et al. [62] and perform the evaluation assuming a static test-set (i.e., a static query-gallery pair). According to this, we set the elements of the matrix C with \(t\lt k\) to zero to indicate the impossibility of a reliable evaluation of a growing test-set that should be sampled from an unknown changing distribution. For the remaining elements, the cross-test values are the elements of the matrix with \(t \gt k\) , whereas the self-test values are those of the main diagonal (i.e., when \(t = k\) ). Given a compatibility matrix C, the average compatibility (AC) is defined as follows:
    \(\begin{equation} AC = \frac{2}{T(T-1)} \sum \limits _{1 \le k \lt t \le T}{1\!\!1}{ \Big (M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) \gt M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big)} \Big), \end{equation}\)
    (4)
    where \({1\!\!1}(\cdot)\) denotes the indicator function. AC summarizes the compatibility matrix values in a single number that quantifies the number of times that compatibility is verified against all possible \(\frac{T(T-1)}{2}\) occurrences.

    5.1 Proposed CL2R Metrics

    The work of Díaz-Rodríguez et al. [15] and Lopez-Paz and Ranzato [34] proposes a set of metrics to assess the ability of the learner to transfer knowledge based on a matrix that reports the test classification accuracy of the model on task j after learning task i. Along a similar vein, we present a set of metrics to evaluate the compatibility between representation models in a compatible lifelong learning setting.
    Let \(C \in \mathbb {R}^{ T \times T}\) be the compatibility matrix of Equation (3) for T tasks, and the proposed criteria are the following:
    (1)
    Backward compatibility (BC) measures the gap in compatibility performance between the representation learned at task T with respect to the representation learned at task k with \(k \in \lbrace 1, \ldots , T-1\rbrace\) . When BC \(\lt 0,\) the learning procedure is also influenced by catastrophic forgetting because the performance degrades with newer learned tasks. BC is defined as follows:
    \(\begin{equation} \mbox{ $BC$} = \frac{1}{T-1} \sum _{k=1}^{T-1} \left(C_{T,k} - C_{k,k}\right). \end{equation}\)
    (5)
    (2)
    Forward compatibility (FC) estimates the influence that learning a representation on a task \(k-1\) has on the compatibility performance of the representation learned on a future task k by comparing the cross-test (between models at task k and \(k-1\) ) with respect to the self-test at task k. FC \(\ge 0\) denotes that, on average, the cross-test values are greater than the self-test evaluated on the subsequent tasks, and therefore re-indexing does not necessarily provide improved results. FC is defined as follows:
    \(\begin{equation} \mbox{ $FC$} = \frac{1}{T-1} \sum _{k=2}^{T} \left(C_{k,k-1} - C_{k,k}\right). \end{equation}\)
    (6)
    The intuition behind the definition of this metric comes from noticing that as the number of tasks increases, the cross-test may result better than the self-test. As this is not typically observed when there is no catastrophic forgetting (i.e., when repeatedly training with new and old data), we argue this is due to the joint interaction between the compatibility constraint and catastrophic forgetting. This observation led us to define something “positive” when the compatible representation with the previously learned model is higher than the self-test of the current model. This metric is designed to yield high values when a CL2R training procedure is able to positively exploit the joint interaction between feature forgetting and compatible representation.
    From Equations (5) and (6), it can be deduced that BC and FC \(\in [-1,1]\) . Backward compatibility for the first task and forward compatibility for the last task are not defined. The larger these metrics, the better the model. When AC values are comparable, both BC and FC represent two metrics that quantify the positive interaction between search accuracy under catastrophic forgetting and compatibility. This allows evaluating how catastrophic forgetting affects the representation and its compatibility.
    As BC evaluates the relationship between the representations learned at the final task T and the previous ones, it is possible to follow their evolution during CL2R training. According to this, we define the backward compatibility at task t as \(BC{(t)} = \frac{1}{t-1} \sum _{{c}k=1}^{t-1} (C_{t,k} - C_{k,k}), \; {\rm with } \; t \gt 1\) where \(t \in \lbrace 1, 2, \ldots , T\rbrace\) . This represents the average of the element-wise difference between the t-th row and the first t elements of the main diagonal on the compatibility matrix.

    6 Proposed CL2R Training

    To achieve compatibility, we encourage global and local stationarity to the feature representation.
    Global stationarity is encouraged according to the approach described in the work of Pernici et al. [46], in which features are learned to follow a set of special fixed classifier prototypes. Pernici et al. [46] impose global stationarity using a classifier in which prototypes cannot be trained (i.e., fixed) and are set before training. Under this condition, only the direction of the features aligns toward the fixed directions of the classifier prototypes and not the opposite. This constraint imposes learned features to follow their corresponding fixed prototypes, therefore encouraging representation stationarity. The lack of trainable classifier functionality is basically replaced by previous layers. Fixed prototypes are set according to the coordinate vertices of a d-Simplex regular polytope that, in addition to stationarity, allows maximally separated features to be learned [44, 45].
    We take advantage of this result and perform CiL as a surrogate task to learn stationary features’ representation to achieve compatibility. More formally, let \(\mathbf {W} \; \forall t \in \lbrace 1, 2, \ldots , T\rbrace\) be the d-Simplex fixed classifier, and we instantiate the CiL problem as \(\sigma (\phi _t \circ \mathbf {W})\) , where \(\sigma\) indicates the softmax function, and perform learning according to incremental fine-tuning. The evolving training-set \(\mathcal {T}_t \leftarrow \mathcal {M}_{t} \cup \mathcal {D}_t\) is computed according to a rehearsal base strategy using the episodic memory, \(\mathcal {M}_{t}\) , which contains an updating set of samples from \(\lbrace \mathcal {D}_1, \ldots , \mathcal {D}_{t-1} \rbrace\) . The memory is updated as \(\mathcal {M}_{t+1} \leftarrow \mathcal {M}_{t} \cup {\rm S}{\rm\small{AMPLING}}({\rm }D_{t})\) . The loss optimized in the work of Pernici et al. [46] is adapted to CL2R training as follows:
    \(\begin{eqnarray} \mathcal {L}_t= -\dfrac{1}{|\mathcal {T}_{t}|} \sum \limits _{\mathbf {x} \in \mathcal {T}_{t}} \log \! \left(\dfrac{\exp { \big ({\mathbf {w}}_{y_i}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} }\big)}{\sum \nolimits _{\scriptscriptstyle j \in K_s} \exp \big ({ {\mathbf {w}}_{j}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} }\big) + \sum \nolimits _{\scriptscriptstyle j \in K_u} \exp {\big ({\mathbf {w}}_{j}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} \big) }} \right) , \end{eqnarray}\)
    (7)
    where \(K_s\) is the set of classes learned up to time t, \(|\mathcal {T}_{t}|\) is the number of elements in the training-set, \(K_u\) is the set of the outputs of the classifier that have not yet been assigned to classes at time t (i.e., future unseen classes [47]), \(\mathbf {w}^{\top }_{(\cdot)}\) is a class prototype of the fixed classifier \(\mathbf {W}\) , and \(y_i\) is the supervising label. In particular, \(\mathbf {W}\) is the weight matrix of the fixed classifier, which does not undergo learning during model training. In the work of Pernici et al. [46], the d-Simplex prototypes are defined as \(\mathbf {W} = \lbrace e_1,e_2,\dots ,e_{d-1}, \alpha \sum _{i=1}^{d-1} e_i \rbrace ,\) where d is the feature dimensionality of the d-Simplex, \(\alpha =\frac{1-\sqrt {d+1}}{d}\) , and \(e_i\) denotes the standard basis in \(\mathbb {R}^{d-1}\) , with \(i \in \lbrace 1,2, \dots , d-1\rbrace\) .
    The loss of Equation (7) imposes global stationarity and does not require any knowledge to be extracted from the previously learned models. However, catastrophic forgetting causes misalignment between features and fixed classifier prototypes. Therefore, we further impose additional stationarity constraints in a local neighborhood of a feature by encouraging the current model to mimic the feature representation of the model previously learned. This allows the overall stationarity to also be determined by a local learning mechanism interacting with the global one provided by the d-Simplex classifier of Equation (7). The global-to-local interaction is achieved through the FD loss [56]. Differently from the more common practice of FD in CiL [16, 22, 26] in which each mini-batch is sampled from both the episodic memory and the current task (i.e., \(\mathcal {T}_t \leftarrow \mathcal {M}_{t} \cup \mathcal {D}_t\) ), we evaluate the FD loss, at each task t, only on the samples stored in episodic memory \(\mathcal {M}_t\) observed from previous tasks:
    \(\begin{equation} \mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}= \frac{1}{|\mathcal {M}_{t}|} \sum _{\mathbf {x}_i \in \mathcal {M}_{t}} \left(1 - \frac{\phi _{t}(\mathbf {x}_i) \cdot \phi _{t-1}(\mathbf {x}_i)}{\left\Vert \phi _{t}(\mathbf {x}_i)\right\Vert \left\Vert \phi _{t-1}(\mathbf {x}_i)\right\Vert } \right), \end{equation}\)
    (8)
    where \(\phi _{t-1}\) is the model learned from the previous task. This encourages local stationarity and stability from only the previous classes in the episodic memory and the assimilation of new knowledge (plasticity) from only the classes of the current task. As confirmed by ablation in Section 8, this learning strategy leads to a significant performance improvement. The final optimized loss function is the sum of Equations (8) and (7):
    \(\begin{equation} \mathcal {L} = \mathcal {L}_{t} + \lambda \; \mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}, \end{equation}\)
    (9)
    where \(\lambda\) balances the contribution of global and local alignment provided by the two losses. The pseudo-code in Algorithm 1 and in Algorithm 2 detail our training procedure and its application in visual search, respectively.

    7 Experimental Results

    7.1 Datasets and Verification Protocol

    We compare our proposed CL2R training procedure and the baseline methods on several benchmarks: CIFAR10 [28], ImageNet20,1 ImageNet100 [22, 53, 71], Labeled Faces in the Wild (LFW) [24], and IJB-C [36]. Evaluation is performed in the open-set 1:1 search problem, with verification accuracy as the performance metric M in Equations (1) and (2) for all datasets except IJB-C in which the true acceptance rate and false acceptance rate (TAR@FAR) is used. They are defined as \(\text{TAR} = {\text{TP}}/{(\text{TP} + \text{FN})}\) , \(\text{FAR} = {\text{FP}}/{(\text{FP} + \text{TN})}\) and \(\text{ACC} = {(\text{TP} + \text{TN})}/{(\text{TP} + \text{TN} + \text{FP} + \text{FN})}\) , where TP, TN, FP, and FN indicate true positives, true negatives, false positives, and false negatives, respectively [27, 58]. Following the verification protocol defined in the work of Huang et al. [24], we generate a set of pairs of images that do or do not belong to the same class. A pair is verified on the basis of the distance between feature vectors of its samples. During the evaluation of task t, \(\phi _t\) is used to extract the feature representation for the first image of each pair (i.e., the query-set) and \(\phi _k\) , with \(k \in \lbrace 1, \ldots , t\rbrace\) , is used to extract the feature representation for the second image (i.e., the gallery-set). When \(k=t\) , the compatibility test is the self-test, and otherwise it is the cross-test between the two representations learned from the tasks at time t and k. For the LFW and IJB-C evaluation, we use the original pairs provided by the respective datasets; for the CIFAR10, ImageNet20, and ImageNet100 evaluation, the verification pairs are randomly generated. As the open-set evaluation requires no overlap between classes of the training-set and test-set, we use CIFAR100 to perform CiL (i.e., classification is the surrogate task from which the feature representation is learned) and the CIFAR10 pairs are used as the verification test-set. Similarly, Tiny-ImageNet200 [29] is used as the training-set to evaluate the ImageNet20 pairs; LFW and IJB-C pairs are evaluated with models trained on CASIA-WebFace [75]. Finally, for ImageNet100, we train the models with images not included in ImageNet100 (i.e., the subset of the images of the remaining 900 classes that we named ImageNet900). These datasets are divided into tasks as described in Section 7.2.

    7.2 Implementation Details

    Our CL2R training procedure is implemented in PyTorch [42] and uses the publicly available library Continuum [17]. We used four NVIDIA Tesla A100s to train the representation models, and the neural network architectures are based on the PODNet implementation.2 The evaluation is carried out on several ResNet [20] architectures. Specifically, a 32-, 18-, and 50-layer ResNet is used for CIFAR10, ImageNet20 and ImageNet100, and LFW and IJB-C, respectively. As is typically used in CiL [22, 71], the episodic memory \(\mathcal {M}\) contains 20 samples for each class. The value of \(\lambda\) in Equation (9) is set to \(\lambda = \lambda _{\rm base} \sqrt {{k_n}/{k_0}}\) [22], in which \(\lambda _{\rm base}\) is a scalar, \(k_n\) is the number of classes of the current task, and \(k_0\) is the number of old classes in the episodic memory. The training details for each dataset are listed next.
    CIFAR100 and CIFAR10. We train the model for 70 epochs for each task with batch size 128, and optimization is performed with SGD with an initial learning rate of 0.1 and weight decay of \(2\cdot 10^{-4}\) . The learning rate is divided by 10 at epochs 50 and 64. The input images are RGB, \(32 \times 32\) . \(\lambda _{\rm base}\) is set to 5.
    Tiny-ImageNet200 and ImageNet20. We train the model for 90 epochs at each task with batch size 256, and optimization is performed with SGD with an initial learning rate of 0.1 and a weight decay of \(2\cdot 10^{-4}\) . The learning rate is divided by 10 at epochs 30 and 60. To properly evaluate the models in this learning setting, input images and the ImageNet test images are resized to match the Tiny-ImageNet200 input size (RGB \(64 \times 64\) ). \(\lambda _{\rm base}\) is set to 5.
    ImageNet900 and ImageNet100. We train the model for 90 epochs in each task with batch size 256, and optimization is performed with SGD with an initial learning rate of 0.1 and weight decay of \(2\cdot 10^{-4}\) . The learning rate is divided by 10 at epochs 30 and 60. The input images are RGB, \(224 \times 224\) . \(\lambda _{\rm base}\) is set to 10.
    CASIA-WebFace and LFW/IJB-C. For each task, we train the model for 120 epochs with batch size 1,024. Optimization is carried out with SGD with an initial learning rate of 0.1 and a weight decay of \(2\cdot 10^{-4}\) . The learning rate is divided by 10 at epochs 30, 60, and 90. The input images are RGB, \(112 \times 112\) . \(\lambda _{\rm base}\) is set to 10.
    In Table 1, we summarize the datasets and the training details of our experiments.
    Table 1.
      training-settest-set
    networkinput sizedataset# classesdataset# pairs
    ResNet-32 \(32 \times 32\) CIFAR100100CIFAR106k
    ResNet-18 \(64 \times 64\) Tiny-ImageNet200200ImageNet206k
    ResNet-18 \(224 \times 224\) ImageNet900900ImageNet1006k
    ResNet-50 \(112 \times 112\) CASIA-WebFace10,575LFW6k
    ResNet-50 \(112 \times 112\) CASIA-WebFace10,575IJB-C15M
    Table 1. Datasets Used in CL2R Training Procedures
    Training-set and test-set of the same configuration have non-overlapping classes to properly evaluate different approaches in a open-set setup.

    7.3 Baselines and Compared Methods

    We compare our training procedure with both the CiL methods and the recently proposed methods for compatible learning. Our baselines include LwF [32], LUCIR [22], BiC [71], PODNet [16], FOSTER [69], FAN [25], and BCT [62]. In particular, FAN and BCT are the only approaches with an explicit mechanism to address feature compatibility. We adapted FAN so that the learned adaptation functions are used to transform the features into compatible features. Since in BCT the model is trained from scratch at each task using all available data, for a fair comparison, we also re-implemented it with an episodic memory and refer to it as lifelong-BCT ( \(\ell\) -BCT). At each task, the model is initialized with the parameters of the model of the previous task and the data of the previous tasks can be accessed only through the episodic memory. For LwF, BiC, and PODNet, we use their publicly available implementations,2 whereas for LUCIR and FOSTER, we adopted their official implementations.3 Finally, we also include a traditional Experience Replay (ER)-based baseline, where the model is continuously fine-tuned as new tasks become available. To evaluate our training procedure without considering the catastrophic forgetting phenomenon, we define as upper bound (UB) our training procedure re-trained from scratch at each task using an episodic memory with infinite size.

    7.4 Evaluation on CIFAR10

    In this section, we report the experiments in 2-, 3-, 5-, and 10-task CL2R settings with models trained on CIFAR100 (i.e., using 50, 33, 20, and 10 classes per task) where compatibility is evaluated on the CIFAR10 generated pairs.
    In Table 2, we summarize the performance of our CL2R training procedure with respect to the other baselines in the two-task scenario. We evaluate the compatibility of the updated model according to the ECC (Equation (1)), BC (Equation (5)), and FC (Equation (6)). The first row of Table 2 reports the verification accuracy of the model trained on the first 50 classes of CIFAR100. Experiments show that, among the methods compared, LUCIR and PODNet may have an inherent, although limited, level of compatible representations. This substantially confirms the importance of having some form of mechanism to preserve the local geometry of the learned features. Our training procedure achieves the highest cross-test, BC, and FC, thus resulting to be the most suited training procedure to avoid re-indexing.
    Table 2.
    methodself- testcross- testECCBCFC
    Initial Task0.65
    ER0.640.62 \(\times\) –0.034–0.210
    LwF0.640.64 \(\times\) –0.0090.002
    BiC0.660.63 \(\times\) –0.015–0.028
    LUCIR0.700.66 \(\surd\) –0.012–0.038
    FAN0.660.63 \(\times\) –0.023–0.035
    FOSTER0.660.57 \(\times\) –0.080–0.090
    \(\ell\) -BCT0.650.60 \(\times\) –0.047–0.044
    PODNet0.670.66 \(\surd\) 0.014–0.013
    Ours0.660.67 \(\surd\) 0.0170.006
    BCT*0.720.65 \(\surd\) –0.003–0.071
    Ours (UB)*0.730.69 \(\surd\) –0.039–0.040
    Table 2. CIFAR10 Evaluation
    Two-task CL2R setting with models trained on CIFAR100. Initial Task (i.e., the previous task) shows the verification accuracy on the first 50 classes, and the other rows represent the performance obtained after two tasks.
    *Not subject to catastrophic forgetting.
    In the last rows of the table, we report the performance of the BCT and our UB that are not affected by catastrophic forgetting. The effect of catastrophic forgetting and its implications on the reduction of performance in compatibility can be observed in the self-test, as these values are significantly higher than the values reported by the methods learned using CiL.
    In Table 3, results for the scenario of 3-, 5-, and 10-task CL2R are presented. For each experiment, we report AC (Equation (4)), BC (Equation (5)), and FC (Equation (6)). As can be noticed, our method always achieves the highest AC, thus obtaining the largest number of compatible representations between models, and always achieves the highest BC between methods that are subject to catastrophic forgetting. FAN achieves almost the same performance as our procedure in the 3-task scenario, but when the number of tasks increases, it has a significant decrease in performance, especially in the 10-task setting. This may be due to the increasing number of adaptation functions between different feature spaces that FAN uses to adapt old features with respect to the new ones. As can be noticed from the two tables, FOSTER does not learn compatible features. This may be due to the fact that feature space compression forces the representation to change abruptly reducing the overall compatibility with previous models. BCT reports higher values since its representation is learned from scratch for each new task. Compared to the UB, our training procedure achieves lower AC and BC, and this is due to the influence of catastrophic forgetting. From the table, it can also be noticed that BiC, LUCIR, and PODNet do not satisfy compatibility when catastrophic forgetting is more severe, as, for example, in the case of 10 tasks. Overall, these results suggest that the interaction between local and global stationarity promoted by our training procedure shows a significant improvement in performance that FD alone cannot provide.
    Table 3.
    Table 3. Evaluation of CIFAR10

    7.5 Evaluation on ImageNet

    In this section, we conducted the experiments with models trained on Tiny-ImageNet200 in CL2R settings with 2 (Table 4), 3, 5, and 10 (Table 5) tasks.
    Table 4.
    methodself- testcross- testECCBCFC
    Initial Task0.61
    ER0.620.59 \(\times\) \(-\) 0.012 \(-\) 0.028
    LwF0.630.60 \(\times\) \(-\) 0.007 \(-\) 0.032
    BiC0.600.61 \(\times\) \(-\) 0.001 \(\hphantom{-}\) 0.005
    LUCIR0.600.62 \(\surd\) \(\hphantom{-}\) 0.012 \(\hphantom{-}\) 0.015
    FAN0.610.62 \(\surd\) \(\hphantom{-}\) 0.008 \(\hphantom{-}\) 0.009
    \(\ell\) -BCT0.610.57 \(\times\) \(-\) 0.042 \(-\) 0.038
    Ours0.610.63 \(\surd\) \(\hphantom{-}\) 0.017 \(\hphantom{-}\) 0.015
    BCT*0.650.64 \(\surd\) \(\hphantom{-}\) 0.026 \(-0.05\hphantom{0}\)
    Ours (UB)*0.660.64 \(\surd\) \(\hphantom{-}\) 0.031 \(-\) 0.018
    Table 4. ImageNet20 Evaluation
    The two-task CL2R setting with models trained on Tiny-ImageNet200. The Initial Task (i.e., the previous task) shows verification accuracy on the first 100 classes, and the other rows represent the performance obtained after two tasks.
    *Not subject to catastrophic forgetting.
    Table 5.
    Table 5. ImageNet20 Evaluation
    Table 4 follows the same structure as Table 2 showing the ECC (Equation (1)), BC (Equation (5)), and FC (Equation (6)) values. For all compared methods, the initial model (i.e., the previous model) is trained on the first 100 classes of Tiny-ImageNet200. As can be seen in the table, our method achieves the best performance. However, with low values, other methods such as FAN and LUCIR have a certain level of compatibility, which confirms again that distillation, with which they are equipped, is a useful tool to support learning compatible features. As is also observed in the CIFAR results, methods not subject to catastrophic forgetting (i.e., BCT and our UB), achieve higher BC and lower FC.
    Table 5 shows the 3-, 5- and 10-task CL2R settings for Tiny-ImageNet200. In these learning scenarios, each task is made up of 66, 40, and 20 classes, respectively. In this table, we discuss the results by analyzing the values of AC (Equation (4)), BC (Equation (5)), and FC (Equation (6)). Our approach always achieves the highest value of AC. In particular, ER, LwF, BiC, FAN, and \(\ell\) -BCT do not achieve lifelong-compatible representation in the 3-task setting as a result of AC = 0. In the 10-task CL2R setting, it is more evident that as the number of tasks increases, methods without any specific mechanism to preserve the representation typically cannot learn compatible representations. LUCIR, BiC, and \(\ell\) -BCT obtain significantly lower values than our method. Specifically, the AC performance is more than twice that of BCT, which means that our CL2R procedure obtains twice the number of compatible representations than that of BCT. This may be caused by the fact that the constraints imposed by these techniques on the learned representation seem to have very little effect on its stationarity, and consequently on its compatibility. The results on the 10-task setting are also important, as they suggest that catastrophic forgetting is not an intrinsic impediment to learning compatible representations. The performance difference of 0.11 in AC with respect to the UB can be considered clear evidence of this effect. Finally, the table shows how our training procedure provides the highest FC and is the only case where FC is always positive. As a result, our training procedure achieves, on average, cross-tests higher than self-tests indicating that the system performs better even without re-indexing the gallery.
    Table 6 reports ImageNet100 results when models are trained on ImageNet900 with two and three tasks. We compare our approach with the \(\ell\) -BCT method as having reasonable performance and with an explicit mechanism to learn compatible features under catastrophic forgetting. As can be noticed from the table, our CL2R training clearly outperforms \(\ell\) -BCT. Our method achieves good scores for AC in both scenarios. As remarked in the Section 7.5 of the novel revised manuscript, the reduced performance of \(\ell\) -BCT appears to be connected to the fact that the training procedure is only based on pairwise model training (i.e., compatibility is only learned from the previous model). In contrast, our method is not based only on pairwise learning and does not use previous classifiers, which may be incorrectly learned.
    Table 6.
    methodtwo tasksthree tasks
    ACBCFCACBCFC
    \(\ell\) -BCT0 \(-\) 0.127 \(-\) 0.1010.00 \(-\) 0.073 \(\hphantom{-}\) 0.006
    Ours1.00 \(\hphantom{-}\) 0.005 \(-\) 0.0090.67 \(\hphantom{-}\) 0.019 \(-\) 0.011
    Table 6. ImageNet100 Evaluation
    The two and three-task CL2R settings with models trained on ImageNet900. We compare our training procedure and \(\ell\) -BCT reporting AC, BC, and FC.

    7.6 Face Verification

    In this section, we report the experimental results on the LFW and IJB-C benchmarks in 2, 3, 5, and 10 CL2R settings. We incrementally train the representation models with CASIA-WebFace resulting in tasks composed of 5,287, 3,525, 2,115, and 1,057 classes, respectively.
    The results are summarized in Tables 7 and 8 for LFW and IJB-C, respectively. In particular, for IJB-C, we report accuracy in terms of AC, BC, and FC at different false acceptance rates (FAR): \(10^{-1}\) , \(10^{-2}\) , and \(10^{-4}\) . In this evaluation, we do not report LUCIR when training on CASIA-WebFace due to the excessive memory requirements of the original implementation.3 Although in the 2-task scenario comparable results are observed to those of \(\ell\) -BCT, in the settings of 3 and 5 tasks, our training procedure achieves complete compatibility resulting in AC = 1 and BC always positive. In 10-task compatibility, the difference in performance increases more significantly, confirming a clear overall positive performance. Generally, the reported performances are higher on face datasets than on CIFAR10, ImageNet20, and ImageNet100. Possible reasons may be found in the fact that in face recognition, the domain shift between classes is lower than that for CIFAR or ImageNet. Finally, this experiment shows that the proposed method is effective not only with a larger number of model updates but also with larger datasets.
    Table 7.
    Methodtwo tasksthree tasksfive tasksten tasks
    ACBCFCACBCFCACBCFCACBCFC
    \(\ell\) -BCT1.000.005 \(-\) 0.0100.67 \(-\) 0.007 \(-\) 0.0050.40 \(-\) 0.002 \(-\) 0.0150.31 \(-\) 0.002 \(-\) 0.010
    Ours1.000.003–0.0011.00 \(\hphantom{-}\) 0.004 \(-\) 0.0051.00 \(\hphantom{-}\) 0.006–0.0050.82 \(\hphantom{-}\) 0.006–0.005
    Table 7. Face Verification on the LFW Dataset
    The 2-, 3-, 5-, and 10-task CL2R settings with models trained on CASIA-WebFace. We compare our training procedure and \(\ell\) -BCT reporting BC (Equation (5)), FC (Equation (6)), and AC (Equation (4)), which corresponds to the ECC (Equation (1)) when evaluated in two tasks.
    Table 8.
    FARMethodtwo tasksthree tasksfive tasksten tasks
    ACBCFCACBCFCACBCFCACBCFC
    10 \(^{-1}\) \(\ell\) -BCT1.00 \(\hphantom{-}\) 0.002 \(-\) 0.0100.33 \(-\) 0.007 \(-\) 0.0170.20 \(-\) 0.031 \(-\) 0.0280.22 \(-\) 0.029 \(-\) 0.029
    Ours1.00 \(\hphantom{-}\) 0.005–0.0091.00 \(\hphantom{-}\) 0.004–0.0060.80 \(\hphantom{-}\) 0.001–0.0080.76 \(\hphantom{-}\) 0.011–0.002
    10 \(^{-2}\) \(\ell\) -BCT0 \(-\) 0.026 \(-\) 0.0150.33 \(-\) 0.011 \(-\) 0.0100.10 \(-\) 0.038 \(-\) 0.0250.09 \(-\) 0.020 \(-\) 0.034
    Ours1.00 \(\hphantom{-}\) 0.005–0.0171.00 \(\hphantom{-}\) 0.010 \(\hphantom{-}\) 0.0090.80 \(\hphantom{-}\) 0.008–0.0140.73 \(\hphantom{-}\) 0.010–0.003
    10 \(^{-4}\) \(\ell\) -BCT0 \(-\) 0.012 \(-\) 0.0040.33 \(-\) 0.010 \(-\) 0.0120 \(-\) 0.041 \(-\) 0.0280.09 \(-\) 0.016 \(-\) 0.009
    Ours1.00 \(\hphantom{-}\) 0.023 \(\hphantom{-}\) 0.0050.67 \(\hphantom{-}\) 0.002 \(\hphantom{-}\) 0.0050.80 \(\hphantom{-}\) 0.001–0.0030.73 \(\hphantom{-}\) 0.012 \(\hphantom{-}\) 0.007
    Table 8. Face Verification on the IJB-C Dataset
    The 2-, 3-, 5-, and 10-task CL2R settings with models trained on CASIA-WebFace. We report \(\text{($AC$, $BC$, $FC$)@FAR}=\) \({10}^{-1}, {10}^{-2},\ {\mathrm{and}}\ {10}^{-4}\) and compare our training procedure with \(\ell\) -BCT.

    7.7 Compatibility and Catastrophic Forgetting

    In this section, we study how compatibility is related to the problem of catastrophic forgetting. In Figure 2, we show the evolution of BC in a 5- and 10-task CL2R scenario. In particular, Figure 2(a) and (b) and Figure 2(c) and (d) show the evaluations on the CIFAR10 and ImageNet20 datasets, respectively. We compared our approach with ER, LwF, BiC, LUCIR, FAN, and \(\ell\) -BCT. As can be observed, our training procedure achieves the highest performance. As the BC metric is, on average, the closest to zero than the other evaluated methods, the representation learned by our training procedure can be considered to be the most compatible and, from the perspective of visual search, equivalent to the representation models learned from previous tasks. More practically, this allows for the reduction of the computational cost of re-indexing.
    Fig. 2.
    Fig. 2. Backward compatibility evolution across tasks t (i.e., \(BC(t)\) ). Comparison between our CL2R training and other methods in 5- and 10-task learning setups. (a, b) CIFAR10 results. (c, d) ImageNet20 results.
    In contrast, FAN achieves a negative value of BC in all four settings, confirming that the composition of an increasing number of feature adaption functions between sequentially learned representations causes a decrease in compatibility. Despite the absence of considerable performance loss, as in the case of FAN, negative BC values indicate a constant deterioration in performance as the number of tasks increases. In general, except for our method, the figure shows that all other methods follow a common trend with lower performance.

    8 Ablation Studies

    We analyze by ablation the main components of our training procedure. The ablation is performed on the CIFAR100 dataset as described in Section 7.4 and considers the 10-task CL2R setting, which can be regarded as a worst-case scenario for this dataset. We analyze the impact of (i) the specific classifier: Trainable vs. fixed d-Simplex with or without the FD component, (ii) how the FD loss is evaluated, and (iii) the sensitivity of the number of samples reserved per class in the episodic memory.
    Impact of the d-Simplex fixed classifier and FD. As can be noticed from Table 9, the Trainable classifier is not able to learn compatible representations. When combined with FD, the performance improves only marginally and not sufficiently to be compared with the CiL approaches shown in Table 3. FD evaluated on the only samples stored in episodic memory as defined in \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) (Equation (8)) improves the values of the reported metrics showing a better supervision signal for the updated model. The d-Simplex alone improves on the previous components obtaining values of AC = 0.27 and FC = 0.003, which are higher than the Trainable classifier with \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) . This remarks on the importance of preserving the global geometry of the learned features according to the d-Simplex fixed classifier.
    Table 9.
    classifierdistillationten tasks
    TrainableFixed \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}\) \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) ACBCFC
    \(\surd\)    0.04 \(-\) 0.130 \(-\) 0.083
    \(\surd\)   \(\surd\)  0.13–0.050–0.049
    \(\surd\)    \(\surd\) 0.22 \(-\) 0.043 \(-\) 0.013
      \(\surd\)   0.27 \(-\) 0.078 \(\hphantom{-}\) 0.003
      \(\surd\) \(\surd\)  0.40 \(-\) 0.019 \(-\) 0.011
      \(\surd\)   \(\surd\) 0.44–0.003 \(\hphantom{-}\) 0.005
    Table 9. Ablation of the Different Main Components of Our CL2R Training Procedure
    The evaluation is performed on CIFAR10 and training is based on CIFAR100 with 10 tasks, where Trainable indicates the traditional ER baseline, Fixed indicates ER with stationary features learned from Equation (7) according to the fixed d-Simplex classifier, \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}\) is the traditional FD, and \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) is the FD evaluated on the only samples stored in episodic memory as defined in Equation (8).
    Impact of memory samples on FD. Table 9 shows that when the distillation loss is evaluated on the only samples stored in episodic memory \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) (Equation (8)), our approach achieves better overall results. We argue that this positive effect is mostly due to the interaction between the global feature stationarity learned using the fixed classifier and the local one promoted through FD from the only observed samples in the episodic memory. The interaction is most likely related to the fact that the fixed d-Simplex classifier in general does not allow novel classes to interfere in the feature space of the already learned one. This in turn provides favorable working conditions (i.e., a kind of coarse pre-alignment) for achieving feature alignment with respect to the previous model by the distillation loss. As expected, the impact intensifies when evaluated only on already known classes, as alignment is less prone to unexpected noisy features which may reduce the degree of the alignment. This confirms the effectiveness of restricting FD only on memory samples in contrast to the traditional FD commonly used in CiL.
    Impact of the episodic memory size. Figure 3 shows the effect of different numbers of reserved samples per class for both our learning procedure and other baselines. As expected, the more samples per class are reserved in the episodic memory, the better the performance. Our approach, with 20 samples per class, achieves results similar to those obtained by the other methods with more examples per class. Although ER, LUCIR, and FAN have a better relative improvement with 50 samples per class, overall our approach results in the highest performance in learning compatible features.
    Fig. 3.
    Fig. 3. The effect of the number of reserved samples per class in the episodic memory.
    We also evaluated the methods in the challenging memory-free training setting (i.e., without the episodic memory). Our training procedure achieves the highest results also in this condition, remarking on the fact that CiL methods typically do not have an inherent mechanism to learn compatible features.

    9 Conclusion

    In this article, we have introduced the problem of CL2R, which considers the compatibility learning problem within the lifelong learning paradigm. We introduced a novel set of metrics to properly evaluate this problem and proposed a novel CL2R training procedure that imposes global and local stationarity on the learned features to achieve compatibility between representations under catastrophic forgetting. Global and local stationarity is imposed according to the d-Simplex fixed classifier and the FD loss, respectively. Empirical evaluation of the learned lifelong-compatible representation shows the effectiveness of our method with respect to baselines and state-of-the-art methods.

    Footnotes

    1
    To meet the open-set protocol, we generated a training set from ImageNet [57] by randomly sampling 20 classes that are not included in the Tiny-ImageNet200 dataset. The indices of the ImageNet classes we use are the following: {n02276258, n01728572, n03814906, n02817516, n03769881, n03220513, n04442312, n04252225, n13037406, n04266014, n03929855, n02804414, n01873310, n03532672, n01818515, n03916031, n03345487, n02114855, n04589890, n03776460}.

    References

    [1]
    Tommaso Barletti, Niccoló Biondi, Federico Pernici, Matteo Bruni, and Alberto Del Bimbo. 2022. Contrastive supervised distillation for continual representation learning. In Proceedings of the International Conference on Image Analysis and Processing. 597–609.
    [2]
    Eden Belouadah, Adrian Popescu, and Ioannis Kanellos. 2021. A comprehensive study of class incremental learning algorithms for visual tasks. Neural Networks 135 (2021), 38–54.
    [3]
    Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828.
    [4]
    Niccolo Biondi, Federico Pernici, Matteo Bruni, and Alberto Del Bimbo. 2021. CoReS: Compatible representations via stationarity. arXiv preprint arXiv:2111.07632 (2021).
    [5]
    Mateusz Budnik and Yannis Avrithis. 2021. Asymmetric metric learning for knowledge transfer. In Proceedings of the 2021 Conference on Computer Vision and Pattern Recognition (CVPR’21). IEEE, Los Alamitos, CA, 8228–8238.
    [6]
    Ken Chen, Yichao Wu, Haoyu Qin, Ding Liang, Xuebo Liu, and Junjie Yan. 2019. R3 Adversarial network for cross model face recognition. In Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE, Los Alamitos, CA, 9868–9876.
    [7]
    Wei Chen, Yu Liu, Nan Pu, Weiping Wang, Li Liu, and Michael S. Lew. 2021. Feature estimations based correlation distillation for incremental image retrieval. IEEE Transactions on Multimedia 24 (2021), 1844–1856
    [8]
    Wei Chen, Yu Liu, Weiping Wang, Tinne Tuytelaars, Erwin M. Bakker, and Michael S. Lew. 2020. On the exploration of incremental learning for fine-grained image retrieval. In Proceedings of the 31st British Machine Vision Conference (BMVC’20).
    [9]
    Zhiyuan Chen and Bing Liu. 2018. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12, 3 (2018), 1–207.
    [10]
    Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, Los Alamitos, CA, 539–546.
    [11]
    Andrea Cossu, Marta Ziosi, and Vincenzo Lomonaco. 2021. Sustainable artificial intelligence through continual learning. arXiv preprint arXiv:2111.09437 (2021).
    [12]
    MohammadReza Davari and Eugene Belilovsky. 2021. Probing representation forgetting in continual learning. In Proceedings of the NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.
    [13]
    Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021), 3366–3385.
    [14]
    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.
    [15]
    Natalia Díaz-Rodríguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. 2018. Don’t forget, there is more than forgetting: new metrics for Continual Learning. arXiv preprint arXiv:1810.13166 (2018).
    [16]
    Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. PODNet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 86–102.
    [17]
    Arthur Douillard and Timothée Lesort. 2021. Continuum: Simple management of complex continual learning scenarios. arXiv:2102.06253 (2021).
    [18]
    K. Anders Ericsson and Walter Kintsch. 1995. Long-term working memory. Psychological Review 102, 2 (1995), 211.
    [19]
    Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Computer Vision—ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, Switzerland, 241–257.
    [20]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [21]
    Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.
    [22]
    Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 831–839.
    [23]
    Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, and Qi Tian. 2019. Towards visual feature translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3004–3013.
    [24]
    Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49. University of Massachusetts, Amherst.
    [25]
    Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, and Cordelia Schmid. 2020. Memory-efficient incremental learning through feature adaptation. In Proceedings of the European Conference on Computer Vision. 699–715.
    [26]
    Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. 2018. Less-forgetful learning for domain expansion in deep neural networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
    [27]
    Brendan F. Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah, and Anil K. Jain. 2015. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1931–1939.
    [28]
    Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report TR-2009. University of Toronto.
    [29]
    Ya Le and Xuan Yang. 2015. Tiny ImageNet visual recognition challenge. CS 231N 7, 7 (2015), 3.
    [30]
    Gary B. Huang Erik Learned-Miller. 2014. Labeled Faces in the Wild: Updates and New Reporting Procedures. Technical Report UM-CS-2014-003. University of Massachusetts, Amherst.
    [31]
    Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. 2015. Convergent learning: Do different neural networks learn the same representations? In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 196–212.
    [32]
    Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 2935–2947.
    [33]
    Alexandra Libby and Timothy J. Buschman. 2021. Rotational dynamics reduce interference between sensory and memory representations. Nature Neuroscience 24 (2021), 715–726.
    [34]
    David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems 30 (2017).
    [35]
    Marc Masana, Xialei Liu, Bartlomiej Twardowski, Mikel Menta, Andrew D. Bagdanov, and Joost van de Weijer. 2020. Class-incremental learning: Survey and performance evaluation. arXiv preprint arXiv:2010.15277 (2020).
    [36]
    Brianna Maze, Jocelyn Adams, James A. Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K. Jain, et al. 2018. IARPA Janus Benchmark C: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB’18). IEEE, Los Alamitos, CA, 158–165.
    [37]
    Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, Vol. 24. Elsevier, 109–165.
    [38]
    Qiang Meng, Chixiang Zhang, Xiaoqiang Xu, and Feng Zhou. 2021. Learning compatible embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 9939–9948.
    [39]
    Ethan M. Meyers. 2018. Dynamic population coding and its relationship to working memory. Journal of Neurophysiology 120, 5 (2018), 2260–2268.
    [40]
    John D. Murray, Alberto Bernacchia, Nicholas A. Roy, Christos Constantinidis, Ranulfo Romo, and Xiao-Jing Wang. 2017. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. Proceedings of the National Academy of Sciences 114, 2 (2017), 394–399.
    [41]
    German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural Networks 113 (2019), 54–71.
    [42]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 8026–8037.
    [43]
    Federico Pernici, Federico Bartoli, Matteo Bruni, and Alberto Del Bimbo. 2018. Memory based online learning of deep representations from video streams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
    [44]
    Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. 2019. Fix your features: Stationary and maximally discriminative embeddings using regular polytope (fixed classifier) networks. arXiv preprint arXiv:1902.10441 (2019).
    [45]
    Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. 2019. Maximally compact and separated features with regular polytope networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
    [46]
    Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. 2022. Regular polytope networks. IEEE Transactions on Neural Networks and Learning Systems 33, 9 (2022), 4373–4387.
    [47]
    Federico Pernici, Matteo Bruni, Claudio Baecchi, Francesco Turchini, and Alberto Del Bimbo. 2021. Class-incremental learning with pre-allocated fixed classifiers. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR’20). IEEE, Los Alamitos, CA, 6259–6266.
    [48]
    Federico Pernici, Matteo Bruni, and Alberto Del Bimbo. 2020. Self-supervised on-line cumulative learning from video streams. Computer Vision and Image Understanding 197 (2020), 102983.
    [49]
    W. Nicholson Price and I. Glenn Cohen. 2019. Privacy in the age of medical big data. Nature Medicine 25, 1 (2019), 37–43.
    [50]
    Nan Pu, Wei Chen, Yu Liu, Erwin M. Bakker, and Michael S. Lew. 2021. Lifelong person re-identification via adaptive knowledge accumulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7901–7910.
    [51]
    Vivek Ramanujan, Pavan Kumar Anasosalu Vasu, Ali Farhadi, Oncel Tuzel, and Hadi Pouransari. 2022. Forward compatible training for representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    [52]
    Roger Ratcliff. 1990. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review 97, 2 (1990), 285.
    [53]
    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2001–2010.
    [54]
    Anthony Robins. 1993. Catastrophic forgetting in neural networks: The role of rehearsal mechanisms. In Proceedings of the 1993 1st New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. IEEE, Los Alamitos, CA, 65–68.
    [55]
    Anthony Robins. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7, 2 (1995), 123–146.
    [56]
    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. FitNets: Hints for thin deep nets. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).
    [57]
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
    [58]
    Mohammadreza Salehi, Hossein Mirzaei, Dan Hendrycks, Yixuan Li, Mohammad Hossein Rohban, and Mohammad Sabokrou. 2021. A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges. arXiv preprint arXiv:2110.14051 (2021).
    [59]
    Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.
    [60]
    Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Communications of the ACM 63, 12 (2020), 54–63.
    [61]
    Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806–813.
    [62]
    Yantao Shen, Yuanjun Xiong, Wei Xia, and Stefano Soatto. 2020. Towards backward-compatible representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6368–6377.
    [63]
    Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation by joint identification-verification. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 1988–1996.
    [64]
    Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1708.
    [65]
    Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object retrieval with integral max-pooling of CNN activations. In Proceedings of the International Conference on Learning Representations (ICLR’16).
    [66]
    Richard Van Noorden. 2020. The ethical questions that haunt facial-recognition research. Nature 587, 7834 (2020), 354–358.
    [67]
    Mochitha Vijayan and S. S. Sridhar. 2021. Continual learning for classification problems: A survey. In Proceedings of the International Conference on Computational Intelligence in Data Science. 156–166.
    [68]
    Chien-Yi Wang, Ya-Liang Chang, Shang-Ta Yang, Dong Chen, and Shang-Hong Lai. 2020. Unified representation learning for cross model compatibility. In Proceedings of the 31st British Machine Vision Conference (BMVC’20).
    [69]
    Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. 2022. FOSTER: Feature boosting and compression for class-incremental learning. arXiv preprint arXiv:2204.04662 (2022).
    [70]
    Liwei Wang, Lunjia Hu, Jiayuan Gu, Zhiqiang Hu, Yue Wu, Kun He, and John Hopcroft. 2018. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, Vol. 31.
    [71]
    Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. 2019. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 374–382.
    [72]
    Shipeng Yan, Jiangwei Xie, and Xuming He. 2021. DER: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3014–3023.
    [73]
    Artem Babenko Yandex and Victor Lempitsky. 2015. Aggregating local deep features for image retrieval. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 1269–1277
    [74]
    Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. 2021. Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021), 2872–2893.
    [75]
    Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z. Li. 2014. Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014).
    [76]
    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, Vol. 27.
    [77]
    Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, and Shu-Tao Xia. 2020. Maintaining discrimination and fairness in class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13208–13217.
    [78]
    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116–1124.
    [79]
    Liang Zheng, Yi Yang, and Alexander G. Hauptmann. 2016. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984 (2016).
    [80]
    Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. 2019. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3702–3712.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2s
    June 2022
    383 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3561949
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 January 2023
    Online AM: 29 September 2022
    Accepted: 10 September 2022
    Revised: 09 September 2022
    Received: 15 November 2021
    Published in TOMM Volume 18, Issue 2s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep learning
    2. compatible learning
    3. lifelong learning
    4. representation learning
    5. fixed classifier

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • European Horizon 2020 Programme

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 559
      Total Downloads
    • Downloads (Last 12 months)340
    • Downloads (Last 6 weeks)31

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media