Federated Continual Learning Goes Online:
Leveraging Uncertainty for Modality-Agnostic Class-Incremental Learning

Giuseppe Serra
Goethe University Frankfurt
German Cancer Consortium (DKTK)
serra@med.uni-frankfurt.de
&Florian Buettner
Goethe University Frankfurt
German Cancer Consortium (DKTK)
German Cancer Research Center (DKFZ)
florian.buettner@dkfz-heidelberg.de

Abstract

Given the ability to model more realistic and dynamic problems, Federated Continual Learning (FCL) has been increasingly investigated recently. A well-known problem encountered in this setting is the so-called catastrophic forgetting, for which the learning model is inclined to focus on more recent tasks while forgetting the previously learned knowledge. The majority of the current approaches in FCL propose generative-based solutions to solve said problem. However, this setting requires multiple training epochs over the data, implying an offline setting where datasets are stored locally and remain unchanged over time. Furthermore, the proposed solutions are tailored for vision tasks solely. To overcome these limitations, we propose a new modality-agnostic approach to deal with the online scenario where new data arrive in streams of mini-batches that can only be processed once. To solve catastrophic forgetting, we propose an uncertainty-aware memory-based approach. In particular, we suggest using an estimator based on the Bregman Information (BI) to compute the model’s variance at the sample level. Through measures of predictive uncertainty, we retrieve samples with specific characteristics, and – by retraining the model on such samples – we demonstrate the potential of this approach to reduce the forgetting effect in realistic settings.

1 Introduction

In recent years, federated learning (FL) (McMahan et al., 2017) has received increasing attention for its ability to allow local clients work towards the same objective without the need for sharing limited and sensitive information. However, despite the benefits provided by the privacy-preserving collaborative training, the assumptions of the standard scenario are far from realistic (Babakniya et al., 2024). For example, the standard static-single-task framework has very limited practical applicability in real-world cases where local clients often need to continuously learn new tasks. Let us consider the problem of classifying new COVID-19 variants as illustrative use-case. Given the evolving nature of the virus, new variants (i.e., new classes) arise over time. In this context, healthcare facilities which collaborate using FL would fail since the standard setting does not consider the dynamic increment of new classes. For this reason, a new paradigm was recently introduced to model more complex dynamics: federated continual learning (FCL) (Yoon et al., 2021a; Dong et al., 2022). This new paradigm takes the global-local communication and privacy-preserving abilities enabled by FL and combines them with the Continual Learning (CL) ability of learning different tasks sequentially over time. In this way, FCL frameworks allow local clients to learn continuously from a stream of data where new classes can be added at any time. Consequently, this new paradigm not only shares the characteristics at the intersection of FL and CL, but also inherits their respective challenges. In particular one of the most prominent problems in CL, the so-called catastrophic forgetting for which the model is prone to suffer from significant performance degradation on older tasks. In CL, this problem is addressed in different ways ranging from memory-based approaches (Yoon et al., 2021b; Kumari et al., 2022; Hurtado et al., 2023) to generative methods (Shin et al., 2017; Lesort et al., 2019) or to regularization techniques (Lee et al., 2017; Aljundi et al., 2018; Chaudhry et al., 2018b; He and Jaeger, 2018). In the context of FCL, instead, the majority of the proposed approaches exploit generative models at both local and global level to trace and encode the past information and generate synthetic instances which faithfully mimic previous history whilst preserving data privacy. A substantial drawback of such generative approaches is that they are tailored to image data and it is not clear how they could translate to other data modalities.
Importantly, such generator-based approaches also imply an offline setting where task data are collected before training and remain unchanged over time: local generators require large training datasets of complete tasks and similarly, more recently proposed generative models based on data distillation can only be trained at the end of each task (Babakniya et al., 2024).

In realistic conditions however, all data from the current task may not be available for collection at the same time but may rather arrive in small chunks sequentially. Furthermore, given the ubiquity of smart and edge devices with limited capabilities (e.g., wearable devices), there is a need for updating with high frequency the learning model with the new incoming data to minimize memory overload and communication bandwidth (Ma et al., 2022). This problem of learning from an online stream of data is well investigated in CL while it remains largely unexplored in FCL.

Inspired by the problem of training models over online streams of data, we first formalise a new scenario to tackle the online problem in FCL (online-FCL). In this new scenario, we assume each client to learn from a stream of data where new data arrive in mini-batches which can only be processed once. As such, in line with the definition of online-CL, the model is updated with high frequency (Soutif-Cormerais et al., 2023). Then, taking inspiration from the most popular solutions in online-CL, we introduce memory buffers to alleviate catastrophic forgetting at local level. More in detail, we propose an uncertainty-based memory management where data points are stored in local buffers according to their predictive uncertainty. Intuitively, predictive uncertainty provides a glimpse of the samples’ location in the decision space; samples with low uncertainty are the most representative ones for the respective class, while samples with high uncertainty represent data points that are close to the decision boundary and/or outliers. Thus, via estimates of predictive uncertainty, we can store in the memory samples with desired properties. Here, we propose to quantify uncertainty by directly estimating a generalized variance term from the loss function, which can be interpreted as a measure of epistemic uncertainty. To this end, we leverage a recently proposed bias-variance decomposition of the cross-entropy loss (Gruber and Buettner, 2023) and estimate the Bregman Information (BI) as the variance term in logit space. The uncertainty-based memory management makes our approach modality-agnostic. In fact, independent of the data modality (e.g., images, texts), our solution allows us to compute uncertainty estimates and thus populate the memory accordingly. In contrast to most of the current solutions in FCL which are limited to vision tasks and the offline setting, our simple but effective baseline is flexible in terms of data modality.

In the last part of the paper, we demonstrate the ability of the proposed approach to reduce the forgetting effect in different scenarios. In the first part of the experiments, we evaluate our approach on CIFAR-10 (Krizhevsky et al., 2009), a standard dataset used in this context. The goal is to understand how the proposed (epistemic) uncertainty estimate based on the Bregman Information performs in comparison with other standard estimates of overall model uncertainty under a memory-based regime. Then, departing from the common evaluation pipeline that would involve datasets like EMNIST (as in Qi et al. (2023); Wuerkaixi et al. (2024)) or larger datasets from the same naturalistic domain (e.g., CIFAR100 or ImageNet – as in Qi et al. (2023); Babakniya et al. (2024)), we validate our results on more probing real-world datasets from the medical domain. Finally, to showcase the ability of our approach to work with different data modalities, we test our findings on a text classification task.

The contributions of this work can be summarized as follows:

•

We propose and formalise a novel framework to tackle the FCL problem in the online setting.
•

We highlight the limitations of the current state-of-the-art generative-based solution to work in the online setting.
•

We propose a modality-agnostic memory-based solution that employs an alternative estimate for predictive uncertainty, which stems directly from a bias-variance decomposition of the cross-entropy loss for classification tasks (Gruber and Buettner, 2023), to populate the memory.
•

We demonstrate the efficacy of our method in more realistic scenarios including datasets from different domains, imbalanced, and with different modalities.

Refer to caption — Figure 1: Schematic overview of the proposed online-FCL scenario.

2 Related Work

Continual Learning.

There are three types of incremental learning (IL) (van de Ven et al., 2022); task-, domain-, and class-IL. In task-IL, training and testing data include explicitly task IDs. In this unrealistic scenario, the learning algorithm knows which task needs to be completed. In the domain-IL setup, instead, the classification problem remains the same while the input distribution (or domain) shifts over time. For instance, an example of domain drift appears when a model for detecting even/odd digits in images learns first $1s$ and $2s$ and then $4s$ and $5s$ . Lastly, class-IL scenarios represent the most realistic setting where the model continuously learns from an increasing number of classes. CL problems can also be classified based on whether the data can be stored and viewed multiple times (offline) or whether they can be processed once (online) (Mai et al., 2022).

Class-Incremental Learning.

Methods for class-IL can be divided into several categories (Mai et al., 2022) as follows; a) Regularization techniques alter the model parameter updates by adding penalty terms to the loss function (Lee et al., 2017; Aljundi et al., 2018), adjusting parameter gradients during optimization (Chaudhry et al., 2018b; He and Jaeger, 2018), or employing knowledge distillation (Rannen et al., 2017; Wu et al., 2019); b) Memory-based techniques exploit a fixed-size buffer containing samples from past tasks for replay (Aljundi et al., 2019; Chaudhry et al., 2019) or regularization (Nguyen et al., 2018; Tao et al., 2020); c) Generative-based techniques involve training generative models that can produce pseudo-samples mimicking data from the past (Shin et al., 2017; Lesort et al., 2019); d) Parameter-isolation-based techniques assign different model parameters to each task. This can be done either by activating only the relevant parameters for each task (Fixed Architecture) (Mallya and Lazebnik, 2018; Serra et al., 2018) or by adding new parameters and keeping the existing ones unchanged (Dynamic Architecture) (Aljundi et al., 2017; Yoon et al., 2018). In online-CL, where data arrive in single mini-batches and the model is updated frequently, rehearsal-based methods are favoured over more complex solutions, like generative methods, because they offer greater flexibility and require less training data and computational time (Mai et al., 2022).

In the context of memory-based techniques, the main question is how to optimally manage the memory. In order to reduce catastrophic forgetting, samples in the memory should be representative of their own class, discriminative towards the other classes, and informative enough for the model to recall the information about the old classes. In the literature, many conflicting strategies can be found. Some of them suggest to use the most representative samples (Yoon et al., 2021b; Hurtado et al., 2023), while others consider the samples near the decision boundary as the most useful to reduce CF (Kumari et al., 2022).

Federated Continual Learning.

The intersection of federated learning and continual learning has only been investigated very recently. One of the first papers in this direction is Yoon et al. (2021a). The approach focuses on the less challenging task-IL scenario, where the task ID information is required during inference and testing. In Dong et al. (2022), the authors introduce the federated class-IL (FCIL) problem which tends to model more realistic scenarios. The clients have access to a large memory buffer and can share perturbed prototype samples with the global server, which differs with the standard FL setting where only the parameters of the local models are shared with the server (Babakniya et al., 2024). Ma et al. (2022) uses knowledge distillation on both local and global levels via unlabeled surrogate datasets. Other works (Hendryx et al., 2021; Qi et al., 2023; Babakniya et al., 2024; Wuerkaixi et al., 2024) pose their attention on the FCIL problem without the use of memory replay data. Hendryx et al. (2021) focus on few-shot learning and allows overlapping classes between tasks. Qi et al. (2023) instead, introduces the constraint of non-overlapping classes for intra-client tasks. The work is based on generative replay where clients train a discriminator and a generator locally. At the same time, at each communication round, the server performs a consolidation step and generates synthetic data using the locally trained generators. Following a similar principle, Babakniya et al. (2024) presents a generative model which is trained by the server in a data-free manner. Another FCIL scenario is considered in Shenaj et al. (2023), where tasks can arrive asynchronously at each client. The problem is tackled via a combination of prototype-based learning, representation loss, and stabilization of server aggregation.

Here we briefly summarise the identified limitations of such generator-based approaches. As anticipated before, they assume to work in the offline setting where static datasets are collected in advance and stored locally. This allows them to be trained over the same task dataset many times which is needed to reach convergence and to learn meaningful patterns in order to generate meaningful synthetic images. Finally, the storage of the whole task datasets and the generator models may be unfeasible in resource-limited devices.

3 Our Approach: O-FCIL

3.1 Online-FCL: Problem Formulation

Following the notation proposed in Qi et al. (2023), we assume to have a set $\bm{\mathcal{C}}=\left\{\mathcal{C}_{1},\dots,\mathcal{C}_{n}\right\}$ of n different clients. Each client $\mathcal{C}_{k}$ keeps its private data $\bm{\mathcal{D}}_{k}=\{D_{k}^{1},\dots,D_{k}^{t}\}$ with its corresponding sequence of tasks $\bm{\mathcal{T}}_{k}=\{t_{k}^{1},\dots,t_{k}^{t}\}$ . For each client $k$ and task $t$ , we have an associated dataset $D_{k}^{t}=\{(x_{i},y_{i})\}_{i=1}^{n_{k}^{t}}$ with $x_{i}$ an input sample, $y_{i}$ the corresponding class label, and $n_{k}^{t}$ the number of training samples. In the proposed online setting, we assume that samples for each task $t$ come gradually in a stream of mini-batches $b_{k}^{t}=\{{(x_{i},y_{i})\in D_{k}^{t}\}_{i=1}^{bs}}$ which can only be seen once. Similar to Qi et al. (2023), we assume non-overlapping classes for intra-client tasks. In other words, the clients can see a specific class only once during the training. At each communication round $r$ , the client trains the local model on its own data and shares the local parameters $\bm{\theta}_{k}^{r}$ to update the global model parameters $\bm{\theta}_{g}^{r}$ . Since data points can be processed only once, in order to exploit the information from the incoming data points at its best, communications with the server are performed when a single mini-batch or multiple consecutive mini-batches are processed. This is considerably different from the standard FCL scenario, where multiple iterations and communication rounds are performed for the whole task dataset. Following a popular trend in FCL, we also focus on the FCIL scenario which represents the most realistic and challenging case.

3.2 Methodology

As briefly anticipated above, the online nature of the newly introduced problem poses new challenges compared to the standard case. For this, we propose a memory-based approach on the client-side to alleviate catastrophic forgetting. The motivation is threefold: 1) the most effective solutions in online-CL are approaches with memory buffers; 2) given the online nature of the problem, generative-based approaches would be limited as they require many iterations over the same dataset; 3) compared with generative-based solutions, we only store a small amount of data reducing the overall overhead on the local device.

In the following subsections, we describe in more detail the characteristics of our approach at the client and server levels. For the client side, we outline our uncertainty-aware memory management detailing the properties of the employed uncertainty estimate and its advantages compared to standard scores. For the server side, we detail the adjustments done to improve the communication and parameter averaging effectiveness in the online scenario.

3.2.1 Client-level

Memory management.

For each client $k$ , we introduce a fixed-size memory buffer $\mathcal{M}_{k}$ . Similar to Chrysakis and Moens (2020), the memory population strategy is based on a class-balanced update, which is crucial to consider in case the stream of data is highly imbalanced. In fact, if classes are not equally represented in the memory, sampling from the memory may further deteriorate the predictive performance of the framework for under-represented classes. Differently from the populating strategy proposed in Chrysakis and Moens (2020), the criteria to decide which samples to keep in the memory is not random, but based on predictive uncertainty estimates in order to intentionally store samples with desired characteristics for each class. There are different ways to select the samples. We can decide to store a) the class-representative data points by selecting the least uncertain ones for each class (bottom-k); b) the easiest-to-forget samples by sampling the ones with high uncertainty (top-k).

From the second task on, we need a strategy for sampling from the memory a subset of data points used for replay (replay set). Since the memory represents a prototypical set of data points for each class, we assume that a random sampling is sufficient to extract informative data points from the memory. Following the standard practice, the number of samples in the replay set is equal to the batch size. It is important to note that the memory may already contain samples from the current task. In such cases, samples from the current task are excluded from the sampling process to ensure that the focus remains solely on those belonging to past tasks.

Predictive uncertainty estimation in the logit space.

Different measure of predictive uncertainty can capture distinct aspects of a model’s irreducible aleatoric uncertainty (inherent in the data) and its epistemic uncertainty (that stems from the finite training data and can be reduced by gathering more data). In online CL, the most commonly used measures are derived directly from the confidence scores of the model and mostly capture the irreducible aleatoric uncertainty (Wimmer et al., 2023). Here, we hypothesize it may be more beneficial for the model to replay instances which are representative in the sense that there is low uncertainty about the data generating process (we refer to this as low epistemic uncertainty, while acknowledging that there exist varying definitions). To not rely on specialized Bayesian models, we leverage a recently proposed bias-variance decomposition of the cross-entropy loss and compute a generalized variance term directly from the loss (Gruber and Buettner, 2023). Such bias-variance decomposition decomposes the expected prediction error (loss) into the model’s bias, variance, and an irreducible error (noise term). The latter is related to aleatoric uncertainty, whereas the variance term can directly be related to epistemic uncertainty (Gruber et al., 2023).
Gruber and Buettner (2023) have recently shown that a bias-variance decomposition of cross-entropy loss gives rise to the Bregman Information as the variance term and measures the variability of the prediction in the logit space. We illustrate the different aspects of uncertainty captured by confidence scores and BI respectively in Figure 2. For example, data points close to the decision boundary have a low confidence score due to the inherently high aleatoric uncertainty; in contrast, due to the high density of observed data, there is actually a low uncertainty about the data generating process (DGP), resulting in a low BI (low epistemic uncertainty). Outliers far away from the decision boundary can have a high confidence score, but will have a high BI due to the high uncertainty regarding the DGP. We hypothesize that it is samples with a low BI that are most useful for replay in online-FCL.
To populate the memory with representative samples, we therefore propose using an uncertainty estimator based on Bregman Information (BI) (Gruber and Buettner, 2023). The authors demonstrate that BI at the sample level can be estimated through deep ensembles or test-time augmentation (TTA). However, to reduce the computational overhead at the local level, we employ TTA for computing the estimations. Let us consider the problem of multi-class classification (as in our case), where the standard loss is represented by the cross-entropy. Considering a set $P$ of perturbations for a given data point $x$ , we can compute the variance term of the classification loss $u(x)_{BI}$ as follows:

u(x)_{BI}=\frac{1}{P}\sum_{i=1}^{P}\text{LSE}(\hat{z}_{i})-\text{LSE}\left(% \frac{1}{P}\sum_{i=1}^{p}\hat{z}_{i}\right),

(1)

where $\hat{z}_{i}\in\mathbb{R}^{c}$ represent the logit predictions and $\text{LSE}(x_{1},\dots,x_{n})=\ln\sum_{i=1}^{n}e^{x_{i}}$ the LogSumExp (LSE) function respectively. Intuitively, a large value of $u(x)_{BI}$ means that the logits predicted across the perturbations vary significantly, suggesting a high uncertainty of the prediction and the DGP at this point of the input space. The use of this estimator is motivated by the fact that, in comparison with other uncertainty scores such as entropy, smallest margin, or least confidence, there is no information loss in the estimation step. In fact, if we inspect these alternative metrics (see Appendix A.1 for the equations), we can notice that they either need a normalization step to move the logits in the probability space or rest on the largest activation value only. Furthermore, as reported in Gruber and Buettner (2023); Ovadia et al. (2019), and Tomani and Buettner (2021), common confidence scores are reliable only in case of well-calibrated models.
In contrast, the BI-based estimation of the epistemic uncertainty is meaningful also under distribution shift and able to identify robust and representative samples (Gruber and Buettner, 2023).

3.2.2 Global-level

Communication rounds.

Following the standard assumptions of the federated scenario, we allow local clients to share the model parameters solely. However, the proposed online framework poses additional challenges compared to the standard case. In the standard scenario, the communication round is performed at the end of several iterations over the same task dataset. This implies that the model has probably reached convergence and can effectively share the learned information during the communication round. In our case, a few new samples are available at each step and, to keep the model up-to-date, the communication round cannot be performed at the end of the task only. For this reason, as anticipated in Section 3.1, communications with the server are performed when a single mini-batch or multiple consecutive mini-batches are processed. Given the instability of the model when a new task starts, to ensure an effective and efficient information sharing, we propose to set a burn-in period. During this period, the local model learns independently for a certain number of batches without sharing and receiving information from the others. This guarantees that, when the local client starts to be involved in the communication rounds, the model has effectively learned relevant information on the current task. Additionally, since every parameter update degrades the predictive performance at local level (Qi et al., 2023), we propose to limit the number of communications rounds per task. Instead of performing a model averaging every time a batch is processed, we let the local models learn for $q$ consecutive mini-batches before actively participating to the communication round.

Parameter averaging.

Although clients work on the same task, they may receive data coming from different classes (as depicted in Figure 1). However, it may happen that some of them receive mini-batches that contain the same label (see $\mathcal{C}_{3}$ and $\mathcal{C}_{4}$ in Figure 1). Thus, we need to consider that local parameters collected for parameter averaging can be biased towards the most common classes. For this, we propose to first create an aggregated model for each class available in the current round, and then compute $\bm{\theta}_{g}^{r}$ using the class-based models just created. In this way, the current classes contribute equally during the updates of the global parameters. In case all the clients work on the same task sequence, the parameter averaging results in the standard computation, like e.g., FedAvg (McMahan et al., 2017).

Finally, once the new global parameters $\bm{\theta}_{g}^{r}$ are computed, to avoid a drastic change between the old parameters and the new ones, following Shenaj et al. (2023), we propose to average the newly computed parameters at round $r$ with the previous parameters computed at round $r-1$ .

4 Experiments

4.1 Datasets and Settings

Datasets.

To understand the behaviour of different algorithms in the online-FCL scenario, we use a dataset commonly used in the literature, namely, CIFAR10 (Krizhevsky et al., 2009). We randomly assign a set of classes to 5 tasks and we split the task data evenly among the clients such that the task sequence $T_{k}^{t}$ for task $t$ and client $k$ has two classes and does not share any data with the other clients. In this way, since the class assignment per task is different every time, we can identify the strategy providing greater flexibility and effectiveness irrespective of the composition of the tasks. Then, as anticipated in the introductory part, we assess the performance under more difficult and realistic conditions. Instead of using datasets that, to some extent, share similar characteristics with CIFAR10 (i.e., CIFAR100 and ImageNet) or represent unrealistic tasks (such as EMNIST), we validate our results on datasets for biomedical image analysis. Apart from the change in the domain (which in turn means different backgrounds in the images, different statistics, etc.), these datasets pose an additional challenge compared to the standard benchmark datasets, i.e., data imbalance. To reflect realistic conditions where recent tasks contain fewer data points than the older ones since the time to collect them is shorter, we assign classes to tasks based on the class size. We decide to focus on two biomedical datasets annotated by expert clinical pathologists; the colorectal cancer hystology (CRC-Tissue) dataset (Kather et al., 2019; Yang et al., 2023) containing images divided in 8 classes of hematoxylin–eosin (HE)–stained slides taken from patients with colorectal cancer (CRC), and the kidney cortex cells (KC-Cell) dataset containing 8 classes of human kidney cortex tissue sections (Ljosa et al., 2012; Yang et al., 2023). Finally, to test our approach on text classification tasks, we use the 20NewsGroups dataset (Lang, 1995) consisting of thousand of news evenly partitioned across 20 different newsgroups. As for CIFAR10, we randomly assign classes to tasks every run. Statistics and more details about the used datasets can be found in the supplementary material (Appendix A.3).

Experimental Settings.

Given the novelty of the online setting, there are no direct competitors which we could refer to. We therefore investigate the most successful algorithms from the online-CL and the FCL literature. We compare our approach with a standard baseline for FL, i.e., FedAvg (McMahan et al., 2017), a standard memory-based approach for online-CL, namely Experience Replay (ER) (Chaudhry et al., 2019), and the state-of-the-art approach for FCL, i.e., MFCL (Babakniya et al., 2024). Our decision to use ER is based on the fact that, despite its simplicity, it is surprisingly competitive when compared with more sophisticated and newer approaches as shown in recently conducted empirical surveys (Soutif-Cormerais et al., 2023). For FCL, we use MFCL because its data-free solution trains the generator on the server side. As such, we believe it may be potentially feasible in the online scenario, in contrast to other approaches that train local generators at the client-side (Qi et al., 2023). In addition to the standard ER, we also include the class-balanced version (CBR) presented in Chrysakis and Moens (2020). Finally, to show the competitiveness of $u(x)_{BI}$ , we also consider other uncertainty scores. In particular, least confidence (LC), margin sampling (MS), ratio of confidence (RC), and entropy (EN) (Shannon, 1948; Campbell et al., 2000; Culotta and McCallum, 2005).

In all the experiments for image classification, we employ a slim version of Resnet18 (He et al., 2016) – as done in previous works in online-CL (Kumari et al., 2022; Hurtado et al., 2023; Soutif-Cormerais et al., 2023) –, and use the SGD optimizer with a learning rate of $0.1$ . For text classification, we employ a simple MLP with a single fully connected hidden layer (512 units) and Adam optimizer with a learning rate of $0.01$ . To generate the perturbed inputs for estimating predictive uncertainty, we use two different strategies according to the different data modalities. For vision tasks, the perturbations are generated via standard augmentation. The list of augmentations used in our experiments is provided in the appendix. For the natural language experiment, we leverage recent progress in foundation models and first create general-purpose latent representations of the input texts via a pre-trained sentence embedder. In particular, we employ e5-small-v2 (Wang et al., 2022) (384 dimensions) via HuggingFace (Wolf et al., 2020). Then, for perturbing the generated vector representations, we add Gaussian noise from $\mathcal{N}(0,0.1)$ to each latent dimension.

Following standard practice, we set the batch size equal to $10$ . The memory size was set to different values for each dataset in order to examine the performance of various memory configurations, including both large and small buffers. The burn-in period is set to $30$ . A communication round with the server is performed after $q=5$ mini-batches. The number of clients is set to $5$ . For parameter averaging, we employ FedAvg (McMahan et al., 2017). Given the flexibility of the approach, the model averaging strategy can be changed as desired. For evaluation, following recent works (Yoon et al., 2021a; Wuerkaixi et al., 2024), we use the average last accuracy (A) and average forgetting (F) (see appendix A.4 for the definitions). All the experiments are run on three different random seeds. For each dataset, experiments were run on a Linux machine using a single Quadro RTX 5000 and 16 GB RAM.

4.2 Empirical Results

From the experiments conducted on CIFAR10, we can observe that the proposed approach is able to reduce CF consistently across different memory sizes and class-per-task assignments. The results in Table 1 suggest that storing the class-representative samples (i.e., the least uncertain data points for each class) provides a benefit in terms of predictive performance gain and forgetting reduction. Our findings are substantiated in the classification tasks for biomedical images. Although the images originate from another domain, Tables 2 and 3 confirm that in terms of CF, BI outperforms the considered baselines in all cases. Crucially, we also show the ability of our approach to work with imbalanced real-world data. Figure 3 summarise our findings; in comparison with standard baselines, our simple approach is able to perform consistently across tasks (left plot) and reduce CF on different datasets (right plot). Finally, we demonstrate that our method can work on other tasks than image classification. Table 4 reports the results on the 20NewsGroup dataset.
More generally, we can notice that the difference in terms of predictive accuracy is less marked. This may be due to the fact that the baselines have higher accuracy on the last tasks and poor performance on the first ones, while in our case the accuracy is kept more uniform across the tasks (reflected in a lower average forgetting) as shown in the left plot in Figure 3.

Table 1: Comparison of average last accuracy (A) and last forgetting (F) on CIFAR10 (5 tasks).

Score		M=200		M=500		M=1000
ER		20.74 $\pm$ 2.53		27.90 $\pm$ 2.03		33.64 $\pm$ 0.72
ER		48.48 $\pm$ 3.95		30.18 $\pm$ 2.69		24.3 $\pm$ 1.73
CBR		22.82 $\pm$ 1.55		24.17 $\pm$ 3.23		32.67 $\pm$ 1.84
CBR		46.79 $\pm$ 2.92		28.54 $\pm$ 2.22		23.62 $\pm$ 2.63
		Top	Bottom	Top	Bottom	Top	Bottom
LC	A $(\uparrow)$	21.58 $\pm$ 1.56	20.92 $\pm$ 0.70	29.49 $\pm$ 1.46	27.38 $\pm$ 1.68	36.14 $\pm$ 2.44	31.94 $\pm$ 2.44
LC	F $(\downarrow)$	50.78 $\pm$ 1.96	49.78 $\pm$ 3.16	36.13 $\pm$ 2.54	30.17 $\pm$ 2.52	21.97 $\pm$ 4.06	29.53 $\pm$ 4.46
MS	A $(\uparrow)$	22.34 $\pm$ 1.04	21.88 $\pm$ 1.28	29.72 $\pm$ 1.98	28.96 $\pm$ 1.54	34.42 $\pm$ 0.91	32.28 $\pm$ 1.20
MS	F $(\downarrow)$	49.38 $\pm$ 3.09	47.20 $\pm$ 4.97	35.16 $\pm$ 3.20	28.07 $\pm$ 2.71	29.58 $\pm$ 2.73	28.64 $\pm$ 4.34
RC	A $(\uparrow)$	21.80 $\pm$ 1.65	22.75 $\pm$ 1.84	28.36 $\pm$ 2.07	31.17 $\pm$ 1.75	35.08 $\pm$ 1.11	28.98 $\pm$ 2.26
RC	F $(\downarrow)$	56.91 $\pm$ 1.90	45.05 $\pm$ 2.11	36.47 $\pm$ 2.84	31.86 $\pm$ 4.16	26.10 $\pm$ 1.76	28.39 $\pm$ 2.95
EN	A $(\uparrow)$	21.15 $\pm$ 1.37	20.14 $\pm$ 1.41	28.63 $\pm$ 0.86	26.48 $\pm$ 1.59	36.25 $\pm$ 1.22	26.42 $\pm$ 1.49
EN	F $(\downarrow)$	53.30 $\pm$ 2.97	55.58 $\pm$ 3.49	36.70 $\pm$ 2.21	35.85 $\pm$ 3.17	25.52 $\pm$ 3.01	34.64 $\pm$ 3.40
BI	A $(\uparrow)$	21.31 $\pm$ 1.46	24.89 $\pm$ 0.83	26.57 $\pm$ 0.83	27.84 $\pm$ 2.31	35.12 $\pm$ 2.51	35.83 $\pm$ 2.60
BI	F $(\downarrow)$	54.65 $\pm$ 1.83	35.77 $\pm$ 4.13	42.86 $\pm$ 3.81	24.59 $\pm$ 3.16	27.00 $\pm$ 2.99	19.07 $\pm$ 2.17

Table 2: Comparison of average last accuracy (A) and last forgetting (F) on CRC-Tissue (4 tasks).

Score		M=80		M=120
ER		50.08 $\pm$ 3.41		58.36 $\pm$ 2.67
ER		27.36 $\pm$ 9.11		24.12 $\pm$ 4.5
CBR		49.52 $\pm$ 2.71		54.57 $\pm$ 2.08
CBR		22.98 $\pm$ 7.46		24.55 $\pm$ 5.41
		Top	Bottom	Top	Bottom
LC	A $(\uparrow)$	47.66 $\pm$ 2.52	47.24 $\pm$ 4.48	56.49 $\pm$ 2.21	51.68 $\pm$ 2.18
LC	F $(\downarrow)$	41.42 $\pm$ 3.64	36.51 $\pm$ 6.86	26.65 $\pm$ 6.30	33.94 $\pm$ 8.80
MS	A $(\uparrow)$	47.87 $\pm$ 1.95	49.35 $\pm$ 2.57	52.82 $\pm$ 3.51	51.98 $\pm$ 2.42
MS	F $(\downarrow)$	36.96 $\pm$ 4.74	41.18 $\pm$ 7.67	30.42 $\pm$ 4.05	33.07 $\pm$ 2.59
RC	A $(\uparrow)$	50.42 $\pm$ 3.92	48.96 $\pm$ 2.13	55.93 $\pm$ 5.85	53.23 $\pm$ 1.58
RC	F $(\downarrow)$	32.95 $\pm$ 4.04	39.84 $\pm$ 5.70	26.57 $\pm$ 6.35	29.90 $\pm$ 7.36
EN	A $(\uparrow)$	49.18 $\pm$ 2.36	48.58 $\pm$ 2.09	53.80 $\pm$ 4.56	52.80 $\pm$ 2.77
EN	F $(\downarrow)$	29.55 $\pm$ 6.39	36.78 $\pm$ 5.75	29.48 $\pm$ 12.65	35.76 $\pm$ 10.16
BI	A $(\uparrow)$	49.18 $\pm$ 2.61	59.72 $\pm$ 2.39	50.16 $\pm$ 2.79	59.06 $\pm$ 2.55
BI	F $(\downarrow)$	36.01 $\pm$ 7.81	18.84 $\pm$ 8.64	37.28 $\pm$ 9.50	23.10 $\pm$ 9.81

Table 3: Comparison of average last accuracy (A) and last forgetting (F) on KC-Cell (4 tasks).

Score		M=120		M=160
ER		19.59 $\pm$ 1.32		22.29 $\pm$ 0.71
ER		65.00 $\pm$ 3.68		60.33 $\pm$ 6.0
CBR		18.69 $\pm$ 1.56		20.38 $\pm$ 1.54
CBR		66.25 $\pm$ 2.36		62.90 $\pm$ 4.43
		Top	Bottom	Top	Bottom
LC	A $(\uparrow)$	17.94 $\pm$ 1.19	19.66 $\pm$ 0.83	19.18 $\pm$ 1.31	21.32 $\pm$ 1.66
LC	F $(\downarrow)$	67.25 $\pm$ 3.88	65.16 $\pm$ 4.07	62.90 $\pm$ 2.48	62.84 $\pm$ 4.47
MS	A $(\uparrow)$	17.46 $\pm$ 0.89	20.05 $\pm$ 1.79	19.30 $\pm$ 1.01	21.39 $\pm$ 0.96
MS	F $(\downarrow)$	66.90 $\pm$ 3.52	62.22 $\pm$ 4.61	65.13 $\pm$ 5.05	65.63 $\pm$ 2.85
RC	A $(\uparrow)$	18.14 $\pm$ 1.69	17.83 $\pm$ 1.20	19.59 $\pm$ 0.88	21.21 $\pm$ 1.23
RC	F $(\downarrow)$	66.12 $\pm$ 3.44	68.04 $\pm$ 3.76	63.50 $\pm$ 3.76	63.27 $\pm$ 9.27
EN	A $(\uparrow)$	17.64 $\pm$ 1.63	19.46 $\pm$ 1.54	18.94 $\pm$ 1.23	21.25 $\pm$ 1.17
EN	F $(\downarrow)$	68.34 $\pm$ 4.37	64.66 $\pm$ 4.48	65.57 $\pm$ 4.24	64.63 $\pm$ 5.01
BI	A $(\uparrow)$	16.63 $\pm$ 0.73	20.91 $\pm$ 1.32	18.03 $\pm$ 0.61	21.61 $\pm$ 1.23
BI	F $(\downarrow)$	72.63 $\pm$ 1.45	61.03 $\pm$ 4.42	67.90 $\pm$ 3.89	59.05 $\pm$ 4.42

Table 4: Comparison of average last accuracy (A) and last forgetting (F) on 20NewsGroups (5 tasks).

		ER	CBR	LC	MS	RC	EN	BI
M=60	A $(\uparrow)$	42.33 $\pm$ 1.66	45.21 $\pm$ 1.86	42.90 $\pm$ 1.23	44.14 $\pm$ 1.63	44.15 $\pm$ 1.32	41.65 $\pm$ 1.82	44.92 $\pm$ 1.63
M=60	F $(\downarrow)$	31.12 $\pm$ 2.18	30.39 $\pm$ 0.80	32.05 $\pm$ 2.27	31.67 $\pm$ 2.57	31.61 $\pm$ 1.77	33.77 $\pm$ 1.43	29.98 $\pm$ 1.37
M=500	A $(\uparrow)$	45.95 $\pm$ 1.76	46.22 $\pm$ 1.55	45.72 $\pm$ 2.28	46.46 $\pm$ 2.53	46.58 $\pm$ 2.83	44.92 $\pm$ 2.71	46.72 $\pm$ 2.11
M=500	F $(\downarrow)$	27.43 $\pm$ 1.12	28.00 $\pm$ 0.98	28.58 $\pm$ 1.61	27.97 $\pm$ 2.36	28.02 $\pm$ 2.53	29.63 $\pm$ 2.45	26.97 $\pm$ 2.36

5 Conclusion and Limitations

In this work, we investigate the challenges of real-world federated continual learning contexts, specifically when data arrive in streams of small chunks and the local models need to be updated with high frequency. Current research in FCL focuses on developing generative-based solutions that imply an offline setting where generators are trained at the end of each task. However, we advocate that in realistic conditions, to keep the model up-to-date and reduce the communication overhead, local models should be trained every time new data are received. For this, we devise and formalise a new scenario for the online problem in FCL. To solve catastrophic forgetting, we introduce a simple memory-based baseline that, via a combination of uncertainty-based update and random replay, outperforms state-of-the-art approaches. The newly proposed BI-based for epistemic uncertainty shows its superiority in reducing CF compared to standard uncertainty metrics, it is simple to implement and applicable across a wide range of data modalities. Furthermore, we validate our findings on more probing scenarios from the biomedical domain. This confirms its ability to sample more robust and representative samples in challenging tasks.
A limitation of the proposed approach is the need to use an ensembling technique (in our case, TTA) in order to estimate the BI scores. This may result in a reduced efficiency compared to, e.g., ER. Furthermore, the TTA-based estimation of the Bregman Information is only an estimate of the true unknown uncertainty; the estimator used throughout the experiments (eq. 1) is only asymptotically unbiased, and underestimates the theoretical quantity (Gruber and Buettner, 2023).

Acknowledgments and Disclosure of Funding

This work was supported by the The Federal Ministry for Economic Affairs and Climate Action of Germany (BMWK, Project OpenFLAAS 01MD23001E). Co-funded by the European Union (ERC, TAIPO, 101088594). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

References

Aljundi et al. [2017] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3366–3375, 2017.
Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
Aljundi et al. [2019] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. Advances in neural information processing systems, 32, 2019.
Babakniya et al. [2024] Sara Babakniya, Zalan Fabian, Chaoyang He, Mahdi Soltanolkotabi, and Salman Avestimehr. A data-free approach to mitigate catastrophic forgetting in federated class incremental learning for vision tasks. Advances in Neural Information Processing Systems, 36, 2024.
Campbell et al. [2000] Colin Campbell, Nello Cristianini, Alex Smola, et al. Query learning with large margin classifiers. In ICML, volume 20, page 0, 2000.
Chaudhry et al. [2018a] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547, 2018a.
Chaudhry et al. [2018b] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. In International Conference on Learning Representations, 2018b.
Chaudhry et al. [2019] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, P Dokania, P Torr, and M Ranzato. Continual learning with tiny episodic memories. In Workshop on Multi-Task and Lifelong Reinforcement Learning, 2019.
Chrysakis and Moens [2020] Aristotelis Chrysakis and Marie-Francine Moens. Online continual learning from imbalanced data. In International Conference on Machine Learning, pages 1952–1961. PMLR, 2020.
Culotta and McCallum [2005] Aron Culotta and Andrew McCallum. Reducing labeling effort for structured prediction tasks. In AAAI, volume 5, pages 746–751, 2005.
Dong et al. [2022] Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu, Xiao Wang, and Qi Zhu. Federated class-incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10164–10173, 2022.
Gruber et al. [2023] Cornelia Gruber, Patrick Oliver Schenk, Malte Schierholz, Frauke Kreuter, and Göran Kauermann. Sources of uncertainty in machine learning–a statisticians’ view. arXiv preprint arXiv:2305.16703, 2023.
Gruber and Buettner [2023] Sebastian Gruber and Florian Buettner. Uncertainty estimates of predictions via a general bias-variance decomposition. In International Conference on Artificial Intelligence and Statistics, pages 11331–11354. PMLR, 2023.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
He and Jaeger [2018] Xu He and Herbert Jaeger. Overcoming catastrophic interference using conceptor-aided backpropagation. In International Conference on Learning Representations, 2018.
Hendryx et al. [2021] Sean M Hendryx, Dharma Raj KC, Bradley Walls, and Clayton T Morrison. Federated reconnaissance: Efficient, distributed, class-incremental learning. arXiv preprint arXiv:2109.00150, 2021.
Hurtado et al. [2023] Julio Hurtado, Alain Raymond-Sáez, Vladimir Araujo, Vincenzo Lomonaco, Alvaro Soto, and Davide Bacciu. Memory population in continual learning via outlier elimination. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3481–3490, 2023.
Kather et al. [2019] Jakob Nikolas Kather, Johannes Krisam, Pornpimol Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron Weis, Timo Gaiser, Alexander Marx, Nektarios A Valous, Dyke Ferber, et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. online, 2009.
Kumari et al. [2022] Lilly Kumari, Shengjie Wang, Tianyi Zhou, and Jeff A Bilmes. Retrospective adversarial replay for continual learning. Advances in Neural Information Processing Systems, 35:28530–28544, 2022.
Lang [1995] Ken Lang. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pages 331–339. Elsevier, 1995.
Lee et al. [2017] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. Advances in neural information processing systems, 30, 2017.
Lesort et al. [2019] Timothée Lesort, Hugo Caselles-Dupré, Michael Garcia-Ortiz, Andrei Stoian, and David Filliat. Generative models from the perspective of continual learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
Ljosa et al. [2012] Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. Annotated high-throughput microscopy image sets for validation. Nature methods, 9(7):637–637, 2012.
Ma et al. [2022] Yuhang Ma, Zhongle Xie, Jue Wang, Ke Chen, and Lidan Shou. Continual federated learning based on knowledge distillation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, volume 3, 2022.
Mai et al. [2022] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey. Neurocomputing, 469:28–51, 2022.
Mallya and Lazebnik [2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
Nguyen et al. [2018] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. In International Conference on Learning Representations, 2018.
Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
Qi et al. [2023] Daiqing Qi, Handong Zhao, and Sheng Li. Better generative replay for continual federated learning. In The Eleventh International Conference on Learning Representations, 2023.
Rannen et al. [2017] Amal Rannen, Rahaf Aljundi, Matthew B Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. In Proceedings of the IEEE international conference on computer vision, pages 1320–1328, 2017.
Serra et al. [2018] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548–4557. PMLR, 2018.
Shannon [1948] Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
Shenaj et al. [2023] Donald Shenaj, Marco Toldo, Alberto Rigon, and Pietro Zanuttigh. Asynchronous federated continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5054–5062, 2023.
Shin et al. [2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
Soutif-Cormerais et al. [2023] Albin Soutif-Cormerais, Antonio Carta, Andrea Cossu, Julio Hurtado, Vincenzo Lomonaco, Joost Van de Weijer, and Hamed Hemati. A comprehensive empirical evaluation on online continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3518–3528, 2023.
Tao et al. [2020] Xiaoyu Tao, Xinyuan Chang, Xiaopeng Hong, Xing Wei, and Yihong Gong. Topology-preserving class-incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pages 254–270. Springer, 2020.
Tomani and Buettner [2021] Christian Tomani and Florian Buettner. Towards trustworthy predictions from deep neural networks with fast adversarial calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9886–9896, 2021.
van de Ven et al. [2022] Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
Wang et al. [2022] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
Wimmer et al. [2023] Lisa Wimmer, Yusuf Sale, Paul Hofman, Bernd Bischl, and Eyke Hüllermeier. Quantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures? In Uncertainty in Artificial Intelligence, pages 2282–2292. PMLR, 2023.
Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
Wu et al. [2019] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 374–382, 2019.
Wuerkaixi et al. [2024] Abudukelimu Wuerkaixi, Sen Cui, Jingfeng Zhang, Kunda Yan, Bo Han, Gang Niu, Lei Fang, Changshui Zhang, and Masashi Sugiyama. Accurate forgetting for heterogeneous federated continual learning. In The Twelfth International Conference on Learning Representations, 2024.
Yang et al. [2023] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
Yoon et al. [2018] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018.
Yoon et al. [2021a] Jaehong Yoon, Wonyong Jeong, Giwoong Lee, Eunho Yang, and Sung Ju Hwang. Federated continual learning with weighted inter-client transfer. In International Conference on Machine Learning, pages 12073–12086. PMLR, 2021a.
Yoon et al. [2021b] Jaehong Yoon, Divyam Madaan, Eunho Yang, and Sung Ju Hwang. Online coreset selection for rehearsal-based continual learning. In International Conference on Learning Representations, 2021b.

Appendix A Appendix / supplemental material

A.1 Predictive Uncertainty Scores

The uncertainty metrics considered in our assessment are the followings:

Least Confidence (LC) [Culotta and McCallum, 2005] measures the predictive uncertainty by looking at the samples with the smallest predicted class probability. If the probability associated with the most probable class $y_{(1)}$ is low, the model is less certain about the given sample.

u(x)_{LC}=1-\frac{1}{P}\sum_{i=1}^{P}p(y_{(1)}=c|\tilde{x}_{i})

(2)

Margin Sampling (MS) [Campbell et al., 2000] measures the predictive uncertainty by looking at the difference between the most probable predicted class $y_{(1)}$ and the second largest one $y_{(2)}$ . If the two probabilities are similar, the model is uncertain.

u(x)_{MS}=1-\frac{1}{P}\sum_{i=1}^{P}\left(p(y_{(1)}=c|\tilde{x}_{i})-p(y_{(2)% }=c|\tilde{x}_{i})\right)

(3)

Ratio of Confidence (RC) [Campbell et al., 2000] is similar to MS. In this case, instead of computing the difference, the ratio between the probabilities of the two most probable classes is considered.

u(x)_{RC}=\frac{1}{P}\sum_{i=1}^{P}\left(\frac{p(y_{(2)}=c|\tilde{x}_{i}))}{p(% y_{(1)}=c|\tilde{x}_{i}))}\right)

(4)

Entropy (EN) [Shannon, 1948] considers, differently from the previously introduced metrics, the whole probability distribution. The entropy is computed as follows:

u(x)_{EN}=-\frac{1}{P}\sum_{i=1}^{P}\left(\sum_{j}p(y_{j}=c|\tilde{x}_{i})\log% (p(y_{j}=c|\tilde{x}_{i}))\right)

(5)

A.2 Set of Augmentations

As described in the main paper, we employ TTA to measure the epistemic uncertainty via BI and to compute all the other uncertainty estimates. The set of augmentations used in our experiments is presented in Figure 4. Each augmentation is applied singularly on the data points of interest.

{minted}python transform_cands = [ CutoutAfterToTensor(args, 1, 10), CutoutAfterToTensor(args, 1, 20), v2.RandomHorizontalFlip(), v2.RandomVerticalFlip(), v2.RandomRotation(degrees=10), v2.RandomRotation(45), v2.RandomRotation(90), v2.ColorJitter(brightness=0.1), v2.RandomPerspective(), v2.RandomAffine(degrees=20, translate=(0.1, 0.3), scale=(0.5, 0.75)), v2.RandomResizedCrop(args.input_size[1:], scale=(0.8, 1.0), ratio=(0.9, 1.1), antialias=True), v2.RandomInvert() ]

Figure 4: Augmentation set used sequentially for the calculation of uncertainty in the experiments.

A.3 Dataset Statistics

In Table 5, we report the statistics of the datasets used for the main experiments. We use dataset from different domains (CIFAR10, CRC-Tissue, KC-Cell) and with different modalities (20NewsGroups). For CRC-Tissue, in order to create tasks with decreasing size and equal number of classes, we remove the smallest class from our evaluation.

Table 5: Statistics of the datasets.

Number of	Classes	Samples	Tasks
CIFAR10	10	60000	5
CRC-Tissue	9	107180	4
KC-Cell	8	236386	4
20NewsGroups	20	18828	5

A.4 Evaluation Metrics

In line with Yoon et al. [2021a] and Wuerkaixi et al. [2024], we employ the average Last Accuracy (A) and average Last Forgetting (F) – a federated adaptation of the metrics defined in Chaudhry et al. [2018a]. Last refers to the measurement of the value at the end of the stream for all the clients. Suppose $a_{k}^{t,i}$ represents the accuracy of task $i$ after learning task $t$ on client $k$ . The last accuracy $A_{k}$ at task $T$ on client $k$ is defined as $A_{k}=\frac{1}{T}\sum_{i}a_{k}^{T,i}$ . Let $K$ represent the total number of clients. The average last accuracy is then defined as:

A=\frac{1}{K}\sum_{k=1}^{K}A_{k}

(6)

The forgetting measures the difference between the peak accuracy and the final accuracy of each task. Usually, the peak accuracy is reached when the considered task is trained. After that, since the model is prone to focus more on the upcoming tasks, we observe a performance degradation on the previous tasks. For this, we want to keep the forgetting as low as possible. The average last forgetting is defined as follows:

F=\frac{1}{K}\sum_{k=1}^{K}F_{k}

(7)

where $F_{k}$ is defined as

F_{k}=\frac{1}{T-1}\sum_{t=1}^{T-1}\max_{l\in\{1,\dots,T-1\}}a_{k}^{l,j}-a_{k}% ^{t,j},\qquad\forall j<t.

(8)

Federated Continual Learning Goes Online: Leveraging Uncertainty for Modality-Agnostic Class-Incremental Learning