5.1 Class-Incremental Scenario
Table
3 reports the performance results in terms of the ACC and BWT (Equations (
2) and (
3), respectively) metrics. Figures
5 and
6 illustrate the evolution of
\(R_{i, i}\) as
i approaches
T as well as the resource consumption in terms of memory usage and training duration. It is observed from Table
3 - due to the low values of ACC and BWT - that catastrophic forgetting occurs not only in the fine-tuning approach but also in the LwF [
27] and PNN [
37] methods. In LwF [
27], the knowledge distillation model increasingly accumulates errors as the number of classes grows, suffering from an exponential decrease in test prediction accuracy (see the LwF [
27] curves in Figures
5 and
6). With respect to PNN [
37], as the number of classes increases, due to the lack of a replay memory, the method struggles at distinguishing between the classes that are not jointly trained - this is known as
inter-task confusion [
31] and is seen by the low values of ACC. EWC [
40] offers slightly better performance, but is still very limited due to the trade-off in weights between old and new tasks.
Better results are seen with AGEM [
5] as it is able to restrain the forgetting up to a certain extent (notice the ACC and BWT metrics in Table
3). However, it chooses to populate the replay memory with the last seen samples of each class. In practice, for HAR, this signifies that the replay memory has a low variability of ways (styles) the activities can be performed - that is, the samples of the replay memory contain similar information. This not only leads the neural network to forget different ways the activities can be performed but also to overfit the limited styles present in the replay memory. GEM [
28] employs the same selection criterion for its replay memory. However, it uses the entire replay memory at each optimization step, thus, being able to more effectively prevent forgetting.
Our method achieves better results compared with the second best-performing method GEM [
28]. As a result of the downsampling and precision reduction, we are able to store and utilize 16× more samples in the replay memory than the other methods. This not only circumvents catastrophic forgetting but also combats the inter-task confusion when the neural network is expanded. The better performance of our method can be also explained by the expected variability of the selected samples. These two factors enable a superior performance compared to complex and resource-demanding methods as GEM [
28].
We measured the training duration and memory footprint when running our method and the baselines on the two microcontrollers, including the Raspberry Pi 4B and the Raspberry Pi 3B+. The results are shown in Figures
5 and
6. With respect to the memory footprint, there is no distinction across microcontrollers. Therefore, the following comments regarding resource utilization apply to both microcontrollers. In EWC [
40], the importance weights - Fisher information matrix - are calculated at the end of each task utilizing all the data pertaining to the task in question, this incurs high memory cost since a large number of samples is required to be stored in memory until the end of the task. The calculation of the importance weights is also responsible for the enormous duration of the training as it consists in performing numerous backward passes on the neural network. Given that each task has an equal number of samples, the memory footprint for EWC [
40] remains constant and the training duration increases linearly. In LwF [
27], the samples can be discarded as soon as the neural network is trained with them, which reduces the memory cost considerably. The computational costs remain constant as the number of tasks increases, hence the training duration also grows linearly.
However, this is not the case for GEM [
28]. GEM [
28] solves a quadratic optimization problem with a number of constraints equal to the number of tasks. The constraints are formed with the gradients calculated with the replay memory of each task. This leads to an ever-increasing computational cost and memory footprint. AGEM [
5] abandons the idea of a growing number of constraints in favor of one sole constraint, this guarantees constant memory usage and approximately equal training duration per task.
Our method works with a significantly lower number of backward passes that remains constant as the amount of tasks grows. Also, when the neural network is expanded, the computational costs do not increase substantially. For these reasons, our method exhibits a linear growth in the elapsed training duration. Concerning memory usage, the results of our method are significantly lower than non-expandable neural networks. An expandable neural network does not need to start with a high complexity as a future-proofing method. Instead, the complexity grows as new classes are required to be learned. This results in more efficient memory usage. Eventually, our neural network can reach a memory usage higher than the static neural networks utilized in the other methods (except for PNNs). However, the increase in memory utilization is minimal and means that the neural network does still have the capacity to learn entirely new features important to the recognition of new classes. Static neural networks simply become saturated and commence to underfit the data.
Our method employs operations not present in other continual learning methods. These operations refer to the sample identifier storage, the downsampling of samples before their storage in the replay memory, their precision reduction, and lastly their interpolation before rehearsal (i.e. performing backward passes with replay samples). The former utilizes negligible processing cost and memory space (less than 3e-9% of the total memory usage). The last three mentioned operations account for less than 10% of the total processing expenses of our method. Hence, most of the processing costs are assigned to the backward passes on the neural network. Note that all four operations are only present during the training.
PNNs have a similar memory cost compared to our neural network. However, as more lateral connections are instantiated, the computational graph rises relatively more in complexity, thus, requiring a higher memory cost than our method. This increase in complexity is not accompanied by any performance improvement. PNNs are still able to train faster since backward passes do not happen in the frozen portion of the neural network. However, this also contributes to inter-task confusion and, consequently, catastrophic forgetting.
The inference results are illustrated in Figure
7. For all datasets, our model is able to achieve significantly lower inference time compared to the static model utilized in the continual learning baselines (with the exception of PNNs [
37]). This is a consequence of the simpler computational graph where features of distinct groups of classes are learned separately. Despite the fact that PNNs [
37] have the same number of weights as in our model, its inference time is longer than our model due to the lateral connections between the expanded blocks. Notice that the results for the ARM Cortex-A72 are similar to those of the ARM Cortex-A53, with the exception that the former can execute the inference 3–4× faster than the latter. Additionally, we have evaluated the performance degradation when the model’s parameters are quantized to an 8-bit fixed-point representation before being deployed to the microcontroller. In such a case, the decrease in the F1-score remained below 2%, whereas the model’s size was reduced fourfold. Libraries such as TensorFlow Lite takes advantage of fixed-point quantization to provide 3x+ speedup in the inference of models. Note that these quantities are independent of the continual learning method applied.
5.2 Style-Incremental Scenario
Table
4 exhibits the results in terms of the ACC, BWT. Figure
8 reports the values of
\(R_{i,i}\) , and resource consumption metrics. Since the data distribution across different styles varies on a much smaller scale than across classes, the forgetting is significantly less pronounced in the style-incremental scenario. This results in an accuracy that either slowly grows or oscillates (see the evolution of the curve in Figure
8) in the fine-tuning approach, thus making it hard to spot any forgetting. Nevertheless, forgetting is still present in the fine-tuning approach since the continual learning methods - in special, our method - result in higher ACC values.
We also observe that the continual learning baselines provide close to negligible - or even detrimental - performance improvement compared to the fine-tuning approach. This indicates that these methods are not appropriate for scenarios where the variation in the data distribution reduces across tasks. On the contrary, our method is able to deliver an appreciable leap in performance with respect to fine-tuning since our design is heavily grounded on a rich - yet small in memory footprint - replay memory. Concerning the resource consumption, the observations made on the EWC [
40], LwF [
27], GEM [
28], and AGEM [
5] methods are still valid in the style-incremental scenario. Since our model shares the same neural network architecture as the baselines, the inference time is exactly the same for all methods in the style-incremental scenario: 27.78ms and 6.91ms for PAMAP2 and MHEALTH, respectively, running on the ARM Cortex-A72. Again, the results on the Cortex A-53 are approximately 3-4× slower.
Although we have utilized four datasets with distinct characteristics - such as number of participants, number of classes, number of sensors, and sampling rate - the results exhibited consistent across them for both scenarios. This is not only true for our method, but also for the baselines.