1 Introduction
While
deep neural networks (DNNs) undergo thorough training and validation using extensive datasets before their deployment on edge devices, they remain susceptible to degradation in prediction accuracy in real-world post-deployment scenarios. This degradation is primarily caused by shifts in the distribution of new input data samples, a phenomenon known as “dataset shifts” [
12]. Such shifts often arise from sensor disturbances or environmental interferences [
69]. To bolster the robustness of DNNs, various techniques, including data augmentation [
29] and adversarial training [
35], have been employed during offline training. However, these methods may not be adequate to handle the diverse range of data shifts that can occur after deployment. As a result, it is imperative to adapt neural networks to improve their prediction accuracy [
22,
39,
49,
68].
Various transfer learning methodologies have been proposed to adapt neural networks on edge devices. One approach involves fine-tuning using a modest set of labeled target data. This strategy has proven effective in improving performance while outperforming domain generalization methods cost-effectively [
39,
47,
56,
74]. It entails training a model on a large source dataset initially and then fine-tuning the pre-trained model on a smaller target dataset to adapt to distribution shifts. Another realm of application is when the labels for the new test data are not present, potentially domain-shifted, such as devices operating in remote places without human intervention [
9,
18] or when the cost of annotating the new data with labels is too high and not feasible [
27,
69]. Representative scenarios include (i) DNNs performing human action recognition on drones without labeled samples [
18], (ii) techniques such as laser-induced breakdown spectroscopy in extreme environments (e.g., other planets) [
8], or (iii) medical imaging where noise could be added due to scanners and the DNN for analysis needs to rapidly adapt without labeled data [
27].
To ensure that deployed DNNs maintain or elevate their prediction accuracy while meeting stringent performance constraints in streaming applications, real-time on-device adaptation to new shifted test data is crucial. Relying on cloud-based adaptation may not always be practical due to stringent timing deadlines or devices being in connectivity-deficient regions. Prediction-time adaptation at the edge presents unique challenges, primarily concerning speed and efficiency. DNNs processing streaming data are frequently up against tight timelines, making rapid adaptation a necessity. Moreover, edge devices are often resource-constrained and may rely on battery power, making lightweight and energy-efficient adaptation imperative.
In the race to achieve efficient adaptation on resource-constrained edge platforms,
Compute-In-Memory (CIM) systems have gained prominence. Such systems, including analog crossbar arrays, effectively combat the memory bottlenecks intrinsic to von Neumann architectures [
59]. Analog crossbars harness various memristive devices, such as
Resistive Random-Access Memories (RRAMs),
Phase Change Memories (PCMs), and
Ferroelectric Field-effect Transistors (FeFETs), which are extensively researched for their capability to perform low-precision DNN inference efficiently with high throughput [
62]. Despite their advantages, CIM systems face unique challenges stemming from device-to-device variations, temporal conductance drift in memristive devices, and parasitic resistances in metallic interconnects within crossbar arrays [
2,
13,
19,
32,
67]. These non-idealities might tamper with a DNN’s weights and lead to accuracy degradation in real-world applications.
This article introduces a benchmarking framework and conducts a comprehensive measurement study of prediction-time DNN adaptation techniques, encompassing both supervised and unsupervised approaches, on CIM hardware substrates at the edge. To the best of our knowledge, this study is the first to explore both supervised and unsupervised approaches for resource-constrained devices with CIM technology. Our study aims to address the following algorithm-hardware co-design questions: (i) For each CIM hardware configuration, what constitutes the optimal choice of a robust DNN and test-time adaptation algorithm in terms of three key objectives: prediction accuracy, adaptation time, and energy dissipated during adaptation? (ii) What bottlenecks are encountered when executing these algorithms on various CIM architectures? (iii) Can adaptation techniques effectively address both environmental data shifts and inherent hardware noise to improve accuracy? Built upon the benchmarking framework, we conduct a series of design space exploration analyses, revealing intriguing and, at times, non-obvious outcomes, and illustrating crucial trade-offs between accuracy, performance, energy consumption, and memory utilization. Furthermore, our assessment extends to scenarios involving autonomous navigation by unmanned aerial vehicles (UAVs), where we showcase significant reductions in energy consumption and compute latency without compromising on performance, as gauged by the Mean Safe Flight (MSF) metric, i.e., the average flight distance before the crash.
This article, therefore, makes the following contributions:
—
A benchmarking framework for evaluating DNN adaptation techniques, both supervised and unsupervised, on resource-constrained edge devices and UAV autonomous systems equipped with CIM hardware substrates.
—
A holistic evaluation of DNN adaptation techniques across diverse hardware configurations, showcasing their ability to adapt to both external environmental shifts and inherent hardware noise.
—
Insights from cross-stack algorithm-hardware-technology co-design space exploration, highlighting critical trade-offs between accuracy, performance, and energy efficiency concerning different DNN adaptation algorithms and CIM designs.
The rest of this article is organized as follows. Section
2 describes various techniques of DNN adaptation and CIM. Section
3 presents our proposed benchmarking suite for DNN adaptation with CIM at the edge. Section
4 conducts a cross-layer algorithm-hardware-technology evaluation of the adaptation with design space exploration. Section
5 further evaluates adaptation on UAV autonomous systems. Section 6 discusses pipelining partial adaptation. The article concludes with summaries in Section
7 of test-time DNN adaptation at edge with CIM. We have publicized the framework in github repo (
https://github.com/SenFFF/CIM_Adaptation/).4 Adaptation Evaluation Results
In this section, we elucidate the outcomes of our experiments on test-time DNN adaptation. Initially, we delve into the influence of various adaptation strategies on model performance in both supervised and unsupervised learning scenarios, considering different types of noise. Capitalizing on the complementary attributes of NVM devices, we then probe into the potential of hybrid CIM adaptation. Subsequently, we underscore the ability of test-time adaptation to manage both shifts in data distribution and inherent hardware noise. Finally, we venture into the adaptation design space tailored for distinct deployment contexts.
To provide clear insights, we employ varied configurations of our adaptation benchmarking platform to address the subsequent
Research Questions (RQs):
RQ1:
How effective is the DNN test-time adaptation with CIM in handling data distribution shifts?
RQ2:
What is the effectiveness of DNN test-time adaptation with CIM in addressing output-level shifts?
RQ3:
How does the DNN test-time adaptation with CIM perform for unsupervised learning tasks?
RQ4:
What constitutes the hardware overhead of NN test-time adaptation in CIM?
RQ5:
How does an SRAM/NVM hybrid–based CIM system benefit DNN test-time adaptation?
RQ6:
Is the DNN test-time adaptation capable of adjusting to both data distribution shifts and inherent CIM hardware noise?
4.1 DNN Adaptation for Data Distribution Shift
RQ1: Evaluating adaptation performance for various DNN models under data distribution shift. In this RQ, we investigate the impact of adaptation on three distinct DNN models. We elucidate that test-time adaptation with CIM effectively elevates task accuracy. Notably, the choice of adaptation strategy influences not just the model’s performance but also the hardware overhead stemming from the adaptation process.
In this work, we use adaptation to enhance the robustness of DNN models. It seems like a natural choice to tune all trainable parameters of the network with distribution-shifted data. Yet, it could be either energy-consuming or achieving non-optimal performance. Previous studies pointed out that adapting the whole network could potentially degrade some learned features in the network [
39]. Tuning layers altogether could lead to modules deviating from their pre-trained optimal status. Addressing data distribution shifts while ignoring other noise sources, the optimal adaptation strategy appears to be context specific. For instance, deploying a pre-trained network on edge devices may result in low image quality, where adapting shallow layers could significantly enhance accuracy. This is because the initial layers, responsible for low-level feature extraction, may not be well suited to processing noisy images. Adapting these layers can thus yield substantial improvements. Conversely, shifts in higher-level features may be best addressed by adapting middle layers, which bridge the gap between low-level features and more abstract representations. For example, adapting these layers can help the model recognize new shapes or patterns. Similarly, shifts at the output or label level, such as those exemplified by the CIFAR10-flip dataset, call for adaptations in the final mapping layers to correct mismatches between high-level features and labels.
We first train three DNN models, VGG8, ResNet20, and DenseNet40, on CIFAR-10 while quantizing them with WAGE regulation. We observe that all three models suffer accuracy loss when tested on CIFAR-10-C images. However, after test-time adaptation with CIM, the accuracy improves on CIFAR-10-C images. As adaptation is conducted in the granularity of blocks, we treat every two consecutive layers as one block in VGG8. For ResNet/DenseNet, we treat each basic block/Dense block as one block unit. The relative accuracy gain of test-time adaptation with CIM with regard to no adaptation is shown in Figure
5.
Figure
5 showcases that both full and partial model adaptations enhance accuracy compared with the original model. Intriguingly, adapting select blocks within DNN models often yields better results than full-model adjustments. Speaking in a quantitative manner, partial adaptation surpasses full adaptation in more than 80% of test cases throughput our experiments. Specifically, by fine-tuning a single parameter block while keeping others static, we achieve superior outcomes in addressing distribution shifts. Furthermore, optimal performance is observed when different blocks are tuned to cater to varied types of distribution shifts. For instance, adjusting the first block of VGG8 proves most effective for input-level shifts such as CIFAR-C (image corruption).
However, this observation isn’t universally valid, as evident from the results of ResNet20 and DenseNet40. We speculate that this discrepancy arises due to the quantization effect. To validate our assumption, we examine the mean of weight gradient across layers for the three networks. As depicted in Figure
6(a), deeper models such as ResNet20 and DenseNet40 have layers where the gradient is quantized to zero, meaning that the correction data of specific layers is neglected—a phenomenon absent in the comparatively shallower VGG8. Subsequent experiments revealed that, on average, 12.3% of trainable parameters in ResNet20 and 9.7% in DenseNet40 receive a zero weight gradient, whereas none receives a zero weight gradient in VGG8. Given that backpropagation processes deeper layers first followed by shallower ones, shallow layers accumulate more quantization errors. That is because gradient computation for shallow layers relies on deeper layers’ activation gradient. This leads to a reduced adaptation accuracy boost for shallow blocks compared with their floating-point counterparts. Such inefficiency could be further exacerbated by the minimal learning rate used for adaptation, which makes gradient values even more susceptible to quantization.
4.2 DNN Adaptation for Output-Level Shift
RQ2: Adaptation performance under output-level shift for different DNN models. In this RQ, we study the effects of the adaptation CIFAR-Flip dataset on three different DNN models. The CIFAR-Flip dataset can be seen as an output-level shift because the only difference between CIFAR-10 and CIFAR-Flip is the flipped label (
\(x\rightarrow 9-x\) ). Figure
6(b) presents the relative accuracy gain (i.e., block-wise adaptation accuracy — full adaptation accuracy) on the CIFAR-Flip dataset. We observe that fine-tuning partial blocks surpassed the full-tuning method regarding accuracy. Moreover, tuning the last layer works best for output-level shifts, and blocks closer to the loss function in the computation flow also suffer less from quantization error. The gap between last-block adaptation and other block fine-tuning will get smaller with more adaptation epochs. However, a higher number of adaptation epochs also indicates extra compute latency and energy overhead.
4.3 Unsupervised Partial Adaptation
RQ3: Adaptation performance under unsupervised learning tasks. A vast number of applications cannot enjoy the benefit of annotation and have to cope with a new working environment blindly, which prompts our experimentation on unsupervised adaptation on the distribution shift. We applied Shannon entropy as the unsupervised loss function [
68]. Figure
5 demonstrates the performance of unsupervised adaptation on three DNN models. Interestingly, we found that for input-level shift, unsupervised adapting is effective and it even slightly surpassed supervised adaptation in more than 50% of our test cases. However, in output-level shifts, unsupervised adapting is hardly working. The testing accuracies are all below 20% under the same adaptation strategy. This is conceivable since Shannon entropy only depends on model prediction and possesses no knowledge of label shifting.
From the hardware perspective, unsupervised entropy minimization usually consists of exponential and logarithmic operations, whereas the adapted supervised loss function is the summed square error. The implementation of squaring and summation indicates that fewer resources are needed compared with the logarithmic and exponential unsupervised counterparts.
4.4 Hardware Cost Estimation
RQ4: Hardware cost (compute latency and energy) under different adaptation strategies and DNN models. With the adaptation carried out on edge, compute latency and energy also play critical roles, together with task accuracy. Adapting different parts of the network would indicate different latency and energy costs. We conduct a hardware cost estimation corresponding to different adaptation strategies with our adaptation benchmarking framework.
In a single epoch, training — or adapting —can be segmented into three main sections: forward pass, backward pass, and weight update. One batch of training finishes as soon as all weights are updated. When adapting different blocks with different depths, the closer the block is to loss function in the computation flow, the less latency it will take. The flow is pictured in Figure
7. Since shallower layers show true dependency towards activation gradients from deeper layers, they will have to wait for deeper layers to have their activation gradient calculation finished, even if those layers are not meant to be adapted. Considering that adapting different blocks would give different accuracy gains, there is obviously a space for exploration to strike a balance between accuracy, latency, and energy consumption.
We demonstrate such trade-off with a quantized VGG8 network carried on CIM, as shown in Figure
9. The whole network is partitioned into four blocks, each consisting of two consecutive layers. Non-parameterized layers, such as maxpooling layers, are excluded. Given the number of layers as
n, we derive the latency of partially adapting the
\(i^{th}\) layer as (use
\(\tau\) as latency)
If more than one layer is adapted, latency would be picked among the maximal adapt latency from all adapting layers. We provide the temporal graph of a 5-layer network adapting two layers to demonstrate latency definition (Figure
8). Here, we consider mostly dynamic energy since the summation of forward, weight gradient and activation gradient dynamic energy dominates energy consumption (>98%). The adaptation delay and energy consumption are given in Figure
9. Adapting the first block costs almost the same computational time as that of full adaptation, since the shallowest layer is very likely to be the critical path among all layers (if it is to be adapted). However, the bypassed weight gradient and weight update could save a large amount of dynamic energy as the calculation is omitted. In practice, a more energy-efficient system is possible through techniques such as applying power gating to the weight gradient unit while computing deeper layers’ activation gradient.
As the adapting block goes deeper, the latency decreases since less activation gradient is computed. Weight gradient computation is executed with an SRAM-based CIM module in the core. The gradient compute engine consists of multiple subarrays. It is initialized in a way that the subarray cluster can fit in the largest activation gradient across all layers. Such implementation has boosted weight gradient derivation. A large number of layers could be further accelerated through operand replication and parallel execution between subarrays. Consequently, weight gradient computation is hardly the critical path in adaptation. Energy consumption increases with depth, which seems counterintuitive at first glance. A closer look into the energy breakdown provides clarity. Predominantly, the energy consumption stems from forward reading dynamic energy, activation gradient dynamic energy, and weight gradient dynamic energy. The trends in forward and activation dynamic energy consumption mirror those of latency, as their variations align precisely with the computation flow’s fluctuations. Yet, weight gradient dynamic energy relates directly to the number of weight parameters in the adapting layers. Essentially, layers with a larger set of trainable parameters demand more dynamic energy during weight gradient derivation. Figure
9(c) elucidates the correlation between the layer parameter size and weight gradient dynamic energy consumption. This observation underscores how pivotal the adaptation strategy is in determining the platform’s algorithmic efficiency, physical performance, and deployment factors. In scenarios in which accuracy is paramount, one might lean towards adapting shallower layers (subjected to input-level shifts), even if it implies a longer adaptation latency. Conversely, in situations in which minimizing latency is the priority, deeper layers may be chosen for adaptation. This choice, however, could come with the trade-offs of elevated energy consumption and diminished performance enhancements.
We also conduct a similar analysis on ResNet20, as shown in Figure
10. Latency declines when the adapting block goes deeper within the network. Energy consumption, however, is related to both block depth and the number of parameters in the adapting block. We chose the adapting unit of ResNet20 to be one basic block. The block with index 2 has the most weights whereas the last block (last FC layer) has the least. Though adapting the last block may not give the best accuracy recovery, it may still be an appealing option given the latency and energy benefit.
4.5 Hybrid Memory CIM Adaptation
RQ5: The effectiveness of adaptation based on SRAM/NVM hybrid–based CIM systems. This RQ prompts us to extend our adaptation benchmarking framework to encompass SRAM and NVM hybrid–based CIM systems. The aim is to investigate the advantages of amalgamating various memory cells for enhancing system performance. NVMs generally show more compact area and superior reading characteristics while SRAM is better at writing. Additionally, SRAM is able to utilize more advanced tech nodes. NVMs, however, are typically a few generations behind, even if some of them are compatible with CMOS processes. From a dataflow perspective, SRAM-CIM is aligned with sequential readout, as the MAC operations are conducted in a row-wise manner. In contrast, mixed signal CIM, exemplified by RRAM, achieves high throughput by activating multiple word-lines simultaneously. This discrepancy in dataflow leads to the assembly of diversified peripherals within the array to support different workflows, resulting in varied performance estimation.
We first measured the latency and energy consumption of different adaptation schemes with full SRAM (14 nm) or RRAM configuration. The results are shown in Figure
9. An RRAM-based CIM macro is approximately 20.21% faster than an SRAM-based system. In addition, an SRAM-based CIM consumes 22.24% less energy than RRAM on average. Since only a portion of the network is adapted, we changed the adapting layers from an RRAM-based macro to an SRAM-based macro, forming a hybrid memory system. This means that only adapting layers are equipped with SRAM to boost writing. The full adaptation scheme is omitted since that would turn out to be a full SRAM-based system.
The performance of the hybrid system is shown in the histogram of Figure
9. Generally speaking, the hybrid CIM system shows intermediate latency and power consumption between pure RRAM and pure SRAM CIM systems. For example, if we use the hybrid memory system to adapt the block indexed 0, 21.2% of overall energy would be saved compared with a pure RRAM system. The energy saving comes at a price of a 24.6% latency increase (compared with full RRAM). For a design with more latency budget but tighter power constraints, swapping NVM with SRAM could be a potential solution. Moreover, as a charge-based system, SRAM is free from cell non-idealities of NVM cells. Concerns such as cells wearing out will never bother system performance. This could mean potentially higher accuracy gain in the long run.
From Figure
9, we can see that, when adapting the first or second block, the compute latency is even slightly better than that of a full RRAM system (0.96 s
\(\rightarrow\) 0.95 s for 1st block, 0.81 s
\(\rightarrow\) 0.78 s for 2nd block). This points to the fact that, even if latency is the only concerned parameter, there are still potential optimal points in the design space after introducing cell types as a new dimension.
One thing we noticed during our experiments is that, though we made assumptions for hybrid systems based on the writing properties of NVM/SRAM cells, weight update is not a dominant part concerning either latency or power consumption. This could be attributed to the workload picked as adaptation in which the learning rate is typically much smaller than training from scratch. A smaller learning rate results in smaller parameter gradients. With quantization, updates become even less critical since zero-quantized gradients imply omitted writing, as certain cells remain entirely untouched. In the core, the weight update cost is estimated by counting the amount of writing pulses applied to the subarrays, which is a number proportional to the amplitude of the weight gradient. As fine-tuning can imply a smaller number of write pulses compared with training, weight update is observed not to contribute the majority of adaptation overhead.
Another distinct advantage of NVMs is their compactness. In certain cases, an NVM cell might occupy an area smaller than
\(10F^{2}\) , whereas a traditional SRAM cell can span nearly
\(150F^{2}\) . Thus, a hybrid system could either opt for more NVM layers for compactness or SRAM for energy efficiency. We conducted area estimation upon the proposed hybrid system, shown in Figure
11. To summarize, when swapping the RRAM 1T1R crossbar array with the SRAM array to facilitate partial adaptation, the area overhead will be proportional to the number of tiles captured by the adapted block. Since block size expands as we proceed through shallow blocks to deeper blocks, more SRAM tiles are required to carry weights, which leads to higher area overhead for the hybrid system.
4.6 Adapting NVM Non-ideality
RQ6: Adaptation on both environmental data shift and inherent hardware noise. In this RQ, we explore whether our proposed test-time DNN adaptation at edge with CIM is able to adapt the data distribution shift and inherent NVM device noise at the same time. Non-ideality associated with NVM devices, such as unpredictable cell-to-cell variation, IR-drop, and ON/OFF ratio degradation, have been a persistent issue for decades [
10] and made models that are deployed on edge NVM-based systems suffer from algorithmic performance degradation.
With the proposed framework, for the first time, we find that such degradation caused by device imperfection could be amended by the test-time DNN adaptation along with the data distribution shift. As demonstrated in Figure
12, we evaluate the pre-trained models on the CIFAR-10-C dataset, and the endurance is simulated as a reduced on-off ratio, i.e., the ratio between the HRS and LRS [
70]. We observe that with the on-off ratio being 80, the inference accuracy dropped roughly 3% for VGG8. When the on-off ratio further drops to 30, inference accuracy quickly goes down below 50%. Deeper networks appear to be more vulnerable to the damage of the on-off ratio as their inference accuracy drops severely to either of the on-off ratio values. In all cases, by applying 3 epochs of adaptation, we are able to adapt the network to a condition with better accuracy. Adaptation works for layers regardless of their depth. This shows that adaptation is not only capable of handling data corruption but also has model decay from non-ideality covered as well. Partial adaptation on VGG8, ResNet20, and DenseNet40 also consolidates this observation. Moreover, we observe that partial adaptation results in higher accuracy than full adaptation. While training a hardware-aware model might seem advantageous, certain noise sources cannot be realistically accounted for during the training stage. However, our findings reveal that adapting a model aware of the on-off ratio on the CIFAR-10-C dataset results in an accuracy improvement of no more than 2% compared with a model unaware of the on-off ratio. This observation underscores the efficacy of test-time adaptation in simultaneously addressing both hardware and input noises. It suggests a potential inclination towards direct edge-based model adaptation, circumventing the need for additional training and pre-deployment profiling of CIM non-idealities. Considering that endurance could mean a reduced on-off ratio, adaptation may as well be a potential solution for defending against edge network conductance shift.
4.7 Design Space Exploration
We conducted a set of experiments to show the discrepancy between different CIM macros carrying out different adaptation strategies. The result is demonstrated in Figure
13. We evaluate macros such as RRAM, SRAM based on 22-nm and 14-nm technologies, and a hybrid CIM system combining RRAM and 14-nm SRAM. These systems were analyzed during the adaptation of the VGG8 network on the CIFAR-10-C dataset over three epochs.
For the desired accuracy, we establish a threshold of at least 75%. Consequently, all adaptation schemes related to the 3rd block were eliminated. We further limit the timing constraint to under 2.85 s and set the energy consumption cap at 850 mJ. This filtering process narrowed down our options, leaving only a few viable candidates concentrated in the bottom-left quadrant of the figure.
As previously discussed in this section, the deeper the adapting block, the more likely the trend is towards reduced latency, increased energy consumption, and diminished accuracy gains. Hence, systems primarily concerned with latency tend to favor deeper adapting layers. Conversely, those with energy efficiency as a priority typically lean towards adapting more superficial layers. For tasks that weigh both latency and energy as significant factors, a few configurations emerge as suitable. These include the RRAM adapting the 1st and 2nd blocks, the 14nm SRAM adapting the 3rd block, and the hybrid system adapting the 1st and 2nd blocks. These configurations are projected to be proficient for the task at hand.