1 Introduction
The
Convolutional Neural Network (CNN) has become ubiquitous for artificial intelligence tasks such as image classification, target detection, and speech recognition due to its surpassing human-level performance [
9,
10,
27,
31,
34]. To satisfy prohibitively massive computational requirements of current deep CNNs, various domain-specific accelerators have been widely deployed in large-scale systems [
8,
10,
12]. Among them, the systolic array that exploits input/weight reuse to accelerate the major
Multiply-and-Accumulate (MAC) operations of CNN appears to be one of the most effective accelerator architecture designs and has also received a lot of attention from academia and industry [
31]—for instance, Google’s Tensor Processing Unit (TPU) [
25], and Gemmini [
14] proposed by UC Berkeley.
However, with the small scale of semiconductor size and growth of chip integration density, the systolic array with tens of thousands of
Processing Elements (PEs) is increasingly vulnerable to high-energy neutrons or
\(\alpha\)-particle strikes, which will cause transient bit flips during CNN execution, called
soft errors [
1,
4,
11,
19,
27,
32]. In modern
Deep Neural Network (DNN) systems, the high data reuse and parallelism characteristics will extend and accelerate error propagation, making them more susceptible to soft error [
21,
26]. Soft error in modern DNN systems may cause severe accuracy degradation [
9], prolonged training time [
21], or fuzzy regions in semantic segmentation tasks [
6]. For example, Google found that soft errors may cause a severe model accuracy decrease (e.g., more than 60%), which even needs 10K to 100M iterations to recover [
21]. Meanwhile, as DNNs become widely deployed, accelerators are integrated into various SoCs; when applied to industrial applications, they need to comply with strict industry reliability standards. For example, in ASIL-D of ISO 26262, the failure rate of the SoC must be less than 10 Failures In Time (< 10 FIT) [
9]. Accelerators occupy only a small area in the SoC. To ensure the whole SoC meets the safety demands, the accelerator must satisfy more strict reliability requirements [
27]. Therefore, while the DNN accelerator is in the pursuit of both higher performance and energy efficiency, the reliable execution in the presence of soft errors also deserves particular attention [
3,
18,
23,
24,
40,
45,
46].
As a result, researchers adopt different resilience technologies, such as
Dual Modular Redundancy (DMR), triple modular redundancy,
Error Correction Codes (ECCs), and hardware components hardening, to mitigate the effect of soft errors in CNN accelerators [
3,
10,
40]. However, these solutions typically incur expensive software/hardware costs, which may be contrary to the original design philosophy of the CNN accelerator. Thus, a better tradeoff between reliability and performance is desirable in commodity CNN accelerator design.
In this article, we observe two key characteristics in both systolic array and CNN, which can help us explore the hardware-software co-designed opportunistic fault-tolerant paradigm for DNN accelerators. First, due to the fixed hardwired design of the CNN accelerator, we observe that some whole columns of PEs are idle when deploying small-scale convolutional layers to the systolic array, and such column-idle situations also appear when the scale/number of filters cannot be evenly divided by the systolic array width. Besides, the widespread deployment of recent model pruning techniques further exacerbates the idle situation. Systolic array adopts a delicate design that exploits input reuse among columns of PEs (or filters). Thus, the column-idle characteristics naturally provide us with opportunities to execute some filters redundantly for reliability improvement. Second, at CNN model level, we observe that multiple filters in the same layer exhibit distinctive error sensitivity discrepancy under soft error. This characteristic guides us to preferentially assign the most vulnerable filters to the recycled idle-column PEs for maximal CNN reliability improvement.
Leveraging the preceding observations together, we proposed ReIPE, which intelligently Recycles Idle PEs in the CNN accelerator for vulnerable filters soft error detection. The main novelty of this study is leveraging program-level filter-wise error resilience knowledge to improve the efficiency of hardware-level opportunistic soft error detection techniques in CNN accelerators. The overall setup of ReIPE mainly consists of the following steps. As a first step, before loading the weights of a specific layer into the systolic array, we carry out a filter-level gradient analysis process offline to replace traditional Fault Injection (FI) for fast filter-wise error resilience estimation. As a second step, combined with the systolic array idleness and CNN filter-wise error resilience profile, ReIPE offline selects the most vulnerable filters for redundancy execution. As a third step, to harmonize the real-time and efficient demand of error detection, ReIPE incorporates the error detection process into the original computation flow of accelerators. Specifically, we assign the duplicated filter to the next column of the original vulnerable filter to ensure their outputs reach the check unit in the adjacent cycles. Moreover, by squeezing the error-masking potential of the activation function (mask errors on negative feature values) and pooling layers (mask 17.83% errors in our experiment), ReIPE performs error detection after the pooling process to avoid unnecessary error detection for better efficiency. As a fourth step, once an error is detected, the recovery stage can be performed with low overhead by recomputing only the corresponding part based on the error detection information. As a fifth step, in addition, for some layers that appropriately fill the systolic array column (i.e., without idle opportunities), ReIPE will trigger an extra calculation round to selectively protect the most vulnerable filters for better reliability.
In summary, we make the following contributions in this study:
–
Suggest a filter-level gradient analysis method alternative to time-consuming FI for fast filter-wise soft error resilience estimation in CNN.
–
Explore an opportunistic soft error detection technique that recycles column-idle PEs in DNN accelerators to perform filter-wise duplication.
–
Mapping program-level vulnerable filters onto idle PEs of the specialized CNN accelerator, we build a hardware/software co-designed soft error detection framework named ReIPE. Experimental results exhibit that ReIPE can cover 96.40% errors while reducing 75.06% performance loss and 67.79% energy consumption of DMR.
–
Demonstrate the possibilities of ReIPE for various application scenarios, including the adaptability to (1) pruned, quantized, and
Vision Transformer (ViT) models (Section
7.1), (2) the portability to other dataflows and accelerator architectures (Section
7.2), and (3) the extensibility to other fault models (Section
7.3).
5 Experimental Methodology
To evaluate the effectiveness of ReIPE, we select six open source CNN models with different scales. Table
1 lists the brief characteristics of each model. We trained LeNet-5 on MINST (10-class of 28
\(\times\) 28 pixel handwriting images), ResNet-20 and Cifar-10-CNN on CIFAR-10 (10-class of 32
\(\times\) 32 pixel RGB tiny images), VGG-16, AlexNet, and ResNet-50 on ImageNet (1,000-class of 224
\(\times\) 224 pixel RGB images).
To simulate the occurrence of soft errors in accelerators, in this study we perform a hardware-aware FI to illustrate how the fault in hardware affects the execution process of DNN. Specifically, leveraging the open source simulator SCALE-Sim [
39], each random transilient fault in PEs can be mapped to the corresponding parameter in the DNN model. For errors incurred in the weight register, we consider that the poisoned psum will accumulate along the PE column; for errors occurring in the input register, we consider that the fault will be broadcast along the row line to affect all right-hand filters. The preceding two types of transient faults will be wiped in the following data load process (i.e., the next calculation round). Additionally, the number of idle columns/cycles/rounds when deploying a particular layer on the systolic array is collected from SCALE-Sim, and the energy consumption of ReIPE is calculated with the metric in Eyeriss [
8].
Moreover, even though the transilient fault in PE can be mapped to DNN parameters, some computer states are still not directly addressable by software (e.g., control faults). The preceding state is not the main focus of this study. Typically, a more comprehensive fault model is needed to design to simulate these errors. For example, to simulate a kind of control fault associated with the MAC unit, the Google research team replaced the corresponding output with a random faulty value [
21].
6 Evaluation
We evaluate our proposed ReIPE by considering the following research questions:
–
\(RQ_1:\) What is the accuracy of gradient analysis for filter-wise error resilience estimation, and is the overhead acceptable?
–
\(RQ_2\): How many errors can be covered by ReIPE?
–
\(RQ_3\): Is the performance degradation incurred by ReIPE acceptable?
–
\(RQ_4\): What is the energy cost introduced by ReIPE?
–
\(RQ_5\): What is the effectiveness of ReIPE under different design scenarios?
6.1 Filter-Level Gradient Analysis Evaluation (\(RQ_1\))
ReIPE preferentially maps
top-
k vulnerable filters onto finite idle-column PEs for redundancy. Therefore, the effectiveness of ReIPE depends on the accuracy of gradient analysis in identifying the
top-
k vulnerable filters. This section will evaluate the accuracy and performance of gradient analysis by comparing it with FIs. Similar to the prior studies in the area [
9], for each model we perform random FI trials with 20% (
constant) of the exhaustive FI to evaluate the representative vulnerability of filters. The corresponding error margin is at most 0.24% at the 99% confidence level, which is able to guarantee the statistically desired analysis results.
We evaluate the accuracy of a gradient-based error resilience estimation approach by assessing
top-
k coverage (i.e.,
\(C_{top\hbox{-}k}\)) in each layer, which represents the proportion of vulnerable filters correctly estimated by gradient analysis to total vulnerable filters identified by FI (i.e., ground truth).
\(C_{top\hbox{-}k}\) can be calculated by
where
\(S_{grad}\) represents the set of
top-
k vulnerable filters estimated by gradient analysis and
\(S_{FI}\) is the ideal set of
top-
k vulnerable filters identified by FI. For brevity and ease of comparison, we averaged
\(C_{top\hbox{-}k}\) of each layer for each CNN, and the layers with full redundancy opportunities are not taken into account. Besides, since the 256
\(\times\) 256 scale of the systolic array is overwhelming for LeNet-5, Cifar-10-CNN, and ResNet-20, we evaluate
\(C_{top\hbox{-}k}\) of them under a 16
\(\times\) 16 systolic array.
Moreover, the gradient error bars can be substantial under different inputs. Therefore, to eliminate input-induced gradient discrepancies and accurately/fairly characterize filter-wise error resilience, we chose 10 inputs of each category to get an average gradient (consistent with the ground truth FI). As shown in Figure
8, we observe that the gradient analysis can cover 92.17%
top-
k vulnerable filters on average, which implies that gradient analysis is efficient for filter soft error sensitivity characterization.
Beyond estimation accuracy, we further compare the average time consumption of gradient analysis and baseline FI, as shown in Table
2. For a post-training CNN, gradient analysis only needs to execute each sample in the test set once to get the average gradient of each filter, whereas the time consumption of FIs depends on the number of parameters in CNNs. On average, gradient analysis is 2,364
\(\times\) faster than FI, and the degree of speedup will increase as the total number of FI trials increases. For large-scale CNN, due to the tremendous FI trials needed, gradient analysis exhibits a more noticeable speedup versus FI. For example, the speedup of gradient analysis relative to FI achieves 7,505
\(\times\) when testing on ResNet-50.
6.2 Error Coverage Evaluation (\(RQ_2\))
We first report the error coverage of ReIPE. The results are normalized to the full DMR. As exhibited in Figure
9, by recycling column-idle PEs in the systolic array for vulnerable filters redundant execution, we observe that ReIPE can detect 96.40% errors on average across six different scales of networks. For the small-scale network (i.e., LeNet-5, Cifar-10-CNN, and ResNet-20), ReIPE achieves 100% error coverage. This phenomenon is because the number of idle columns is much larger than the number of filters for each layer. Consequently, each filter has opportunities to perform redundancy for error detection. As for the large-scale network, by preferentially duplicated loading error-sensitive filters onto the systolic array, ReIPE can cover 94.04%, 91.28%, and 93.08% of errors for VGG-16, AlexNet, and ResNet-50, respectively.
As mentioned in Section
4.1, ReIPE leverages filter-wise gradient analysis to preferentially select a set of error-sensitive filters to perform duplicate execution for better reliability. Section
6.1 compared the accuracy and efficiency of gradient analysis with baseline FI. To further verify its effectiveness in the reliability improvement of CNN accelerators, we further compare the error detection ability between ReIPE and random-ReIPE (i.e., randomly selecting filters for redundancy). As can be seen, for VGG-16, AlexNet, and ResNet-50, ReIPE can covers 33.70%, 25.34%, and 30.76% more errors than random-ReIPE, respectively. The results demonstrate that our modified filter-level gradient analysis is efficient for filter-wise soft error sensitivity characterization.
6.3 Performance Evaluation (\(RQ_3\))
We further analyze the performance of the CNN accelerator under the impact of ReIPE. As shown in Figure
10, we normalize the execution time taken by ReIPE and R-DMR to the system without any protection. We observe that ReIPE, on average, reduces 75.06% performance loss of DMR due to implementing selective protection for vulnerable filters in CNN. Due to the high fraction of column-idle for small-scale CNN, the original and redundant calculations are performed in the same horizontal round. Thus, when processing LeNet-5, ResNet-20, and Cifar-10-CNN, ReIPE virtually does not affect performance. However, for large-scale CNN, as mentioned in Section
4.2.2, the redundant round will be triggered to improve reliability, thus sacrificing a bit of execution time.
For some safety-critical scenarios, we propose R-DMR for full error coverage. Like ReIPE, R-DMR can also recycle column-idle PEs in the systolic array for error detection. As shown in Figure
10, R-DMR can reduce 60.97% performance loss incurred by DMR. By recycling idle-column PEs, R-DMR put part of redundant calculations into the original round. As a consequence, R-DMR completes full redundancy with fewer cycles than DMR. Especially for small-scale layers, R-DMR finishes both the original and redundant calculations in the same round, thus nearly having no effect on accelerator performance.
In CNN accelerator architecture, tens of thousands of PEs are designed to provide massive computational throughput. Although DMR provides full error coverage, it introduces non-negligible performance loss, which may be contrary to the original systolic array design philosophy. In comparison, ReIPE can map software-level vulnerable filters onto idle PEs of specialized CNN accelerators to perform selective redundant execution, achieving a better tradeoff between reliability and performance. Furthermore, for some safety-critical scenarios with extraordinarily high error coverage requirements, our proposed R-DMR tries to provide an optimized DMR design to guarantee both reliability and performance concerns.
6.4 Energy Evaluation (\(RQ_4\))
In this section, we give the energy consumption of ReIPE. Figure
11 exhibits the normalized energy consumption of ReIPE, R-DMR, and DMR. Similar to performance evaluation, the baseline is the system without any protection. As we can see, on average, ReIPE and R-DMR can reduce energy consumption by 67.79% and 40.35% compared to the traditional DMR, respectively.
When processing small-scale CNNs (e.g., LeNet-5, Cifar-10-CNN, and ResNet-20), we observe that ReIPE and R-DMR consume the same amount of energy. The reason is that there will be enough idle opportunities for ReIPE to perform full redundancy in small-scale networks (i.e., ReIPE is equal to R-DMR). In addition, while ReIPE and R-DMR introduce identical calculation costs compared with traditional DMR, we notice that they still introduce less extra energy. ReIPE and R-DMR can directly recycle idle columns to perform redundancy, where the original horizontal input (i.e., ifmap) will be reused by redundant execution, thus avoiding extra memory access.
For large-scale CNNs (e.g., AlexNet, VGG-16, and ResNet-50), ReIPE achieves relatively high error coverage by selectively protecting vulnerable filters. Therefore, ReIPE can reduce extra energy consumption in terms of memory access, computation, and communication. Unlike ReIPE, R-DMR performs identical computational cost as DMR. Once a(multiple) redundant round(s) is triggered, both filters and corresponding input will be accessed and streamed into accelerators, which results in extra energy cost. However, compared with traditional DMR, R-DMR still has opportunities to reduce part of extra input access energy by opportunistically recycling idle-column PEs in the systolic array.
6.5 Sensitivity Analysis (\(RQ_5\))
In this section, we first illustrate the scalability of ReIPE by deploying it on various scales of systolic arrays, followed by exploring the performance and energy consumption of ReIPE under different protection budgets (i.e., \(r_{protection}\)).
6.5.1 The Impact of Different Systolic Array Sizes.
To explore the effectiveness of ReIPE under different design scenarios, we conduct a sensitivity analysis under different systolic array scales. Figure
12 shows the normalized error coverage and execution time of VGG-16, AlexNet, and ResNet-50 under 64
\(\times\) 64, 128
\(\times\) 128, 256
\(\times\) 256, and 512
\(\times\) 512 systolic array sizes. Due to their small scale, LeNet-5, Cifar-10-CNN, and ResNet-20 will achieve full error coverage in the 64
\(\times\) 64 size; these results are not exhibited here.
First, as can be seen, error coverage grows steadily with array size, demonstrating that more filters can leverage idle-column PEs to perform error detection with the growth of array size. In comparison, we notice that the execution time shows an upward and then downward trend. When the array size is 128
\(\times\) 128, the normalized execution time reaches a peak. As mentioned in Section
4.2.1, for layers without column-idle opportunities, ReIPE triggers one round to get
k idle columns for reliability improvement, where
\(k=width\). Due to fewer triggered idle opportunities compared to the 128
\(\times\) 128 size, the 64
\(\times\) 64 size incurs less performance loss for reliability improvement and consequently suffers lower error coverage. As the array size increases further, the number of inherent idle columns starts to increase while fewer redundant rounds are triggered. Eventually, the normalized execution time starts to decrease.
6.5.2 The Impact of Different Protection Budgets.
We further explore the effectiveness of ReIPE at various
\(r_{protection}\). Figure
13 reports the error coverage, normalized execution time, and energy consumption of VGG-16 under different protection rates for illustration. Facing low
\(r_{protection}\) requirements, ReIPE only opportunistically leverages the idle opportunities for redundant execution. As
\(r_{protection}\) increases, to ensure the protection rate of each layer, ReIPE will trigger some redundant rounds to create sufficient idle opportunities until the
\(r_{protection}\) is met. Exceptionally, when
\(r_{protection}=100\%\), ReIPE is equivalent to R-DMR. As shown in Figure
13, as
\(r_{protection}\) gradually increases, the energy consumption and execution time progressively increase since more redundant rounds are triggered to duplicate vulnerable filters. As an interesting characteristic, we notice the error coverage curve starts to increase slowly when the
\(r_{protection}\) reaches 40%, demonstrating that the reliability benefits brought by triggering more rounds are feeble. As observed in Section
3, large-scale convolutional layers contain only a small number of error sensitivity filters; thus, the partial duplication for the
top-
k vulnerable filters proposed by ReIPE can achieve a better tradeoff between reliability and error coverage. Moreover, compared to evenly distributing fault-tolerant resources at each layer (the dashed line), leveraging
\(B_{i,k}\) to select vulnerable filters can achieve better error coverage with lower overhead (e.g., under 30%
\(r_{protection}\), optimized vulnerable filter selection with
\(B_{i,k}\) can save 6.73% execution time and 12.49% energy consumption to reach 5.1% additional error coverage).