To satisfy prohibitively massive computational requirements of current deep Convolutional Neural Networks (CNNs), CNN-specific accelerators are widely deployed in large-scale systems. Caused by high-energy neutrons and α-particle strikes, soft error may lead to catastrophic failures when CNN is deployed on high integration density accelerators. As CNNs become ubiquitous in mission-critical domains, ensuring the reliable execution of CNN accelerators in the presence of soft errors is increasingly essential.

In this article, we propose to Recycle Idle Processing Elements (PEs) in the CNN accelerator for vulnerable filters soft error detection (ReIPE). Considering the error-sensitivity of filters, ReIPE first carries out a filter-level gradient analysis process to replace fault injection for fast filter-wise error resilience estimation. Then, to achieve maximal reliability benefits, combining the hardware-level systolic array idleness and software-level CNN filter-wise error resilience profile, ReIPE preferentially duplicated loads the most vulnerable filters onto systolic array to recycle idle-column PEs for opportunistically redundant execution (error detection). Exploiting the data reuse properties of accelerators, ReIPE incorporates the error detection process into the original computation flow of accelerators to perform real-time error detection. Once the error is detected, ReIPE will trigger a correction round to rectify the erroneous output. Experimental results performed on LeNet-5, Cifar-10-CNN, AlexNet, ResNet-20, VGG-16, and ResNet-50 exhibit that ReIPE can cover 96.40% of errors while reducing 75.06% performance degradation and 67.79% energy consumption of baseline dual modular redundancy on average. Moreover, to satisfy the reliability requirements of various application scenarios, ReIPE is also applicable for pruned, quantized, and Transformer-based models, as well as portable to other accelerator architectures.

1 Introduction

The Convolutional Neural Network (CNN) has become ubiquitous for artificial intelligence tasks such as image classification, target detection, and speech recognition due to its surpassing human-level performance [9, 10, 27, 31, 34]. To satisfy prohibitively massive computational requirements of current deep CNNs, various domain-specific accelerators have been widely deployed in large-scale systems [8, 10, 12]. Among them, the systolic array that exploits input/weight reuse to accelerate the major Multiply-and-Accumulate (MAC) operations of CNN appears to be one of the most effective accelerator architecture designs and has also received a lot of attention from academia and industry [31]—for instance, Google’s Tensor Processing Unit (TPU) [25], and Gemmini [14] proposed by UC Berkeley.

However, with the small scale of semiconductor size and growth of chip integration density, the systolic array with tens of thousands of Processing Elements (PEs) is increasingly vulnerable to high-energy neutrons or \(\alpha\)-particle strikes, which will cause transient bit flips during CNN execution, called soft errors [1, 4, 11, 19, 27, 32]. In modern Deep Neural Network (DNN) systems, the high data reuse and parallelism characteristics will extend and accelerate error propagation, making them more susceptible to soft error [21, 26]. Soft error in modern DNN systems may cause severe accuracy degradation [9], prolonged training time [21], or fuzzy regions in semantic segmentation tasks [6]. For example, Google found that soft errors may cause a severe model accuracy decrease (e.g., more than 60%), which even needs 10K to 100M iterations to recover [21]. Meanwhile, as DNNs become widely deployed, accelerators are integrated into various SoCs; when applied to industrial applications, they need to comply with strict industry reliability standards. For example, in ASIL-D of ISO 26262, the failure rate of the SoC must be less than 10 Failures In Time (< 10 FIT) [9]. Accelerators occupy only a small area in the SoC. To ensure the whole SoC meets the safety demands, the accelerator must satisfy more strict reliability requirements [27]. Therefore, while the DNN accelerator is in the pursuit of both higher performance and energy efficiency, the reliable execution in the presence of soft errors also deserves particular attention [3, 18, 23, 24, 40, 45, 46].

As a result, researchers adopt different resilience technologies, such as Dual Modular Redundancy (DMR), triple modular redundancy, Error Correction Codes (ECCs), and hardware components hardening, to mitigate the effect of soft errors in CNN accelerators [3, 10, 40]. However, these solutions typically incur expensive software/hardware costs, which may be contrary to the original design philosophy of the CNN accelerator. Thus, a better tradeoff between reliability and performance is desirable in commodity CNN accelerator design.

In this article, we observe two key characteristics in both systolic array and CNN, which can help us explore the hardware-software co-designed opportunistic fault-tolerant paradigm for DNN accelerators. First, due to the fixed hardwired design of the CNN accelerator, we observe that some whole columns of PEs are idle when deploying small-scale convolutional layers to the systolic array, and such column-idle situations also appear when the scale/number of filters cannot be evenly divided by the systolic array width. Besides, the widespread deployment of recent model pruning techniques further exacerbates the idle situation. Systolic array adopts a delicate design that exploits input reuse among columns of PEs (or filters). Thus, the column-idle characteristics naturally provide us with opportunities to execute some filters redundantly for reliability improvement. Second, at CNN model level, we observe that multiple filters in the same layer exhibit distinctive error sensitivity discrepancy under soft error. This characteristic guides us to preferentially assign the most vulnerable filters to the recycled idle-column PEs for maximal CNN reliability improvement.

Leveraging the preceding observations together, we proposed ReIPE, which intelligently Recycles Idle PEs in the CNN accelerator for vulnerable filters soft error detection. The main novelty of this study is leveraging program-level filter-wise error resilience knowledge to improve the efficiency of hardware-level opportunistic soft error detection techniques in CNN accelerators. The overall setup of ReIPE mainly consists of the following steps. As a first step, before loading the weights of a specific layer into the systolic array, we carry out a filter-level gradient analysis process offline to replace traditional Fault Injection (FI) for fast filter-wise error resilience estimation. As a second step, combined with the systolic array idleness and CNN filter-wise error resilience profile, ReIPE offline selects the most vulnerable filters for redundancy execution. As a third step, to harmonize the real-time and efficient demand of error detection, ReIPE incorporates the error detection process into the original computation flow of accelerators. Specifically, we assign the duplicated filter to the next column of the original vulnerable filter to ensure their outputs reach the check unit in the adjacent cycles. Moreover, by squeezing the error-masking potential of the activation function (mask errors on negative feature values) and pooling layers (mask 17.83% errors in our experiment), ReIPE performs error detection after the pooling process to avoid unnecessary error detection for better efficiency. As a fourth step, once an error is detected, the recovery stage can be performed with low overhead by recomputing only the corresponding part based on the error detection information. As a fifth step, in addition, for some layers that appropriately fill the systolic array column (i.e., without idle opportunities), ReIPE will trigger an extra calculation round to selectively protect the most vulnerable filters for better reliability.

In summary, we make the following contributions in this study:

–

Suggest a filter-level gradient analysis method alternative to time-consuming FI for fast filter-wise soft error resilience estimation in CNN.

–

Explore an opportunistic soft error detection technique that recycles column-idle PEs in DNN accelerators to perform filter-wise duplication.

–

Mapping program-level vulnerable filters onto idle PEs of the specialized CNN accelerator, we build a hardware/software co-designed soft error detection framework named ReIPE. Experimental results exhibit that ReIPE can cover 96.40% errors while reducing 75.06% performance loss and 67.79% energy consumption of DMR.

–

Demonstrate the possibilities of ReIPE for various application scenarios, including the adaptability to (1) pruned, quantized, and Vision Transformer (ViT) models (Section 7.1), (2) the portability to other dataflows and accelerator architectures (Section 7.2), and (3) the extensibility to other fault models (Section 7.3).

2 Background

2.1 CNNs and Systolic Array Design

The systolic array consists of a set of simple structured PEs, and its horizontal and vertical pipelines bring high throughput, thus providing significant performance advantages in accelerating GEMM, the core computation CNN, GNN, Transformer, and so on [14, 25, 39, 56]. As exhibited in Figure 1, each PE executes the MAC operations and connects tightly with their neighbors in two dimensions. By reusing different types of data, the systolic array can employ three common dataflow strategies, including Weight Stationary (WS), Input Stationary (IS), and Output Stationary (OS). They respectively fix the weights, input feature map (i.e., ifmap), and partial sum (i.e., psum) on the systolic array during the computation [14, 39]. In this work, we consider the WS-style systolic arrays containing 256 \(\times\) 256 PEs as the baseline architecture (just like Google’s TPUv1 [25]).

Fig. 1.

In the WS dataflow, each weight in a filter will be pre-allocated into a PE along a column. The ifmap streams the array from the left side to the right PE for input data reused among filters. At the same time, each PE will generate a psum and propagate to the underneath neighbor for accumulation. Systolic array processes CNN in a layer-by-layer manner. Since the scale of the systolic array is fixed, the calculation of large-scale layers needs to be divided into several filter-group (horizontal) execution rounds according to the array size [39]. Likewise, the filters containing massive weights also need to be executed in multiple vertical rounds (i.e., the number of weights in a filter is much more than the number of PEs in a column of accelerators).

2.2 Fault Model

The execution procedure of CNNs consists of two steps: training and inference. The training phase is typically performed on high-performance platforms, where the back-propagation process generally squashes errors [37]. In contrast, the inference phase will be executed multiple times and may be deployed on unreliable platforms (e.g., energy-constrained embedded systems). Therefore, same as prior work [31], we focus on the error resilience of the CNN inference phase. Moreover, considering that ECC can effectively protect accelerator memories [29], we focus on errors occurring on the register (typically FF [21]) of on-chip PEs during DNN execution. Due to the high-parallelism characteristics of accelerators, applying ECC on PE registers (i.e., weight, input, or psum) means performing ECC generation, verification, and update for all PEs in each cycle, which is an unacceptable penalty in both area and performance overhead due to the high integration of PE and the large scale of the arrays. Therefore, an effective hardening/duplicating technique is desired to enhance the reliability of accelerator PEs. In addition, during the inference, errors may occur on weights, ifmaps, and psum, where the ifmaps and psums change as the mission varies while the weights will be shared. Therefore, we focus on the errors in the weights due to their frequent reuse.

Soft errors may manifest as single-bit flip or multiple-bit flips in practice. The data perturbation caused by soft errors may result in network misclassification (the corrupted output differs from the error-free one) [9]. Similar to previous works [37, 41], we mainly consider the single-bit flip model, as it is typically considered the most common error type [34, 53]. Consistent with traditional HPC programs [46], the DNN model also exhibits error sensitivity discrepancies under different inputs [32]. Therefore, in line with previous studies [9, 53], this study chooses 10 inputs for each category (i.e., 10 categories in MINST and CIFAR-10, 1,000 categories in ImageNet) to capture the representative error resilience of CNNs.

3 Observation and Motivation

In this section, we introduce two key observations of this study: (1) idle columns on the systolic array caused by the mismatch between CNN computation and accelerator scale, and (2) natural error resilience discrepancy of CNNs in both inter- and intra-layer levels. The combination of program- and architecture-level observations motivates us to explore the efficient opportunistic fault-tolerant strategy in CNN accelerators.

3.1 Idle Columns in the Systolic Array

For CNNs, the scale of convolutional layers (i.e., filter number) typically grows gradually as the network goes deep [20, 43]. While the CNN accelerator is hardwired, the total number of rows/columns/PEs is fixed. As mentioned in Section 2.1, when executing a specific layer of CNN, weights are strictly tied to the column of PEs at the granularity of a filter. Consequently, when small-scale convolutional layers are deployed to the CNN accelerator (i.e., the filter number is less than the systolic array width (\(f_{num}\lt width\))), some whole columns of PEs in the systolic array are idle, and such column-idle situations also appear when the scale of filters in the specific layer cannot be evenly divided by the systolic array width (i.e., \(f_{num} \% width \ne 0\)). In addition, the widespread deployment of recent model compression techniques further exacerbates the idle situation (e.g., pruning) [17]. Figure 2 shows the fraction of column-idle when deploying different convolutional layers of the selected CNN models to the baseline CNN accelerator. We observe that for small-scale and pruned CNN models, the fraction of column-idle remains high throughout the entire model inference process (77.1%, 70.2%, 86.4%, 30.1%, and 56.4% for LeNet-5, Cifar-10-CNN, ResNet-20, pruned VGG-16 [16], and pruned ResNet-50 [30], on average, respectively). For original large-scale models such as VGG-16, AlexNet, and ResNet-50, the column-idle situations also exist when executing the first few convolution layers. During the execution of CNN on the systolic array, the weights of the later layer can be loaded and processed only after the computation of the former layers is finished. Therefore, although column-idle situations exist, it is difficult to directly leverage these idle-column PEs to improve cross-layer CNN processing performance.

Fig. 2.

To further reveal the impact of column idleness on the performance of the systolic array, we depict the normalized performance growth trend at different convolutional layer scales on the systolic array. Taking a specific layer that includes \(f_{num}\) filters as an example, each filter contains 3 \(\times\) 3 \(\times\) 256 weights, and padding and stride are set to 1. As shown in Figure 3, the execution time grows in a stepwise manner as \(f_{num}\) increases, and within each step interval, the execution time grows slightly. As mentioned in Section 2.1, given the sub-round processing behavior and high parallelism nature of the systolic array, the execution time of a specific layer mainly depends on the number of computation rounds rather than \(f_{num}\). For example, the execution of 512 filters (i.e., without idle column) is only 0.13% slower than the execution of 257 filters (i.e., with 255 idle columns) on the baseline systolic array. As can be seen, the idle columns are not only difficult to utilize for other layer computations, but allowing idle columns in the array is also a non-negligible performance waste for the execution of this layer.

Fig. 3.

Fortunately, we observe that the systolic array exhibits unique data reuse characteristics among columns—that i, the ifmap will be broadcast from the left-side PEs to right PEs along the row line for data reuse among filters. Hence, we can attempt to recycle idle columns for redundant executing partial filters. Due to the stepwise growth of execution time, opportunistic duplication will introduce negligible performance penalties.

3.2 Inter- and Intra-Layer Soft Error Resilience Discrepancy

The architecture-level column-idle phenomenon provides the chance to duplicate part of filters for opportunistic CNN reliability improvement. To further verify the preceding fault-tolerant design philosophy, in this section we explore the error sensitivity discrepancy of CNNs in both inter- and intra-layer levels.

Figure 4 exhibits the layer-level misclassification rate of the investigated CNN models when performing FIs. As can be seen, for all models, the earlier small-scale convolutional layers (i.e., with column-idle opportunities in the CNN accelerator) have a higher risk of causing erroneous output under the impact of soft error. In contrast, the large-scale deep layers are resilient to soft error striking. For example, as illustrated in Figure 4(c), the misclassification rate in the first layer (with 64 filters) of VGG-16 is 19.14\(\times\) than that in the 32nd layer (with 512 filters). The reason different layers exhibit various error sensitivities is that errors occurring in the first few layers may propagate to the entire network via the convolutional computation. The inter-layer error resilience discrepancy is exciting, as it indicates that the error-sensitive layers of CNN coincidentally have more chances to leverage idle-column PEs for redundant execution, which may bring considerable reliability improvement for the entire network.

Fig. 4.

Furthermore, we observe that filters in the same layer also exhibit resilience discrepancies (i.e., intra-layer level). For instance, Figure 5 compares the misclassification rate of various filters in the first layer of ResNet-50. We observe that the most vulnerable filter achieves a 12.35% misclassification rate, whereas some filters are resilient to soft errors. We have observed similar results on other CNN models and layers, which are not exhibited here due to space constraints. The intra-layer error resilience discrepancy further guides us to leverage idle columns to preferentially duplicate error-sensitive filters for better reliability improvements.

Fig. 5.

4 ReIPE: Recycling Idle PEs for Vulnerable Filters Sof Error Detection

Leveraging the column-idle feature of the systolic array and error resilience discrepancies of CNN, we detail the hardware/software co-designed soft error detection framework, ReIPE, which intelligently Recycles the whole column of Idle PEs in the CNN accelerator to perform vulnerable filters soft error detection.

4.1 Filter-wise Error Resilience Profile Estimation

Typically, fault-tolerant resources (i.e., idle columns) are finite, and protecting more vulnerable filters will bring greater reliability benefits. Therefore, for each layer with redundancy opportunities, the first essential task is identifying a set of filters for preferential duplication.

To obtain the filter-wise error resilience profile of CNNs, the typical way is to perform sufficient FI trials for each filter [9, 37]. Each FI campaign needs a complete CNN execution. Therefore, this way usually introduces significant time/energy overhead due to the tremendous number of weights/filters in modern CNNs. As a substitute for FI, in this section we suggest a filter-level gradient analysis method for fast filter-wise soft error resilience estimation.

Previous works have noticed that the weight gradient refers to the weight’s contribution to the model performance [7, 15]. The underlying philosophy is that the gradient \(\frac{\partial loss}{\partial w}\) generated by the back-propagation process can reflect the model accuracy loss caused by introducing an infinitesimal change in weight. Inspired by the preceding concept, we heuristically conclude that weights with significant gradients typically have a greater influence on CNN correctness under the impact of soft error [10]. Since ReIPE recycles idle columns to perform filter-level redundancy, we extend the gradient analysis to the filter level for assessing filter-wise soft error sensitivity.

During the model test process, we leverage back-propagation to get the gradient of weights. Taking cross-entropy (represented by C) as the loss function:

\begin{equation} C(x,y)=- {\textstyle \sum _{i}^{Class}}x_{i}logy_{i}, \end{equation}

(1)

where Class is the number of classes and \(x_{i}\) is the binary indicator (one hot code, 0 or 1) if class label i is the correct classification. \(y_{i}\) is the predicted probability of class i. The weight-level gradient can be calculated as follows:

\begin{equation} gradient\_w=\frac{\sum _{n=1}^{N}\frac{\partial C_{n}}{\partial w}}{N}, \end{equation}

(2)

where N is the number of test samples in the dataset and \(\frac{\partial C_{n}}{\partial w}\) represents the gradient of weight w under sample n. To set generalizability analysis conclusions, we average the gradient results of all test samples. By accumulating the absolute value of the weight-level gradients in the filter, we get the filter-level gradient, formulated in Equation (3):

\begin{equation} gradient\_f=\sum _{f=1}^{F}\mid gradient\_w_{f} \mid , \end{equation}

(3)

where F is the number of weights in the filter. Note that since the gradient analysis process is microarchitecture agnostic, for a post-training model we can perform this stage offline for vulnerable filter identification.

The traditional FI also can be performed offline; however, it typically introduces millions of trials for a network to get credible conclusions [9]. On the contrary, while the gradient analysis consumes a bit more execution time than a single injected model inference process, it can calculate the average gradient of each weight by a limited N back-propagation process on all samples, which is far more efficient than FI campaigns. More details about the comparison of gradient analysis and FI will be discussed in Section 6.1.

4.2 ReIPE Design

In this section, we will depict the implementation of ReIPE, which leverages our proposed gradient analysis method to select program-level vulnerable filters and map them onto idle-column PEs of CNN accelerators for efficient soft error detection. ReIPE mainly consists of the following three stages: (1) duplicate filters identification/selection, (2) weights mapping and loading, and (3) error detection and correction.

4.2.1 Duplicate Filters Identification/Selection.

The possibility of performing redundant filter execution on idle PEs depends on the fraction of column-idle. For each layer, ReIPE will duplicate the top-k vulnerable filters in three scenarios according to the variation column-idle rate (represented as \(r_{idle}\)), which can be easily calculated by the systolic array width (represented by width) and the number of filters in a layer (represented by \(f_{num}\)), formulated in Equation (4) \((when\; f_{num} \% width \ne 0)\):

\begin{equation} r_{idle}=(width-(f_{num} \% width)) / (\lceil f_{num} / width\rceil \times width), \end{equation}

(4)

where \(\lceil f_{num} / width\rceil\) is the number of horizontal execution rounds for the specific convolutional layer in the systolic array. \((f_{num} \% width)\) is the number of filters in the last horizontal execution round. As mentioned before, each filter is tied to one specific column of the systolic array. Consequently, \((width-(f_{num} \% width))\) represents the number of idle columns when deploying a convolutional layer on CNN accelerators.

For intuitive and clear description, Figure 6 depicts the three top-k vulnerable filter selection scenarios:

Fig. 6.

–

When \(r_{idle}\ge 0.5\), as shown in Figure 6(a), it implies that there will be enough idle columns for the full redundancy of filters. All original and duplicated filters can finish execution within one round in this circumstance. This scenario usually occurs in the first few layers of the original CNN and most layers of pruned CNN, where we set \(k=f_{num}\).

–

When \(0\lt r_{idle}\lt 0.5\), this demonstrates that there are some idle columns, but not enough for full/complete duplication of all filters in this layer. Hence, partial redundancy will be triggered (as exhibited in Figure 6(b)). Based on the method described in Section 4.1, ReIPE selects the top-k sensitive filters for duplication, where k is the number of idle columns—that is, \(k=(width-(f_{num} \% width))\). Based on the observation in Section 3.2, large-scale layers contain only a small number of vulnerable filters. Given the relatively rare occurrence of soft errors [9, 52, 53], partial duplication for the most top-k vulnerable filters will bring decent error coverage in most situations.

–

While processing some middle layers, as shown in Figure 6(c), filters may appropriately fill the systolic array column (i.e., \(f_{num} \% width = 0\) or \(r_{idle}=0\)), indicating that there are no opportunities to recycle any idle-column PEs for soft error detection. In pursuit of better reliability, ReIPE provides an optional strategy that triggers a redundant round to create opportunities for filter redundancy. Note that these large-scale convolutional layers typically require multiple horizontal rounds to execute. Therefore, triggering one extra round does not equal complete duplication for the entire layer. Due to the stepwise performance characteristic of the systolic array, as illustrated in Section 3.1, once an extra round is triggered, the number of duplicated filters has little effect on the performance. More duplicated filters mean higher error coverage. Consequently, in this scenario, we sacrifice a bit of execution time to fill all triggered idle columns with duplicated filters for redundant execution (i.e., \(k=width\)).

Since the scale of the network, the filter-level gradient, and the systolic array size are fixed, the preceding vulnerable filter identification/selection stage is easy to carry out off-chip.

4.2.2 Weights Mapping and Loading.

As shown in Figure 7, based on vulnerable filter identification results in the preceding step, the selected vulnerable filters (i.e., the bright columns) are repeatedly loaded to the recycled idle columns during the weight-loading process. In addition, to ensure the error detection processes can finish timely, ReIPE maps the duplicate filters (i.e., the shaded columns) next to their original filters so that their outputs can be obtained in adjacent cycles.

Fig. 7.

The only difference from the original execution process is that the vulnerable filters will be loaded twice during the weight-loading process. ReIPE does not disturb the original horizontal-input-reuse-vertical-psum-accumulation dataflow of the baseline systolic array, which means the high parallelism of the systolic array is preserved. Therefore, ReIPE will not introduce much performance degradation.

4.2.3 Error Detection and Correction.

Considering that activation functions (e.g., RELU) and pooling can mask some bit flip errors during CNN execution [32, 37], as Figure 7 shows, we place the error detection unit after the pooling to avoid unnecessary comparisons overhead for better efficiency.

At the feature map writeback stage, ReIPE introduces a demultiplexer (i.e., DEMUX) to determine whether to perform the error detection process. If the output is attached to the immaterial filter (i.e., the gray columns), indicating this filter is resilient to soft error attacks, ReIPE avoids unnecessary error correction processes and sends the results to the input buffer directly. In another case, the output corresponds to a vulnerable filter, demonstrating that errors occurring in the filter may significantly affect CNN correctness, and ReIPE first sends the output to the compare buffer. Then, in the next cycle, the output of the duplicated filter will reach the comparator for error detection. If these two adjacent cycle outputs are equivalent, meaning soft error did not occur or has been masked by activation functions and pooling, ReIPE writes the verified data to the input buffer. Otherwise, indicating an error is detected, ReIPE will reload the fault-free weights from the weight buffer to overwrite the parameters in the PE.

Moreover, considering the erroneous filter’s former filters (i.e., left filters in systolic array) have finished execution, we optimize the error correction strategy by only restarting the erroneous filter and its latter/right filters for energy saving. Besides, compared to the sizeable PE array, the scale of the checking unit is small, and errors rarely occur in it. Similar to prior studies [38, 55], we adopt simple gate-sizing hardening techniques to reduce its SER. Once a transient fault occurs on the checking unit, the most likely consequence is that one correct result is recognized as an erroneous one, activating the unnecessary error recovery stage of ReIPE. However, the computational correctness will not be affected, as the corresponding transient fault will be wiped.

4.3 Costumed Selective Fault Tolerance Strategy

For common CNN deployment cases, we give the default top-k setting in Section 4.2.1. Furthermore, to satisfy the various reliability requirements of different application scenarios, ReIPE still allows users to specify the protection rate for each layer (i.e., \(r_{protection}\)) based on their protection budget. In the limit, \(r_{protection}=100\%\) equal to DMR, which is essential for some safety-critical scenarios, such as autonomous vehicles. Similar to ReIPE, R-DMR can also recycle column-idle PEs in the systolic array for error detection.

Moreover, we suggest a reliability-performance metric for assisting users in maximizing reliability benefits within the user-defined limited Fault Tolerance Performance Budget (FTPB):

\begin{equation} B_{i,k}=\frac{V_{i}\times \frac{ {\textstyle \sum _{j=1}^{k}gradient_{j}}}{\textstyle \sum _{l=1}^{N}gradient_{l}}}{T_{i,k}}. \end{equation}

(5)

For each layer, \(V_i\) is the vulnerable factor of layer i (i.e., the misclassification contribution of layer i under the impact of soft errors), which can be obtained from layer-level FI). \(\frac{{\textstyle \sum _{j=1}^{k}gradient_{j}}}{\textstyle \sum _{l=1}^{N}gradient_{l}}\) is the ratio of the top-k filter’s gradient over the N filters’ gradient in the i-th layer representing the relative vulnerability of the selected k filters. Consequently, \({V_{i}\times \frac{ {\textstyle \sum _{j=1}^{k}gradient_{j}}}{\textstyle \sum _{l=1}^{N}gradient_{l}}}\) can represent the increased error coverage/reliability benefit of duplicating the corresponding top-k filter in layer i. \(T_{i,k}\) is the increased execution time of the k duplicated filters, as well as the protection overhead. Therefore, we can leverage \(B_{i,k}\)(i.e., the proportion of reliability benefit and performance overhead) to denote the relative benefit of protecting k filters of layer i. With the help of \(B_{i,k}\), we can perform layer-wise prioritize redundancy until the user FTPB is achieved (can be formulated as a knapsack problem similar to that in prior work [47]).

The detailed sensitivity analysis of error coverage and performance/energy cost under different \(r_{protection}\) will be discussed in Section 6.5.

5 Experimental Methodology

To evaluate the effectiveness of ReIPE, we select six open source CNN models with different scales. Table 1 lists the brief characteristics of each model. We trained LeNet-5 on MINST (10-class of 28 \(\times\) 28 pixel handwriting images), ResNet-20 and Cifar-10-CNN on CIFAR-10 (10-class of 32 \(\times\) 32 pixel RGB tiny images), VGG-16, AlexNet, and ResNet-50 on ImageNet (1,000-class of 224 \(\times\) 224 pixel RGB images).

Table 1.

Model	Dataset	Layers	# Params	Accuracy
LeNet-5	MINST	2Conv,3FC	60K	99.14% (Top-1)
Cifar-10-CNN	CIFAR-10	4Conv,3FC	478K	79.61% (Top-1)
ResNet-20	CIFAR-10	19Conv,1FC	1.03M	91.70% (Top-1)
VGG-16	ImageNet	13Conv,3FC	138M	89.68% (Top-4)
AlexNet		5Conv,3FC	233.08M	79.06% (Top-5)
ResNet-50		53Conv,1FC	36M	90.35% (Top-5)

Table 1. Model Description

To simulate the occurrence of soft errors in accelerators, in this study we perform a hardware-aware FI to illustrate how the fault in hardware affects the execution process of DNN. Specifically, leveraging the open source simulator SCALE-Sim [39], each random transilient fault in PEs can be mapped to the corresponding parameter in the DNN model. For errors incurred in the weight register, we consider that the poisoned psum will accumulate along the PE column; for errors occurring in the input register, we consider that the fault will be broadcast along the row line to affect all right-hand filters. The preceding two types of transient faults will be wiped in the following data load process (i.e., the next calculation round). Additionally, the number of idle columns/cycles/rounds when deploying a particular layer on the systolic array is collected from SCALE-Sim, and the energy consumption of ReIPE is calculated with the metric in Eyeriss [8].

Moreover, even though the transilient fault in PE can be mapped to DNN parameters, some computer states are still not directly addressable by software (e.g., control faults). The preceding state is not the main focus of this study. Typically, a more comprehensive fault model is needed to design to simulate these errors. For example, to simulate a kind of control fault associated with the MAC unit, the Google research team replaced the corresponding output with a random faulty value [21].

6 Evaluation

We evaluate our proposed ReIPE by considering the following research questions:

–

\(RQ_1:\) What is the accuracy of gradient analysis for filter-wise error resilience estimation, and is the overhead acceptable?

–

\(RQ_2\): How many errors can be covered by ReIPE?

–

\(RQ_3\): Is the performance degradation incurred by ReIPE acceptable?

–

\(RQ_4\): What is the energy cost introduced by ReIPE?

–

\(RQ_5\): What is the effectiveness of ReIPE under different design scenarios?

6.1 Filter-Level Gradient Analysis Evaluation (\(RQ_1\))

ReIPE preferentially maps top-k vulnerable filters onto finite idle-column PEs for redundancy. Therefore, the effectiveness of ReIPE depends on the accuracy of gradient analysis in identifying the top-k vulnerable filters. This section will evaluate the accuracy and performance of gradient analysis by comparing it with FIs. Similar to the prior studies in the area [9], for each model we perform random FI trials with 20% (constant) of the exhaustive FI to evaluate the representative vulnerability of filters. The corresponding error margin is at most 0.24% at the 99% confidence level, which is able to guarantee the statistically desired analysis results.

We evaluate the accuracy of a gradient-based error resilience estimation approach by assessing top-k coverage (i.e., \(C_{top\hbox{-}k}\)) in each layer, which represents the proportion of vulnerable filters correctly estimated by gradient analysis to total vulnerable filters identified by FI (i.e., ground truth). \(C_{top\hbox{-}k}\) can be calculated by

\begin{equation} C_{top\hbox{-}k}= \frac{\left|S_{grad}\bigcap S_{FI}\right|}{\left| S_{FI}\right|}, \end{equation}

(6)

where \(S_{grad}\) represents the set of top-k vulnerable filters estimated by gradient analysis and \(S_{FI}\) is the ideal set of top-k vulnerable filters identified by FI. For brevity and ease of comparison, we averaged \(C_{top\hbox{-}k}\) of each layer for each CNN, and the layers with full redundancy opportunities are not taken into account. Besides, since the 256 \(\times\) 256 scale of the systolic array is overwhelming for LeNet-5, Cifar-10-CNN, and ResNet-20, we evaluate \(C_{top\hbox{-}k}\) of them under a 16 \(\times\) 16 systolic array.

Moreover, the gradient error bars can be substantial under different inputs. Therefore, to eliminate input-induced gradient discrepancies and accurately/fairly characterize filter-wise error resilience, we chose 10 inputs of each category to get an average gradient (consistent with the ground truth FI). As shown in Figure 8, we observe that the gradient analysis can cover 92.17% top-k vulnerable filters on average, which implies that gradient analysis is efficient for filter soft error sensitivity characterization.

Fig. 8.

Beyond estimation accuracy, we further compare the average time consumption of gradient analysis and baseline FI, as shown in Table 2. For a post-training CNN, gradient analysis only needs to execute each sample in the test set once to get the average gradient of each filter, whereas the time consumption of FIs depends on the number of parameters in CNNs. On average, gradient analysis is 2,364\(\times\) faster than FI, and the degree of speedup will increase as the total number of FI trials increases. For large-scale CNN, due to the tremendous FI trials needed, gradient analysis exhibits a more noticeable speedup versus FI. For example, the speedup of gradient analysis relative to FI achieves 7,505\(\times\) when testing on ResNet-50.

Table 2.

CNNs	LeNet-5	Cifar-10-CNN	ResNet-20	VGG-16
Speedup	59	269	856	4,707
CNNs	AlexNet	ResNet-50	AVG
Speedup	789	7,505	2,364

Table 2. Average Speedup of Gradient Analysis Relative to FI

6.2 Error Coverage Evaluation (\(RQ_2\))

We first report the error coverage of ReIPE. The results are normalized to the full DMR. As exhibited in Figure 9, by recycling column-idle PEs in the systolic array for vulnerable filters redundant execution, we observe that ReIPE can detect 96.40% errors on average across six different scales of networks. For the small-scale network (i.e., LeNet-5, Cifar-10-CNN, and ResNet-20), ReIPE achieves 100% error coverage. This phenomenon is because the number of idle columns is much larger than the number of filters for each layer. Consequently, each filter has opportunities to perform redundancy for error detection. As for the large-scale network, by preferentially duplicated loading error-sensitive filters onto the systolic array, ReIPE can cover 94.04%, 91.28%, and 93.08% of errors for VGG-16, AlexNet, and ResNet-50, respectively.

Fig. 9.

As mentioned in Section 4.1, ReIPE leverages filter-wise gradient analysis to preferentially select a set of error-sensitive filters to perform duplicate execution for better reliability. Section 6.1 compared the accuracy and efficiency of gradient analysis with baseline FI. To further verify its effectiveness in the reliability improvement of CNN accelerators, we further compare the error detection ability between ReIPE and random-ReIPE (i.e., randomly selecting filters for redundancy). As can be seen, for VGG-16, AlexNet, and ResNet-50, ReIPE can covers 33.70%, 25.34%, and 30.76% more errors than random-ReIPE, respectively. The results demonstrate that our modified filter-level gradient analysis is efficient for filter-wise soft error sensitivity characterization.

6.3 Performance Evaluation (\(RQ_3\))

We further analyze the performance of the CNN accelerator under the impact of ReIPE. As shown in Figure 10, we normalize the execution time taken by ReIPE and R-DMR to the system without any protection. We observe that ReIPE, on average, reduces 75.06% performance loss of DMR due to implementing selective protection for vulnerable filters in CNN. Due to the high fraction of column-idle for small-scale CNN, the original and redundant calculations are performed in the same horizontal round. Thus, when processing LeNet-5, ResNet-20, and Cifar-10-CNN, ReIPE virtually does not affect performance. However, for large-scale CNN, as mentioned in Section 4.2.2, the redundant round will be triggered to improve reliability, thus sacrificing a bit of execution time.

Fig. 10.

For some safety-critical scenarios, we propose R-DMR for full error coverage. Like ReIPE, R-DMR can also recycle column-idle PEs in the systolic array for error detection. As shown in Figure 10, R-DMR can reduce 60.97% performance loss incurred by DMR. By recycling idle-column PEs, R-DMR put part of redundant calculations into the original round. As a consequence, R-DMR completes full redundancy with fewer cycles than DMR. Especially for small-scale layers, R-DMR finishes both the original and redundant calculations in the same round, thus nearly having no effect on accelerator performance.

In CNN accelerator architecture, tens of thousands of PEs are designed to provide massive computational throughput. Although DMR provides full error coverage, it introduces non-negligible performance loss, which may be contrary to the original systolic array design philosophy. In comparison, ReIPE can map software-level vulnerable filters onto idle PEs of specialized CNN accelerators to perform selective redundant execution, achieving a better tradeoff between reliability and performance. Furthermore, for some safety-critical scenarios with extraordinarily high error coverage requirements, our proposed R-DMR tries to provide an optimized DMR design to guarantee both reliability and performance concerns.

6.4 Energy Evaluation (\(RQ_4\))

In this section, we give the energy consumption of ReIPE. Figure 11 exhibits the normalized energy consumption of ReIPE, R-DMR, and DMR. Similar to performance evaluation, the baseline is the system without any protection. As we can see, on average, ReIPE and R-DMR can reduce energy consumption by 67.79% and 40.35% compared to the traditional DMR, respectively.

Fig. 11.

When processing small-scale CNNs (e.g., LeNet-5, Cifar-10-CNN, and ResNet-20), we observe that ReIPE and R-DMR consume the same amount of energy. The reason is that there will be enough idle opportunities for ReIPE to perform full redundancy in small-scale networks (i.e., ReIPE is equal to R-DMR). In addition, while ReIPE and R-DMR introduce identical calculation costs compared with traditional DMR, we notice that they still introduce less extra energy. ReIPE and R-DMR can directly recycle idle columns to perform redundancy, where the original horizontal input (i.e., ifmap) will be reused by redundant execution, thus avoiding extra memory access.

For large-scale CNNs (e.g., AlexNet, VGG-16, and ResNet-50), ReIPE achieves relatively high error coverage by selectively protecting vulnerable filters. Therefore, ReIPE can reduce extra energy consumption in terms of memory access, computation, and communication. Unlike ReIPE, R-DMR performs identical computational cost as DMR. Once a(multiple) redundant round(s) is triggered, both filters and corresponding input will be accessed and streamed into accelerators, which results in extra energy cost. However, compared with traditional DMR, R-DMR still has opportunities to reduce part of extra input access energy by opportunistically recycling idle-column PEs in the systolic array.

6.5 Sensitivity Analysis (\(RQ_5\))

In this section, we first illustrate the scalability of ReIPE by deploying it on various scales of systolic arrays, followed by exploring the performance and energy consumption of ReIPE under different protection budgets (i.e., \(r_{protection}\)).

6.5.1 The Impact of Different Systolic Array Sizes.

To explore the effectiveness of ReIPE under different design scenarios, we conduct a sensitivity analysis under different systolic array scales. Figure 12 shows the normalized error coverage and execution time of VGG-16, AlexNet, and ResNet-50 under 64 \(\times\) 64, 128 \(\times\) 128, 256 \(\times\) 256, and 512 \(\times\) 512 systolic array sizes. Due to their small scale, LeNet-5, Cifar-10-CNN, and ResNet-20 will achieve full error coverage in the 64 \(\times\) 64 size; these results are not exhibited here.

Fig. 12.

First, as can be seen, error coverage grows steadily with array size, demonstrating that more filters can leverage idle-column PEs to perform error detection with the growth of array size. In comparison, we notice that the execution time shows an upward and then downward trend. When the array size is 128 \(\times\) 128, the normalized execution time reaches a peak. As mentioned in Section 4.2.1, for layers without column-idle opportunities, ReIPE triggers one round to get k idle columns for reliability improvement, where \(k=width\). Due to fewer triggered idle opportunities compared to the 128 \(\times\) 128 size, the 64 \(\times\) 64 size incurs less performance loss for reliability improvement and consequently suffers lower error coverage. As the array size increases further, the number of inherent idle columns starts to increase while fewer redundant rounds are triggered. Eventually, the normalized execution time starts to decrease.

6.5.2 The Impact of Different Protection Budgets.

We further explore the effectiveness of ReIPE at various \(r_{protection}\). Figure 13 reports the error coverage, normalized execution time, and energy consumption of VGG-16 under different protection rates for illustration. Facing low \(r_{protection}\) requirements, ReIPE only opportunistically leverages the idle opportunities for redundant execution. As \(r_{protection}\) increases, to ensure the protection rate of each layer, ReIPE will trigger some redundant rounds to create sufficient idle opportunities until the \(r_{protection}\) is met. Exceptionally, when \(r_{protection}=100\%\), ReIPE is equivalent to R-DMR. As shown in Figure 13, as \(r_{protection}\) gradually increases, the energy consumption and execution time progressively increase since more redundant rounds are triggered to duplicate vulnerable filters. As an interesting characteristic, we notice the error coverage curve starts to increase slowly when the \(r_{protection}\) reaches 40%, demonstrating that the reliability benefits brought by triggering more rounds are feeble. As observed in Section 3, large-scale convolutional layers contain only a small number of error sensitivity filters; thus, the partial duplication for the top-k vulnerable filters proposed by ReIPE can achieve a better tradeoff between reliability and error coverage. Moreover, compared to evenly distributing fault-tolerant resources at each layer (the dashed line), leveraging \(B_{i,k}\) to select vulnerable filters can achieve better error coverage with lower overhead (e.g., under 30% \(r_{protection}\), optimized vulnerable filter selection with \(B_{i,k}\) can save 6.73% execution time and 12.49% energy consumption to reach 5.1% additional error coverage).

Fig. 13.

7 Discussion

7.1 Adaptability of ReIPE to Pruned, Quantized, and ViT Based Models

7.1.1 Pruned Model.

Pruning is a prevalent technique to compress DNNs by reducing redundant parameters for performance and energy benefits [15, 28]. However, the model pruning technique removes weights with less importance and only preserves weights that contribute significantly to the results. Along with compressing the model scale, the pruning technique may also squeeze the original resilience of the DNN model, making it more susceptible to errors. As shown in Figure 14, taking VGG-16 as an example, the misclassification rate of the pruned model is, on average, 2.26% higher than the original model, and it is even 3.40% higher at layer 14, which indicates that the pruned network is indeed more susceptible to the influence of errors. Therefore, beyond cost optimization, ensuring the reliability of the pruned network is also vital. In this section, we discuss the efficiency of ReIPE for pruned models.

Fig. 14.

Structured pruning will prune weights at the granularity of filters to achieve structured sparsity [42]. Thus, the structured pruned CNN will face more idle columns when deployed to the systolic array, which can be utilized by ReIPE for filter-wise redundant execution, thereby boosting the error resilience of the network. As a use case, we performed ReIPE with a filter-level structured pruning on VGG-16 at 55.47% pruning rate. As shown in Figure 15, ReIPE can provide 93.72% error coverage by introducing merely 13.35% performance degradation and 22.72% extra energy overhead. Compared to the original network, due to the high fraction of column-idle opportunities, most duplicate calculations can be performed in the same round with original filters, thus nearly introducing no performance loss.

Fig. 15.

7.1.2 Quantized Model.

The modern accelerator also applied different data types to balance the inference efficiency and accuracy [21, 25]. We further explore the error sensitivity under different data types, including FP16 and INT8.

As an illustration, Figure 16 exhibits the layer-level misclassification rate of VGG-16 in FP32, FP16, and INT8. Similar to our previous observation, for all three data types, the shallow small-scale layers have a higher risk of causing erroneous output than the deep large-scale layer. Moreover, different data types also exhibit error sensitivity discrepancies. First, FP16 and FP32 exhibit similar layer-level misclassification rate distribution. In addition, in each layer, FP16 exhibits a higher misclassification rate than FP32. We attribute the discrepancy to a higher exponents ratio by analyzing the data format. FP16 contains 5 exponent bits and 10 fraction bits, whereas FP32 contains 8 exponent bits and 23 fraction bits. Typically, errors are more serious when they occur in the exponent than in the fractions [22]. Higher exponents percentage makes FP16 more susceptible to soft errors. Second, in shallow layers, error impact is typically more severe on the INT8 than float point (FP) format. However, as the layer goes deeper, the model in INT8 is more resilient than that in FP. One possible reason is the discrepancy in data representation. For shallow layers of FP format, only part of exponent bits are vulnerable to soft errors; fractions typically exhibit error resilience characteristics. In contrast, errors’ impact on integers grows bit by bit. For the deeper layers, the reason FP32 is more vulnerable than INT8 may be attributed to data distribution. For example, in the deeper layers of VGG-16, weights in FP32 are distributed within the ± 0.05 interval. Consequently, errors in FP bits, even in the fractions field, may also lead to severe data offsets, resulting in model misclassification. Compared to INT8, the fraction field robustness advantage in shallow layers no longer exists.

Fig. 16.

Although different data types exhibit different error sensitivities, filter-level error sensitivity discrepancies still exist. The column idleness phenomenon still allows us to improve system reliability opportunistically.

7.1.3 Vision Transformer.

Modern ViT models have received increasing attention, and their reliability also deserves concern. ViT consists of multiple stacked encoders. In each encoder, Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) are the main functional components, both of which can be converted into GEMM/GEMV and accelerated by the systolic array based accelerators [2, 50]. Through performing reliability analysis, the specific error resilience discrepancy manifestations and the corresponding selective fault tolerance opportunities are as follows.

Multi-Head Attention. MHA contains three parameter matrices: query matrix Q, key matrix K, and value matrix V. The basic calculation (attention) of MHA can be referred to the following formula:

\begin{equation} Attention(Q,K,V)=softmax \left(\frac{QK^T}{\sqrt {d_k}}\right)V. \end{equation}

(7)

The systolic arrays mainly accelerate the following two steps of MHA: (1) the scoring matrix generation (i.e., \(QK^T)\) and (2) multiply the normalized score matrix \(softmax(\frac{QK^T}{\sqrt {d_k}})\) with the value matrix (V) to generate the output. We observe the misclassification rate of V is much higher than Q and K. Taking the first layer of ViT-B/16 as an illustration, on average, V exhibits 7.67% misclassification rate, whereas Q and K are only 0.03% and 0.03%, respectively. A possible reason for this phenomenon is that the result of Q and K will be sent to the softmax function, and even if there is a serious error, the parameters will be shrunk into the range (0, 1). On the contrary, when an error occurs on V, the fault result will be passed directly backward and propagated to the following layer norm and MLP, which in turn leads to serious error consequences. The preceding reliability discrepancies guide us to enhance the vulnerable V-correlation computation, and relax the constraints on Q and K to reduce the fault tolerance overheads.

Multi-Layer Perceptron. MLPs in ViT usually consist of a two-layer Fully Connected (FC) network. Similar to FC layers in CNNs, MLPs in ViT can be transferred to matrix-vector multiplication (GEMV) and accelerated by the systolic array [2]. We also observe error sensitivity discrepancy among neurons in MLP, which is consistent with previous works [10, 22]. For example, in the first MLP in ViT-b, the most vulnerable neurons exhibited a 3.38% misclassification rate, while some neurons are robust to errors. The preceding phenomena inspired us to selectively duplicate the execution of vulnerable neurons for efficient fault-tolerant design.

Besides, except for reliability discrepancies within the encoder, recent work [49] also found that different transformer encoders (i.e., layers) of ViT also exhibit vulnerability discrepancies. The layer-level reliability discrepancies can further guide us in designing flexible layer-wise protection strategies based on each layer’s resilience profile/knowledge to achieve a better reliability-performance balance.

7.2 Portability of ReIPE to Other Dataflows and Accelerator Architectures

7.2.1 Dataflow.

This study mainly focuses on WS dataflow, which is widely applied in state-of-the-art CNN accelerators. Caused by high-energy neutrons and \(\alpha\)-particle strikes, the same problem (i.e., soft errors) also occurs in other dataflow styles. Different dataflow will cause varying idle PE opportunities and redundancy dimensions.

Under IS dataflow, the ifmap will be pre-loaded and fixed on the column of the systolic array at the granularity of a convolution window [48, 51]. The weight matrix will be streamed in from the left edge for reused among convolution windows. Similar to WS dataflow, each PE will generate a psum and propagate to the underneath neighbor for accumulation. Since the number of convolution windows is usually not divisible by the array size, as shown in Figure 17(a), when processing a specific layer, ReIPE can leverage the idle-column PEs to perform opportunistic redundancy at the granularity of the convolution window. For example, the third convolution layer of AlexNet trained on the ImageNet dataset requires 169 columns of PEs for all convolution windows. When deploying this layer on a 256 \(\times\) 256 systolic array, there will be 87 idle-column PEs that can be recycled for soft error detection.

Fig. 17.

As for OS dataflow, none of the parameters will be pre-loaded into the systolic array. During the computation process, the psum will be fixed on PEs, and the organized weight matrix and input matrix are streamed into accelerators from the top and left sides, respectively. The newly generated psum will overwrite the original one in each cycle. When the corresponding calculation is completed, the psum becomes the final output and is transferred to the input buffer for the further subsequent layer computation. Analogous to WS dataflow, the calculation of a specific filter will be assigned to a column of PEs. The same column-idle opportunities also exist in the OS-based systolic array. Besides, like IS dataflow, the inputs within a convolution window are arranged in a row. When processing a layer with few convolution windows, there are also row-idle PEs. Consequently, as portrayed in Figure 17(b), leveraging the column- and row-idle opportunities in IS architecture, we can duplicately stream some filter(input) vectors from the top(left) side of the accelerator for selective redundancy. For instance, for the third convolution layer of AlexNet mentioned before, 128 idle-column PEs and 87 idle-row PEs can be recycled for reliability enhancement.

7.2.2 Other Accelerator Architectures.

There are some other architectures proposed for DNN accelerators, such as Eyeriss [8] and ShiDianNao [12]: they both contain PE arrays and profit from data pipelining to reduce data access. Differing from the systolic array, Eyeriss employs horizontal filter reuse (i.e., PEs in the same row contain the same weights of a filter), diagonal ifmap reuse, and vertical psum accumulation; ShiDianNao broadcast weights to PEs and employs FIFOs for input reusing.

The opportunistic selective fault tolerance design philosophy of ReIPE is also applicable to these architectures. Taking Eyeriss as an example, it divides PEs into PE sets [8], and a PE set undertakes the computation of a filter. As described in the work of Chen et al. [8], the hardware size is fixed at 12 rows and 14 columns, and the size mismatch between filter number and accelerator size may also lead to PE set idling. As shown in Figure 18, we can assign a duplication of the vulnerable filter to the neighboring PE set (map the duplicated filter 2 to PE set 3) for real-time soft error detection.

Fig. 18.

7.3 Extensibility of ReIPE to Other Fault Models

7.3.1 Faults on Inputs.

As introduced in Section 2.2, ReIPE focuses on errors in the weights of convolutional and FC layers, which can be attributed to two reasons. First, the weights of a trained network are fixed, and the same network performs inference on multiple inputs. Thus, an error in the weights affects multiple inference processes, whereas an error in the inputs affects the inference process only once. Second, errors in weights have a wider “footprint” than errors in inputs: an error in weights can cause faults on all activation values in a specific channel on output feature map (ofmap), whereas an error in inputs may lead to only one fault in a channel and affect at most width channels.

Apart from errors on the vulnerable filters, we further discuss the detection and correction of input errors in this section. Due to the horizontal-input-reuse nature of the systolic array, once a soft error occurs, the vulnerable filter and its duplication will share the same faulty input, which prevents ReIPE from detecting errors in the inputs. For this circumstance, we provide a modified prototype for input error detection. The specific design and workflow are as follows.

As shown in Figure 19, we place a column of modified PEs as the checking PEs on the right side of the systolic array. In the data setup stage, we generate a checksum for each input vector and feed its inversion (\(P_i\)) into the psum register of check PEs (e.g., \(P_1\) = –(\(I_{11}\)+\(I_{12}\)+\(I_{13}\)+\(\ldots\))). During the weight loading stage, the weight of the check PEs will be set to 1. In the calculation and checking stage, the input value is multiplied by the 1-weight and then added to psum in the registers, which is equivalent to removing I from \(P_i\). Consequently, when the calculation is completed, if all psums in registers are zero, this means the inputs are error free. Otherwise, indicating a soft error occurs on inputs during the calculation process, ReIPE will restart the corresponding calculation to eliminate errors. Nearly all input errors can be detected by performing a simple checksum validation in each PE row, except the extreme case where the same bit flips twice during the computation, which is extremely rare. Like the original ReIPE, the input checking does not disrupt the dataflow of the systolic array, but introduces 10.01%, 15.73%, and 19.17% performance overhead for AlexNet, ResNet-50, and VGG-16, respectively.

Fig. 19.

7.3.2 Multi-Bit Upset.

Except for single-bit flips, soft errors may also manifest as multi-bit upset/flips. As described in the work of Ebrahimi et al. [13], double-bit upsets are the most common pattern of multi-bit upset.

As introduced in Section 2, each PE contains three registers, which store weight, input, and psum, respectively. Due to the relative independence of the preceding three registers, double-bit upsets typically manifest as two neighboring bits flip in the same parameter (i.e., weight, input, or psum). To explore the different error effects between single-bit flip mode and double-bit flip mode, Figure 20 compares the misclassification rates of VGG-16 under these two fault modes for illustration. We observe that the layer-level misclassification rate incurred by two-neighboring-bit flips shows a similar first-high-then-decrease trend with single-bit flips, but is always lower than that of single-bit flips. We suggest possible reasons for such a scenario as follows.

Fig. 20.

There are four manifestations of two-neighboring-bit flips: 00→11, 01→10, 10→01, 11→00. First, compared to single-bit flip, 01→10 and 10→01 have less impact on weights since the effects of two opposite-direction bit flips may counteract each other. Taking 01→10 as an example, the higher 0-bit flip increases the absolute value of the weights, whereas the lower 1-bit flip reduces this effect. Second, 11→00 will cause more value reduction than 1→0. However, the erroneous weights typically remain within the original weight distribution, so the error impact may not be severe. Third, for 00→11, two 0-bit flips incur more serve data corruption than 0→1. Since the 00 combinations are usually present in the fractions and lowest exponents, the weight shifts caused by error may also be acceptable.

Moreover, owing to the performance limitations of statistical FI, in this work we suggest a gradient analysis approach for fast error resilience evaluation, characterizing filter vulnerability by exploiting the filter-level gradient, which is fault mode independent.

7.3.3 Architectural Vulnerability Factor Analysis.

Although errors may also occur in duplicated filters, they can be detected and corrected by ReIPE as well. From the perspective of the Architectural Vulnerability Factor (AVF), we further analyze the system reliability benefits brought by ReIPE. The AVF is the probability that a soft error in the accelerator incurs the misclassification of CNN. According to the classical AVF analysis approach [33] and the latest AVF analysis approach in DNN accelerator architectures [44], the AVF can be calculated by the following equation:

\begin{equation} AVF=\frac{{\textstyle \sum _{}^{Total\ cycles}} ACE\ bits\ in\ the\ structure}{bits\ in\ the\ structure \times total\ execution\ cycles}, \end{equation}

(8)

where \(ACE\ bits\) are bits that may affect the final system output when in errors; \(bits\ in\ the\) \(structure \times total\ execution\ cycles\) represent the total number of bits during hardware execution [33]. The duplicated filters are introduced to protect the error-sensitive filters. Errors occurring in these filters will be detected and will not affect the correctness of DNN execution. Consequently, the corresponding bits in the extra round will not be counted as ACE bits. Compared to an unprotected system, ReIPE reduces the number of ACE bits, so according to the preceding equation, the AVF of the system will benefit from the duplication-based fault tolerance design.

8 Related Work

In this section, we discuss two categories of study related to our work: (1) the first category introduces the studies on the selective fault-tolerant design, and (2) the second category summarizes the studies that focus on checksum-based and algorithm-based fault tolerance approaches.

8.1 Selective Fault Tolerance Approaches

Recently, several studies have focused on performing selective protection to improve system reliability [5, 10, 35, 40]. With regard to hardware-based approaches, Choi et al. [10] propose an active fault tolerance scheme, which first analyzes in detail the neuron-level error sensitivity discrepancy and then deploys the vulnerable neurons to the protected PE to reduce the impact of errors. However, this approach needs a customized robust protected PE design, which may introduce extra hardware overhead. Schorn et al. [40] propose a fine-grained row-level switch approach to match vulnerable weight with robust PE for fault effects elimination. To ensure computational correctness, it modifies the structure of PEs so that inputs can switch across the rows. However, due to the random distribution of fragility weights, the dataflow needs to be rearranged for each computation round, which may introduce extra performance overhead. In contrast, ReIPE remaps the original and duplicated filters into adjacent PE columns, which incorporates the error detection process into the original computation flow of accelerators to perform real-time error detection. With regard to software-based approaches, Bolchini et al. [5] analyze the layer-level reliability discrepancy of CNN and propose a coarse-grained (layer) selective replication strategy. Oz and Arslan [35] summarize some redundant multi-threading approaches that leverage underutilized resources (low-ILP phase) to run redundant threads. However, as the execution characteristics of accelerators are not co-considered, the additional performance overhead is challenging to control in a quantifiable manner, which may limit their efficient deployment for dedicated accelerator scenarios. In comparison, in the systolic array, ReIPE recycles the idle column PE to preferentially protect the most vulnerable filters, which opportunistically improves system reliability while introducing negligible performance overhead.

8.2 Checksum-Based and Algorithm-Based Fault Tolerance Approaches

Some other studies tried to design low-cost checksum-based and algorithm-based fault tolerance approaches [17, 36, 54]. Checksum-based approaches can detect errors with lower overhead by validating the consistency between the outputs and the corresponding checksum. Har et al. [17] perform the DNN and checksum computation with the GPU and assigns the subsequent validation to the host CPU; the error detection needs to be initiated after the calculation of the whole CONV is finished, which may delay the error detection stage and results in more state rollback overhead. Ozen et al. [36] attempt to integrate the checksum generation and verification stage into the systolic array workflow. However, they typically cannot obtain the location information of errors, which limits the error traceability; thus, they usually need to introduce a coarse-grained error recovery stage (e.g., recalculating the whole round). Compared with the checksum-based error detection method, Zhao et al. [54] propose an efficient and fine-grained algorithm-based fault tolerance approach, which can perform efficient and fine-grained error correction processes based on error location information. However, for the structured accelerator, it is typically challenging to deploy such complex algorithm-based fault tolerance logic on-chip directly. To implement the algorithm-based fault tolerance mechanism, the checksum verification and error correction stage need to be carried out by the host CPU, and dedicated workflows are required to ensure efficient collaboration between the CPU and the accelerator.

9 Conclusion

In this article, we proposed an efficient soft error resilient technique named ReIPE, which recycles column idle PEs in the CNN accelerator for vulnerable filters soft error detection. For a post-training CNN model, the modified filter-level gradient analysis is used to perform filter-wise sensitivity estimation. Based on the error sensitivity of filters within each convolutional layer, by leveraging/recycling column-idle PEs, the most vulnerable filters are duplicated assigned onto systolic array to perform soft error detection. Moreover, for convolutional layers without idle opportunities, ReIPE triggers a redundant round for better reliability. Experimental results show that, on average, ReIPE is able to cover 96.40% error in the CNN accelerator while reducing 75.06% performance degradation and 67.79% energy consumption of baseline DMR.

References

[1]

Dimitris Agiakatsikas, George Papadimitriou, Vasileios Karakostas, Dimitris Gizopoulos, Mihalis Psarakis, Camille Bélanger-Champagne, and Ewart Blackmore. 2023. Impact of voltage scaling on soft errors susceptibility of multicore server CPUs. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 957–971.

Abstract

1 Introduction

2 Background

2.1 CNNs and Systolic Array Design

2.2 Fault Model

3 Observation and Motivation

3.1 Idle Columns in the Systolic Array

3.2 Inter- and Intra-Layer Soft Error Resilience Discrepancy

4 ReIPE: Recycling Idle PEs for Vulnerable Filters Sof Error Detection

4.1 Filter-wise Error Resilience Profile Estimation

4.2 ReIPE Design

4.2.1 Duplicate Filters Identification/Selection.

4.2.2 Weights Mapping and Loading.

4.2.3 Error Detection and Correction.

4.3 Costumed Selective Fault Tolerance Strategy

5 Experimental Methodology

6 Evaluation

6.1 Filter-Level Gradient Analysis Evaluation (\(RQ_1\))

6.2 Error Coverage Evaluation (\(RQ_2\))

6.3 Performance Evaluation (\(RQ_3\))

6.4 Energy Evaluation (\(RQ_4\))

6.5 Sensitivity Analysis (\(RQ_5\))

6.5.1 The Impact of Different Systolic Array Sizes.

6.5.2 The Impact of Different Protection Budgets.

7 Discussion

7.1 Adaptability of ReIPE to Pruned, Quantized, and ViT Based Models

7.1.1 Pruned Model.

7.1.2 Quantized Model.

7.1.3 Vision Transformer.

7.2 Portability of ReIPE to Other Dataflows and Accelerator Architectures

7.2.1 Dataflow.

7.2.2 Other Accelerator Architectures.

7.3 Extensibility of ReIPE to Other Fault Models

7.3.1 Faults on Inputs.

7.3.2 Multi-Bit Upset.

7.3.3 Architectural Vulnerability Factor Analysis.

8 Related Work

8.1 Selective Fault Tolerance Approaches

8.2 Checksum-Based and Algorithm-Based Fault Tolerance Approaches

9 Conclusion

References

Index Terms

Recommendations

An FPGA-based Fine Tuning Accelerator for a Sparse CNN

Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations