4.2.3 Sensitivity Metric Comparison.
Our next set of experiments aims to understand how different metrics perform at identifying the bits most sensitive to single-bit faults. We use four different models—three different ECON-T autoencoder models and a CIFAR-10 edge CNN, specifically hls4ml’s submission to the MLPerf Tiny Inference Benchmark [
10]. We compare the abilities of four metrics to rank the weights:
Random,
MSB to LSB,
Hessian, and
Gradient.
Random picks a bit at Random.
MSB to LSB selects the MSBs first, followed by the second MSB, all the way to the LSBs. The bits are selected in the weight index provided by Keras after flattening a layer’s weight matrix, e.g., the ECON-T Medium NN has the weight ordering shown in Figure
4.
Hessian uses the Hessian-based sensitivity score as computed in Equation (
3).
Gradient uses the parameter’s gradient value from Equation (
1) and sorts the bits from MSB to LSB in a similar manner to
Hessian (see Section
3.1).
We compare the ability of these metrics to predict the sensitivity accurately. Figure
6 shows the Spearman’s rank correlation coefficient
\(\rho\) for the four different models. The results compare the four metrics and an
Oracle (a perfect ranking derived from the brute-force single-fault experiments) to predict the bit sensitivity. The
\(\rho\) values are shown for all bits, followed by them being broken out into the individual bits in order of most significant to least significant. As expected, the Random ranking has a near zero correlation, and the Oracle has a perfect correlation across all models. The Hessian is consistently the best, especially in the more significant bits. The Gradient also provides good performance close to the Hessian but clearly lower in most cases. The MSB to LSB ranking performs quite well and provides a relatively simple way to consider bit sensitivity, e.g., for an FI campaign.
Next, we consider the relative magnitude of the error and not just the relative ranking. Figure
7 shows the results of an experiment that plots the cumulative error when ranking the sensitivity of the weight bits. A larger error indicates that flipping that particular bit increases the error of the overall model. The error measure depends on the model. The three ECON-T autoencoder NNs use EMD for error. Recall that EMD is a measure of error where larger indicates worse autoencoder performance. The CIFAR-10 CNN uses the number of mispredictions for the error where larger indicates more error.
Consider first Figure
7(a) that plots the cumulative
\(\Delta\) EMD versus the number of bits flipped for the four metrics and an Oracle on the ECON-T Medium Pareto NN. The Oracle is the optimal or best-case ranking calculated from the brute-force single-fault experiments, e.g., from Figure
4 for ECON-T Medium. The Oracle ranks the bits with the largest mean
\(\Delta\) EMD first. The cumulative EMD provides the difference between the faulty model EMD and the EMD of a model with no faults. The EMD for the non-faulty model is 1.100. The cumulative
\(\Delta\) EMD for the Oracle results quickly approaches the maximum cumulative
\(\Delta\) EMD of 57.37. Only 63.5% (8,080/12,720) of the bits are sensitive, i.e., they have a non-zero
\(\Delta\) EMD. The remaining 36.5% do not affect the autoencoder EMD.
The Random metric is the worst of the metrics, showing that chance alone provides roughly an equal chance of guessing the bits that contribute most to the EMD. MSB to LSB performs significantly better than Random. This shows that the bit order matters. The MSB has the lion’s share of the cumulative \(\Delta\) EMD (40.91 of the 57.37). The impact on \(\Delta\) EMD falls quickly; weights from last few significant have little effect on the EMD. Hessian and Gradient both perform better. Hessian does perform better at ordering the MSB weights, with Hessian being slightly better as indicated by the separation between the two lines. In particular, Hessian is more accurate for the first 2,120 bits (corresponding to the weights of the MSB). The subsequent bits are approximately equal between Hessian and Gradient. These bits contribute less to the overall EMD and thus are overall less sensitive.
It is interesting to compare the difference between the
Random and
Oracle on the three ECON-T NNs. ECON-T Small NN (Figure
7(c)) has a much smaller spread due to the fact that the model is smaller and all of the weights are more sensitive. Conversely, the spread in the large ECON-T NN (Figure
7(b)) is the largest of the three. The large model has a small percentage of sensitive weights as indicated by the steep initial slope of the
Oracle. In other words, most of the weights are insensitive to faults, which is not surprising given that the model has many more weight bits. The ECON-T Large NN has 61,344 weight bits compared to ECON-T Medium (12,720 weight bits) and the ECON-T Small (10,240 weight bits).
CIFAR-10 is a different classification problem with a different error measure. Thus, the results are not as easily comparable as the three ECON-T NNs. Overall, CIFAR-10 is the largest model with 459,520 weight bits. The fairly steep initial slope of the Oracle indicates that most of the sensitivity resides in a small number of bits. However, there is a relatively long tail, e.g., more similar to ECON-T Medium Pareto NN. The relatively large separation between the Random and Oracle indicates that the bit sensitivity is not easy to predict. Hessian generally performs best in determining the most sensitive bits.
In Figure
8, we compare the cumulative
\(\Delta\) EMD and mispredicts with state-of-the-art work in FI: BinFI [
18] and StatFI [
61]. To recap Section
2, BinFI performs a binary search within each value in the NN to find the bit that is the “inflection point,” wherein all of the bits more significant than it are considered sensitive to faults. BinFI calls this the
silent data corruption (SDC) boundary. While BinFI applies this technique to the NN activations, we instead apply it to the NN weights according to our fault model. Since BinFI performs a binary search to find the SDC boundary, we first plot the actual bits BinFI flips during the binary search, which we call
BinFI (Actual Bits Flipped). The SDC boundary implies that all of the bits more significant than it are also sensitive. We plot the actual bits flipped plus these implied sensitive bits as
BinFI (Actual+Implied Bits Flipped). BinFI does not specify the order in which to search the NN values, so we flip them in the order they appear in the NN. StatFI introduces two FI methods for finding the sensitive NN weight bits: data-aware and data-unaware. StatFI statistically determines the sample size of how many bits to flip per weight bit index, e.g., the MSB or second MSB, per layer.
StatFI (Data-aware) statically measures the changes in magnitude in each weight that occur from flipping a bit to determine the sample size per weight bit index per layer. The larger the magnitude change, the higher the sample size will be, and vice versa.
StatFI (Data-unaware) does not rely on changes in magnitude and selects the same sample size per weight bit index per layer. StatFI randomly samples based on the determined sample size, i.e., they do not specify the order in which to flip bits. We thus order them MSB to LSB, as StatFI provides a list of bits to flip per MSB, second MSB, and so on.
Let us first consider Figure
8(a) and look at how BinFI compares with the Hessian and the Oracle on the ECON-T Medium Pareto NN. We find that
BinFI (Actual Bits Flipped) is not very impressive, as expected. BinFI uses each of these bit flips to implicate more bits. As such,
BinFI (Actual+Implied Bits Flipped) performs more impressively in the beginning; however, it begins to falter after a few hundred bit flips. This is expected because BinFI does not specify an order in which to search the weights, whereas the
Hessian does. By only flipping around half of all of the NN weight bits though,
BinFI (Actual+Implied Bits Flipped) identifies most of the sensitive bits, as indicated by the dashed vertical line that drops down after
\(\sim\) 6,000 bits flipped. It fails to identify 5% of the sensitive bits (676 out of a total of 8,080). Comparing
BinFI (Actual+Implied Bits Flipped) to our work, the
Hessian finds more sensitive bits much sooner in its search. In fact, for the first
\(\sim\) 5,500 bit flips, the
Hessian’s bit flips are much more informative, as it finds the highly sensitive bits early on. Thus, from Figure
8(a) as well as Figure
8(b), we observe this tradeoff of the
Hessian finding more sensitive bits earlier in the search versus the
BinFI (Actual+Implied Bits Flipped) finding a majority of the sensitive bits earlier for the ECON-T Medium and Large Pareto models.
Figure
8(c) and Figure
8(d) show more complex trends. In both the ECON-T Small Pareto (Figure
8(c)) and CIFAR-10 (Figure
8(d)), we see that
BinFI (Actual+Implied Bits Flipped) rises slowly, as we have seen previously, before exceeding the
Oracle. This happens because
BinFI (Actual+Implied Bits Flipped) implies multiple bits are sensitive per bit flip, whereas the
Oracle only implicates one bit based on how we plot it. Clearly, the
Oracle knows all of the sensitive bits prior to FI (and could reach the maximum cumulative
\(\Delta\) EMD/mispredicts with 0 bit flips); however, we are interested in understanding the ceiling for the
Hessian in our
Oracle. As such,
BinFI (Actual+Implied Bits Flipped) exceeds the best the Hessian could perform for these two models. This is likely the case because most of the bits in the ECON-T Small Pareto and CIFAR-10 models are sensitive. A total of 100% of the ECON-T Small Pareto bits and 82.72% of the CIFAR-10 bits are sensitive. Thus, these NNs are easier tasks for BinFI—each bit flip is highly likely to find a sensitive bit. We are not necessarily finding a needle in a haystack the way we are for ECON-T Large, where only 6.5% of the bits are sensitive. The
Hessian is clearly better in this case (see Figure
8(b)).
However, there is a major caveat with BinFI: the only information we receive on how sensitive the model bits are is from
BinFI (Actual Bits Flipped). As seen in all four charts in Figure
8,
BinFI (Actual Bits Flipped) reveals very little information and has the lowest cumulative
\(\Delta\) EMD/mispredicts out of all of the methods. We have no way of determining cumulative
\(\Delta\) EMD/mispredicts unless we actually flip all of the bits plotted by
BinFI (Actual+Implied Bits Flipped). As a result, it is difficult to determine which bits are a higher priority to protect, which may be a tradeoff worth considering in the resource-constrained environments of edge NNs. Overall, BinFI performs well when the lion’s share of a model’s bits are sensitive and poorly when most of a model’s bits are insensitive.
StatFI performs the worst in all cases in Figure
8. Its statistical sampling is ineffective at selecting the sensitive bits in an NN. In particular,
StatFI (Data-aware) depends on large changes in magnitude from flipping a weight bit to determine the sample size per weight bit index per layer. These large changes are more likely to occur in float32 and less likely to occur when we represent our weights with
\(\le\) 8-bit fixed-point precision. We therefore observe a large gap between the
StatFI (Data-aware) line and the Hessian line in the ECON-T Medium Pareto (see Figure
8(a)), ECON-T Large Pareto (see Figure
8(b)), and CIFAR-10 (see Figure
8(d)) models charted. We would expect either StatFI method to work well for ECON-T Large Pareto, where few of the bits are sensitive because it was designed to find the few sensitive bits with fewer FIs; however, StatFI randomly selects the bits to sample per weight bit index per layer, which is ineffective. For the ECON-T Small Pareto model (see Figure
8(c)), where 100% of the bits are sensitive,
StatFI Data-aware fails to identify 37% of the sensitive bits, whereas
StatFI Data-unaware fails to identify 5% of the bits.
StatFI (Data-unaware) samples more bits, as we see in the pink line falling down after having flipped more bits than
StatFI (Data-aware)’s brown line falling point in every case; nevertheless, it fails to find many of the bits, especially for the ECON-T Large Pareto (see Figure
8(b)), missing 35% of the sensitive bits, and for CIFAR-10 (see Figure
8(d)), missing 76% of the sensitive bits. Both the
StatFI Data-aware and
Data-unaware methods are not sampling cleverly enough. Since the
Hessian captures how sensitive the NN weights are, it easily outperforms both
StatFI Data-aware and
StatFI Data-unaware.
Figure
9 provides a different analysis related to the ability of the sensitivity metrics to find the top-
k percentile of the sensitive bits. The
Oracle provides a perfect ranking with respect to the bit’s sensitivity. In other words, the Oracle perfectly selects the top-
k percentile of the bits and provides a lower bound (best-case result) for a sensitivity metric.
Oracle will not go to 1.0 on the
y-axis when a subset of the bits are insensitive to single-bit faults. For example, only
\(63.5\%\) of the bits are sensitive for the ECON-T Medium Pareto NN, only
\(6.55\%\) of the bits are sensitive for the ECON-T Large Pareto NN,
\(100\%\) of the bits are sensitive for the ECON-T Small Pareto NN, and
\(82.72\%\) of the bits are sensitive for the CIFAR-10 NN.
Random provides the other extreme as the top-
k bits are randomly distributed through its ranking, and thus the entire ranking must be enumerated to find the top-
k bits for all but the smallest values of
k.
For the ECON-T NNs, MSB to LSB performs the worst of the other metrics. Overall, Hessian is better than Gradient for the ECON-T Large and Small Pareto NNs. Gradient is overall generally lower (better) than Hessian for the ECON-T Medium Pareto model.
The CIFAR-10 classification task is interesting in that
MSB to LSB performs quite well overall, whereas it is the worst sensitivity metric for the ECON-T NNs. The CIFAR-10 NN is a two-stack ResNet model with five convolutional layers [
10]. This is fundamentally different than the ECON-T models, which have only two layers, especially for ECON-T Small which consists of only dense layers.
We then compare the top-
k percentile performance of the
Hessian and the
Oracle with BinFI [
18] and StatFI [
61] in Figure
10. We first compare with BinFI. Let us first focus on the ECON-T Medium model in Figure
10(a). To find the top-1 percentile of the sensitive bits, the
Hessian only needs to flip
\(\sim\) 18% of the bits, whereas
BinFI (Actual+Implied Bits Flipped) must flip
\(\sim\) 41% of the bits, taking longer to find the most sensitive bits. BinFI then drops to 0 after the top-15th percentile because it produces false negatives, i.e., it does not find all of the sensitive bits. The top-
k percentile requirement is stringent. If a method fails to identify even a single bit in the top-
k percentile, then we say this method has failed because there is no number of bits to flip according to that method that will find all of the top-
k sensitive bits. We can thus infer from Figure
10(a) that
BinFI (Actual+Implied Bits Flipped) fails to identify bits beyond the top-15th percentile, e.g., it misses bits that are quite sensitive (such as in the 20th percentile). The
Hessian provides both weight-level and bit-level sensitivity information to rank weight bits, whereas BinFI only provides bit-level sensitivity information per weight without any way of ranking the weights. Therefore, the
Hessian is significantly more efficient than BinFI at finding the most sensitive bits because it captures more sensitivity information. For the ECON-T Large Pareto (see Figure
10(b)), the Hessian outperforms
BinFI (Actual+Implied Bits Flipped) for the top-10th percentile before
BinFI (Actual+Implied Bits Flipped) exceeds the
Hessian up until the top-20th percentile when it falls to 0—once again due to its failure to find sensitive bits. For ECON-T Small Pareto, the
Hessian is the closest to the
Oracle up until the top-15th percentile when
BinFI (Actual+Implied Bits Flipped) is the better method at finding top sensitive bits, eventually exceeding the
Oracle past the top-50th percentile. As discussed previously, this is due to
BinFI (Actual+Implied Bits Flipped)’s implicating multiple bits as sensitive per bit flip, whereas the
Oracle only implicates one bit per bit flip. Since all bits in ECON-T Small Pareto are sensitive,
BinFI (Actual+Implied Bits Flipped) always succeeds in finding a sensitive bit. Moreover it only needs to flip about half of the bits to identify all of the sensitive bits. This is due to the easy nature of this task, i.e., when most if not all of the bits are sensitive. We see a similar pattern for the CIFAR-10 model where
BinFI (Actual+Implied Bits Flipped) outperforms the
Oracle for a bit around the top-60th percentile before it falls to 0 due to its failure to identify all of the sensitive bits beyond the 60th percentile. Note that
BinFI (Actual Bits Flipped) always lies at 0 for all four NNs because it is primarily flipping bits to implicate other bits in the model and is thus not good at flipping the most sensitive bits first.
We then compare with
StatFI (Data-aware) and
StatFI (Data-unaware). For all four models, both StatFI methods fail to identify any top-
k percentile sensitive bits, so all StatFI lines lie at 0, except for
StatFI (Data-unaware) for the ECON-T Small Pareto model (see Figure
10(c)) which can find the top-1st percentile by flipping 80% of the bits before immediately falling to 0. Beyond this instance, no number of bits flipped according to StatFI will find some top-
k percentile of the sensitive bits.
Next, we summarize the relationship between model size and the sensitivity of its weights. Figure
11(a) plots the number of sensitive bits versus the total number of bits for the three ECON-T NNs. All of the bits in the ECON-T Small Pareto model are sensitive. As the model size increases, the number of sensitive bits decreases. The ECON-T Large Pareto model has only
\(6.55\%\) of its bits sensitive to single-bit faults. Figure
11(b) shows the same three ECON-T models with respect to the EMD (error) of the non-faulty model. The ECON-T Large Pareto model has the smallest EMD (0.807), which is expected given that it is more complex. Reducing the model size increases the EMD (decreasing its performance).
Figure
12 breaks out the number of sensitive bits according to their relative bit position in the weight from the MSB to the LSB. All models are quantized to a fixed-point representation such that
MSB is a sign bit followed by 1 to 3 integer bits and some fractional bits remaining. Note that each model has a different quantization, with ECON-T Small Pareto having 8-bit weights, ECON-T Medium Pareto having 6-bit weights, and ECON-T Large Pareto having both 5-bit and 7-bit weights. In the ECON-T Small Pareto NN, all of the bits are sensitive; thus, the sensitive bits are equally distributed across all bit indices. A total of 63.5% of the bits are sensitive in the ECON-T Medium Pareto NN. The sensitive bits are relatively equally distributed across each bit index, although more reside in the
MSB and
MSB-1 bit indices. In the ECON-T Large Pareto NN, only a tiny fraction (6.5%) of the bits are sensitive. The sensitive bits are clustered in the first three MSBs out of (at most) 7 bits. Figure
12(b) shows that when there are sensitive bits in the model, the majority of them reside in the MSBs. Moreover, as model size increases, the percentage of sensitive bits decreases (as shown in Figure
11(a)).