5.1.1 New Method from SE.
5.1.1.1 Output-Based Methods. In 2018, Kim et al. [
54] proposed the
surprise adequacy (
SA) criteria for testing of DNNs systems, called SADL, which is a very early work that introduced test selection for DNNs. SADL first extracts the intermediate outputs from DNNs of test data and training data as features. Then, it measures the SA according to the dissimilarity between these two features. Two measurements have been proposed in the original work,
likelihood-based surprise adequacy (
LSA) and
distance-based surprise adequacy (
DSA). LSA uses kernel density estimation to calculate the distance, while DSA uses Euclidean distance directly. Even though the authors did not state that SADL can be used to detect the faults of DNNs, they evaluated the relation between the accuracy and the SA scores of selected test inputs. This evaluation indicates that SADL can reveal the faults of DNNs. In their extension version [
55], the authors proposed a variant SA,
Mahalanobis Distance based Surprise Adequacy (
MDSA), and evaluated SA criteria on complex models trained on a large dataset.
Another previous work DeepGini [
26] was proposed by Feng et al. DeepGini is an output probability-based test prioritization technique. After obtaining the output of the test input, the Gini score is calculated by Equation (
1) in Table
5. The input is more likely a fault if it has a higher Gini score. The authors demonstrated that DeepGini has a powerful ability to reveal DNN faults.
Wang et al. [
106] proposed to leverage the neural coverage to rank the test inputs and find the faults. Concretely, given the unlabeled data pool, it iteratively selects a subset that has the same coverage score as the whole unlabeled set and removes the selected set from the pool. Finally, the remaining data in the pool are the ones that have different activation distributions from the whole set and are regarded as faults. Similar to [
106], Yan et al. tried to use activation patterns of classes to distinguish if the output of one input is reliable [
116]. Concretely, given all the training data of each class, the authors computed the lower bound and upper bound of activation values of each neuron as the pattern of this class. Then, given the test data and their predicted label, the authors compared their activation information with the activation pattern of the corresponding class. The priority score is finally calculated by the number of neurons that have greater activation values than a threshold plus the number of neurons that have smaller activation values than a threshold, divided by the total number of neurons. A greater priority score indicates that the prediction of this input is more reliable.
Guerriero et al. introduced the
adaptive test selection (
ATS) method DeepEST [
35] for fault detection. It randomly selects one input sample first and then uses two selection methods,
simple random sampling (
SRS) and
weight-based sampling (
WBS) to add more test inputs to the selected pool. Thus, the key component of DeepEST is the WBS method. Roughly speaking, WBS assigns a weight to each input based on its distance to the labeled data divided by the total distance between all unlabeled data and labeled data. Here, the distance is defined by auxiliary variables which are the prediction confidence, distance based on SA, and the combination of confidence and SA distance. Another ATS [
28] method was proposed by Gao et al. to select diverse faults from the unlabeled dataset. In this work, the output vectors are used to represent the model behaviors and measure the diversity of test inputs. ATS first projects the output vectors (top-3 maximum vectors are considered in ATS) to a space plane and then calculates the coverage of each data on this plane. After that, the difference between the coverage of a single data and the whole candidate set is utilized as an identifier to distinguish the faults and correctly predicted test data. Compared to previous works, ATS can select more diverse faults that range from different classes.
Inspired by the metamorphic testing in SE, Xie et al. proposed a diversity-guided method to detect faults [
115]. The key idea is to rank metamorphic test case pairs (MPs) based on their ability to reveal violations of DNNs, Specifically, it chooses an internal layer
L in the DNN model and splits this model into two parts, (1) the first part consists of layers from the input layer to
L, (2) the second part consists of layers from
L layer to the output layer. Then, MPs are prioritized based on the diversity of outputs from the first part. Here, the authors tried different distribution discrepancy methods to compute the output diversity, for example,
Kullback-Leibler (
KL) divergence. Finally, the sorted MPs are fed to the second part of the model to check the output consistency for fault detection.
Ma et al. [
77] proposed to use the prediction difference between the DNN model and its subspecialized models to select fault data. Specifically, a series of models (called subspecialized models) have been trained to classify one single class. Here, each subspecialized model follows the same architecture as the original model. Then, the inputs are selected if they have high output discrimination between the original and subspecialized models. Similar to [
77] and inspired by mutation testing in SE, Wei et al. proposed the
Efficient Mutation Analysis for Prioritization (
EffiMAP) [
108] for fault detection. EffiMAP has three components, Generator, Tracer, and Estimator. Generator aims at generating fault-revealing models and input mutants. Specifically, it selects model mutants with higher killing scores and uses an autoencoder to generate input mutants. This autoencoder is iteratively updated with the generated diverse inputs. Besides, Tracer collects trace information for the given inputs. Three trace features have been considered by EffiMAP, feature map, proportion of activated neurons, and entropy of outputs. Finally, EffiMAP uses XGBoost to learn the above two types of features for fault prediction.
To further enhance the uncertainty-based methods, Li et al. proposed the distance-based
dynamic random testing with prioritization (
D-DRT-P) [
67]. D-DRT-P first extracts features of inputs and clusters them into different subdomains. Then, D-DRT-P borrows uncertainty metrics to assign prioritization scores to each data in each subdomain. Finally, the input with the highest prioritization is selected. At the same time, if the selected input is a fault, the selection probability for its subdomain is increased. In this way, the importance of each subdomain dynamically changes and there is more chance to detect faults. Bao et al. introduced their approach
Nearest Neighbor Smoothing (
NNS) based test case selection method [
7] by calculating the uncertainty score on the input and its neighbors. Specifically, NNS first extracted the representation of inputs using the outputs of inner layers from the model. Then, the cosine distance between each pair of inputs is computed using these representations. Inputs that are close to each other are put in the same group and their predictions are used for output probability smoothing by label smoothing technique [
100]. Finally, NNS used the smoothed outputs to compute the uncertainty score for prioritization.
Tao et al. proposed TPFL [
101], a test prioritization method based on fault location. Firstly, TPFL uses all the training data to collect the activation information of each neuron. Different from the classical way to set the activation threshold, TPFL sets the threshold of each neuron by the sum of its average and standard deviation of outputs. Then, TPFL indicates the suspicious neurons if they have a higher frequency activated by test inputs where the DNN makes incorrect decisions and a lower frequency activated by test inputs where the DNN makes correct decisions. Finally, if the new unlabeled data activate more suspicious neurons, they are more likely faults.
Al-Qadasi et al. proposed DeepAbstraction [
4], which leverages runtime monitors to detect faults. Specifically, DeepAbstraction first extracts the features of training data to build the abstraction box. It considers the vectors extracted from the penultimate layer and the predicted classes as the features. In the abstraction box, the inputs are divided into two groups, correct inputs and incorrect inputs. Then, given new unlabeled data, DeepAbstraction checks whether their features belong to these two groups. If not, DeepAbstraction assigns these data to the third group—uncertain group. Finally, using uncertainty metrics, DeepAbstraction prioritizes the data from each group in order of incorrect, uncertain, and correct.
Different from prior works that primarily focused on computer vision tasks and
feed-forward neural networks (
FNNs), Liu et al. proposed the first method DeepState [
71] to specifically target the fault detection problem of RNNs. DeepState extracts the predictions from the internal output state and builds a label sequence for each input. Then, DeepState defines a
changing rate (
CR) to measure the uncertainty of inputs based on the label changes of the collected sequence. A higher CR indicates a more uncertain input. DeepState uses the
changing trend (
CT) metric to help remove the redundant inputs with the same CRs. CT measures how two label sequences are different. In this way, the selected inputs are both uncertain and diverse.
Aghababaeyan et al. proposed a black-box method for fault detection [
1]. Specifically, it first used VGG16 models to extract the features of mispredicted inputs from training and test sets. Then, it used
Uniform Manifold Approximation and Projection (
UMAP) method to reduce the dimension of collected features. After that, the
hierarchical density-based spatial clustering of applications with noise (
HDBSCAN) was used to cluster the faults into different groups. Finally, the authors found that, given new inputs, using inputs within one cluster to retrain the model can help fix the model with respect to that fault represented by this cluster.
To tackle more practical problems, Deng et al. proposed the
Scenario-based Test Reduction and Prioritization (
STRaP)[
24] to detect faults in self-driving systems efficiently. Firstly, given a driving recording, STRaP converts messages in each time frame into vectors based on a driving scene schema. Here, the authors defined multiple driving scene schemas. for example,
Is there any pedestrian crossing the road? 1 for yes and 0 for no. Then, STRaP slices the vectors transformed from the last step into segments based on their similarity. Finally, it prioritizes the vectors based on their coverage and the rarity of driving scene features.
Lastly, Zheng et al. proposed CertPri [
122], a certifiable method for fault detection. The intuition behind CertPri is that a significant difference exists in the movement cost of correctly and incorrectly predicted test inputs. Here, the movement cost means the cost of moving the input to the class center. Interestingly, the movement cost of correctly predicted inputs is significantly higher than the cost of incorrectly predicted inputs. Thus, this cost can be used to distinguish the faults.
5.1.1.2 Learning-Based Methods. Wang et al. proposed Dissector [
104] to detect unexpected conditions produced by faults. The intuition behind Dissector is that DNNs should have increasing confidence in the normal input (correctly predicted input) from the input layer to the output layer. Based on this intuition, Dissector slices the DNN model into multiple sub-models and adds an output layer to each sub-model. Then, Dissector collects the output (so-called snapshot) from each sub-model as the sequence of confidence by the original DNN model. Finally, the authors defined an SVscore to measure the snapshot validity and computed the final profile validity score of inputs by weighting all SVscores. A higher profile validity score indicates the input is more likely a normal one.
Based on mutation testing techniques, Wang et al. proposed the
PRioritizing test inputs via Intelligent Mutation Analysis (
PRIMA) [
107] for fault detection. The intuition behind PRIMA is that the faults are near the decision boundary and, thus, more sensitive to the perturbation. Based on this intuition, the authors designed two types of mutation roles, model mutation and input mutation. Then, given input and these mutation rules, PRIMA collects the multiple features based on the killing information and leverages the learning-to-rank strategy to train a ranking model for fault prediction. Here, the famous XGBoost ranking algorithm has been used for building the ranking model.
He et al. proposed the
Parallel Signal Routing Paths (
PSRP) [
39] to identify misclassified samples. PSRP contains three components, feature space compression, extraction of PSRP, and misclassified sample detection. In the first step, PSRP compresses the input by computing the mean value of a two-dimensional tensor produced by a convolution kernel. Then, PSRP trains an SVM model to solve a binary classification task for fault detection. The training data is the trace of compressed data from the first CNN layer to the last one.
To tackle other types of datasets, Stocco et al. proposed SelfOracle [
97], an online misbehavior prediction method for self-driving cars. SelfOracle first uses an autoencoder to reconstruct the training inputs of self-driving cars. Then, it computes the mean pixel-wise squared error as the reconstruction error. After that, SelfOracle uses the maximum likelihood estimation to fit the sum of the squares of pixel-wise errors and estimates the parameters of a Gamma distribution. Finally, by setting a threshold, the Gamma distribution will be used to predict the probability of misbehavior of the unseen inputs. As stated by the authors, SelfOracle can be used for online misbehavior prediction and time-aware anomaly score prediction. Dang et al. proposed the GNN-oriented Test Prioritization (GraphPrior) [
18], a fault detection method specifically designed for graph data. GraphPrior defined 10 rules for mutating GNN models, for example, adding self-loops to the nodes. After the model mutation, inputs are prioritized based on their killing score on this mutant, which means the input is more likely a fault if mutants have higher disagreements on the input.
5.1.2 New Method from ML.
5.1.2.1 Output-Based Methods. In 2017, Dan et al. introduced the well-known baseline for fault detection, maximum softmax probabilities based detection [
41]. This work showed that it is promising to detect out-of-distribution data and misclassified faults by simply using the maximum softmax probabilities of outputs.
To enhance uncertainty-based methods, Malinin et al. proposed the
Dirichlet Prior Networks (
DPNs) [
79] to model the predictive uncertainty of DNNs. Simply speaking, the DPN model directly parameterized the Dirichlet distribution as a prior for predicting the classification distribution on the probability simplex. The most important part is the loss function of DPNs which was defined as the KL divergence between the model and a sharp Dirichlet distribution on the appropriate class for in-distribution data, plus the model and a flat Dirichlet distribution for out-of-distribution data. Lastly, the existing uncertainty measurements, max probability and entropy were used to detect the faults based on the output of DPNs. Importantly, this work summarized the different types of uncertainty, model uncertainty, data uncertainty and distributional uncertainty which were confused by previous works.
Similar to mutation-based methods from SE [
107], Chen et al. proposed a framework to use an ensemble of models to identify faults [
10]. Specifically, the disagreement between the original model and the majority voting of the ensemble model was used as the indicator for fault detection. That means if the ensemble disagrees with the original model, the input has a higher ability to be a fault. Most importantly, to improve the performance of fault detection, the authors built the ensemble models iteratively and added the selected faults at each iteration to the training data, and then performed self-training in the ensemble model.
Lust et al. proposed a
Gradient’s Norm (
GraN) based method [
73] to detect faults. GraN contains three steps, input pre-processing, feature extraction, and feature processing. In the first step, given an input image, GraN utilized image smoothing techniques to generate a new input. Then, GraN fed the smoothed image with the predicted label of the original image into the DNN model to calculate the gradients via backpropagation. In the last step, a logistic regression network was used to learn the connection between the gradient features extracted from the last step and the correctness of the inputs. The output of the regression model indicates the likelihood of the correctness of inputs.
Instead of directly using the prediction score to detect faults, Jiang et al. proposed the
trust score [
51] to identify the correctness of unlabeled test data. To do so, firstly, the authors removed the data that has a low density from the training set for each class. Here, any data representation and distance methods can be used for this process. After that, the
trust score was defined to measure the reliability of inputs. Given an input,
Trust scorewas calculated by the ratio between its distance from this input to the nearest class different from the predicted class and the distance to its predicted class. A lower
trust score indicates that this input is more likely a fault.
5.1.2.2 Learning-Based Methods. Aigrain et al. proposed Introspection-Net [
3] to use a 3-layer regression NN to predict the correctness of inputs. Introspection-Net is a binary classification model that takes the output logits from the original DNNs as inputs and produces a confidence value for the inputs that is, whether the classification is correct (output value of (1) or not (output value of 0). The experiments showed that Introspection-Net accompanied by adversarial training and data augmentation has a competitive fault detection ability with the Trust Score approach [
51] and Softmax Baseline [
41].
Li et al. proposed TestRank [
66] to rank test inputs based on their likelihood of being a failure. Specifically, TestRank extracted two types of features from inputs, the output from the logits layer and the graph information that represents the distance of this input (here, cosine distance was used for the computation) to others. After that, a GNN model was used to learn the graph information and predict the learned contextual attributes of the data. Finally, given the correctness label (e.g., 0 for incorrect, 1 for correct) of inputs, TestRank utilized a simple binary classification model to learn the output information and the contextual attributes and predict the correctness of inputs.
Granese et al. proposed DOCTOR [
32], which defined two types of discriminators to determine whether the input is a fault or not. The first one is based on the sum of squared softmax probabilities, and the second one is computed by the predicted model for the posterior class probability. Interestingly, the authors considered two scenarios for the evaluation,
Totally Black Box (
TBB) where only the predictions are available and
Partially Black Box (
PBB) where gradient information is allowed. In the PBB situation, the authors found adding a small perturbation brought advantages for fault detection.
Sensoy et al. proposed the risk-calibrated evidential classifiers [
93] whose outputs are more sensitive to the faults. The purpose of such classifiers is to force the faults to be biased toward less risky categories to increase the fault detection rate. To do so, firstly, the authors utilized the Dirichlet Distributions to reform the outputs and therefore, to quantify the uncertainty of inputs against DNNs. After that, an evidential deep classifier was trained where the activation function of the classifier was changed to the exponential function for predicting the Dirichlet distribution for each sample. Finally, uncertainty methods were used on the evidential deep classifier for fault detection.
Qiu et al. proposed the
Residual-based Error Detection (
RED) [
89] to enhance error fault score based on the original maximum class probability. Specifically, RED first assigned a target detection score to each training input according to whether it is correctly classified by the base model, that is, 1 for correct and 0 for incorrect. Then, it computed the residual between the target score and the maximum class probability produced by the original model. After that, a Residual prediction with an Input/Output kernel (RIO) model is trained to predict this residual. Finally, when new inputs come, RED combines the score produced by the RIO model and the output probability produced by the original for fault detection.
Zhu et al. [
126] conducted a study and found that adding
out-of-distribution (
OOD) detection methods, for example, Outlier Exposure, to the training process, harms the performance of output-based fault detection. Then, based on this finding, the authors proposed a method called OpenMix to help the fault detection. Specifically, OpenMix changed the OE loss in Outlier Exposure to in/OOD (ID) Mixup-based loss to increase the exposure of low-density regions. In this way, the OOD detection methods have both good OOD detection ability and fault detection ability.
5.1.3 New Method from Others.
5.1.3.1 Output-Based Methods. Aghababaeyan et al. introduced DeepGD [
2], a search-based test selection method for detecting faults. Specifically, DeepGD considered both uncertainty scores and diversity scores of inputs to conduct test prioritization. Gini [
26] score is used for uncertainty calculation and geometric diversity on the features extracted from VGG models is applied for diversity calculation. An NSGA-II algorithm is used for optimization uncertainty and diversity to assign a final score to each input for prioritization.
Yang et al. proposed PROPHET [
117] to detect faults in
automated speech recognition (
ASR) systems. In this work, the authors introduced a new fine-grained word-level error metric for evaluating the correctness of audio-to-text tasks. By comparing the reference text and predicted text, PROPHET labeled the token as 0 (1)if it is correctly (incorrectly) predicted. Then, it trained a BERT model to predict word errors. Given the new coming data, after feeding them to the trained BERT model, the ones that have higher word errors have higher probabilities be faults.
Lee et al. proposed a neuron activation similarity-based sample selection method [
63] for fault detection. The method first collected the neuron activation information for each class from the training data and then checked the activation of the test data. If the difference between the similarities of training and test data is less than the threshold, these inputs are considered to be faults.
Wu et al. propose RNNtcs [
112], a test prioritization method for RNNs. Given the unlabeled test data, RNNtcs first extracted the outputs of the test inputs and utilized HDBSCAN to cluster the outputs into different groups. Then, it utilized uncertainty methods (Least confidence was used in RNNtcs) to compute uncertainty scores for outliers identified by HDBSCAN and prioritized them accordingly. If the number of outliers is less than the labeling budgets, the uncertainty scores of other inputs in each cluster (normal data) were also calculated and prioritized. After that, the hidden state CR of RNN models was computed as the second uncertainty measurement. Finally, the budget number of inputs with higher hidden state CRs was selected from the prioritized sets in the order of outliers and normal data.
5.1.3.2 Learning-Based Methods. Chen et al. proposed ActGraph [
9], an activation graph-based test prioritization method. ActGraph first extracted the activation value from the last
K layers and built an adjacency matrix. Then, a GNN model is used to aggregate the activation features, and an aggregation function is used to obtain the center node feature. Here, ActGraph used
\(Sum\left(\right)\) as the aggregation function. Finally, a learning-to-rank algorithm is applied to build a ranking model for test prioritization. XGBoost is applied in ActGraph as the ranking model.
5.1.4 Empirical Study from SE.
The very first empirical study was conducted by Byun et al. [
8]. In this work, the authors explored the effectiveness of three fault detection methods, Entropy (method 2 in Table
5), uncertainty in Bayesian neural networks (method 4 in Table
5), and DSA [
54]. The key finding from this empirical study is that when the DNN under test has a higher test accuracy, these methods are more effective in detecting faults. Besides, Mosin et al. compared SA, autoencoder-based, and similarity-based fault detection methods [
85]. They found that SA has the most effective fault detection ability among the considered methods. However, the similarity-based method is the most efficient method which is more than 1,000 times faster than the SA method.
Ma et al. evaluated the fault detection ability of 12 methods [
76]. In their work, they divided existing fault detection methods into two groups, methods from ML testing literature including coverage-guided methods, SA-based methods, and so on, and model uncertainty-based methods, such as maximum probability (method 5 in Table
5). Importantly, the authors suggested using Silhouette coefficient [
92] to detect faults and also defined two new methods, Variance, and Weighted Variance (methods 6 and 7 in Table
5). They found that uncertainty-based methods have stronger fault detection ability than SA and neuron coverage-based methods. Shi et al. also empirically studied the effectiveness of test case prioritization metrics (fault detection methods) [
95]. In total, 11 fault detection methods have been considered by the authors including SA methods and uncertainty-based methods. Different from the previous work, this work suggested using model mutants to replace dropout prediction for fault detection and introduced three new methods called
Variation Ratio (
VR),
Variation Ratio for Original (
VRO), and
Mutual Information (
MI), which are listed as methods 9, 10, and 11 in Table
5. In addition to analyzing the effectiveness of existing methods, this study also explored the impact of the test suite size, mutation operators, and the number of mutants.
More recently, Weiss et al. [
111] empirically showed that simple test prioritization methods perform better than SA and neuron coverage methods in terms of fault detection. Here,
simple methods indicate those methods only rely on the model outputs, for example, using the maximum probability to detect faults directly (method 5 in Table
5).