1 Introduction
Deep neural networks (DNNs) have achieved extremely high predictive accuracy in various domains, such as computer vision [
3,
72], autonomous driving [
50,
88], and natural language processing [
22,
62]. Despite their superior performance, the lack of explainability of DNN models remains an issue in many contexts. While they can approximate complex and arbitrary functions, studying their structure often provides little or no insight into the underlying prediction mechanisms. There seems to be an intrinsic tension between
Machine Learning (
ML) performance and explainability. Often the highest-performing methods (for example, Deep Learning) are the least explainable, and the most explainable (for example, decision trees) are the least accurate [
35].
For DNNs to be trustworthy, in many critical contexts where they are used, we must understand why they behave the way they do [
7]. Explanation methods aim at making neural network decisions trustworthy [
32]. Several explanation methods are proposed in the literature (see Section
5). In our work, because of our focus on safety analysis, we focus on explanation methods for root cause analysis, that is identifying the underlying reason of a DNN failure (root cause) which is, in our context, an incorrect DNN prediction. More precisely, we aim at identifying root causes in terms of characteristics of the input images leading to failures; in other words, we are interested in identifying the different scenarios in which the DNN may fail. Such characterization is the first step toward retraining the DNN.
Root cause analysis techniques based on unsupervised learning have proven their effectiveness [
86,
96]. These methods group failure samples (e.g., data collected during hardware testing) without requiring diagnostic labels, such that the samples in each cluster share similar root causes.
Our previous work is the first application of unsupervised learning to perform root cause analysis targeting DNN failures. Precisely, we proposed two DNN explanation methods:
Safety Analysis based on Feature Extraction (
SAFE) [
4] and
Heatmap-based Unsupervised Debugging of DNNs (
HUDD) [
24]. They both process a set of failure-inducing images and generate clusters of similar images. Commonalities across images in each cluster provide information about the root cause of the failure. Further, the identified root causes support safety analysis because they help identify possible countermeasures to address the problem. For example, applying our approaches to failure-inducing images for a DNN that classifies car seat occupancy may include a cluster of images with child seats containing a bag; such cluster may help engineers determine that bags inside child seats are likely to be misclassified. Possible countermeasures could be to retrain the DNN using more child seats with objects or, if it does not work, integrating additional components that make the approach safer (e.g., radar technology [
44]). Both SAFE and HUDD also support the identification of additional images to be used to retrain the DNN.
HUDD and SAFE differ with respect to the kind of data used to perform clustering and the pipeline of steps they rely on. HUDD applies clustering based on internal DNN information; precisely, for all failure-inducing images, it generates heatmaps capturing the relevance of DNN neurons on the DNN output. Finally, it applies a hierarchical clustering algorithm relying on a distance metric based on the generated heatmaps. SAFE is black-box as it does not rely on internal DNN information. It generates clusters based on the visual similarity across failure inducing images. To this end it relies on feature extraction based on transfer learning, dimensionality reduction, and the DBSCAN clustering algorithm.
SAFE and HUDD rely on a pipeline that has been configured in specific ways according to best practices. However, several variants exist for each component of both approaches (e.g., different transfer learning models, different clustering algorithms).
In this article, we aim at evaluating these pipeline variants for both SAFE and HUDD. Therefore, we propose an empirical evaluation of 99 alternative configurations for SAFE and HUDD (pipelines). These pipelines were obtained using different combinations of feature extraction methods, clustering algorithms, and dimensionality reduction techniques; in addition, we assessed the effect of fine tuning the transfer learning models used by feature extraction methods. Consistent with HUDD and SAFE, our pipelines support the characterization of DNNs tested at the level of models, not systems. Model-level testing, also called offline testing [
38], concerns testing DNN models in isolation, whereas system-level testing, also called online testing [
38], targets the system integrating the DNN (e.g., an autonomous driving system tested within a simulator [
38]). Supporting system-level testing is part of future work.
For our empirical evaluation we considered six case study subjects, two of which were provided by our industry partner in the automotive domain, IEE Sensing
1. Our subjects’ applications include head pose classification, eye gaze detection, drowsiness detection, steering angle prediction, unattended child detection, and car position detection.
We present a systematic and extensive evaluation scheme for these pipelines, which entails generating failure causes that resemble realistic scenarios (e.g., poor lighting conditions or camera misconfiguration). Since in these scenarios the causes of failures are known a priori, such an evaluation scheme enables us to objectively analyze and evaluate the performance of pipelines while controlling the frequency of such failure scenarios.
Our empirical results suggest that the best pipelines support and facilitate the process of functional safety analysis such that they (1) can generate
root-cause clusters (
RCCs) that group together a very high proportion of images capturing a same root cause (
\(94.3\%\) , on average), (2) can capture most of the root causes of failures for all case study subjects (
\(96.7\%\) , on average), and (3) can perform well (i.e., are reliable) in the presence of rare failure instances in a dataset (i.e., when some causes of failures affect less than 10% of the failure-inducing images). In our approach, the root causes of failures are determined by engineers after inspecting the identified clusters. Although such a solution still requires human involvement, it simplifies an engineer’s task
2; indeed, it is unlikely that a human can manually identify similarities across a large set of images leading to DNN failures. Further, though our previous work (i.e., SEDE [
27]) aims at improving the degree of automation by automatically deriving expressions capturing commonalities in failure-inducing images, in this article, we tackle an orthogonal problem: assessing which pipelines lead to clusters with better purity and coverage. One possible future work is the integration of the best analysis pipeline with SEDE.
The remainder of this article is organized as follows. In Section
2, we briefly present the main features and limitations of SAFE and HUDD, along with other feature extraction models (Autoencoders and Backpropagation-based Heatmaps). In Section
3, we describe the different models and algorithms we use in our evaluated pipelines. In Section
4, we present the research questions, the experiment design and results, including a comparison between 99 pipelines. In Section
5, we discuss and compare related work. Finally, we conclude this article in Section
6.
2 Background
This section provides an overview of our previous work that inspired this research. We focus on clustering methods, heatmap-based DNN Explanations, the HUDD and SAFE DNN explanation methods, and Autoencoders.
2.1 Clustering
Clustering is a data analysis method that mines essential information from a dataset by grouping data into several groups called
clusters. In clustering, similar data points are grouped into the same cluster, while non-similar data points are put into different clusters. There are two main objectives in data clustering; the first objective is to minimize the dissimilarity within the cluster, and the second objective is to maximize the inter-cluster dissimilarity. HUDD and SAFE rely on
hierarchical agglomerative clustering (
HAC [
71]) and
density-based clustering (
DBSCAN [
23]), respectively. In HAC, each observation starts in its own cluster and pairs of clusters are iteratively merged to minimize an objective function (e.g., error sum of squares [
94]). DBSCAN works by considering dense regions as clusters; it is detailed in Section
3.
2.2 Heatmap-based DNN Explanations
Approaches that aim at explaining DNN results have been developed in recent years [
31]. Most of these concern the generation of heatmaps that capture the importance of pixels in image predictions. They include black-box [
15,
68] and white-box approaches [
59,
76,
80,
99,
101]. Black-box approaches generate heatmaps for the input layer and do not provide insights regarding internal DNN layers. White-box approaches rely on the backpropagation of the relevance score computed by the DNN [
59,
76,
80,
99,
101].
In this Section, we focus on a white-box technique called
Layer-Wise Relevance Propagation (
LRP) [
59] because it has been integrated into HUDD. LRP was selected because it does not present the shortcomings of other heatmap generation approaches [
24].
LRP redistributes the relevance scores of neurons in a higher layer to those of the lower layer. Figure
1 illustrates how LRP operates on a fully connected network used to classify inputs. In the forward pass, the DNN receives an input and generates an output (e.g., classifies the gaze direction as TopLeft) while recording the activations of each neuron. In the backward pass, LRP generates
internal heatmaps for a DNN layer
k, which consists of a matrix with the relevance scores computed for all the neurons of layer
k.
The heatmap in Figure
1 shows that the pupil and part of the eyelid, which are the non-white parts in the heatmap, had a significant effect on the DNN output. Furthermore, the heatmap in Figure
2 shows that the mouth and part of the nose are the input pixels that mostly impacted on the DNN output.
A heatmap is a matrix with entries in \(\mathbb {R}\) , i.e., it is a triple \((N,M,f)\) where \(N,M \in \mathbb {N}\) , and f is a map \([N] \times [M] \rightarrow \mathbb {R}\) . We use the syntax \(H[i,j]_x^L\) to refer to an entry in row i (i.e., \(i \lt N\) ) and column j (i.e., \(j \lt M\) ) of a heatmap H computed on layer L from an image x. The size of the heatmap matrix (i.e., the number of entries) is \(N \cdot M\) , with N and M are determined by the dimensions of the DNN layer L. For convolution layers, N represents the number of neurons in the feature map, whereas M represents the number of feature maps. For example, the heatmap for the eighth layer of AlexNet has size \(169 \times 256\) (convolution layer), while the the heatmap for the tenth layer has size \(4096 \times 1\) (linear layer).
2.3 Heatmap-based Unsupervised Debugging of DNNs (HUDD)
Although heatmaps may provide useful information to determine the characteristics of an image that led to an erroneous result from the DNN, they are of limited applicability because, to determine the cause of all DNN errors observed in the test set, engineers may need to visually inspect all the error-inducing images, which is practically infeasible. To overcome such limitations, we recently developed HUDD [
24], a technique that facilitates the explanation and removal of the DNN errors observed in a test set. HUDD generates clusters of images that lead to a DNN error because of the same root cause. The root cause is determined by the engineer who visualizes a subset of the images belonging to each cluster and identifies the commonality across each image (e.g., for a Gaze detection DNN, all the images present a closed eye). To further support DNN debugging, HUDD automatically retrains the DNN by selecting a subset from a pool of unlabeled images that will likely lead to DNN errors because of the same root causes observed in the test set.
Figure
3 provides an overview of HUDD, which consists of six steps. In Step 1, root cause clusters are identified by relying on a hierarchical clustering algorithm applied to heatmaps generated for each failure inducing image. Step 2 involves a visual inspection of clustered images. In this step, engineers visualize a few representative images for each RCC; the inspection enables the engineers to determine which are the commonalities across the images in each cluster and, therefore, determine the failure root cause. Example root causes include the presence of an object inside a child seat (as reported in the Introduction) or a face turned left thus making an eye not visible and causing misclassification in a gaze detection system. HUDD’s Step 2 supports functional safety analysis because each failure root cause represents a usage scenario in which the DNN is likely to fail, and, based on domain knowledge, engineers can determine the likelihood of each failure scenario, its safety impact, and possible countermeasures, as required by functional safety analysis standards [
45,
46]. For example, objects inside child seats might be very common but they lead to false alarms not hazards; misclassified gaze may instead prevent the system from determining that the driver is not pay attention to the road. Countermeasures include the retraining of the DNN, which is supported by HUDD’s Step 3. In Step 3, a new set of images, referred to as the
improvement set, is provided by the engineers to retrain the model. In Step 4, HUDD automatically selects a subset of images from the improvement set called the
unsafe set. The engineers label the images in the unsafe set in Step 5. Finally, in Step 6, HUDD automatically retrains the model to enhance its prediction accuracy.
Heatmap-based Clustering in HUDD. Clustering based on heatmaps is a key component of HUDD, an its functioning is useful to understand some of the pipelines considered in this article. HUDD relies on LRP to generate an heatmap for every internal layer of the DNN, for each failure-inducing image. However, since distinct DNN layers lead to entries defined on different value ranges [
60], to enable the comparison of clustering results across different layers, we generate normalized heatmaps by relying on min-max normalization [
36].
For each DNN layer
L, a distance matrix is constructed using the generated heatmaps; it captures the distance between every pair of failure-inducing image in the test set. The distance between a pair of images
\(\langle a,b \rangle\) , at layer
L, is computed as follows:
where
\(\tilde{H}^L_x\) is the heatmap computed for image
x at layer
L.
\(\mathit {EuclideanDistance}\) is a function that computes the euclidean distance between two
\(N \times M\) matrices according to the formula
where
\(A_{i,j}\) and
\(B_{i,j}\) are the values in the cell at row
i and column
j of the matrix.
HUDD applies the HAC clustering algorithm multiple times, once for every DNN layer. For each DNN layer, HUDD selects the optimal number of clusters using the knee-point method applied to the
weighted average intra-cluster distance (
\(\mathit {WICD}\) ).
\(\mathit {WICD}\) is defined according to the following formula:
where
\(L_l\) is a specific layer of the DNN,
\(|L_l|\) is the number of clusters in the layer
\(L_l\) ,
ICD is the intra-cluster distance for cluster
\(C_i\) belonging to layer
\(L_l\) ,
\(|C_j|\) represents the number of elements in cluster
\(C_j\) , whereas
\(|C|\) represents the number of images in all the clusters.
In Formula
3,
\(\mathit {ICD}(L_l,C_j)\) is computed as follows:
where
\(p_i\) is a unique pair of images in cluster
\(C_j\) , and
\(N_j\) is the total number of pairs it contains. The superscripts
a and
b refer to the two images of the pair to which the distance formula is applied.
HUDD then select the layer \(L_m\) with the minimal \(\mathit {WICD}\) . By definition, the clusters generated for layer \(L_m\) are the ones that maximize cohesion and we therefore anticipate that they will group together images that exhibit similar characteristics.
In our study, we rely on HUDD as a feature extraction method; precisely, we use the heatmaps generated by the layer selected by HUDD as features.
2.4 Safety Analysis based on Feature Extraction (SAFE)
SAFE is based on a combination of a transfer learning-based feature extraction method, a clustering algorithm, and a dimensionality reduction technique. The workflow of SAFE matches HUDD’s, except for Step 1 and Step 4. In SAFE’s Step 1 RCCs are identified by relying on non-convex clustering (DBSCAN) applied to features extracted from failure-inducing images; HUDD, instead, applies hierarchical clustering to heatmaps. In Step 4, SAFE selects the improvement step using a procedure that relies on DBSCAN’s outputs.
The pipelines evaluated in this article had been inspired by the pipeline implemented in SAFE’s Step 1, which consists of three stages (see Figure
4):
Feature Extraction,
Dimensionality Reduction, and
Clustering. In this article, we investigate variants of the SAFE pipeline using different combinations of these components. Additionally, we introduce a fine-tuning stage where we fine-tune the pre-trained transfer learning models to generate more domain-specific models. Excluding clustering, which was introduced in Section
2.1, the components of SAFE’s pipeline are briefly described below.
2.4.1 Transfer Learning and Feature Extraction.
To maximize the accuracy of image-processing DNNs in a cost-effective way, engineers often rely on the transfer learning approach, which consists of transferring knowledge from a generic domain, usually ImageNet [
81], to another specific domain, (e.g., safety analysis, in our case). In other terms, we try to exploit what has been learned in one task and generalize it to another task. Researchers have demonstrated the efficiency of transfer learning from ImageNet to other domains [
85].
Transfer learning-based
Feature Extraction is an efficient method to transform unstructured data into structured raw data to be exploited by any machine learning algorithm. In this method, the features are extracted from images using a pre-trained CNN model [
18].
The standard CNN architecture comprises three types of layers: convolutional layers, pooling layers, and fully connected layers. The convolutional layer is considered the primary building block of a CNN. This layer extracts relevant features from input images during training. Convolutional and pooling layers are stacked to form a hierarchical feature extraction module. The CNN model receives an input image of size
\((224,224,3)\) . This image is then passed through the network’s layers to generate a vector of features. The feature extraction process, for each image, generates raw data represented by a
2D matrix (denoted as
X) formalized below:
where
\(C_i\) represent the class categories,
c is the number of categories,
\(m = N \times N\) is the number of features, and
k is the size of the dataset.
SAFE uses the VGG16 model pre-trained on the ImageNet dataset as a feature extraction method.
2.4.2 Dimensionality Reduction.
Dimensionality reduction aims at approximating data in high-dimensional vector spaces [
34]. It is important in our context since we extract a high number of features from the images (512 to 2048). In SAFE, we used the
Principal Component Analysis (
PCA) dimensionality reduction method to reduce the number of features from 2048 to 100.
2.5 Autoencoders
Autoencoders (
AE) are unsupervised artificial neural networks that learn how to compress and encode the data before reconstructing it from the compressed encoded version to a representation that resembles the original input as much as possible. AEs can extract features that can be used to improve downstream tasks, such as clustering or supervised learning, that benefit from dimensionality reduction and higher-level features. In other words, AEs try to learn an approximation to the identity function and, by placing various constraints on the network’s architecture and activation functions, they extract useful representations [
28].
Figure
5 illustrates the neural network architecture of a simple AE. It consist of four main components:
—
Encoder: learns how to compress the input data and reduce its dimensions into an encoded representation.
—
Bottleneck: contains the encoded representation of the input data (i.e., the extracted features vector).
—
Decoder: reconstructs the input data from the encoded version (retrieved from the Bottleneck) such that it resembles the original input data as much as possible.
—
Reconstruction Loss: the difference between the Encoder’s input and the reconstructed version (the Decoder’s output). The objective is to minimize such loss during training.
The objective of an AE’s training process is to minimize its reconstruction loss, measured as either the mean-squared error or the cross-entropy loss between original inputs and its constructed inputs.
4 Empirical Evaluation
In this section, we aim at evaluating the pipelines presented in Section
3. A pipeline leads to the generation of clusters of images that are visually inspected by safety engineers to determine the root cause captured by each. We assume that a root cause can be described in terms of the commonalities across the images in a cluster; each root cause is thus a distinct scenario in which the DNN may fail (hereafter,
failure scenario). The pipeline that best support such process should be the one requiring minimal effort toward accurate identification of root causes. Therefore, the best pipeline is the one that generates clusters having a high proportion of similar images (to facilitate the identification of the root cause, based on analyzing similarities across images in a cluster), enable the detection of all the root causes of failures, and is is reliable in the presence of rare root causes of a particular root cause (to avoid ignoring infrequent but unsafe failure scenarios). Based on the above, we defined three research questions to drive our empirical evaluation:
RQ1 Which pipeline generates root cause clusters with the highest purity? We define a pure cluster as one that contains only images representing the same failure scenario. Such clusters are expected to be easier to interpret; indeed, the engineer should more easily determine the root cause of failures if all the images share the same characteristics. Therefore, the best pipeline is the one that leads to clusters with the highest degree of purity. The purity of a cluster is computed as the maximum proportion of images belonging to a same failure scenario in this cluster.
RQ2 Which pipelines generate root cause clusters covering more failure scenarios? This research question investigates to which extent the different pipelines miss failure scenarios. Ideally, all failure scenarios should be captured by one or more clusters. We say that a failure scenario is covered by a cluster if a majority of the images in the cluster belong to the scenario; indeed, commonalities shared by most of the images in a cluster should be noticed by engineers during visual inspection. We aim at determining which pipeline maximizes such coverage.
RQ3 How is the quality of the generated root cause clusters affected by infrequent failure scenarios? Some failure scenarios may be infrequent but are nevertheless important to identify as they may lead to severe hazards once the DNN is deployed in the field. Ideally, a pipeline should be able to produce high-quality clusters even when a small number of images belong to one or more failure scenarios. In this research question, we vary the number of images belonging to failure scenarios and study how the effectiveness of pipelines—purity and coverage of the generated clusters—is affected.
RQ4 How do pipelines perform with failure scenarios that are not synthetically injected? The only way to know what are the failures scenarios affecting our subject DNNs, for RQ1 to RQ3, is to rely on test set images presenting alterations (e.g., blurriness) that DNN cannot process (e.g., because it was not trained on such images). However, the results observed with injected failure scenarios may not generalize to pre-existing failure scenarios (i.e., scenarios that the original DNN cannot properly handle despite being trained for them). This research question assesses if the pipelines that perform best with injected failure scenarios also perform best with pre-existing failure scenarios and vice-versa.
To perform our empirical evaluation, we implemented our pipelines’ components using different libraries. Feature extraction based on LRP was implemented with PyTorch [
70], Tensorflow [
1], and Keras [
12] as an extension of the DNNs under test whereas transfer learning models were implemented using Tensorflow and Keras. The clustering algorithms and the dimensionality reduction methods rely on the Scikit-Learn library [
66]. All the experiments were carried out on an Intel Core i9 processor running macOS with 32 GB RAM. Additionally, in our experiments, we relied on the LRP implementation provided by LRP authors [
58] for well-known types of layers (i.e., MaxPooling, AvgPooling, Linear, and Convolutional layers).
4.1 Subjects of the Study
To evaluate our pipelines, we consider four different DNNs that process synthetic images in the automotive domain. These DNNs support gaze detection, drowsiness detection, headpose detection, and unattended child detection, which are subjects of ongoing innovation projects at IEE Sensing, our industry partner developing sensing components for automotive. Additionally, we consider two DNNs that process real-world images to support autonomous driving: steering angle prediction and car position detection.
The gaze detection DNN (GD) performs gaze tracking; it can be used to determine a driver’s focus and attention. It divides gaze directions into eight categories: TopLeft, TopCenter, TopRight, MiddleLeft, MiddleRight, BottomLeft, BottomCenter, and BottomRight. The drowsiness detection DNN (OC) has the same architecture as the gaze detection DNN and relies on the same dataset, except that it predicts whether the driver’s eyes are open or closed.
The head-pose detection DNN (HPD) is an important cue for scene interpretation and computer remote control, such as in driver assistance systems. It determines the pose of a driver’s head in an image based on nine categories: straight, rotated left, rotated left, rotated top left, rotated bottom right, rotated right, rotated top right, tilted, and headed up.
The unattended child detection DNN is trained with the
Synthetic dataset for Vehicle Interior Rear seat Occupancydetection (
SVIRO) [
14]. SVIRO is a dataset generated by IEE Sensing that represents scenes in the passenger compartment of ten different vehicles. The dataset has been used to train DNNs performing rear seat occupancy detection using a camera system. The original IEE’s DNN classifies SVIRO images into seven classes: adult, child, infant, child seat (empty or occupied), and infant seat (empty or occupied). However, the trained IEE DNN cannot be made publicly available for replication studies; therefore, in our study, we use SVIRO to retrain IEE’s DNN from scratch with only three output classes (i.e.,
empty seats,
children/infants not accompanied by adults, and
the presence of an adult). For our classification task, we relabeled the SVIRO dataset as follows. A seat is empty when there is an object, an empty child/infant seat, or nothing. We consider the presence of a child/infant and the presence of an adult as distinct classes. IEE’s DNN architecture is opensource [
17], it follows a VGG-19 architecture and we retrained it for 2,000 epochs, with a batch size of 64.
SAP datasets are commonly used in autonomous driving or vehicle control systems [
20]. These datasets are designed to train machine learning models to predict the appropriate steering angle for a given input image. The steering angle is a crucial parameter that determines the direction in which a vehicle should turn. The images can represent different perspectives of the road ahead, including images from a front-facing camera, multiple camera angles, or even side or rear cameras. For example, an image in the dataset could show the view of the road ahead from the driver’s perspective.
For Steering Angle Prediction, we rely on the pre-trained Autumn DNN model [
69], which follows the DAVE-2 architecture [
6] provided by NVIDIA. It is a DNN to automate steering commands of self-driving vehicles [
89]; it predicts the angle at which the wheel should be turned. It has been trained on a dataset of road images captured by a dashboard camera mounted in the car.
Car Position Detection (
CPD) DNNs are used by most
Advanced-Driver Assistance Systems (
ADAS) to predict the positions of the cars in the scene [
93]. For example, a dataset could include images captured from different angles or heights, representing various driving scenarios like urban environments, highways, or parking lots. The goal is to predict the position of each car on the scene. We rely on the CenterNet DNN [
21], which is an accurate DNN used by most competition-winning approaches for object detection [
90]. It has been trained on images from the ApolloScape dataset [
42] collected using a dashboard camera to estimate the absolute position of vehicles with respect to the ego-vehicle.
For each subject DNN, we apply our pipelines to a set of failure-inducing images. Such sets consist of (1) images belonging to a provided test set and leading to a DNN failure and (2) test set images that were not leading to a DNN failure but had been modified to cause a DNN failure; the latter are images with injected root causes of failures and are described in Section
4.2. In classifier DNNs (i.e., OC, GD, HPD, and SVIRO) a failure occurs in the presence of an image being incorrectly classified. For SAP and CPD, which are regression DNNs, we set a threshold to determine DNN failures. For SAP, we observe a DNN failure when the squared error between the predicted and the true angle is above 0.18 radian (
\(10.3^{\circ }\) ), which is deemed to be an acceptable error in related work [
88]. For CPD, since it tackles a multi-object detection problem, we report a DNN failure when the result contains at least one
false positive (i.e., the distance between the predicted position of the car and the ground truth is above 10 pixels [
79]).
In Table
1, we provide details about the case study subjects used to evaluate our pipelines. For each subject, we report the source of the dataset (e.g., the simulator used to generate the data), the training and test set sizes, the accuracy of the DNN on the original test set, the number of failure-inducing images, and the number of images for each injected root cause (they are detailed in Section
4.2).
We fine-tune the pipelines relying on transfer learning using the test sets of the respective case studies. We use the resulting fine-tuned model to extract the features from the failure-inducing sets. We train on the test sets because the number of images in each set is sufficient for the model to learn the features. Also, we train the autoencoders on the training set, and use the test set of the respective case study to validate the results. The termination criterion is 50 epochs unless we reach an early stopping point (the model stops improving). After training, we use only the encoder part to extract the features from the images in the failure inducing set.
4.2 Injected Failure Scenarios
To assess the ability of different pipelines to generate clusters that are pure and cover all the root causes of failures, we need to know the root causes of failures in the test set. Since such root causes may vary (e.g., lack of sufficient illumination, presence of a shadow in a specific part of the image) and it is not possible to objectively demonstrate that a failure cause has been correctly captured by a cluster (e.g., some readers may not agree that certain images show lack of sufficient illumination), to avoid introducing bias and subjectivity in our results, we modify a subset of the provided test set images so that they will fail because of known root causes of failures. In total, we considered nine different root causes to be injected in our test set images and refer to them as injected failure scenarios (i.e., failure scenarios with injected root causes).
We derive an image belonging to an injected failure scenario by modifying a test set image according to the specific root cause we aim at injecting; for example, by covering the mouth of a person with a mask. To ensure that a modified image leads to a DNN failure because of the injected root cause, we modify only test set images that, before modification, lead to a correct DNN output.
Figure
9 illustrates the different injected failure scenarios. Below, we describe the nine root causes considered in our study:
—
Hand: The presence of a hand blocking the full view of the driver’s face could affect the DNN result, leading it to mispredict the driver’s head direction. We simulate a hand that is partially covering the face appearing in the image.
—
Mask: Similar to Hand, the presence of a mask covering the nose or the mouth may affect a DNN that recognizes the driver’s head pose. Using image key points, we drew the shape of a white mask to simulate a mask covering the nose and the mouth.
—
Sunglasses: As for the Mask, we use the eyes’ key points to draw sunglasses covering the driver’s eyes.
—
Eyeglasses: Different from the Sunglasses, we draw glasses with the eyes being still visible.
—
Noise: A noisy image is one that contains random perturbations of colors or brightness. In real-world automotive systems, such a failure scenario occurs due to a defective camera or a high
signal-to-noise ratio (
SNR) in the communication channel between different
electronic control units (
ECUs), resulting in a noisy input. Also, some image compression algorithms, particularly those used in certain file formats like JPEG, can introduce artifacts and noise into the image during the compression process [
39,
52]. Related work has considered this failure scenario to assess the fault tolerance of DNNs [
100]. We use the Scikit-Image library [
92] to add Gaussian Noise, a statistical noise with a probability density function equal to a normal distribution, also known as Gaussian Distribution.
—
Blurriness: This scenario can occur because of camera shake, especially when the camera is integrated into the car. Motion blur can also happen when capturing moving objects such as cars and pedestrians. This failure scenario was used to evaluate DNN robustness [
88]. We use the Pillow library [
13] to add blurriness to images using a radius of 30 pixels.
—
Darkness: In practice, poor lighting conditions could make the DNN fail because it cannot clearly recognize what is depicted in a relatively dark image. This failure scenario was used in a related work to evaluate DNN robustness [
67]. We use the Pillow library [
13] to decrease the brightness of images by a factor of 0.3; we selected 0.3 because it is the lowest value introducing failures in our subject results.
—
Scaling: Such a failure scenario mimics the situation where a camera is misconfigured, leading to rescaled images being fed to the DNN. We reduce the size of an image by a value based on the image size (i.e., large 1200px
\(\times\) 1200px images are scaled by 400px, small 320px
\(\times\) 320px images by 70px). and insert a black background using the Pillow library [
13]. Camera malfunctions or technical issues with the zooming mechanism can result in a scaling failure. Scaling was also used in the literature for data augmentation [
10].
—
Everyday Object: For the SVIRO dataset, we introduce, in the car’s rear seat, an everyday object (e.g., a washing machine or a handbag) never observed in the training set, thus simulating the effect of an incomplete training dataset. Such objects capture the case of an unseen label during the training, which is a commonly used faulty scenario [
43].
For regression DNNs (SAP and CPD), we randomly selected 90 images to be copied and modified for each failure scenario. For classifier DNNs, for each failure scenario, we randomly selected 10 images for each class label.
4.3 Pre-existing Failure Scenarios
Since it is usually not possible to achieve perfect accuracy through training, our DNNs, like any machine learning model, are affected by failure scenarios whose effects are visible in the original test set (e.g., borderline images that are misclassified because they are very similar to the ones belonging to another class). In other words, some failure scenarios could already be identified in the original test set and we refer to such scenarios as pre-existing failure scenarios.
Unfortunately, it is not possible to identify pre-existing failure scenarios in a test set because commonalities across failure-inducing images might be partially perceptible (e.g., shadows on faces) and, consequently, it might be difficult to precisely determine the causes for such failures. Therefore, we cannot perform an accurate assessment of our pipelines on pre-existing failure scenarios. However, for some of the subject DNNs classifying simulator images, it is possible to make assumptions on some of the possible causes of DNN errors; such causes can be expressed in terms of simulator parameters leading to borderline cases that are likely hard to classify by a DNN. We refer to such parameters as failure-inducing parameters. For each failure-inducing parameter, it is possible to identify one or more unsafe values. We then generate images that are likely to cause a DNN failure by configuring the simulator with a value for a failure-inducing parameter close to an unsafe value.
In our previous work [
4], we have identified a set of failure-inducing parameters affecting the HPD, OC, and GD DNNs; they are listed in Table
2. For GD, we identified unsafe values related to the angle of the eye gaze (8 values) and the openness of the eye (1 value) because they all may affect gaze detection results. For OC, we consider the openness of the eye (1 unsafe value), which directly affects classification, and values characterizing an unrealistic image, with a pupil below the eyelid (i.e., a distance between the pupil and the bottom eyelid below -16 pixels) or above the eyelid (i.e., a distance between the top eyelid and the pupil below -16 pixels). For HPD, we consider the Horizontal and Vertical Headpose parameters which represent the classification classes of the DNN (8 unsafe values). For Gaze Angle, Openness, Headpose-X, and Headpose-Y, the value of a failure-inducing parameter is considered close to an unsafe value if the difference between them is below 25% of the length of the subrange including the average value. For PupilToBottom, and TopToPupil, the value of a failure-inducing parameter is considered close to an unsafe value if it is below or equal to it. Table
3 provides the list of failure inducing scenarios for each subject DNN; basically, we have one failure scenario for each unsafe value except for the unsafe values of PupilToBottom and TopToPupil, which capture the same unsafe scenario (i.e., unrealistic image). Table
3 also reports the number of failure-inducing test set images belonging to each pre-existing failure scenario; note that an image can belong to one or more pre-existing failure scenarios and it was not possible to associate every image to a failure scenario. For example, this may happen because the DNN failure is due to the rendering of the image (e.g., a shadow may affect how the shape of the nose is perceived by the DNN), which is not controllable through simulator parameters but is the result of complex interactions among them (e.g., illumination direction, head orientation, light intensity).
For the HPD, OC, and GD DNNs, we could determine unsafe values for each failure-inducing parameter, because we know what are the simulator parameter values used to generate each image. For the SVIRO case study, we could not identify failure-inducing parameters because we only have access to the dataset, not the parameters associated with each image. Therefore, the possible reasons for misclassification (e.g., object size) cannot be directly mapped to the information provided to us, which is coarse grained (e.g., presence of an object on the seat).
Since we cannot know what are all the failure scenarios in our case study subjects, we do not compare our pipelines based on pre-existing failure scenarios. However, for completeness, in Section
4.7, we report on the performance of our pipelines with such failure scenarios affecting the OC, GD, and HPD DNNs.
For the experiments with injected failure scenarios (i.e., experiments assessing RQ1 to RQ3), we still include images belonging to pre-existing failure scenarios into the dataset since they are usually observed for any DNN and, therefore, should be considered when generating RCCs. However, clusters that do not include any image belonging to an injected failure scenario are assumed to capture root causes related to pre-existing failure scenarios and, therefore, are ignored for computing purity and coverage (details are provided in the next Sections).
For RQ1-3, since we cannot make assumptions about the distribution of pre-existing and other failure scenarios, we include the same number of images for pre-existing failure scenarios and injected failure scenarios (see Table
1). For the experiments assessing pipelines with pre-existing failure scenarios (i.e., RQ4), instead, to be realistic, we consider the whole set of failure-inducing test images belonging to a pre-existing failure scenario.
4.4 RQ1: Which Pipeline Generates Root Cause Clusters with the Highest Purity?
4.4.1 Design and Measurements.
A pure cluster includes only images presenting the same root cause (i.e., common cause leading to a DNN failure); for example, a hand covering a person’s mouth. Pure clusters simplify root cause analysis because they should make it easier for an engineer to determine the commonality across images and therefore the cause of failures.
Since the likely root cause of the failure in our
injected failure scenarios is known, we focus on these scenarios to respond to RQ1. For each RCC, we compute the proportion of images belonging to each injected failure scenario. Therefore, we measure the purity
P of a cluster
C (hereafter,
\(P_C\) ) as the highest proportion of images belonging to one injected failure scenario
\(f \in F\) assigned to cluster
C, where
F is the set of all failure scenarios.
\(P_C\) is computed as follows:
The proportion of a failure scenario f in a cluster C is computed as the number of images belonging to f assigned to cluster C ( \(C_f\) ), divided by the size of cluster C.
Clusters that do not include any image belonging to an injected failure scenario are assumed to capture root causes due to pre-existing failure scenarios and, consequently, are excluded from our analysis.
We study the purity distribution across RCCs generated for the different case study subjects. Since, ideally, we would like to obtain pure clusters, the best pipeline is the one that maximizes the average purity across the generated RCCs.
4.4.2 Methodology.
We use the
Conditional Inference Tree (
CTree) algorithm [
41] to generate a decision tree with a maximum depth set to 4 (we have four components in a pipeline) and a minimum split set to 10 (i.e., the weight of a node to be considered for splitting). The dataset used to build the tree consists of the components of each pipeline as attributes, and the purity of the generated clusters as the predicted outcome. The dataset size is equal to 99, the number of pipelines. We rely on decision trees because they enable us to determine how the different pipeline components contribute to the results (i.e., precision); the manual inspection of the configurations leading to the highest precision would not have enabled us to determine which components contribute most to precision.
Each node of the tree represents a feature of the pipeline. Leaves (terminal nodes) depict box plots representing distributions of the average purity across RCCs generated by the pipelines belonging to each leaf. Each point in the box plot is the average purity of one pipeline (i.e., the average of the purity of all the RCCs generated across all case study subjects). To split a node, the CTree algorithm first identifies the feature with the highest association (covariance) with the response variable (purity, in our case). Precisely, it relies on a permutation test of independence (null hypothesis) between any of the features and the response [
82]; the feature with the lowest significant p-value is then selected (
\(\mathit {alpha} = 0.05\) , in our case). Once a feature has been selected, a binary split is then performed by identifying the value that maximizes the test statistics across the two subsets. Since we are in the presence of multiple hypotheses (assume
m, for each node), to prevent a Type I error, for each feature
j, CTree computes its Bonferroni-adjusted [
98]
\(p\text{-value}_j\) as
4.4.3 Results.
Figure
10 depicts a regression tree illustrating how the different components of a pipeline (feature extraction methods, fine-tuning, dimensionality reduction techniques and clustering algorithms) determine the purity of the clusters generated by a pipeline. We notice that the pipelines with fine-tuned models (Nodes 3 and 4) generate lower-purity clusters than those without any fine-tuning (Nodes 6 and 7), which can be explained by the fine-tuning dataset not including the injected failure scenarios. For our approach, the objective of fine-tuning is to learn features that are specific for the context of use; please recall that our transfer learning models are based on ImageNet and we rely on them for feature extraction. However, we perform fine-tuning using the test set, which is smaller than the training set and thus leading to a quicker process. Further, to simulate a realistic usage, we did not include the injected failures into the dataset used for fine-tuning; indeed, since our injected root causes aim at capturing scenarios not foreseen at training time, it would be unrealistic to consider such scenarios for fine-tuning. Finally, fine-tuning with images including injected failures (e.g., noise) may affect the quality of fine-tuning. Because of the choices above, fine-tuning leads to features that do not capture the injected faults but the characteristics of images without faults. As a result, in our experiments, images are clustered based only on their pre-existing fault (e.g., borderline class) instead of the injected faults. ImageNet models, instead, may capture features that are useful to cluster injected faults (e.g., the presence of everyday objects in SVIRO), but such features are then forgotten as an effect of catastrophic forgetting during fine-tuning [
9], thus leading to clustering results that are worse for fine-tuned transfer-learning models.
The pipelines using non-fine-tuned transfer learning models as a feature extraction method (Node 7) generate purer clusters (min = \(50\%\) , median = \(80\%\) , max = \(96\%\) ) than the pipelines using an autoencoder model, HUDD, or LRP (Node 6) (min = \(50\%\) , median = \(65\%\) , max = \(70\%\) ). The purpose of the Autoencoder model is to provide a condensed representation of the image to be used for reconstruction. This is done by ignoring the features that the model considers insignificant and only keeping the features that help the encoder reconstruct the image accurately. Therefore, a possible explanation for our result is that since the autoencoder is trained on the training set, the injected faults are ignored. Given that clustering is based on the condensed representation, the generated clusters are less pure than the ones generated by the pipelines with transfer learning models. Note that without empirical assessment, it is not possible to know in advance how autoencoders support clustering; indeed, injected faults may mask certain autoencoder features (e.g, presence of non-black pixels around the borders for scaled images) that turn out to be useful for clustering.
As for HUDD and LRP, it seems that their main limitation is that heatmaps cannot capture the presence of root causes affecting all the pixels in an image (i.e., the result of noise, blurriness, darkness, scaling). Heatmaps mainly capture which pixels of the image drive the DNN output, thus leading clustering to group images where the same pixels affected the output. For instance, the DNN’s response to a blurred image with a shadow on the mouth could be different from that of another blurred image with a shadow on the eyes, thus leading to different clusters for these images although they represent the same injected failure scenario (blurriness).
Finally, we notice that the pipelines using HDBSCAN and DBSCAN (Node 3) as a clustering algorithm yield purer clusters (min =
\(25\%\) , median =
\(40\%\) , max =
\(80\%\) ) than those using K-means (Node 4, min =
\(22\%\) , median =
\(27\%\) , max =
\(29\%\) ). This is because K-means faces difficulty dealing with non-convex clusters. A cluster is convex if, for every pair of points belonging to it, it also includes every point on the straight line segment between them [
49], which gives the cluster a hyperspherical form. Nevertheless, in many practical cases, the data leads to clusters with arbitrary, non-convex shapes. Such clusters, however, cannot be appropriately detected by a centroid-based algorithm (e.g., K-means), as they are not designed for arbitrary-shaped clusters.
DBSCAN and HDBSCAN are density-based clustering algorithms. They consider high-density regions as clusters (see Section
2). The root cause clusters generated by DBSCAN and HDBSCAN are arbitrary-shaped and more homogeneous (i.e., clusters with higher within-cluster similarity) with very similar images. In contrast, a convex cluster generated by K-means tends to be less dense and can group rather dissimilar images. As a result, a convex cluster is less pure than a non-convex one.
We report the significance of these results in Table
4, including the values of the Vargha and Delaney’s
\(\hat{A}_{12}\) effect size and the
p-values resulting from performing a Mann-Whitney U-test to compare the average purity of the pipelines using transfer learning models (Node 7 in the decision tree) and the pipelines represented by the other nodes. Typically, an
\(\hat{A}_{12}\) effect size above 0.56 is considered practically significant with higher thresholds for medium (0.64) and large (0.71) effects [
47], thus suggesting the effect sizes between the pipelines using transfer learning models and other pipelines are large. Further,
p-values suggest these differences are statistically significant.
Finally, in Table
5, we report the pipelines that generated clusters with an average purity above
\(90\%\) across all case study subjects, along with the purity obtained for each subject; the complete results obtained for all pipelines appear in Appendix
A, Table
14. An average purity of
\(100\%\) means that all the clusters generated by the pipeline are pure. Interestingly, all the pipelines in Table
5 belong to Node 7 in Figure
10, thus confirming our main finding. Five of these seven best pipelines, rely on UMAP, without fine-tuning but with a transfer learning model, which is therefore our suggestion to perform root cause analysis. The best result is obtained with ResNet-50 combined with UMAP and DBSCAN.
4.5 RQ2: Which Pipelines Generate Root Cause Clusters Covering More Failure Scenarios?
4.5.1 Design and Measurements.
This research question investigates the extent to which our pipelines identify all failure scenarios. We compare pipelines in terms of the percentage of injected failure scenarios being covered by at least one RCC. A failure scenario is covered by an RCC if it enables the engineer to determine the root cause of the failure. Precisely, when images belonging to a failure scenario f represent a sufficiently large share of images in a cluster C, it is easier for an engineer to notice that f is a likely root cause. Therefore, we assume that an injected failure scenario f is covered by a cluster C if the cluster C contains at least \(90\%\) of images with f. Since this threshold is relatively high, our results can be considered conservative.
Given that our injected failure scenarios are represented by the same number of images in the failure-inducing test set, every failure scenario has the same likelihood of being observed. Therefore, we expect to obtain RCCs corresponding to each failure scenario.
4.5.2 Methodology.
We follow the same methodology as for RQ1 (see Section
4.4.2) but we construct a decision tree considering, for each pipeline, the average coverage achieved across case study subjects instead of the average purity.
4.5.3 Results.
Figure
11 shows a decision tree illustrating how the different components of a pipeline determine the coverage of failure scenarios.
Each leaf node depicts a box plot with the distribution of the percentages of failure scenarios covered by the set of pipelines that include the components listed in the decision nodes.
For instance, Node 9 provides the distribution of the percentage of failure scenarios covered by the RCCs generated by pipelines using UMAP as a dimensionality reduction technique and non-fine-tuned transfer learning models as feature extraction methods (12 pipelines). Ideally, the root-cause clusters generated by a pipeline should cover \(100\%\) of the failure scenarios.
The decision tree in Figure
11 confirms RQ1 results. The pipelines without fine-tuning (Nodes 6, 8, and 9) outperform the pipelines with fine-tuning (Nodes 3 and 4). The pipelines with transfer learning models (Nodes 8 and 9) generate clusters that cover more failure scenarios than those generated by the pipelines using HUDD, LRP, and AE (Node 6). Also, the pipelines using the DBSCAN and HDBSCAN clustering algorithms (Node 3) yield better results than the ones using K-means (Node 4).
Further, the decision tree in Figure
11 gives us more insights into which dimensionality reduction method is more effective. We notice that the root-cause clusters generated by the pipelines using UMAP (Node 9) lead to a better distribution (min =
\(45\%\) , median =
\(85\%\) , max =
\(100\%\) ) than the pipelines using PCA or not using any dimensionality reduction (Node 8, min =
\(25\%\) , median =
\(55\%\) , max =
\(90\%\) ). This is because UMAP yields a better separation of the clusters (i.e., less clusters overlap) compared to PCA. When using UMAP, all the data points converge toward their closest neighbor (the most similar data point). Therefore, neighboring data points in higher dimensions end up in the same neighborhood in lower dimensions, resulting in a compact and well-separated clusters where it is easier for the clustering algorithms to distinguish them.
We report the significance of these results in Table
6, including the values of the Vargha and Delaney’s
\(\hat{A}_{12}\) effect size and the
p-values resulting from performing a Mann-Whitney U-test to compare the percentages of covered failure scenarios resulting from the pipelines using UMAP (Node 9 in the decision tree in Figure
11), and the other pipelines. Table
6 shows that the
p-values, when comparing the pipelines using UMAP to the other pipelines, are always below 0.05. This implies that in all the cases, differences are statistically significant with large effect sizes (above 0.77).
In Table
7, we report the pipelines that generated clusters covering at least
\(90\%\) of the failure scenarios across all case study subjects, along with the coverage obtained for each case study subject (complete results for all the pipelines are reported in Appendix
B, Table
15). If the coverage is equal to
\(100\%\) , all the failure scenarios are covered by the RCCs. Unsurprisingly, the pipelines in Table
7 belong to Node 7 in Figure
11: they rely on a non-fine-tuned transfer learning model for feature extraction, and UMAP for dimensionality reduction. Further, they all use DBSCAN for clustering. These pipelines consistently yielded the best results for all individual case studies (confirming the results obtained in RQ1).
Such findings are further supported by the results in Tables
14 and
15, where we notice that the combination of UMAP with DBSCAN always achieves higher purity and coverage (in bold) than its alternatives, regardless of the used feature extraction method.
4.6 RQ3: How is the Quality of Root Cause Clusters Generated Affected by Infrequent Failure Scenarios?
4.6.1 Design and Measurements.
We study the effect of infrequent failure scenarios on the quality of the RCCs generated by the pipelines. Indeed, infrequent scenarios may not be properly captured by clustering algorithms. With K-means, the number of clusters depends on within-cluster SSD (see Section
3.3.1) but the exclusion of small clusters may lead to unnoticeable changes in the computed SSD. With DBSCAN, small clusters may be treated as noise. With HDBSCAN, small clusters, which have a limited persistence (
\(\epsilon\) cannot be higher than the number of datapoints, see Section
3.3.3), may not be identified.
We consider a failure scenario infrequent when it is observed in a low proportion of the images in the failure-inducing set. To be practically useful, a good pipeline should be able to generate root-cause clusters even for infrequent failure scenarios; indeed, in safety-critical contexts, infrequent failure scenarios may lead to hazards and thus should be detected when testing the system. For instance, if only five out of hundred failure-inducing images belong to a failure scenario and we have three failure scenarios in total, a reliable pipeline should still generate an RCC containing only the images of the infrequent failure scenario.
4.6.2 Methodology.
We generate 10 different failure-inducing sets for each case study subject (a total of 60 failure-inducing sets). To construct a failure-inducing set, for each root cause that might affect the case study (see Table
1, Page 17), we generate a number
n of images affected by the injected root cause. We randomly select a number
n that is lower than the number of images selected for the same root cause in RQ1 (see Table
1). Further, for classifier DNNs, we select a value higher than the number of classes of the corresponding case study (we enforce one root cause of failures for one image per class, at least); for regression DNNs, we select a value above 2. Since
n is randomly selected (uniform distribution), we obtained failure-inducing sets containing failure scenarios whose number vary. Table
16, Appendix
C, provides the details for each case study; for instance, the number of images representing a failure scenario for each failure-inducing set of the HPD case study (9 classes) is randomly selected between 9 and 90.
In addition, we also include a randomly selected number of images belonging to pre-existing failure scenarios, to mimic what happens in practice (see RQ1). The number of images belonging to pre-existing failure scenarios varies between two and the total number of injected failure scenario images.
Since we aim at studying the effect of infrequent failure scenarios on the quality of the generated RCCs, we categorize our 290 failure scenarios into infrequent and frequent. Infrequent failure scenarios are the ones that include a proportion of injected images that is lower than the median proportion in all the generated failure-inducing sets (equals to \(18\%\) in our study). For example, noise is frequent in the dataset GD_1 ( \(64\gt 18\) ) but infrequent in the dataset OC_2 ( \(4\lt 18\) ).
We consider only the best pipelines resulting from the experiments in RQ1 and RQ2 (i.e., having purity or coverage above 90% as shown in Tables
5 and
7); they are pipeline 26 (
VGG16/DBSCAN/UMAP/NoFT), 44 (
ResNet50/DBSCAN/UMAP/NoFT), 62 (
InceptionV3/DBSCAN/UMAP/NoFT), 19 (
VGG16/K-means/None/NoFT), 25 (
VGG16/K-means/UMAP/NoFT), 39 (
ResNet50/HDBSCAN/None/NoFT), 43 (
ResNet50/K-means/UMAP/NoFT), and 80 (
Xception/DBSCAN/UMAP/NoFT). The first three pipelines (i.e., 26, 44, 62) were the best for both RQ1 and RQ2, the next four (i.e., 19, 25, 39, 43) were selected based on RQ1 results while the latter (i.e., 80) based those of RQ2. We compute the purity and coverage of the RCCs generated by each of these pipelines, following the same procedures adopted for RQ1 and RQ2. We then compare the distribution of purity and coverage for infrequent and frequent failure scenarios. The most reliable pipelines are the ones being affected the least, in terms of purity and coverage, by infrequent failure scenarios.
4.6.3 Results.
In Figure
12, for each selected pipeline, we report the average purity across all the RCCs
3 with the injected failure scenarios having a certain frequency. The
x-axis reports the proportion of images for failure scenarios whereas the
y-axis reports the average purity of the RCCs associated to each failure scenario.
Figure
12 shows that when the frequency of the failure scenarios is below the median (infrequent scenario), the cluster purity obtained by pipelines tends to significantly lower and decrease rapidly as the frequency decreases. This is expected because when a failure scenario is infrequent, the clustering algorithm tends to either cluster its images as noise or distribute them over the other clusters. For density-based clustering algorithms, images belonging to infrequent scenarios may not become core points when the identification of a core point requires more data points in their neighborhood. In such case, images belonging to infrequent scenarios will be either labeled as noise points or border points (belonging to other clusters). The same is true for K-means, where these points are usually spread across other clusters because they cannot form a cluster.
To strengthen our findings, in Table
8, we report the results when comparing the purity of the selected pipelines for frequent and infrequent failure scenarios; further, we report the Vargha and Delaney’s
\(\hat{A}_{12}\) effect size and the
p-values resulting from performing a Mann-Whitney U-test. We notice that for all pipelines, the difference between frequent and infrequent scenarios are significant (p-value < 0.05). However, the effect sizes for Pipelines 26, 62, 45, and 80 are small, while they are medium for Pipelines 19 and 44, which indicates that pipelines including DBSCAN (i.e., Pipelines 26, 62, 45, and 80) are much more reliable with infrequent scenarios than others (i.e., the difference between frequent and infrequent scenarios is less pronounced). Actually, the pipelines using DBSCAN fare better than the rest also in the general case. Indeed, almost all the injected failure scenarios with frequency above 18% have 100% purity (see Figure
12); further for infrequent failure scenarios they include less data points below 100% than the other pipelines. This is because DBSCAN tends to find clusters with different sizes if these clusters are dense enough; K-means, instead, derives clusters that are of similar size.
Further, we notice that the purity of the clusters generated by Pipeline 26 (
VGG16/Dbscan/UMAP/NoFT), for infrequent failure scenarios, is higher (average is 94%) than the purity of the clusters generated by the other pipelines; differences are significant (see Table
9), thus suggesting Pipeline 26 might be the best choice.
Concerning
coverage, Figure
13 shows, for each pipeline, histograms with the average coverage obtained for failure scenarios having proportions of failure inducing images within specific ranges. In general, we observe that coverage is higher for frequent scenarios. This is due to the correlation between pure clusters and coverage; the less pure the generated clusters, the fewer failure scenarios they cover. When the failure scenarios are infrequent, their images are distributed over the other clusters, reducing their purity and, thus, reducing the probability of these scenarios being covered. To demonstrate the significance of the difference between coverage results obtained with frequent and infrequent scenarios, we apply the Fisher’s Exact test
4 to compare the coverage of frequent and infrequent scenarios for the clusters generated by the selected pipelines. We report the
p-values resulting from the Fisher’s Exact test in Table
10 and observe that differences are statistically significant thus indicating that pipelines perform better with frequent failure scenarios.
Further, Figure
13 shows that pipeline 62 (
InceptionV3/DBSCAN/UMAP/NoFT) is the one performing best with the least frequent scenarios (i.e., range 0-5%) but no pipeline fares well in that range. Pipeline 26 (
VGG16/DBSCAN/UMAP/NoFT) is the one performing best with infrequent scenarios in the range 5% to 20%; indeed, it is the only pipeline providing an average coverage above 90% for that range. To further demonstrate the significance of the difference in performance between Pipeline 26 and the other pipelines, we apply Fisher’s exact test to the coverage obtained for infrequent scenarios. We report the
p-values resulting from this test in Table
11. We notice that all the
p-values are below 0.05 except when Pipeline 26 is compared to Pipeline 62; indeed, the results of these two pipelines are similar as visible in Figure
13), even though Pipeline 26 performs slightly better on average.
In conclusion, infrequent failure scenarios affect both purity and coverage; pipelines tend to perform worse when the failure scenarios are infrequent (their frequency is below the median). However, some pipelines fare better than others. Our results suggest that the pipeline relying on a non-fine-tuned VGG16 model, with UMAP and DBSCAN (Pipeline 26) is the best choice because it yields significantly higher purity and coverage than the other pipelines. Pipeline 26 is also less negatively affected by infrequent failure scenarios since its coverage is above 90% when the frequency is above 5%, which is not the case for all the other pipelines.
4.7 RQ4: How do Pipelines Perform with Failure Scenarios That are Not Synthetically Injected?
4.7.1 Design and Measurements.
Our objective is to determine if the best pipelines identified in RQ1, RQ2, and RQ3 perform best also with pre-existing failure scenarios. As stated in Section
4.3, to address this research question we considered only the subject DNNs for which it is possible to determine the pre-existing failure scenarios each failure-inducing image may belong to; the selected DNNs are OC, GD, and HPD. The list of pre-existing failure scenarios is shown in Table
3 (page 20).
A pipeline should, ideally, identify all the pre-existing failure scenarios (i.e., generate at least one cluster for each pre-existing failure scenario thus maximizing coverage). Also, the generated clusters should be pure, that is, include only images belonging to a same pre-existing failure scenario. Consequently, as for RQ1 to RQ3, we compare pipelines based on the purity and coverage of the generated clusters.
4.7.2 Methodology.
For each subject DNN, we applied all our pipelines to the set of failure-inducing images in the original test set and belonging to a pre-existing failure scenario.
As per RQ1 to RQ3, we compute coverage and purity of each cluster as follows. For each image, we know the pre-existing failure scenarios it belongs to. Therefore, for each generated cluster, we can determine the number of images belonging to each pre-existing failure scenario. Each cluster is considered to cover the pre-existing failure scenario with the largest number of clustered images; indeed, being the most frequent, the commonalities in those images are likely to be noticed by the engineer inspecting the results. For each cluster, purity is computed as the proportion of clustered images belonging to the scenario covered by the cluster.
We consider the pipelines leading to the best results for purity and coverage for RQ1, RQ2, and RQ3, and compare them with the pipelines leading to the best purity and coverage results when applied to the failure inducing images described above, across the three selected subject DNNs.
4.7.3 Results.
In Table
12, we report the pipelines leading to the best purity and coverage when applied to the datasets with injected (RQ1, RQ2, RQ3) and pre-existing failure scenarios (i.e., RQ4). The values in parentheses capture the ranking of a pipeline for each dataset. For both purity and coverage, for each RQ, we rank our pipelines after sorting them in a decreasing order based on the average of the metric value computed for the OC, HPD, and GD DNNs; pipelines having the same average are assigned the same rank.
The results in Table
12 show that the pipeline with the highest coverage for pre-existing failure scenarios is Pipeline 26 (see column
Coverage-RQ4), which confirms our findings for RQ3 (Section
4.6.3) where pipeline 26 leads to the highest coverage results when failure scenarios do not occur with the same frequency; the results observed for RQ4 can thus be explained by the fact that, in the original test set, failure scenarios do not have the same frequency. Further, Pipeline 26 achieves high purity with pre-existing failure scenarios; indeed it is ranked 4th in column
Purity-RQ4. Interestingly, a white-box pipeline (i.e., Pipeline 8, combining HUDD, DBSCAN, and UMAP) leads to the highest purity for RQ4’s dataset; however, it does not lead to the best coverage (only 91%, ranked 7-th). Since in safety-critical systems one would prioritize the discovery of all failure scenarios, Pipeline 26 should be a better option than Pipeline 8; indeed, Pipeline 26 achieves top coverage while having a very high purity (87% vs. 92% of Pipeline 8). Further, for pre-existing failure scenarios, Pipeline 26 is the only pipeline with a purity rank up to 4 being among the best 10 pipelines for coverage.
Pipeline 26 and Pipeline 80 are the only two pipelines being among the best ten for both purity and coverage, with pre-existing failure scenarios. Also, they are among the ten best pipelines for all the other datasets (i.e., RQ1, RQ2, and R3). More in general, the Pipelines 26, 44, 62, and 80, which are all the pipelines relying on transfer learning, DBSCAN, and not using fine-tuning, lead to top ranked results. However, only Pipeline 26 achieves the highest rank for more than one dataset, thus confirming it is a preferable choice as we suggested in Section
4.6.3.
Interestingly, four of the ten best-ranked pipelines for coverage with pre-existing failure scenarios include fine-tuning; however, they poorly perform in terms of purity. Based on our discussion in Section
4.4.3, it is predictable that fine-tuning performs better with pre-existing failure scenarios; indeed, the failure-inducing images do not differ from the ones considered for fine-tuning (i.e., fine-tuning captures features that are present in the failure-inducing test set). However, the reason fine-tuning did not help achieve clusters with high purity is its reliance on a dataset with different scenarios occurring according to very different frequencies. Indeed, fine-tuning may overfit the features belonging to the most frequent scenarios, consequently the fine-tuned autoencoder may not extract relevant features for infrequent scenarios. To conclude, fine-tuning seems not to be advisable because (1) failure scenarios, as shown in our experiment, are unlikely to include the same proportion of images, (2) it is not realistic to expect engineers to construct datasets with the same proportion of images for all failure scenarios, and (3) failure scenarios may largely differ from the images observed in the training set, which led to poor performance for fine-tuned pipelines in Section
4.4.3.
4.8 Discussion
The results of RQ1 and RQ2 show that there is a family of pipelines leading to higher purity (i.e., they simplify the identification of root causes) and coverage (i.e., they enable the identification of all root causes). Such pipelines rely on transfer learning, UMAP for dimensionality reduction, DBSCAN for clustering, and are not using fine-tuning. Among such pipelines, considering that it is reasonable to expect unsafe scenarios to be infrequent, based on the results of RQ3, we suggest to use the pipeline relying on VGG16 (Pipeline 26) as transfer learning model. Pipeline 26 also leads to the best results when applied to pre-existing failure scenarios (RQ4), probably due to infrequent pre-existing failure scenarios.
In our study, we focused on effectiveness, not cost; indeed, our main purpose is to identify the pipeline that generates clusters that do not confuse the end-user (i.e., they are pure) and is likely to help identify all the root causes of failures (i.e., they have high coverage). In contrast, cost is related to the number of clusters being inspected. To discuss such cost, we report in Figure
14 a boxplot with the size of the clusters generated for RQ1, RQ2, RQ3, and RQ4 by Pipeline 26. As shown in Figure
14, across all our experiments, the number of images per cluster ranges from 2 to 76, with 75% of our clusters including at most 13 images (third quartile in Figure
14). Based on such numbers, we can conclude that the effort required to inspect a cluster is limited (i.e., at most 13 images to be visualized for 75% of our clusters); further, we have previously demonstrated through a user study that the inspection of five images per cluster is sufficient for a correct identification of the root cause of a DNN failure [
4]. Finally, our root cause analysis toolset [
25] includes the generation of animated gifs, one for each cluster, thus enabling the quick visualization of all the images in a cluster. In conclusion, either with animated gifs, or when cluster images are inspected in sequence, we conjecture that the number of clusters’ images does not strongly impact cost since clusters are typically small and small subsets of larger clusters are sufficient for a correct identification of failure root causes.
What is important, instead, is the purity of clusters as low purity makes it difficult for the end-user to determine commonalities among images.
Nevertheless, to further discuss cost, we measure the number of clusters to be inspected for each pipeline considering the dataset used for RQ1 and RQ2. We count only clusters capturing the injected failure scenarios. A lower number of clusters should indicate lower cost and, since a number of clusters higher than the number of failure scenarios to be discovered implies the presence of redundant clusters, we compute the degree of redundancy as
Finally, to discuss how well each pipeline improves current practice in industry, we estimate the degree of savings with respect to the such practice, which entails the visual inspection of all images. To do so, we assume that inspecting a single cluster, especially when using animated gifs, is as inexpensive as visualizing one single image. Indeed, though clusters involve several images, through animation, they actually make it easier to quickly identify commonalities rather than guessing root causes from a single image. Figure
15 shows four example clusters where all the images present a commonality (i.e., the root cause of the DNN failure) that is easy to determine when visualizing all the images in a sequence. Therefore, we estimate savings as
Table
13 shows our results; it reports the number of RCCs generated for each case study DNN and across all of them. Further, it reports the
percentage and number of failure scenarios covered by each pipeline (used to compute redundancy and providing information about the effectiveness of a pipeline), along with
redundancy ratio and
savings. We report only the results for the best pipelines identified when addressing RQ1 to RQ2 because there is no reason to select pipelines that do not achieve high purity and coverage.
The number of clusters generated by the selected pipelines ranges between 18 and 284. The pipelines leading to the lowest number of clusters are the ones including K-means: ResNet50/K-means/UMAP/NoFT (18), VGG16/K-means/None/NoFT (19), and VGG16/K-means/UMAP/NoFT (24). Pipelines with DBSCAN and HDBSCAN lead to a much higher number of clusters. To discuss the practical impact of such a high number of clusters, we focus on the redundancy ratio, which ranges between 1.12 and 11.8; the redundancy ratio indicates that the pipeline with the highest number of clusters (i.e., ResNet50/HDBSCAN/None/NoFT), on average, presents 11 redundant clusters for each identified failure scenario. Given that, in the presence of pure clusters, understanding the scenario captured by one pipeline is quick with animated gifs, we consider that inspecting 11 redundant clusters per fault has a limited impact on cost. Finally, if we focus on savings, we can observe that with respect to current practice, all the pipelines except (ResNet50/HDBSCAN/Only/NoFT) lead to savings above 90%, thus showing that their adoption is highly beneficial.
Although the pipelines including K-means lead to the lowest cost, their coverage is particularly low for infrequent scenarios (see Table
10, with coverage below 35% for the range [0–5], and below 60% for the range [5–10]), which is bound to be a common situation in practice. Since pipelines leading to a small number of clusters can be highly ineffective in realistic safety-critical contexts (i.e., when some failure scenarios are infrequent), assuming that redundant clusters are easy to manage, we conclude that the best choice are the pipelines that maximize purity and coverage, as discussed above (i.e., Pipeline 26,
VGG16/DBSCAN/UMAP/NoFT). A possible tradeoff is Pipeline 80 (
Xception/DBSCAN/UMAP/NoFT), which is among the best performing for RQ3 (e.g., coverage above 40% for the range [0–5], and above 70% for the range [5–10]) and leads to 3.6 redundant clusters only, on average.
4.9 Threats to Validity
We discuss internal, conclusion, construct, and external validity according to conventional practices [
97].
4.9.1 Internal Validity.
Since 72 of our 99 pipelines use a Transfer Learning pre-trained model to extract features, a possible internal threat is that this model can negatively affect our results if inadequate. Indeed, clustering relies on the similarity computed on the extracted features. However, since every pre-trained model is integrated into at least one of the best pipelines identified in our experiments (see Table
12), we conclude that they are suitable. Also, to mitigate the risk that our purity metric might not reflect what is perceived by the end-user as a pure cluster, we relied on the same purity metric adopted in our previous work [
4] to conduct an empirical study with human subjects, which demonstrated that end-users can understand the root causes of failures by looking at a small random subset of images in each cluster. Further, we visually inspected a random subset of our clusters to check their consistency. Such consistency suggests that the features extracted by the models contain enough information to cluster the images based on their similarity.
Another potential threat might be that the dataset (with the injected faults) was created with the proposed approach in mind. Therefore, there might be a risk of bias. To mitigate this risk, all the methods used in our pipelines (feature extraction methods, clustering algorithms, dimensionality reduction techniques) are independent of the data. These methods do not require any a priori knowledge on the data. We also publish our data to further mitigate this risk. All the experiments can be reproduced with any injected faulty scenario.
4.9.2 External Validity.
To alleviate the threats related to the choice of the case study DNNs, we use six well-studied datasets with diverse complexity. Four out of six subject DNNs implement tasks motivated by IEE business needs. These DNNs address problems that are quite common in the automotive industry. The other two DNNs are also related to the automotive industry and were used in many Kaggle challenges [
64,
90].
Although our pipelines were only tested on case study DNNs related to the automotive industry, we believe they will perform well with other datasets. This is because the models used for the feature extraction were pre-trained on ImageNet, which means that the model can capture features related to 1,000 classes, including humans, animals, and objects. As for AE, it can learn the aspects of any dataset during training and provide high-quality clusters. Finally, for HUDD and LRP, the extraction of heatmap-based features is performed on well-known layer types that are part of any DNN model, regardless of the task at hand (i.e., they can be extended to DNNs that were not studied in this work).
4.9.3 Construct Validity.
The construct considered in our work is effectiveness. We measure the effectiveness through complementary indicators as follows:
For RQ1, we evaluate the effectiveness of our pipelines by computing the purity of the generated clusters. The purity of a cluster is measured as the maximum proportion of images representing one faulty scenario in this cluster.
For RQ2, we evaluate the effectiveness of our pipelines based on the coverage of the injected faulty scenarios by the root cause clusters. A faulty scenario is covered by a cluster if at least \(90\%\) of the images in this cluster represent such faulty scenario.
Finally, for RQ3, we consider both the purity and the coverage to measure the robustness of the top-performing pipelines to rare faulty scenarios.
4.9.4 Conclusion Validity.
Conclusion validity addresses threats that impact the ability to conclude appropriately. To mitigate such threats and to avoid violating parametric assumptions in our statistical analysis, we rely on a non-parametric test and effect size measure (i.e., Mann Whitney U-test and the Vargha and Delaney’s \(\hat{A}_{12}\) statistics, respectively) to assess the statistical significance of differences in our results. Additionally, we applied the Fisher’s exact test when comparing coverage results related to different distributions of faulty scenarios (i.e., RQ3), which is commonly used in similar contexts. All results were reported based on both purity and coverage parameters, and six datasets were analyzed during our experiments.
4.10 Data Availability
All our implementations, the failure-inducing sets, the generated root-cause clusters and the data generated to address our research questions are available online [
5].
5 Related Work
Our article is related to the literature on DNN debugging and applications of transfer learning to perform root cause analysis [
63,
86].
Heatmap-based approaches [
15,
59,
68,
76,
80,
99,
101] explain the DNN’s prediction of an image by highlighting which region of that image influenced the most the DNN output. For example, Grad-CAM generates a heatmap from the gradient flowing into the last layer. The heatmap is then superposed on the original image to highlight the regions of the image that activated the DNN and influenced the decision [
76]. The main limitation of these approaches is that they require the inspection of all the heatmaps generated for the images processed by the DNN (e.g., error-inducing images) and, different from our pipelines, do not provide engineers with guidance for their inspection (i.e., one cluster for each failure root cause). SHAP [
53] generates explanations by calculating the contribution of each feature to predictions, thus explaining what features are the most important for each prediction. In the case of an image CNN, SHAP considers a group of pixels as a feature and calculates their contribution to the decision made by the DNN. Like heatmap-based approaches, SHAP does not provide guidance for the investigation of multiple failure-inducing images.
DeepJanus [
73] helps identify misbehaviors in a Deep Learning system by finding a set of pairs of inputs that are similar to each other and that trigger different behaviors of the Deep Learning system. This set of pairs represents the border between the input regions where the DNN behaves as expected or fails. Different from our work, DeepJanus characterizes the behavior of a DNN that can be tested with a simulator but cannot provide explanations for failures observed with real-world images.
Some DNN testing approaches explain the input regions where DNN errors are observed by relying on decision trees constructed using the simulator parameters used to generate test input images [
2,
37]. Although decision trees are an effective mean to provide explanations for failures detected during simulator-based testing, they cannot be applied to provide explanations for failures observed with real-world images. To overcome such a limitation, we have recently developed SEDE [
26], a technique that applies HUDD to failure-inducing real-world images to generate root cause clusters and then relies on evolutionary algorithms to drive the generation, for each RCC, of similar images using simulators. The simulator parameter values used to generate such images are then fed into PART [
29], a tree-based rule learning algorithm to characterize each RCCs in terms of simulator parameters (i.e., it generates expressions that constrain simulator parameters). The work in this article is complementary to SEDE since the latter can be applied to clusters generated with the best pipeline (i.e., Pipeline 26).
Pan et al. [
63] combine Transfer Learning with clustering to find root causes of hardware failures. The proposed method uses different clustering algorithms (K-means [
55], decision tree clustering [
51], hierarchical clustering [
48]) on hardware test data to cluster failures likely due to the same causes. Different from their work, we aim at explaining failures in DNNs that process images (i.e., our feature space is much larger). Ter Burg et al. [
86] explain DNNs based on a transfer learning model that has been fine-tuned to detect geometric shapes connecting face landmarks. Such shapes are treated as features and the contribution of each feature is computed by relying on SHAP. The output should help end-users determine what influenced the DNN output. Unfortunately, similar to heatmap-based approaches, this approach does not support the explanation of multiple failures but require engineers to process them one by one.
To conclude, our previous works (i.e., HUDD [
24] and SAFE [
4]) have been the first to apply clustering algorithms to white-box and black-box feature extraction approaches to explain failure causes in DNN-based systems. This study is the first to systematically assess and compare the performance of alternative white-box and black-box feature extraction approaches, dimensionality reduction techniques, and clustering algorithms using a wide variety of practical, realistic failure scenarios.
6 Conclusion
In this article, we presented a large-scale empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. Our pipelines receive as input a set of images leading to DNN failures and generate as output cluster of images sharing similar characteristics. As demonstrated by our previous work, by visualizing the images in each cluster, an engineer can notice commonalities across the images in each cluster; such commonalities represent the root causes of failures, help characterize failure scenarios and, thus, support engineers in improving the system (e.g., by selecting additional similar images to retrain the DNN or by introducing countermeasures in the system).
We considered 99 pipelines resulting from the combination of five methods for feature extraction, two techniques for dimensionality reduction and three clustering algorithms. Our methods for feature extraction include white-box (i.e., heatmap generation techniques) and black-box approaches (i.e., fine-tuned and non-finetuned transfer learning models). Additionally, we rely on PCA and UMAP for dimensionality reduction and K-means, DBSCAN, and HDBSCAN for clustering.
We evaluated our pipelines in terms of clusters’ purity and coverage of failures based on a controlled set of failure scenarios artificially injected into our datasets and widely varying in terms of frequency, thus analyzing the impact of rare scenarios on our best pipelines. Further, we assess the performance of our clustering pipelines in identifying failure scenarios not artificially injected but naturally present in our datasets. Based on six case study subjects in the automotive domain, our results show that the best results are obtained with a pipeline relying on VGG-16 as transfer learning model, not using fine tuning, leveraging UMAP as a dimensionality reduction technique, and using DBSCAN as clustering algorithm. When the failure scenarios are equally distributed, the best pipeline achieved a purity of 94.3% (i.e., almost all the images in RCCs present the same failure scenario) and a coverage of 96.7%. The same pipeline also performs well with rare failure scenarios; indeed, when images belonging to failure scenarios represent between 5 and 10% of the total number of images, it still can cover 90% of the failure scenarios with a cluster purity above 70%. Finally, we observed that the pipeline performing the best with injected failure scenarios also leads to the best results with failure scenarios already present in the datasets; specifically, it achieves 100% coverage and an average purity of 87% across clusters.