370
Efficient and Concise Explanations for Object Detection with Gaussian-Class Activation Mapping Explainer
Abstract.
To address the challenges of providing quick and plausible explanations in Explainable AI (XAI) for object detection models, we introduce the Gaussian Class Activation Mapping Explainer (G-CAME). Our method efficiently generates concise saliency maps by utilizing activation maps from selected layers and applying a Gaussian kernel to emphasize critical image regions for the predicted object. Compared with other Region-based approaches, G-CAME significantly reduces explanation time to 0.5 seconds without compromising the quality. Our evaluation of G-CAME, using Faster-RCNN and YOLOX on the MS-COCO 2017 dataset, demonstrates its ability to offer highly plausible and faithful explanations, especially in reducing the bias on tiny object detection.
keywords:
Keywords: Explainable AI, Object Detection, Class Activation MappingKhanh Nguyen\upstairs\affilone, Hung Nguyen\upstairs\affilone,\affiltwo,*, |
Khang Nguyen\upstairs\affilone, Binh Truong\upstairs\affilone, Tuong Phan\upstairs\affilone,\affilthree, Hung Cao\upstairs\affiltwo |
\upstairs\affiloneQuy Nhon AI, FPT Software, Vietnam |
\upstairs\affiltwoAnalytics Everywhere Lab, University of New Brunswick, Canada |
\upstairs\affilthreeUniversity of Waterloo, Canada |
*hung.ntt@unb.ca
1. Introduction
In object detection, Deep Neural Networks [girshick2014rich] have significantly improved with the adoption of Convolution Neural Networks. However, the deeper the network is, the more difficult it is to understand, debug, or improve, which potentially poses a serious problem in critical areas [nguyen2023towards]. To help humans gain a thorough understanding of the model’s decisions, several Explainable Artificial Intelligence (XAI) methods using saliency maps to highlight the important regions of input images have been introduced.
A common and simple way to explain the object detector is to disregard the model’s architecture and only consider the input and output. This approach aims to determine the importance of each region in the input image based on the change in the model’s output. For example, Detector Randomized Input Sampling for Explanation (D-RISE) [petsiuk2021black] estimates each region’s effect on the input image by creating thousands of perturbed images, and subsequently feeding them into the model to predict and get the score for each perturbed mask. Another method is Surrogate Object Detection Explainer (SODEx) [sejr2021surrogate], an upgrade of Local Interpretable Model-Agnostic Explanations (LIME) [ribeiro2016should], which also employs the same technique as D-RISE to explain object detectors. Although the results of both SODEx and D-RISE are compelling, the generation of a large number of perturbations slows the explanation generation considerably.
Other approaches, such as Class Activation Mapping (CAM) [zhou2016learning] and GradCAM [selvaraju2017grad], use the activation maps of a specific layer in the model’s architecture as the main component to form the explanation. These methods are faster than the mentioned region-based but still have some meaningless information since the feature maps are not related to the target object [zhang2021group]. Such methods can give a satisfactory result for the classification task. Still, they cannot be applied directly to the object detection task because these methods highlight all regions having the same target class and fail to focus on one specific region.
In this paper, we propose Gaussian Class Activation Mapping Explainer (G-CAME), which can explain the classification and localization of the target objects. Our method extends the applicability of CAM-based XAI to object detectors. By adding the Gaussian kernel as the weight for each pixel in the feature map, G-CAME’s final saliency map can explain each specific object. Our contributions can be summarized as follows:
-
(1)
We propose the first CAM-based method tailored for object detection, G-CAME, which can explain object detectors as a saliency map for a specific target object. G-CAME can explain in a reasonably short time, which overcomes the existing methods’ time constraints like D-RISE [petsiuk2021black] and SODEx [sejr2021surrogate].
-
(2)
We qualitatively and quantitatively evaluate our method with D-RISE on two main types of object detectors, namely YOLOX [ge2021yolox] (one-stage detector) and Faster-RCNN [ren2015faster] (two-stage detector), and prove that our method can give a less noise, more accurate saliency map in a shorter time than D-RISE.
Our code is available at https://github.com/khanhnguyenuet/GCAME.
2. Explainable AI in Object Detection
Object detection, a field in computer vision (CV), involves models that are broadly classified into two categories: one-stage and two-stage models. One-stage models, such as the YOLO series [redmon2016you], SSD [liu2016ssd], and RetinaNet [lin2017focal], detect objects directly over a dense sampling of locations. In contrast, two-stage models like the R-CNN family [girshick2014rich], FPN [lin2017feature], and R-FCN [dai2016r], involve a two-phase process. Initially, these models select Regions of Interest (ROI) from the feature extraction stage, followed by classification based on each proposed ROI.
While several XAI methods have been applied to analyze deep CNN models in classification tasks, their applicability in object detection is comparatively limited due to constraints in flexibility, suitability, and computational efficiency [8689279].
This section discusses two XAI types: Region-based saliency methods and CAM-based saliency methods. These methods are evaluated for their applicability in both classification and object detection tasks. A significant gap in current XAI methods, particularly in object detection, is identified, laying the groundwork for the introduction of our method.
2.1. Region-based saliency methods
Region-based saliency methods use masks to isolate specific regions of an input image, assessing their impact on the output by processing the masked input through the model and quantifying each region’s influence. In classification, LIME [ribeiro2016should] and its extension, RISE [petsiuk2018rise], are notable examples, where the latter employs thousands of masks to generate a composite saliency map. Recent advancements have adapted these methods for object detection. SODEx [sejr2021surrogate] applies LIME to explain object detectors, modifying the metric to focus on target bounding boxes. D-RISE [petsiuk2021black] refines this by altering the computation of weighted scores for each random mask, specifically for object detection. D-CLOSE [truong2023towards] further utilizes multiple levels of segmentation on the image and combines them to deliver more concise and consistent explanations. Region-based methods offer an intuitive approach as they do not necessitate the end-users in-depth understanding of the model’s architecture.
However, a notable challenge is the sensitivity of these explanations to changes in hyper-parameters, resulting in multiple potential explanations for a single object. Consequently, to achieve a clear and satisfactory explanation, careful fine-tuning hyper-parameters is essential. Additionally, a significant drawback of region-based methods is the considerable amount of time required to generate an explanation.
2.2. CAM-based methods
Conversely, CAM-based XAI requires a thorough understanding of the model’s architecture. Techniques such as CAM [zhou2016learning] and its successors, GradCAM [selvaraju2017grad], GradCAM++ [chattopadhay2018grad], and XGradCAM [fu2020axiom], are noteworthy for producing detailed saliency maps. These methods utilize partial derivatives of feature maps in selected layers relative to the target class score. While CAM-based methods are generally more efficient than Region-based methods [nguyen2021evaluation], their reliance on feature maps can result in less meaningful saliency maps. Additionally, these methods have primarily been developed for classification tasks, with no existing adaptations for object detection.
In light of these limitations, we introduce G-CAME, a novel CAM-based XAI method tailored for object detection. G-CAME is the first of its kind to offer stable and rapid explanations for both one-stage and two-stage object detection models, addressing the shortcomings of existing approaches.
3. Proposed method
For a given image with size by , an object detector and the prediction includes the bounding box and predicted class. We aim to provide a saliency map to explain why the model has that prediction. The saliency map has the same size as the input . Each value shows the importance of each pixel in , respectively, influencing to give prediction . We propose a new method that helps to produce that saliency map in a white-box manner. Our method is inspired by GradCAM [selvaraju2017grad], which uses the class activation mapping technique to generate the explanation for the model’s prediction. The main idea of our method is to use normal distribution combined with the CAM-based method to measure how one region in the input image affects the predicted output. Fig. 1 shows an overview of our method.
Due to their output difference, we cannot directly apply XAI methods for the classification model to the object detection model. In the classification task, the model only gives one prediction that shows the image’s label. However, in the object detection task, the model gives multiple boxes with corresponding labels and the probabilities of objects. Most object detectors, such as YOLO [redmon2016you] and R-CNN [girshick2014rich], usually produce predicted bounding boxes in the format:
(1) |
The prediction is encoded as a vector that consists of:
-
•
Bounding box information: denotes the top-left and bottom-right corners of the predicted box.
-
•
Objectness probability score: denotes the probability of an object’s occurrence in the predicted box.
-
•
Class score information: denotes the probability of classes in predicted box.
In almost all object detectors, such as Faster-RCNN [ren2015faster], YOLOX [ge2021yolox], the anchor boxes technique is widely used to detect bounding boxes. G-CAME utilizes this technique to find and estimate the region related to the predicted box. Our method can be divided into 4 phases (Fig. 1) as follows: 1) Choosing target layers, 2) Object Locating, 3) Weighting Feature Map, and 4) Masking Target Region.
3.1. Target layers selection
One-stage object detector (YOLOX) For a one-stage object detector, such as YOLOX, we choose the final convolution layer in each branch of the model as the target layer to calculate the derivative, as convolutional layers naturally retain spatial information that is lost in fully connected layers. Hence, the last convolutional layers are expected to have the best compromise between high-level semantics and detailed spatial information [selvaraju2017grad]. The neurons in these layers look for semantic class-specific information in the image.
Two-stage object detector (Faster-RCNN) Two-stage object detectors, such as Faster-RCNN, contain two phases. In the first stage, the image is passed through stacked convolution layers in backbone layers and the Feature Pyramid Network (FPN) [lin2017feature] which includes four branches to detect the different objects’ sizes to extract features. Subsequently, the Region Proposal Network (RPN) identifies potential object-containing regions, which are then resized uniformly via the Region of Interest (ROI) Pooling layer. For a two-stage object detector, we utilize the convolution layers in the FPN network as the target layers to analyze because they are the last layers containing spatial information of the feature extractors.
3.2. Object Localization with Gradient
Most detector models like Faster-RCNN [ren2015faster], PAFNet [xin2021pafnet] use the anchor box technique to predict the bounding boxes. However, regarding the YOLOX [ge2021yolox], an anchor-free detector, in the final feature map, each pixel predicts bounding boxes and one bounding box for the anchor-free technique. To get the correct pixel representing the box that we aim to explain, we take the derivative of the target box with the final feature map to get the location map as the following formula:
(2) |
where denotes the gradient map of layer for feature map . is the derivative of the target class score with the feature map . In the regression task of most one-stage object detectors, Convolution is used for predicting the bounding box, so in the backward pass, we have the Gradient map having the value of 1 pixel.
In the two-stage object detector, such as Faster-RCNN, because the regression and classification tasks are in two separate branches, we tailor G-CAME for two-stage models as follows. First, we calculate the partial derivative of the class score according to each feature map of selected layers. Faster-RCNN has four branches of detecting objects, and we choose the last convolution layer of each branch to calculate the derivative. When we take the derivative of the class score to the target layer, the gradient map has more than one pixel having value because anchor boxes are created in the next phase, namely the detecting phase. The ROI pooling layer replaces 11 Convolution, and they are in a separate branch from the classification stage. Thus, we cannot get the pixel representing the object’s center through the gradient map. To solve this issue, we set the pixel with the highest value in the gradient map as the center of the Gaussian mask. We estimate that the area around the highest value pixel likely contains relevant features.
3.3. Weighting Feature Map via Gradient-based method
We adopt a gradient-based method as GradCAM [selvaraju2017grad] for the classification to get the weight for each feature map. As the value in the gradient map can be either positive or negative, we divide all feature maps into two parts ( and , ), the one with positive gradient and another with negative gradient . is the weight for each feature map of target layer calculated by taking the mean value of the gradient map . The negative is considered to reduce the target score, so we sum two parts separately and then subtract the negative part from the positive one (as Eq. 5) to get a smoother saliency map, and then use the function to remove the pixel not contributing to the prediction.
(3) |
(4) |
(5) |
Because GradCAM can only explain classification models, it highlights all objects of the same class . By detecting the target object’s location, we can tailor G-CAME to the object detection problem by explaining only one target object.
3.4. Masking Target Region with Gaussian Distribution
To deal with the localization issue, we propose to use Gaussian distribution to estimate the region around the object’s center. Because the gradient map shows the target object’s location, we estimate the object region around the pixel representing the object’s center by using a Gaussian mask as the weight for each pixel in the weighted feature map . The Gaussian kernel is defined as:
(6) |
where the term is the standard deviation of the value in the Gaussian kernel and controls the kernel size . and are two linear-space vectors filled with value in range one vertically and another horizontally. The bigger is, the larger highlighted region we get. For each feature map in layer , we apply the Gaussian kernel to get the region of the target object and then sum all these weighted feature maps. In general, we slightly adjusted the weighting feature map (Eq. 5) to get the final saliency map as shown in Eq. 7:
(7) |
3.4.1. Choosing for Gaussian mask
The Gaussian masks are applied to all feature maps, with the kernel size being the size of each feature map, and the is calculated as in Eq. 10.
(8) |
(9) |
(10) |
where the is combined by two terms. In the first term, we calculate the expansion factor with representing the importance of location map and is the scale between the original image size () and the feature map size (). We use the logarithm function to adjust the value of the first term so that its value can match the size of the gradient map. For multi-scale object detectors, we have a different for each scale level. In the second term, we choose Gaussian kernel size based on the -rule [pukelsheim1994three] as the Eq. 11 and take the inverse value.
(11) |
3.4.2. Gaussian mask generation
We generate each Gaussian mask with the following steps:
-
(1)
Create a grid filled with value in range for the width and for the height ( and is the size of the location map ).
-
(2)
Subtract the grid with value in position where is the center pixel of the target object on the location map.
- (3)
-
(4)
Normalize all values in range .
By normalizing all values in range , Gaussian masks only keep the region relating to the object we aim to explain and remove other unrelated regions in the weighted feature map.
4. Experiments and Results
We performed our experiment on the MS-COCO 2017 [lin2014microsoft] dataset with 5000 validation images. The models in our experiment are YOLOX-l (one-stage model) and Faster-RCNN (two-stage model). All experiments and conducted on NVIDIA Tesla P100 GPU. G-CAME’s inference time depends on the number of feature maps in selected layer . Our experiments run on model YOLOX-l with 256 feature maps for roughly 0.5 second per object.
4.1. Sanity check
To validate whether the saliency map is a faithful explanation or not, we perform a sanity check [adebayo2018sanity] with Cascading Randomization and Independent Randomization. In the Cascading Randomization approach, we randomly choose five convolution layers as the test layers. Then, for each layer between the selected layer and the top layer, we remove the pre-trained weights, reinitialize with normal distribution, and perform G-CAME to get the explanation for the target object. In contrast to Independent Randomization, we only reinitialize the weight of the selected layer and retain other pre-trained weights. The sanity check results show that G-CAME is sensitive to model parameters and can produce valid results, as shown in Fig. 2.
4.2. Qualitative Evaluation
We performed a saliency map qualitative evaluation of G-CAME in comparison with D-RISE. We use D-RISE’s default parameters [petsiuk2021black], where each grid’s size is , the probability of each grid’s occurrence is , and the amount of samples for each image is . For G-CAME, we choose the target layers as shown in Sec. 3.1 to calculate the derivative.
Fig. 3 shows the results of G-CAME compared with GradCAM and D-RISE. GradCAM is only applicable for the classification task, as it shows the saliency maps for all objects in the same class. Considering XAI methods for object detectors, where G-CAME and D-RISE can deliver the explanations for a specific object, G-CAME can generate saliency maps where the random noises are significantly reduced in comparison with D-RISE.
4.3. Quantitative Localization Evaluation
We use two standard metrics, Pointing Game [zhang2018top] and Energy-based Pointing Game [wang2020score], to compare the correlation between an object’s saliency map and human-labeled ground truth. The results are shown in Table 1.
4.3.1. Pointing Game (PG)
To evaluate XAI methods via PG metric, firstly, we run the model on the dataset and get the bounding boxes that best match the ground truth for each class on each image. A is scored if the highest point of the saliency map lies inside the ground truth; otherwise, a is counted. The pointing game score for each image is calculated by
(12) |
This score should be high for a good explanation to evaluate an XAI method.
4.3.2. Energy-Based Pointing Game (EBPG)
EBPG [wang2020score] calculates how much the energy of the saliency map falls inside the bounding box. Similar to the PG score, a good explanation is considered to have a higher EBPG. EBPG formula is defined as follows:
(13) |
PG and EBPG results are reported in Table 1. Specifically, more than 65% energy of G-CAME’s saliency map falls into the ground truth bounding box compared with only 18.4% of D-RISE. In other words, G-CAME drastically reduces noises in the saliency map. In PG evaluation, G-CAME also gives better results than D-RISE. 98% of the highest pixel lie inside the correct bounding box, while this number in D-RISE is 86%.
Method | D-RISE | G-CAME (Our) | ||
|
0.86 | 0.127 | 0.98 | 0.158 | ||
|
0.184 | 0.009 | 0.671 | 0.261 |
Method | D-RISE | G-CAME (Our) |
Confidence Drop% | 42.3 | 36.8 |
Information Drop% | 31.58 | 29.15 |
Running time(s) | 252 | 0.435 |
4.3.3. Bias in Tiny Object Detection
Explaining tiny objects detected by the model can be a challenge for XAI methods. In particular, the saliency map may be biased toward the neighboring region. This issue can worsen when multiple tiny objects partially or fully overlap because the saliency map stays in the same location for every object. In our experiments, we define the tiny object by calculating the ratio of the predicted bounding box area to the input image area (640640 in YOLOX). An object is considered tiny when this ratio is less than or equal to 0.005. In Fig. 4, we compare G-CAME with D-RISE in explaining tiny object prediction for two cases. In the first case (Fig. 4a), we test the performance of D-RISE and G-CAME in explaining two tiny objects of the same class. The result shows that D-RISE fails to distinguish two “traffic lights”, where the saliency maps are nearly identical. For the case of multiple objects with different classes overlapping (Fig. 4b), the saliency maps produced by D-RISE hardly focus on one specific target. The saliency corresponding to the “surfboard” even covers the “person”, and so does the explanation of the “person”. The problem can be the grid’s size in D-RISE, but changing to a much smaller grid’s size can make the detector unable to predict. In contrast, G-CAME can clearly show the target object’s localization in both cases and reduce the saliency map’s bias to unrelated regions. In detail, we evaluated our method only in explaining tiny object prediction with EBPG score. The MS-COCO 2017 validation dataset has more than 8000 tiny objects, and the results are reported in Table 1. Our method outperforms D-RISE with more than 26% energy of the saliency map falling into the predicted box, while this figure in D-RISE is only 0.9%. Especially, most of the energy in D-RISE’s explanation does not focus on the correct target. In the PG score, instead of evaluating one pixel, we assess all pixels having the same value as the pixel with the highest value. The result also shows that G-CAME’s explanation has better accuracy than D-RISE’s.
4.4. Quantitative Faithfulness Evaluation
Another essential aspect of an XAI method is the ability to ensure the explanation’s completeness and consistency in the model’s predictions. In this section, we employ the Confidence Drop and Information Drop scores to evaluate G-CAME and D-RISE on the YOLOX model with the MS-COCO 2017 dataset.
4.4.1. Confidence Drop
We employ the Average Drop metric to evaluate the confidence change [chattopadhay2018grad, fu2020axiom, ramaswamy2020ablation] in the model’s prediction for the target object when using the explanation as the input. In other words, when we remove these important regions, the confidence score of the target box should be dropped. The Average Drop is defined as:
(14) |
where:
(15) |
(16) |
Here, we tailor the original formula of Average Drop for the object detection model. In Eq. 15, we create a new input image masked by the explanation of G-CAME. is the mean value of the original image. With the value of , we only keep 20% of the pixel with the most significant value in the original explanation and set the rest as 0. Then, we can minimize the explanation’s noise, and the saliency map can focus on the regions most influencing the prediction.
In Eq. 16, to compute probability , we first calculate the pair-wise of the box predicted on perturbed image with the box predicted on the original image and take the one with the highest value. After that, we multiply the first term with the corresponding class score of the box. In calculating , the equals 1, so the value remains the original confidence score. Hence, if the explanation is faithful, the confidence drop should increase. However, removing several pixels can penalize the method of producing the saliency map that has connected and coherent regions. Specifically, pixels representing the object’s edges are more meaningful than others in the middle [kapishnikov2019xrai]. For example, pixels representing the dog’s tail are easier to recognize than others lying on the dog’s body.
4.4.2. Information Drop
In addition to the Confidence Drop score, we measure the faithfulness of the method via the Information Drop score. We compare the information level of the bokeh image by blurring images with focused salient regions. To measure the bokeh image’s information, we use WebP [Webp] format and calculate the Information Drop score by taking the proportion of the compressed size of the bokeh image to the original image [kapishnikov2019xrai].
4.5. Evaluation
Table 1 highlights the strengths of G-CAME compared to D-RISE. D-RISE achieves a 42.3% Confidence Drop by spreading its saliency map across the image, leading to a significant but less targeted reduction in confidence. Conversely, G-CAME maintains focus on the target object, resulting in a lower confidence drop that signifies a precise and relevant explanation. Crucially, G-CAME outperforms D-RISE in Information Drop with 29.1% versus 31.58%, indicating superior preservation of the original image’s content. Additionally, our method offers a significant speed advantage, delivering explanations in under a second, as opposed to D-RISE’s four-minute runtime. These results demonstrate G-CAME’s efficiency in providing focused, relevant, and quick explanations for object detection models.
5. Conclusion
In this paper, we proposed G-CAME, a novel CAM-based XAI method elevating the Gaussian kernel to explain one-stage and two-stage object detection models. The experiment’s results show that our method can plausibly explain the model’s predictions and reduce the bias in tiny object detection. Moreover, our method’s runtime is relatively short, overcoming the time constraint of existing region-based methods and reducing the noise in the saliency map.
Acknowledgment
This work was partially supported by the NBIF Talent Recruitment Fund (TRF2003-001) and the UNB-FCS Startup Fund (22-23 START UP/ H CAO).
[heading=subbibintoc]