Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\volumeheader

370

Efficient and Concise Explanations for Object Detection with Gaussian-Class Activation Mapping Explainer

Abstract.

To address the challenges of providing quick and plausible explanations in Explainable AI (XAI) for object detection models, we introduce the Gaussian Class Activation Mapping Explainer (G-CAME). Our method efficiently generates concise saliency maps by utilizing activation maps from selected layers and applying a Gaussian kernel to emphasize critical image regions for the predicted object. Compared with other Region-based approaches, G-CAME significantly reduces explanation time to 0.5 seconds without compromising the quality. Our evaluation of G-CAME, using Faster-RCNN and YOLOX on the MS-COCO 2017 dataset, demonstrates its ability to offer highly plausible and faithful explanations, especially in reducing the bias on tiny object detection.

keywords:
Keywords: Explainable AI, Object Detection, Class Activation Mapping
Khanh Nguyen\upstairs\affilone, Hung Nguyen\upstairs\affilone,\affiltwo,*,
Khang Nguyen\upstairs\affilone, Binh Truong\upstairs\affilone, Tuong Phan\upstairs\affilone,\affilthree, Hung Cao\upstairs\affiltwo
\upstairs\affiloneQuy Nhon AI, FPT Software, Vietnam
\upstairs\affiltwoAnalytics Everywhere Lab, University of New Brunswick, Canada
\upstairs\affilthreeUniversity of Waterloo, Canada
\emails\upstairs

*hung.ntt@unb.ca

\copyrightnotice

1. Introduction

In object detection, Deep Neural Networks [girshick2014rich] have significantly improved with the adoption of Convolution Neural Networks. However, the deeper the network is, the more difficult it is to understand, debug, or improve, which potentially poses a serious problem in critical areas [nguyen2023towards]. To help humans gain a thorough understanding of the model’s decisions, several Explainable Artificial Intelligence (XAI) methods using saliency maps to highlight the important regions of input images have been introduced.

A common and simple way to explain the object detector is to disregard the model’s architecture and only consider the input and output. This approach aims to determine the importance of each region in the input image based on the change in the model’s output. For example, Detector Randomized Input Sampling for Explanation (D-RISE[petsiuk2021black] estimates each region’s effect on the input image by creating thousands of perturbed images, and subsequently feeding them into the model to predict and get the score for each perturbed mask. Another method is Surrogate Object Detection Explainer (SODEx[sejr2021surrogate], an upgrade of Local Interpretable Model-Agnostic Explanations (LIME) [ribeiro2016should], which also employs the same technique as D-RISE to explain object detectors. Although the results of both SODEx and D-RISE are compelling, the generation of a large number of perturbations slows the explanation generation considerably.

Other approaches, such as Class Activation Mapping (CAM[zhou2016learning] and GradCAM [selvaraju2017grad], use the activation maps of a specific layer in the model’s architecture as the main component to form the explanation. These methods are faster than the mentioned region-based but still have some meaningless information since the feature maps are not related to the target object [zhang2021group]. Such methods can give a satisfactory result for the classification task. Still, they cannot be applied directly to the object detection task because these methods highlight all regions having the same target class and fail to focus on one specific region.

In this paper, we propose Gaussian Class Activation Mapping Explainer (G-CAME), which can explain the classification and localization of the target objects. Our method extends the applicability of CAM-based XAI to object detectors. By adding the Gaussian kernel as the weight for each pixel in the feature map, G-CAME’s final saliency map can explain each specific object. Our contributions can be summarized as follows:

  1. (1)

    We propose the first CAM-based method tailored for object detection, G-CAME, which can explain object detectors as a saliency map for a specific target object. G-CAME can explain in a reasonably short time, which overcomes the existing methods’ time constraints like D-RISE [petsiuk2021black] and SODEx [sejr2021surrogate].

  2. (2)

    We qualitatively and quantitatively evaluate our method with D-RISE on two main types of object detectors, namely YOLOX [ge2021yolox] (one-stage detector) and Faster-RCNN [ren2015faster] (two-stage detector), and prove that our method can give a less noise, more accurate saliency map in a shorter time than D-RISE.

Our code is available at https://github.com/khanhnguyenuet/GCAME.

2. Explainable AI in Object Detection

Object detection, a field in computer vision (CV), involves models that are broadly classified into two categories: one-stage and two-stage models. One-stage models, such as the YOLO series [redmon2016you], SSD [liu2016ssd], and RetinaNet [lin2017focal], detect objects directly over a dense sampling of locations. In contrast, two-stage models like the R-CNN family [girshick2014rich], FPN [lin2017feature], and R-FCN [dai2016r], involve a two-phase process. Initially, these models select Regions of Interest (ROI) from the feature extraction stage, followed by classification based on each proposed ROI.

While several XAI methods have been applied to analyze deep CNN models in classification tasks, their applicability in object detection is comparatively limited due to constraints in flexibility, suitability, and computational efficiency [8689279].

This section discusses two XAI types: Region-based saliency methods and CAM-based saliency methods. These methods are evaluated for their applicability in both classification and object detection tasks. A significant gap in current XAI methods, particularly in object detection, is identified, laying the groundwork for the introduction of our method.

2.1. Region-based saliency methods

Region-based saliency methods use masks to isolate specific regions of an input image, assessing their impact on the output by processing the masked input through the model and quantifying each region’s influence. In classification, LIME [ribeiro2016should] and its extension, RISE [petsiuk2018rise], are notable examples, where the latter employs thousands of masks to generate a composite saliency map. Recent advancements have adapted these methods for object detection. SODEx [sejr2021surrogate] applies LIME to explain object detectors, modifying the metric to focus on target bounding boxes. D-RISE [petsiuk2021black] refines this by altering the computation of weighted scores for each random mask, specifically for object detection. D-CLOSE [truong2023towards] further utilizes multiple levels of segmentation on the image and combines them to deliver more concise and consistent explanations. Region-based methods offer an intuitive approach as they do not necessitate the end-users in-depth understanding of the model’s architecture.

However, a notable challenge is the sensitivity of these explanations to changes in hyper-parameters, resulting in multiple potential explanations for a single object. Consequently, to achieve a clear and satisfactory explanation, careful fine-tuning hyper-parameters is essential. Additionally, a significant drawback of region-based methods is the considerable amount of time required to generate an explanation.

2.2. CAM-based methods

Conversely, CAM-based XAI requires a thorough understanding of the model’s architecture. Techniques such as CAM [zhou2016learning] and its successors, GradCAM [selvaraju2017grad], GradCAM++ [chattopadhay2018grad], and XGradCAM [fu2020axiom], are noteworthy for producing detailed saliency maps. These methods utilize partial derivatives of feature maps in selected layers relative to the target class score. While CAM-based methods are generally more efficient than Region-based methods [nguyen2021evaluation], their reliance on feature maps can result in less meaningful saliency maps. Additionally, these methods have primarily been developed for classification tasks, with no existing adaptations for object detection.

In light of these limitations, we introduce G-CAME, a novel CAM-based XAI method tailored for object detection. G-CAME is the first of its kind to offer stable and rapid explanations for both one-stage and two-stage object detection models, addressing the shortcomings of existing approaches.

3. Proposed method

For a given image I𝐼Iitalic_I with size hhitalic_h by w𝑤witalic_w, an object detector f𝑓fitalic_f and the prediction d𝑑ditalic_d includes the bounding box and predicted class. We aim to provide a saliency map S𝑆Sitalic_S to explain why the model has that prediction. The saliency map S𝑆Sitalic_S has the same size as the input I𝐼Iitalic_I. Each value S(i,j)subscript𝑆𝑖𝑗S_{(i,j)}italic_S start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT shows the importance of each pixel (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in I𝐼Iitalic_I, respectively, influencing f𝑓fitalic_f to give prediction d𝑑ditalic_d. We propose a new method that helps to produce that saliency map in a white-box manner. Our method is inspired by GradCAM [selvaraju2017grad], which uses the class activation mapping technique to generate the explanation for the model’s prediction. The main idea of our method is to use normal distribution combined with the CAM-based method to measure how one region in the input image affects the predicted output. Fig. 1 shows an overview of our method.

Refer to caption
Figure 1. Overview of G-CAME method. We use the gradient-based technique to get the target object’s location and weight for each feature map. We multiply element-wise with Gaussian kernel for each weighted feature map to remove unrelated regions. After applying the Gaussian kernel, the output saliency map is created by a linear combination of all weighted feature maps.

Due to their output difference, we cannot directly apply XAI methods for the classification model to the object detection model. In the classification task, the model only gives one prediction that shows the image’s label. However, in the object detection task, the model gives multiple boxes with corresponding labels and the probabilities of objects. Most object detectors, such as YOLO [redmon2016you] and R-CNN [girshick2014rich], usually produce N𝑁Nitalic_N predicted bounding boxes in the format:

di=(x1i,y1i,x2i,y2i,pobji,p1i,,pCi)subscript𝑑𝑖subscriptsuperscript𝑥𝑖1subscriptsuperscript𝑦𝑖1subscriptsuperscript𝑥𝑖2subscriptsuperscript𝑦𝑖2superscriptsubscript𝑝𝑜𝑏𝑗𝑖subscriptsuperscript𝑝𝑖1subscriptsuperscript𝑝𝑖𝐶d_{i}=(x^{i}_{1},y^{i}_{1},x^{i}_{2},y^{i}_{2},p_{obj}^{i},p^{i}_{1},…,p^{i}_{% C})italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) (1)

The prediction is encoded as a vector disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that consists of:

  • Bounding box information: (x1i,y1i,x2i,y2i)subscriptsuperscript𝑥𝑖1subscriptsuperscript𝑦𝑖1subscriptsuperscript𝑥𝑖2subscriptsuperscript𝑦𝑖2(x^{i}_{1},y^{i}_{1},x^{i}_{2},y^{i}_{2})( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) denotes the top-left and bottom-right corners of the predicted box.

  • Objectness probability score: pobji[0,1]superscriptsubscript𝑝𝑜𝑏𝑗𝑖01p_{obj}^{i}\in[0,1]italic_p start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] denotes the probability of an object’s occurrence in the predicted box.

  • Class score information: (p1i,,pCi)subscriptsuperscript𝑝𝑖1subscriptsuperscript𝑝𝑖𝐶(p^{i}_{1},…,p^{i}_{C})( italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) denotes the probability of C𝐶Citalic_C classes in predicted box.

In almost all object detectors, such as Faster-RCNN [ren2015faster], YOLOX [ge2021yolox], the anchor boxes technique is widely used to detect bounding boxes. G-CAME utilizes this technique to find and estimate the region related to the predicted box. Our method can be divided into 4 phases (Fig. 1) as follows: 1) Choosing target layers, 2) Object Locating, 3) Weighting Feature Map, and 4) Masking Target Region.

3.1. Target layers selection

One-stage object detector (YOLOX) For a one-stage object detector, such as YOLOX, we choose the final convolution layer in each branch of the model as the target layer to calculate the derivative, as convolutional layers naturally retain spatial information that is lost in fully connected layers. Hence, the last convolutional layers are expected to have the best compromise between high-level semantics and detailed spatial information [selvaraju2017grad]. The neurons in these layers look for semantic class-specific information in the image.

Two-stage object detector (Faster-RCNN) Two-stage object detectors, such as Faster-RCNN, contain two phases. In the first stage, the image is passed through stacked convolution layers in backbone layers and the Feature Pyramid Network (FPN) [lin2017feature] which includes four branches to detect the different objects’ sizes to extract features. Subsequently, the Region Proposal Network (RPN) identifies potential object-containing regions, which are then resized uniformly via the Region of Interest (ROI) Pooling layer. For a two-stage object detector, we utilize the convolution layers in the FPN network as the target layers to analyze because they are the last layers containing spatial information of the feature extractors.

3.2. Object Localization with Gradient

Most detector models like Faster-RCNN [ren2015faster], PAFNet [xin2021pafnet] use the anchor box technique to predict the bounding boxes. However, regarding the YOLOX [ge2021yolox], an anchor-free detector, in the final feature map, each pixel predicts N𝑁Nitalic_N bounding boxes and one bounding box for the anchor-free technique. To get the correct pixel representing the box that we aim to explain, we take the derivative of the target box with the final feature map to get the location map Gkl(c)superscriptsubscript𝐺𝑘𝑙𝑐G_{k}^{l(c)}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT as the following formula:

Gkl(c)=ScAklsuperscriptsubscript𝐺𝑘𝑙𝑐superscript𝑆𝑐superscriptsubscript𝐴𝑘𝑙G_{k}^{l(c)}=\frac{\partial S^{c}}{\partial A_{k}^{l}}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT = divide start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG (2)

where Gkl(c)superscriptsubscript𝐺𝑘𝑙𝑐G_{k}^{l(c)}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT denotes the gradient map of layer l𝑙litalic_l for feature map k𝑘kitalic_k. ScAklsuperscript𝑆𝑐superscriptsubscript𝐴𝑘𝑙\frac{\partial S^{c}}{\partial A_{k}^{l}}divide start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG is the derivative of the target class score Scsuperscript𝑆𝑐S^{c}italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT with the feature map Aksubscript𝐴𝑘A_{k}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In the regression task of most one-stage object detectors, 1×1111\times 11 × 1 Convolution is used for predicting the bounding box, so in the backward pass, we have the Gradient map G𝐺Gitalic_G having the value of 1 pixel.

In the two-stage object detector, such as Faster-RCNN, because the regression and classification tasks are in two separate branches, we tailor G-CAME for two-stage models as follows. First, we calculate the partial derivative of the class score according to each feature map of selected layers. Faster-RCNN has four branches of detecting objects, and we choose the last convolution layer of each branch to calculate the derivative. When we take the derivative of the class score to the target layer, the gradient map Gkl(c)superscriptsubscript𝐺𝑘𝑙𝑐G_{k}^{l(c)}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT has more than one pixel having value because anchor boxes are created in the next phase, namely the detecting phase. The ROI pooling layer replaces 1×\times×1 Convolution, and they are in a separate branch from the classification stage. Thus, we cannot get the pixel representing the object’s center through the gradient map. To solve this issue, we set the pixel with the highest value in the gradient map as the center of the Gaussian mask. We estimate that the area around the highest value pixel likely contains relevant features.

3.3. Weighting Feature Map via Gradient-based method

We adopt a gradient-based method as GradCAM [selvaraju2017grad] for the classification to get the weight for each feature map. As the value in the gradient map can be either positive or negative, we divide all k𝑘kitalic_k feature maps into two parts (k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, k1+k2=ksubscript𝑘1subscript𝑘2𝑘k_{1}+k_{2}=kitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_k), the one with positive gradient Akc(+)superscriptsubscript𝐴𝑘𝑐A_{k}^{c(+)}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( + ) end_POSTSUPERSCRIPT and another with negative gradient Akc()superscriptsubscript𝐴𝑘𝑐A_{k}^{c(-)}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( - ) end_POSTSUPERSCRIPT. αkcsuperscriptsubscript𝛼𝑘𝑐\alpha_{k}^{c}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the weight for each feature map k𝑘kitalic_k of target layer l𝑙litalic_l calculated by taking the mean value of the gradient map Gkl(c)superscriptsubscript𝐺𝑘𝑙𝑐G_{k}^{l(c)}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT. The negative α𝛼\alphaitalic_α is considered to reduce the target score, so we sum two parts separately and then subtract the negative part from the positive one (as Eq. 5) to get a smoother saliency map, and then use the ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U function to remove the pixel not contributing to the prediction.

Ak2c()=αk2c()Ak2csuperscriptsubscript𝐴subscript𝑘2𝑐superscriptsubscript𝛼subscript𝑘2𝑐superscriptsubscript𝐴subscript𝑘2𝑐A_{k_{2}}^{c(-)}=\alpha_{k_{2}}^{c(-)}A_{k_{2}}^{c}italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( - ) end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( - ) end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT (3)
Ak1c(+)=αk1c(+)Ak1csuperscriptsubscript𝐴subscript𝑘1𝑐superscriptsubscript𝛼subscript𝑘1𝑐superscriptsubscript𝐴subscript𝑘1𝑐A_{k_{1}}^{c(+)}=\alpha_{k_{1}}^{c(+)}A_{k_{1}}^{c}italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( + ) end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( + ) end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT (4)
LCAMc=ReLU(k1Ak1c(+)k2Ak2c())subscriptsuperscript𝐿𝑐CAM𝑅𝑒𝐿𝑈subscriptsubscript𝑘1superscriptsubscript𝐴subscript𝑘1𝑐subscriptsubscript𝑘2superscriptsubscript𝐴subscript𝑘2𝑐L^{c}_{\text{CAM}}=ReLU\bigg{(}\sum_{k_{1}}A_{k_{1}}^{c(+)}-\sum_{k_{2}}A_{k_{% 2}}^{c(-)}\bigg{)}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT CAM end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( + ) end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( - ) end_POSTSUPERSCRIPT ) (5)

Because GradCAM can only explain classification models, it highlights all objects of the same class c𝑐citalic_c. By detecting the target object’s location, we can tailor G-CAME to the object detection problem by explaining only one target object.

3.4. Masking Target Region with Gaussian Distribution

To deal with the localization issue, we propose to use Gaussian distribution to estimate the region around the object’s center. Because the gradient map shows the target object’s location, we estimate the object region around the pixel representing the object’s center by using a Gaussian mask as the weight for each pixel in the weighted feature map k𝑘kitalic_k. The Gaussian kernel is defined as:

Gσ=12πσ2exp(x2+y2)2σ2subscript𝐺𝜎12𝜋superscript𝜎2superscriptsuperscript𝑥2superscript𝑦22superscript𝜎2G_{\sigma}=\frac{1}{2\pi\sigma^{2}}\exp^{-\frac{(x^{2}+y^{2})}{2\sigma^{2}}}italic_G start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_exp start_POSTSUPERSCRIPT - divide start_ARG ( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT (6)

where the term σ𝜎\sigmaitalic_σ is the standard deviation of the value in the Gaussian kernel and controls the kernel size κ𝜅\kappaitalic_κ. x𝑥xitalic_x and y𝑦yitalic_y are two linear-space vectors filled with value in range [1,κ]1𝜅[1,\kappa][ 1 , italic_κ ] one vertically and another horizontally. The bigger σ𝜎\sigmaitalic_σ is, the larger highlighted region we get. For each feature map k𝑘kitalic_k in layer l𝑙litalic_l, we apply the Gaussian kernel to get the region of the target object and then sum all these weighted feature maps. In general, we slightly adjusted the weighting feature map (Eq. 5) to get the final saliency map as shown in Eq. 7:

LGCAMEc=ReLU(k1Gσ(k1)Ak1c(+)k2Gσ(k2)Ak2c())subscriptsuperscript𝐿𝑐GCAME𝑅𝑒𝐿𝑈subscriptsubscript𝑘1direct-productsubscript𝐺𝜎subscript𝑘1superscriptsubscript𝐴subscript𝑘1𝑐subscriptsubscript𝑘2direct-productsubscript𝐺𝜎subscript𝑘2superscriptsubscript𝐴subscript𝑘2𝑐\displaystyle L^{c}_{\text{GCAME}}=ReLU\bigg{(}\sum_{k_{1}}G_{\sigma(k_{1})}% \odot A_{k_{1}}^{c(+)}-\sum_{k_{2}}G_{\sigma(k_{2})}\odot A_{k_{2}}^{c(-)}% \bigg{)}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GCAME end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_σ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ⊙ italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( + ) end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_σ ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ⊙ italic_A start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c ( - ) end_POSTSUPERSCRIPT ) (7)

3.4.1. Choosing σ𝜎\sigmaitalic_σ for Gaussian mask

The Gaussian masks are applied to all feature maps, with the kernel size being the size of each feature map, and the σ𝜎\sigmaitalic_σ is calculated as in Eq. 10.

R=log|1ZijGkl(c)|𝑅1𝑍subscript𝑖subscript𝑗superscriptsubscript𝐺𝑘𝑙𝑐R=\log\left|\frac{1}{Z}\sum_{i}\sum_{j}G_{k}^{l(c)}\right|italic_R = roman_log | divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT | (8)
S=H×Wh×w𝑆𝐻𝑊𝑤S=\sqrt{\frac{H\times W}{h\times w}}italic_S = square-root start_ARG divide start_ARG italic_H × italic_W end_ARG start_ARG italic_h × italic_w end_ARG end_ARG (9)
σ=RlogS×3h×w12𝜎𝑅𝑆3𝑤12\sigma=R\log{S}\times\frac{3}{\lfloor{\frac{\sqrt{h\times w}-1}{2}}\rfloor}italic_σ = italic_R roman_log italic_S × divide start_ARG 3 end_ARG start_ARG ⌊ divide start_ARG square-root start_ARG italic_h × italic_w end_ARG - 1 end_ARG start_ARG 2 end_ARG ⌋ end_ARG (10)

where the σ𝜎\sigmaitalic_σ is combined by two terms. In the first term, we calculate the expansion factor with R𝑅Ritalic_R representing the importance of location map Gkl(c)superscriptsubscript𝐺𝑘𝑙𝑐G_{k}^{l(c)}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT and S𝑆Sitalic_S is the scale between the original image size (H×W𝐻𝑊H\times Witalic_H × italic_W) and the feature map size (h×w𝑤h\times witalic_h × italic_w). We use the logarithm function to adjust the value of the first term so that its value can match the size of the gradient map. For multi-scale object detectors, we have a different S𝑆Sitalic_S for each scale level. In the second term, we choose Gaussian kernel size based on the 3σ3𝜎3\sigma3 italic_σ-rule [pukelsheim1994three] as the Eq. 11 and take the inverse value.

κ=2×3σ+1𝜅23𝜎1\kappa=2\times{\lceil{3\sigma}\rceil}+1italic_κ = 2 × ⌈ 3 italic_σ ⌉ + 1 (11)

3.4.2. Gaussian mask generation

We generate each Gaussian mask with the following steps:

  1. (1)

    Create a grid filled with value in range [0,w]0𝑤[0,w][ 0 , italic_w ] for the width and [0,h]0[0,h][ 0 , italic_h ] for the height (w𝑤witalic_w and hhitalic_h is the size of the location map Gkl(c)superscriptsubscript𝐺𝑘𝑙𝑐G_{k}^{l(c)}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ( italic_c ) end_POSTSUPERSCRIPT).

  2. (2)

    Subtract the grid with value in position (it,jt)subscript𝑖𝑡subscript𝑗𝑡(i_{t},j_{t})( italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) where (it,jt)subscript𝑖𝑡subscript𝑗𝑡(i_{t},j_{t})( italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the center pixel of the target object on the location map.

  3. (3)

    Apply Gaussian formula (Eq. 6) with σ𝜎\sigmaitalic_σ as the expansion factor as Eq. 10 to get the Gaussian distribution for all values in the grid.

  4. (4)

    Normalize all values in range [0,1]01[0,1][ 0 , 1 ].

By normalizing all values in range [0,1]01[0,1][ 0 , 1 ], Gaussian masks only keep the region relating to the object we aim to explain and remove other unrelated regions in the weighted feature map.

4. Experiments and Results

We performed our experiment on the MS-COCO 2017 [lin2014microsoft] dataset with 5000 validation images. The models in our experiment are YOLOX-l (one-stage model) and Faster-RCNN (two-stage model). All experiments and conducted on NVIDIA Tesla P100 GPU. G-CAME’s inference time depends on the number of feature maps in selected layer l𝑙litalic_l. Our experiments run on model YOLOX-l with 256 feature maps for roughly 0.5 second per object.

4.1. Sanity check

To validate whether the saliency map is a faithful explanation or not, we perform a sanity check [adebayo2018sanity] with Cascading Randomization and Independent Randomization. In the Cascading Randomization approach, we randomly choose five convolution layers as the test layers. Then, for each layer between the selected layer and the top layer, we remove the pre-trained weights, reinitialize with normal distribution, and perform G-CAME to get the explanation for the target object. In contrast to Independent Randomization, we only reinitialize the weight of the selected layer and retain other pre-trained weights. The sanity check results show that G-CAME is sensitive to model parameters and can produce valid results, as shown in Fig. 2.

Refer to caption
Figure 2. The result of Cascading Randomization and Independent Randomization for five layers from top to bottom of the YOLOX model. Chosen layers in the head part do not include the layer in the regression branch. The result shows G-CAME is sensitive to the model’s parameters.

4.2. Qualitative Evaluation

We performed a saliency map qualitative evaluation of G-CAME in comparison with D-RISE. We use D-RISE’s default parameters [petsiuk2021black], where each grid’s size is 16×16161616\times 1616 × 16, the probability of each grid’s occurrence is 0.50.50.50.5, and the amount of samples for each image is 4000400040004000. For G-CAME, we choose the target layers as shown in Sec. 3.1 to calculate the derivative.

Fig. 3 shows the results of G-CAME compared with GradCAM and D-RISE. GradCAM is only applicable for the classification task, as it shows the saliency maps for all objects in the same class. Considering XAI methods for object detectors, where G-CAME and D-RISE can deliver the explanations for a specific object, G-CAME can generate saliency maps where the random noises are significantly reduced in comparison with D-RISE.

Refer to caption
Figure 3. Visualization results of GradCAM, D-RISE, and G-CAME on samples of MS-COCO 2017 dataset. G-CAME can generate the least noisy saliency maps for explaining a specific object.

4.3. Quantitative Localization Evaluation

We use two standard metrics, Pointing Game [zhang2018top] and Energy-based Pointing Game [wang2020score], to compare the correlation between an object’s saliency map and human-labeled ground truth. The results are shown in Table 1.

4.3.1. Pointing Game (PG)

To evaluate XAI methods via PG metric, firstly, we run the model on the dataset and get the bounding boxes that best match the ground truth for each class on each image. A hit𝑖𝑡hititalic_h italic_i italic_t is scored if the highest point of the saliency map lies inside the ground truth; otherwise, a miss𝑚𝑖𝑠𝑠missitalic_m italic_i italic_s italic_s is counted. The pointing game score for each image is calculated by

PG=#Hits#Hits+#Misses𝑃𝐺#𝐻𝑖𝑡𝑠#𝐻𝑖𝑡𝑠#𝑀𝑖𝑠𝑠𝑒𝑠PG=\frac{{\#}Hits}{{\#}Hits+{\#}Misses}italic_P italic_G = divide start_ARG # italic_H italic_i italic_t italic_s end_ARG start_ARG # italic_H italic_i italic_t italic_s + # italic_M italic_i italic_s italic_s italic_e italic_s end_ARG (12)

This score should be high for a good explanation to evaluate an XAI method.

4.3.2. Energy-Based Pointing Game (EBPG)

EBPG [wang2020score] calculates how much the energy of the saliency map falls inside the bounding box. Similar to the PG score, a good explanation is considered to have a higher EBPG. EBPG formula is defined as follows:

EBPG=L(i,j)bboxcL(i,j)bboxc+L(i,j)bboxc𝐸𝐵𝑃𝐺subscriptsuperscript𝐿𝑐𝑖𝑗𝑏𝑏𝑜𝑥subscriptsuperscript𝐿𝑐𝑖𝑗𝑏𝑏𝑜𝑥subscriptsuperscript𝐿𝑐𝑖𝑗𝑏𝑏𝑜𝑥EBPG=\frac{\sum L^{c}_{(i,j)\in bbox}}{L^{c}_{(i,j)\in bbox}+L^{c}_{(i,j)% \notin bbox}}italic_E italic_B italic_P italic_G = divide start_ARG ∑ italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_i , italic_j ) ∉ italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT end_ARG (13)

PG and EBPG results are reported in Table 1. Specifically, more than 65% energy of G-CAME’s saliency map falls into the ground truth bounding box compared with only 18.4% of D-RISE. In other words, G-CAME drastically reduces noises in the saliency map. In PG evaluation, G-CAME also gives better results than D-RISE. 98% of the highest pixel lie inside the correct bounding box, while this number in D-RISE is 86%.

Method D-RISE G-CAME (Our)
PG%\uparrow
(Overall | Tiny object)
0.86 | 0.127 0.98 | 0.158
EBPG%\uparrow
(Overall | Tiny object)
0.184 | 0.009 0.671 | 0.261
Method D-RISE G-CAME (Our)
Confidence Drop%\uparrow 42.3 36.8
Information Drop%\downarrow 31.58 29.15
Running time(s)\downarrow 252 0.435
Table 1. Comparison of D-RISE and G-CAME (Our) on the MS-COCO 2017 validation dataset with the YOLOX model. Evaluation metrics include PG%, EBPG%, Confidence Drop, Information Drop, and Running time. Higher or lower scores are better as indicated by /\uparrow/\downarrow↑ / ↓. The best results are shown in bold.

4.3.3. Bias in Tiny Object Detection

Explaining tiny objects detected by the model can be a challenge for XAI methods. In particular, the saliency map may be biased toward the neighboring region. This issue can worsen when multiple tiny objects partially or fully overlap because the saliency map stays in the same location for every object. In our experiments, we define the tiny object by calculating the ratio of the predicted bounding box area to the input image area (640×\times×640 in YOLOX). An object is considered tiny when this ratio is less than or equal to 0.005. In Fig. 4, we compare G-CAME with D-RISE in explaining tiny object prediction for two cases. In the first case (Fig. 4a), we test the performance of D-RISE and G-CAME in explaining two tiny objects of the same class. The result shows that D-RISE fails to distinguish two “traffic lights”, where the saliency maps are nearly identical. For the case of multiple objects with different classes overlapping (Fig. 4b), the saliency maps produced by D-RISE hardly focus on one specific target. The saliency corresponding to the “surfboard” even covers the “person”, and so does the explanation of the “person”. The problem can be the grid’s size in D-RISE, but changing to a much smaller grid’s size can make the detector unable to predict. In contrast, G-CAME can clearly show the target object’s localization in both cases and reduce the saliency map’s bias to unrelated regions. In detail, we evaluated our method only in explaining tiny object prediction with EBPG score. The MS-COCO 2017 validation dataset has more than 8000 tiny objects, and the results are reported in Table 1. Our method outperforms D-RISE with more than 26% energy of the saliency map falling into the predicted box, while this figure in D-RISE is only 0.9%. Especially, most of the energy in D-RISE’s explanation does not focus on the correct target. In the PG score, instead of evaluating one pixel, we assess all pixels having the same value as the pixel with the highest value. The result also shows that G-CAME’s explanation has better accuracy than D-RISE’s.

Refer to caption
Figure 4. The saliency map of D-RISE and G-CAME for tiny objects prediction. We evaluate them in two cases: (a) multiple tiny objects from the same class lying close together and (b) multiple tiny objects from different classes lying close together. In both cases, G-CAME can clearly identify each object in its explanations.

4.4. Quantitative Faithfulness Evaluation

Another essential aspect of an XAI method is the ability to ensure the explanation’s completeness and consistency in the model’s predictions. In this section, we employ the Confidence Drop and Information Drop scores to evaluate G-CAME and D-RISE on the YOLOX model with the MS-COCO 2017 dataset.

4.4.1. Confidence Drop

We employ the Average Drop metric to evaluate the confidence change [chattopadhay2018grad, fu2020axiom, ramaswamy2020ablation] in the model’s prediction for the target object when using the explanation as the input. In other words, when we remove these important regions, the confidence score of the target box should be dropped. The Average Drop is defined as:

AD=1Ni=1Nmax(Pc(Ii)Pc(Ii~),0)Pc(Ii)×100𝐴𝐷1𝑁superscriptsubscript𝑖1𝑁𝑚𝑎𝑥subscript𝑃𝑐subscript𝐼𝑖subscript𝑃𝑐~subscript𝐼𝑖0subscript𝑃𝑐subscript𝐼𝑖100AD=\frac{1}{N}\sum_{i=1}^{N}\frac{max(P_{c}(I_{i})-P_{c}(\tilde{I_{i}}),0)}{P_% {c}(I_{i})}\times 100italic_A italic_D = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_m italic_a italic_x ( italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over~ start_ARG italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , 0 ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG × 100 (14)

where:

Io~=I(1Mo)+μMo~subscript𝐼𝑜direct-product𝐼1subscript𝑀𝑜𝜇subscript𝑀𝑜\tilde{I_{o}}=I\odot(1-M_{o})+\mu M_{o}over~ start_ARG italic_I start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG = italic_I ⊙ ( 1 - italic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) + italic_μ italic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT (15)
Pc(I~)=IoU(Li,Lj)pc(Lj)subscript𝑃𝑐~𝐼𝐼𝑜𝑈subscript𝐿𝑖subscript𝐿𝑗subscript𝑝𝑐subscript𝐿𝑗P_{c}(\tilde{I})=IoU(L_{i},L_{j})\cdot p_{c(L_{j})}italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over~ start_ARG italic_I end_ARG ) = italic_I italic_o italic_U ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT italic_c ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT (16)

Here, we tailor the original formula of Average Drop for the object detection model. In Eq. 15, we create a new input image masked by the explanation M𝑀Mitalic_M of G-CAME. μ𝜇\muitalic_μ is the mean value of the original image. With the value of M𝑀Mitalic_M, we only keep 20% of the pixel with the most significant value in the original explanation and set the rest as 0. Then, we can minimize the explanation’s noise, and the saliency map can focus on the regions most influencing the prediction.

In Eq. 16, to compute probability Pc(I~)subscript𝑃𝑐~𝐼P_{c}(\tilde{I})italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over~ start_ARG italic_I end_ARG ), we first calculate the pair-wise IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U of the box Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT predicted on perturbed image I~~𝐼\tilde{I}over~ start_ARG italic_I end_ARG with the box Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT predicted on the original image and take the one with the highest value. After that, we multiply the first term with the corresponding class score pc(Lj)subscript𝑝𝑐subscript𝐿𝑗p_{c(L_{j})}italic_p start_POSTSUBSCRIPT italic_c ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT of the box. In calculating Pc(Ii)subscript𝑃𝑐subscript𝐼𝑖P_{c}(I_{i})italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), the IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U equals 1, so the value remains the original confidence score. Hence, if the explanation is faithful, the confidence drop should increase. However, removing several pixels can penalize the method of producing the saliency map that has connected and coherent regions. Specifically, pixels representing the object’s edges are more meaningful than others in the middle [kapishnikov2019xrai]. For example, pixels representing the dog’s tail are easier to recognize than others lying on the dog’s body.

4.4.2. Information Drop

In addition to the Confidence Drop score, we measure the faithfulness of the method via the Information Drop score. We compare the information level of the bokeh image by blurring images with focused salient regions. To measure the bokeh image’s information, we use WebP [Webp] format and calculate the Information Drop score by taking the proportion of the compressed size of the bokeh image to the original image [kapishnikov2019xrai].

4.5. Evaluation

Table 1 highlights the strengths of G-CAME compared to D-RISE. D-RISE achieves a 42.3% Confidence Drop by spreading its saliency map across the image, leading to a significant but less targeted reduction in confidence. Conversely, G-CAME maintains focus on the target object, resulting in a lower confidence drop that signifies a precise and relevant explanation. Crucially, G-CAME outperforms D-RISE in Information Drop with 29.1% versus 31.58%, indicating superior preservation of the original image’s content. Additionally, our method offers a significant speed advantage, delivering explanations in under a second, as opposed to D-RISE’s four-minute runtime. These results demonstrate G-CAME’s efficiency in providing focused, relevant, and quick explanations for object detection models.

5. Conclusion

In this paper, we proposed G-CAME, a novel CAM-based XAI method elevating the Gaussian kernel to explain one-stage and two-stage object detection models. The experiment’s results show that our method can plausibly explain the model’s predictions and reduce the bias in tiny object detection. Moreover, our method’s runtime is relatively short, overcoming the time constraint of existing region-based methods and reducing the noise in the saliency map.

Acknowledgment

This work was partially supported by the NBIF Talent Recruitment Fund (TRF2003-001) and the UNB-FCS Startup Fund (22-23 START UP/ H CAO).

\printbibliography

[heading=subbibintoc]