1. Introduction
The number of aging infrastructures is increasing around the world [
1,
2], and techniques to support engineers in maintaining infrastructures efficiently are required. For their maintenance, engineers usually perform visual inspections of distresses that occur in infrastructures and record the progress of distresses as deterioration levels [
3]. In particular, engineers first capture distress images in on-site inspections and determine the deterioration levels at conferences after the inspections using the distress images. To support such inspections, methods for estimating the deterioration level using the distress images [
4] and automatic distress image capturing by robots [
5] have been proposed. Notably, several skilled engineers need to make reliable determinations because the determination of deterioration level in the conferences is the final judgment. Their burden is high because of the large number of distress images to be examined at the conference. Therefore, reducing the burden on the engineers by accurately estimating the deterioration level based on machine learning using distress images is necessary.
Based on convolutional neural networks (CNNs) [
6], the performance in various image recognition tasks has been improved [
7]. However, general CNNs for estimation tasks output only the estimated class probabilities and do not explain the estimation results. Because misjudgment of the deterioration levels may endanger people using the infrastructure, high reliability is required. When engineers refer to the results of deterioration level estimation, trusting the estimation results that have unknown reasons is difficult. Therefore, methods that can explain the reasons for the estimation results are required. Recently, several methods have been proposed to improve estimation performance and interpretability using attention, which enables focusing on important features [
8,
9]. In image recognition, an attention branch network (ABN), which uses an attention map generated during the CNN-based estimation to improve estimation performance and explain the estimation results, has been proposed [
10]. ABN has been used in various fields, such as self-driving, deterioration level estimation, and control of household service robots [
11,
12,
13,
14].
Figure 1 shows an overview of ABN. The ABN model has a feature extractor in the shallow part near the input and two branches called an attention branch and a perception branch in the deep part near the output. The attention branch performs estimation before the perception branch using feature maps output from the feature extractor. It also generates an attention map that represents the regions of interest in the estimation. The generated attention map is used to highlight important regions in the feature maps in the attention mechanism. The perception branch performs the final estimation using the feature maps that are output from the attention mechanism. The attention map is a spatial annotation that enables focusing on the important regions in the feature maps. However, the estimation performance is degraded when the actual regions of distresses are different from the regions highlighted by the attention map [
15]. The influence of the attention map that does not highlight important regions needs to be reduced based on the reliability of whether the regions highlighted by the attention map are actual regions of distresses. Recently, several studies have reported on the use of attention by considering its reliability [
16,
17]. For example, the literature [
16] has reported the effectiveness of using only useful attention by considering the relationship between attention and query in image captioning. Moreover, it has been reported in [
17] that it is effective to use the attention that introduces the concept of uncertainty using a Bayesian framework for disease risk prediction. Moreover, the literature [
18] has reported the effectiveness of learning using data with “noisy labels”, which are labels with uncertain reliability, while considering the confidence in the labels. Therefore, in the ABN-based estimation, it can be more effective to reduce the influence of highlighted regions that are irrelevant to the actual distress regions. In other words, it is expected that the performance of the deterioration level estimation can be improved by preferentially utilizing an attention map that highlights regions related to the distress.
To investigate the reliability of the attention map, we focus on the confidence in the estimated class probabilities corresponding to the attention map. In the ABN-based estimation, the attention branch outputs the estimated class probabilities and the corresponding attention map representing the regions of interest. The confidence in the estimated class probabilities that are output by the attention branch can be used to investigate the reliability of the attention map. The entropy calculated from the estimated class probabilities has been widely used to estimate the uncertainty of the estimated class probabilities [
19]. The smaller the uncertainty, the higher the confidence in the estimated class probabilities. An attention map corresponding to the estimated class probabilities with low confidence is expected to contain a lot of noise, i.e., the attention map does not accurately highlight the important regions for the estimation. Therefore, it is possible to calculate the confidence in the attention map from the entropy of the estimated class probabilities. The performance of the deterioration level estimation improves significantly by controlling the influence of the attention map on the feature maps according to the above confidence.
In this paper, deterioration level estimation based on confidence-aware ABN (ConfABN), which can control the influence of an attention map on feature maps according to the confidence, is proposed for infrastructure inspection.
Figure 2 shows an overview of ConfABN. We improve a conventional attention mechanism in ABN so that an attention map with high confidence has a strong influence on the feature maps. Specifically, we input the entropy-based confidence calculated from the estimated class probabilities of the attention branch to our attention mechanism in addition to the original inputs. In our attention mechanism, feature maps are multiplied by an attention map weighted based on the confidence so that a reliable attention map can be used strongly and an unreliable attention map can be used weakly. The above attention mechanism that can consider the confidence is called the confidence-aware attention mechanism, and it is the greatest contribution of this paper. The perception branch performs the final estimation using the feature maps obtained from the confidence-aware attention mechanism. Consequently, accurate deterioration level estimation can be realized by using the attention map considering its confidence. ConfABN has higher visual explainability for estimation results by presenting spatial attention maps than methods that employ a channel attention mechanism [
20] to determine the attention of each channel of the feature maps, such as SENet [
9]. Therefore, ConfABN is suitable for supporting the inspection of infrastructures that need high reliability. Furthermore, since ConfABN can provide confidence in the attention map of the estimation results, it achieves higher explainability for the output than previous attention-based methods, such as ABN [
10].
The remainder of the paper is organized as follows. In
Section 2, the estimation of deterioration level based on ConfABN is described. Experimental results are described in
Section 3 to verify the effectiveness of the proposed method. Finally,
Section 4 presents the conclusions.
2. Deterioration Level Estimation Based on Confabn
In this section, the estimation of deterioration level based on ConfABN is explained. As shown in
Figure 2, ConfABN consists of three modules: a feature extractor, an attention branch, and a perception branch. The feature extractor is constructed with convolutional layers and calculates feature maps
(
being the number of channels of the feature maps) from an input distress image
. By using the feature maps
, the attention branch outputs an attention map
and estimated class probabilities
used to train the ConfABN model. Then we calculate improved feature maps
based on the confidence-aware attention mechanism from the feature maps
and the attention map
. The perception branch outputs the final estimated class probabilities
using the improved feature maps
. It is worth noting that the feature extractor, attention branch, and perception branch are constructed by partitioning the CNN model commonly used in general classification tasks, such as ResNet [
21], which is described in detail in [
10]. In
Section 2.1,
Section 2.2 and
Section 2.3, we explain the attention branch, confidence-aware attention mechanism, and perception branch, respectively. Furthermore,
Section 2.4 describes the training of ConfABN.
2.1. Attention Branch
The feature maps
calculated by the feature extractor are input into the attention branch. The attention branch has multiple convolutional layers on the input side. On the output side, the attention branch has a global average pooling layer for calculating class probabilities
and a convolutional layer for calculating the attention map
. The dimension of the output of the global average pooling layer is the number of classes. Since this layer has the softmax function as the activation function, it can output the estimated class probabilities
in the attention branch. These estimated class probabilities
are used for the confidence-aware attention mechanism, as described in detail in
Section 2.2. Furthermore,
are used for the training of the attention branch, as described in
Section 2.4. Notably, the layer for calculating the attention map
is the
convolutional layer, which has a sigmoid function as the activation function. Here, the
convolutional layer has a
kernel, and the number of channels of its output is 1. The obtained attention map
can be used in the confidence-aware attention mechanism to improve feature maps
considering the confidence in the estimated class probabilities
.
The generation of the attention map in the attention branch is designed with reference to class activation mapping (CAM) [
22]. It is a method for visualizing the regions of interest that a CNN model focuses on during the test phase. As explained in detail in [
10], CAM cannot generate the attention map during training because it uses feature maps and the weights of fully connected layers obtained after training. The attention branch uses a
convolutional layer to compute the attention map in a feedforward process and can output the attention map even during training. However, in the early epochs of training, many ineffective attention maps are likely to be generated since the parameters of the model are not sufficiently optimized. The use of such an ineffective attention map in the attention mechanism is a problem of the original ABN. As presented in the next subsection, ConfABN enhances the usefulness of the attention branch by considering the confidence in the attention map during the training.
2.2. Confidence-Aware Attention Mechanism
The confidence-aware attention mechanism improves the feature maps
using the attention map
and the estimated class probabilities
from the attention branch. In the attention mechanism of the original ABN, the attention map
is applied to the feature maps
using the following equation to calculate the improved feature maps
:
where ⊙ denotes the Hadamard product. The matrix whose elements are 1, and its size is equal to
is denoted by
. However, in the confidence-aware attention mechanism, the improved feature maps
are calculated by applying the attention map
to feature maps
as follows:
where
where
t (
) denotes the confidence calculated from the entropy
. Note that
is calculated using the class probabilities
;
N as the number of the deterioration levels) output from the attention branch as follows:
The entropy becomes maximum when the probability is uniformly distributed, and its value is a constant that depends only on N. A large value of the entropy reduces the confidence t; thus, the coefficient of the attention map becomes smaller. Therefore, an attention map that seems ineffective due to low confidence will have a smaller influence on feature maps. Moreover, it is possible to use a strong attention map that seems effective due to high confidence. Consequently, the confidence-aware attention mechanism can consider the effectiveness of the attention map, and it improves the performance in the perception branch.
The confidence-aware attention mechanism can reduce the negative effects of ineffective attention maps that are likely to be generated in the early training process. Furthermore, our attention mechanism can prevent the negative effects of ineffective attention maps in the test phase.
Section 3.2 demonstrates that the distribution of confidence has many small values in the early stages of training, and ineffective attention maps with small confidence are generated in the test phase.
2.3. Perception Branch
The perception branch calculates the final estimated class probabilities using the improved feature maps as input. Specifically, in the perception branch, the improved feature maps are first propagated through multiple convolutional layers. Then, the output of the last convolutional layer is input into a global average pooling layer to obtain a feature vector. By inputting the feature vector into a fully connected layer with the softmax function as the activation function, the final estimated class probabilities for the deterioration level are output. Consequently, ConfABN achieves an accurate deterioration level estimation using the feature maps improved by the confidence-aware attention mechanism.
2.4. Training of ConfABN
ConfABN is trained in an end-to-end manner using a loss function
L calculated from the estimated class probabilities
and
.
L is defined by the following equation:
where
and
are the losses calculated by inputting
and
into the cross-entropy loss function, respectively.
is 1 if class
n is equal to the ground truth and 0 otherwise;
is a hyperparameter for adjusting the influence of
. The end-to-end training of the feature extractor, attention branch, and perception branch can be performed using the loss function
L. In other words, training ConfABN with
L realizes simultaneous optimization of the parameters of the model for the attention map generation and final deterioration level estimation.