0% found this document useful (0 votes)

53 views

Road Crack Detection Using Deep Neural Network Bas

This article proposes a new deep learning model called AR-UNet for road crack detection. AR-UNet introduces convolutional block attention modules (CBAM) into the encoder and decoder of a U-Net model to better extract global and local feature information. It also connects the input and output CBAM features to increase the transmission of features. The model uses BasicBlock structures instead of convolutional layers to avoid issues like gradient disappearance. Experimental results on three datasets show the model focuses more on crack features and extracts cracks with higher integrity compared to other methods.

Uploaded by

Harsha S

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

Road Crack Detection Using Deep Neural Network Bas

Uploaded by

Harsha S

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3233072

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier xxxx

Road Crack Detection Using Deep

Neural Network Based on Attention
Mechanism and Residual Structure
PENG JING1 , HAIYANG YU1 ,ZHIHUA HUA1 ,SAIFEI XIE1 ,CAOYUAN SONG1
1
School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo 454000, China
Corresponding author: Haiyang Yu (yuhaiyang@hpu.edu.cn)
This work was supported in part by the National Natural Science Foundation of China under Grant U1304402.

ABSTRACT Intelligent detection of road cracks is crucial for road maintenance and safety. Due to the
interference of illumination and different background factors, the road crack extraction results of existing
deep learning methods are incomplete, and the extraction accuracy is low. We designed a new network
model, called AR-UNet, which introduces a convolutional block attention module (CBAM) in the encoder
and decoder of U-Net to effectively extract global and local detail information. The input and output CBAM
features of the model are connected to increase the transmission path of features. The BasicBlock is adopted
to replace the convolutional layer of the original network to avoid network degradation caused by gradient
disappearance and network layer growth. We tested our method on DeepCrack, Crack Forest Dataset, and
our own labeled road image dataset (RID). The experimental results show that our method focuses more
on crack feature information and extracts cracks with higher integrity. The comparison with existing deep
learning methods also demonstrates the effectiveness of our proposed method. The code is available at:
https://github.com/18435398440/ARUnet.

INDEX TERMS Residual structure, attention mechanism, deep learning, crack detection

I. INTRODUCTION et al. [10] enhanced and extracted multi-scale crack features

RACKS are the most common type of road disease. using dense connections. Finally, the feature maps at different
C If road repair is not carried out in time, cracks will
seriously endanger traffic safety. Therefore, how to detect and
scales were fused to achieve crack extraction by complement-
ing the features at different levels. However, these methods
repair cracks in time is an essential responsibility of the trans- can less extract fine cracks in pavement images with many
portation department. In recent years, with the development interfering factors.
of road crack detection methods for image and computer Olaf et al. [11] proposed a U-Net-based medical image
vision [1], deep learning has been widely used for crack segmentation method to obtain contextual semantics by con-
detection [2][3][4]. Zhang et al. [5] first used deep learning tracting the paths and determining the location by symmet-
for road crack extraction and proposed and trained a super- rically expanding the trails. The encoder and decoder sub-
vised shallow neural network to detect cracks. CrackFor- networks of U-Net++ are connected by nested and dense
est [6] combined multi-level complementary features using jump paths [12] to reduce the semantic gap between the
structural information in crack patches to detect and extract encoder-decoder sub-network feature mappings and Inter-
cracks. Yao et al. [7] proposed a convolutional neural network section over Union (IOU) is higher than the original U-Net
for crack recognition, which suppressed the interference of network. Cheng et al. [13] treated the crack images as a
background factors and significantly improved detection ac- whole; They also introduced a cost function based on dis-
curacy. Liu et al. [8] proposed a pixel-level classification tance transformation to improve the detection performance
network combining local and global information to obtain of the network. FAN et al. [14] proposed an encoder-decoder-
richer multi-scale feature information and improve crack de- based structured neural network U-HDN that integrates crack
tection accuracy. Dorafshan et al. [9] reduced the interference context information into a multi-expansion module to ob-
of background factors on crack extraction by connecting tain more crack features. Drozdzal et al. [15] studied the
edge detectors and deep convolutional neural networks. Li importance of skip connections and introduced short skip

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3233072

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

connections in the encoder. ResNet34 residual network [16] to mark contours using FAST feature point recognition and
was used, and the original convolution of the residual net- used PYNQ for crack identification. However, the accuracy
work was replaced with an expanded convolution [17] to of these method is poor when there is a lot of noise in the
extract crack information, and an attention mechanism was background.
introduced to obtain the final crack detection results. these Algorithms such as wavelet pavement crack detection
methods have poor detection accuracy in the presence of [23][24] use wavelet transform to convert cracks and noise
many background disturbing factors. into different wavelet coefficients. These methods require
U-Net neural network is a coding and decoding structure high equipment requirements and are prone to disadvantages
that can be trained end-to-end using fewer images to detect such as over-segmentation and susceptibility to interference
road cracks quickly. However, there are many distracting by external factors.
factors in road images, and the U-Net network is insufficient Histogram statistics and shape analysis algorithms [25],
to extract the fine cracks in the images. After the introduction morphological image processing and logistic regression sta-
of the CBAM into the U-Net neural network, the structure tistical classification [26], and free-form path calculation
of the neural network and the number of network layers methods [27], which combine brightness and connectivity
increase, and the network model shows network degradation. to detect cracks. The detection is not practical under the
To solve the above problems, the work in this paper focuses influence of complex backgrounds and the presence of more
on the following aspects: background-interfering factors, etc. The median filtering al-
1) we design a new network model called AR-UNet by in- gorithm [28] enhances grayscale pavement images using four
troducing the convolutional block attention module (CBAM) structural element reconstructions and combines the morpho-
in the U-Net neural network. The CBAM performs global logical gradient operator and morphological closure operator
averaging and global maximum hybrid pooling of channels to extract crack edges. However, these method can identify
and spaces of input features to focus on more global and local crack pixels with noticeable contrast changes in the crack
detail information. The performance of the neural network in image, and its crack extraction accuracy is poor for cracks
detecting fine cracks is improved. with inconspicuous features.
2) CBAM’s input and output features are pooled using Shah and Wang et al. [29] [30] studied crack segmen-
shortcut connections to increase the transmission path of tation based on edge detection. Still, the natural properties
crack features, and the network model can learn more about of road diseases were not considered, and the algorithm’s
crack features. applicability was less than ideal. The segmentation algorithm
3) BasicBlock replaces the convolutional layers of the of edge detection is generally based on local grayscale and
U-Net network to avoid network degradation due to the gradient information to identify crack edges, which is only
increase in the number of network layers. Further, improve applicable to cracks with complete edge information. It is
the accuracy of crack extraction. easy to judge the background with strong edge information
as crack information points. When there is more noise, the
II. RELATED WORK effect of edge detection is poor.
Traditional road pavement crack detection mainly has the fol-
lowing categories: 1) manual detection, 2) threshold method, III. METHOD
3) wavelet transform, 4) morphological image processing and A. OVERALL NETWORK STRUCTURE
classification, 5) path method and 6) edge detection method. The U-Net neural network is divided into three parts: en-
Manual detection is through the pavement investigator driv- coder, decoder, and prediction module. The encoder reduces
ing along the road to record the location of cracks, the degree the image size and extracts the initial image features by
of damage, and the number of information. Such a method convolution and maximum pooling. The decoder obtains the
is detailed and comprehensive, but the amount of human and deep features of the image by convolution (a ReLU func-
material resource consumption is large and inefficient. tion follows each convolution). Finally, pixel classification is
Thresholding-based image segmentation methods have an done by 1×1 convolution.
early origin and are widely used. The thresholding method The established network structure is shown in Figure 1.
detects cracks utilizing the feature that the gray value of crack The network structure mainly consists of a feature extrac-
image pixels is lower than the background [18]. Kirschke et tion network, residual module, and CBAM module. The
al. [19] proposed a histogram-based threshold segmentation BasicBlock module replaces the convolutional layer of the
method, which can only be used for more apparent crack U-Net network. BasicBlock module can effectively solve
identification. Removal algorithms [20] using binary seg- the problem of network model degradation and gradient
mentation, morphological operations, and removal of isolated disappearance when the number of network layers increases.
points and regions are prone to the presence of gaps in The network introduces CBAM and then sums the input and
detected cracks. Segmentation using an improved adaptive output of CBAM; the module is called Res-CBAM. Res-
iterative thresholding segmentation algorithm [21] can also CBAM makes the network pay more attention to the channel
yield crack images. Zhang et al. [22] took advantage of and spatial dimensions crack information and assign more
the significant difference between cracks and background weights to the network coefficients.
2 VOLUME 4, 2016

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

FIGURE 1. The overall structure of the neural network.

Mc Ms
dimensional channel attention feature map Mc ∈ C × 1 × 1
In put Feature Out put Feature
and the two-dimensional spatial attention feature map Ms ∈
C×1×1 1×H×W
1 × H × W in turn, and finally outputs the weighted features
with channel and space. The overall attention is calculated as
follows:
C× H× W C× H× W

FIGURE 2. Convolutional Block Attention Module.

F ′ = Mc (F ) × F (1)

F ′′ = Ms (F ′ ) × F ′ (2)
B. CONVOLUTIONAL BLOCK ATTENTION MODULE ′
Where F denotes the input features after the channel
(CBAM) attention operation, F ′′ is the final refined output.
CBAM is a lightweight module that contains spatial attention
and channel attention. The module derives attention weights C. CHANNEL ATTENTION MODULE(CAM)
sequentially along two independent dimensions, channel and The structure of the Channel Attention Module is shown in
space, and then multiplies the output attention map with Figure 3; The two Mc = 1 × 1 × C feature maps are obtained
the input feature map for adaptive feature refinement. Since by feeding the input features into global max pooling and
CBAM is a lightweight, general-purpose module, it can be global average pooling, respectively. Then after two layers of
seamlessly integrated into any CNN architecture. It can be the fully connected neural network, the number of neurons
C
trained end-to-end with the underlying CNN. Compared to in the first layer is (r is the compression rate). ReLu is
attention modules focusing on only one side, CBAM can take r
the activation function, and the number of neurons in the
care of both sides and extract more information about the second layer is C. Then, the fully connected neural network’s
target. output features are summed and passed through the sigmoid
As shown in Figure 2, assuming F = C × H × W as activation function to generate the channel attention features
the input feature map, the CBAM module computes the one- (Mc ). The channel attention is calculated as follows:
VOLUME 4, 2016 3

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

MaxPool

In put Feature
Mc

s
AvgPool
C×1×1
C× H× W MLP

FIGURE 3. Channel Attention Module.

F` Convolution layer Ms

C× H× W MaxPool/AvgPool 1×H× W

FIGURE 4. Spatial Attention Module.

c c
Mc (F ) = σ(W1 (W0 (Favg )) + W1 (W0 (Fmax ))) (3) Input Channel Spatial MaxPool
Feature
Attention Attention 2×2
C  H W
C
Where σ denotes the sigmoid function,W0 = × C, W1 = C 
H

W
r 2 2

C
C× .
r FIGURE 5. Partial structure of the encoder.

D. SPATIAL ATTENTION MODULE (SAM)

The structure of the spatial attention module is shown in Fig-
feature map of size C × H × W is deconvolved, and the
ure 4. The spatial attention input features F ′ = C × H × W
′ ′ corresponding CBAM input feature map of the encoder is
are averaged and max pooling to obtain Favg and Fmax .
copied and cut, and stitched with the deconvolved feature
Then, the two feature maps are channel spliced. After a 7×7
map to obtain the feature map of size C × 2H × 2W ; The
convolution operation, it is compressed into H × W × 1. It
stitched feature map is input to the attention mechanism as
generates Ms by the sigmoid activation function. Finally, the
the input feature map. The output feature map is connected
output feature map of this module is multiplied by the input
with the input feature map and then convolved with a 3 ×
feature map to get the final generated feature map. The spatial
3 convolution kernel to obtain the final feature map of size
attention module is calculated as follows: C
× 2H × 2W .
2
Ms (F ′ ) = σ f 7×7 [(Favg
′ ′

+ Fmax )] (4)
C
7×7 C× 2H× 2W  2 H  2W
Where σ denotes the sigmoid function and f denotes 2
C× H× W
the convolution operation with a filter size of 7 × 7.
Channel Spatial
Attention Attention
E. STRUCTURE DETAILS OF THE ENCODER Input
Feature
As shown in Figure 5, the input features enter the channel
attention of CBAM after two convolution operations of size 3
FIGURE 6. Partial structure of the decoder.
× 3 to get the channel attention weight Mc . Mc is multiplied
by the input feature map to get the input features required
by the spatial attention module. Next, the spatial attention G. RESIDUAL NETWORK
weight Ms is obtained by the spatial attention operation, and The residual network comes from the literature [31]. Typ-
the original input and Ms enter the 2×2 max pooling together ically, as the number of layers increases, the training loss
after the shortcut connection to obtain the final feature map gradually decreases and then saturates, but the fact tells us
H W
of size C × × . that the training loss increases when the network depth is
2 2
increased again. This is not overfitting because, in overfitting,
F. STRUCTURE DETAILS OF THE DECODER the training loss continuously decreases.
The residual-connected Res-CBAM is also introduced in The deeper the network is, the more difficult it is to train.
the structure of the decoder, as shown in Figure 6. The Therefore, it is essential to integrate shortcut connections
4 VOLUME 4, 2016

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

in U-Net networks to reduce network degradation. Since B. EXPERIMENTAL SETTINGS

the original convolutional layer is computationally time- 1) Analysis of Initial Learning Rate and Optimizers
consuming and unsuitable for pixel-level prediction. The In the first experiment,In order to obtain a suitable initial
original convolutional neural network layer is replaced by learning rate value and the optimization method, we set
BasicBlock, whose structure is shown in Figure 7. different learning rates and model optimization methods to
analyze the training loss of the model. Figure 8 (a) indicates
Input
Feature
that we employed the Adam optimizer, The figure indicates
that there are large fluctuations in the training loss for the
3×3×C
three datasets, and the training loss values are large. Figure 8
Relu (b) indicates that we employed the SGD optimizer. The figure
3×3×C shows that the training loss values of the three datasets are
small and stable, therefore, we choose SGD as the network
optimizer. The learning rates for the training RID and Crack-
Relu Forst datasets are set to 1e-1 and for the training DeepCrack
datasets to 3e-3, because their corresponding loss values are
FIGURE 7. The structure of BasicBlock. the smallest.

After the input feature map is passed through two convo- 0 .3 0

R ID
lutional layers and the ReLu function, it is summed with the D e e p C ra c k
original input features to obtain the final output feature map. C ra c k F o rst
0 .2 5
A residual block can be expressed as:

xl+1 = xl + f (xl , wl ) (5) 0 .2 0

T r a in lo s s

The residual block is divided into two parts: the direct map-
ping part and the residual part. h (xl ) is the direct mapping, 0 .1 5
and the response is the curve on the right in Figure 7;
f (xl , wl ) is the residual part, which consists of two convo-
0 .1 0
lution operations, and the part containing the convolution on
the left in Figure 7.
The shortcut connections between the input and output 0 .0 5

feature maps can transfer the crack information extracted

--5
--5
--5
--4
--4
--4
--3
--3
--3
--2
--2
--2
--1
--1
--1
1e
3e
5e
1e
3e
5e
1e
3e
5e
1e
3e
5e
1e
3e
5e
by the previous layer of the network to the next layer. The L re rin g ra te
information loss is avoided to a greater extent, and the net- (a) Loss values using the Adam optimizer
work degradation caused by increasing the number of neural
network layers is effectively prevented. 0 .3 0
R ID
D e e p C ra c k
IV. EXPERIMENTS AND RESULTS C ra c k F o rst
0 .2 5
A. ROAD IMAGE DATA SET
The datasets used for the experiments are DeepCrack [32],
Crack Forest Dataset [33], and our annotated onboard road 0 .2 0
T r a in lo s s

image dataset, which we named RID. DeepCrack is a dataset

containing 537 concrete pavement images of 544 × 384 0 .1 5
pixels with multi-scene and multi-scale pavement cracks.
The Crack Forest dataset is a dataset of asphalt pavement
images, which contains 118 images of size 480 × 320 pixels 0 .1 0

with background noise such as white markers and shadows.

These two datasets have fewer images and are enhanced 0 .0 5
using rotate, flip, and mirror operations. After enhancement,
--5
--5
--5
--4
--4
--4
--3
--3
--3
--2
--2
--2
--1
--1
--1
1e
3e
5e
1e
3e
5e
1e
3e
5e
1e
3e
5e
1e
3e
5e

2148 and 708 images were obtained from the DeepCrack and L re rin g ra te
Crack Forest datasets, respectively. Then, we made a dataset
(b) Loss values using the SGD optimizer
with 548 images from the road images acquired by mobile
LiDAR mapping system. The labeled images in these three FIGURE 8. Statistical results of training loss values with different learning
datasets were manually labeled. To validate the established rates

neural network models, we selected 80% of each dataset as

training data and 20% as test data.
VOLUME 4, 2016 5

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

FIGURE 9. Experimental visualization results of three data sets.(where "Res-CBAM" means only Res-CBAM is introduced, "BasicBlock" means only BasicBlock is
introduced, and "Res-CBAM+ BasicBlock" means all the two structures are introduced.)

2) Other Experimental Settings

We implement all tests in Python 3.6, Pytorch 1.10.1, and 2×P ×R
F1 = (7)
CUDA 11.1 framework and use NVIDIA GeForce RTX2080 P +R
GPU for training. The model uses the SGD optimization
method to update the parameters by randomly selecting TP
P = (8)
small batches of samples with the momentum optimization TP + FP
algorithm set to 0.9. The ReLu activation function suppresses
TP
gradient disappearance during training to accelerate the con- R= (9)
vergence rate of the model and maintain stability. TP + FN
The precision indicates the proportion of correctly de-
C. EXPERIMENTAL EVALUATION INDEXES tected crack pixels that were initially correct. Where TP
Neural network segmentation accuracy evaluation is per- indicates the number of correctly classified crack pixels and
formed using commonly used metrics, DICE (D), precision FP indicates the number of incorrectly classified crack pixels.
(P), recall (R), and F1-score are selected for assessment. Recall indicates the percentage of correctly detected cracked
Where DICE indicates the ratio of the area where the pre- pixels to all cracked pixels, where FN indicates the number
dicted and true results intersect with the total area, and the of pixels incorrectly classified as background.
value of perfect segmentation is 1. The F1-score can better
measure both the precision and the recall. The DICE and F1- D. THE RESULTS OF ABLATION EXPERIMENTS
score are calculated as follows: 1) Visual Analysis of Experimental Results
To discuss the effect of introducing Res-CBAM and Ba-
2 × (Rseg ∩ Rgt ) sicBlock in the neural network on crack feature extraction,
D= (6) we validate it by ablation experiments. The tests were done
Rseg + Rgt
6 VOLUME 4, 2016

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

TABLE 1. Results on different datasets.(where "+" means the structure is introduced and "-" means the structure is not introduced)

Dataset U-Net Res-CBAM BasicBlock D(%) P(%) R(%) F1-scores(%)

+ - - 65.39 64.20 70.63 67.26

+ + - 68.72 80.11 71.64 75.64
DeepCrack
+ - + 83.91 84.05 83.29 83.67
+ + + 84.09 89.24 82.64 85.82
+ - - 58.57 61.74 54.85 54.85
+ + - 62.99 70.10 60.99 65.23
Crack Forest Dataset
+ - + 66.25 57.85 80.31 67.25
+ + - 67.22 60.91 79.18 68.85
+ - - 37.77 57.41 32.89 41.82
+ + - 46.02 56.28 46.96 51.20
RID
+ - + 39.52 56.31 36.00 43.92
+ + + 50.39 58.97 52.36 55.47

in each of the three datasets. As Figure 9 shows the vi- Regarding the results of RID. we see that the network
sualization results of the experiments, rows 1-2 show the achieves the best performance by introducing attention and
detection results of the DeepCrack dataset, which shows that residual structure. The DICE and F1-scores reach 50.39%
the original neural network crack extraction is incomplete and 55.47%, respectively. However, the obtained perfor-
and the extraction accuracy is poor. After the introduction of mance is lower than the performance on the other datasets.
Res-CBAM and BasicBlock, the network model can focus Because the road image dataset (RID) has uneven illumi-
more on the crack region, and the crack completeness is nation and skewed shooting angles. In addition, the ground
higher. Rows 3-4 show the results of the crack forest dataset, labels of this dataset are only one or a few pixels wide, which
and the extracted cracks are more realistic. Rows 5-6 show is one of the reasons for the low detection results.
the results of RID, where the fine cracks are extracted to be
more complete. V. DISCUSSION
A. EFFECTIVENESS OF SHORTCUT CONNECTIONS
E. RESULTS OF ABLATION EXPERIMENTS We further verified through ablation experiments whether
adding shortcut connections in CBAM positively affects the
Results on DeepCrack. We explored the contribution of
extraction of cracks. The experimental results are shown
introducing each component on DeepCrack’s test set. As
in Table 2. We found that by adding shortcut connections,
shown in Table 1, we found that introducing Res-CBAM
the crack extraction accuracy of the network was improved
improved DICE from 65.39% to 68.72% and F1-scores from
because the shortcut connections increased the path of feature
67.26% to 75.64%. And then, we integrated BasicBlock into
information propagation. The neural network learned more
the original network and found that DICE and F1-scores
global and local crack information, proving our method’s
improved further to 83.91% and 83.67%. We concurrently
feasibility.
added Res-CBAM and BasicBlock into the neural network,
and the DICE and F1-scores reached 84.09% and 85.82%, TABLE 2. Test results for CBAM in RID dataset with or without residual
respectively. We improve the structure of the encoder and connections.
decoder and yield higher extraction accuracy compared to U-
Net. Methods D(%) P(%) R(%) F1-score(%)

Results for the Crack Forest dataset. we can see that CBAM 48.78 56.53 51.34 53.81
the DICE and F1-scores improve to 67.2% and 68.85%, Res-CBAM 50.39 58.97 52.36 55.47
respectively, after the introduction of Res-CBAM and Ba-
sicBlock in U-Net. The precision performance of the neural
network is better after introducing Res-CBAM alone. The Since Res-CBAM plays an essential role in the network
neural networks performed better in recall after introducing structure, the position of Res-CBAM may affect the neural
BasicBlock alone. But their F1-scores did not perform as well network performance. We compare two position ways of Res-
as the networks introduced simultaneously. The experimental CBAM placement in the decoder, as shown in Figure 10 (a)
results of the crack forest dataset show that the simultaneous and (b). The effects of introducing Res-CBAM in convolution
introduction of Res-CBAM and BasicBlock can effectively and deconvolution on the neural network are discussed. In
improve the crack detection ability of U-Net. the same experimental environment, the neural networks with
VOLUME 4, 2016 7

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

the two arrangement methods are tested separately. Table 3 U -N e t

0 .5
summarizes the test results of different location arrangement R e s-C B A M
R e s-C B A M + B a s
methods. The results show that the neural network with the 0 .4
introduction of Res-CBAM in convolution performs better
because the input features of Res-CBAM include features

T r a in lo s s
0 .3
from the encoder, which makes the input information richer.
Introducing Res-CBAM into the position shown in Fig. 0 .2

10(b), the DICE and F1-scores are lower because some fea-
0 .1
ture information is lost after the input features are subjected
to two convolution operations, resulting in a degradation of
0 .0
the network detection performance.
0 1 0 2 0 3 0 4 0 5 0
E p o c h

(a) The training loss of DeepCrack Dataset

Res-CBAM

U -N e t
0 .5
R e s-C B A M
R e s-C B A M + B a s
0 .4

T r a in lo s s
0 .3
Res-CBAM

0 .2
(a) Introducing Res-CBAM in convolution
0 .1

0 .0

0 1 0 2 0 3 0 4 0 5 0
E p o c h

Res-CBAM (b) The training loss of Crack Forest Dataset

U -N e t
0 .5
R e s-C B A M
R e s-C B A M + B a s
0 .4
(b) Introducing Res-CBAM in up-
convolution
T r a in lo s s

0 .3

FIGURE 10. Arrangement of Res-CBAM at different positions in the decoder 0 .2

0 .1

TABLE 3. Test results of different position arrangement methods. 0 .0

Position D(%) P(%) R(%) F1-score(%) 0 1 0 2 0 3 0 4 0 5 0

E p o c h
Up-conv 48.84 57.24 52.09 54.54
Conv 50.39 58.97 52.36 55.47 (c) The training loss of RID

FIGURE 11. Training loss values in different data sets(Where "U-Net"

indicates the original network, "Res-CBAM" indicates that the original network
introduces Res-CBAM, and "Res-CBAM+Bas" indicates that all two structures
are introduced.)
B. NETWORK DEGRADATION IN TRAINING PROCESS
In addition, we also verified the network degradation dur-
ing the training process by ablation experiments. And we
recorded the changes in the training loss values during train- unstable, fluctuate greatly during the training process, and
ing of the three datasets. As shown in Figure 11 (a); (b) and the neural network converges slowly. After the introduction
(c), the U-Net with the introduction of Res-CBAM shows of Res-CBAM, the neural network pays more attention to
network degradation due to increased network layers. The the crack features, converging faster. However, due to the
figure shows that the loss values of the original U-Net are increase in network layers, the neural network performance
8 VOLUME 4, 2016

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

TABLE 4. Results of comparison with other deep learning algorithms

Traditional Deep Learning Algorithms Transformer Algorithm

Dataset
Methods P(%) R(%) F1-score(%) Methods P(%) R(%) F1-score(%)
SegNet 73.2 81.2 77.0 VIT 82.6 83.7 83.2
DeepCrack 53.5 55.5 54.5 Swin-UNet 85.7 83.6 84.6
DeepCrack
RCF 60.1 71.3 65.2 TransUNet 86.2 84.4 85.3
ours 88.9 85.7 87.2 ours 88.9 85.7 87.2
SegNet 42.0 60.2 49.5 VIT 58.6 77.7 66.7
DeepCrack 46.7 61.5 53.0 Swin-UNet 60.7 75.3 67.2
Crack Forest Dataset
RCF 41.5 49.5 45.2 TransUNet 63.8 79.8 70.9
ours 63.2 81.2 71.1 ours 63.2 81.2 71.1
SegNet 37.2 51.6 43.2 VIT 48.6 51.7 50.1
DeepCrack 39.3 51.4 44.5 Swin-UNet 52.8 50.9 51.8
RID
RCF 40.6 49.8 44.7 TransUNet 53.2 54.6 53.9
ours 58.9 52.3 55.4 ours 58.9 52.3 55.4

was slightly worse than the original network, and network with an overall precision of 55.4%. Compared with Trans-
degradation occurred. So we connected the input and output former, our method integrates the channel and spatial location
features of CBAM and replaced the convolutional layer of information of cracks in the feature extraction stage, and the
the original network with BasicBlock. The improved neural attention weight is tilted toward cracks. Transformer focuses
network converged faster and with higher accuracy. more on global information and ignores local information.
The proportion of crack pixels in the image is smaller,
C. COMPARISON WITH TRADITIONAL DEEP LEARNING so ignoring local information will lead to lower detection
ALGORITHMS accuracy.
The comparison results with other commonly used methods
are shown in Table 4. And our method has better accuracy VI. CONCLUSION
compared to SegNet [34], RCF [35], and DeepCrack [32]. We introduced Res-CBAM and BasicBlock into the U-Net
The F1-scores in DeepCrack Dataset are 10.2% better than network to establish a neural network model for crack de-
SegNet, and the precision and recall are 15.7% and 4.5% tection. The experimental results show that the introduction
better, respectively. In Crack Forest Dataset, the F1-score of CBAM enhances the attention of the neural network to
is improved by 18.1% compared to DeepCrack, and the the crack region, improves the extraction ability of the neural
precision and recall are improved by 16.5% and 19.7%, re- network for fine cracks, and suppresses the interference of
spectively. In the RID dataset, our network outperforms other background factors. Meanwhile, The shortcut connections of
networks, with a 10.7% improvement in F1-score compared Res-CBAM and the replacement of the convolutional layer
to RCF, 18.3%, and 2.5% improvement in precision and in the network structure by BasicBlock ensure the trans-
recall, respectively. The experimental results show that inte- mission of crucial information as efficiently as possible and
grating CBAM and residual structure in the U-Net network effectively suppress the problem of network degradation. The
can improve its crack detection performance and increase constructed neural network learns more features about cracks
detection accuracy. and improves the ability of the model to detect fine cracks.
Compared with several other neural network methods, the
D. COMPARISON WITH TRANSFORMER ALGORITHM neural network built in this study has a significantly enhanced
To further demonstrate the advantages of the method pro- ability to extract cracks. The excellent accuracy and robust-
posed in this study, we also compare the method with the ness of the neural network were verified through extensive
recently published Vision Transformer (VIT) [36], Swin- experiments on different data sets.
UNet [37], and TransUNet [38] algorithms. Our method also
has some advantages. The comparison results are shown in REFERENCES
Table 4; for the DeepCrack dataset, our method’s overall [1] R, Stefania C., and I Brilakis. “Synthetic structure of industrial plastics,”
Journal of Computing in Civil Engineering, vol.31, no.2, mar.2017.
accuracy is 87.2%, and the precision and recall are 88.9% and [2] J Jong-Hyun, H Jo, and G Ditzler, “Convolutional neural networks for
85.7%, respectively. For Crack Forest Dataset, the precision pavement roughness assessment using calibration-free vehicle dynamics,”
of our method is lower than TransUNet by 0.6%, but our Computer-Aided Civil and Infrastructure Engineering,, vol.35, no.11,
pp.1209-1229, Mar. 2020, 10.1111/mice.12546.
overall accuracy is 0.2% higher than TransUNet. And for the [3] H Y Ju,W Li,S Tighe,Z C Xu,J Z Zhai, “CrackU-net: a novel deep
RID dataset, our method also outperforms other algorithms convolutional neural network for pixelwise pavement crack detection,”

VOLUME 4, 2016 9

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

Structural Control and Health Monitoring, vol. 27, no.8, Mar. 2020, [23] B C SUN,Y J QIU and S Q LIANG, “Research on wavelet-based pavement
10.1002/stc.2551 crack identification” Journal of Chongqing Jiaotong University (Natural
[4] E. H. Miller, “Crack detection and segmentation using deep learning Science Edition), vol.29, no.01 pp.69-72,2010,
with 3D reality mesh model for quantitative assessment and integrated [24] P. Subirats, J. Dumoulin, V . Legeay, and D. Barba, “Automation of pave-
visualization,” Journal of Computing in Civil Engineering., vol. 34, no.3, ment surface crack detection using the continuous wavelet transform”in
May. 2020, 10.1061/(ASCE)CP.1943-5487.0000890 Proc. Int. Conf. Image Process. (ICIP), pp.3037–3040. Oct.2006.
[5] L Zhang, F Yang, Y M Zhang, Daniel, “Road crack detection using [25] Z G XU ,X M ZHAO and H S SONG, “Crack identification algorithm
deep convolutional neural network,” in IEEE International Conference on for asphalt pavement based on histogram estimation and shape analysis”
Image Processing,Phoenix, AZ, USA, 2016, 10.1109/ICIP.2016.7533052 Journal of Instrumentation, vol.31, no.10 pp.2260-2266, Oct. 2010.
[6] Y Shi;, L M Cui, Z Q Qi, F Meng and Z S Chen, “Automatic road [26] A Landstrom ,M J Thurley, “Morphology-based crack detection for steel
crack detection using random structured forests,” IEEE Transactions on slabs” IEEE Journal of selected topics in signal processing, vol.6, no.7
Intelligent Transportation Systems, vol.17, no.12, pp.3434-3445, May. pp.866-875, Aug.2012. 10.1109/JSTSP.2012.2212416
2016, 10.1109/TITS.2016.2552248. [27] T S Nguyen, S Begot, F Duculty and M Avila, “Free-form anisotropy:
[7] G YAO, F J WEI, J Y QIAN and Z G WU, “Crack Detection Of Concrete A new method for crack detection on pavement surface images,”in 2011
Surface Based On Newline Convolutional Neural Networks,” in 2018 18th IEEE International Conference on Image Processing, IEEE, pp.1069-
International Conference on Machine Learning and Cybernetics,Chengdu, 1072. Sept.2011.10.1109/ICIP.2011.6115610
China, 2018, 10.1109/ICMLC.2018.8527035 [28] Y Maode,B Shaobo, X Kun and Y Y He, “Pavement crack detection
[8] Z Liu, Y Cao, Y Wang and W Wang, “Computer vision-based and analysis for high-grade highway,”in 2007 8th International Con-
concrete crack detection using U-net fully convolutional networks,” ference on Electronic Measurement and Instruments, IEEE, Aug.2007.
Automation in Construction, vol.104,(AUG.) pp.129-139, Aug. 2019, 10.1109/ICEMI.2007.4351202
10.1016/j.autcon.2019.04.005 [29] S Shah, “utomatic cell segmentation using a shape-classification model in
[9] Dorafshan, S, RJ.Thomas, and M Maguire, “Comparison of deep convo- immunohistochemically stained cytological images,”IEICE transactions
lutional neural networks and edge detectors for image-based crack detec- on information and systems, vol.E91-D, no.7, pp.1955-1962 Jul.2008.
tion in concrete,” Construction and Building Materials, vol.186,(AUG.) 10.1093/ietisy/e91-d.7.1955
pp.1031-1045, Oct. 2018, 10.1016/j.conbuildmat.2018.08.011 [30] H Wang, N Zhu and Q Wang, “Segmentation of pavement cracks using
differential box-counting approach,”Journal of Harbin Institute of Tech-
[10] H F Li; J P Zong; J J Nie; Z L Wu; H Y Han, “Pavement crack
nology, vol.39, no.1, pp.142-144, 2007.
detection algorithm based on densely connected and deeply supervised
[31] K He, X Zhang, S Ren, J Sun and M Research “Deep residual learning for
network,” IEEE Access, vol.9, pp.11835-11842, Jan. 2021, 10.1109/AC-
image recognition,”in Proceedings of the IEEE conference on computer
CESS.2021.3050401
vision and pattern recognition, IEEE, pp. 770-778. 2016.
[11] O Ronneberger, P Fischer and T Brox , “U-net: Convolutional networks for
[32] Y Liu, J Yao, X Lu, R P Xie and L Li, “DeepCrack: A deep hierarchical
biomedical image segmentation,” in International Conference on Medical
feature learning architecture for crack segmentation,”Neurocomputing,
image computing and computer-assisted intervention,vol.9351 Springer,
vol.338, pp.139-153, Apr. 2019.
Cham, pp.234-241, Nov.2015
[33] Shi Y, Cui L, Qi Z,M Fan and Z S Chen “Automatic road crack
[12] Z W Zhou, M M R Siddiquee, N Tajbakhsh and J M Liang, “Unet++: detection using random structured forests,”IEEE Transactions on In-
A nested u-net architecture for medical image segmentation,” in Deep telligent Transportation Systems, IEEE vol.17,no.12 pp.3434-3445,
learning in medical image analysis and multimodal learning for clinical May.2016.10.1109/TITS.2016.2552248
decision support,vol.11045 Springer, Cham, pp.3-11, Sept.2018 [34] V Badrinarayanan, A Kendall and R Cipolla, “Segnet: A deep convolu-
[13] J Cheng, W Xiong, W Chen, Y Gu and Y S Li “Pixel-level crack detec- tional encoder-decoder architecture for image segmentation,”IEEE trans-
tion using U-net,” in TENCON 2018-2018 IEEE region 10 conference., actions on pattern analysis and machine intelligence, IEEE vol.39,no.12
pp.0462-0466, Oct. 2018 pp.2481-2495, Dec.2017. 10.1109/TPAMI.2016.2644615
[14] Z Fan, C Li, Y Chen,J H Wei, G Loprencipe, X P Chen and P D Mascio [35] Y Liu, M M Cheng,X Hu,J W Bian, L Zhang, X Bai and J H Tang
, “Automatic crack detection on road pavements using encoder-decoder “Richer convolutional features for edge detection,”in Proceedings of the
architecture,” Materials, vol.13, no.13, May.2020, 10.3390/ma13132960 IEEE conference on computer vision and pattern recognition, IEEE, pp.
[15] Z W Zhou, M M R Siddiquee, N Tajbakhsh and J M Liang, “The im- 3000-3009.Oct. 2018. 10.1109/TPAMI.2018.2878849
portance of skip connections in biomedical image segmentation,” in Deep [36] A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X H Zhai, T
learning and data labeling for medical applications,vol.10008 pp.179- Unterthiner, M Dehghani, M Minderer, G Heigold, S Gelly, J Uszko-
187, Springer, Cham, Sept.2016 reit and N Houlsby, “An image is worth 16x16 words: Transform-
[16] G Xu ,C Liao and J Chen, “Extraction of apparent crack information of ers for image recognition at scale,”arXiv preprint arXiv, Jun.2021,
concrete based on HU-ResNet,” Computer Engineering, vol.46, no.11, 10.48550/arXiv.2010.11929
pp.279-285, 2020 [37] H Cao, Y Y Wang, J Chen, D S Jiang, X P Zhang, Q Tian and M N Wang,
[17] L F Li, N Wang, B Wu, and X Zhang, “Segmentation algorithm of bridge “Swin-Unet: Unet-like Pure Transformer for Medical Image Segmenta-
crack image based on modified pspnet,” Advances in Lasers and Optoelec- tion,”arXiv preprint arXiv, May. 2021, 10.48550/arXiv.2105.05537
tronics, vol.58, no.22, pp.101-109, 2021, 10.3788/LOP202158.2210001 [38] V Badrinarayanan, A Kendall and R Cipolla, “TransUNet: Transformers
[18] S H Hanzaei, A Afshar and F Barazandeh, “Automatic detection and Make Strong Encoders for Medical Image Segmentation,”arXiv preprint
classification of the ceramic tiles’ surface defects,” Pattern Recognition, arXiv, Feb. 2021, 10.48550/arXiv.2102.04306
vol.66, pp.174-189, Jun,2017, 10.1016/j.patcog.2016.11.021
[19] K R Kirschke and S A Velinsky, “Histogram-based approach for
automated pavement-crack sensing,” Journal of Transportation Engi-
neering, vol.118, no.5, pp.700-710, Sept,1992, 10.1061/(ASCE)0733-
947X(1992)118:5(700)
[20] W Huang and N Zhang, “A novel road crack detection and identification
method using digital image processing techniques,” in 2012 7th Interna-
tional Conference on Computing and Convergence Technology (ICCCT),
pp.397-400, Seoul, Korea , Dec.2012 PENG JING was born in Datong, Shanxi
[21] Z W Zhou, M M R Siddiquee, N Tajbakhsh and J M Liang, “Research on Province, China in 1994. He is currently studying
crack detection method of airport runway based on twice-threshold seg- for a master’s degree in the School of Survey-
mentation,” in 2015 Fifth International Conference on Instrumentation and ing and Land Information Engineering of Jiaozuo
Measurement, Computer, Communication and Control (IMCCC), pp.1716- Henan Polytechnic University. His current re-
1720, Qinhuangdao, China, Sept.2015, 10.1109/IMCCC.2015.364 search interests mainly include deep learning ob-
[22] Y H Zhang, J Qin, Z L Guo ,K C Jiang and S Y Cai, “Detection of road ject detection and semantic segmentation.
surface crack based on PYNQ,” in 2020 IEEE International Conference
on Mechatronics and Automation (ICMA),vol.13, no.16 pp.1150-1154,
Beijing, China, Sept.2020

10 VOLUME 4, 2016

P JING et al.: Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

HAIYANG YU was born in Linyi Shandong,

China, in 1978. He received the Ph.D. degree
from the Chain University of Geosciences. He
is currently a Professor with the School of Sur-
veying and Land Information Engineering, Henan
Polytechnic University, Jiaozuo. He is the author
or coauthor of more than 50 papers published
in academic journals and conferences. His main
research interests include remote sensing theory
and application and LiDar data processing and
application.

ZHUHUA HUA was born in Zhoukou, Henan,

China, in 1998. He is currently pursuing the mas-
ter’s degree with school of surveying and Land
Information Engineering, Henan Polytechnic Uni-
versity, Jiaozuo. His current research interests
mainly include remote sensing image processing
and change detection.

SAIFEI XIE was born in Xuchang, Henan

Province, China in 2000. She is currently studying
for a master’s degree in the School of Survey-
ing and Land Information Engineering of Jiaozuo
Henan Polytechnic University. Her current re-
search interests mainly include deep learning
based point cloud filtering.

CAOYUAN SONG was born in Xuchang, Henan

Province, China in 1997. She is currently studying
for a master’s degree in the School of Survey-
ing and Land Information Engineering of Jiaozuo
Henan Polytechnic University. Her current re-
search interests mainly include deep learning
based point cloud filtering.

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/