Improving_the_Efficiency_of_Encoder-Decoder_Archit
Improving_the_Efficiency_of_Encoder-Decoder_Archit
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT Cracks are one of the most common categories of pavement distress that may potentially
threaten road and highway safety. Thus, a reliable and efficient pixel-level method of crack detection is
necessary for real-time measurement of the crack. However, many existing encoder-decoder architectures for
crack detection are time-consuming because the part of decoder module always has lots of convolutional
layers and feature channels that lead to performance that highly relies on computing resources, which is a
handicap in scenarios with limited resources. In this study, we propose a simple and effective method to boost
the algorithmic efficiency based on encoder-decoder architecture for crack detection. We develop a switch
module, called SWM, to predict whether the image is positive or negative and then skip the decoder module
to save computation time when it is negative. This method uses the encoder module as the fixed feature
extractor and only needs to place a light-weight classifier head on the end of the encoder module to output
the final class probability. We choose the classical UNet and DeepCrack as examples of the encoder-decoder
architectures to show how SWM is integrated into the architectures to reduce computation complexity.
Evaluations on the public CrackTree206 and AIMCrack datasets demonstrate that our method can
significantly boost the efficiency of the encoder-decoder architectures in all tasks, while without affecting
the performance. The SWM can also be easily embedded into other encoder-decoder architectures for further
improvement. The source code is available at https://github.com/hanshenchen/crack-detection.
INDEX TERMS Computer vision, Deep learning, Encoder-decoder architecture, Crack detection
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
methods. Many studies have proposed various kinds of fully non-crack, we will skip the decoder module and create a
convolutional encoder-decoder networks [12,13,14] for negative map. Otherwise, we will feed it into the decoder
crack detection. Typically, these networks contain an module to create a crack map. This method saves significant
encoder module that usually uses the backbone of the time in the real application because most areas of the
classification network to reduce the feature maps and capture pavement image are non-crack.
higher semantic information, and a decoder module that uses Recent works [15,16] have also studied algorithmic
up-sampling layers to create the final pixel-wise prediction. efficiency, but these have mainly focused on reducing the
By comparing the classification network, as illustrated in model size by designing a light-weight network. Our work
Figure 1, the encoder-decoder architecture has an additional clearly differs from these approaches in that we only add an
decoder module, which is usually more complex than the SWM into the original network, which will not change the
classification network. However, the pavement surface inner structure of the network. Our method is an end-to-end
image, which is captured by the camera mounted in the manner, which is also different with other two-stage methods
vehicle, is mostly non-crack. As shown in Figure 2, even [17,18] that the first stage is detecting basic visual unit
where there are cracks in the road, most areas of the road (objects, parts, patches, etc.) and the second stage is using
image are background or irrelevant objects. Therefore, it is the segmentation network to create the final map. The
pointless and time-consuming to forward pass the whole proposed method has several advantages:
image to the decoder module for crack detection. For these 1) The method is simple because we only add light-weight
reasons, the question arises: why don’t we crop the pavement layers into the original network that have the potential to
image and then choose the crack patches by applying a integrate into many efficient encoder-decoder architectures
classification network before feeding into the decoder to further reduce computation complexity.
module? 2) The method is efficient and able to dramatically speed-
up inference time while keeping its segmentation accuracy
unchanged.
3) Training is also simple. The encoder-decoder network
can train as same as before and then fine-tunes the SWM, or
the whole network considers as a single-stage using a multi-
task loss to train.
To the best of our knowledge, this is the first method that
dynamically skips the encoder module to save computation
time in crack detection.
The rest of the paper is organized as follows. A brief
survey of related work is presented in the next section. Then,
the proposed approach is described in Section III, and the
evaluations and analyses are provided in Section IV. In the
last section, the conclusion and future work are presented.
FIGURE 1. The typical construction of the classification network and the
encoder-decoder network.
II. RELATED WORK
Some pixel-level pavement crack detection methods and
related studies on efficient encoder-decoder architectures are
briefly reviewed.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
convolutional encoder-decoder network and showed that the high performance of previous networks. In ICNet [28], the
backbone of the pre-trained ResNet-152 encoder is an optimal authors developed a cascade network to efficiently utilize
choice for detecting cracks. However, the inference time per semantic information in low-resolution along with details
image of the backbone of VGG-16 and ResNet-152 is 4.34 and from high-resolution images. This proposed work improves
4.64 seconds, respectively, which is not suitable for real-time both the speed and accuracy of existing deep networks, but it
applications. may fail to obtain solid performance when the input data is the
Due to complicated pavement conditions, it is hard to find low-resolution. Nekrasov et al. [29] adopted RefineNet [30] to
features that are effective for all the different pavements. To light-weight RefineNet to create a more compact network by
meet the challenges and to learn to strong representation, replacing 3x3 convolutions and mixed with other efficient
several multi-scale and multi-level combined methods have backbones for real-time semantic segmentation.
been proposed. In particular, the feature pyramid module [21] Recently, decoding technology has gained greater attention
is used as a multi-scale feature extractor to capture rich and new up-sampling methods [31,32,33] have been proposed
contextual information at different resolutions, and the to decrease the computation burden of the decoder module.
hierarchical convolutional neural network [22,23] is used to However, these approaches still experience a certain
obtain sharp object boundaries. Besides, Mei and Gül [24] computation complexity in the decoder module. In contrast to
proposed a method based on the conditional Wasserstein the approaches mentioned above, our method is simple and
generative adversarial network (cWGAN) and introduced a does not require huge efforts from human experts to design a
loss function with connectivity maps to overcome the issues light-weight network.
regarding scattered output in deconvolution layers. The above
studies considered vastly different crack datasets, and hence III. PROPOSED APPROACH
the results are not directly comparable. The goal of the current study is to boost the algorithmic
Some studies have also focused on accelerating the model efficiency by skipping the redundancy decoder when the input
for crack detection. Liu et al. [15] proposed a fast encoder- image is non-crack. The details of the switch module (SWM)
decoder structure network called FPCNet. Specifically, the are presented, and an example of how to integrate it into the
dilated convolution with multiple rates was applied in the encoder-decoder network is illustrated.
encoder sub-network to synthesize the crack features of
multiple context sizes, and the squeeze-and-excitation A. SWITCH MODULE
learning operation was introduced in the decoder sub-network. The switch module can be seen as a binary classification
This design not only optimizes the MD features but also network. It infers whether the input image is a negative class
reduces the computing burden. Besides, Fei et al. [16] with no crack or contains a crack and is called positive. The
introduced an efficient deep network called CrackNet-V for backbone of the encoder module is commonly based on a
automated pixel-level crack detection on 3D asphalt pavement classification network. The main difference for the
images. CrackNet-V built on the previous work (called classification network with the encoder module is that it
CrackNet) to improve accuracy and computation efficiency removed the last several layers in the latter, and hence we can
and uses many small filters (3×3) convolutional layers to take the removed layers back and place them after the output
increase the depth of the network structure without increasing of the encoder to perform the binary classification. In other
the number of extra parameters. With a single Nvidia GTX words, the encoder module can be seen as a feature extractor,
1080Ti GPU, the CrackNet-V took almost 0.33s to the single and a light-weight classifier head can be placed on the end of
forward pass a 512 × 256 image. However, our method can the encoder module to output the final class probability. As
achieve better efficiency with the same hardware illustrated in Figure 3, the SWM consists of a binary classifier
configuration, and more details can be seen in Section IV. head and a switch. The binary classifier head takes the feature
that is inferred from the encoder module and relays it to infer
B. EFFICIENT ENCODER-DECODER ARCHITECTURES the binary class of the patch. The switch uses the predicted
In semantic segmentation, speed is one of the important factors class to decide whether to input the encoder module or directly
for real-time applications. Most prior studies on encoder- output the negative map.
decoder architectures that are computationally efficient are At test time, the input image is divided into non-overlapping
focused on reducing the model size. For instance, ENet [25] is patches, and each patch is fed to the encoder to extract the
a deeper encoder and a shallower decoder used to create a features. The classifier head uses the inferred features to check
quite shallow segmentation network with a few parameters. whether the input patch has a crack or not. If the result is
Chaurasia et al. [26] proposed LinkNet architecture that negative, as illustrated in Figure 3 (left), the feature will be fed
utilized ResNet-18 as its encoder and achieved better mean into the decoder module to get the crack map. Otherwise, as
intersection over union (mIoU) than ENet. However, ENet illustrated in Figure 3 (right), it will skip the decoder to output
outperforms it in terms of computational efficiency. Recently, the negative map. In the end, the generated patches map is
Mehta et al. [27] proposed an efficient spatial pyramid up- merged to get the final result. Greater detail of the description
sampling (ESP) to reduce memory and time-consuming while can be seen in Algorithm 1. Let 𝐻 × 𝑊 be the size of the
maintaining accuracy at only 8% less. Although the proposed patches that are extracted from the pavement image, N be the
method leads to a significant speedup, it also sacrifices the number of image patches cropped from a pavement image, and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
𝑌𝑖 be the crack map for the input patch 𝑋𝑖 . The crack detection 2: for 𝑖 = 1 to 𝑁 do
is a system that takes a pavement image 𝐼 and produces a 3 𝐿𝑓 = 𝐹𝑒 (𝑋𝑖 ; 𝑊𝑒 ) /* infers the feature 𝐿𝑓 of the
crack map image 𝑌 with the same size (width and height). The input patch by decoder module*/
encoder module is represented as one function 𝐹𝑒 (𝑋𝑖 ; 𝑊𝑒 ) that 4: 𝑝𝑐𝑙𝑠 = 𝐹𝑐 (𝐿𝑓 ; 𝑊𝑐 )/* infers the probability value
takes the patch extract 𝑋𝑖 and produces feature maps and the 𝑝𝑐𝑙𝑠 to the positive class by classifier head*/
decoder module as another function 𝐹𝑑 (𝐿𝑓 ; 𝑊𝑑 ) that takes the 5: if 𝑝𝑐𝑙𝑠 < 0.5 then
feature maps 𝐿𝑓 to estimate the crack map; 𝐹𝑐 (𝐿𝑓 ; 𝑊𝑐 ) is the 6: 𝑌𝑖 = {0}𝑊×𝐻 /*create an empty mask*/
classifier head that takes the feature maps 𝐿𝑓 𝑡𝑜 predict the 7: else
positive probability. 8: 𝑌𝑖 = 𝐹𝑑 (𝐿𝑓 ; 𝑊𝑑 )/*infers the crack map 𝑌𝑖
of the input feature by encoder module */
9: end if
10: end for
𝑁
11: Y= {𝑌𝑖 } 𝑖=1 /*all of the crack maps are assembled
into an image to get the final mask map*/
B. TRAINING METHOD
Positive Negative
Loss function: Crack detection can be seen as a pixel-wise
binary classification problem. Hence, the binary cross-entropy
loss function is employed for the task of training the network.
Output
Input Output Input We denote the training dataset by 𝑆 = {(𝑋𝑛 , 𝑌𝑛 , 𝑍𝑛 ), 𝑛 =
(𝑛)
Encoder Decoder Encoder Decoder 1, … , 𝑁}, where 𝑋𝑛 = {𝑥𝑖 , 𝑖 = 1, … , 𝑀} is the input patches,
Module Module Module Module
(𝑛) (𝑛)
𝑌𝑛 = {𝑦𝑖 , 𝑖 = 1, … , 𝑀}, 𝑦𝑖 ∈ {0,1} is the ground-truth of
pixel-wise labels, and 𝑍𝑛 = {𝑧 (𝑛) }, 𝑧 (𝑛) ∈ {0,1} is the
Switch Switch ground-truth class of the patch. M denote the number of pixels
in every image. The loss function is then defined as:
Classifier Classifier
Head Head 𝐿(𝑾𝑒 , 𝑾𝑑 )𝑠𝑒𝑔 =
𝑀 (𝑛) (𝑛)
positive negative − ∑𝑁
𝑛=1 ∑𝑖=1 𝑦𝑖 𝑙𝑜𝑔 (𝐹𝑑 (𝐹𝑒 (𝑥𝑖 ; 𝑾𝑒 ) , 𝑾𝑑 )) −
(𝑛) (𝑛)
FIGURE 3. Illustration of encoder-decoder architecture with the SWM. The (1 − 𝑦𝑖 ) 𝑙𝑜𝑔 (1 − 𝐹𝑑 (𝐹𝑒 (𝑥𝑖 ; 𝑾𝑒 ), 𝑾𝑑 )) (1)
SWM is highlighted in the blue rectangle. Left: predicted positive; Right:
predicted negative. , where 𝑾𝑒 𝑎𝑛𝑑 𝑾𝑑 denote the standard set of parameters of
the encoder module and the decoder module, which are trained
Algorithm 1: Inference computation through the with backpropagation. The 𝐹𝑑 (𝐹𝑐 (𝑿; 𝑾𝑒 ), 𝑾𝑑 ) represent the
encoder-decoder network with SWM prediction probability by the decoder module. Simultaneously,
Input: a pavement image 𝐼, 𝐼 ∈ ℝ𝐻×𝑊×𝐶 the same binary cross-entropy loss function is used for SWM
Output: the target output 𝑌, 𝑌 ∈ ℝ𝐻×𝑊 training and formulated as:
Require: 𝑊𝑒 , the weight matrices of the encoder module 𝐿(𝑾𝑒 , 𝑾𝑐 )𝑐𝑙𝑠 = − ∑𝑁 (𝑖)
𝑖=1 𝑧 𝑙𝑜𝑔(𝐹𝑐 (𝐹𝑒 (𝑋𝑖 ; 𝑾𝑒 ), 𝑾𝑐 )) −
𝑊𝑐 , the weight matrices of the classifier head (𝑖)
(1 − 𝑧 )𝑙𝑜𝑔(1 − 𝐹𝑐 (𝐹𝑒 (𝑋𝑖 ; 𝑾𝑒 ), 𝑾𝑐 )) (2)
𝑊𝑑 , the weight matrices of the decoder module
Note: 𝐿𝑓 and 𝑝𝑐𝑙𝑠 is a temporary variable that used to , where 𝑾𝑐 denote the parameters of the classifier head and
save the result of the feature maps and class 𝐹𝑐 (𝐹𝑒 (𝑋𝑖 ; 𝑾𝑒 ), 𝑾𝑐 ) represent the prediction probability to the
probability, respectively. positive class.
𝑁
1: crops the pavement image 𝐼 to patches {𝑋𝑖 } 𝑖=1 with
non-overlapping
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
64 64 128 64 64 1
Output
Input Crack map
Image
2×2 Maxpooling
128 128 256 128
3×3 Conv,256ch
Batch Norm
Relu
Encoder 256 256 512 256
Decoder
Module Module 3×3 Conv,1ch
Sigmoid
FIGURE 4. The structure of the proposed UNet+SWM. The classifier head is highlighted in black.
3×3-64 Conv
3×3-64 Conv Input Image
Encoder 3×3-128 Conv 2×2 MaxPooling1
Module 3×3-128 Conv
3×3-256 Conv 2×2 MaxPooling2
3×3-256 Conv
3×3-256 Conv
3×3-512 Conv 2×2 MaxPooling3
3×3-512 Conv
3×3-512 Conv
3×3-512 Conv 2×2 MaxPooling4
3×3-512 Conv
Classifier 3×3-512 Conv
Head 2×2 MaxPooling5
3×3 Conv,256ch
Loss/Sigmoid
Loss/Sigmoid
Loss/Sigmoid
Loss/Sigmoid
Loss/sigmoid
Batch Norm
1×1-1 Conv
1×1-1 Conv
1×1-1 Conv
1×1-1 Conv
1×1-1 Conv
Deconv
Deconv
Deconv
Deconv
Deconv
Relu
2×2 UpSampling5
3×3 Conv,1ch
3×3-512 Conv
3×3-512 Conv
Sigmoid
3×3-512 Conv 2×2 UpSampling4
FIGURE 5. The structure of the proposed DeepCrack+SWM. The classifier head is highlighted in black.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
Training Strategy: There are two strategies that can be linearly processing by 3 × 3 convolutions with the stride of 2
used to train the network: to refine the feature. Then, we apply another 3 × 3
a) Joint training: The encoder-decoder network and SWM convolutions followed by a Sigmoid activation layer on the
can be merged into a single network by training end to end, low-level features to reduce the number of channels. Next, we
which is known as joint training. Joint training is performed apply a max-pooling layer for faster convergence and better
by merging the gradients computed by each loss on generalization [40] and to output the final class probability.
independent mini-batches. This allows us to train the encoder- For DeepCrack, as shown in Figure 5, The classifier head
decoder network and the classifier head with their own set of architecture is similar, but fewer layers. Since the DeepCrack
training parameters. Besides, we observe that the semantic encoder module has five down-sampling units, the first down-
network requires more steps to train than the classification task sampling layer (Max-pooling) is not needed. Both networks
and thus assigns a weight factor of 0.05 to the classification will skip the decoder module if the input patch is predicted as
loss in the contribution to the final loss. negative. Otherwise, the encoder features will be sent to the
b) Fine-tuned method: In this situation, the encoder can be decoder module to produce the crack map.
considered as a fixed feature extractor, which has proven to be
effective at extracting generic features [34]. Therefore, SWM D. MODEL COMPLEXITY
can be fine-tuned in network training. More specifically, the In the encoder-decoder network, suppose that the computation
encoder-decoder network is first independently trained. complexity of the encoder module is 𝑂𝑒 , and the decoder
Subsequently, the branch of the classification network is module is 𝑂𝑑 . The model complexity of the encoder-decoder
trained with the target dataset and all the encoder weights are network is:
frozen during backpropagation. The process of being fine-
tuned can be quick convergence because the classifier head 𝑂𝑛𝑒𝑡𝑤𝑜𝑟𝑘 = 𝑂𝑒 + 𝑂𝑑 (3)
only has a few trainable layers. . When we introduce SWM, the model complexity of the
encoder-decoder network with SWM is:
C. ENCODER-DECODER ARCHITECTURES WTH SWM
𝑂 + 𝑂𝑑 + 𝑂𝑐 , 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
Many kinds of encoder-decoder architectures, such as UNet 𝑂𝑛𝑒𝑡𝑤𝑟𝑜𝑘+𝑆𝑊𝑀 = { 𝑒 (4)
[35], SegNet [36], PspNet [37], DeepCrack [23], 𝑂𝑒 + 𝑂𝑐 , 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
DeepLabV3+ [38], were proposed and achieved decent , where 𝑂𝑐 𝑖𝑠 𝑡ℎ𝑒 computation complexity of the classifier
performance in various applications. Here, UNet and head. Considering that the computation complexity of 𝑂𝑐 is
DeepCrack are chosen as examples of the encoder-decoder much lower than 𝑂𝑒 and 𝑂𝑑 , the additional 𝑂𝑐 can be
architectures to show how SWM is built and integrated into negligible. Suppose that the ratio of the number of positive
the architectures to reduce computation complexity. UNet samples to the total number of samples is 𝜇 , the model
introduces the useful skip connections between encoder and complexity of the encoder-decoder network with SWM is
decoder blocks with the same spatial resolution and achieves given by
an impressive performance in the image segmentation domain,
such as biomedical image processing, remote sensing analysis. 𝑂network+𝑆𝑊𝑀 = 𝑂𝑒 + 𝜇𝑂𝑑 (5)
DeepCrack is a recently proposed method that is built on the . Hence, integrated SWM into the encoder-decoder network
architecture of SegNet for crack detection. This network uses could lead to a significantly reduced computational burden of
the fusion of hierarchical convolutional features to fully the network, especially when 𝜇 is relatively small.
exploit multilevel and multiscale features of objects, resulting The computation complexities of each component in
in better performance. Figure 4 and Figure 5 illustrate the UNet+SWM and DeepCrack+SWM are shown in Table 1.
architecture of UNet with integrated SWM (called The floating point operations per second (FLOPS) indicate the
UNet+SWM) and DeepCrack with integrated SWM (called amount of operations of the module, and the parameters
DeepCrack+SWM), respectively. These networks consist of represent the number of learnable parameters in the module.
four components: an encoder module, a decoder module, a For a fair comparison, 192×192 is chosen as the resolution of
switch, and a classifier head. Here, the classifier head is only the input image. The original UNet in our experiments has
introduced and others can be referred to the papers [35][23]. approximately 31.4M parameters and 62.67GFLOPS,
As shown in Figure 4, the feature that is reused across UNet whereas the classifier head only adds 2.36M parameters and
makes the classifier head light-weight. The architecture of the 0.17GFLOPS. For DeepCrack+SWM, the encoder module is
encoder module is the VGG-13 with the last four fully also a heavy computational load when compared with SWM.
connected layers removed. Theoretically, we can adopt the Hence, integrated SWM into the encoder-decoder networks
removed layers as the classifier head, but the last layers of only increases a small fraction of the network complexity.
VGG-13 are traditional fully connected layers, which are Moreover, both the decoder module of UNet and DeepCrack
prone to over-fitting [39] and require huge computational have more FLOPS than their encoder module; Hence, a
efforts. Instead of the fully connected layers, we use significant reduction in complexity can be achieved by simply
convolutional and pooling layers to perform the binary skipping the redundancy decoder in the negative samples in
classification. The extracted features by the encoder module crack detection. It should be mentioned that FLOPS is an
are first 2x down-sampled from the spatial resolution and non- indirect metric and only an approximation of the direct metric
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
[41], which means that networks with similar FLOPS may consists of more images and has a higher ratio of the number
have different speeds. We found that the pooling indices [36] of negative patches to the positives.
cost a lot of runtime in up-sampling operations although it only
has a small portion of FLOPS in the decoder module of TABLE 2. Comparison of the two crack datasets. The number of
patches is based on counting the cropped with non-overlapping.
DeepCrack, which means that it will be significanlty The N/P is the ratio of the number of negative patches to the
accelerated when we skip the decoder module. number of positive patches.
TABLE 1. Comparison of the number of learnable parameters and Dataset Image Patch Negative Positive N/P
FLOPS between the encoder module, the decoder module, and the resolution resolution patches patches ratio
classifier head.
CrackTree206 800×600 160×160 1996 2124 0.484
Network Module Parms (M) FLOPS (G) AIMCrack 1920×384 192×192 6473 4067 0.614
Encoder module 18.86 19.17
UNet Decoder module 12.54 43.50
Classifier head 2.36 0.17
Encoder module 14.73 22.56
DeepCrack Decoder module 14.77 25.17
Classifier head 1.18 0.085
1
https://sites.google.com/site/qinzoucn.
2
The dataset can be shared by the authors if requested.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
distance between the predicted and the ground-truth boundary. Figure 7. The PR curves of UNet and UNet+SWM are
We compute: identical to each other, the same are also true for DeepCrack
1 𝑇𝑃𝑖 and DeepCrack+SWM. These results again indicate that
1) Precision = ∑𝑁
𝑖 integration with SWM will maintain decent accuracy
𝑁 𝑇𝑃𝑖 +𝐹𝑃𝑖
1 𝑇𝑃𝑖
compared to the baseline network. We also noted that UNet
2) Recall = ∑𝑁
𝑖 performs slightly better than DeepCrack in CrackTree206.
𝑁 𝑇𝑃𝑖 +𝐹𝑁𝑖
However, for the AIMCrack dataset, the performances of
2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
3) F1 = DeepCrack are better. The difference in performance may be
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
related with the fact that the number of training samples used
1 𝑇𝑃𝑖
4) IoU= ∑𝑁
𝑖 for the networks training is much less in CrackTree206 (126
𝑁 𝑇𝑃𝑖 +𝐹𝑃𝑖 +𝐹𝑁𝑖
images) than in AIMCrack (327 images)). UNet is known as
To evaluate the computation complexity, we used the the model that is worked well in data-limited scenarios [35],
GPU inference time as the evaluation metric, which was while DeepCrack is built on SegNet, which is primarily
reported on a computer with an Nvidia GTX1080Ti GPU and designed for semantic understanding of road/indoor scenes
Intel i5 8500 CPU. The inference time was obtained from the and demanded more training data [36].
forward times of the images and averaged among all the
results.
C. EXPERIMENTAL SETTINGS
The experiments were conducted on an Nvidia GTX1080Ti
GPU with 11GB memory and the deep learning framework
Keras [44] with TensorFlow [45] backend. The UNet encoder
module was initialized using pre-trained VGG-13 weights on
ImageNet and the DeepCrack encoder module was pre-trained
VGG-16 weights. Other module weights were randomly
initialized using a glorot normal distribution [46].
For UNet, the learning rate was initially set to 0.0005 and
decreased by one order after every 20 epochs. For DeepCrack,
we trained with the starting learning rate of 0.001 and the same
learning rate schedule. All models were trained using the
Adam [47] optimizer with a mini-batch size of 12. The training
was continued till validation loss converged. Data
augmentation was also used, including popular horizontally
flipping, arbitrary brightness (+5%) and random contrast
(+5%).
We found that joint learning and fine-tuned method could
be automatically learned and each of them could achieve a
similar result in testing. However, we recommend using the
fine-tuned method in practice because we do not need to select
the weight factor (hyper-parameter) that controls the balance
between the two task losses.
D. QUANTITATIVE RESULTS
To fairly compare with two methods, we used 0.3 as our
prediction threshold to generate binary outputs. The
performance of UNet, UNet+SWM, DeepCrack, and
DeepCrack+SWM are reported in Table 3. We noted that the
performances of networks with integrated SWM could closely
match the original networks while it yields faster processing
speed. For the CrackTree206, our method (UNet+SWM) runs
approximately 30.7% faster than UNet, and FIGURE 7. The Precision-recall (PR) curves of the UNet, UNet+SWM,
DeepCrack, and DeepCrack+SWM on the two crack test sets. The top
DeepCrack+SWM runs approximately 62.9% faster than panel shows the PR curves on the CrackTree206 dataset; the bottom
DeepCrack. On the AIMCrack dataset, the portion of the panel shows the PR curves on the AIMCrack dataset. Best viewed in color.
positive patch is lower than CrackTree206, and our method
leads to a higher speedup, which is 39.1% faster than UNet
and 86.8% faster than DeepCrack. The precision-recall (PR)
curves and the average precision (AP) scores are provided in
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
TABLE 3. The performances of UNet, UNet+SWM, DeepCrack, and DeepCrack+SWM on the two Crack datasets. The reported performances
are obtained at the threshold=0.3. The inference times are evaluated using a mini-batch size of 1 and are shown in the last column.
E. QUALITATIVE RESULTS
Some results of samples from the CrackTree206 and
AIMCrack datasets are shown in Figure 8 and Figure 9,
DeepCrack+SWM
respectively. The pavement images, ground-truth annotations,
UNet segmentations along with our proposed UNet+SWM
segmentations, and DeepCrack segmentations along with our
proposed DeepCrack +SWM segmentations are provided. The
final detection results are shown in rows 4,5,7 and 8. It can be DeepCrack
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
Image
Ground truth
UNet+SWM
Visualization
UNet+SWM
UNet
DeepCrack+SWM
Visualization
DeepCrack+SWM
DeepCrack
FIGURE 9. Examples of predicted results on the AIMCrack dataset. The first row lists the original images. The corresponding manually labeled cracks
are shown in the second row. The third to fifth rows and the sixth to eighth rows show the detection results by UNet methods and DeepCrack methods,
respectively. The yellow rectangles present the noises created by the original network, while these noises can be removed by our method (integrated
SWM).
For patches of each image, the corresponding is negative, our method skips out the decoder and directly
classification results are shown in the third and sixth rows in output the negative map, avoiding the noise created by the
Figure 8 and Figure 9. We highlighted the skipped patches encoder and decoder module. All qualitative results show that
(predicted as negative by SWM) with a color box in which the skipped FN samples have little effect on performance, which
green squares represent the non-crack patches, the blue further supports the quantitative analysis.
squares represent the crack patches that could not be detected
even forward pass of the decoder module, and the red squares F. COMPARISON WITH OTHER METHODS
indicate that the patches that have cracks were failed to detect The performances of other methods are reported in Table 5
by the classifier. With more detailed observation, we again and Figure 10. CrackForest [4] utilizes the traditional image
noticed that some FN samples are low contrast, thin width, and method to extract the features and introduces random
small enough that they could not be detected when conducted structured forests [48] as a classifier for crack detection. The
by the decoder module. results show that all of the deep-learning methods have
In the fifth row of Figure 8 and the fifth, eighth row of outperformed CrackForest. Note that CrackForest was
Figure 9, the yellow rectangles show the situation in which the implemented using MATLAB (R2016a) on a CPU (i5 8500),
pixels that were not actual cracks were predicted as cracks in which is different from the other methods that running on a
original network, whereas our method (integrated SWM) GPU. LinkNet [26] is an efficient encoder-decoder network
could reduce the noises generation, which is one of our that uses ResNet18 as its encoder and some full-convolution
method’s advantages. The reason for this is when the sample layers in its decoder. LinkNet can achieve a speed of about 10%
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
faster than UNet+SWM at the cost of sacrificing the accuracy. phenomenon may be two twofold: (i) Compared to the normal
ResUnet [49] is a shallower UNet-like model that is built with semantic image, cracks do not have a certain shape and usually
residual units [50]. This network cuts down the number of the have an extremely large aspect ratio (thin and long). Besides,
down-sampling units and the up-sampling units to reduce the the number of training data is limited. (ii) The cracks in two
computational cost and the parameters of the network. The datasets are wide diversity in shape, scale, and shadow with
inference time of ResUNet is shorter than DeepCrack+SWM illumination various effects.
but is longer than UNet+SWM. DenseNet67 [51] is a larger AIMCrack is a more common scenario where the captured
and deeper encoder-decoder network with dense connections camera is far away from the object and the captured image
[52] in the intermediate layers. For the CrackTree206 dataset, exists in some irrelevant instances. However, the AIMCrack
DenseNet67 outperforms all of the other methods but demands dataset was created by extracting 527 frames (crack images)
more computation time. However, for the AIMCrack, the from the 289 hours of the collected road video. The frames of
performances of UNet+SWM and DeepCrack+SWM are the video data must have many non-crack images, which
better. DeepCrack+SWM uses independent loss in each skip- means that in real-world applications, the inference time will
layer, and effectively suppresses noise generation but costs a be further decreased if we feed the road video (unselected
lot of runtime. UNet+SWM ranks No.2 in both the accuracy frames) to our network.
and efficiency in the two datasets. Compared with each
method’s speed and accuracy, UNet+SWM obtains a good V. CONCLUSION AND FUTURE WORK
balance between the accuracy and efficiency for crack We hypothesize that for pavement detection, most spatial
detection. information is the non-crack area in the pavement image,
which is common in the task of crack detection. We propose a
TABLE 5. Comparison of accuracies and inference times with
various methods on the CrackTree206 and AIMCrack datasets.
simple and effective switch module, termed SWM, that is
integrated into the encoder-decoder architecture to
Dataset Method IoU F1 Time (ms) dynamically skip the decoder module and directly output the
CrackForest 0.159 0.294 1545 result when there is no crack in the input image. This method
LinkNet 0.516 0.680 156 can dramatically boost the baseline speed while preserving
ResUNet 0.666 0.794 181 accuracy. We choice UNet and DeepCrack as examples of
CrackTree206 the encoder-decoder architectures to show how SWM is built
DenseNet67 0.705 0.824 356
UNet+SWM 0.672 0.802 163 and integrated into the architectures to reduce computation
DeepCrack+SWM 0.641 0.780 310 complexity. Besides, the network can be replaced with the
CrackForest 0.146 0.289 2493 other encoder-decoder network for advanced performance.
LinkNet 0.314 0.453 156 Finally, we conduct extensive experiments to support the
ResUNet 0.282 0.409 221 proposed method.
AIMCrack The proposed method of crack detection can also be
DenseNet67 0.318 0.464 478
UNet+SWM 0.339 0.477 174 extended to other class imbalance domains: for example,
DeepCrack+SWM 0.343 0.485 273 estimating concrete bridge cracks at transportation hubs or
detecting a tumor from medical images. Nevertheless, our
method also has limitations. In particular, the SWM may not
G. DISCUSSION
be able to be integrated into some non-encoder-decoder
We performed extensive experiments for the pixel-level crack
architectures (e.g., CrackNet-V [16]). Besides, the method
detection on the CrackTree206 and AIMCrack datasets and
cannot achieve notable speed acceleration under a
showed that UNet+SWM and DeepCrack+SWM give a
scenario where the object in the image has a consistent ratio
significant leap in efficiency over the original encoder-
against the background (e.g., Cityscapes [53]).
decoder architectures and without losing the accuracy in all
In the future, we aim to adopt smaller architectures to
tasks. The improvement is more significant when most areas
further improve the system frame rates, or some of the state-
of the pavement do not have a crack. We also visualized the
of-art encoder-decoder networks to provide more reliable road
results and showed that the encoder-decoder architectures with
segmentation. Furthermore, the current version of the
integrated SWM would not affect the results for all tasks.
proposed method can only be used for crack detection, and we
We also evaluated the performances of different methods
will improve this method to simultaneously detect multiple
on crack detection. Although some segmentation models
defects on the road.
achieved decent performance in various applications, none of
them was able to achieve the state-of-the-art segmentation
performance on both datasets. The reason for this
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
Image
Ground truth
CrackForest
LinkNet
ResUNet
DenseNet67
UNet+SWM
DeepCrack+SWM
FIGURE 10. Examples of predicted results by different methods on the two datasets. The first two columns list the results on the CrackTree206 test
samples, and the last two columns list the results on the AIMCrack test samples.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
[6] L. Zhang, F. Yang, Y. D. Zhang, and Y. J. Zhu, “Road crack detection [26] A. Chaurasia, and E. Culurciello, “Linknet: Exploiting encoder
using deep convolutional neural network.” In 2016 IEEE Int. Confe. on representations for efficient semantic segmentation.” In 2017 IEEE
Image Process. (ICIP), pp. 3708-3712, Sep 2016. Visual Commun. and Image Process. (VCIP), pp. 1-4, Dec 2017.
[7] Y. J. Cha, W. Choi, and O. Büyüköztürk, “Deep learning‐based crack [27] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi,
damage detection using convolutional neural networks.” Comput. Aided “ESPNet: efficient spatial pyramid of dilated convolutions for semantic
Civ. and Infrastruct. Eng., vol. 32, no. 5, pp. 361-378, 2017. segmentation.” In Proc. Eur. Conf. on Comput. Vision (ECCV), pp. 552-
[8] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata, “Road 568, 2018.
damage detection and classification using deep neural networks with [28] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic
smartphone images.” Comput. Aided Civ. and Infrastruct. Eng., vol. 33, segmentation on high-resolution images.” In Proc. Eur. Conf. on
no. 12, pp. 1127-1141, 2018. Comput. Vision (ECCV), pp. 405-420, 2018.
[9] L. Song, and X. Wang, “Faster region convolutional neural network for [29] V. Nekrasov, C. Shen, and I. Reid, “Light-weight refinenet for real-
automated pavement distress detection.” Road Materials and Pavement time semantic segmentation.” 29th British Machine Vision Conf.
Design, pp.1-19, 2019. (BMVC), 2018. [Online]. Available:
[10] T. A. Carr, M. D. Jenkins, M. I. Iglesias, T. Buggy, and G. Morison, http://bmvc2018.org/contents/papers/0494.pdf
“Road crack detection using a single stage detector based deep neural [30] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: multi-path
network.” In 2018 IEEE Workshop on Environmental, Energy, and refinement networks for high-resolution semantic segmentation.”
Structural Monitoring Systems (EESMS), pp. 1-5, Jun 2018. In Proc. IEEE Conf. on Comput. Vision and Pattern Recognit. (CVPR),
[11] S. Dorafshan, R. J. Thomas, and M. Maguire, “Comparison of deep pp. 1925-1934, 2017.
convolutional neural networks and edge detectors for image-based crack [31] Z. Tian, T. He, C. Shen, and Y. Yan, “Decoders matter for semantic
detection in concrete.” Construction and Building Materials, vol. 186, segmentation: data-dependent decoding enables flexible feature
pp. 1031-1045, 2018. aggregation.” In Proc. of the IEEE Conf. on Comput. Vision and Pattern
[12] X. Yang, H. Li, Y. Yu, X. Luo, T. Huang, and X. Yang, “Automatic Recognit. (CVPR), pp. 3126-3135, 2019.
pixel‐level crack detection and measurement using fully convolutional [32] Z. Wojna, V. Ferrari, S. Guadarrama, N. Silberman, L. C. Chen, A.
network.” Comput. Aided Civ. and Infrastruct. Eng., vol. 33, no. 12, pp. Fathi, and J. Uijlings, “The devil is in the decoder: classification,
1090-1109, 2018. regression and GANs.” Int. J. of Comput. Vision, pp. 1-13, 2019.
[13] S. Bang, S. Park, H. Kim, and H. Kim, “Encoder–decoder network for [33] H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yu, “FastFCN: rethinking
pixel-level road crack detection in black-box images.” Comput. aided dilated convolution in the backbone for semantic segmentation.”
Civ. and Infrastruct. Eng., vol. 34, no. 8, pp. 713-727, 2019. arXiv:1903.11816, 2019. [Online]. Available:
[14] Z. Fan, Y. Wu, J. Lu, and W. Li, “Automatic pavement crack detection https://arxiv.org/abs/1903.11816.
based on structured prediction with the convolutional neural network.” [34] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
arXiv:1802.02208, 2018. [Online]. Available: hierarchies for accurate object detection and semantic segmentation.”
https://arxiv.org/abs/1802.02208. In Proc. of the IEEE Conf. on Comput. Vision and Pattern Recognit.
[15] W. Liu, Y. Huang, Y. Li, and Q. Chen, “FPCNet: fast pavement crack (CVPR), pp. 580-587, 2014.
detection network based on encoder-decoder architecture.” [35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional
arXiv:1907.02248, 2019. [Online]. Available: networks for biomedical image segmentation.” In Int. Conf. on Medical
https://arxiv.org/abs/1907.02248. Image Comput. and Computer-assisted Intervention, pp. 234-241, Oct
[16] Y. Fei, K. C. Wang, A. Zhang, C. Chen, J. Q. Li et al., “Pixel-level 2015.
cracking detection on 3D asphalt pavement images through deep- [36] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: a deep
learning-based CrackNet-V.” IEEE Trans. Intell. Transp. Syst., pp. 1-12, convolutional encoder-decoder architecture for image segmentation.”
2019. IEEE Trans. on Pattern Anal. and Machine Intelligence, vol. 39, no. 12,
[17] S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia, “Multi-scale patch pp. 2481-2495, 2017.
aggregation (MPA) for simultaneous detection and segmentation.” In [37] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
Proc. IEEE Conf. on Comput. Vision and Pattern Recognit. (CVPR), pp. network.” In Proc. IEEE Conf. on Comput. Vision and Pattern Recognit.
3141-3149, 2016. (CVPR), pp. 2881-2890, 2017.
[18] H. R. Roth, L. Lu, N. Lay, A. P. Harrison, A. Farag et al., “Spatial [38] L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
aggregation of holistically-nested convolutional neural networks for “Encoder-decoder with atrous separable convolution for semantic image
automated pancreas localization and segmentation.” Medical Image segmentation.” In Proc. of the Eur. Conf. on Comput. Vision (ECCV).
Anal., pp. 94-107, 2018. 2018.
[19] A. Zhang, K. C. Wang, B. Li, E. Yang, X. Dai, Y. Peng et al., [39] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
“Automated pixel‐level pavement crack detection on 3D asphalt with deep convolutional neural networks.” In Advances in Neural Inf.
surfaces using a deep‐learning network.” Comput. Aided Civ. and Process. Syst., pp. 1097-1105, 2012.
Infrastruct. Eng., vol. 32, no. 10, pp. 805-819, 2017. [40] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling
[20] C. V. Dung, and L. D. Anh, “Autonomous concrete crack detection operations in convolutional architectures for object recognition.” In Int.
using deep fully convolutional neural network.” Automat. in Conf. on Artif. Neural Networks, pp. 92-101, Sep 2010.
Construction, 99, pp. 52-58, 2019. [41] N. Ma, X. Zhang, H. Zheng, and J. Sun, “ShuffleNet V2: practical
[21] F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei, and H. Ling, “Feature guidelines for efficient CNN architecture design.” In Proc. Eur. Conf.
pyramid and hierarchical boosting network for pavement crack on Comput. Vision (ECCV), pp.116-131, 2018.
detection.” IEEE Trans. on Intell. Transp. Syst., pp. 1-11, April 2019. [42] Q. Zou, Y. Cao, Q. Li, Q. Mao, and S. Wang, “Cracktree: automatic
[22] Y. Liu, J. Yao, X. Lu, R. Xie, and L. Li, “DeepCrack: a deep crack detection from pavement images.” Pattern Recognition Letters,
hierarchical feature learning architecture for crack vol. 33, no. 3, pp. 227-238, 2012.
segmentation.” Neurocomputing, vol. 338, pp. 139-153, 2019. [43] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A.
[23] Q. Zou, Z. Zhang, Q. Li, X. Qi, Q. Wang, and S. Wang, “Deepcrack: Zisserman, “The pascal visual object classes (voc) challenge.” Int. J. of
learning hierarchical convolutional features for crack detection.” IEEE Comput. Vision, vol. 88, no. 2, pp. 303-338, 2010.
Trans. on Image Process., vol. 28, no. 3, pp. 1498-1512, 2018. [44] F. Chollet et al. 2015. “Keras.” 2015. [Online]. Available:
[24] Q. Mei, and M. Gül, “A conditional wasserstein generative adversarial https://github.com/fchollet/keras.
network for pixel-level crack detection using video extracted images.” [45] M. Abadi, A. Agarwal, P. Barham et al., “Tensorflow: large-scale
arXiv :1907.06014, 2019. [Online]. Available: machine learning on heterogeneous distributed systems.” arXiv:
https://arxiv.org/abs/1907.06014. Distributed, Parallel, and Cluster Computing, 2015. [Online]. Available:
[25] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep http://arxiv.org/abs/1603.04467.
neural network architecture for real-time semantic segmentation.” [46] X. Glorot, and Y. Bengio, “Understanding the difficulty of training
arXiv:1606.02147, 2016. [Online]. Available: deep feedforward neural networks.” In Proc. of the Thirteenth Int. Conf.
https://arxiv.org/abs/1606.02147. on Artif. Intell. and Statistics, pp. 249-256, Mar 2010.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961375, IEEE Access
H. Chen et al.: Preparation of Papers for IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.