The end-to-end CNN model proposed in this paper is shown in
Figure 3. The network contains a total of 28 layers, including 16 convolutional layers and 3 maxpooling layers, while the Atrous Spatial Pyramid Pooling (ASPP) module contains 10 convolutional layers. In the structure shown in
Figure 3, the first three convolutional layers are used to extract image features, and the output feature maps are then input into the ASPP module, whose structure will be described in detail in
Section 2.2.3, to extract the multi-scale crack feature information. Furthermore, the depthwise separable convolution contained in the ASPP module reduces the computational complexity of the model, thus making the network easier to optimize. In the last three convolutional layers of the network, we apply atrous convolution to replace the maxpooling layers, thus avoiding the degradation of the image resolution while increasing the receptive field. At last, the network uses Softmax to classify the input pictures (cracks or backgrounds).
2.2.1. Atrous Convolution
Modern image classification networks integrate multi-scale context information through continuous pooling and down-sampling layers, resulting in a loss of detail information about the object edges and a degradation of the image resolution [
25,
26]. To solve this problem, Yu et al. [
27] proposed a novel convolution method—atrous convolution (also called dilated convolution). Atrous convolution can exponentially expand the receptive field without a loss of resolution, resulting in a denser feature map [
27].
Let
denotes a discrete function. Let
and
be a discrete filter of size
. Then, the discrete convolution operator
can be defined as:
We now promote this operator. Let
be an atrous factor, and let
be defined as:
Here, is the atrous convolution or l-atrous convolution. The discrete convolution we are familiar with is an atrous convolution of .
Atrous convolution enables an exponential expansion of the receptive field without a loss of resolution and coverage [
27]. Let
be discrete functions, and let
be discrete 3 × 3 filters. Consider applying an atrous filter with an exponential increase:
The receptive field of the element
p in
is defined as the set of elements in
that modify the value of
. Assume that the size of the receptive field of
p in
is the number of these elements. Then, the receptive field size of each element in
is
. The receptive field is an exponentially increasing square, as shown in
Figure 4.
In our crack detection task, special attention was paid to the edge and texture information of the crack. However, continuous pooling and down-sampling layers will result in the loss of the object edge detail information and a reduced image resolution [
25,
26]. For example, after four pooling layers, the information of any object smaller than 16 pixels would be lost. In the last few layers of the network, the resolution of the feature map is small, so using the atrous convolution instead of the maxpooling layer can effectively preserve the resolution of the feature map and provide more feature information for the final classification, thereby obtaining a higher accuracy. Therefore, atrous convolution with an atrous rate of 2 was introduced in the last three convolutional layers of the network. As a result, the loss of edge detail information or the reduction of the image resolution is avoided, while a larger receptive field is obtained.
2.2.2. Depthwise Separable Convolution
The depthwise separable convolution was originally proposed by Laurent Sifre [
28]. After that, it was widely known and applied in models such as Inception [
29], Xception [
30], and MobileNets [
31].
The depthwise separable convolution resolves the standard convolution into a depthwise convolution and a pointwise convolution. The spatial depthwise convolution is used independently on each channel of the input feature map, and then the outputs of the depthwise convolution are combined using a pointwise convolution. The Depthwise separable convolution can greatly reduce the model parameters and computational complexity, while maintaining a similar or better performance [
32].
Figure 5 shows how the standard convolution (
Figure 5a) is decomposed into a depthwise convolution (
Figure 5b) and a 1 × 1 pointwise convolution (
Figure 5c).
The standard convolutional layer takes the feature map F of size
as an input and produces a feature map G of size
, where
is the spatial width and height of the input square feature map, M is the number of input channels,
is the spatial width and height of the output square feature map, and N is the number of output channels [
31].
The standard convolutional layer is convolved by a convolution kernel K of size , where is the spatial width and height of the convolution kernel (which is assumed to be squared), M is the number of input channels, and N is the number of output channels.
The calculation cost of the standard convolution is shown in Equation (4):
In Equation (4), the computational cost of the standard convolution increases exponentially with the output feature map size and the convolution kernel size , the number of input channels M, and the number of output channels N.
The depthwise separable convolution integrates the traditional convolution into two layers; the first layer is the filter layer, and the spatial convolution is performed on each channel of the input feature map by a depthwise convolution; the second layer is the combined layer, using a 1 × 1 pointwise convolution to combine the outputs of the depthwise convolution.
The computational cost of the depthwise convolution is shown in Equation (5):
The computational cost of the 1 × 1 pointwise convolution is shown in Equation (6):
The computational cost of the depthwise separable convolution is shown in Equation (7), which is the sum of the depthwise convolution and the pointwise convolution:
By decomposing the standard convolution into the filter layer and the combined layer, we get a reduction in the amount of computation, as shown in Equation (8):
The standard convolution simultaneously learns spatial information and the correlation between channels, while the depthwise separable convolution first performs a depthwise convolution, broadens the network so that it can extract more network features, and then performs a 1 × 1 pointwise convolution, by which the outputs of the depthwise convolution are linearly combined to generate new features. Compared to the traditional convolution, the computational amount and model complexity of the depthwise separable convolution are greatly reduced, making it a more efficient convolution method that is able to be used to learn representative features better with less data. These advantages of the depthwise separable convolution structure are especially important for our crack detection model: our task is to train the model from scratch on a crack dataset, and the depthwise separable convolution allows us to learn the full feature representation of the data efficiently, thereby improving the model’s cracks detection performance.
2.2.3. ASPP Module
In deep neural networks, the size of the receptive field can roughly indicate the extent to which context information is used [
33]. However, Zhou et al. [
34] proved that the empirical receptive field of CNN was much smaller than the theoretical receptive field, especially in the higher layers. This greatly limits the ability of CNN to use context information to get accurate predictions. Global average pooling [
35] is an effective method to obtain global context information, and has been applied in image classification tasks [
29,
36] to solve the above problem.
However, directly fusing the information obtained by global averaging pooling to form a single vector may result in loss of sub-region context information [
33]. He, K et al. [
37] proposed a layered Spatial Pyramid Pooling (SPP), which can obtain the fusion of information and receptive fields from different sub-regions. Experiments show that multi-scale feature information fusion will bring about the improvement of network accuracy. This is not simply because of the increase in parameters, but because multi-level pooling can effectively deal with object deformation and differences in the spatial layout [
38].
For bridge crack pictures, cracks usually only occupy a small part of the picture. Therefore, in order to accurately identify cracks, it is necessary to effectively utilize context information and accurately extract crack features. Compared with the SPP structure in [
37], the atrous convolution used by the ASPP structure avoids the loss of image detail information caused by the down-sampling operation; therefore, it fits better with the need to detect cracks. Consequently, the ASPP module is introduced in the network, and it is part of the DeepLabv2 [
39] network proposed by the Google team in 2017. It performs parallel atrous sampling at different sample rates on a given input, which is equivalent to capturing the context of an image in multiple scales, thereby obtaining multi-scale image feature information [
32].
The structure of the ASPP module we used is shown in
Figure 6. On the feature map extracted by the first few layers of CNN in
Figure 3, the atrous convolutions with rates of 2, 4, and 8 were performed in parallel, and the feature maps extracted for each atrous rate were further processed by a depthwise separable convolution in a separate branch. The multi-scale features were then merged with the globally pooled input feature map to produce the final result. Since the ASPP module detects input feature maps with multi-sample rates, it can capture object and image context information [
39] on multiple scales, thus improving the accuracy of the network prediction.