1. Introduction
Electricity is one of the most essential elements to make the world go around, and the transmission of high-voltage electricity is very important for the practical use of it. In the transmission of high-voltage electricity, electrical insulators are used to support and separate electrical conductors without allowing current through themselves. Usually, insulators are exposed to the harsh environment of strong electric field with all the severe weather conditions such as blazing sunlight, typhoons or hurricanes, thunderstorms, freezing rain, and snowstorms. The harsh environment will make the insulators vulnerable damaged, which will threaten the safety of power grid systems and the use of electricity [
1]. Therefore, it is necessary to develop effective methods for insulator defect detection to ensure the safe and reliable electric power transmission [
2].
The current defect detection methods can be divided into three categories including physical methods [
3,
4], traditional vision-based methods [
5,
6,
7,
8,
9,
10,
11], deep learning based methods [
12,
13,
14].
The physical methods mainly include the ultrasonic wave method [
3] and the ultraviolet pulse method [
4] based on manual operation. In [
3], the authors proposed an ultrasonic wave method for detection by receiving and processing ultrasonic signals between 20–40 kHz and 80–140 kHz because ultrasound at these frequencies mostly be emitted from the insulator failure location. In [
4], the authors introduced an ultraviolet pulse method which identified the defect location by detecting the discharge of defective insulators. However, in practical application, the physical methods are easily affected by solar radiation, meteorological conditions, surrounding heat source, distance and other factors. In recent years, electric power inspection departments have begun to use UAV (unmanned aerial vehicle) instead of manual methods to detect insulator defect.
Among the vision-based insulator defect detection methods, the most frequently used ones are HOG + SVM [
10] and Haar + AdaBoost [
11]. As traditional detection algorithms, they mainly use sliding window to select the region of interest, extract the features of each window and then classify the feature samples to obtain the detection results. Besides, there are also methods based on contour features and gray similarity matching to classify intact and defective insulators [
6]. For example, Zhang and Yang [
2] proposed an insulators recognition method with curved smooth and weak textured (CSWT) surfaces. Li et al. [
7] used the improved MPEG-7 EHD (edge histogram method) technique to recognize insulators. Wu et al. [
5] used the global minimization active contour model (GMAC) for insulator segmentation. However, these methods are usually based accumulated experience to extract image features, so that they are inefficient, suffering from accuracy limitation and time-consuming problems.
With the continuous improvement of computer performance, detection methods based on deep learning frameworks have been widely used. It can effectively compensate the loss of feature information in the process of artificial image feature extraction and improve the efficiency of fault detection. A number of effective target detection algorithms have been proposed, such as Faster R-CNN (faster region-based convolutional neural network) [
15], Yolo (you only look once) [
16], SSD (single shot multibox detector) [
17], DCNN (dynamic convolutional neural network) [
18], etc. In reference [
14], the authors proposed a method for insulator defect detection based on R-FCN (region-based fully convolutional network). This method is a two-stage object detection method based on RPN (regional recommendation network) and FCN [
19] (fully convolutional network) with RPN generating the region proposals and FCN obtaining the feature maps. In reference [
20], the proposed insulator defect detection method is based on the improved Faster R-CNN by using ResNet101 as the backbone. However, due to the fact that the high-resolution insulator images taken by UAV contain mostly irrelevant background information, the useful information of the insulator is not much in an image. In addition, the slender shape characteristics of insulators and different defect variations lead to the diverse and complex presentation of defects in images. Hence, the accuracy of the existing insulator detection methods still needs to be improved. The merits and demerits of the above-mentioned three method categories are summarized in
Table 1.
Among these three categories, the physical methods are too labor work extensive, which makes it not practical for large-scale applications. The other two categories (i.e., traditional vision-based methods and deep learning based methods) mainly depend on camera systems. Most of the above-mentioned detection methods are based on monocular vision. They have a relatively simple hardware structure and are easily to be operated. However, these methods are disadvantaged by the instability of the available feature point extraction algorithms. Hence, extensive efforts based on multi vision had been conducted in displacement, crack, and deformation detection. In [
21], the authors proposed a method for measuring the distance of a given target using deep learning and binocular vision methods, where target detection network methodology and geometric measurement theory were combined to obtain 3D (three-dimensional) target information. In [
22], the authors presented a dynamic real-time method to detect surface deformation and full field strain in recycled aggregate concrete-filled steel tubular columns. In [
23], the authors proposed a four-ocular vision system for 3D reconstruction of large-scale concrete-filled steel tube under complex testing conditions. Compared with the methods based on monocular vision, these technologies based on multi vision consistently demonstrate high accuracy, stability, and feasibility. However, the multi vision methods are usually based on more expensive equipment than monocular vision methods, and the weight of multi vision systems are usually heavier than monocular vision systems. These disadvantages make the multi vision methods not as feasible as monocular vision methods to be applied in UAV. Therefore, the methodology developed in this study is based on monocular vision.
To alleviate the low accuracy problems of the above-reviewed deep learning-based methods, two methods inspired by Faster R-CNN for insulator defect detection are proposed. A series of advanced techniques are used to improve the network performance including RoI Align (region of interest align), cascade regression, and FPN (feature pyramid network). Besides, a new detection pipeline is proposed to improve the detection accuracy, following the idea that reducing the inference of background information could improve the detection performance. The main contributions of this study are listed as follows.
A new detection network is proposed for the detection of insulator defects, and the method is verified to be effective on a self-collected dataset that has been released to be publicly available online. Our proposed method addresses the misalignment and huge computational burden problems in Faster R-CNN, and advanced techniques including RoI Align, cascade regression and FPN are used to improve the network performance.
A new two-stage defect detection pipeline is proposed. In the first stage, the insulator mask image is generated to reduce the interference of the background information in images by the encoder-decoder mask extraction network. The Exact R-CNN (exact region-based convolutional neural network) is then used to detect insulator defects in the second stage. Compared with the traditional target detection method, this two-stage pipeline can greatly eliminate the interference of background features on the target detection, and greatly improve the accuracy of target detection.
4. Experiment
To examine the effectiveness of our proposed methods, 879 high-resolution insulator images taken by UAV were collected from the Internet, of which 660 were used as the training set and the other 219 were used as the test set. The images in the data set were high-resolution images, with a resolution ranging from 3936 × 2624 to 7360 × 4912.
Figure 17 shows two examples of health insulators and insulators with defects. All the images were processed on intel-i7 6700 K (4.0 GHz) with GTX 1080 and were compressed to 1024 × 1024. However, if the image size was adjusted directly, the aspect ratio of the original image would be damaged, resulting in the loss of texture, edge and other information in the image. To avoid this problem, the image was filled with black edges of length
(long side
L, short side
S) and used the bilinear interpolation method to reduce the images to 1024 × 1024. In this case, the aspect ratio of the original image could remain unchanged during the scaling process. For the encoder-decoder mask extraction network, four 256 × 256 image blocks containing all the image information are extracted from the 1024 × 1024 sample images.
However, the number of images in our dataset was obviously not enough for the training of a reliable deep learning-based model. Therefore, image augmentation technologies were used to expand the dataset, including flipping, random rotation, random scaling and reverse.
Figure 18 shows an example of these extended images. Finally, 6600 training set images and 2,190 test set images were obtained. In fact, our dataset could get enough images after augmentation, which largely enhanced the generalization ability of our models. The augmented dataset used in this study is publicly available online [
33]. The following strategies were also used for network training:
- (1)
Warmup strategy: In the beginning of training, the learning rate was not set as the initial value directly but increased it from a small value to the setting value by a constant amount at each iteration. This ramp avoids a sudden increase of the learning rate, allowing healthy convergence at the start of training. After the warmup, we went back to the original learning rate schedule [
44].
- (2)
Hard negative mining strategy: In the training of RPN, the negative samples that were easy to be classified (GioU < 0.3) were not included in the loss because they could not contribute much to the loss. Only the positive samples and negative samples with 0.3 < GioU < 0.5 that were difficult to be distinguished were included.
- (3)
Sample balance strategy: The number of collected negative samples was far more than the number of positive samples. Hensman et al. [
45] found that the distribution of the training data has a big impact on CNN performance and a balanced training set is optimal. Hence, the proportion of positive samples and negative samples for training was controlled to 1:1 to achieve better training performance.
The criteria defined in the PASCAL VOC 2012 competition was used to evaluate the predictions. When a prediction frames multiple insulator defects, the prediction will be assigned to the ground truth with the largest IoU. When multiple predictions frame the same insulator defect, they will be sorted according to the obtained confidences. All the predictions are compared with the ground truth for IoU calculation, and they will be considered as “match” when all IoU 0.5. Actually, to avoid the same insulator defect being detected by multiple predictions, only the result with the highest confidence was considered as the true positive.
When comparing the performance of different algorithms on our test set, the comparison indexes including average precision (AP), precision (P), recall rate (R) and frames per second (FPS) were used. P is the proportion of true positive predictions to the total number of positive predictions. R is the proportion of true positive predictions in all positive samples. AP is the cumulative mean sum of PR curve areas under different confidence thresholds. To show the performance of different algorithms more comprehensively, two additional metrics were added [
46]: all true (AT) and all miss (AM). AT is the proportion of images in which all the insulator defects are detected correctly. AM is the proportion of images in which all the insulator defects are not detected. For the mask extraction network, Dice coefficient was used to evaluate its performance, which is defined as,
where
is the ground truth area and
is the area obtained by the mask extraction network.
5. Results and Discussions
To verify the effectiveness of our proposed methods, their performance was compared with five commonly used target detection methods, i.e., HOG + SVM [
10], Haar + AdaBoost [
11], Faster R-CNN [
15], YoloV3 [
24] and YoloV4 [
47]. Meanwhile, to show the performance of our proposed methods more comprehensively, the results under two different backbones, i.e., ResNet50 and MobileNetV2 were also compared. The results when using images with different resolutions as inputs and when using different network structures are also presented in this section for a deeper understanding on our methods.
5.1. Determination of the Number of Residual Blocks
To determine the number of residual blocks between the encoding and decoding process of the U-Net in
Figure 13, an analysis was conducted to examine the performances when using different number of residual blocks. The results shown in
Table 3 indicate that the Dice coefficient gradually increases with the number of residual blocks in the rage of 0~3, but the Dice increase speed greatly slows down when the number of residual blocks continuously increase. As presented in
Table 3, the Dice coefficients are similar when the number of blocks is higher than 3. Therefore, as introduced in
Figure 13, three residual blocks were used.
5.2. Qualitative Detection Results
Figure 19 shows the segmentation results of the mask extraction network mapped to the original image after denoising with morphological filter under the backbones of ResNet and MobileNetV2 respectively. We found that the improved mask extraction network based on U-Net has pretty good performance under different backbones. From our experimental data, the average Dice coefficient of the network is 0.97 on the backbone of ResNet and 0.94 on the backbone of MobileNetv2.
Figure 20 shows the qualitative results of all the methods examined on our dataset, where the yellow bounding box is the ground truth, and the red bounding box is the prediction. From
Figure 18, a comparative summary can be concluded as follows.
Compared with the traditional vision methods, the method based on deep learning can detect more accurately and has less omissions and duplications. This is probably due to the fact that traditional vision-based methods usually use artificially designed features to represent objects and only consider some simple information (such as edge, color or texture) while the deep learning-based methods automatically extract both low-level and high-level features to represent the object. Therefore, the traditional vision method cannot effectively identify the insulator defects in situations with complex background.
Compared with Faster R-CNN and Yolov3, our proposed ERCN and CME-CNN have smaller prediction bias, higher positioning accuracy, and almost no omission and duplication. This is probably because our methods are based on multi-level classification and regression. Each cascade stage improves the detection quality by increasing the GIoU threshold, and introduce RoI Align to solve the mismatch problem, making the re-classification and re-regression network receive the correct features without position deviation. It makes the regression and classification effect better. In addition, in order to improve the detection performance of small targets, FPN is introduced which combines semantic features of different levels. These methods continue to strengthen the extraction and processing of image details, making the extracted target features completer and more accurate.
Compared with ERCN and Yolov4, there is almost no deviation between the prediction of CME-CNN and the ground truth. The detection performance of CME-CNN is better. This is probably because the complex background is simplified and the interference in detection is reduced by generating insulator mask image through the encoder-decoder mask extraction network. This makes the network more focused on the learning of target characteristics, making the network more efficient.
From the illustrated qualitative comparison results, we can conclude that the methods proposed in this paper are more accurate than the other methods mentioned above. There is basically no offset between the prediction and the ground truth, and the predictions are basically no omission or duplication. It can be seen that our method can better grasp the details of the image, can grasp more comprehensive and accurate features of the detected target.
5.3. Quantitative Detection Results
Table 4 gives the quantitative results of all the examined methods on our dataset. The results show that our proposed CME-CNN-ResNet50 has the best performance on five of the examined evaluation indexes (i.e., AP, P, R, AT and AM). Although with a not high FPS, it has great application value in practice because it does not require high real-time performance but requires high detection accuracy when using UAV to detect insulator defects. In addition, due to the introduction of depthwise separable convolution, ERCN-MobileNetV2 has the best FPS (i.e., 11.1). However, its detection performance is worse than CME-CNN-ResNet50. The FPS of CME-CNN-ResNet50 is 1.1, indicating that about one image can be processed in each second. This should be satisfactory for insulator defect detection using UAV because the detection using UAV does not have high requirements on the running time.
In general, the methods with higher detection accuracies usually need higher calculation costs [
48]. For the insulator defect detection task in this study, the detection can be conducted on an expensive high-performance server in off-line practical applications. When an insulator defect is detected, it will be reported to the maintenance workers to manually replace the defect insulator. In this situation, the detection accuracy is the most important indicator, and the computation time is not as important as the detection accuracy. For example, the FPS of CME-CNN-ResNet50 is 1.1 and the FPS of YoloV4 is 10.6, which indicates that the computation time of CME-CNN-ResNet50 is 0.8 s slower than the number of YoloV4. In the above-mentioned application scenario, this 0.8 s doesn’t make any sense to speed up the maintenance that requires extensive human labor work. Therefore, in this study, we mainly focus on the detection accuracies of the examined methods.
5.4. Ablation Study Results
To examine how much the above-mentioned techniques contributed to the detection performance, an ablation study was conducted to compare the detection performance when using different skills. The results are given in
Table 5, which shows that the use of cascade R-CNN increases the AP of the model by 1.9%, while the use of mask extraction increased the AP of the model by 4.2%. In addition, using both cascade R-CNN and mask extraction increases the R of the model by 5.8%.
Figure 21 shows the influence of different techniques on the model performance. We can find that the use of cascade R-CNN can reduce the offset between the prediction and ground truth and makes the model prediction more accurate. The use of mask extraction makes the model more sensitive to the capture of insulator defects, better detection accuracy for small insulators, and the prediction basically has no offset with the ground truth.
5.5. Detection Performance on Images with Different Resolutions
The above experiments clearly show that the methods proposed in this paper can obtain high detection performance on high-resolution images. However, high-resolution images cannot always be available in practical applications because of many factors such as cost limitation, environmental condition, etc. [
49]. Therefore, an experiment was conducted to compare the influence of image resolutions on the performance of CME-CNN-ResNet50 which performs the best on all the examined indexes except FPS in
Table 4.
The compression method mentioned in
Section 4 was used to compress the image of our dataset to 512 × 512 and 256 × 256. Examples of the compressed images are shown in
Figure 22. The compressed 512 × 512 images are regarded as medium-resolution images and the compressed 256 × 256 images are regarded as low-resolution images. The comparison results are shown in
Table 6. The results show that the lower the resolution of the test images, the higher the FPS of the network. However, the lower resolution of the test images leads to lower values of the examined AP, P, R, AT, and AM. The AP on high-resolution images is 5.1% and 2.2% higher than the numbers on medium- and low-resolution images, respectively. This is mainly due to the information loss on small targets such as insulator defects in lower resolution images.
5.6. Detection Performance When Using Different ResNet Structures
Besides image resolution, the ResNet structure also has an impact on the detection performance. In [
50], ResNet-v2 was proposed. Different from ResNet-v1, ResNet-v2 puts the activation function on the residual branch, which makes the computational speed faster in the back propagation and forward propagation. Here in this study, another experiment was conducted to compare the performance of different ResNet structures on the error rate of the model.
Table 7 shows the error rates when using ResNet50-v1 or ResNet50-v2. The results show that the error rates of Faster R-CNN, ERCN, CME-CNN in ResNet50-v2 are reduced by 1.7%, 1.9%, 2.1%, respectively. This indicates that our proposed method can be further improved by alternatively using ResNet50-v2.
5.7. Novelty and Application of our Proposed Methods
Different from the traditional one-stage or two-stage detection methods, our proposed methods use multi-level classification and regression as well as information fusion and detect based on the mask image. On the one hand, multi-level classification and regression improves the quality of prediction box by increasing the GIoU threshold. Our methods also use RoI Align for allowing the selected RoI receive a correct feature without position bias, making a better regression and classification results. On the other hand, information fusion uses FPN network to combine different levels of semantic features to enhance the extraction of bottom details. When stitching the underlying and advanced features in U-Net, the SE block is used to assign different weights to each channel, and different attention is given to low-level features and high-level features. Residual block is also adopted to make the network learn deeper features. More importantly, the interference of complex background is eliminated after obtaining the mask image, which makes the detection more targeted and makes the network focus on training target features. Due to the introduction of depthwise separable convolution, our proposed methods even can be used for real-time detection.
Therefore, our model method CME-CNN-ResNet50 can be used for high-precision inspection tasks such as insulator defects. In addition, in the case of different backbones such as MobileNetv2 and ResNet50, our model can have a good trade-off between detection accuracy and running speed and can be applied to detection scenarios with different demands.
6. Conclusions
In this paper, two methods (i.e., ERCN and CME-CNN) for insulator defect detection based on Faster R-CNN are proposed to be applied to high-resolution aerial images. Our proposed methods in two different backbones (i.e., ResNet50 and MobileNetV2) are used to compare with five different detection methods (i.e., HOG+SVM, Haar+AdaBoost, Faster R-CNN, YoloV3, and YoloV4). The results show that the average precision (AP) of our proposed CME-CNN-ResNet50 achieves the best performance (88.7%) which is 52.9%, 58.2%, 6.5%, 10.8%, and 1.6% higher than the number of HOG+SVM, Haar+AdaBoost, Faster R-CNN-ResNet50, YoloV3, and YoloV4, respectively. The ablation study results show that the mask extraction proposed in CME-CNN contributes to a 4.2% increase on AP than without using this skill and using both cascade R-CNN and mask extraction increases the AP by 7.1%. Besides, the results also show that the performance of our proposed method can be further improved by alternatively using ResNet50-v2. However, the proposed methods are based on insulator images captured in daytime, while detecting insulator defects at night is still a challenging task. Therefore, one of our future work will focus on improving the night detection performance by using image enhancement technologies [
51]. Moreover, researchers have introduced the concept of Corner Net [
52] and Extreme Net [
53] into target detection in the recent years, and satisfactory results have been reported. In our future work, we will also try to combine our proposed method with the detection of key points to see whether we can further improve the detection performance of our network.