1. Introduction
The quality of images—vital for the correct and comprehensive communication of their content to the outside world, especially in specialized fields such as aerospace, medical treatment, and electrical power—often plays an important role in their use [
1]. High-voltage transmission lines cover a large area in complex and diverse geographical environments (mountains, basins, reservoirs, etc.). Affected by their structure, the natural environment, and other factors, high-voltage transmission lines are highly susceptible to defective faults such as insulator defects, anti-vibration hammer offsets, lighting rod breaks, space rod fractures, and tower rust. These failures can seriously endanger the safe operation of the power system [
2].
With the continuous development of domestic UAV technology, UAVs have especially made breakthroughs in cost and size reduction, making UAVs widely used in the industry [
3]. In terms of transmission line inspection, the use of drones is also of great value. Drones, for example, can acquire multiple more image data simultaneously as a workforce. However, the images taken by the UAV are affected by geographical factors such as high altitude and windy conditions, making it dither and challenging to focus the optical sensors carried by the UAV, resulting in blurred images. In addition, the shot image will also be affected by occlusion, backlight shooting, and shooting angle, resulting in reduced quality of the shot image. If the captured images are fed directly into the transmission line fault detection and inspection models, then the training effect of the model will be seriously affected, making the final recognition rate low, thus laying a significant hidden danger to the safe operation of transmission lines. Therefore, studying the generation and processing of high-quality images on transmission lines is of great practical significance.
In recent years, with the rapid development of artificial intelligence, various deep learning-based generative models have achieved good results both at the theoretical and application levels. Currently, common image generation techniques include the autoregressive model [
4], variational auto-encoder model (VAE) [
5], flow-based model [
6,
7,
8], and generation adversarial network (GAN) [
9]. However, in 2016, Oord introduced the autoregressive model in the image generation task. The autoregressive model uses pixel-by-pixel learning and prediction, which makes the computational parallelism of image generation low, and the sampling speed is slow. However, VAE has made exemplary achievements in some fields. However, VAE has made outstanding achievements in some areas. However, they are still limited by the high dimensional characteristics of the probability density distribution of sample data, thus it is very challenging to learn a model that can fit its data distribution. As one of the deep-generation models with excellent development potential, the image generation effect of the flow model is not only inferior to that of GAN models but also accompanied by complex computational problems while generating high-quality images. In a word, the existing issues of these models have affected their widespread application in the field of image generation to varying degrees.
GAN has a simple overall structure and can generate a large number of image data samples without the construction of complex objective functions. However, also, the GAN technology continuously improves its network performance through the connection between the generator and the discriminator, providing practical solutions for the sampling and training problems under a high-dimensional probability density distribution, which makes the GAN favored by the majority of researchers. However, some drawbacks have gradually emerged with the continuous extensions of GAN in practical engineering applications. Since the discriminator only judges whether its input is from the actual sample without considering the diverse characteristics of the input data. Such a behavior of the discriminator can seriously affect the model training of the generator, making the generator unable to truly understand the accurate distribution of the input data, which eventually leads to poor diversity of the images generated by the generator. Additionally, classic GANs often struggle to capture images’ structure, texture, detail, etc., meaning they cannot directly produce many high-quality (i.e., high-resolution) images of different categories.
In response to these problems, researchers have proposed a large body of work based on improvements to GAN. For instance, deep convolutional GAN (DCGAN) [
10] introduced deep convolutional neural networks to GAN for the first time. Wasserstein GAN (WGAN) [
11] introduced Lipschitz continuous row constraints for discriminative networks. Information-maximizing generative adversarial nets (InfoGAN) [
12] improved the model from an information-theoretic perspective. Additionally, conditional generative adversarial networks (CGAN) [
13] introduced auxiliary variables. Apart from that, there are also quite a few researchers who have now used GAN to achieve excellent effects in font restoration [
14], image conversion [
15], high-resolution image semantic segmentation [
16], and other tasks.
In the image generation task, the Matching-aware discriminator and Learning with manifold interpolation (GAN-INT-CLS) [
17] can generate an image resolution of 64 × 64. To further improve the resolution, the text-conditioned auxiliary classifier generative adversarial network (TAC-GAN) [
18] uses an auxiliary classifier generative adversarial network (AC-GAN) [
19] for the text-to-image (T2I) task. The TAC-GAN fed the category labels and text description vectors into the generator as conditional information. The discriminator distinguishes between real and synthetic images and assigns labels to them. However, the generated image resolution is only increased to 128 × 128. Self-attention generative adversarial networks (SA-CGAN) [
20] improve the quality of CGAN-generated images by enhancing the relationships between image parts. Still, the semantic details of the pictures could not represent better when generating images with complex backgrounds. Isola et al., based on the GAN idea, proposed the Pixel-yo-pixel (Pix2pix) [
15] network, which has good graph generation quality. However, the scarcity of paired data sets leads the Pix2pix model to poor generalization applications and few application scenarios as a supervised style migration network.
This work proposes a critical image generation network model for high-voltage transmission line components to solve the problem based on an improved generative adversarial network. The model in this paper can flexibly apply the image generation task of critical components of high-voltage transmission lines in complex backgrounds and some other application scenarios. The CFM-GAN model consists of two generators and several discriminators. In
Figure 1, the model in this paper produces images in two stages. Firstly, the global generator is responsible for extracting high-level abstract semantic features such as skeleton and texture to obtain low-resolution images (LR image). Secondly, the local generator is responsible for extracting the underlying basic features such as image resolution and degree of fineness to obtain high-resolution images (HR image).
The main distinctions between CFM-GAN and conventional models include: firstly, CFM-GAN uses a multilevel generator, which we use to ensure that CFM-GAN produces high-resolution and realistic images on the one hand, and to provide its application to areas such as image reconstruction, image synthesis, and image translation on the other. Secondly, we use Monte Carlo search (MCS) [
21] to sample the global generator-generated target image several times and compute the appropriate penalty values to obtain richer semantic information about the image. The penalty values could then guide the local generator to generate a semantically richer target image to prevent the pattern collapse problem [
22] effectively. Thirdly, after optimizing the generator structure, the discriminator needs to improve the adversarial capability. This paper modifies the traditional single-layer network model to a multi-scale discriminator network with three layers having the same design based on the parameter sharing [
23] strategy and the feature pyramid network (FPN) [
24]. The multi-scale discriminator network adopts three structurally identical, responsible for different levels of abstraction, semantics to commonly improve the discriminator’s discriminate ability. The above three modifications enable the network to pay attention to the high-level semantics of the generated image while also considering its texture details. Of course, these modifications make the model not need to perform complex loss function settings and subsequent processing operations while generating high-resolution semantic information. In addition, to solve the problem of a small public data set and the simple image backgrounds of the critical components of high-voltage transmission lines, we constructed the dataset KCIGD by using aerial images of crucial parts of high-voltage transmission lines taken by UAVs.
We compare and analyze CFM-GAN with mainstream image generation models and demonstrate that this model could produce high-resolution and high-fidelity images. The significant contributions to the work are listed below.
In the current work, we propose an improved generation adversarial network model (CFM-GAN) consisting of two generators and several discriminators to generate images containing key components of high-voltage transmission lines. In addition, this paper adopts a multilevel generator combining a global generator and a local generator to ensure that CFM-GAN can produce high-resolution images with rich details and textures and increase the application scenarios of the model.
To effectively search the intermediate generation state of the generated image, we introduce a penalty mechanism based on the Monte Carlo search in the generator to score the intermediate generation state of the generator. The introduction of the tool guides the generator to acquire richer semantic information. It enables the generated images to include more semantic details.
Based on the mechanism of the argument shared, this paper proposes a multi-scale discriminant architecture. This structure can determine if the input picture is original or generated by using information from different levels of abstraction.
This paper considers that there is currently no complete publicly available dataset of critical components of transmission lines, thus we establish a data set named KCIGD for generating high-quality images of essential components of high-voltage transmission lines. In addition, this paper conducted a comparative experiment between the CFM-GAN model and the current mainstream model on the KCIGD data set. The experiment proved the effectiveness and extensibility of the CFM-GAN model.
The rest of the arrangements for this article are below. The second part illustrates the area relating to image generation.
Section 3 introduces the GAN. The CFM-GAN model is elaborated on in
Section 4.
Section 5 provides an experimental comparison to validate the model’s validity in this paper. The final section summarizes the paper’s content and our future research plans.
2. Related Work
Various deep learning-based generative models aim to produce image samples that the naked eye cannot distinguish between real and fake. Development trends in image generation models indicate that techniques such as autoregressive, VAE, flow-based, and GAN models are developing and growing.
The autoregressive model was introduced into the image generation task by Oord in 2016. On this basis, Oord proposes the pixel recurrent neural networks (Pixel RNN) [
25] and pixel convolutional neural networks model (Pixel CNN) [
26]. Among them, Pixel RNN obtains the corresponding pixel values by scanning each pixel in the image and then uses these values to predict the distribution of each pixel value in the image, thus achieving the purpose of image generation. The Pixel CNN captures the distribution of the dependency between image pixels through network parameters and then generates pixels in the image in turn along two spatial dimensions, and thus uses the mask convolution layer to learn the distribution of all pixels in an image in parallel, and, finally, generates related images. Compared with Pixel RNN, Pixel CNN effectively reduces the number of network parameters and computational complexity. Still, because it limits the size of the receptive field during image generation, the image quality generated by Pixel CNN could not be satisfactory. Parmar et al. [
27] significantly broadened the vision of image generation by introducing the self-attention mechanism into the Pixel CNN, thus considerably improving the quality of images. The sub-pixel network proposed by Menick [
28] is to improve the resolution of the generated image by converting the basic unit of the generated image into image blocks. Chen et al. [
29] solved the problem of short dependence duration in recurrent neural networks. They effectively improved the quality of the generated images by introducing causal convolutions into the self-attention mechanism.
VAE learns the probability distribution of data mainly by the maximum likelihood criterion. Gregor et al. [
30] combined the spatial attention mechanism and sequential VAE to propose the deep recurrent attentive writer (DRAW) model to enhance the resulting image performance. Wu et al. [
31] integrated the multiscale residual module into the adversarial VAE model, effectively improving image generation capability. Parmar et al. [
32] proposed a generative automatic encoder model called dual contradistinctive generative autoencoder (DC-VAE). DC-VAE can synthesize images at different resolutions. Moreover, introducing the double contradiction loss function makes the images generated by DC-VAE significantly improved in quality and quantity. Hou et al. [
33] proposed an improved variational autoencoder to force it to generate clear and realistic images by using the depth feature consistency principle and the generative adversarial training mechanism. The introspective variational autoencoder (IntroVAE) [
34] uses adversarial training VAE to distinguish original samples from generated images. IntroVAE presents excellent image generation ability. Additionally, to ensure the stability of model training, it also adopts hinge-loss terms for generated samples. The β-VAE framework joint distribution of continuous and discrete latent variables (Joint-VAE), proposed in [
35], can jointly learn the generative distributions of constant and discrete latent variables based on supervised categorical labeling and, finally, categorize the generated images effectively. Moreover, since the encoder can effectively predict the labels, the framework can still use tags for a conditional generation even without a classifier.
The main ideas of the generative flow-based model were originally from NICE [
6], RealNVP [
7], and GLOW [
8]. RealNVP has made multiple improvements based on NICE, including replacing the additive coupling layer with the affine coupling layer to improve the fitting ability of the transformation. They introduced the checkerboard mask convolution technique to process image data. The multi-scale structure is adopted to learn the characteristics of different scales and randomly scrambles the channel behind the coupling layer to effectively fuse the information between different channels to improve the model’s performance. Based on RealNVP, the GLOW network further improves the method of random channel disruptions, replacing it with a 1 × 1 convolution layer to effectively integrate the information of different channels and significantly enhance the quality of image generation. Compared with GAN and VAE, the generative flow-based model can generate higher-resolution images and accurately infer hidden variables. In contrast to autoregression, the flow model can carry out a parallel computation and efficiently carry out data sampling operations. However, the flow model is also accompanied by complex computational problems while generating high-quality images.
GAN has attracted the attention of many researchers because of its vast application potential. The fusing of GAN and CGAN (Fused-GAN) is presented by Bodla et al. [
36]. Fused-GAN generates image samples of diversity and high similarity by combining two generators. The first generator unconditionally generates feature maps; the second uses feature maps and conditional inputs to generate final image samples. The authors of [
37] constructed a cascaded GAN structure by introducing a Laplace pyramid structure, progressively improving the resolution of these generated images by learning the residuals between adjacent layers. However, it still requires a deeper network structure and more computing resources. KARRAS et al. [
38], by adding latent codes and random noise to each network layer, a style-based generator architecture for the generative adversarial network (Style-GAN) can produce higher-quality images with more apparent details and style diversity than other GAN networks, but only for small image datasets. However, the resulting images may be artifacts or texture sticking for complex image data sets (such as ImageNet). Self-attention generative adversarial nets (SAGAN) [
20] address the problem of poor generation when dealing with complex structure images by introducing a self-attention mechanism in the network for the feature extraction and learning of crucial image regions. A generative model from a single natural image (Sin-GAN) performs multi-granularity learning by down-sampling a single image at different sizes, enabling the model to generate various high-quality images at arbitrary sizes [
39].
Although previous studies have achieved good experimental results in different image generation tasks, considering the unique geographical location of the transmission lines makes the image backgrounds obtained from UAV aerial photography extremely complex. The complex environment leads to poor compatibility between the above image generation model and the image with the transmission line background. After all, not all scenes have enough semantic information for model evaluation, and not all models can match UAVs and other equipment. This results in the above generation models still having difficulty balancing the quality of the generated images with the generation speed. The proposed CFM-GAN model in this paper is also inspired by [
40], which progressively generates transmission line images based on GAN networks. Our model further improves the problem of low resolution and poor semantic detail representation of the generated images by introducing a penalty mechanism to do high-resolution constraints and a semantic information guide to the generator.
3. Basic Knowledge of GAN
The process of proposing a GAN is heavily influenced by the “two-person zero-sum game theory,” which consists of a generator and discriminator models. The generator generates as realistic as possible image data according to the noise vector. The discriminator distinguishes the generated data from the actual image data as accurately as possible. In the confrontation, the two fight each other and progress until they reach the optimal state of Nash equilibrium [
41]. The objective loss function for generating adversarial network training is:
where
denotes the generative network,
represents the discriminative network,
indicates the sample data, and
represents random noise.
and
mean the probability distribution of the sample data and the distribution of the generated data, respectively.
This paper, furthermore, considers that the GAN loss function encourages the generation of images with more color information and that traditional loss functions (such as the
or
separation from the original image and the resulting image) encourage the generation of an image with more detailed information, such as image edges. The CFM-GAN model combines the GAN loss function with the traditional loss function. However, it has been shown that the
L1 distance can be more effective in reducing the blurring of the generated image compared to the
distance [
15]. Hence, we incorporate the
distance into the CFM-GAN network model. The
L1 distance equation is as follows.
The final loss function for CFM-GAN is expressed as:
represents the numerical of the weight set.
5. Experiments and Analysis
Firstly, we need an image database of critical components of high-voltage transmission lines to help us analyze the CFM-GAN module’s performance. The only public dataset for high-voltage transmission line critical components is CPLID [
45]. It is worth mentioning that the images in CPLID are obtained mainly by cropping, flipping, and stitching with the image’s background; thus, the image background is simple and does not fully satisfy the application of the image generation model in real scenarios. Therefore, this paper uses an aerial video of a 500 kV high-voltage transmission line from a UAV in China as a data source and generates a dataset for generating key components images of transmission lines. Transmission lines are generally located at high altitudes and widely distributed in mountains, forests, rivers, lakes, fields, hills, and other areas. Therefore, the final image obtained has a complex background and contains more semantic detail to evaluate the model’s performance more effectively. Both the homemade dataset and the public dataset CPLID contain vital components of transmission lines, such as insulators, anti-vibration hammers, spacer bars, lightning rods, and towers. In addition, we named the original image dataset KCIGD, and the training set and test set contain 4200 and 700 original images, respectively.
The training in this study mainly consists of two stages: freezing and thawing. Because the image-generation model is pre-trained, its parameters also have a certain priority. In model training, to prevent the generator network parameters from running aimlessly, we freeze and thaw them. The forward propagation calculation will be carried out quickly in the freezing process using these pre-trained parameters. After the generator model is roughly trained, we will perform the thawing operation on the generator model parameters. The network model will guide the generator’s generation direction during the unfreezing process according to the discriminator’s scoring output. The generator will adjust the parameters’ optimization direction according to the discriminator’s scoring result. The network is slowly trained to generate images with as much semantic detail and as high of a resolution as possible. All parameters of the generated adversarial network used in the study were performed with the Adam optimizer. We performed 200 rounds of training, where the learning rate were kept constant at 0.0002 for the first 100 times and were reduced progressively to zero for the following 100 times. The initial weight coefficients were Gaussian distributed, and the mean was 0 with a standard deviation of 0.02. A NVIDIA RTX 2080 GPU with 8 GB of memory was employed to achieve the training and testing. The experimental operating system was Ubuntu 18.04 with 32 GB of memory. Additionally, all algorithms were built based on Pytorch 1.4. The amount of Monte Carlo searches applied in this study was 5, and the specific weight of the loss regarding the control discriminator and diagram extractive match was 10.
5.1. The Baselines
In this paper, to facilitate testing the performance of the CFM-GAN model, we compare and analyze CFM-GAN with the current mainstream image generation models.
VAE [
46]: A popular generation model consists of a differentiable encoder and a decoder. VAE trains two networks by optimizing the variational boundary of the logarithmic likelihood of data. As a result of the addition of noise and the use of inappropriate element distance measures (such as a squared error), there are fuzzy problems in the samples generated by VAE.
Cascaded refinement network (CRN) [
47]: Unlike the GAN method in image generation, the CRN does not rely on adversarial training. The CRN model uses an end-to-end convolutional network to generate corresponding images according to the input pixel semantic layout images. CRN creatively uses techniques of calculating matching losses between images, which compute different losses in the generated and semantically segmented images.
The combination of variational auto-encoders and generative adversarial nets (VAEGAN) [
48]: In the VAE model, the reduction factor for network optimization is the Euclidean difference from the initial image after decoding by the decoder. However, the loss value is not precisely inversely proportional to the image quality, thus the decoded image is then delivered to the discriminator to identify its generation effect, thus achieving the effect of using GAN to enhance the VAE image generation effect.
Pix2pix [
15]: The Pix2pix model is based on an adversarial loss implementation. In short, the network learning mapping from Pixel x to Pixel y has achieved good results in tasks such as image translation and pixel transfer.
InsulatorGAN [
49]: In this model, the insulator label box is restricted in the image generation process by combining the coarse–fine granularity model, which can detect insulator segments as accurately as possible.
5.2. Quantitative Evaluation
In this chapter, the resulting image qualities are first evaluated using the inception score (IS) and Fréchet’s inception distance (FID), which are specific evaluation metrics based on generative adversarial networks. Then, the pixel-level evaluation metrics [
50] in structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and sharpness difference (SD) are referred to, to determine the degree of resemblance of the original and generated images.
5.2.1. Inception Score (IS) and Fréchet’s Inception Distance (FID)
The IS index calculates the distance between the two probability distributions by KL divergence, which reflects the fit of the generated image’s probability distribution related to the actual image probability distribution to a certain extent. The greater the IS score, the greater the clarity of the generated images and the better the overall diversity. The IS is calculated as below.
where image
is sampled from the distribution
of generated data, and
represents the distance in the middle of the generated image distribution and the original image distribution, i.e., the relative entropy.
Since the IS score uses the InceptionV3 network trained under the ImageNet dataset, the transmission line critical components included in the public dataset CPLID and the dataset KCIGD in this paper are not in the ImageNet dataset [
51]. Therefore, we scored the CFM-GAN model based on the AlexNet network.
Fréchet’s inception distance (FID) can be used to calculate the Fréchet distance between the actual and generated samples in a Gaussian feature space distribution. A minor FID score indicates that the resulting images are closer to the original images and the resulting images are more realistic. The FID is calculated as follows.
where
and
represent the feature mean value of the original image and the generated image.
and
are the covariance matrix of the original image feature vector and the generated image feature vector.
is the “trace” operation in linear algebra. This indicator also with the help of the InceptionV3 network, but the difference is that FID only uses the Inception network as a feature extractor. We also use AlexNet to extract its features, then map the feature map to a 1 × 1 × 4096 vector through the fully connected layer, and, finally, receive a 4096 × 4096 covariance matrix.
Table 4 lists the experimental results of the comparative analysis of CFM-GAN and current mainstream models using IS and FID metrics. The experiments indicate a higher score for the CFM-GAN model than for the other structures, i.e., the CFM-GAN model slightly outperforms the alternatives concerning the resulting image qualities and types.
5.2.2. Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Sharpness Difference (SD)
Peak Signal Noise Ratio (PSNR) evaluates an image by comparing the error between the original image and the corresponding pixels of the generated image. The unit is dB. The greater the PSNR values, the less distorted the generated images are and the more influential the results.
SSIM measures the similarity between images from three aspects: brightness, contrast, and structure. The closer the value of SSIM is to 1, the more similar the processed image structure is to the original image, i.e., the better the resulting image is.
where
and
represent the mean of the images
and
, respectively.
and
represent the standard deviation of images
and
.
denotes the covariance of the image
and
.
and
are constants.
This study calculated the loss of sharpness among the generated and original images, drawing on the concept of gradient difference to measure the sharpness of the generated image.
where
.
As seen in
Table 4, the CFM-GAN score is higher than other mainstream models. This result indicates that the CFM-GAN model can generate high-resolution images of crucial components of high-voltage transmission lines and can also be well applied to scenes with complex image backgrounds such as mountains, lakes, and forests.
As seen in
Table 4, the CFM-GAN score is higher than other mainstream models. This result indicates that the CFM-GAN model can generate high-resolution images of crucial components of high-voltage transmission lines and apply them to scenes with complex image backgrounds, such as mountains, lakes, and forests. The image quality generated by the model in this paper is better than that of InsulatorGAN. Analyzing the reason, MoCo plays a positive role in image feature extraction on the one hand. On the other hand, InsulatorGAN uses the punishment mechanism to constrain the generation of an insulator box. In this study, the proposed penalty mechanism can mine the hidden spatial semantic data inside the images, guide the local generator to produce a more realistic image, and make the final generated image more similar to the original image. In addition, as the similarity in value was high between the generated images and the authentic images, the images generated in this paper can also expand the images of critical components of high-voltage transmission lines.
To fully evaluate the behavior of the models, we also performed speed tests on different models; we also performed speed tests on other models. The tested speed of the CFM-GAN models in our paper is less than that of the current mainstream models, as shown in
Table 5. This is because the CFM-GAN model combines global and local generators. The Monte Carlo search strategy sampling process also takes significant time. However, the number of 62 FPS is sufficient for real-world applications.
5.3. Generated Image Visualization
Figure 5 shows the experimental results. The first line and the second line represent the original image and mark the part of the image, respectively. The third line is the image generated according to the tagged image’s labeled area. According to the experimental results, we can see that the image generated by the model in this paper can reconstruct the image of the critical components of the transmission line naturally, and the generated image contains the features of the known image, which has a good generation effect.
The analysis of the result of the qualitative experiment is represented in
Figure 6 and
Figure 7. The resolution of the test image is 512 × 512. The image generated by model CFM-GAN in the work is more evident than the original image, and the information about the key components and background of the high-voltage transmission line in the generated image is also richer. When the mainstream model generates images, it is prone to problems such as blurred, distorted, and distorted images. On the other hand, the CFM-GAN model can generate high-resolution images with rich semantic detail even when generating images under complex image backgrounds. Additionally, the resulting image is highly similar to the original image.
5.4. Sensitivity Analysis
In the present chapter, to determine the effects of various compositions on the CFM-GAN models, we perform sensitivity analyses on multi-level generators, the number of Monte Carlo searches introduced, multi-scale discriminators, the number of iterations, and the minimum training data set.
5.4.1. Two-Stage Generation
To verify the effect of the two-stage generator on the CFM-GAN model, we performed an experimental analysis on the KCIGD dataset with different numbers of generators introduced. As shown in
Table 6, the image generated by a three-stage generator could be richer in semantic detail and higher in resolution than a single-stage generator. In addition, the effect was improved because the multiple structures worked independently. We fused the image details that the local generator was good at with the high-level abstract semantic information extracted by the global generator, which made multiple generators retain the semantic and texture information of the final generated image. Therefore, the multiple structures effect was better than the standard single-layer global generator. In other words, the global and local generator distinctions were not considered when using several-stage generators. It is just that the image details of model mining would be more prosperous with each additional generator. However, considering the speed of image generation, the effect of the two-stage model was the best.
5.4.2. Monte Carlo Search
This paper confirms the impact of Monte Carlo search times on the model CFM-GAN performance. It compares the effects of various Monte Carlo search times on image generation results of crucial components of power lines. As shown in
Table 7, when introducing a small number of Monte Carlo searches at the beginning, the enhancement of the image generation effect is the most evident. As the number of introductions increases, the result slowly increases, and the time spent increases rapidly. Therefore, when analyzing the number of introductions, achieving the best balance is achieved by combining the image generation and time spent when N = 5, i.e., the model is best balanced when the number of Monte Carlo searches is 5.
5.4.3. Multi-Level Discriminator
To validate the effect of a discriminator with multiple input scales on model performance, we compare the experimental results of introducing different numbers of discriminators on the KCIGD dataset. After obtaining a feature map of the same size as the original image (512 × 512), the feature maps are down-sampled by 2x (256 × 256), 4x (128 × 128), and 8x (64 × 64), and the feature maps of different scales fed into the discriminators of the corresponding sizes, respectively. Among them, the single-stage discriminator inputs the original image. Additionally, the original image and the image after 2x down-sampling are input to the two-stage discriminator. The three-stage discriminator model’s input is an original image, a 2x down-sampled image, and a 4x down-sampled image. On top of the three-stage model input, we add the result of 8x down-sampling as the input to the four-stage model.
Table 8 shows that four discriminators slightly improve the pixel accuracy metric over using three. However, it decreases the speed by 5 FPS, thus it is not worth it. Therefore, the model achieves an optimal balance when using three discriminators. Compared with the original discriminator method, the reason for the effective improvement is that the three-layer structure uses the same convolution layer for feature extraction, and they, respectively, work on the input feature maps of different scales simultaneously. The discriminator with a high input scale focuses on the details and the texture information of the image. In contrast, the discriminator with a low input scale focuses on the high-level abstract semantic information of the image. The three work together, judging the input image’s authenticity, which finally urges the generator to generate a more realistic critical components image with more evident semantics.
5.4.4. Number of Epochs
In the image generation model, selecting iteration parameters is very crucial. Once the number of iterations is too low, it will not be able to reflect the actual distribution of image sample space accurately. On the contrary, if the number of iterations is too high, overfitting behavior will occur in the model, thus affecting the generalization function of the model. Therefore, we compared the number of iterations for the CFM-GAN model with the KCIGD dataset to obtain better model performance. As shown in
Table 9, as the number of iterations increases, the image generated by the model becomes better and better. However, when the number of iterations is greater than 200, CFM-GAN slightly reduces the efficiency of image generation due to the overfitting problem of the image generation model. In conclusion, when the number of iterations is 200, the model’s performance reaches the best balance.
5.4.5. Minimum Training Data Experiment
For verifying the impact of KCIGD dataset size on the model’s generalization performance, the CFM-GAN model was applied to training sample sets of different sizes for comparative analysis. As shown in
Table 10, the performance of the CFM-GAN model does not decrease significantly with the reduction in training sets. Additionally, this is enough to indicate that the robustness of the CFM-GAN model is strong and can still extract critical information from images, even on small datasets. It overcomes the problem of the weak generalization ability of previous research models to a certain extent.
5.5. Ablation Analysis
We performed ablation analyses for the CFM-GAN models to verify the effect of various ingredients in the CFM-GAN models. As indicated by
Table 11, due to the MoCo model playing an active role in the feature extraction of the global generator, the index of model B is better than that of A. Model C has significantly superior metrics than B, demonstrating that the two-stage generation model combining global and local generators can enhance the sharpness of the generated images. Model D introduces a multiscale discriminator. It drives the generator to generate more realistic transmission line element images and enhances the model’s stability. By observing the scores of model E, we find that the penalization mechanisms significantly improve the model CFM-GAN’s properties, mainly because it imposes sufficient semantic information constraints on the intermediate states of the generator, which makes the resulting images.
5.6. Computational Complexity
By calculating the network training parameters and training times of several mainstream models, we can evaluate the models’ spatial and time complexity comprehensively. It can be found from
Table 12 that the training time, together with the network parameters of CFM-GAN, is slightly increased compared with some other mainstream generation models. Still, the performance improvement brought by CFM-GAN is worth it.