4.1. Dataset
In the experiments, our main focus was to classify defective satellite cloud images and identify the corresponding region of noise points and lines. However, there is a lack of publicly available datasets specifically designed to recognize defective satellite images. To address this gap, we carefully created a dataset using satellite cloud imagery from the Fengyun-1 satellite (China’s first-generation sun-synchronous orbiting meteorological satellites), which can be downloaded from
https://satellite.nsmc.org.cn/. The dataset consists of 20,000 images, each with a resolution of 224 × 224 pixels.
First, we split the dataset into training and testing subsets. The training dataset, used for model training, consists of 16,000 images, while the testing dataset, used for method evaluation, contains 4000 images. To ensure the authenticity of model evaluation, the testing dataset uses the original images, with noise manually annotated in the images, including ground truth for both detection and segmentation tasks. For the training dataset, noise is simulated by generating noise points and lines, with noise masks directly used as ground truth, enabling the rapid construction of datasets for detection and segmentation tasks. Additionally, for a more detailed evaluation, we divided the dataset into two subsets, each focusing on a specific aspect of the defects: noise points and noise lines. Both subsets contain an equal number of samples, with 10,000 images in each.
Figure 5 illustrates examples from the noise point dataset, showing satellite cloud images with noise points on the top and normal satellite cloud images on the bottom. The noise points appear as salt and pepper noise, characterized by the sudden appearance of black or white pixels. This type of low-grayscale noise can be caused by various factors such as image sensors, transmission channels, and decoding processes.
Figure 6 presents representative samples from the noise lines dataset, with the upper segment showing satellite cloud images with noise lines and the lower segment showing normal satellite cloud images. Noise lines in satellite images typically appear as black or white rectangular strips, either horizontally or vertically on the sides of the image. These artifacts are commonly caused by interference or faults in the imaging process.
4.2. Experimental Setup
4.2.1. Implementation Details
Our proposed method is implemented using the PyTorch framework, a widely utilized deep learning library. The experimental evaluations are conducted on a high-performance workstation featuring an Intel(R) Core(TM) i9-10920X CPU clocked at 3.50 GHz. The computational power is enhanced by two GeForce RTX 3090 24GB TURBO GPUs, accelerating the training process significantly. To optimize our model, we employ stochastic gradient descent (SGD) as our chosen optimization algorithm, widely acknowledged in deep learning research. Experiments are conducted with a carefully chosen mini-batch size of 256 for a balance between computational efficiency and model convergence. The initial learning rate is set to and follows a linear decay schedule, progressively decreasing with each epoch until reaching . Additionally, cross-entropy was chosen as the loss function to ensure an effective evaluation of classification performance. This adaptive learning rate strategy ensures effective convergence and stability during the training process. The comprehensive evaluation involves 200 epochs, providing a rigorous exploration of the model’s performance over an extended training duration to capture intricate patterns and trends in the learning dynamics, contributing to a thorough understanding of its capabilities and robustness.
Below is a brief introduction to the parameter settings for the comparative methods. In classification methods, the Logistic Regression method uses the ‘lbfgs’ solver with the regularization strength set to 1. The K-Nearest Neighbors method sets the number of neighbors to 5, uses Minkowski as the distance metric, and `uniform’ as the weight. The Decision Tree method sets the minimum number of samples for splitting to 2 and the minimum number of samples per leaf node to 1. The Random Forest method sets the number of trees to 100 and ‘sqrt’ as the maximum number of features. The Multilayer Perceptron method uses a continuous learning rate decay and ‘relu’ as the activation function. The AdaBoost method uses the default decision tree as the base learner, with the learning rate set to 1 and the number of trees set to 50. The Support Vector Machine method uses the ‘rbf’ kernel function, with the regularization parameter set to 1. For other deep learning classification algorithms, the parameters remain unchanged, such as the ResNet50 model, where the layers contain 3, 4, 6, and 3 basic blocks, respectively. For segmentation algorithms, the parameters are set to default values. For instance, the U-Net network defaults to using ResNet34 as the backbone network, sigmoid as the activation function, and ImageNet as the pretrained weights. The DeepLabV3 network uses ResNet50 as the backbone, with cross-entropy as the default loss function.
In the evaluation of noise image classification, we systematically compare the effectiveness of our proposed approach with a diverse range of ten classical and state-of-the-art techniques for classifying defects in satellite cloud images. This comprehensive analysis includes eight well-established traditional shallow methods: Logistic Regression [
29], K-Nearest Neighbors [
30], Naive Bayes Classification [
31], Decision Tree [
32], Random Forest [
33], Multilayer Perceptron [
34], AdaBoost [
35], and Support Vector Machine [
36]. Additionally, we incorporate two modern deep learning methods, namely, AlexNet [
37] and ResNet50 [
38]. In our assessment of noise region segmentation, we meticulously contrast various prominent image segmentation models, encompassing Unet [
18], Unet++ [
19], DeepLabV3 [
20], DeepLabV3+ [
21], A2-FPN [
23], and MACU-Net [
22]. We delve into a comprehensive comparison, examining their respective strengths, weaknesses, and performance across different noise types. This thorough analysis aims to provide insights into their applicability and effectiveness in addressing the challenges posed by noise region segmentation tasks.
4.2.2. Evaluation Protocol
Four evaluation metrics are commonly used to assess the performance of satellite cloud image defect classification methods and noise region segmentation methods: accuracy, precision, recall, F1 score, and mIoU. The definitions of these metrics are as follows:
Accuracy is a measure that reflects the overall effectiveness of a classification model. It assesses the proportion of correctly predicted instances among all instances. A high accuracy score indicates that the model is making correct predictions across all classes, while a low accuracy suggests a higher rate of misclassifications. The definition of the accuracy metric is as follows:
Accuracy gives an overall assessment of the model’s ability to make correct predictions across all classes.
Precision is a metric that focuses on the accuracy of positive predictions made by a model. It quantifies the model’s ability to correctly identify instances belonging to the positive class. High precision indicates that when the model predicts a positive instance, it is likely to be correct, minimizing false positives. The definition of the precision metric is as follows:
TP and FP denote True Positives and False Positives. Precision is particularly important in situations where false positives carry significant consequences.
Recall measures the effectiveness of a classification model in capturing all relevant instances of a specific class. It emphasizes the ability of the model to avoid missing positive instances, making it crucial in scenarios where false negatives (missing positives) have significant implications. A high recall score indicates a model that is sensitive to the presence of positive instances. The definition of the recall metric is as follows:
FN denotes False Negatives. Recall is important when the cost of missing positive instances (false negatives) is high, and it provides insight into how well the model identifies all relevant instances.
F1 Score is a comprehensive metric that balances precision and recall. It is particularly useful in situations where there is an uneven distribution of classes or where there is a trade-off between false positives and false negatives. The F1 score is the harmonic mean of precision and recall, offering a single value that considers both the correctness of positive predictions and the model’s ability to capture all relevant instances. The definition of F1 score is as follows:
The F1 score combines both precision and recall into a single metric, allowing for a comprehensive evaluation of a model’s performance, especially in scenarios where there is a trade-off between false positives and false negatives.
mIoU, i.e., mean Intersection over Union, is a widely used evaluation metric in semantic segmentation tasks. It assesses the accuracy of pixel-wise classification by measuring the overlap between predicted segmentation masks and ground truth masks for each class. mIoU is the average IoU across all classes, where IoU is calculated as the ratio of the intersection area between the predicted and ground truth masks to their union area. Mathematically, IoU is expressed as follows:
TP denotes the number of true positive pixels (correctly classified pixels). FP is the number of false positive pixels (incorrectly classified pixels). FN is the number of false negative pixels (pixels missed by the prediction). mIoU is calculated by summing up the IoU of each class and dividing it by the total number of classes, which is defined as follows:
N is the total number of classes. In semantic segmentation, a higher mIoU value indicates better performance, meaning the model can accurately delineate different objects or regions within an image.
4.3. Experimental Results
We present a comprehensive quantitative comparison of our methodology with existing classification approaches using the noise point dataset, summarizing the results in
Table 1. The table provides details on accuracy, precision, recall, and F1 score for all methods in the test dataset. Several key observations can be made from the outcomes presented in
Table 1. First, deep learning methods outperform shallow methods, even when deep features are utilized. For example, compared to logistic regression, ResNet50 shows improvements of 5.60%, 7.18%, 11.20%, and 5.99% in accuracy, precision, recall, and F1 score, respectively. Shallow methods independently learn feature representations and classification models, limiting their discriminatory capacity for satellite cloud images containing noise points. In contrast, deep learning methods, which use convolutional neural networks for both feature representation and classification, consistently outperform shallow methods by enabling joint execution of noise point image recognition. Consequently, all deep methods consistently outperform shallow methods. Furthermore, our methodology demonstrates superior classification performance compared to other methods on noise point datasets. Compared to ResNet50, our approach achieves significant enhancements of 0.20%, 0.61%, and 0.21% in accuracy, recall, and F1 score, respectively. Compared to VGG16-AdvNet, our approach achieves significant enhancements of 6.8%, 2.08%, and 11.8% in accuracy, precision, and recall. This superiority is attributed to the adoption of transformer networks for classifying satellite cloud images, using self-attention mechanisms to capture long-range dependencies and global contextual information. This results in more effective feature learning across the entire image compared to convolutional networks. Additionally, transformers use various spatial hierarchies, which contribute to a more remarkable performance than convolutional networks. In summary, our method surpasses all baseline methods in noise point image recognition.
Figure 7 illustrates a selection of results from the noise point experiment. The first two images represent remote sensing satellite images with noise points, while the latter two are normal satellite images without noise. We compared the classification predictions of ResNet50, AlexNet, and our proposed model. In all four cases, our model achieved correct predictions, whereas ResNet50 only succeeded with the last image, and AlexNet misclassified all four images. Given the subtlety of the noise features in these images, these results highlight the robustness and superior performance of our model in detecting noise points.
Furthermore, we conduct a comprehensive quantitative comparison of all methods in the noise line dataset, presenting the results in
Table 2, including the accuracy, precision, recall, and F1 score metrics in the test dataset. From these findings, several observations emerge. First, deep learning methods outperform shallow methods when applied to noise lines, even when deep features are utilized. For example, compared to logistic regression, ResNet50 shows improvements of 0.71%, 1.75%, and 0.73% in accuracy, recall, and F1 score. Moreover, the improvement in these methods on noise lines is comparatively lower than that on noise points, since the distinct characteristics of noise lines make it easier for shallow methods to achieve better performance. Second, our method applied to noise lines exhibits superior classification performance compared to other methods. For example, relative to ResNet50, our approach achieves significant improvements of 0.20%, 0.41%, and 0.20% in accuracy, recall, and F1 score, respectively. Compared to MobileNetV2-Adv, our approach achieves significant enhancements of 4.05%, 8.2%, and 4.27% in accuracy, recall, and F1 score. This superiority is attributed to the utilization of transformer networks, which allows more effective feature learning across the entire image than convolutional networks, especially in capturing long-range dependencies and global contextual information through self-attention mechanisms.
Additionally, the improvement in our method compared to other methods for noise lines is lower than for noise points, as the distinctiveness of noise lines allows for easier identification. As shown in
Table 3, while the precision of our classification method is slightly lower than that of the best-performing method (though the difference is minimal), its recall significantly outperforms the latter. Our approach employs a two-stage framework. The first stage selects images that contain noise, while the second stage identifies the specific locations of noise points and lines within these images. Existing methods often suffer from low recall when precision is high in the first stage, resulting in the omission of many noisy images. In contrast, our method achieves a better balance between precision and recall, ensuring more comprehensive detection of noise. For example, if a pixel is misclassified as noise during the detection stage, the segmentation stage will not segment it, thereby preventing the error that could arise from instability in a single model. This two-stage design effectively mitigates the impact of minor accuracy losses on the overall performance. Moreover, in the field of meteorological image processing, noisy images can significantly affect subsequent tasks. Therefore, the model must aim to recall all potential noisy images. Our approach emphasizes identifying all noisy images, while the two-stage algorithm helps prevent errors caused by insufficient accuracy, resulting in better defect recognition performance. In summary, our method outperforms all comparison methods in noise line image recognition.
Figure 8 showcases a selection of results from the noise line experiment. The first two images represent remote sensing satellite images with noise lines, while the latter two are normal satellite images without noise lines. We compared the classification predictions of ResNet50, AlexNet, and our proposed model. Among these four images, our model achieved correct predictions across all cases. In contrast, AlexNet correctly classified only the first two images containing noise lines but misclassified the last two, while ResNet50 failed to correctly predict any of the images. These results highlight the superior performance of our model in detecting noise lines. Particularly for noise lines, which exhibit relatively continuous features and are more challenging to discern, our model demonstrates remarkable stability and precision. This performance advantage stems from the ability of our model to leverage self-attention mechanisms, capturing global contextual information and extracting key features associated with noise lines more effectively. In comparison, the predictions of AlexNet and ResNet50 exhibit significant inconsistencies, underscoring their limitations in handling such tasks.
We also experimented with replacing the DeiT model with the ViT [
41] model. The results are shown in
Table 3. The results showed that the ViT model performed similarly to the DeiT model, and even slightly inferior to DeiT model across various accuracy metrics. However, the parameter count and Flops of the ViT model are 4–5 times that of the DeiT model. From the perspective of parameter scale and inference time, the ViT model and other more complex Transformer models often fail to meet the efficiency requirements of real-world business applications. Although these models may offer high performance in certain scenarios, their high computational cost and extended inference time make them less practical for deployment. Therefore, considering both performance and efficiency, we adopted DeiT as the backbone to identify images containing noise. The DeiT model not only delivers excellent accuracy but also provides faster inference speeds in resource-constrained environments, making it a practical solution for real-world business needs.
Additionally, we performed Flops and parameter analyses on the classification methods used, with the results shown in
Table 4. In this discussion, we focus exclusively on deep learning models, excluding traditional machine learning models. From the table, it is evident that our model has Flops and parameter sizes close to the minimum, while achieving the shortest inference time. Our model successfully achieves lightweight design while maintaining high accuracy and stability, making it highly suitable for resource-constrained environments. Furthermore, the reduced model complexity enhances its scalability and deployment flexibility in practical applications, offering a robust solution for real-time processing demands.
4.4. Ablation Study
To validate the importance of self-attention in our model, we conducted ablation studies. Self-attention is the core component of the transformer architecture; when removing the self-attention layers, our DeiT-based network essentially degrades to a simple Multilayer Perceptron (MLP). Our experimental results demonstrate that removing self-attention leads to significant performance degradation. The experimental results of ablation are shown in
Table 5 and
Table 6 below.
Specifically, for the noise line detection task, the complete DeiT architecture with self-attention achieves 99.25% accuracy, 99.40% precision, 99.10% recall, and 99.25% F1-score. In contrast, removing self-attention results in lower performance with 98.25% accuracy, 99.49% precision, 97.00% recall, and 98.23% F1-score. For the noise point detection task, the performance gap is even more pronounced. With self-attention, the model achieves 99.15% accuracy, 99.40% precision, 98.90% recall, and 99.15% F1-score, while without self-attention, these metrics drop to 94.30% accuracy, 97.74% precision, 90.70% recall, and 94.09% F1-score. These results confirm the crucial role of self-attention in capturing both global and local dependencies of noise patterns, particularly for the more challenging noise point detection task.
4.5. Performance on Different Subclasses
In this study, our primary objective was to evaluate the performance of various methods across two distinct classification tasks: noise point classification and noise line classification. We conducted a comprehensive analysis, utilizing bar charts to assess the performance of each method in every category.
Figure 9,
Figure 10 and
Figure 11 illustrate the precision, recall, and F1 metric indicators for all methods in noise point and normal images, respectively.
The results shown in
Figure 9 indicate that all methods achieved a higher level of precision when dealing with images containing noise points compared to normal images. On the other hand,
Figure 10 reveals that all methods achieved a higher level of recall when dealing with normal images. These findings suggest that, in the task of classifying noise points, most methods focused on evaluating the performance of classifiers in accurately identifying and classifying instances that were affected by noise points within the dataset.
To achieve a balanced representation of the two metrics, we have presented the F1 score for all methods in
Figure 11, which effectively provides an average assessment of the impact of precision and recall evaluation metrics on the model. These figures indicate that our method successfully balances the precision and recall metrics, resulting in high performance on both indicators. Additionally, our method consistently achieves high F1 scores for both noise points and normal images. In contrast, shallow methods such as KNN often struggle to perform well in both categories simultaneously. In conclusion, our method demonstrates superior overall performance in all categories when compared to other methods.
The performance of various methods in classifying noise line images was further analyzed. The precision, recall, and F1 scores were evaluated for all methods on normal and noise line images, as shown in
Figure 12,
Figure 13 and
Figure 14, respectively. Based on these figures, three observations can be made. First, most shallow methods exhibit a gap in precision between different categories. For example, the KNN method achieves a precision of 9.97% higher in noise line images compared to normal images. In contrast, the ResNet50 method achieves similar precision values of 98.71% and 99.39% in the two categories, respectively, indicating the ability of deep learning methods to learn feature representations and classification models end-to-end, resulting in better precision. Second, most methods perform well in recall metrics, indicating their effectiveness in recalling satellite cloud images with noise line features. Finally, our method achieves the highest F1 score values in both normal and noise line images. This can be attributed to the fact that noise lines typically cover larger areas, and our method effectively utilizes the transformer network to learn long-range features, enabling an effective distinction between normal and noise line images. In conclusion, our method outperforms all other methods in classifying noise line images.
4.6. Comparison Results of Noise Region Segmentations
In this paper, we employ a two-level deep defect recognition framework to identify noise points and lines in meteorological satellite images. The framework begins with the application of a transformer-based method for image classification. Subsequently, we utilize a pseudo-label-based training strategy and popular image segmentation models (e.g., Unet) to detect the regions containing noise points and lines. By combining image classification and image segmentation approaches, we achieve a balance between processing efficiency and performance. While image classification methods typically filter normal images and require less inference time, image segmentation models only need to deal with the noise images, resulting in time and resource savings. The performance comparison of popular methods in identifying noise points and lines in images is shown in
Table 7.
Table 7 shows the performance of various methods in segmenting noise points. We observe that DeepLabV3 and DeepLabV3+ exhibit poorer performance compared to other methods. This is mainly due to their reliance on atrous convolutions for contextual information extraction from satellite images. However, the small size of noise points poses a challenge for these methods, as their networks struggle to effectively capture features of such minuscule points, leading to inaccurate detection. Conversely, encoder–decoder-based methods, such as Unet, demonstrate higher accuracy in segmenting noise regions. For example, Unet++ achieves an 11.26% improvement over DeepLabV3, attributed to the precise feature reconstruction capability of the encoder–decoder architecture, which enhances the model’s ability to delineate small objects accurately. Moreover, the U-shaped architecture commonly adopted by encoder–decoder models incorporates skip connections to preserve fine details and effectively localize small structures, making it highly effective in detecting noise points in satellite images. Similarly, for noise lines, as presented in
Table 7, we observe that all methods achieve higher accuracy compared to noise points. This is because noise lines typically have a larger size than noise points, enabling image segmentation models to effectively detect them. Overall, our pseudo-label-based training strategy enables the development of effective noise region segmentation models for accurately detecting noise pixels.
We present the segmentation results of various methods in
Figure 15 and
Figure 16. In
Figure 15, comprising six columns and six rows, the left three columns display meteorological satellite images, ground truth, and predicted results, respectively. The right three columns are similar to the left set. Notably, we observe that these methods may overlook small noise points due to their challenging detection. Additionally, the accuracy of the right images is lower than that of the left images. This disparity arises because background elements, such as clouds, in the right images can interfere with the model’s ability to detect noise points effectively.
Figure 16 illustrates the visualization of certain methods for detecting noise lines. The left three columns exhibit satellite images, ground truth, and predicted results, while the right columns are similar to the left set. We notice that trained models can identify various types of noise lines, even those with complex structures. Although our created dataset predominantly comprises black or white noise, the model effectively discerns these intricate noise lines. Unlike noise points, detecting noise lines may only capture a portion of the area, leaving gaps in the results. Hence, future research could focus on designing post-processing methods to fully capture noise lines based on their horizontal or vertical characteristics. In summary, these visualizations affirm the efficacy of our training strategy in developing a robust model for segmenting noise regions and identifying defective pixels in satellite images.
Additionally, we conducted an analysis of the Flops and parameters for the semantic segmentation methods used, with the results presented in
Table 8. It can be observed that the model with the largest number of parameters, Unet++, has a parameter count in the order of
, while the model with the highest Flops, Unet, also has Flops in the order of
. The models listed in the table were able to complete inference on large datasets within 10 s, indicating high efficiency and good accuracy.