1. Introduction
Remote sensing technology serves as an indispensable tool, enabling the continuous and swift acquisition of geometric and physical information pertaining to Earth’s surface [
1]. Accompanying the rapid advancements in remote sensing technologies, optical RSIs have emerged as the predominant medium for Earth observation [
2]. Nonetheless, the presence of clouds, which envelop approximately 55% of Earth’s surface, markedly hinders the interpretation and utility of optical RSIs [
3]. Consequently, cloud removal from optical RSIs presents a significant challenge [
4].
The traditional cloud removal techniques primarily rely on the principle of local correlation among adjacent pixels, predicated on the assumption that cloud-covered areas possess similar characteristics to their adjacent cloud-free regions [
5]. As a result, these techniques utilize the data from cloud-free regions to reconstruct or substitute the cloud-covered areas, employing methods such as interpolation [
6], filtering [
7], and exemplar-based strategies [
8], among others. Zhu et al. [
9] introduced a novel cloud removal approach utilizing a modified neighborhood similar pixel interpolator (NSPI). Similarly, Siravenha et al. [
10] employed a high boost filter alongside homomorphic filtering for scattered cloud elimination, while He et al. [
11] proposed an innovative image completion technique through the analysis of similar patch offsets. These methods fundamentally assume a resemblance in the features between cloud-covered and cloud-free areas. Nonetheless, they might introduce discontinuities or artifacts at the boundaries of the newly generated cloud-free areas and the original cloud-covered regions, leading to unnatural effects in the restored images [
12]. Moreover, these approaches often require manual parameter adjustments to achieve optimal results, adding complexity and potentially resulting in inconsistent performances [
13].
The advent of machine learning techniques, including decision trees (DTs) [
14], support vector machines (SVMs) [
15], random forests (RFs) [
16], and others, has significantly mitigated the constraints of the conventional cloud removal methods. By leveraging extensive datasets comprising both cloudy and cloud-free imagery, machine learning models adeptly discern intricate patterns and distinctions between these regions. Consequently, they possess the capability to distinguish between the cloudy and cloud-free areas in images, facilitating effective cloud restoration. Lee et al. [
17] introduced multicategory support vector machine (MSVM) as a promising, efficient algorithm for cloud removal preprocessing. Hu et al. [
18] developed a thin cloud removal algorithm for contaminated RSIs, combining multidirectional dual-tree complex wavelet transform (M-DTCWT) with domain adaptation transfer least square support vector regression (T-LSSVR), ensuring the preservation of the ground object details while eliminating thin clouds. Tahsin et al. [
19] devised an innovative optical cloud pixel recovery (OCPR) method using an RF trained on multi-parameter hydrological data to restore the cloudy pixels in Landsat NDVI imagery. However, machine learning approaches typically do not directly derive high-level feature representations from raw data. Instead, they depend on the expertise and prior knowledge of domain experts for feature selection. The effectiveness of image reconstruction is profoundly influenced by the quality and selection of these handcrafted features, introducing a degree of subjectivity and inherent limitations.
The recent integration of deep learning’s robust nonlinear modeling capabilities, particularly through convolutional neural networks (CNNs), has revolutionized the cloud removal efforts. These networks can autonomously learn feature representations from raw data, eliminating the necessity for manual feature extraction or selection. This end-to-end learning paradigm significantly diminishes the need for manual input in cloud removal processes. Zhang et al. [
20] introduced DeepGEE-S2CR, a method combining Google Earth Engine (GEE) data with a multi-level feature-connected CNN, to efficiently clear the clouds in Sentinel-2 imagery using Sentinel-1 synthetic aperture radar imagery as supplementary data. Meanwhile, Ma et al. [
21] proposed the innovative cloud-enhancement GAN (Cloud-EGAN) strategy, which incorporates saliency and high-level feature enhancement modules within a cycle-consistent generative adversarial network (CycleGAN) framework. The widespread implementation of CNNs has significantly improved the cloud removal models’ ability to discern complex features and variations between the cloud-covered and cloud-free regions in images, leading to the enhanced reconstruction of cloudy imagery.
In traditional CNN architectures, the model’s effective receptive field is constrained by the network’s depth and the size of the convolutional kernels, limiting the ability to capture comprehensive features and contextual information [
22]. This restriction hampers the model’s capacity to process global information from images. To overcome this limitation, attention mechanisms [
23] have been integrated into cloud removal models to augment their capability to assimilate global information, enabling dynamic focus adjustment across different image regions. Such enhancements allow these models to prioritize the reconstruction of cloudy areas, thereby significantly improving their contribution to image analysis. Xu et al. [
24] introduced AMGAN-CR, which effectively leverages attention maps alongside attentive recurrent and residual networks, coupled with a reconstruction network, to tackle the challenges of cloud removal. Wu et al. [
25] proposed Cloudformer, a transformer-based model that integrates convolution and self-attention mechanisms with locally enhanced positional encoding (LePE), adeptly managing cloud removal by extracting features across various scales and augmenting the positional encoding capabilities.
Despite the promising results demonstrated by the existing deep learning cloud removal models, they still face significant limitations. Specifically, these models [
26,
27] primarily rely on a per-pixel approach during the training process, which neglects the global coherence between pixels. This leads to difficulties in seamlessly integrating the reconstructed regions with the surrounding cloud-free areas at the semantic level. To address this challenge, studies such as [
28,
29] have introduced mask techniques aimed at more accurately distinguishing cloud-covered areas from cloud-free regions, thereby guiding the detailed reconstruction of local features and transitions. However, the accuracy of these masks remains a critical factor limiting their performance. Additionally, the attention mechanisms currently employed in cloud removal tasks mainly focus on channel and spatial attention, largely overlooking the crucial role of the frequency features within the image [
30]. Frequency features, as another important dimension of image information, are essential for capturing both the global structure and local details of an image. Therefore, effectively integrating frequency features into cloud removal models to achieve a unified reconstruction of the cloud-covered and cloud-free regions, thereby enhancing the quality and semantic consistency of the reconstructed images, remains a pressing research problem.
To address these problems, we introduce a frequency-domain attention mechanism utilizing the fast Fourier transform (FFT) to enhance the spatial information processing and performance of cloud removal models given that frequency information is critical for cloud removal. Low frequencies highlight the overall content of an image, while high frequencies focus on the edge contours and texture details. Leveraging high-frequency information is essential. Moreover, precise boundary delineation is vital in cloud removal to accurately identify cloud-covered regions, enabling targeted reconstruction efforts that preserve the integrity of cloud-free areas. We propose a multi-stage reconstruction strategy to refine the boundary features and design a collaborative optimization loss function to concentrate on the boundaries of cloud-covered areas while minimizing unnecessary reconstruction in cloud-free zones. The principal contributions are summarized as follows:
(1) We propose a frequency cloud removal module (FCRM) that is adept at recovering the details while preserving the original characteristics of non-cloud regions in the frequency domain. The FCRM utilizes frequency-domain attention to focus on the differences in the frequency-domain information between cloudy and cloud-free images to refine the boundary information of the image. Additionally, it introduces the non-local attention block to capture the local and non-local relationships and enhance the contextual connections through global dependency relationships.
(2) We introduce a collaborative optimization loss function, consisting of Charbonnier loss for global robustness, edge loss for edge-preserving precision, and FFT loss for frequency-aware adaptability, which penalize boundary shifts while ensuring subject consistency and retaining intricate image details and textures.
(3) The multi-stage frequency attention network (MFCRNet) is structured around an encoder–decoder architecture, specifically designed for reconstructing areas obscured by clouds. Utilizing FCRM modules in the preceding layers enables meticulous cloud removal from input images. To minimize the information loss from up-sampling operations, a variant ResNet is directly applied to the input image in the N-th layer.
(4) A series of experiments are conducted on the RICE1, RICE2 [
31], and T-Cloud [
32] datasets, demonstrating the feasibility and superiority of the proposed method. It exhibits superior performance in both the quantitative and qualitative assessments compared to the other cloud removal methods.