1. Introduction
The long-term remote sensing data record holds substantial significance in the field of geography. These data can be utilized for monitoring the continuous changes on the Earth’s surface over time, extracting temporal characteristics of geographical features, and addressing various geographical issues, such as trends in vegetation growth, phenological features, and land cover changes [
1]. Landsat is one of the most widely used sources of satellite data for time-series analysis and land cover change mapping, with relatively high spatial resolution and continuous global coverage for more than 50 years [
2,
3,
4]. However, the Landsat 7 satellite displays noticeable striping artifacts due to the permanent disablement of its Scan Line Corrector (SLC) since May 2003, resulting in approximately 22% of the images acquired from the SLC not being scanned [
5]. This issue compromises the temporal consistency and data quality of Landsat time-series data. To address this, researchers have developed various approaches to reconstruct images free of gaps [
6,
7,
8,
9,
10,
11]. Generally, these approaches could be divided into two main categories: the single-source and multi-source techniques.
In single-source techniques, the interpolation of missing pixels depends on the data within the SLC-off image itself. Commonly used single-source methods include the simple interpolation methods, Kriging-based methods, and segmentation model approaches. The simple interpolation methods (e.g., mean [
12], bilinear [
13], and bicubic interpolation methods [
14]) are usually employed for narrow strips (1–2 pixels wide), offering rapid computation but limited reconstruction accuracy due to reliance on adjacent pixels. Compared to simple interpolation methods, Kriging-based methods make full use of spatial information. Kriging-based methods, offering a statistically rigorous estimation of reflectance at unscanned locations [
15], and Co-Kriging [
16,
17], which incorporates secondary images for spatial correlation, utilize spatial information more effectively but are hampered by their complexity and slow computation times [
15,
16,
17,
18,
19]. The segmentation model approach depends on the SLC-on image to create the segment model, which is then applied to SLC-off images, using available data points to fill the gaps [
20]. For example, Maxwell [
6] introduced a multi-scale segmentation model that overlays on an SLC-off image, extracting consistent spectral data to fill in missing pixels. Marujo [
21] refined this method by integrating linear operations to calculate pixel weights based on Maxwell’s model [
6], thus enhancing the accuracy of the interpolation, particularly in homogeneous landscapes [
5]. Despite the demonstrated efficacy of these single-source techniques, their limited use of additional temporal information presents challenges in accurately predicting data across diverse land-use interfaces [
17,
19]. This underscores the importance of incorporating multi-temporal imagery in the prediction process to achieve more precise and reliable reconstructions.
In contrast, multi-source techniques estimate missing pixels using reference images from other sensors [
9,
10,
11]. Histogram matching methodologies (e.g., GHM, LLHM, AWLHM), released by USGS [
5,
22,
23], performed a linear transformation between the target SLC-off image and the reference SLC-on image to calculate missing values. While the histogram matching methodologies achieve satisfactory filling results in homogeneous regions, they may not perform well when dealing with poor-quality images or those with significant changes. Chen proposed the Neighborhood Similar Pixel Interpolator (NSPI) [
24], which leverages the similarity information of adjacent pixels from SLC-off data series to estimate missing pixels, enhancing accuracy in such situations. However, due to changes in land cover, NSPI may exhibit inaccuracies in certain scenes and require longer calculation times. The Geostatistical Neighborhood Similar Pixel Interpolator (GNSPI) [
18] enhances NSPI by using both TM and SLC-off images as inputs and incorporating residual distributions of missing values. This approach can process cloud-contaminated images and reduce edge defects [
25]. The drawback of this method is the longer computation time and the maintenance of poor temporal resolution in the filled images.
Besides these methods mentioned above, the Spatiotemporal Fusion (STF) method is currently the most widely applied multi-source technology, capable of effectively improving temporal resolution [
8,
26,
27]. The STF method utilizes data from sensors with high temporal resolution, enabling relatively flexible restoration of missing data in large dynamic areas [
28]. The existing STF methods can be classified into five categories [
29] according to the specific techniques employed to connect coarse and fine images: the weight-based, unmixing-based, learning-based, Bayesian-based, and hybrid methods. Weight function-based methods (such as STARFM [
28], ESTARFM [
30], STAARCH [
31], and Fit-FC [
32]), which estimate ideal pixel values by extracting weight functions, still have potential for improvement in handling large change scenes. Unmixing-based methods (like MMT [
33], STDFA [
34], U-STFM [
35], and OB-STVIUM [
36]), rely on linear spectral mixing theory to estimate high spatial resolution pixel values but are constrained by assumptions of surface invariance or linear change. Learning-based methods employ machine learning algorithms to model the relationships between observed coarse and fine image pairs, and predict unobserved fine images using techniques such as dictionary-pair learning [
37,
38], machine learning [
39], regression trees [
40], and neural networks [
41,
42]. Bayesian-based methods, including STBDF [
43] and the Bayesian Maximum Entropy method [
44], apply Bayesian parameter estimation for probabilistic image fusion. Hybrid methods (e.g., FSDAF [
45], VIPSTF [
46], STRUM [
47], and USTARFM [
48]) integrate two or more techniques from the aforementioned four categories to enhance fusion performance. While these STF methods offer significant improvements in temporal resolution, they predominantly require clean, cloud-free input images for optimal reconstruction. This presents a challenge in cloud-prone regions where acquiring completely unblemished images within a reasonable timeframe is often unfeasible [
49]. The ROBOT algorithm addresses this by leveraging variations in a low-dimensional linear subspace to effectively approximate variations within time-series data [
49], allowing the use of partially contaminated image pairs for satisfactory reconstruction. This makes it more suitable for the automated processing of large-scale Earth Observation data. Although STF algorithms can improve the time frequency of reconstructed fine-resolution images of any given date, they cannot utilize gap data containing missing pixels to reconstruct complete images. Therefore, these algorithms necessitate seamless input images, demonstrating that SLC-off data cannot be effectively processed by fusion algorithms alone. This highlights the need for developing algorithms capable of reconstructing dense time-series data from SLC-off data.
It is worth mentioning that Liu [
50] proposed a new paradigm for remote sensing data processing called Seamless Data Cube (SDC), specifically designed to tackle missing values and outliers in the data. The SDC framework integrates conventional analysis-ready data processing with proven algorithms for missing data reconstruction and multi-sensor data fusion [
24,
30,
51], offering a novel approach to reconstruct SLC-off data. As evidenced in
Table 1, from November 2011 to February 2013, Landsat 7 SLC-off data was the sole source available, leaving a significant temporal gap devoid of comparable data. Therefore, filling in the missing pixels of SLC-off data is imperative before employing the STF method to reconstruct comprehensive time-series images.
Therefore, this paper proposes an effective method (referred to as Improved ROBOT (IROBOT)) to reconstruct dense time-series images from SLC-off data. The IROBOT method effectively addresses missing data both temporally and spatially, significantly enhancing the utility of SLC-off data. This study focuses on image quality restoration, data completion, and spatiotemporal fusion of Landsat 7 SLC-off data. The result is a seamless daily data cube that provides a comprehensive and continuous time series of surface reflectance data. The structure of this paper is organized as follows:
Section 2 presents the study region and data used;
Section 3 details the IROBOT methodology and experimental design;
Section 4 discusses the results and analysis; and the final section provides a brief summary and conclusions.
2. Study Region and Data
Situated in the northwest suburb of Beijing, the study region is located at approximately 40.10° N and 116.33° E, covered by World Reference System 2 Path 123 and Row 32. Within this Path/Row, we selected an intensive study area, which covers 15 km × 15 km (500 × 500 Landsat pixels). The land cover types include buildings, forests, grasslands, farmlands, roads, and rivers. Characterized by a temperate continental climate, the study region shows distinct seasons with significant vegetation changes. To clearly demonstrate the changes in the reconstructed time series, we selected a square (outlined in red in Figure 3c) to display the NDVI series.
For the present work, the Landsat SR dataset and the SDC500 [
52] dataset are used to repair missing data and reconstruct dense time-series data. The Landsat SR dataset contains atmospherically corrected surface reflectance derived with the Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) algorithm (version 3.4.0). For this dataset, we have chosen the blue, green, red, and near-infrared bands, each with a 30-meter spatial resolution and a 16-day temporal resolution (Landsat SR datasets are available for download at:
https://earthexplorer.usgs.gov/, accessed on 9 April 2024). The Landsat 7 data used in this paper are clear images with less than 30% cloud cover and removed cloud and cloud shadows. All data have undergone preprocessing, including atmospheric correction, cloud removal, and cropping.
Figure 1 illustrates the 25 Landsat 7 ETM+ images acquired between 2011 and 2013. The first 10 images correspond to acquisitions in 2011, and the subsequent 6 images pertain to the year 2012, and the final 9 images were acquired in 2013, with specific dates detailed in
Table 2. All remote sensing images in the article are illustrated as true-color images of the red, green, and blue bands.
The global land surface reflectance Seamless Data Cube in 500 m resolution (SDC500) is also used in this study. Corresponding to MODIS-Terra sensor bands 1 to 7, this dataset reduces noise and fills gaps in the temporal reflectance series for each pixel. From this dataset, we have selected the blue, green, red, and near-infrared bands with a spatial resolution of 500 m and a temporal resolution of 1 day for the reconstruction process. (SDC500 dataset is available for download at:
https://data-starcloud.pcl.ac.cn/resource/27/, accessed on 9 April 2024).
Figure 2 illustrates four images utilized for validation in this study, including Landsat 5 TM and Landsat 8 OLI images. The TM images were captured on 31 January 2011 (
Figure 2a), and 7 May 2011 (
Figure 2b), while the OLI images date from 1 September 2013 (
Figure 2c), and 4 November 2013 (
Figure 2d). These images distinctly illustrate significant surface reflectance changes in the region over a considerable time span. This region serves as an ideal area for testing algorithms with long temporal spans, as most existing methods face challenges in accurately reconstructing dense fine-resolution data over such extended periods.
3. Methodology
The ROBOT algorithm requires seamless data inputs, thereby necessitating interpolation preprocessing for SLC-off data. Aiming to reconstruct daily time-series images at a 30-meter resolution during the SLC-off period (especially around the year 2012), the IROBOT method can be viewed as an integration of the NSPI interpolation method and the ROBOT method.
3.1. The Neighborhood Similar Pixel (NSPI) Interpolation Method
Assuming that pixels with the same land cover class near data gaps have similar spectral characteristics and temporal change patterns as the missing target pixels, we can search for similar pixels near these gaps. Given a short time interval between the input and target scenes, we can select similar pixels from the input images and assume that these pixels also have similar spectral features with the missing target pixels on the target image [
24]. Based on spectral similarity, Equation (1) defines the Root Mean Square Deviation (RMSD) between each common pixel and the target pixel, and we choose similar pixels based on these shared pixels.
represents the value of a common pixel in the input image at position
in band
, and
represents the value in band
at the target pixel
,
is the number of spectral bands. A higher RMSD indicates greater spectral dissimilarity. Gao [
28] utilized the standard deviation of the pixels in the input image and the estimated number of land cover types to determine the threshold. Pixels exhibiting RMSD values below this threshold are considered similar. This study employs built-in functions to automatically extract masks and categorize land cover types into pure pixels and others. If the RMSD of the
k-th common pixel satisfies Equation (2), the
k-th common pixel is chosen as a similar pixel.
represents the standard deviation of the band
across the entire input image. The parameters align with those in NSPI [
24], where the initial search window is set to 7 × 7 pixels, and the minimum number of similar pixels is 30. When the initial window cannot contain the minimum number of similar pixels, the window expands until the minimum number of similar pixels is satisfied. However, to balance computational efficiency, the maximum window size is capped at 17 × 17 pixels. If the minimum number of similar pixels is not met when the window size reaches the maximum, all available similar pixels are used.
Higher spectral similarity and smaller distances of similar pixels should carry more weight when predicting the target pixel. The weight
determines the contribution of the
j-th similar pixel to the target pixel’s predicted value. Equation (3) calculates the geographic distance
between the
j-th similar pixel
and the target pixel
. A comprehensive index
is outlined in Equation (4), which incorporated both spectral similarity (Equation (1)) and geographic distance.
Similar pixels with larger
values contribute less to the calculated value of the target pixel. Therefore, we use the normalized reciprocal of
as the weight
(as formulated in Equation (5)), where
ranges from 0 to 1, ensuring that the cumulative weight of all similar pixels equals 1. Here,
is the number of similar pixels.
Since similar pixels have the same or similar spectral values to the target pixel during simultaneous observations, we can use the information from these similar pixels in the target image to predict the target pixel. A higher weight attributed to a similar pixel indicates greater reliability. Hence, the prediction of the target pixel is achieved by the weighted average of all similar pixels in the target image, as shown in Equation (6).
In instances where no similar pixel is selected, Inverse Distance Weighting (IDW) interpolation is employed to predict the target pixel’s value. This method gathers pixel values surrounding the target pixel, assigning higher weights to those closer to the target location. Following this interpolation process, the resulting seamless image is then ready for the subsequent phase of spatiotemporal fusion.
3.2. The ROBOT Algorithm
For coarse-resolution images,
is the patch of the
i-th input image pair represented as a flattened vector. There exists a sparse vector satisfying Equation (7):
where
is stacked image patch vectors as matrices,
, and
is a sparse vector,
is the number of input images. We can find a compact and sparse representation as in Equation (7) by solving the LASSO (Least Absolute Shrinkage and Selection Operator) problem (Equation (8)), where
is a parameter.
After obtaining the value of
, in the same way, the estimation of the fine-resolution image can be expressed as Equation (9):
where
is stacked image patch vectors arranged as matrices,
,
is the fine-resolution image patch of the
i-th input image pair as a flattened vector,
is an estimation of the actual image
.
Variations in sensor specifications and observational perspectives can lead to inconsistencies between fine-resolution and coarse-resolution images, so the prediction in Equation (9) may contain noise. To mitigate this issue, a regularization term is introduced to enhance temporal correlation, and the optimized formula becomes:
Here, represents the fine-resolution image closest to the predicted moment in time, is a time flag of the i-th input image pair, is the time flag of the prediction phase, is a parameter.
Considering the presence of approximate residual
in coarse-resolution images (as depicted in Equation (12)), it is necessary to distribute the residuals into the prediction of fine resolution represented by sparse vectors. These residuals are then integrated into the final prediction, as formulated in Equation (13).
3.3. The Improved ROBOT (IROBOT) Algorithm
In the ROBOT algorithm, a high-quality reference image
(represented by Equation (11)) is required as the average image around
. This approach stabilizes the output during the reconstruction of time-series images. However, mosaic artifacts may emerge in the reconstructed images if the input images are of poor quality or are limited in quantity. To address this, we modify
to be a piecewise function as shown in Equation (14):
When predicting the image at time point within the time neighborhood , the predicted image should be more similar to the i-th input image than the average of adjacent images. The L1 norm constraint controlled by the parameter compresses data and reduces model complexity. The L2 norm constraint controlled by parameter preserves most features of and generates a stable predicted image.
Additionally, when the available quantity of images is less, we handle differently. If the time flag of the predicted image is not closely aligned with the input image time series, indicating significant temporal differences between the predicted image and , the spatial structure of the predicted image relative to may change. In such cases, it is advisable to relinquish the L2 norm constraint and employ the L1 norm constraint only. This allows for the automatic selection of the most crucial features in datasets with abundant features, generating a fine-resolution predicted image that aligns with the selected features. We devised a strategy to abandon the L2 norm constraint when the number of available images is less than 7; the parameters are set to .
3.4. Experimental Design
Interpolating SLC-off data is a crucial step in reconstructing dense time series using the ROBOT algorithm. In our experiments, we primarily focused on comparing the effects of different interpolation methods on the reconstruction results. Specifically, we integrated two distinct interpolation techniques with the original ROBOT algorithm to serve as comparative algorithms. The Linear-ROBOT method identifies valid pixels before and after the target image’s date, employing time-based linear interpolation to fill the invalid striped pixels, before applying the original ROBOT algorithm [
49] for reconstructing complete annual time-series data. Conversely, the IDW-ROBOT method adopts inverse distance weighted interpolation for filling in missing values prior to the reconstruction process with the original ROBOT algorithm [
49]. Both comparison methods use the same input images as IROBOT. This article designed three experiments: Experiment I and Experiment II were conducted to demonstrate the reconstruction effect, while Experiment III was performed to analyze the continuity of reconstructed time-series images. Each experiment presented the results of the proposed IROBOT method as well as those of the comparative methods.
Experiment I involves the recovery and reconstruction of Landsat 7 SLC-off data in 2011 (images shown in
Figure 1). The input data for the IROBOT method consisted of SLC-off data (fine images) and SDC data (coarse images) from 2011, and the parameters are set to
. Clear Landsat 5 TM images, as shown in
Figure 2a,b, are selected as the reference images for a quantitative evaluation of the reconstruction in the blue, green, red, and near-infrared (NIR) bands. The TM sensor can be regarded as a homologous sensor compared to the ETM+ sensor [
53], so in this experiment, the reconstructed image can be directly compared to the reference image without the need for sensor transformation.
Experiment Ⅱ encompasses the restoration and reconstruction of Landsat 7 SLC-off data captured in 2013. During this year, the Landsat 8 satellite could provide high-quality images as a reference. Considering the sensor disparities between ETM+ and OLI, the Landsat 7 reflectance data can be linearly transformed following the approach outlined by Roy [
53,
54] to ensure consistency between the images captured by the two sensors. The regression coefficients for the transformation from ETM+ to OLI are provided in
Table 3. The formula for converting the surface reflectance data from ETM+ to OLI, incorporating the scaling factors from the SR dataset, is given by Equation (15). The 9 Landsat 7 ETM+ images of 2013 in
Figure 1 are transformed into OLI images via RMA regression. So, the fine-resolution images are the converted Landsat 7 OLI images, and coarse-resolution images are SDC images of the year 2013. These input images are utilized for the reconstruction process, aligning with the procedures outlined in Experiment I. The clear Landsat 8 images are selected as the reference images (as shown in
Figure 2c,d), and the restoration reconstruction results of the two methods on four bands are evaluated.
Experiment Ⅲ is to reconstruct the daily Landsat 7 reflectance data for 2012. Due to the limited availability of Landsat 7 images in 2012, comprising only 6 images, we devised a strategy to abandon the parameter when the number of available images is less than 7. To enhance feature availability, the temporal input range is extended to include four additional months around 2012, thus incorporating 5 images from August to December 2011 and 2 from January to April 2013. In this experiment, we reconstructed the time-series data using IROBOT, Linear-ROBOT, and IDW-ROBOT methods, and all algorithms used 13 Landsat images as input for fine-resolution data (including 7 images from 2011 and 2013) and SDC500 images in 2012 for coarse-resolution data. Moreover, the NDVI time series of the reconstructed results are calculated and represented by a line chart.
3.5. Evaluation Metrics
In terms of qualitative assessment, visual inspection of the reconstructed images is conducted to assess their spatial continuity. Additionally, scatter plots comparing reconstructed images with reference images are used for qualitative evaluation of the reconstruction results. Regarding quantitative assessment, the Mean Absolute Error (MAE), BIAS, Correlation Coefficient (CC), and Structural Similarity Index Metric (SSIM) are employed to compare the reconstructed image with reference image data. When the values of MAE and BIAS are close to 0, and the values of CC and SSIM are close to 1, it indicates superior performance in predicting image quality.