1. Introduction
The Normalized Difference Vegetation Index (NDVI) is a widely used vegetation index (VI) derived from optical remote-sensing data to evaluate the biophysical or biochemical information related to vegetation growth [
1,
2,
3,
4,
5]. Large scale time series NDVI is generally used for assessment and monitoring of forest [
6,
7], grassland [
8,
9,
10], ecological environment [
11,
12], wildlife habitat disturbance [
13], and to estimate gross primary productivity [
14], biomass [
15,
16], and evapotranspiration [
17]. It can be calculated from the images acquired by sensors such as the Advanced Very High Resolution Radiometer (AVHRR), Moderate Resolution Imaging Spectroradiometer (MODIS), Medium Resolution Imaging Spectrometer (MERIS), Sea-Viewing Wide Field-of-View Sensor (SEAWIFS), or VEGETATION, with spatial resolution ranging from 250 m to a few kilometers [
1]. In applications such as crop monitoring, time series images acquired by high spatial-resolution sensors such as Landsat-OLI (30 m) and RapidEye (5 m) are required to provide spatial and temporal details. However, due to cost and technical limitations (trade-off between pixel spatial resolution and satellite temporal revisiting cycle) [
18] and cloud cover problems, it is difficult to acquire images with both high spatial resolution and high temporal frequency. Thus, spatio-temporal data fusion techniques have been developed as a feasible and less expensive way to acquire remote sensing time series data for land surface dynamics monitoring [
19,
20,
21,
22].
In general, generating fine-resolution NDVI time series images through spatio-temporal data fusion can be conducted in two ways [
23]: (i) conduct fusion of reflectance first and then calculate the NDVI (Blend-then-Index, BI); and (ii) calculate the NDVI first and then conduct fusion (Index-then-Blend, IB). The theoretical basis of the two ways are essentially the same, and some of the methods developed for reflectance images can also be used for NDVI images. However, the IB is preferred over BI due to less error propagation and fewer computational steps (blending one index band versus multiple reflectance bands) [
23].
A few categories of spatio-temporal data fusion approaches have been originally proposed to blend reflectance images including data-assimilation based algorithms, unmixing based methods, dictionary-pair learning based methods, and weighted function based methods [
22]. The data-assimilation based algorithms incorporate observation data with models and their uncertainties to minimize the residual errors [
24]. The advantage of data-assimilation based algorithms is that a complete time series of fine-resolution images, rather than single synthetic image, can be synthesized in the implementation. A recursive Kalman filter algorithm (KF) was implemented to produce frequent fine-resolution NDVI time series using available fine-resolution images and a time series of coarse-resolution NDVI images in previous studies [
25,
26]. The uncertainty of these algorithms is correlated to the number of available fine-resolution observations, and these algorithms suffer large uncertainties when the available fine-resolution images are limited in heterogeneous areas.
The unmixing based methods include the Multisensor Multiresolution Technique (MMT) [
27], the spatial temporal data fusion approach (STDFA) [
28], and the spatial and temporal reflectance unmixing model (STRUM) [
29]. These methods assume that remote sensing signal of coarse-resolution pixels is the weighted average of the mean reflectance of each class and a residual error. The weight of each class is its fractional cover within one coarse-resolution pixel, which can be calculated from a fine-resolution land-cover map. The mean reflectance of each land cover is estimated by solving a number of spectral unmixing model equations of coarse-resolution pixels, and the mean reflectance of each land cover is assigned to the fine-resolution pixels. The unmixing based concept can also be used for spatio-temporal fusion of NDVI images. Busetto et al. [
1] proposed an approach for the estimation of sub-pixel NDVI time series by combining fine- and coarse-resolution NDVI images based on an unmixing approach. This approach aims to estimate the mean NDVI of each land cover within one coarser-resolution pixel from daily MODIS data by solving the linear weighted equations using manually selected coarse-resolution pixels, and disaggregate MODIS NDVI using weights calculated by Gaussian functions of MODIS spectral dissimilarity index and the Euclidean spatial distance between the target and each pixel. Rao et al. [
30] proposed the NDVI Linear Mixing Growth Model (NDVI-LMGM) to produce high spatial resolution NDVI time-series data by using MODIS NDVI time series data and Landsat images. The NDVI-LMGM combines the NDVI linear mixing model with the NDVI linear growth model. It assumes that the short-term changes in NDVI can be interpreted as linear, and long-term NDVI changes can only be predicted reliably by several short-term predictions. The change rate of each land cover during the two dates on the TM/ETM+ pixel scale within the corresponding MODIS pixel can be estimated by solving the linear weighted equations. Another method, NDVI-Bayesian spatiotemporal fusion model (NDVI-BSFM) [
31], employed the Bayesian pixel unmixing process through incorporating the prior multi-year average MODIS NDVI from the first day of the year to the last day of the year for each land cover type, and it can predict a single Landsat-like NDVI image as well as build a Landsat-like NDVI time-series dataset. However, these methods are computationally intensive and rely heavily on the quality of MODIS NDVI data. Furthermore, the performance of these methods will be affected by the classification accuracies [
32], especially when there are land cover changes within a year. These methods are not applicable when reliable Land cover/Land use (LC/LU) ancillary information is not available. For example, in some cropland area, the crop types change every year due to the annual rotation of the crops. Thus, the LC/LU data are generally produced at the end of the growing season.
The dictionary-pair learning based methods, such as the sparse representation-based spatio-temporal reflectance fusion model (SPSTFM) [
33,
34] and the error-bound-regularized semi-coupled dictionary learning (EBSCDL) [
35], need one or two pairs of fine- and coarse-resolution images and one coarse-resolution image as input data. It builds relationships (dictionaries) between the features of fine-resolution images and their corresponding coarse-resolution images through dictionary-pair learning [
36,
37], and then applies the relationship (dictionary) to predict a fine-resolution image on the prediction date. The dictionary-pair learning based methods performed well in phenology change, but they still face challenges in heterogeneous regions with abrupt land cover type changes [
22], and they are computationally complex [
22,
33].
The weighted function based methods include the spatial and temporal adaptive reflectance fusion model (STARFM) [
19], and the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM) [
18]. Due to simple input requirements (no ancillary land cover data or classification data required) and robustness, they are widely used in many applications [
11,
16,
23,
38,
39]. The input data of the STARFM algorithm are one or two pairs of fine- and coarse-resolution images acquired at the same time and one coarse-resolution image acquired at the prediction time. This algorithm assumes that the reflectance of a given coarse-resolution pixel can be aggregated from fine-resolution homogeneous pixels. The STARFM attempts to search neighboring similar homogeneous coarse-resolution pixels within a moving window and endows the weights of these similar pixels calculated according to the spectral difference between coarse-resolution and fine-resolution data, the temporal difference between the input MODIS data, and the distance to the central pixel. The reflectance of the central fine-resolution pixel is estimated from the neighboring similar homogeneous pixels. However, the STARFM algorithm is not applicable to heterogeneous areas such as croplands [
18]. Due to this limitation, an improved STARFM with help of unmixing-based method (USTARFM) was proposed [
40]. However, it still subjects the problem of land cover changes. The ESTARFM was proposed by Zhu et al. [
18] to enhance the ability of STARFM through the use of two pairs of fine-resolution and coarse-resolution images obtained on two dates. It is able to minimize the system biases between the sensors and is more suitable for heterogeneous areas by using two pairs of fine- and coarse-resolution images to detect land cover changes and it keeps more spatial details [
18]. However, the ESTARFM method assumes that there is no temporal variation in change rate of the vegetation during a period, which is not reasonable if it is long a period between the input images. Furthermore, the ESTARFM neglects the variation of the relationship between the fine- and coarse-resolution images caused by different acquisition dates. The spatial and temporal nonlocal filter-based data fusion method (STNLFFM) [
41] is a recently proposed method that can predict the fine-resolution reflectance accurately and robustly by introducing the idea of nonlocal filtering, for both heterogeneous landscapes and temporally dynamic areas. However, it is based on the assumption that the reflectance change rate is linear, which is not accurate over a long period. The flexible spatiotemporal data fusion (FSDAF) model [
22] is a method using one pair of fine- and coarse-resolution images and one coarse-resolution image acquired on the prediction date. This method integrates the unmixing-based methods, spatial interpolation, and STARFM into one framework. It performs better in predicting abrupt land cover changes than other methods that use only one pair of fine- and coarse-resolution images. However, this method is computationally expensive and the prediction accuracy greatly depends on the extent of land cover changes between the two dates of the input images.
In this study, a spatio-temporal vegetation index image fusion model (STVIFM) is proposed to generate NDVI time series images with high spatial and high temporal resolution in heterogeneous regions such as croplands more accurately and robustly. Different from the methods mentioned above, we aim to predict the NDVI change for each fine-resolution pixel by using a weighting system to disaggregate the total NDVI change within a moving window, which can be calculated from the coarse-resolution NDVI images acquired on two different dates. The proposed model employed: (1) a new weighting system; (2) a new method to obtain the relationship between the two resolution images; and (3) more reasonable assumptions on the NDVI change rate of non-evergreen vegetation, which considers the change rate variations at both spatial scale and temporal scale. This algorithm is tested by Landsat-OLI and MODIS data acquired in three study sites. The results generated by the STARFM, ESTARFM and FSDAF methods are validated by the real Landsat NDVI images or field measurements for three study sites and different growth stages of a cropland area.
3. Results of Data Fusion
3.1. Test Sites and Data
Three test sites of distinct geographic locations and climate zones were chosen to test the STVIFM algorithm. The first site (42°53′N 81°35′W, 12 km × 12 km) is located in the Mixedwood Plains Ecozone in southwestern Ontario, Canada, near the city of London. This area is constantly contaminated by cloud cover during the growing season. For example, from April to October, about 57% of Landsat-8 images contain cloud cover greater than 35%. The dominant crops are winter wheat, corn, and soybean. The winter wheat is generally seeded in October in the previous year and harvested in late July, whereas the corn and soybean are generally seeded in May and harvested in September or October. The second site (37°01′N 99°07′W, 24 km × 24 km) is located near Dodge City in Kansas, United States. This study site contains a large area of grassland as well as crops such as winter wheat. This area receives around 532 mm of rainfall with annual average temperature of 13.3 °C, and altitude between 700 and 800 m above sea level. The third site (44°13′N 87°53′E, 45 km × 45 km) is situated in the south of the Junggar Basin in Xinjiang, China, bordered by the Gurbantunggut Desert in the north. The dominant crop types in this area are cotton, corn, and winter wheat. The cotton and corn are planted in April and harvested in September, whereas wheat is generally seeded in October in the previous year and harvested in May. This area receives scarce rainfall and has long and cold winters with short and hot summer with sharp contrast between daytime and night temperature [
47].
Landsat-8 OLI data and Moderate Resolution Imaging Spectroradiometer (MODIS) surface reflectance products (MOD09GQ and MOD09Q1) were obtained. The Landsat-8 images, with 9 spectral bands, 16-day temporal frequency, and 30 m spatial resolution, were downloaded from the United States Geological Survey (USGS) (
http://landsat.usgs.gov/landsat8.php). MOD09GQ is daily reflectance product and MOD09Q1 is a level-3 eight-day composite product of MOD09GQ, which provides surface reflectance of Band 1 and Band 2 at 250 m resolution. Each MOD09Q1 pixel represents the best observation during an eight-day period [
48]. The eight-day MODIS reflectance products (MOD09Q1) were downloaded from the National Aeronautics and Space Administration (NASA) Reverb portal (
http://reverb.echo.nasa.gov/reverb/) for Ontario site due to the frequent cloud contamination of the daily reflectance products. The daily MODIS reflectance products MOD09GQ were downloaded for Kansas and Xinjiang sites. The MODIS data were re-projected and mosaicked using the MODIS re-projection tool (MRT). They were then resampled to 30 m resolution using the nearest neighbor method and geo-rectified to the corresponding Landsat images. To avoid the influence of clouds, the MODIS and Landsat-8 images were then clipped to the areas where there was no cloud presence. Finally, the NDVI was calculated.
The dates of cloud-free Landsat-8 OLI and the corresponding MODIS images acquired near the dates of the Landsat acquisitions are shown in
Table 1. The two pairs of MODIS and Landsat-8 NDVI images acquired before and after the prediction date, and one MODIS NDVI image acquired on the prediction date were used to predict the synthetic Landsat-like NDVI image. The Landsat-8 images acquired on the prediction date at each study site were used to validate the synthetic Landsat-like NDVI images.
Figure 3,
Figure 4 and
Figure 5 show the NDVI images obtained on three dates over three study sites. The sub-images (30 m resolution, 400 × 400 pixels) at the upper rows are Landsat-8 NDVI images (fine-resolution, 30 m) and lower rows are MODIS NDVI images (coarse-resolution, 250 m resampled to 30 m).
To assess the application of the proposed algorithm on time series data, a total of 18 cloud-free MODIS MOD09Q1 data and six Landsat-8 OLI data were acquired over London, Ontario throughout the growing season in 2014 (
Figure 6).
Figure 7 shows the six cloud-free NDVI images and corresponding MODIS NDVI images. The sub-images at the upper row are Landsat-8 NDVI images (fine-resolution, 30 m). The sub-images at the lower row are the eight-day MODIS NDVI images. From 6 May to 7 June, as the winter wheat grew, the NDVI increased greatly. On 10 August, the winter wheat had been harvested, and most of the land was covered by corn and soybean. Thus, there were great land cover changes from 7 June to 10 August. From 10 August to 26 August, a few winter wheat fields were covered by alfalfa, and the NDVI increased again. On 27 September, most corn and soybean were senescent and the NDVI decreased.
3.2. Selection of Window Size for Deriving the Coefficients
Linear regression analysis was conducted between the fine- and coarse-resolution image pairs (
Section 2.3.1) for different study sites using different sizes of moving window. The variations of the coefficient of determination (R
2),
a and
b with the increasing window size are shown in
Figure 8. It was illustrated that when 9 (the approximate size of one course-resolution pixel) was adopted as the window size, the correlation was much lower than for other choices. The reason for this is likely that there are errors introduced by rounding and the geometric correction process between fine-resolution and coarse-resolution images. In this way, the fine-resolution pixels may not be the pixels that are supposed to be within the original coarse-resolution pixel. With the increase of window size, the R
2 values increased for most image pairs and plateaued when the window size is 17, which contains 4 MODIS pixels. This should be impacted by the MODIS point spread function (SPF) [
49]. The value of
a ranges from 0.9 to 1.5 for different study sites, and
b ranges from −0.15 to 0.05 for different study sites. Even for the same study site,
a and
b vary on different dates. With the increase of the window size, the variations of
a and
b are slight and smooth. It can be believed that both
a and
b are not sensitive to the change of window size. However, with a larger window, the increase rate of R
2 becomes smaller, the sample points become less, and the significance of the correlation will be reduced. Accordingly, a 33 × 33 moving window (4 × 4 coarse-resolution pixels) was used to obtain the coefficients.
Besides the proposed method, the STARFM, ESTARFM and FSDAF methods were also used for comparison purpose. To select a reasonable window size for the algorithm implementation, the data fusion was conducted with different window sizes based on the STARFM, ESTARFM, FSDAF and STVIFM methods. Take the Ontario site as an example (
Table 2). For the STVIFM, the accuracy is the highest when the window size is 25, and the computation time increases with the increase of window size. For the ESTARFM, the accuracy is the highest when the window size is 33. For the STARFM and the FSDAF, the larger window size gives higher accuracy (higher R
2 and lower RMSE), but the increase of accuracy is very small. Therefore, we chose 33 as the window size (the same as the window size used in coefficients calculation) for the fusion models, after analyzing the R
2, RMSE, and computational efficiency of these algorithms using different window sizes.
3.3. Algorithm Tests in Regions with Different Landscapes
As the FSDAF only needs one pair of fine- and coarse-resolution images, the pair of images acquired on the date before and after the prediction date were used, respectively, for all three sites, and FSDAF_m and FSDAF_n were used to represent the two predictions hereafter. The performances of the four algorithms were assessed by visual comparison and regression analyses. The coefficient of determination, Mean Absolute Difference (MAD) and Mean Difference (MD) between the observed NDVI and predicted NDVI images were also calculated to assess the accuracies of the four algorithms.
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13 and
Figure 14 show the original Landsat-8 OLI NDVI image and the synthetic NDVI images predicted by the four algorithms, and the scatter plots between the synthetic and the original NDVI values of Landsat OLI image acquired at
tp at the three study sites.
Table 3 shows the R
2, RMSE, MAD, MD and computation time of the four algorithms at different test sites. The performance of the STVIFM is better than the STARFM and ESTARFM methods, and similar with the best prediction of the FSDAF at all the three study sites according to R
2, RMSE, and MAD. In addition, the STVIFM is more computationally efficient than the ESTARFM and FSDAF when a window of 33 × 33 pixels was adopted.
The Ontario site has many small-area croplands (the width of the fields is less than 250 m), which resulted in many mixed pixels in the coarse-resolution image.
Figure 10 shows that the predicted NDVI images generated by the FSDAF_m and STVIFM are more similar to the original image. However, the small-area land cover changes shown in the red boxes in
Figure 10 were not accurately predicted by the FSDAF. Actually, the predicted NDVI image obtained by the FSDAF is more similar to the input fine-resolution image. As there are less land cover changes between 20 April and 6 May than between 6 May and 7 June at this site, the predicted result using the FSDAF (FSDAF_m) is more similar to the observed NDVI. The STARFM, ESTARFM, and STVIFM are all able to predict the land cover change by making use of two image pairs, whereas the ESTARFM overestimated the NDVI.
As indicated by the scatter plots shown in
Figure 11 and the assessment indices shown in
Table 3, the accuracy of the predicted NDVI image using the STVIFM is the best (R
2: 0.826, RMSE: 0.071) and slightly better than the accuracy of the predicted NDVI image using FSDAF_m (R
2: 0.824, RMSE: 0.075). The STARFM performed better than the ESTARFM and FSDAF_n in terms of the RMSE and MAD. Most NDVI values were overestimated by the ESTARFM and FSDAF_n. From the perspective of computational efficiency, the STVIFM consumed less time than the ESTARFM and the FSDAF. For the sub-image of 400 × 400 pixels, the ESTARFM and FSDAF consumed 52 s and 69 s, respectively, and the STVIFM consumed about 46 s.
The Kansas site is mostly covered by grassland and it is more homogeneous than the other two sites. The overall accuracy using the proposed STVIFM method (R
2: 0.711, RMSE: 0.076) is the best when compared with the STARFM and ESTARFM. The RMSE and MAD of the result produced by the STVIFM are comparable with that of the result produced by the FSDAF_m, but the R
2 of the former result is higher than the later one.
Figure 12 reveals that the ESTARFM, FSDAF_m, and STVIFM performed better in the cropland area shown in the red box, while all the methods seem to overestimate the NDVI in the grassland area. The RMSEs for these three methods are similar (RMSE: 0.077 vs. 0.075 vs. 0.076) but the R
2 for the STVIFM is the highest (R
2: 0.711). As the land cover for most areas barely changed from the date of the first image pair (3 May 2014) to the prediction date (19 May 2014), the NDVI predicted by the FSDAF_m shows higher accuracy than the NDVI predicted by the FSDAF_n. In terms of computational efficiency, the ESTARFM and FSDAF consumed 3 min 29 s and 9 min, respectively, whereas the STVIFM consumed 3 min 9 s for an image of 800 × 800 pixels.
At the Xinjiang site, the predicted NDVI image produced by the STVIFM shows good agreement with the observed NDVI (R
2: 0.891, RMSE: 0.065). Most of the fields were in the growing stage during this period and a few fields were in the senescent stage. The accuracy of the synthetic NDVI images produced by the ESTARFM and FSDAF_m are comparable but lower than that of the STVIFM (R
2: 0.820 vs. 0.812, RMSE: 0.082 vs. 0.085). As shown in the zoom-in area in the black box, the predicted NDVI image obtained by the STARFM shows “blurred” field boundaries, and the overall accuracy is much lower than the ESTARFM. The temporal NDVI changes of the senescent fields shown in the red box in
Figure 14 were more accurately captured using the STVIFM when compared with the STARFM and ESTARFM. For the FSDAF, the predicted NDVI image is more accurate using the image pair acquired on 27 May than using the image pair acquired on 28 June. Even though the time intervals between the prediction date (12 June 2014) and dates of the two base image pairs were the same, the land cover changed significantly from 12 June to 28 June. Therefore, the NDVI prediction using the image pair acquired on 27 May is more accurate than using the image pair acquired on 28 June. In terms of the computational efficiency, the ESTARFM and FSDAF consumed about 12 min and 40 min respectively for an image of 1500 × 1500 pixels, whereas the STVIFM consumed about 11 min.
When we compare the accuracies for the three study sites, it is obvious that both the Ontario site and Xinjiang site are more heterogeneous than the Kansas site and there are many crop fields with small areas. This is possibly the reason that the STARFM and ESTARFM performed better at Kansas site than Ontario site and Xinjiang site in terms of the RMSE and MAD. However, the STVIFM performed better for Ontario site and Xinjiang site than Kansas site. Therefore, it can be concluded that the STVIFM performs better than the STARFM and ESTARFM not only in homogeneous regions but also in heterogeneous regions.
3.4. Tests with Time Series Data
In this test, all six available Landsat NDVI images and the corresponding MODIS NDVI images acquired throughout the whole growing season over Ontario site were tested using the four methods for the four predictions shown in
Table 4. For the FSDAF, both the image pairs before and after the prediction date were used to implement this method. (a1)–(a6) and (b1)–(b6) are the images shown in
Figure 7.
The results were validated with the original Landsat NDVI images, and the assessment indices including R
2, RMSE, MAD, and AD for different methods were shown in
Figure 15. For the first prediction, the two Landsat images were acquired on 20 April and 7 June. The main crop was winter wheat, which was growing steadily during this period. There were some small-area land cover changes in images acquired on 6 May and 7 June (
Figure 7 (a3)). As presented in
Section 3.3 (Ontario site), the STVIFM performed the best when compared with the other three methods during this period.
For the second prediction, the two Landsat images were acquired on 6 May and 10 August and the prediction date was 7 June. From 6 May to 7 June, corn and soybeans were planted, then wheat was harvested and alfalfa was planted in the harvested wheat fields from 7 June to 10 August. The STVIFM performed better than the STARFM and ESTAFM (RMSE: 0.184 vs. 0.195 vs. 187). The predicted NDVI image generated by the FSDAF using the image pair acquired on 6 May shows much higher accuracy than using the image acquired on 10 August and the predicted NDVI generated by the other two methods.
For the third prediction, the two Landsat images were acquired on 6 June and 26 August and the prediction date was 10 August. As mentioned above, the wheat fields changed greatly from 6 June to 10 August, whereas the land cover seldom changed from 10 August to 26 August. The correlation of determination of the predicted NDVI using the STVIFM is higher than the STARFM and ESTARFM (R2: 0.579 vs. 0.550 vs. 0.552), whereas the RMSE is higher than the STARFM and ESTARFM (RMSE: 0.158 vs. 0.151 vs. 0.156). This may be because of the large land cover changes for wheat fields from 6 June to 26 August, and the NDVI for wheat field is near the valley of the NDVI profile on 10 August. As mentioned earlier, larger inaccuracy would be produced by the STVIFM if the peak or valley in the NDVI profile between the two dates of acquired Landsat images needs to be predicted, and the NDVI change is not captured by the NDVI difference of the two Landsat images. This inaccuracy can be reduced if more fine-resolution images can be acquired during this period. The prediction of the FSDAF using the image pair acquired on 26 August shows a higher accuracy than using the image pair acquired on the 6 June.
For the fourth prediction, the two Landsat images were acquired on 10 August and 27 September, and the prediction date was 26 August. Corn and soybeans were senescent during this period. The STVIFM performed better than the STARFM and ESTARFM (RMSE: 0.116 vs. 0.133 vs. 0.137). The accuracy of the image predicted by the FSDAF using image acquired on 27 September is higher than the accuracy of the NDVI predicted by the FSDAF using image acquired on 10 August in terms of the RMSE and MAD.
In addition, a total of 12 Landsat-like NDVI images were predicted using the six acquired Landsat and MODIS image pairs. By using more Landsat images, the prediction accuracy would be improved. However, there was no more available Landsat image to assess the prediction results. The NDVI time series were assessed by analyzing the temporal variations over the cornfield and winter wheat field and compare with the phenology information and photos collected in the field work. For the STARFM, ESTARFM and STVIFM, two temporally closest Landsat and MODIS NDVI image pairs and one MODIS NDVI image acquired between the two dates were used to predict the Landsat-like NDVI image each time. For the FSDAF, the MODIS NDVI image on the prediction date and one Landsat and MODIS NDVI image pair acquired closer to the prediction date was used to predict the Landsat-like NDVI image each time.
Figure 16 shows the temporal profiles of the average NDVI of an area (360 m × 360 m) from a healthy cornfield and an area (360 m × 360 m) from a healthy winter wheat field between DOY 113 and DOY 249, which were generated by the STARFM, ESTARFM, FSDAF, and the STVIFM. These two fields were surrounded by different crop types; therefore, the MODIS pixels contain mixed Landsat pixels. The average NDVI extracted from the original MODIS NDVI time series images and five Landsat-8 NDVI images are also presented as comparisons.
The corn and winter wheat show two distinct temporal patterns due to the difference of their growing seasons. The corn was generally seeded in May and harvested in October. The winter wheat was generally seeded in October of the previous year, started to ripen from the beginning of July and was harvested by the end of July, then alfalfa was planted. Before the emergence of corn, the MODIS NDVI shows higher and fluctuated values in the cornfield due to the influence of neighboring wheat fields. All the methods can generate reasonable predictions during this period. Between 7 June and 10 August, only two Landsat images were acquired, and the predictions of the STVIFM and FSDAF show large difference on 4 July (DOY 185). As the image pair closer to the prediction date was always used, the profile generated by the FSDAF shows peak and valley, but large inaccuracy was still produced on some dates when the land cover changed greatly. During this period, corn was in its growing stage whereas winter wheat was harvested and alfalfa was planted. From the general survey pictures collected in 2014, corn had reached to the stem elongation stage (BBCH-scale 33) and most fields were covered by corn leaves, whereas the color of wheat started to turn yellow (BBCH-scale 79). Therefore, the STVIFM seems to generate more reasonable temporal profiles for corn and winter wheat. The ESTARFM also shows reasonable temporal profiles for the two types of crops, however, the prediction of the land cover change for the wheat field is less accurate than the prediction produced by the STVIFM when compared with the nearby Landsat NDVI values.
4. Discussion
4.1. Advantages of the STVIFM
The algorithm tests at three study sites illustrated that the STVIFM algorithm performed better than the STARFM, ESTARFM and FSDAF at the three study sites. According to the results of the above tests, the performance of the FSDAF greatly depends on the degree of land cover change between the two dates of the input data as it uses only one pair of fine- and coarse-resolution images as input, which agrees with the findings stated in [
31]. It performs better than other methods which use one image pair [
22] and it is flexible when only one fine-resolution image can be acquired. However, it is less robust than methods using two fine-resolution images as inputs in the land cover change prediction. Even though the FSDAF can predict NDVI with higher accuracy when the time interval between the two dates of the input images are close enough or land covers are similar enough, the STVIFM still performs better than the FSDAF during the growing stage or senescent stage. In addition, the STVIFM is more computationally efficient and performs about three times faster than the FSDAF at Kansas site and Xinjiang site, where the sizes are 24 km × 24 km and 45 km × 45 km, respectively.
Since the inputs of the STVIFM are same as the inputs of the STARFM and ESTARFM, the theoretical comparisons are made between the three methods. Compared with the STARFM and ESTARFM algorithm, the STVIFM has made several improvements. Firstly, the STVIFM builds a relationship between the mean NDVI change of fine-resolution pixels and mean NDVI change of coarse-resolution pixels within a moving window. It attempts to detect the mean fine-resolution NDVI change calculated from the coarse-resolution NDVI images and to seek each fine-resolution pixel’s contribution to the total NDVI change by calculating the weight of each fine-resolution pixel. In contrast, the STARFM and ESTARFM build a relationship between the NDVI change of single fine-resolution pixels and single coarse-resolution pixels, therefore accurate geometric correction between the fine- and coarse-resolution images is required in order to achieve more accurate results [
36].
Secondly, the ESTARFM assumes that the relationships between the fine-resolution and coarse-resolution image pairs are the same on all dates. However, due to the difference of weather conditions, the relationship between the two images may be different on different dates. The STVIFM attempts to obtain the coefficients between the fine-resolution and coarse-resolution image pairs on different dates using linear regression analyses. However, it is difficult to obtain the coefficients between the images on the prediction date, due to the unavailability of the fine-resolution image. The STVIFM adopts the weights calculated from the correlation coefficients between the coarse-resolution images to obtain the coefficients between the fine- and coarse-resolution images on the prediction date.
Thirdly, each of the STVIFM, STARFM and ESTARFM applied a weighting system to calculate the NDVI of the central pixel, but the meaning of the weighting system for STVIFM and the weighting system for STARFM and ESTARFM are different. The weight in the STARFM or ESTARFM means the similarity between the central pixel and the surrounding similar pixels within the moving window, whereas the weight in the STVIFM means the variation in contributions of fine-resolution pixels to the total NDVI change within the moving window. The STVIFM considers the change rate variation at both spatial scale and temporal scale, which is more reasonable for non-evergreen vegetation. It attempts to calculate the spatial variation of NDVI change (spatial weight) of each fine-resolution pixel at any prediction date by incorporating the weights calculated based on one base fine-resolution image and the temporal NDVI change of the two fine-resolution images. These two elements are incorporated according to the land cover similarity between the prediction date and the two base dates. However, the ESTARFM assumes that the change rate is stable during a short period. This assumption is reasonable if the vegetation is evergreen or if the period between the two input image pairs is short enough (e.g., one day), but it would be unreasonable if the period is longer (e.g., more than 10 days) [
18].
Lastly, the two predictions obtained from the two base dates are combined using a temporal weight for the ESTARFM and a similarity weight for the STVIFM. The ESTARFM calculated the temporal weight using the mean absolute difference between two coarse-resolution pixels within the moving window [
18]. However, this is not the best selection to determine the land cover similarity in heterogeneous region. For instance, the mean absolute difference may be the same for area where has large land cover change (NDVI decrease mixed with NDVI increase), and area where has the same land cover but with NDVI decrease or increase. Therefore, the STVIFM adopts the correlation of determination for heterogeneous areas and the mean absolute difference for homogeneous areas to calculate the similarity weight.
Due to the advantages mentioned above, the STVIFM can make more accurate NDVI predictions in heterogeneous regions than the STARFM and ESTARFM when the land cover or NDVI changes were captured by the two pairs of fine- and coarse-resolution images. The accuracy improvements of the STVIFM aremore obvious for Ontario site and Xinjiang site, which are characterized by heterogeneous cropland areas. Accordingly, the STVIFM can generate more reasonable NDVI time series for winter wheat and corn, which have different growing season.
4.2. Limitations and Uncertainties of the STVIFM
In addition to the advantages mentioned above, it is worth noting that the STVIFM has its limitations. The following three aspects are the theoretical limitations of the STVIFM algorithm. Firstly, the STVIFM algorithm assumes that the NDVI is spatially additive. This linear assumption for the NDVI may lead to minor inaccuracies since the NDVI is not a linear combination of reflectance. Secondly, the relationship between the fine-resolution and coarse-resolution images acquired at tp is calculated from the relationship between the two input image pairs. The result obtained in this way may be slightly different from the real relationship. Additionally, the STVIFM adopts the same coefficients for the whole image, but the coefficients may vary at different locations. More efforts should be made in further work to obtain the coefficients using a more accurate way.
There are some practical limitations with the STVIFM. Firstly, the small-area (width or length is less than one coarse-resolution pixel) abrupt disturbances that occurred between tm and tp or between tp and tn, may not be accurately detected because the influence of other land covers in one coarse-resolution pixel. Secondly, if the dates tm and tn are in the growing and senescent period of the vegetation, respectively, and the NDVI change at tp is not captured by the images acquired at tm and tn, the performance of the STVIFM is slightly worse than the other methods since the NDVI change from tm to tn cannot reflect the NDVI change from tm to tp. In this case, more frequent high spatial resolution images that can cover the important vegetation phenology will be helpful. Another possible way to improve the prediction accuracy is to integrate the fusion model with a vegetation growth model for different types of vegetation.
4.3. Applications of the STVIFM
The STVIFM uses two pairs of Landsat-8 OLI and MODIS images acquired before and after the prediction date and one coarse-resolution image on the prediction date as inputs, to predict the fine-resolution NDVI image on the prediction date. This algorithm can be applied to regions with different landscapes such as grassland, forest and cropland areas. It can also be applied to other vegetation indices, but the thresholds may need to be adjusted accordingly. Besides the Landsat-8 OLI and MODIS data, other high spatial resolution data such as the SPOT, RapidEye, Sentinel-2, and high temporal frequency data such as AVHRR, MERIS can also be used. There are four parameters that could be set in the STVIFM, the window size for coefficients deriving and the window size for STVIFM implementation, the NDVI value for the maximum change rate of vegetation and the variance of the change rate index. The window size should be the odd rounding value of the integer multiple of the resolution ratio between the coarse- and fine-resolution images. The suggested window size for coefficients deriving and algorithm implementation is 25 or 33 for Landsat and MODIS data. However, for images with different spatial scales, the window size may need to be adjusted. Since the theoretical NDVI values for vegetation pixels range from 0 to 1, the median value 0.5 is suggested as the value of d, but the value can be adjusted according to the actual NDVI range of vegetation for special vegetation cover types. The suggested value for is 0.1–0.2 to obtain the change rate index with a dynamic range from 0 to 1. For the NDVI time series generation, different spatio-temporal data fusion methods may need to be incorporated to improve accuracy.