1. Introduction
Remote sensing satellites are important tools for the monitoring of processes such as vegetation and land cover changes on the earth surface [
1,
2,
3]. Because of technological limitations in sensor designs [
4], compromises have to be made between spatial and temporal resolutions. For example, Moderate Resolution Imaging Spectroradiometer (MODIS) can visit the earth once a day with 500-m spatial resolution. As a comparison, the spatial resolution of Landsat Enhanced Thematic Mapper Plus (ETM+) is 30 m, but its revisiting period is 16 days. Such a limitation restricts the application of remote sensing in problems that need images high in both spatial and temporal resolutions. Spatiotemporal reflectance fusion models [
5] have thus been developed to fuse image data from different sensors to obtain high spatiotemporal resolution images.
The Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) [
6] is a pioneering fusion model based on a weighting method. This model uses neighboring pixels to compute the center pixel at a point in time with a weighting function, and the weights are determined by spectral difference, temporal difference and location distance. Furthermore, Zhu et al. [
7] proposed an Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model (ESTARFM) based on a STARFM algorithm to predict the surface reflectance of heterogeneous landscapes. Another improved STARFM method is a Spatial Temporal Adaptive Algorithm for mapping Reflectance Change (STAARCH) [
8], which is designed to detect disturbance and changes in reflectance by using Tasseled Cap transformations. However, performance of the weighting methods are constrained because linear combination smooths out the changing terrestrial contents.
Another type of reflectance fusion method, known as dictionary learning methods, has been proposed to overcome the shortcoming of the weighting methods. Dictionary-based methods that use certain known dictionaries, such as wavelets and shearlets, have been proved to be efficient in multisensor and multiresolution image fusion [
9,
10,
11]. In remote sensing data analysis, Moigne et al. [
12] and Czaja et al. [
13] proposed remote sensing image fusion methods based on wavelets and wavelet packet, respectively. Shearlet transform is also used in a fusion algorithm in [
14] because shearlets can share the same optimality properties and enjoy similar geometrical properties. Using the capability of dictionary learning and sparsity-based methods in the super resolution analysis, Huang et al. [
15] proposed a Sparse-representation-based Spatiotemporal Reflectance Fusion Model (SPSTFM) to integrate sparse representation and reflectance fusion by establishing correspondences between structures in high resolution images and their corresponding low resolution images through dictionary pair and sparse coding. SPSTFM assumes that high and low resolution images of the same area have the same sparse coefficients. Such assumption is, however, too restrictive [
16]. Based on this idea, Wu et al. [
17] proposed the Error-Bound-regularized Semi-Coupled Dictionary Learning (EBSCDL) model which assumes that the representation coefficients of the image pair have a stable mapping and coefficients of the dictionary pair have perturbations in the reconstruction step. Attempts have been made to improve the performance of the SCDL based models. For examples, Block Sparse Bayesian Learning for Semi-Coupled Dictionary Learning (BSBL-SCDL) [
18] employs the structural sparsity of the sparse coefficients as a priori knowledge and Compressed Sensing for Spatiotemporal Fusion (CSSF) [
19] considers explicitly the down-sampling process within the framework of compressed sensing for reconstruction. In comparison with the weighting methods, the advantage of the dictionary-learning-based methods is that they retrieve the hidden relationship between image pairs from the sparse coding space to better capture structure changes.
Besides the aforementioned methods, some researchers employed other approaches to fuse multi-source data. Unmixing techniques have been suggested for spatiotemporal fusion because of their ability to reconstruct images with high spectral fidelity [
20,
21,
22,
23,
24]. Considering the mixed-class spectra within a coarse pixel, Xu et al. [
25] proposed the Class Regularized Spatial Unmixing (CRSU) model. This method is based on the conventional spatial unmixing technique but is modified to include prior class spectra estimated by the known image pairs. To provide a formal statistical framework for fusion, Xue et al. [
26] proposed Spatiotemporal Bayesian Data Fusion (STBDF) that makes use of the joint distribution to capture implicitly the temporal changes of images for the estimation of the high resolution image at a target point in time.
Considering structure similarity in spectral bands, structure information has been employed in pan-sharpening and image fusion. Shi et al. [
27], for example, proposed a learning interpolation method for pan-sharpening by expanding sketch information of the high-resolution panchromatic (PAN) image which contains the structure features of the PAN image. Glasner et al. [
28] verified that many structures in a natural image are similar at the same and different scales. Inspired by this, a self-learning approach was proposed by Khateri et al. [
29] which uses similar structures at different levels to pan-sharpen the low resolution multi-spectral images. In multi-modality image fusion, Zhu et al. [
30] proposed a method which decomposes images into cartoon and texture components, and preserves the structure information of two components based on spatial-based method and sparse representation, respectively.
However, none of these spatiotemporal fusion methods consider the structure similarity between spectral bands in the fusion procedure. Although different bands have different reflectance ranges, the edge information is still similar [
31]. Obviously, a reconstruction model can have a better performance if such information can be effectively used to predict the unknown high resolution image. Otherwise, the dictionary pair obtained by the training image pair are inefficient to predict the unknown images because of the lack of information for the target time. This can be explained from the experience in machine learning in which the
norm is too restrictive in encoding the unknown data in the prediction process because it only uses the sparsity structure of the dictionary [
32,
33]. Therefore, the reconstruction model needs a replacement of the
norm to reduce the impact of insufficient information and to improve the representation ability of the dictionary pair.
We propose a new model in this paper to enhance spatiotemporal fusion performance. Our model uses the edge information in different bands via adaptive multi-band constraints to improve the reconstruction performance. To overcome the disadvantage of the
norm, the nuclear norm is adopted as the regularization term to increase the efficiency of the learnt dictionary pair. Nuclear norm considers not only the sparsity but also the coordination in producing a suitable coefficient that can harmonize the sparse and collaborative representations adaptively [
32,
33].
Overall, the main contributions of this work can be summarized as follows.
The multi-band constraints are employed to reinforce the structure similarity of different bands in spatiotemporal fusion.
Considering the different structure similarity between two bands, the adaptive regularization parameters are proposed to determine the importance of each multi-band constraint adaptively.
The nuclear norm is employed to replace the norm in the reconstruction model because the nuclear norm considers both sparsity and correlation of the dictionaries and can overcome the disadvantage of the norm.
The remainder of this paper is organized as follows. Our method for spatiotemporal fusion, called adaptive multi-band constraints fusion model (AMCFM), is proposed in
Section 2.
Section 3 discusses the experiments carried out to assess the effectiveness of the AMCFM and four state-of-the-art methods in terms of statistics and visual effects. We then conclude the paper with a summary and direction for future research in
Section 4.
3. Experiments
The performance of our proposed method is compared to those of the four state-of-the-art methods for evaluation. ESTARFM [
7] is a weighting method and CRSU [
25] is an unmixing-based method. The other two are dictionary learning methods, named SPSTFM [
15] and EBSCDL [
17].
All programs are run in Windows 10 system (Microsoft, Redmond, Washington, DC, USA) and the processor is Intel Core i7-6700 3.40 GHz (Intel, Santa Clara, CA, USA). All of these fusion algorithms are coded in Matlab 2015a (MathWorks, Natick, MA, USA) except the ESTARFM, which is in IDL 8.5 (Harris Geospatial Solutions, Broomfield, CO, USA).
3.1. Experimental Scheme
In this experiment, we use the data acquired from the Boreal Ecosystem-Atmosphere Study (BOREAS) southern study area on 24 May, 11 July and 12 August in 2001, respectively. The products from Landsat ETM+ and MODIS (MOD09GHK) are selected as the source data for fusion. The Landsat image on 11 July 2001 is set as the target image for prediction. All the data are registered for fine geographic calibration.
In the fusion process, we focus on three bands: NIR, red and green. The size of the test images is 300 × 300. Before the test, we up-sample the MODIS images to the same resolution as the Landsat images via bi-linear interpolation because the spatial resolutions of these two source images are different.
3.2. Parameter Settings and Normalization
The parameters of AMCFM are set as follows. The dictionary size is 256, the patch size is 7 × 7, the overlap of patches is 4, the number of training patches is 2000, is 0.15, is 0, and are both 0, and is 0.1. All the comparative methods keep their original parameter settings.
Normalization can speed up the computation time and has an effect on the fusion results. As a preprocessing step, the high and low resolution images are normalized as follows:
where
is the mean value of image
and
is the standard deviation of image
.
3.3. Quality Measurement of the Fusion Results
Several metrics have been used to evaluate the fusion results by different methods. These metrics can be classified into two types, namely the band quality metrics and the global quality metrics.
We employ three assessment metrics, namely the root mean square error (RMSE), average absolute difference (AAD) and correlation coefficient (CC) to assess the performance of the algorithms in each band. The ideal result is 0 for RMSE and AAD, while it is 1 for CC.
Three other metrics are adopted to evaluate the global performance, including relative average spectral error (RASE) [
40], Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [
41] and Q4 [
42]. The mean RMSE (mRMSE) of three bands is also used as a global index. The ideal result is 0 for mRMSE, RASE and ERGAS, while it is 1 for Q4. It should be noted that Q4 is defined for four spectral bands. For our comparisons, the real part of a quaternion is set to 0.
3.4. Results
Table 3,
Table 4 and
Table 5 show the digital values of these methods in each band. All these methods can reconstruct the target high resolution image. ESTARFM has a good performance in the red band. CRSU performs well in the red band and green band of image 2, but, in most cases, this method has undesirable results. SPSTFM and EBSCDL have similar results and EBSCDL produces slightly higher quality in these three images. AMCFM and AMCFM-s produce the best results for NIR band. Moreover, AMCFM has the best or the second best results in almost all metrics, showing the stability and efficiency in its performance.
The global metrics of different methods are shown in
Table 6,
Table 7 and
Table 8. AMCFM has the best global performance in all three images, except for Q4 in image 2 and ERGAS in image 3. Image 1 is best captured by our proposed model with a noticeable performance in all four metrics. The outstanding performance of AMCFM is attributed to its improved performance in the NIR band.
Figure 2 and
Figure 3 compare the target (true) Landsat images with the images predicted by ESTARFM, CRSU, SPSTFM, EBSCDL, AMCFM and AMCFM-s. We use NIR-red-green band as the red-green-blue-band composite to show the images. These images are displayed with an ENVI 5.3 (Harris Geospatial Solutions, Broomfield, Colorado, United States) 2% linear enhancement.
All these fusion algorithms have the capability to reconstruct the main structure and details of the target image. It appears that the colors of the dictionary learning methods are visually more similar to the true Landsat image than the weighting method and unmixing-based method. The details captured by AMCFM are more prominent than those captured by SPSTFM and EBSCDL, which can be observed in the two-times enlarged red box in the images. Overall, our proposed method has the best performance in visualization.
Figure 4,
Figure 5 and
Figure 6 display the 2D scatter plots of NIR, red and green band of image 1. ESTARFM performs slightly better than the other methods in the red band. This result is consistent with the statistics in
Table 3. However, in the NIR and green band, it is obvious that dictionary learning methods outperform the weighting method and unmixing-based method because scatter plots of ESTARFM and CRSU are more dispersed. The scatter plots of our proposed methods, AMCFM and AMCFM-s, are closer to the 1-1 line than the other methods, indicating that using the edge information can actually improve fusion performance, especially in the NIR band. In general,
Figure 4,
Figure 5 and
Figure 6 show that our proposed methods reconstruct images closest to the true Landsat image.
5. Conclusions and Future Work
In this paper, we have proposed a novel dictionary learning fusion model, called AMCFM. This model considers the structure similarity between bands via adaptive multi-band constraints. These constraints essentially enforce the similarities of the edge information across bands in high resolution patches to improve the fusion performance. Moreover, different from existing dictionary learning models which only emphasize on sparsity, we use the nuclear norm as the regularization term to represent both sparsity and correlation. Therefore, our model can reduce the impact of inefficient dictionary pair and improve the representation ability of the dictionary pair. Comparing with four state-of-the-art fusion methods in metrics and visual effects, the experimental results support our proposed model in the improvements of image fusion. Although our model is slower than the other two dictionary learning methods in this empirical analysis because of the complexity of the optimization algorithm, the fusion results obtained from our model are improved indeed. One may wonder whether it is justifiable to achieve a slight improvement on the expense of an increase in computational time. Our argument is that, on a theoretical basis, our model is more reasonable and appealing than SPSTFM and EBSCDL because it capitalizes on the structure information and correlation of dictionaries for image fusion. Such advantages will be more evident when structure similarity increases.
However, there remains some room for improvement. Firstly, the norm loss term assumes that noise is an i.i.d. Gaussian. We can consider the use of other noise hypotheses, such as i.i.d. Gaussian mixture and non-i.i.d noise structure, to improve the fusion results. Secondly, the computation cost of the proposed method is high because of the complexity of the ADMM algorithm. To reduce the computation time, an alternative approach can be designed to solve the reconstruction model efficiently for practical applications. To analyze hyperspectral data efficiently, dimension reduction methods might need to be incorporated into the fusion process.