(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
Exploiting Frequency Correlation for Hyperspectral Image Reconstruction
Abstract
Deep priors have emerged as potent methods in hyperspectral image (HSI) reconstruction. While most methods emphasize space-domain learning using image space priors like non-local similarity, frequency-domain learning using image frequency priors remains neglected, limiting the reconstruction capability of networks. In this paper, we first propose a Hyperspectral Frequency Correlation (HFC) prior rooted in in-depth statistical frequency analyses of existent HSI datasets. Leveraging the HFC prior, we subsequently establish the frequency domain learning composed of a Spectral-wise self-Attention of Frequency (SAF) and a Spectral-spatial Interaction of Frequency (SIF) targeting low-frequency and high-frequency components, respectively. The outputs of SAF and SIF are adaptively merged by a learnable gating filter, thus achieving a thorough exploitation of image frequency priors. Integrating the frequency domain learning and the existing space domain learning, we finally develop the Correlation-driven Mixing Domains Transformer (CMDT) for HSI reconstruction. Extensive experiments highlight that our method surpasses various state-of-the-art (SOTA) methods in reconstruction quality and computational efficiency.
Keywords:
Hyperspectral image reconstructionHyperspectral frequency correlation prior Frequency domain processing1 Introduction
Hyperspectral images (HSIs) encompass detailed spectral bands, thus widely used in image recognition [25, 41], tracking [27, 57], and image classification [5, 8, 70]. To capture HSIs, conventional spectrometers [33, 67] generally scan the scene along either the spatial dimension or the spectral dimension, requiring multiple exposures. Thus, these systems are unsuitable for measuring dynamic scenes. Based on the foundations of the compressive sensing theory [22], the Coded Aperture Snapshot Spectral Imaging (CASSI) [4, 59, 63, 42] stands out as a promising solution. However, the bottleneck of CASSI lies in the limited reconstruction capacity of deriving the underlying 3D HSI from the 2D measurement.
For the inverse reconstruction problem, the HSI priors are essential to restrict the solution space. Traditional priors like sparsity [59] and low-rank [75] are hand-crafted and need manual parameter tweaking, thus lacking flexibility of processing different scenes. With the evolution of deep learning [69, 64], deep priors have emerged to be data-driven and flexible, thus contributing to a promoted reconstruction quality [77, 26, 32].
Nonetheless, the previous deep priors narrowly exploit space-domain information. However, existing space-domain-based models suffer from non-local issues with the limited receptive field of CNNs and inflexible fixed shuffle patterns of transformers, obstructing the learning of non-local similarity and frequency intricacies. Furthermore, according to the various research on the inherent nature of neural networks [71, 60, 49], the learning process of neural networks pays high attention to image frequency intricacies that represents the useful periodic patterns beyond space pixels, thus emphasizing the great promise to exploit the image frequency priors for HSI reconstruction. To harness the frequency information, HDNet [31] first uses the focal frequency loss for network guidance. However, given that the frequency loss predominantly originates from low-frequency components, the inappropriate weighting manner of the focal frequency loss will inadvertently over-magnify low-frequency focus while over-diminishing high-frequency focus, thus resulting in suboptimal results with insufficient texture details.
In this paper, we first conduct a series of frequency analyses of HSIs and obtain a pivotal observation: a pronounced spectral-spatial correlation between frequency components, i.e., tokens, are widely present in divergent HSIs and exhibits a descending trend from low-frequency to high-frequency token. Moreover, there exists the spectral/spatial correlation within one frequency token summarized from the neighborhood similarity. Thus, we develop the Hyperspectral Frequency Correlation (HFC) prior by consolidating the two-fold frequency correlations of HSIs into one overarching theme.
Leveraging the substantial HFC prior, we subsequently establish the detailed frequency domain learning to thoroughly exploit frequency correlation for HSI reconstruction. Concretely, the frequency domain learning is composed of two modules: the Spectral-spatial self-Attention of Frequency (SAF) and the Spectral/spatial Interaction of Frequency (SIF), which target the low-frequency tokens and the high-frequency tokens, respectively. For the low-frequency tokens, given the dominant spectral-spatial correlation between frequency tokens, the SAF is creatively formulated based on the self-attention mechanisms to model the long-term dependencies between various spectral-spatial frequency tokens. For the high-frequency tokens, as the high-frequency tokens exhibit subdued spectral-spatial correlation as illustrated in the HFC prior, it is inefficient to model the correlation between frequency tokens but of great significance to exploit the correlation within one frequency token. Thus, the SIF is concisely established with a spatial token interaction followed by a spectral token evolution, achieving mixed-dimensions exploration within one frequency token. As the dominant correlation varies in different frequency components, a learnable gating filter is crafted to adaptively determine the weights of SAF and SIF and offer the optimal proportion of two blocks, ensuring thorough frequency exploitation.
By integrating the frequency domain learning and space domain learning, we finally formulate the Correlation-driven Mixing Domains Transformer (CMDT) as the heart of deep priors and subsequently obtain a deep frequency unfolding framework for HSI reconstruction. Extensive experiments on simulation data and real data validate the superiority of our method over various SOTA methods on both image quality and computational efficiency as shown in Fig. 1.
Our contributions are summarized as follows:
-
•
We establish the useful HFC prior rooted in the statistical frequency analyses of the HSI datasets.
-
•
We formulate the frequency domain learning comprising SAF and SIF to comprehensively exploit frequency correlation.
-
•
We establish the CMDT comprising the frequency domain learning and space domain learning to efficiently explore the correlation of HSIs in dual domains.
2 Related Work
2.1 Image Priors for HSI
Image priors are essential for high-quality HSI reconstruction. Traditional priors such as total variation [24], sparsity [59], and low-rank [75] are hand-crafted. For instance, GAP-TV [74] minimizes total variation to guarantee the first-order smoothness. DeSCI [38] exploits repetitive patches under low-rank assumptions. However, these handcrafted priors require manual parameter tweaking, thus struggling with limited generalization. Inspired by the success of deep learning, deep priors have arisen to explore data-driven priors from large datasets. Wang et al. [61] proposes a deep spectral-spatial network to explore the image spectral-spatial correlation in the space domain. DNU [62] introduces a non-local network to exploit the non-local similarity in the space domain. However, these priors solely focus on the image space priors, thus restricted in exploration of image frequency priors, resulting in a suboptimal reconstruction.
2.2 Self-Attention Mechanism
The self-attention mechanism [58, 14, 76] has emerged as a powerful tool for capturing long-range interactions. Recently, self-attention mechanisms have showcased significant potential in image classification [5, 8], semantic segmentation [79, 13], image restoration [17, 37, 76, 68], etc. For HSI reconstruction, -Net [46] first explores self-similarity via a self-attention mechanism. MST [11] next calculates spectral self-attention map to discern image spectral correlation. DAUHST [12] subsequently explores a shuffle strategy to calculate the image non-local correlation. However, the previous methods always leverage the self-attention mechanisms for the extraction of space-domain image correlation and struggle with striking a balance between low computational costs and modeling the comprehensive space-domain long-range dependencies, thus limiting the potential of the self-attention mechanism for HSI reconstruction.
2.3 Spectral-spatial interaction
Given the limited capacity of self-attention mechanisms, some works [47, 35, 40] use the spectral-spatial interaction in the space domain to model the long-term dependency, which is mainly based on the common inductive biases like spatial non-local similarity and spectral dependency of space pixels. Concretely, SSIN [47] designs a spectral-spatial attention module for interaction. ESSINet [40] proposes an involution-2D operator to fuse the spectral and spatial features. In the frequency domain, [81] attempts to migrate a similar architecture to the frequencies. However, the under-explored inductive biases of frequency make the frequency learning process lack reasonable interpretability, thus suppressing the future potential of frequency-domain learning.
2.4 Learning in the Frequency Domain
Frequency analysis [50, 54] emphasizes signal frequency attributes. Numerous studies have probed the frequency traits of neural networks to explore network generalization. F-principle [71] elucidates that network learning inherently obeys an order from low to high frequencies, which is concordant with the human visual system [60] emphasizes that the generalization mainly originated from frequency learning, illustrating the importance of learning the frequency information of images. These insights have galvanized endeavors in frequency exploration. For instance, [34] proposes a focal frequency loss for network guidance. DASR [66] and SFNet [20] deploy filters to grasp frequency features from spatial images. [16] proposes the reasonable mathematical method of frequencies for HSI restoration. However, prevalent models [20, 31, 66, 72] grapple with either elucidating the mechanics of frequency extraction from space-domain images or lacking learning adaptability of frequencies in HSI, underscoring the ongoing need for an interpretable and efficient frequency-domain network.
3 The Proposed Method
In this section, we first introduce the Hyperspectral Frequency Correlation (HFC) prior. Based on the HFC, we next formulate the novel frequency domain learning block for thorough frequency exploitation. Integrating the space domain learning, we subsequently develop the Correlation-driven Mixing Domains Transformer (CMDT) for comprehensive image prior exploration. By plugging CMDT into an unfolding architecture, we finally form a deep frequency unfolding framework for HSI reconstruction.
3.1 Hyperspectral Frequency Correlation Prior
Among various techniques for transforming image signals from the space domain to the frequency domain, Discrete Cosine Transform (DCT) generates a frequency coefficient matrix comprising entirely real values, making it convenient for frequency analysis. Therefore, we conduct a series of analyses on the spectrogram after the 2D-DCT.
Spectral/spatial Correlation within One Frequency Token. Due to the unique arrangement of frequency in the spectrogram, the frequencies within a spatial neighborhood are of similar horizontal and vertical wavelengths, corresponding to the close positions in the DCT coefficient matrix. Thus, for the 2D-spectrogram of an HSI in one spectral band, a correlation emerges within one frequency token in a spatial neighborhood. Moreover, for a specific frequency component, a spectral correlation also exists within the frequency token in a spectral group. We term two correlations in both spatial neighborhood and spectral group spectral/spatial correlation within one frequency token.
Spectral-spatial Correlation between Frequency Tokens. The spectral-spatial correlation between frequency tokens is fundamentally inspired by the evident visual similarities across diverse spectra in the spectrograms of HSIs. To verify the correctness of the specific correlation, we collect 24 HSI datasets [73, 2, 48, 3, 18, 19, 15, 7, 36, 28, 56, 6, 80, 1, 65, 53, 30, 29, 55], which contains 1029 HSIs with divergent spatial size and spectral bands to conduct statistical analyses on the correlations between spectral-spatial token pairs. We choose the Pearson-product-momentum correlation coefficient [52] as the metric, thereby sidestepping the influence of the numerical data scale.
Firstly, as shown in Fig. 3(a), by computing the correlation for the HSI data in two domains, we derive two correlation maps of size , where is the number of spectral bands of HSI. Each element within the map represents the correlation degree of two spatial-vectorized spectral pairs with size , where are the height and width of HSI. We further calculate the average value of all elements and obtain the average correlation. Subsequently, we present comprehensive statistics of spectral correlation in two domains of 1029 HSIs including histograms and probability distribution. It is observed that the frequency domain spectral correlation is concentrated around 0.94, whereas the space domain counterpart is more dispersed and of a significant probability around 0.4, spotlighting the high stability and concentrated probability distribution of spectral-spatial correlation in the frequency domain. Additionally, to unearth spectral correlation within specific frequency districts, we partition the HSI spectrogram into fixed-size spectral-spatial tokens and calculate the frequency spectral correlation between tokens across spectral bands. It should be emphasized that each frequency token embodies similar frequencies. Fig. 3(b) shows the descending trend of correlation from low to high-frequency tokens.
Distilling our analyses, the spectral-spatial correlation between frequency tokens can be concisely summarized as:
(i) The frequency spectral-spatial token pairs unveil a high spectral correlation marked by significant stability, which remains relatively immune to distant spectra when juxtaposed with the space counterpart.
(ii) As the frequency ascends, there is a descending trend of the correlation between frequency tokens.
The above two-fold correlations constitute the HFC prior. In the next section, we will leverage the HFC prior to establish the efficient frequency domain learning for comprehensive frequency exploitation.
3.2 Frequency Domain Learning
Inspired by the HFC prior, we formulate a novel frequency domain learning to exploit the image frequency information, which is composed of a correlation exploration block and a frequency-level gating block. The former consists of a Spectral-wise self-Attention in Frequency domain (SAF) and a Spectral-spatial Interaction in Frequency domain (SIF) to achieve thorough frequency exploitation. The latter provides dynamic gating capability to achieve the optimal balance between SAF and SIF.
Spectral-wise self-Attention of Frequency. The input of frequency domain learning is initially transformed into frequency cube via 2D-DCTs. For the low-frequency tokens, given the pronounced spectral-spatial correlation between frequency tokens, we creatively incorporate HFC with self-attention mechanism, sculpting the SAF to model the correlation between distant spectral-spatial frequency token pairs. As shown in Fig. 4, in SAF, the input spectrogram is firstly split into non-overlapping cubic patches with a size of , denoted as [], where . For each cube , we map into FreqQuery , FreqKey , FreqValue via three learnable parameters . Then the output of SAF dubbed is calculated as:
(1) |
(2) |
where are learnable parameters embedding the frequency position information. is output cubic of SAF. transforms matrix into tensor . In the implementation, the multi-head strategy is adopted to map features and exploit exhaustive information.
Spectral-spatial Interaction of Frequency. For the high-frequency tokens, considering the relation between correlation and self-attention, the learning of less related high frequencies is inefficient. Therefore, spectral-spatial interaction of frequency (SIF) is designed to accurately model high frequencies with spectral/spatial correlation and frequency inductive biases. More concretely, the inductive bias of the spatial token interaction is the same phase distance between neighboured frequencies. The inductive bias of the spectral token evolution is the same structural representation of one frequency across different spectral bands. As shown in Fig. 4, The former employs the depth-wise layer with a number the same as the input channels, thus building a bridge for frequency tokens in the spatial neighborhood to interact with each other. The latter leverages two layers , thus facilitating a single frequency to evolve across the spectral groups. The outputs of two blocks are and , respectively, thus effectively representing the frequency-level information in HSIs across both spatial and spectral dimensions. represents the GELU function. The complete SIF process is represented as follows:
(3) |
(4) |
(5) |
Learnable Gating Filter. Considering the dominant correlation varying from low-frequency to high-frequency tokens, a learnable gating filter with size is introduced to adaptively determining the proportion of SAF and SIF, thereby ensuring the network to emphasize all frequencies. The output is obtained as:
(6) |
where is the repetition operation along the channel dimension. Subsequently, we transform the into space image through C inverse 2D-DCTs, serving as the output of the frequency domain learning.
3.3 Correlation-driven Mixing Domains Transformer
To exploit the pixel-level information as complementary to frequency domain learning, the space domain learning consisting of space-domain spatial-wise multi-head self-attention is introduced to capture image local correlation. As depicted in Fig. 5, the input image is also split into non-overlapping spatial-spectral tokens with size of , denoted as [], where . The output of space domain learning dubbed is obtained after the operation. Mixing the proposed frequency domain learning and the space domain learning, the CMDT based on traditional transformers is architected as shown in Fig. 5. Concretely, the CMDT comprises layer normalization, linear projection, mixing domains learning comprising the frequency domain learning and the space domain learning, and feed-forward network, thus ensuring the comprehensive exploitation of HSI correlation in dual domains.
3.4 Deep Frequency Unfolding Framework
For convenience, given the sensing matrix , the input HSI , and the imaging noise , the measurement of CASSI imaging model can be vectorized as:
(7) |
As [78, 45] illustrates, the problem of reconstructing from can be split into two iterative process as:
(8) |
(9) |
where variable serves as an auxiliary variable, refers to the deep prior network, and are the iterative parameters. Detailed illustration of CASSI imaging model and the derivation of the unfolding framework can be found in [12, 61, 51]. Please note that the middle and final outputs of our method are in formulation of 3D cubes, i.e., and , which can be obtained from the vectorized and by reshape operation.
By combining our CMDT with unfolding strategies, we architect the deep frequency unfolding framework with stages. As Fig. 5 shows, each stage contains a Data Module (DM) and a Prior Module (PM), corresponding to the calculation of the closed-form solution (Eq. 8) and the deep prior exploitation (Eq. 9). An Iteration Parameter Estimator (IPE) [12] is employed to explore the degradation patterns and ill-posedness degree caused by the mask-modulation and dispersion-integration, thus providing iteration parameters and .
Loss Function. Given the ground-truth HSI and the predicted HSI , the formula of the loss function is formulated as follows:
(10) |
please note that the and are both space-domain HSI, that is to say, the loss function is calculated only in the space domain because the input and output of the overall pipeline are space-domain HSI.
4 Experiments
Algorithms | Params | FLOPs | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 | S10 | Avg | |||||||||||||||||||||||||||||||||||
TwIST [9] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
GAP-TV [74] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
GAP-Net [43] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
DGSMP [32] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
-Net [46] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
HDNet [31] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
MST-L [11] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
CST-L [10] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
DAUHST-9stg [12] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
Ours-2stg |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
Ours-3stg |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
Ours-5stg |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||
Ours-9stg |
|
|
|
|
|
|
|
|
|
|
|
|
|
4.1 Experimental Setup
Datasets. We adopt CAVE [48] and KAIST [18] with 28 wavelengths from 450nm to 650nm for experiments. CAVE consists of 32 HSIs with spatial size . KAIST contains 30 HSIs of spatial size . We use the augmented CAVE for training, 10 scenes from KAIST for simulation testing, and 5 scenes from [44] for real testing.
Implementation Details. We implement our method by PyTorch. All methods are trained with Adam optimizer () using the Cosine Annealing scheme for 300 epochs on an RTX 3090 GPU. The initial learning rate is , batch size is 5. Some data augmentations including random rotation and flipping are used to enrich the diversity of training data. The kernel size of SAF is set to 8. The parameters of PM are shared between different stages. The training objective is to minimize the Root Mean Square Error (RMSE) between reconstructed and ground-truth (GT) HSIs. The source codes and pre-trained models will be released to be publicly available.
Compared Methods. We compare our method with two traditional methods (TwIST [9] and GAP-TV), five end-to-end methods (-Net, GAP-Net [74], HDNet [31], MST [11], and CST [10]) and four deep unfolding methods (DGSMP [32], GAP-Net [43], DAUHST [12], and RDLUF [21]). Given that RDLUF changes the sensing matrix that should have been fixed, the direct comparison with RDLUF is unfair. Thus, we conduct a further combination study for a fair comparison with RDLUF in the learnable sensing matrix part.
Evaluation Metrics. The reconstruction quality is evaluated by perceptual quality and numerical performance including PSNR, SSIM, and Frequency Domain Gap (FDG). The best numerical value in each metric is in bold form. The calculation of FDG is in the supplementary material.
Sigma | Metric | HDNet | DAUHST | Ours |
PSNR | 21.81 | 35.15 | 36.34 | |
SSIM | 0.558 | 0.390 | 0.951 | |
Preserve Degree | 63.51% | 96.72% | 97.42% | |
PSNR | 15.13 | 27.06 | 31.40 | |
SSIM | 0.208 | 0.726 | 0.865 | |
Preserve Degree | 44.05% | 74.46% | 84.18% | |
PSNR | 12.20 | 24.22 | 27.30 | |
SSIM | 0.093 | 0.632 | 0.731 | |
Preserve Degree | 35.52% | 66.64% | 73.19% |
4.2 Noise Injection Experiment
To assess the noise robustness, we introduce a series of zero-mean Gaussian noise with a standard deviation ranging from 0.01 to 0.1 to measurements and conduct tests on pre-trained model. Considering the fairness of computational costs and characters of various models, DAUHST and HDNet are chosen for comparisons. Tab. 2 indicates the max preserving degree of our method, exhibiting the best robustness to noise.
4.3 Simulation Results
Numerical Results. The results from 10 simulated scenes are represented in Tab. 1, we also provide the line graph in Fig. 1 for clear comparison, which shows that our model achieves the optimal balance between computational cost and performance. Compared to DAUHST, Ours-9stg achieves outstanding performance, , 39.47dB in PSNR, while Ours-5stg excels DAUHST-9stg with 0.6dB increase in PSNR and only requires 65 FLOPs and 15 Params, verifying the efficiency of our method.
Baseline-1 | SAF | SIF | PSNR | SSIM | Params (M) | FLOPS (G) |
✓ | 36.02 | 0.952 | 0.55 | 16.30 | ||
✓ | ✓ | 37.10 | 0.960 | 0.64 | 20.06 | |
✓ | ✓ | ✓ | 37.44 | 0.963 | 0.90 | 21.39 |
Model | Baseline-3 | kernel-2 | kernel-4 | kernel-8 | kernel-16 |
PSNR | 34.00 | 34.25 | 34.57 | 34.61 | 34.28 |
SSIM | 0.930 | 0.934 | 0.939 | 0.940 | 0.936 |
Method | Baseline-2 | G-MSA | SW-MSA | S-MSA | HS-MSA | SAF |
PSNR | 32.79 | 33.63 | 33.75 | 33.82 | 34.05 | 34.61 |
SSIM | 0.904 | 0.920 | 0.924 | 0.926 | 0.930 | 0.940 |
Params (M) | 0.40 | 0.48 | 0.48 | 0.48 | 0.48 | 0.48 |
FLOPS (G) | 6.85 | 10.30 | 9.41 | 8.89 | 9.72 | 7.87 |
Complexity | - |
stages | Params (M) | FLOPs (G) | PSNR | SSIM |
2 | 1.75 | 21.39 | 37.64 | 0.966 |
2-shared | 0.90 | 21.39 | 37.44 | 0.963 |
3 | 2.60 | 31.56 | 38.05 | 0.967 |
3-shared | 0.90 | 31.56 | 38.16 | 0.968 |
5 | 4.31 | 51.90 | 38.42 | 0.971 |
5-shared | 0.90 | 38.42 | 38.96 | 0.974 |
Frequency Domain Gap. To show the efficacy of our method in learning image frequency information, we visualize the spectrogram residual maps of our method, HDNet, CST, and DAUHST to GT. As shown in Fig. 8, our method exhibits the smoothest surface, especially in the zoomed area, and has the lowest FDG, thus demonstrating the superior recovery of the HSI frequencies.
Perceptual Quality. For better vision, we exhibit the visual results in RGB format with CIE color as the mapping function. The right part in Fig. 6 shows that our method excels in preserving clear edges, particularly in the zoomed area. The left part in Fig. 6 depicts the spectral curves at index in RGB image, the highest correlation and the closest curve to GT further emphasize the efficacy of our method.
4.4 Real Experiment
We further evaluate the effectiveness of our method on real HSI data. Following the same settings as [44], we retrain our method of 2 stages with the real mask on the CAVE and KAIST datasets. Fig. 7 presents the visual comparison between our method and others. Compared to others, our method preserves more details with fewer artifacts, demonstrating the capability of our method to reconstruct the accurate details in HSIs.
4.5 Ablation Study
Break-down Ablation of Frequency Domain Learning. To verify the effectiveness of each component of the frequency domain learning, we present the break-down ablation experiments on SAF and SIF. Baseline-1 is derived by removing SAF and SIF from our method on 2 stages. As Tab. 4.3 shows, the baseline-1 achieves 36.02 dB. When we apply SAF and SIM, the method achieves 1.08 dB, and 0.34 dB improvements, showing the effectiveness of each component.
stage | Ours | RDLUF | Ours-plus |
PSNR | 38.16 | 37.56 | 38.20 |
SSIM | 0.967 | 0.963 | 0.972 |
FLOPs (G) | 31.56 | 62.34 | 32.29 |
Params (M) | 0.9 | 1.89 | 0.82 |
metric | HFC |
Inference Time (s) | 0.0592 |
Computation Effort (GFLOPs) | 0.498 |
Memory Cost (M) | 0.086 |
Ablation Study of Kernel Size in SAF. To obtain the optimal kernel size for SAF, which balances the number of similar frequencies and the correlation between frequency tokens, we conduct the ablation experiments on SAF of various kernel sizes, e.g., 2, 4, 8, 16. Baseline-3 is adopted by removing SIF and space domain learning from CMDT with 1 stage. As Tab. 4.3 illustrates, the kernel-8 model achieves the best performance of 34.61 dB in PSNR, thus adopted by us.
Ablation Study of Self-attention Mechanisms. To compare SAF with other self-attention mechanisms, we adopt G-MSA [23], Swin MSA (SW-MSA) [39], and HS-MSA [12] as competitors. The baseline-2 is adopted by removing CMDT from our method of 1 stage. Tab. 4.3 shows that SAF yields the highest numerical performance with the linear complexity, demonstrating the prosperous performance of self-attention in the frequency domain.
Ablation Study of Parameter Sharing. To validate the effects of parameter sharing, a series of experiments are conducted in various stages. As Tab. 4.3 shows, in 2 stages, the method without parameter sharing yields higher performance than its counterpart. However, in long stages, e.g., 3 stages and 5 stages, the method with parameter sharing excels its counterpart, which is primarily because methods without parameter sharing tend to experience multiplicative growth of parameters as the stage increases, leading to the overfitting of the neural network, especially in the situation where the data diversity is insufficient. Therefore, the method with parameter sharing exhibits promising reconstruction performance.
Learnable sensing matrix. Since RDLUF changes the forward process with new learnable in each iteration, whose performance comparison with our method is fundamentally unfair. For a fair comparison, We further conduct the combination experiment to validate the superiority of our CMDT, , replace CMDT with Mix transformer in RDLUF and retrain the model on 3 stages. As shown in Tab. 4.5 which shows that the proposed CMDT offers a 0.64 dB increase in PSNR and 0.09 increase in SSIM while requiring only 43.3 Params and 51.8 FLOPs compared to RDLUF.
Revisiting HFC. First, since our method is correlation-driven, we redraw the spectral correlation heatmap of reconstructed HSI in the frequency domain. HDNet and DAUHST are chosen for comparison. As Fig. 9 depicts, our method gives the closest spectral correlation reconstruction to GT, especially in the left-bottom district, validating the accurate correlation reconstruction of our method. Second, to validate the running cost of our frequency domain learning network based on HFC, we compute inference time, computation effort, and memory cost in the network, showing the satisfactory performance in Tab. 4.5.
5 Discussion
5.1 Visualization of Learnable Gating Filter
To prove the gating capacity of the learnable gating filter, we draw the heatmaps of 5 increasing-frequency tokens named patch-1 to patch-5. As Fig. 10 shows, the higher the frequency tokens are, the lower the overall values in the corresponding patch district of the learnable gating filter are, thus verifying that the learnable filter makes the low-frequency tokens with dominant spectral-spatial correlation mostly come from the SAF and high-frequency tokens with dominant spectral/spatial correlation mostly come from the SIF.
5.2 Relation between attention map and correlation
To further explore the relation between attention maps calculated in the method and the various correlation between spectral tokens, we visualize the heatmaps of space domain learning and the SAF, respectively. As depicted in Fig. 11, the high correlation between frequency tokens in Fig. 3(a) brings the broad activated attention weights, while the low correlation between space tokens in Fig. 3(a) brings broad non-activated attention weights. To be concrete, highly correlated frequency tokens ensure that rich information from other relevant frequency tokens is comprehensively emerged. Conversely, poorly correlated space tokens lead to inefficient utilization of other space tokens and great interference by highly irrelevant tokens, thus harming the process of information extraction and evolution.
6 Conclusion
In this paper, we propose a valuable HFC prior derived from the statistical analyses of existent HSI datasets. Leveraging the HFC, we formulate a frequency domain learning composed of a Spectral-wise self-Attention of Frequency and a Spectral-spatial Interaction of Frequency, which significantly enhances the HSI frequency exploration. By combining the frequency domain learning and the space domain learning, we architect the Correlation-driven Mixing Domains Transformer as deep prior for comprehensive prior exploitation. Plugging the delicate deep prior into prosperous unfolding architecture, we develop a correlation-driven mixing transformer unfolding framework for accurate HSI reconstruction. Experimental results reveal the superiority of our methods in image quality and computational efficiency over the SOTA methods. Since the HFC widely exists in various HSIs, the HFC prior and the proposed method can be further leveraged in various HSI-based applications like spectral super-resolution, image classification, and biomedical analysis.
References
- [1] Andika, F., Rizkinia, M., Okuda, M.: A hyperspectral anomaly detection algorithm based on morphological profile and attribute filter with band selection and automatic determination of maximum area. Remote Sensing 12(20), 3387 (2020)
- [2] Arad, B., Ben-Shahar, O.: Sparse recovery of hyperspectral signal from natural rgb images. In: European Conference on Computer Vision. pp. 19–34 (2016)
- [3] Arad, B., Timofte, R., Yahel, R., Morag, N., Bernat, A., Cai, Y., Lin, J., Lin, Z., Wang, H., Zhang, Y., et al.: Ntire 2022 spectral recovery challenge and data set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 863–881 (2022)
- [4] Arce, G.R., Brady, D.J., Carin, L., Arguello, H., Kittle, D.S.: Compressive coded aperture spectral imaging: An introduction. IEEE Signal Processing Magazine 31(1), 105–115 (2013)
- [5] Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6836–6846 (2021)
- [6] Bandara, W.G.C., Patel, V.M.: Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1767–1777 (2022)
- [7] Baumgardner, M.F., Biehl, L.L., Landgrebe, D.A.: 220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3. Purdue University Research Repository 10(7), 991 (2015)
- [8] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10231–10241 (2021)
- [9] Bioucas-Dias, J.M., Figueiredo, M.A.: A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing 16(12), 2992–3004 (2007)
- [10] Cai, Y., Lin, J., Hu, X., Wang, H., Yuan, X., Zhang, Y., Timofte, R., Van Gool, L.: Coarse-to-fine sparse transformer for hyperspectral image reconstruction. In: European Conference on Computer Vision. pp. 686–704 (2022)
- [11] Cai, Y., Lin, J., Hu, X., Wang, H., Yuan, X., Zhang, Y., Timofte, R., Van Gool, L.: Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17502–17511 (2022)
- [12] Cai, Y., Lin, J., Wang, H., Yuan, X., Ding, H., Zhang, Y., Timofte, R., Gool, L.V.: Degradation-aware unfolding half-shuffle transformer for spectral compressive imaging. Advances in Neural Information Processing Systems 35, 37749–37761 (2022)
- [13] Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision. pp. 205–218 (2022)
- [14] Cao, J., Wang, Q., Xian, Y., Li, Y., Ni, B., Pi, Z., Zhang, K., Zhang, Y., Timofte, R., Van Gool, L.: Ciaosr: Continuous implicit attention-in-attention network for arbitrary-scale image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1796–1807 (2023)
- [15] Chakrabarti, A., Zickler, T.: Statistics of real-world hyperspectral images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 193–200 (2011)
- [16] Chang, Y., Yan, L., Chen, B., Zhong, S., Tian, Y.: Hyperspectral image restoration: Where does the low-rank property exist. IEEE Transactions on Geoscience and Remote Sensing 59(8), 6869–6884 (2020)
- [17] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12299–12310 (2021)
- [18] Choi, I., Jeon, D.S., Nam, G., Gutierrez, D., Kim, M.H.: High-quality hyperspectral reconstruction using a spectral prior. ACM Transactions on Graphics p. 1–13 (2017)
- [19] Christovam, L., Pessoa, G., Shimabukuro, M., Galo, T.: Land use and land cover classification using hyperspectral imagery: Evaluating the performance of spectral angle mapper, support vector machine and random forest. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 42, 1841–1847 (2019)
- [20] Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., Knoll, A.: Selective frequency network for image restoration. In: International Conference on Learning Representations (2022)
- [21] Dong, Y., Gao, D., Qiu, T., Li, Y., Yang, M., Shi, G.: Residual degradation learning unfolding framework with mixing priors across spectral and spatial for compressive spectral imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22262–22271 (2023)
- [22] Donoho, D.L.: Compressed sensing. IEEE Transactions on information theory 52(4), 1289–1306 (2006)
- [23] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
- [24] Eason, D.T., Andrews, M.: Total variation regularization via continuation to recover compressed hyperspectral images. IEEE Transactions on Image Processing 24(1), 284–293 (2014)
- [25] Fauvel, M., Tarabalka, Y., Benediktsson, J.A., Chanussot, J., Tilton, J.C.: Advances in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE 101(3), 652–675 (2012)
- [26] Fu, Y., Zhang, T., Wang, L., Huang, H.: Coded hyperspectral image reconstruction using deep external and internal learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(7), 3404–3420 (2021)
- [27] Fu, Y., Zheng, Y., Sato, I., Sato, Y.: Exploiting spectral-spatial correlation for coded hyperspectral image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3727–3736 (2016)
- [28] Gong, Z., Zhong, P., Hu, W.: Diversity in machine learning. IEEE Access 7, 64323–64350 (2019)
- [29] Guillevic, P.C., Privette, J.L., Coudert, B., Palecki, M.A., Demarty, J., Ottlé, C., Augustine, J.A.: Land surface temperature product validation using noaa’s surface climate observation networks—scaling methodology for the visible infrared imager radiometer suite (viirs). Remote Sensing of Environment 124, 282–298 (2012)
- [30] He, W., Zhang, H., Zhang, L., Shen, H.: Hyperspectral image denoising via noise-adjusted iterative low-rank matrix approximation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8(6), 3050–3061 (2015)
- [31] Hu, X., Cai, Y., Lin, J., Wang, H., Yuan, X., Zhang, Y., Timofte, R., Van Gool, L.: Hdnet: High-resolution dual-domain learning for spectral compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17542–17551 (2022)
- [32] Huang, T., Dong, W., Yuan, X., Wu, J., Shi, G.: Deep gaussian scale mixture prior for spectral compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16216–16225 (2021)
- [33] James, J.: Spectrograph design fundamentals. Cambridge University Press (2007)
- [34] Jiang, L., Dai, B., Wu, W., Loy, C.C.: Focal frequency loss for image reconstruction and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13919–13929 (2021)
- [35] Jiang, Y., Zhou, H., Zhang, Z., Zhang, C., Zhang, K.: S2 moinet: Spectral-spatial multi-order interactions network for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2023)
- [36] Kalman, L.S., Bassett III, E.M.: Classification and material identification in an urban environment using hydice hyperspectral data. In: Imaging Spectrometry III. vol. 3118, pp. 57–68 (1997)
- [37] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1833–1844 (2021)
- [38] Liu, Y., Yuan, X., Suo, J., Brady, D.J., Dai, Q.: Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(12), 2990–3006 (2018)
- [39] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022 (2021)
- [40] Lv, Z., Dong, X.M., Peng, J., Sun, W.: Essinet: Efficient spatial–spectral interaction network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 60, 1–15 (2022)
- [41] Maggiori, E., Charpiat, G., Tarabalka, Y., Alliez, P.: Recurrent neural networks to correct satellite image classification maps. IEEE Transactions on Geoscience and Remote Sensing 55(9), 4962–4971 (2017)
- [42] Manakov, A., Restrepo, J., Klehm, O., Hegedus, R., Eisemann, E., Seidel, H.P., Ihrke, I.: A reconfigurable camera add-on for high dynamic range, multispectral, polarization, and light-field imaging. ACM Transactions on Graphics 32(4), 47–1 (2013)
- [43] Meng, Z., Jalali, S., Yuan, X.: Gap-net for snapshot compressive imaging. arXiv preprint arXiv:2012.08364 (2020)
- [44] Meng, Z., Ma, J., Yuan, X.: End-to-end low cost compressive spectral imaging with spatial-spectral self-attention. In: European Conference on Computer Vision. pp. 187–204 (2020)
- [45] Meng, Z., Yuan, X., Jalali, S.: Deep unfolding for snapshot compressive imaging. International Journal of Computer Vision 131(11), 2933–2958 (2023)
- [46] Miao, X., Yuan, X., Pu, Y., Athitsos, V.: -net: Reconstruct hyperspectral images from a snapshot measurement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4059–4069 (2019)
- [47] Nie, Z., Chen, L., Jeon, S., Yang, X.: Spectral-spatial interaction network for multispectral image and panchromatic image fusion. Remote Sensing 14(16), 4100 (2022)
- [48] Park, J.I., Lee, M.H., Grossberg, M.D., Nayar, S.K.: Multispectral imaging using multiplexed illumination. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1–8 (2007)
- [49] Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville, A.: On the spectral bias of neural networks. In: International Conference on Machine Learning. pp. 5301–5310 (2019)
- [50] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. Advances in Neural Information Processing Systems 20 (2007)
- [51] Roth, S., Black, M.J.: Fields of experts: A framework for learning image priors. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). vol. 2, pp. 860–867. IEEE (2005)
- [52] Sedgwick, P.: Pearson’s correlation coefficient. British Medical Journal 345 (2012)
- [53] Shuai, L., Guanglong, X.: Multi-dimensional convolutional network collaborative unmixing method for hyperspectral image mixed pixels. Acta Geodaetica et Cartographica Sinica 49(12), 1600 (2020)
- [54] Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, 7537–7547 (2020)
- [55] Thompson, D.R., Boardman, J.W., Eastwood, M.L., Green, R.O.: A large airborne survey of earth’s visible-infrared spectral dimensionality. Optics Express 25(8), 9186–9195 (2017)
- [56] Uchaev, D., Uchaev, D.: Small sample hyperspectral image classification based on the random patches network and recursive filtering. Sensors 23(5), 2499 (2023)
- [57] Uzkent, B., Hoffman, M.J., Vodacek, A.: Real-time vehicle tracking in aerial video using hyperspectral features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 36–44 (2016)
- [58] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
- [59] Wagadarikar, A., John, R., Willett, R., Brady, D.: Single disperser design for coded aperture snapshot spectral imaging. Applied Optics 47(10), B44–B51 (2008)
- [60] Wang, H., Wu, X., Huang, Z., Xing, E.P.: High-frequency component helps explain the generalization of convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8684–8694 (2020)
- [61] Wang, L., Sun, C., Fu, Y., Kim, M.H., Huang, H.: Hyperspectral image reconstruction using a deep spatial-spectral prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8032–8041 (2019)
- [62] Wang, L., Sun, C., Zhang, M., Fu, Y., Huang, H.: Dnu: Deep non-local unrolling for computational spectral imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1661–1671 (2020)
- [63] Wang, L., Xiong, Z., Gao, D., Shi, G., Zeng, W., Wu, F.: High-speed hyperspectral video acquisition with a dual-camera architecture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4942–4950 (2015)
- [64] Wang, L., Zhang, T., Fu, Y., Huang, H.: Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging. IEEE Transactions on Image Processing 28(5), 2257–2270 (2018)
- [65] Wang, Q., Yuan, Z., Du, Q., Li, X.: Getnet: A general end-to-end two-dimensional cnn framework for hyperspectral image change detection. IEEE Transactions on Geoscience and Remote Sensing p. 3–13 (2019)
- [66] Wei, Y., Gu, S., Li, Y., Timofte, R., Jin, L., Song, H.: Unsupervised real-world image super resolution via domain-distance aware training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13385–13394 (2021)
- [67] Wolfe, W.L.: Introduction to imaging spectrometers, vol. 25. SPIE Press (1997)
- [68] Xia, B., Zhang, Y., Wang, Y., Tian, Y., Yang, W., Timofte, R., Van Gool, L.: Basic binary convolution unit for binarized image restoration network. In: International Conference on Learning Representations (2022)
- [69] Xiong, Z., Shi, Z., Li, H., Wang, L., Liu, D., Wu, F.: Hscnn: Cnn-based hyperspectral image recovery from spectrally undersampled projections. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 518–525 (2017)
- [70] Xu, Q., Shi, Y., Yuan, X., Zhu, X.X.: Universal domain adaptation for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 61, 1–15 (2023)
- [71] Xu, Z.Q.J., Zhang, Y., Xiao, Y.: Training behavior of deep neural network in frequency domain. In: International Conference on Neural Information Processing. pp. 264–274 (2019)
- [72] Ying, Y., Wang, J., Shi, Y., Yin, B.: Dual-domain feature learning and memory-enhanced unfolding network for spectral compressive imaging. In: IEEE International Conference on Multimedia and Expo. pp. 1589–1594 (2023)
- [73] Yokoya, N., Iwasaki, A.: Airborne hyperspectral data over chikusei. Space Appl. Lab., Univ. Tokyo, Tokyo, Japan, Tech. Rep. SAL-2016-05-27 5, 5 (2016)
- [74] Yuan, X.: Generalized alternating projection based total variation minimization for compressive sensing. In: IEEE International Conference on Image Processing. pp. 2539–2543 (2016)
- [75] Zha, Z., Wen, B., Yuan, X., Ravishankar, S., Zhou, J., Zhu, C.: Learning nonlocal sparse and low-rank models for image compressive sensing: Nonlocal sparse and low-rank modeling. IEEE Signal Processing Magazine 40(1), 32–44 (2023)
- [76] Zhang, J., Zhang, Y., Gu, J., Zhang, Y., Kong, L., Yuan, X.: Accurate image restoration with attention retractable transformer. In: International Conference on Learning Representations (2022)
- [77] Zhang, T., Fu, Y., Wang, L., Huang, H.: Hyperspectral image reconstruction using deep external and internal learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8559–8568 (2019)
- [78] Zhang, X., Zhang, Y., Xiong, R., Sun, Q., Zhang, J.: Herosnet: Hyperspectral explicable reconstruction and optimal sampling deep network for snapshot compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17532–17541 (2022)
- [79] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6881–6890 (2021)
- [80] Zheng, Z., Zhong, Y., Ma, A., Zhang, L.: Fpga: Fast patch-free global learning framework for fully end-to-end hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 58(8), 5612–5626 (2020)
- [81] Zhou, M., Huang, J., Guo, C.L., Li, C.: Fourmer: An efficient global modeling paradigm for image restoration. In: International Conference on Machine Learning. pp. 42589–42601 (2023)