Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging

Jiahua Dong1  ,  Hui Yin2*,  Hongliu Li3,  Wenbo Li4,  Yulun Zhang5{\dagger},
Salman Khan1, 6,  Fahad Shahbaz Khan1, 7
1Mohamed bin Zayed University of Artificial Intelligence   2Hunan University
3The Hong Kong Polytechnic University   4The Chinese University of Hong Kong
5Shanghai Jiao Tong University   6Australian National University   7Linköping University
Equal Contributions.   Corresponding Author.
Abstract

Deep unfolding methods have made impressive progress in restoring 3D hyperspectral images (HSIs) from 2D measurements through convolution neural networks or Transformers in spectral compressive imaging. However, they cannot efficiently capture long-range dependencies using global receptive fields, which significantly limits their performance in HSI reconstruction. Moreover, these methods may suffer from local context neglect if we directly utilize Mamba to unfold a 2D feature map as a 1D sequence for modeling global long-range dependencies. To address these challenges, we propose a novel Dual Hyperspectral Mamba (DHM) to explore both global long-range dependencies and local contexts for efficient HSI reconstruction. After learning informative parameters to estimate degradation patterns of the CASSI system, we use them to scale the linear projection and offer noise level for the denoiser (i.e., our proposed DHM). Specifically, our DHM consists of multiple dual hyperspectral S4 blocks (DHSBs) to restore original HSIs. Particularly, each DHSB contains a global hyperspectral S4 block (GHSB) to model long-range dependencies across the entire high-resolution HSIs using global receptive fields, and a local hyperspectral S4 block (LHSB) to address local context neglect by establishing structured state-space sequence (S4) models within local windows. Experiments verify the benefits of our DHM for HSI reconstruction. The source codes and models will be available at https://github.com/JiahuaDong/DHM.

1 Introduction

Refer to caption
Figure 1: Comparisons of PSNR-FLOPS between our DHM and SOTA models.

Unlike standard RGB images with only three spectral bands, hyperspectral images (HSIs) [54, 18, 45, 23] comprise multiple contiguous bands, providing detailed spectral information for each pixel. In recent decades, HSIs have achieved remarkable successes in a wide range of applications such as remote sensing [3, 47, 75], object detection [32, 55], vehicle tracking [62, 63, 21], and medical image analysis [1, 41, 50]. With the development of compressive sensing theory, the coded aperture snapshot spectral imaging (CASSI) [49, 22], one of the snapshot compressive imaging systems [74, 44, 58, 64], has shown impressive performance in capturing HSIs at video rate. The CASSI system modulates HSI signals at various wavelengths, and mixes all modulated spectra to output a 2D compressed measurement. Then, numerous HSI reconstruction methods [20, 78, 57] are developed to restore original HSIs from 2D compressed measurements (i.e., the CASSI inverse problem [8]).

Different from natural image restoration, HSI reconstruction deals with substantially degraded measurements caused by uncertain system noise and spectral compression [49, 14]. Thus, it is more challenging to learn underlying HSI properties than natural image restoration. Generally, existing HSI reconstruction methods can be mainly divided into four categories. To solve the CASSI inverse problem, model-based methods [68, 78, 60] are heavily dependent on hand-crafted image priors (e.g., low-rank [39] and sparsity [34]), suffering from limited generalization capability. Some plug-and-play works [77, 80, 54] apply the pretrained denoiser into model-based methods [57, 73], while end-to-end algorithms [49, 27, 52] ignore the working mechanism of the CASSI system and instead model a brute-force projection from 2D compressed measurements to HSIs via convolutional neural networks (CNNs). Moreover, deep unfolding methods [79, 20, 66, 67] introduce a multi-stage unfolding framework to iteratively learn a linear projection and a denoiser. They possess the interpretability of model-based methods [72] as well as the powerful encoding capability of deep learning, thereby achieving state-of-the-art performance to lead the development of HSI reconstruction task.

Many deep unfolding methods [79, 72] rely on CNNs as denoiser to capture local contexts, showing significant limitations in exploiting the crucial global contexts for HSI reconstruction. To tackle this issue, some works employ Transformers [17] to model wide-range dependencies [14, 5, 4, 7], but the complexity is quadratic to the token size. Therefore, there is a trade-off between computation complexity and effective receptive fields, hindering these methods from exploring long-range dependencies, especially in high-resolution HSIs. Recently, structured state space sequence (S4) models [25, 65, 46] have emerged as a promising backbone to address the limitations of Transformers and CNNs. Then visual Mamba models [81, 69] introduce a cross-scan module to apply S4 models into vision tasks by unfolding 2D features as 1D array along four directions. It can use global receptive fields to capture long-range contexts while reducing the quadratic complexity to linear. However, existing Mamba models [81, 69] face a crucial challenge of local context neglect when directly applied to the high-resolution HSI reconstruction. Since Mamba unfolds a 2D feature map as a 1D sequence, spatially close pixels may end up being located at distant positions in the flattened sequences. The excessive distance among nearby pixels leads to the problem of local context neglect (i.e., significant loss of critical local textures), thereby degrading the performance of HSI reconstruction.

To resolve the above challenges, we develop a novel Dual Hyperspectral Mamba (DHM) for efficient HSI reconstruction. Our DHM relies on structured state-space sequence (S4) models to reconstruct HSIs from 2D degraded measurements, which can capture both global long-range dependencies and local contexts with linear computation complexity. It is the first attempt to address HSI reconstruction via S4 models in the field of hyperspectral compressive imaging. After learning informative parameters from the physical mask and degraded measurement of the CASSI system, we feed them into multi-stage unfolding framework by scaling the linear projection and estimating noise level for the denoiser (i.e., our proposed DHM). The core component of our DHM is dual hyperspectral S4 block (DHSB), which is mainly composed of a global hyperspectral S4 block (GHSB) and a local hyperspectral S4 block (LHSB). More specifically, the GHSB focuses on understanding global long-range dependencies by modeling discrete state-space equation on the entire high-resolution HSIs, which can effectively balance computation complexity and global receptive fields. Besides, the LHSB aims to surmount the challenge of local context neglect by constructing S4 models within different local windows. As shown in Fig. 1, experiments shows that our DHM significantly surpasses existing HSI reconstruction methods. The novel contributions of our paper are listed as follows:

\bullet We propose a new Dual Hyperspectral Mamba (DHM) for HSI reconstruction, capable of capturing both global long-range dependencies and local contexts with linear computational complexity. To our best knowledge, our DHM is the first Mamba-based deep unfolding method for HSI reconstruction.

\bullet We develop a global hyperspectral S4 block (GHSB) to explore long-range dependencies across the entire high-resolution HSIs using global receptive fields, while design a local hyperspectral S4 block (LHSB) to tackle local context neglect by constructing S4 models within different local windows.

\bullet We conduct comprehensive experiments to illustrate that our DHM significantly surpasses SOTA deep unfolding methods, while requiring lower model size and computational complexity.

2 Related Work

Hyperspectral Image Reconstruction: Traditional model-based HSI reconstruction methods [21, 60, 68, 71, 78] utilize hand-crafted priors such as sparsity [34], total total variation [71] and low-rank constraint to address the CASSI inverse problem. Unfortunately, they highly rely on manual parameter tuning, leading to unsatisfactory reconstruction performance. In light of this, some plug-and-play methods [77, 54, 36] focus on integrating convex optimization with the pretrained denoising networks for HSI reconstruction. They have limited generalization performance due to the overreliance on pretrained denoiser. Besides, end-to-end (E2E) algorithms [5, 27, 49] rely on convolutional neural networks (CNNs) [13, 12] or Transformers [17] to learn a brute-force projection function for HSI restoration. They can improve the HSI reconstruction performance but lack robustness and interpretability. To address these limitations, deep unfolding methods [8, 20, 28, 42, 67] are developed to restore HSI cubes from 2D compressed measurements via a multi-stage framework, showcasing the interpretability and strong encoding ability. [42, 79, 67] employ CNNs to estimate degradation patterns, showing limitations to explore long-range contexts. After Cai et al. [8] employ Transformer to capture non-local dependencies, many Transformer-based methods [29, 15, 38, 14, 70] are proposed to design the denoisers. However, the above methods suffer from a trade-off between computation complexity and effective receptive fields, preventing them from understanding long-range dependencies with global receptive fields to achieve better HSI reconstruction performance.

State Space Models (SSMs) [25, 26, 59, 35] have attracted increasing attention recently due to their capability to linearly scale with sequence length in the long-range dependency modeling. After structured state space sequence (S4) model [25] shows impressive performance on long-range sequence modeling tasks, S5 model [59] introduces an efficient parallel scan and a general MIMO SSM based on S4. Then [19, 46] are proposed to alleviate the performance gap between Transformers and SSMs. Mamba [24], an enhanced SSM with efficient hardware design and a selective mechanism, has surpassed Transformer in natural language processing [30, 53]. Due to its ability in modeling long-range dependencies with linear complexity, Mamba has been widely applied to diverse vision tasks, such as image/video understanding [40, 76, 37] and biomedical image analysis [43]. However, these Mamba models [24, 76, 37, 40, 53] ’may face the challenge of local context neglect (i.e., substantial loss of critical local textures), when directly applied to the high-resolution HSI reconstruction task.

3 The Proposed Model

3.1 The CASSI System

Refer to caption
Figure 2: Our unfolding framework with T𝑇Titalic_T iterative stages.

Degradation Model: In the coded aperture snapshot spectral imaging (CASSI) system [61, 49, 22], the camera can capture the vectorized degraded measurement 𝐘ξ𝐘superscript𝜉\mathbf{Y}\in\mathbb{R}^{\xi}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT, where ξ=H(δs(Nω1)+W)𝜉𝐻subscript𝛿𝑠subscript𝑁𝜔1𝑊\xi=H(\delta_{s}(N_{\omega}-1)+W)italic_ξ = italic_H ( italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT - 1 ) + italic_W ). Nω,δs,Hsubscript𝑁𝜔subscript𝛿𝑠𝐻N_{\omega},\delta_{s},Hitalic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_H and W𝑊Witalic_W represent the number of wavelengths, shifting step of dispersion, height and width in hyperspectral images (HSIs), respectively. As introduced in [8], after vectorizing the shifted HSI signal as 𝐗ξNω𝐗superscript𝜉subscript𝑁𝜔\mathbf{X}\in\mathbb{R}^{\xi N_{\omega}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we express the degradation model of the CASSI system as follows:

𝐘=𝚿𝐗+ϵ,𝐘𝚿𝐗bold-italic-ϵ\displaystyle\mathbf{Y}=\boldsymbol{\Psi}\mathbf{X}+\boldsymbol{\epsilon},bold_Y = bold_Ψ bold_X + bold_italic_ϵ , (1)

where ϵξbold-italic-ϵsuperscript𝜉\boldsymbol{\epsilon}\in\mathbb{R}^{\xi}bold_italic_ϵ ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT is the vectorized imaging noise on 𝐘𝐘\mathbf{Y}bold_Y. 𝚿ξ×ξNω𝚿superscript𝜉𝜉subscript𝑁𝜔\mathbf{\Psi}\in\mathbb{R}^{\xi\times\xi N_{\omega}}bold_Ψ ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ × italic_ξ italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT indicates the sparse and fat sensing matrix which is determined via the physical mask in the CASSI system [16, 31]. Given 𝚿𝚿\mathbf{\Psi}bold_Ψ and 𝐘𝐘\mathbf{Y}bold_Y in the CASSI system, the goal of HSI reconstruction is to restore HSI signal 𝐗𝐗\mathbf{X}bold_X by removing the imaging noise ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ.

Estimation of Degradation Patterns: As analyzed in previous deep unfolding methods [8, 15, 14, 70], the estimation of degradation patterns is crucial to improve HSI reconstruction performance in the multi-stage unfolding framework, by adaptively scaling linear projection and offering information about imaging noise ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ for the denoiser. Thus, motivated by [8, 15], we use maximum a posteriori (MAP) theory to restore original HSI signal 𝐗𝐗\mathbf{X}bold_X in Eq. (1) via optimizing the following energy function:

𝐗^=argmin𝐗12𝐘𝚿𝐗2+λ(𝐗),^𝐗subscript𝐗12superscriptnorm𝐘𝚿𝐗2𝜆𝐗\displaystyle\widehat{\mathbf{X}}=\arg\min_{\mathbf{X}}\frac{1}{2}\|\mathbf{Y-% \mathbf{\Psi X}}\|^{2}+\lambda\mathcal{R}(\mathbf{X}),over^ start_ARG bold_X end_ARG = roman_arg roman_min start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Y - bold_Ψ bold_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ caligraphic_R ( bold_X ) , (2)

where (𝐗)𝐗\mathcal{R}(\mathbf{X})caligraphic_R ( bold_X ) denotes the prior term about 𝐗𝐗\mathbf{X}bold_X, and λ𝜆\lambdaitalic_λ is the hyperparameter to balance the importance of prior term. In order to solve Eq. (2), we define an auxiliary variable as 𝐙=𝐗ξNω𝐙𝐗superscript𝜉subscript𝑁𝜔\mathbf{Z}=\mathbf{X}\in\mathbb{R}^{\xi N_{\omega}}bold_Z = bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and then utilize the half-quadratic splitting algorithm to minimize the following loss HSIsubscriptHSI\mathcal{L}_{\mathrm{HSI}}caligraphic_L start_POSTSUBSCRIPT roman_HSI end_POSTSUBSCRIPT:

HSI=12𝐘𝚿𝐗2+λ(𝐙)+η2𝐙𝐗2,subscriptHSI12superscriptnorm𝐘𝚿𝐗2𝜆𝐙𝜂2superscriptnorm𝐙𝐗2\displaystyle\mathcal{L}_{\mathrm{HSI}}=\frac{1}{2}\|\mathbf{Y-\mathbf{\Psi X}% }\|^{2}+\lambda\mathcal{R}(\mathbf{Z})+\frac{\eta}{2}\|\mathbf{Z}-\mathbf{X}\|% ^{2},caligraphic_L start_POSTSUBSCRIPT roman_HSI end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Y - bold_Ψ bold_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ caligraphic_R ( bold_Z ) + divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ∥ bold_Z - bold_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where η𝜂\etaitalic_η is a penalty parameter. We decouple 𝐗𝐗\mathbf{X}bold_X and 𝐙𝐙\mathbf{Z}bold_Z into two iterative subproblems to solve Eq. (3):

𝐗t=argmin𝐗𝐘𝚿𝐗2+η𝐗𝐙t12,𝐙t=argmin𝐙+η2𝐙𝐗t2+λ(𝐙),formulae-sequencesubscript𝐗𝑡subscript𝐗superscriptnorm𝐘𝚿𝐗2𝜂superscriptnorm𝐗subscript𝐙𝑡12subscript𝐙𝑡subscript𝐙𝜂2superscriptnorm𝐙subscript𝐗𝑡2𝜆𝐙\displaystyle\mathbf{X}_{t}=\arg\min_{\mathbf{X}}\|\mathbf{Y}-\mathbf{\Psi X}% \|^{2}+\eta\|\mathbf{X}-\mathbf{Z}_{t-1}\|^{2},~{}~{}\mathbf{Z}_{t}=\arg\min_{% \mathbf{Z}}+\frac{\eta}{2}\|\mathbf{Z}-\mathbf{X}_{t}\|^{2}+\lambda\mathcal{R}% (\mathbf{Z}),bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ∥ bold_Y - bold_Ψ bold_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η ∥ bold_X - bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT + divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ∥ bold_Z - bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ caligraphic_R ( bold_Z ) , (4)

where t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T denotes the iterative stage index in the multi-stage unfolding framework, as shown in Fig. 2. Since the subproblem of solving 𝐗𝐗\mathbf{X}bold_X in Eq. (4) is a quadratic regularized least-squares problem, we can derive its closed solution as 𝐗t=(𝚿𝚿+η𝐈)1(𝚿𝐘+η𝐙t1)subscript𝐗𝑡superscriptsuperscript𝚿top𝚿𝜂𝐈1superscript𝚿top𝐘𝜂subscript𝐙𝑡1\mathbf{X}_{t}=(\boldsymbol{\Psi}^{\top}\boldsymbol{\Psi}+\eta\mathbf{I})^{-1}% (\boldsymbol{\Psi}^{\top}\mathbf{Y}+\eta\mathbf{Z}_{t-1})bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ψ + italic_η bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + italic_η bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). Considering the high computational overhead of (𝚿𝚿+η𝐈)1superscriptsuperscript𝚿top𝚿𝜂𝐈1(\boldsymbol{\Psi}^{\top}\boldsymbol{\Psi}+\eta\mathbf{I})^{-1}( bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ψ + italic_η bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT brought by the fat sensing matrix 𝚿ξ×ξNω𝚿superscript𝜉𝜉subscript𝑁𝜔\boldsymbol{\Psi}\in\mathbb{R}^{\xi\times\xi N_{\omega}}bold_Ψ ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ × italic_ξ italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we resort to the matrix inversion formula to simplify it: (𝚿𝚿+η𝐈)1=η1𝐈η1𝚿(𝚿η1𝚿+𝐈)1𝚿η1superscriptsuperscript𝚿top𝚿𝜂𝐈1superscript𝜂1𝐈superscript𝜂1superscript𝚿topsuperscript𝚿superscript𝜂1superscript𝚿top𝐈1𝚿superscript𝜂1(\boldsymbol{\Psi}^{\top}\boldsymbol{\Psi}+\eta\mathbf{I})^{-1}={\eta}^{-1}% \mathbf{I}-{\eta}^{-1}\boldsymbol{\Psi}^{\top}(\boldsymbol{\Psi}{\eta}^{-1}% \boldsymbol{\Psi}^{\top}+\mathbf{I})^{-1}\boldsymbol{\Psi}\eta^{-1}( bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ψ + italic_η bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_I - italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Ψ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. As a result, we can reformulate the closed solution of 𝐗𝐗\mathbf{X}bold_X in Eq. (4) as follows:

𝐗t=𝐙t1+η1𝚿𝐘η1𝚿(𝚿η1𝚿+𝐈)1𝚿𝐙t1η2𝚿(𝚿η1𝚿+𝐈)1𝚿𝚿𝐘.subscript𝐗𝑡subscript𝐙𝑡1superscript𝜂1superscript𝚿top𝐘superscript𝜂1superscript𝚿topsuperscript𝚿superscript𝜂1superscript𝚿top𝐈1𝚿subscript𝐙𝑡1superscript𝜂2superscript𝚿topsuperscript𝚿superscript𝜂1superscript𝚿top𝐈1𝚿superscript𝚿top𝐘\displaystyle\!\!\!\mathbf{X}_{t}\!=\!\mathbf{Z}_{t-1}+\eta^{-1}\boldsymbol{% \Psi}^{\top}\mathbf{Y}\!-\!\eta^{-1}\boldsymbol{\Psi}^{\top}(\boldsymbol{\Psi}% {\eta}^{-1}\boldsymbol{\Psi}^{\top}\!\!+\!\mathbf{I})^{-1}\boldsymbol{\Psi}% \mathbf{Z}_{t-1}\!-\!\eta^{-2}\boldsymbol{\Psi}^{\top}(\boldsymbol{\Psi}{\eta}% ^{-1}\boldsymbol{\Psi}^{\top}\!\!+\!\mathbf{I})^{-1}\boldsymbol{\Psi}% \boldsymbol{\Psi}^{\top}\mathbf{Y}.\!\!\!bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y - italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Ψ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Ψ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ψ bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y . (5)

As introduced in [4, 8], 𝚿𝚿=𝐝𝐢𝐚𝐠{ψ1,ψ2,,ψξ}𝚿superscript𝚿top𝐝𝐢𝐚𝐠subscript𝜓1subscript𝜓2subscript𝜓𝜉\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top}=\boldsymbol{\mathrm{diag}}\{\psi_{1}% ,\psi_{2},\cdots,\psi_{\xi}\}bold_Ψ bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_diag { italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_ψ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT } is a diagonal matrix in the CASSI system. After defining 𝝍=[ψ1,ψ2,,ψξ]ξ𝝍subscript𝜓1subscript𝜓2subscript𝜓𝜉superscript𝜉\boldsymbol{\psi}=[\psi_{1},\psi_{2},\cdots,\psi_{\xi}]\in\mathbb{R}^{\xi}bold_italic_ψ = [ italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_ψ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT, we plug 𝚿𝚿=𝐝𝐢𝐚𝐠{ψ1,ψ2,,ψξ}𝚿superscript𝚿top𝐝𝐢𝐚𝐠subscript𝜓1subscript𝜓2subscript𝜓𝜉\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top}=\boldsymbol{\mathrm{diag}}\{\psi_{1}% ,\psi_{2},\cdots,\psi_{\xi}\}bold_Ψ bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_diag { italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_ψ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT } into Eq. (5):

𝐗t=𝐙t1+𝚿((𝐘𝚿𝐙t1)(η+𝝍)1),subscript𝐗𝑡subscript𝐙𝑡1superscript𝚿topsuperscripttensor-product𝐘𝚿subscript𝐙𝑡1superscript𝜂𝝍1top\displaystyle\mathbf{X}_{t}=\mathbf{Z}_{t-1}+\boldsymbol{\Psi}^{\top}((\mathbf% {Y}-\boldsymbol{\Psi}\mathbf{Z}_{t-1})\otimes(\eta+\boldsymbol{\psi})^{-1})^{% \top},bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ( bold_Y - bold_Ψ bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ⊗ ( italic_η + bold_italic_ψ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (6)

where tensor-product\otimes is the element-wise multiplication. Since 𝝍𝝍\boldsymbol{\psi}bold_italic_ψ is precomputed and stored in 𝚿𝚿𝚿superscript𝚿top\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top}bold_Ψ bold_Ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, the value of η𝜂\etaitalic_η in Eq. (6) can affect the output of each iterative stage in the multi-stage unfolding framework. To eliminate negative influence of manually determining η𝜂\etaitalic_η, we set η𝜂\etaitalic_η to be learnable in the multi-stage framework, and denote ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the value of η𝜂\etaitalic_η at the t𝑡titalic_t-th iterative stage. Besides, we also define a learnable parameter λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the t𝑡titalic_t-th stage, and express the subproblem of solving 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. (4) as:

𝐙t=argmin𝐙12(λt/ηt)2𝐙𝐗t2+(𝐙).subscript𝐙𝑡subscript𝐙12superscriptsubscript𝜆𝑡subscript𝜂𝑡2superscriptnorm𝐙subscript𝐗𝑡2𝐙\displaystyle\mathbf{Z}_{t}=\arg\min_{\mathbf{Z}}\frac{1}{2(\sqrt{\lambda_{t}/% \eta_{t}})^{2}}\|\mathbf{Z}-\mathbf{X}_{t}\|^{2}+\mathcal{R}(\mathbf{Z}).bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 ( square-root start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_Z - bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + caligraphic_R ( bold_Z ) . (7)

In Eq. (7), the subproblem of solving 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is equivalent to denoising the image 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with a Gaussian noise level of λt/ηtsubscript𝜆𝑡subscript𝜂𝑡\sqrt{\lambda_{t}/\eta_{t}}square-root start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, according to Bayesian probability [9]. Given 𝜼=[η1,,ηT]T𝜼subscript𝜂1subscript𝜂𝑇superscript𝑇\boldsymbol{\eta}=[\eta_{1},\cdots,\eta_{T}]\in\mathbb{R}^{T}bold_italic_η = [ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝝆=[η1/λ1,,ηT/λT]T𝝆subscript𝜂1subscript𝜆1subscript𝜂𝑇subscript𝜆𝑇superscript𝑇\boldsymbol{\rho}=[\eta_{1}/\lambda_{1},\cdots,\eta_{T}/\lambda_{T}]\in\mathbb% {R}^{T}bold_italic_ρ = [ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we can introduce the following iterative optimization scheme to estimate degradation patterns of the CASSI system and reconstruct original HSI signal 𝐗𝐗\mathbf{X}bold_X in Eq. (1):

(𝜼,𝝆)=Ω(𝚿,𝐘),𝐗t=(𝐙t1,𝚿,𝐘,ηt),𝐙t=𝒟(𝐗t,ρt),formulae-sequence𝜼𝝆Ω𝚿𝐘formulae-sequencesubscript𝐗𝑡subscript𝐙𝑡1𝚿𝐘subscript𝜂𝑡subscript𝐙𝑡𝒟subscript𝐗𝑡subscript𝜌𝑡\displaystyle(\boldsymbol{\eta},\boldsymbol{\rho})=\Omega(\boldsymbol{\Psi},% \mathbf{Y}),~{}\mathbf{X}_{t}=\mathcal{E}(\mathbf{Z}_{t-1},\boldsymbol{\Psi},% \mathbf{Y},\eta_{t}),~{}\mathbf{Z}_{t}=\mathcal{D}(\mathbf{X}_{t},\rho_{t}),( bold_italic_η , bold_italic_ρ ) = roman_Ω ( bold_Ψ , bold_Y ) , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_E ( bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_Ψ , bold_Y , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (8)

where Ω()Ω\Omega(\cdot)roman_Ω ( ⋅ ) is the parameter learner. ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ) is equivalent to Eq. (6), which is a linear projection used for mapping 𝐙t1subscript𝐙𝑡1\mathbf{Z}_{t-1}bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) indicates the Gaussian denoiser to solve Eq. (7). As shown in Fig. 2, we depict our unfolding framework with T𝑇Titalic_T iterative training stages to restore original HSI signal 𝐗𝐗\mathbf{X}bold_X in Eq. (1). Specifically, we first concatenate the given sensing matrix 𝚿𝚿\boldsymbol{\Psi}bold_Ψ and compressed measurement 𝐘𝐘\mathbf{Y}bold_Y, and input it into a convolution block to initialize 𝐙0subscript𝐙0\mathbf{Z}_{0}bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. At the t𝑡titalic_t-th (t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T) stage, the parameter learner Ω()Ω\Omega(\cdot)roman_Ω ( ⋅ ) contains two degradation-aware blocks (DABs), an average pooling layer and three fully connected layers to encode 𝐙0subscript𝐙0\mathbf{Z}_{0}bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝚿𝚿\boldsymbol{\Psi}bold_Ψ, and then outputs learnable parameters (𝜼,𝝆)𝜼𝝆(\boldsymbol{\eta},\boldsymbol{\rho})( bold_italic_η , bold_italic_ρ ). The DAB has three convolution layers and two GELU functions. Then ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ) and 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) use the parameters (𝜼,𝝆)𝜼𝝆(\boldsymbol{\eta},\boldsymbol{\rho})( bold_italic_η , bold_italic_ρ ) to iteratively update 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. (8) until the T𝑇Titalic_T-th stage. Particularly, (𝜼,𝝆)𝜼𝝆(\boldsymbol{\eta},\boldsymbol{\rho})( bold_italic_η , bold_italic_ρ ) learned by Ω()Ω\Omega(\cdot)roman_Ω ( ⋅ ) can effectively scale the linear projection in Eq. (6), while offering accurate noise level for the denoiser 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) to solve Eq. (7). In the CASSI system, they are essential to estimate the ill-posedness degree and degradation patterns, thereby substantially improving HSI reconstruction performance.

3.2 Dual Hyperspectral Mamba (DHM)

Generally, existing deep unfolding methods [7, 14, 70, 15] mainly utilize CNNs or Transformers to design the denoiser 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ). However, these methods struggle to capture long-range dependencies using global receptive fields, thereby limiting their HSI reconstruction performance. Besides, directly applying Mamba to high-resolution HSI reconstruction suffers from local context neglect (i.e., substantial loss of critical local details). To resolve the above challenges, we develop a novel Dual Hyperspectral Mamba (DHM) as the denoiser 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) in Eq. (8). Our DHM uses global receptive fields to model long-range dependencies while tackling local context neglect via capturing local contexts.

Fig. 3a shows the architecture of our DHM (i.e., the denoiser 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ )) at the t𝑡titalic_t-th (t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T) iterative stage in Fig. 2. Specifically, given the scalar ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐗tH×W×Nωsubscript𝐗𝑡superscript𝐻subscript𝑊subscript𝑁𝜔\mathbf{X}_{t}\in\mathbb{R}^{H\times W_{*}\times N_{\omega}}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at the t𝑡titalic_t-th stage, we first reshape ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to H×Wsuperscript𝐻subscript𝑊\mathbb{R}^{H\times W_{*}}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and concatenate 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the reshaped ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to extract shallow feature 𝐅sH×W×Csubscript𝐅𝑠superscript𝐻subscript𝑊𝐶\mathbf{F}_{s}\in\mathbb{R}^{H\times W_{*}\times C}bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT via a convolutional layer, where W=δs(Nω1)+Wsubscript𝑊subscript𝛿𝑠subscript𝑁𝜔1𝑊W_{*}=\delta_{s}(N_{\omega}-1)+Witalic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT - 1 ) + italic_W, and C𝐶Citalic_C is the feature dimension. Then we forward 𝐅ssubscript𝐅𝑠\mathbf{F}_{s}bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to the encoder, bottleneck and decoder to obtain the deep feature 𝐅dH×W×Csubscript𝐅𝑑superscript𝐻subscript𝑊𝐶\mathbf{F}_{d}\in\mathbb{R}^{H\times W_{*}\times C}bold_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT. The encoder and decoder comprise N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT pairs of dual hyperspectral S4 block (DHSB) and the resizing module, while the bottleneck only has N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT DHSBs. In Fig. 3a, we visualize the pipeline of our DHM when N1=2subscript𝑁12N_{1}=2italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 and N2=1subscript𝑁21N_{2}=1italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 for better demonstration. In Fig. 3b, the DHSB includes a global hyperspectral S4 block (GHSB), a local hyperspectral S4 block (LHSB), a gated feed-forward network (GFFN) and three layer normalization (LN). Fig. 3c presents the components of GHSB and LHSB, which are the two most important modules in our DHM. Apart from the reshaping operation, they have the same architectures. Particularly, the GHSB can use global receptive fields to model long-range dependencies, and the LHSB aims to address local context neglect by constructing structured state space sequence (S4) model within local windows. Besides, Fig. 3d shows the design of GFFN module. Then we perform a convolution operation on 𝐅dsubscript𝐅𝑑\mathbf{F}_{d}bold_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to obtain 𝐅zH×W×Nωsubscript𝐅𝑧superscript𝐻subscript𝑊subscript𝑁𝜔\mathbf{F}_{z}\in\mathbb{R}^{H\times W_{*}\times N_{\omega}}bold_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Finally, we sum 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐅zsubscript𝐅𝑧\mathbf{F}_{z}bold_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT to generate the denoised image 𝐙tH×W×Nωsubscript𝐙𝑡superscript𝐻subscript𝑊subscript𝑁𝜔\mathbf{Z}_{t}\in\mathbb{R}^{H\times W_{*}\times N_{\omega}}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at the t𝑡titalic_t-th iterative stage. In the following subsections, we introduce the detailed components of the GHSB and LHSB.

Refer to caption
Figure 3: Algorithmic pipeline of our DHM. (a) Architecture of our DHM at the t𝑡titalic_t-th iterative stage. (b) Each DHSB is composed of a GHSB, a LHSB, a GFFN and three LN layers. (c) Diagram of the GHSB and LHSB modules. (d) Components of the GFFN. (e) Design of the HSI-SSM.

Global Hyperspectral S4 Block (GHSB) constructs S4 model on the entire high-resolution HSIs to capture global contexts using global receptive fields. As shown in Fig. 3c, we forward a given feature 𝐅iH×W×Dsubscript𝐅𝑖superscript𝐻subscript𝑊𝐷\mathbf{F}_{i}\in\mathbb{R}^{H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT into two branches, where D={C,2C,4C}𝐷𝐶2𝐶4𝐶D=\{C,2C,4C\}italic_D = { italic_C , 2 italic_C , 4 italic_C } denotes the feature dimensions at different levels of encoder, bottleneck and decoder. Specifically, the upper branch encodes 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝐅u=𝝈(𝐏u(𝐖d(𝐅i)))H×W×Dsubscript𝐅𝑢𝝈subscript𝐏𝑢subscript𝐖𝑑subscript𝐅𝑖superscript𝐻subscript𝑊𝐷\mathbf{F}_{u}=\boldsymbol{\sigma}(\mathbf{P}_{u}(\mathbf{W}_{d}(\mathbf{F}_{i% })))\in\mathbb{R}^{H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = bold_italic_σ ( bold_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT via a linear projection 𝐏u()subscript𝐏𝑢\mathbf{P}_{u}(\cdot)bold_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ⋅ ), a depth-wise convolution 𝐖d()subscript𝐖𝑑\mathbf{W}_{d}(\cdot)bold_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ) and a SILU activation function 𝝈()𝝈\boldsymbol{\sigma}(\cdot)bold_italic_σ ( ⋅ ). Then we reshape 𝐅usubscript𝐅𝑢\mathbf{F}_{u}bold_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as 𝐅sg1×H×W×Dsuperscriptsubscript𝐅𝑠𝑔superscript1𝐻subscript𝑊𝐷\mathbf{F}_{s}^{g}\in\mathbb{R}^{1\times H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, and input it into the HSI-SSM()HSI-SSM\mathrm{HSI}\text{-}\mathrm{SSM}(\cdot)roman_HSI - roman_SSM ( ⋅ ) to model long-range dependencies using global receptive fields. As a result, we can formulate the output feature 𝐅ogH×W×Dsuperscriptsubscript𝐅𝑜𝑔superscript𝐻subscript𝑊𝐷\mathbf{F}_{o}^{g}\in\mathbb{R}^{H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT of the GHSB module as follows:

𝐅og=𝐏o(LN(RS(HSI-SSM(𝐅sg)))𝐅l),superscriptsubscript𝐅𝑜𝑔subscript𝐏𝑜tensor-productLNRSHSI-SSMsuperscriptsubscript𝐅𝑠𝑔subscript𝐅𝑙\displaystyle\mathbf{F}_{o}^{g}=\mathbf{P}_{o}\big{(}\mathrm{LN}(\mathrm{RS}(% \mathrm{HSI}\text{-}\mathrm{SSM}(\mathbf{F}_{s}^{g})))\otimes\mathbf{F}_{l}% \big{)},bold_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( roman_LN ( roman_RS ( roman_HSI - roman_SSM ( bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ) ) ⊗ bold_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , (9)

where tensor-product\otimes denotes the element-wise multiplication. 𝐅l=𝝈(𝐏l(𝐅i))H×W×Dsubscript𝐅𝑙𝝈subscript𝐏𝑙subscript𝐅𝑖superscript𝐻subscript𝑊𝐷\mathbf{F}_{l}=\boldsymbol{\sigma}(\mathbf{P}_{l}(\mathbf{F}_{i}))\in\mathbb{R% }^{H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = bold_italic_σ ( bold_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT denotes the output of lower branch in Fig. 3c, and 𝐏l()subscript𝐏𝑙\mathbf{P}_{l}(\cdot)bold_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ ) is the linear mapping. LN()LN\mathrm{LN}(\cdot)roman_LN ( ⋅ ) is the layer normalization (LN), RS()RS\mathrm{RS}(\cdot)roman_RS ( ⋅ ) can reshape the given feature to H×W×Dsuperscript𝐻subscript𝑊𝐷\mathbb{R}^{H\times W_{*}\times D}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, and 𝐏osubscript𝐏𝑜\mathbf{P}_{o}bold_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is the linear projection to obtain 𝐅ogsuperscriptsubscript𝐅𝑜𝑔\mathbf{F}_{o}^{g}bold_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT. Moreover, HSI-SSM()HSI-SSM\mathrm{HSI}\text{-}\mathrm{SSM}(\cdot)roman_HSI - roman_SSM ( ⋅ ) denotes the proposed hyperspectral image state space module (HSI-SSM).

HyperSpectral Image State Space Module (HSI-SSM) can model long-range cross-pixel interactions to explore global contexts of 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using global receptive fields. As shown in Fig. 3e, given the input feature 𝐅sg1×H×W×Dsuperscriptsubscript𝐅𝑠𝑔superscript1𝐻subscript𝑊𝐷\mathbf{F}_{s}^{g}\in\mathbb{R}^{1\times H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, we unfold the entire hyperspectral image (HSI) that includes H×W𝐻subscript𝑊H\times W_{*}italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT pixels, into four one-dimensional sequences with a size of HW𝐻subscript𝑊HW_{*}italic_H italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, by scanning these pixels along four distinct traversal paths: from the top-left to the bottom-right, from the top-right to the bottom-left, from the bottom-right to the top-left, and from the bottom-left to the top-right. We denote four sequence features as {𝐒uG×L×D}u=1nssuperscriptsubscriptsubscript𝐒𝑢superscript𝐺𝐿𝐷𝑢1subscript𝑛𝑠\{\mathbf{S}_{u}\in\mathbb{R}^{G\times L\times D}\}_{u=1}^{n_{s}}{ bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where ns=4,G=1formulae-sequencesubscript𝑛𝑠4𝐺1n_{s}=4,G=1italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 4 , italic_G = 1, and L=HW𝐿𝐻subscript𝑊L=HW_{*}italic_L = italic_H italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT denotes the sequence length in the GHSB. Motivated by Mamba [24, 40, 69], we construct some enhanced discrete state space equations on the u𝑢uitalic_u-th (u=1,,ns𝑢1subscript𝑛𝑠u=1,\cdots,n_{s}italic_u = 1 , ⋯ , italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) sequence feature 𝐒usubscript𝐒𝑢\mathbf{S}_{u}bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Specifically, after defining the learnable variables: 𝐀D×Ds𝐀superscript𝐷subscript𝐷𝑠\mathbf{A}\in\mathbb{R}^{D\times D_{s}}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐄G×L×D𝐄superscript𝐺𝐿𝐷\mathbf{E}\in\mathbb{R}^{G\times L\times D}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D end_POSTSUPERSCRIPT, we can formulate some continuous parameters such as 𝐁G×L×Ds,𝐂G×L×Dsformulae-sequence𝐁superscript𝐺𝐿subscript𝐷𝑠𝐂superscript𝐺𝐿subscript𝐷𝑠\mathbf{B}\in\mathbb{R}^{G\times L\times D_{s}},\mathbf{C}\in\mathbb{R}^{G% \times L\times D_{s}}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a timescale parameter G×L×Dsuperscript𝐺𝐿𝐷\triangle\in\mathbb{R}^{G\times L\times D}△ ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D end_POSTSUPERSCRIPT as:

𝐁=𝐏b(𝐒u),𝐂=𝐏c(𝐒u),=τ(𝐄+𝐏(𝐒u)),formulae-sequence𝐁subscript𝐏𝑏subscript𝐒𝑢formulae-sequence𝐂subscript𝐏𝑐subscript𝐒𝑢subscript𝜏𝐄subscript𝐏subscript𝐒𝑢\displaystyle\mathbf{B}=\mathbf{P}_{b}(\mathbf{S}_{u}),~{}~{}\mathbf{C}=% \mathbf{P}_{c}(\mathbf{S}_{u}),~{}~{}\triangle=\tau_{\triangle}(\mathbf{E}+% \mathbf{P}_{\triangle}(\mathbf{S}_{u})),bold_B = bold_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , bold_C = bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , △ = italic_τ start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT ( bold_E + bold_P start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) , (10)

where Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the latent feature dimension, and τ()subscript𝜏\tau_{\triangle}(\cdot)italic_τ start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT ( ⋅ ) is the softplus activation function. 𝐏b(),𝐏c()subscript𝐏𝑏subscript𝐏𝑐\mathbf{P}_{b}(\cdot),\mathbf{P}_{c}(\cdot)bold_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( ⋅ ) , bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ) and 𝐏()subscript𝐏\mathbf{P}_{\triangle}(\cdot)bold_P start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT ( ⋅ ) are the linear projection matrices. Inspired by the zero-order hold (ZOH) discretization rule [24], we reshape the parameter \triangle as ¯G×L×D×1¯superscript𝐺𝐿𝐷1\overline{\triangle}\in\mathbb{R}^{G\times L\times D\times 1}over¯ start_ARG △ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D × 1 end_POSTSUPERSCRIPT, and utilize it to transform the continuous parameters 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B into the discrete parameters 𝐀¯G×L×D×Ds¯𝐀superscript𝐺𝐿𝐷subscript𝐷𝑠\overline{\mathbf{A}}\in\mathbb{R}^{G\times L\times D\times D_{s}}over¯ start_ARG bold_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐁¯G×L×D×Ds¯𝐁superscript𝐺𝐿𝐷subscript𝐷𝑠\overline{\mathbf{B}}\in\mathbb{R}^{G\times L\times D\times D_{s}}over¯ start_ARG bold_B end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT:

𝐀¯=exp(¯𝐀),𝐁¯=(¯𝐀)1(exp(¯𝐀)𝐈)¯𝐁.formulae-sequence¯𝐀¯𝐀¯𝐁superscript¯𝐀1¯𝐀𝐈¯𝐁\displaystyle\overline{\mathbf{A}}=\exp(\overline{\triangle}\mathbf{A}),~{}% \overline{\mathbf{B}}=(\overline{\triangle}\mathbf{A})^{-1}(\exp(\overline{% \triangle}\mathbf{A})-\mathbf{I})\cdot\overline{\triangle}\mathbf{B}.over¯ start_ARG bold_A end_ARG = roman_exp ( over¯ start_ARG △ end_ARG bold_A ) , over¯ start_ARG bold_B end_ARG = ( over¯ start_ARG △ end_ARG bold_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( over¯ start_ARG △ end_ARG bold_A ) - bold_I ) ⋅ over¯ start_ARG △ end_ARG bold_B . (11)

After obtaining the discrete 𝐀¯¯𝐀\overline{\mathbf{A}}over¯ start_ARG bold_A end_ARG and 𝐁¯¯𝐁\overline{\mathbf{B}}over¯ start_ARG bold_B end_ARG via Eq. (11), we reshape the parameter 𝐂𝐂\mathbf{C}bold_C as 𝐂¯G×L×Ds×1¯𝐂superscript𝐺𝐿subscript𝐷𝑠1\overline{\mathbf{C}}\in\mathbb{R}^{G\times L\times D_{s}\times 1}over¯ start_ARG bold_C end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, and formulate the semantic encoding of 𝐒usubscript𝐒𝑢\mathbf{S}_{u}bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as the form of recurrent neural networks (RNNs) to extract a new sequence feature 𝐲kuG×L×Dsuperscriptsubscript𝐲𝑘𝑢superscript𝐺𝐿𝐷\mathbf{y}_{k}^{u}\in\mathbb{R}^{G\times L\times D}bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D end_POSTSUPERSCRIPT. Then we denote 𝐡k1u,𝐡kuG×L×D×Dssuperscriptsubscript𝐡𝑘1𝑢superscriptsubscript𝐡𝑘𝑢superscript𝐺𝐿𝐷subscript𝐷𝑠\mathbf{h}_{k-1}^{u},\mathbf{h}_{k}^{u}\in\mathbb{R}^{G\times L\times D\times D% _{s}}bold_h start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D × italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as the latent features of the (k1)𝑘1(k\!-\!1)( italic_k - 1 )-th and k𝑘kitalic_k-th hidden states in the RNNs, and define 𝐲kusuperscriptsubscript𝐲𝑘𝑢\mathbf{y}_{k}^{u}bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT as follows:

𝐡ku=𝐀¯𝐡k1u+𝐁¯𝐒u,𝐲ku=𝐂¯𝐡ku+ν𝐒u,formulae-sequencesuperscriptsubscript𝐡𝑘𝑢¯𝐀superscriptsubscript𝐡𝑘1𝑢¯𝐁subscript𝐒𝑢superscriptsubscript𝐲𝑘𝑢¯𝐂superscriptsubscript𝐡𝑘𝑢𝜈subscript𝐒𝑢\displaystyle\mathbf{h}_{k}^{u}=\overline{\mathbf{A}}\mathbf{h}_{k-1}^{u}+% \overline{\mathbf{B}}\mathbf{S}_{u},~{}~{}\mathbf{y}_{k}^{u}=\overline{\mathbf% {C}}\mathbf{h}_{k}^{u}+\nu\cdot\mathbf{S}_{u},bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over¯ start_ARG bold_A end_ARG bold_h start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT + over¯ start_ARG bold_B end_ARG bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over¯ start_ARG bold_C end_ARG bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT + italic_ν ⋅ bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (12)

where ν𝜈\nuitalic_ν denotes the scale parameter. Inspired by [24], we use the broadcasting mechanism to match the dimensions of different matrices for matrix multiplication operations in Eqs. (11)(12). Then we merge all sequence features {𝐲ku}u=1nssuperscriptsubscriptsuperscriptsubscript𝐲𝑘𝑢𝑢1subscript𝑛𝑠\{\mathbf{y}_{k}^{u}\}_{u=1}^{n_{s}}{ bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to get the final output map 𝐲=u=1ns𝐲ku𝐲superscriptsubscript𝑢1subscript𝑛𝑠superscriptsubscript𝐲𝑘𝑢\mathbf{y}=\sum_{u=1}^{n_{s}}\mathbf{y}_{k}^{u}bold_y = ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT of the HSI-SSM. In the GHSB, we utilize the HSI-SSM to encode the entire high-resolution HSI in a recursive manner. It can explore long-range dependencies of the input feature 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using global receptive fields.

Local Hyperspectral S4 Block (LHSB) aims to explore local contexts within position-specific windows. Different from the GHSB that uses the HSI-SSM to unfold and scan the entire high-resolution HSI containing H×W𝐻subscript𝑊H\times W_{*}italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT pixels, the LHSB scans each local window, including N×N𝑁𝑁N\times Nitalic_N × italic_N pixels, to capture local contexts. Specifically, as shown in Fig. 3c, after encoding the given feature 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝐅uH×W×Dsubscript𝐅𝑢superscript𝐻subscript𝑊𝐷\mathbf{F}_{u}\in\mathbb{R}^{H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT via the upper branch, we partition 𝐅usubscript𝐅𝑢\mathbf{F}_{u}bold_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to H/N×W/N𝐻𝑁subscript𝑊𝑁\nicefrac{{H}}{{N}}\times\nicefrac{{W_{*}}}{{N}}/ start_ARG italic_H end_ARG start_ARG italic_N end_ARG × / start_ARG italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG non-overlapping windows, then reshape it as 𝐅slHW/N2×N×N×Dsuperscriptsubscript𝐅𝑠𝑙superscript𝐻subscript𝑊superscript𝑁2𝑁𝑁𝐷\mathbf{F}_{s}^{l}\in\mathbb{R}^{\nicefrac{{HW_{*}}}{{N^{2}}}\times N\times N% \times D}bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT / start_ARG italic_H italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG × italic_N × italic_N × italic_D end_POSTSUPERSCRIPT, and input 𝐅slsuperscriptsubscript𝐅𝑠𝑙\mathbf{F}_{s}^{l}bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT into the HSI-SSM, where HW/N2𝐻subscript𝑊superscript𝑁2\nicefrac{{HW_{*}}}{{N^{2}}}/ start_ARG italic_H italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG denotes the number of windows and each window includes N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels. In the HSI-SSM, we flatten each window including N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels and scan them along four distinctive directions to obtain four sequence features {𝐒uG×L×D}u=1nssuperscriptsubscriptsubscript𝐒𝑢superscript𝐺𝐿𝐷𝑢1subscript𝑛𝑠\{\mathbf{S}_{u}\in\mathbb{R}^{G\times L\times D}\}_{u=1}^{n_{s}}{ bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Note that we set G=HW/N2𝐺𝐻subscript𝑊superscript𝑁2G=\nicefrac{{HW_{*}}}{{N^{2}}}italic_G = / start_ARG italic_H italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and L=N2𝐿superscript𝑁2L=N^{2}italic_L = italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the LHSB, which are different from the GHSB. After encoding each sequence {𝐒u}u=1nssuperscriptsubscriptsubscript𝐒𝑢𝑢1subscript𝑛𝑠\{\mathbf{S}_{u}\}_{u=1}^{n_{s}}{ bold_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT under a recursive manner to get {𝐲ku}u=1nssuperscriptsubscriptsuperscriptsubscript𝐲𝑘𝑢𝑢1subscript𝑛𝑠\{\mathbf{y}_{k}^{u}\}_{u=1}^{n_{s}}{ bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we sum them to get the output map 𝐲G×L×D𝐲superscript𝐺𝐿𝐷\mathbf{y}\in\mathbb{R}^{G\times L\times D}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_G × italic_L × italic_D end_POSTSUPERSCRIPT of the HSI-SSM. The LHSB can capture local contexts of HSI by encoding different local windows of the given feature 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a recursive manner. Thus, we formulate the final feature 𝐅olH×W×Dsuperscriptsubscript𝐅𝑜𝑙superscript𝐻subscript𝑊𝐷\mathbf{F}_{o}^{l}\in\mathbb{R}^{H\times W_{*}\times D}bold_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT outputted by the LHSB as follows:

𝐅ol=𝐏o(LN(RS(HSI-SSM(𝐅sl)))𝐅l).superscriptsubscript𝐅𝑜𝑙subscript𝐏𝑜tensor-productLNRSHSI-SSMsuperscriptsubscript𝐅𝑠𝑙subscript𝐅𝑙\displaystyle\mathbf{F}_{o}^{l}=\mathbf{P}_{o}\big{(}\mathrm{LN}(\mathrm{RS}(% \mathrm{HSI}\text{-}\mathrm{SSM}(\mathbf{F}_{s}^{l})))\otimes\mathbf{F}_{l}% \big{)}.bold_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( roman_LN ( roman_RS ( roman_HSI - roman_SSM ( bold_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) ) ⊗ bold_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) . (13)

Optimization: As shown in Fig. 2, we utilize ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ) and 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) (i.e., our DHM) to iteratively update 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. (8) until the T𝑇Titalic_T-th stage. After getting 𝐙Tsubscript𝐙𝑇\mathbf{Z}_{T}bold_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT at the T𝑇Titalic_T-th stage, we follow [14, 15] to train our DHM by minimizing the Charbonnier loss between the groundtruth and reconstructed HSI 𝐙Tsubscript𝐙𝑇\mathbf{Z}_{T}bold_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

4 Experiments

4.1 Implementation Details

For fair comparisons, we set exactly the same experimental configurations with existing HSI reconstruction methods [7, 70, 10, 14, 6, 27] to validate the effectiveness of our DHM. Following the settings of [27, 48, 6, 49], we perform spectral interpolation on the original HSIs and choose a wide spectral range from 450 nm to 650 nm for comparisons on both the simulation and real datasets. The simulation dataset is composed of two subsets: KAIST [11] and CAVE [56]. We employ the CAVE subset to train our DHM, and select 10 HSIs from the KAIST to evaluate performance. Moreover, the real dataset [49] consists of five HSI cubes, which are captured by the practical CASSI system [49].

During training, we employ the Adam optimizer [33] to train all variants of our DHM on a single NVIDIA A100 GPU, where initial learning rate is 1.0×1031.0superscript1031.0\times 10^{-3}1.0 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and the training epoches are set to 300. Following [7, 70, 14, 27], we randomly crop HSI cubes to 256×256×2825625628256\times 256\times 28256 × 256 × 28 for simulation dataset, and 660×660×2866066028660\times 660\times 28660 × 660 × 28 for real dataset. The shifting step of dispersion in the CASSI system is set to δs=2subscript𝛿𝑠2\delta_{s}=2italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 2. Moreover, we set C=28,N=8,N1=2,N2=1formulae-sequence𝐶28formulae-sequence𝑁8formulae-sequencesubscript𝑁12subscript𝑁21C=28,N=8,N_{1}=2,N_{2}=1italic_C = 28 , italic_N = 8 , italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and D=Ds𝐷subscript𝐷𝑠D=D_{s}italic_D = italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in this paper. Motivated by baseline HSI reconstruction methods [14, 15], we share the network weights of our DHM across different stages, and use exactly the same data augmentation to train our DHM.

Table 1: Performance of our DHM and other comparison methods on the simulation dataset with 10 scenes (S1similar-to\simS10). In each cell, the upper and lower entries report PSNR and SSIM, respectively.
Comparison Methods #Params GFLOPS    S1    S2    S3    S4    S5    S6    S7    S8    S9    S10    Avg
TwIST [2] - -
25.16
0.700
23.02
0.604
21.40
0.711
30.19
0.851
21.41
0.635
20.95
0.644
22.20
0.643
21.82
0.650
22.42
0.690
22.67
0.569
23.12
0.669
λ𝜆\lambdaitalic_λ-Net [52] 62.64M 117.98
30.10
0.849
28.49
0.805
27.73
0.870
37.01
0.934
26.19
0.817
28.64
0.853
26.47
0.806
26.09
0.831
27.50
0.826
27.13
0.816
28.53
0.841
DNU [67] 1.19M 163.48
31.72
0.863
31.13
0.846
29.99
0.845
35.34
0.908
29.03
0.833
30.87
0.887
28.99
0.839
30.13
0.885
31.03
0.876
29.14
0.849
30.74
0.863
DIP-HSI [51] 33.85M 64.42
32.68
0.890
27.26
0.833
31.30
0.914
40.54
0.962
29.79
0.900
30.39
0.877
28.18
0.913
29.44
0.874
34.51
0.927
28.51
0.851
31.26
0.894
DGSMP [29] 3.76M 646.65
33.26
0.915
32.09
0.898
33.06
0.925
40.54
0.964
28.86
0.882
33.08
0.937
30.74
0.886
31.55
0.923
31.66
0.911
31.44
0.925
32.63
0.917
GAP-Net [48] 4.27M 78.58
33.74
0.911
33.26
0.900
34.28
0.929
41.03
0.967
31.44
0.919
32.40
0.925
32.27
0.902
30.46
0.905
33.51
0.915
30.24
0.895
33.26
0.917
ADMM-Net [42] 4.27M 78.58
34.12
0.918
33.62
0.902
35.04
0.931
41.15
0.966
31.82
0.922
32.54
0.924
32.42
0.896
30.74
0.907
33.75
0.915
30.68
0.895
33.58
0.918
HDNet [27] 2.37M 154.76
35.14
0.935
35.67
0.940
36.03
0.943
42.30
0.969
32.69
0.946
34.46
0.952
33.67
0.926
32.48
0.941
34.89
0.942
32.38
0.937
34.97
0.943
MST-L [5] 2.03M 28.15
35.40
0.941
35.87
0.944
36.51
0.953
42.27
0.973
32.77
0.947
34.80
0.955
33.66
0.925
32.67
0.948
35.39
0.949
32.50
0.941
35.18
0.948
MST++ [6] 1.33M 19.42
35.80
0.943
36.23
0.947
37.34
0.957
42.63
0.973
33.38
0.952
35.38
0.957
34.35
0.934
33.71
0.953
36.67
0.953
33.38
0.945
35.99
0.951
CST-L [4] 3.00M 40.01
35.96
0.949
36.84
0.955
38.16
0.962
42.44
0.975
33.25
0.955
35.72
0.963
34.86
0.944
34.34
0.961
36.51
0.957
33.09
0.945
36.12
0.957
BIRNAT [10] 4.40M 2122.66
36.79
0.951
37.89
0.957
40.61
0.971
46.94
0.985
35.42
0.964
35.30
0.959
36.58
0.955
33.96
0.956
39.47
0.970
32.80
0.938
37.58
0.960
LDMUN [70]
38.07
0.969
41.16
0.982
43.70
0.983
48.01
0.993
37.76
0.980
37.65
0.980
38.58
0.973
36.31
0.979
42.66
0.984
35.18
0.967
39.91
0.979
DAUHST [8] 6.15M 79.50
37.25
0.958
39.02
0.967
41.05
0.971
46.15
0.983
35.80
0.969
37.08
0.970
37.57
0.963
35.10
0.966
40.02
0.970
34.59
0.956
38.36
0.967
PADUT [38] 5.38M 90.46
37.36
0.962
40.43
0.978
42.38
0.979
46.62
0.990
36.26
0.974
37.27
0.974
37.83
0.966
35.33
0.974
40.86
0.978
34.55
0.963
38.89
0.974
RDLUF [15] 1.89M 115.34
37.94
0.966
40.95
0.977
43.25
0.979
47.83
0.990
37.11
0.976
37.47
0.975
38.58
0.969
35.50
0.970
41.83
0.978
35.23
0.962
39.57
0.974
DERNN (3stg) [14] 0.65M 27.41
37.54
0.964
39.23
0.973
42.01
0.979
47.08
0.992
36.03
0.973
36.82
0.974
37.34
0.966
35.04
0.971
40.97
0.978
34.39
0.960
38.65
0.973
DERNN (5stg) [14] 0.65M 45.60
37.86
0.963
40.28
0.976
42.69
0.978
47.97
0.990
37.11
0.975
37.23
0.974
37.97
0.967
35.82
0.971
41.93
0.979
34.98
0.959
39.38
0.973
DERNN (7stg) [14] 0.65M 63.80
37.91
0.964
40.75
0.978
42.95
0.978
47.51
0.990
37.81
0.978
37.37
0.975
38.49
0.970
35.83
0.971
42.47
0.980
35.04
0.961
39.61
0.974
DERNN (9stg) [14] 0.65M 81.99
38.26
0.965
40.97
0.979
43.22
0.979
48.10
0.991
38.08
0.980
37.41
0.975
38.83
0.971
36.41
0.973
42.87
0.981
35.15
0.962
39.93
0.976
DERNN (9stg) [14] 1.09M 134.18
38.49
0.968
41.27
0.980
43.97
0.980
48.61
0.992
38.29
0.981
37.81
0.977
39.30
0.973
36.51
0.974
43.38
0.983
35.61
0.966
40.33
0.977
DHM-light (3stg) 0.66M 26.42
37.67
0.965
39.58
0.974
42.67
0.981
47.90
0.993
36.47
0.975
36.76
0.975
37.72
0.968
35.14
0.972
41.65
0.981
34.35
0.961
38.99
0.975
DHM-light (5stg) 0.66M 43.96
38.17
0.971
40.91
0.981
43.78
0.983
47.18
0.993
37.41
0.980
37.51
0.978
38.78
0.973
35.83
0.977
43.26
0.985
35.28
0.968
39.81
0.979
DHM-light (7stg) 0.66M 61.50
38.58
0.972
41.42
0.983
43.93
0.984
47.95
0.993
38.29
0.983
37.88
0.980
39.03
0.974
36.26
0.979
43.25
0.986
35.42
0.970
40.20
0.980
DHM-light (9stg) 0.66M 79.04
38.78
0.972
41.44
0.983
44.07
0.984
48.16
0.994
38.32
0.983
37.45
0.980
39.22
0.976
36.37
0.980
43.75
0.987
35.73
0.972
40.33
0.981
DHM (3stg) 0.92M 36.34
37.63
0.967
39.85
0.976
43.40
0.982
47.56
0.993
36.37
0.976
36.98
0.975
38.05
0.970
34.94
0.972
42.04
0.982
34.42
0.962
39.13
0.975
DHM (5stg) 0.92M 60.50
38.48
0.972
41.14
0.982
44.10
0.984
48.03
0.993
37.82
0.981
37.95
0.979
39.21
0.975
36.34
0.978
43.31
0.986
35.20
0.967
40.16
0.980
DHM (7stg) 0.92M 84.65
38.40
0.972
41.52
0.983
44.21
0.984
47.93
0.994
38.21
0.983
38.17
0.981
39.58
0.976
36.17
0.978
43.56
0.986
35.60
0.970
40.34
0.981
DHM (9stg) 0.92M 108.80
38.50
0.972
41.64
0.984
44.37
0.985
48.13
0.994
38.33
0.983
38.27
0.982
39.70
0.977
36.52
0.980
43.89
0.988
35.75
0.971
40.50
0.982

4.2 Quantitative Performance Comparisons

As shown in Tab. 1, we introduce comprehensive quantitative comparisons between our HDM and SOTA HSI reconstruction methods on the simulation dataset with 10 scenes (S1similar-to\simS10). From the results in Tab. 1, we observe that the proposed DHM (9stg) (i.e., our DHM at the 9999-th stage) achieves the best HSI reconstruction performance (i.e., 40.50 dB in PSNR and 0.982 in SSIM). Our DHM (9stg) substantially surpasses existing methods [2, 39, 51, 4, 28], especially several recent SOTA comparison models (e.g., DAUHST [8], LDMUN [70], RDLUF-Mix [15], DERNN [14]) by 0.572.14similar-to0.572.140.57\sim 2.140.57 ∼ 2.14 dB. Such improvements verify the effectiveness of our DHM in exploring long-range dependencies across the entire high-resolution HSIs using global receptive fields, while capturing local context within local windows. More importantly, our DHM requires lower model size and computational costs to dramatically outperform existing methods. Compared with the SOTA DERNN (9stg) [14], our DHM (9stg) improves 0.17 dB in PSNR and 0.005 in SSIM, but only consumes 84.40% (0.92M / 1.09M) parameters and 81.09% (108.80 / 134.18) GFLOPS. Moreover, we propose a light model (i.e., DHM-light) where each DHSB contains a single global hyperspectral S4 block (GHSB) and a GFFN. In Tab. 1, our DHM-light at the 3/5/7/9-th stage has significant improvement than other comparison methods (e.g., DERNN [14]) with the same number of stages, while retaining comparable model size and less GFLOPs. It illustrate the effectiveness of our DHM for HSI reconstruction task.

Refer to caption
Figure 4: Qualitative results on the Scene 7 (S7) of simulation dataset (zoom in for a better view).
Refer to caption
Figure 5: Qualitative comparisons on the Scene 4 (S4) of real dataset (zoom in for a better view).

4.3 Qualitative Performance Comparisons

Simulation Dataset: As depicted in Fig. 4, we select 4 out of the 28 spectral channels to visualize some qualitative comparisons of HSI reconstruction on the Scene 7 (S7) of simulation dataset. For better visibility, we zoom in on the regions within the yellow boxes of the original HSIs (bottom), and show the comparison of these regions in the top-right part. In Fig. 4, previous methods suffer from blotchy texture, distortions and blurring artifacts. In contrast, our DHM (9stg) can effectively restore HSIs with less artifacts and finer details. Besides, the spectral density curves corresponding to the green boxes in the top-left RGB image are depicted in the top-middle part. Our DHM (9stg) exhibits the best correlation with groundtruth, which illustrates the effectiveness of our DHM.

Real Dataset: To verify the superiority of our model in real HSI reconstruction, we follow [49, 8, 70, 14] to retrain our DHM-light (5stg) on the joint KAIST [11] and CAVE [56]. Besides, we introduce 11-bit shot noise into training samples to simulate real imaging scenarios. As shown in Fig. 5, our DHM-light (5stg) can effectively restore the plant region corresponding to the yellow box. Compared with SOTA methods [8, 38, 14], our DHM-light (5stg) restores clearer contents and structural details with less artifacts, verifying the robustness of our model to address the real HSI restoration.

Table 2: Ablation studies (averaged PSNR and SSIM) of our DHM (5stg) on simulation dataset.
GHSB LHSB GFFN #Params GFLOPs PSNR SSIM
0.66M 43.96 39.81 0.979
0.66M 43.96 38.76 0.973
0.92M 60.26 39.93 0.979
0.92M 60.50 40.16 0.980
(a) Ablation experiments of the DHSB.
GHSB\rightarrowGA LHSB\rightarrowLA #Params GFLOPs PSNR SSIM
0.92M 60.50 40.16 0.980
0.79M 53.05 39.11 0.975
0.79M 53.05 40.08 0.980
0.65M 45.60 39.38 0.973
(b) Ablation experiments of alternative variants.
GS\rightarrowLS LS\rightarrowGS SPs #Params GFLOPs PSNR SSIM
4.59M 60.50 39.23 0.977
0.92M 60.50 40.12 0.980
4.59M 60.50 39.28 0.977
0.92M 60.50 40.16 0.980
(c) Ablation analysis of different block orders.
Variants 𝜼𝜼~{}~{}\boldsymbol{\eta}bold_italic_η 𝝆𝝆~{}~{}\boldsymbol{\rho}bold_italic_ρ #Params GFLOPs PSNR SSIM
Baseline 0.90M 59.11 39.86 0.979
DHM w/o 𝜼𝜼\boldsymbol{\eta}bold_italic_η 0.92M 60.50 39.71 0.978
DHM w/o 𝝆𝝆\boldsymbol{\rho}bold_italic_ρ 0.92M 60.42 39.92 0.979
DHM 0.92M 60.50 40.16 0.980
(d) Ablation analysis of learnable (𝜼,𝝆)𝜼𝝆(\boldsymbol{\eta},\boldsymbol{\rho})( bold_italic_η , bold_italic_ρ ).
Table 3: Ablation results of the DHSB. In each cell, the upper and lower entries are PSNR and SSIM.
GHSB LHSB GFFN #Params GFLOPS  S1  S2  S3  S4  S5  S6  S7  S8  S9  S10  Avg
0.66M 43.96
38.17
0.971
40.91
0.981
43.78
0.983
47.18
0.993
37.41
0.980
37.51
0.978
38.78
0.973
35.83
0.977
43.26
0.985
35.28
0.968
39.81
0.979
0.66M 43.96
37.28
0.962
39.95
0.975
42.77
0.981
47.42
0.992
35.95
0.973
36.65
0.974
37.40
0.966
34.94
0.971
41.00
0.979
34.28
0.960
38.76
0.973
0.92M 60.26
38.42
0.972
40.75
0.981
43.97
0.984
47.65
0.993
37.79
0.981
37.47
0.978
38.95
0.974
35.96
0.976
43.18
0.985
35.13
0.967
39.93
0.979
0.92M 60.50
38.48
0.972
41.14
0.982
44.10
0.984
48.03
0.993
37.82
0.981
37.95
0.979
39.21
0.975
36.34
0.978
43.31
0.986
35.20
0.967
40.16
0.980

4.4 Ablation Studies

Refer to caption
Figure 6: Visualization of 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at different stages on the Scene 5 (S5) of simulation dataset.

This subsection analyzes the effectiveness of all proposed modules on simulation dataset using our DHM (5stg) as an example. 1) DHSB: As shown in Tab. 2a, when we remove the GHSB, LHSB or replace the GFFN with a traditional feed-forward network (FNN) [8] in the DHSB, the performance of our DHM (5stg) significantly decreases by 0.231.40similar-to0.231.400.23\sim 1.400.23 ∼ 1.40 dB in PSNR and 0.0010.007similar-to0.0010.0070.001\sim 0.0070.001 ∼ 0.007 in SSIM. Tab. 3 presents ablation results of our DHM (5stg) on 10 scenes (S1similar-to\simS10) to veirify the effectiveness of the DHSB. 2) Variants: In Tab. 2b, our model decreases by 0.081.05similar-to0.081.050.08\sim 1.050.08 ∼ 1.05 dB in PSNR when we replace the GHSB with non-local MSA [14] (GHSB\rightarrowGA) or substitute the LHSB with local MSA [14] (LHSB\rightarrowLA), where MSA is the multi-head self-attention [17]. It verifies the effectiveness of our DHM in using global receptive fields to model long-range dependencies while capturing local contexts. 3) Block Orders: In Tab. 2c, we perform ablation studies about shared parameters (SPs) across different stages, and the orders of GHSB and LHSB: from GHSB to LHSB (GS\rightarrowLS) or from LHSB to GHSB (LS\rightarrowGS). The ablation results validate the effectiveness of our DHM. 4) Parameters: Tab. 2d shows ablation studies about learnable parameters (𝜼,𝝆)𝜼𝝆(\boldsymbol{\eta},\boldsymbol{\rho})( bold_italic_η , bold_italic_ρ ), which validates their effectiveness to estimate degradation patterns. Fig. 6 visualizes {𝐙0,𝐙3,𝐙9}subscript𝐙0subscript𝐙3subscript𝐙9\{\mathbf{Z}_{0},\mathbf{Z}_{3},\mathbf{Z}_{9}\}{ bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT } as examples to verify the effectiveness of our unfolding framework in HSI reconstruction, when we use our DHM (9stg) as the denoiser 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ).

5 Conclusion

In this paper, we propose a novel Dual Hyperspectral Mamba (DHM) to model both global and local dependencies for efficient HSI reconstruction. After estimating degradation patterns of the CASSI system via the learnable parameters, we utilize these parameters to scale the linear projection and offer noise level for the denoiser (i.e., our DHM) in the multi-stage unfolding framework. Particularly, the proposed DHM mainly consists of a global hyperspectral S4 block (GHSB) and a local hyperspectral S4 block (LHSB). The GHSB can explore long-range dependencies across the entire high-resolution HSIs using global receptive fields, while the LHSB constructs S4 models within different local windows to capture local contexts. We conduct enormous quantitative and qualitative comparison experiments on both the simulation and real datasets to demonstrate the effectiveness of our DHM.

References

  • [1] V. Backman, M. B. Wallace, L. Perelman, J. Arendt, R. Gurjar, M. Muller, Q. Zhang, G. Zonios, E. Kline, and T. McGillican. Detection of preinvasive cancer cells. Nature, 2000.
  • [2] JosÉ M. Bioucas-Dias and MÁrio A. T. Figueiredo. A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16(12):2992–3004, 2007.
  • [3] M. Borengasser, W. S. Hungate, and R. Watkins. Hyperspectral remote sensing: principles and applications. CRC press, 2007.
  • [4] Yuanhao Cai, Jing Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Coarse-to-fine sparse transformer for hyperspectral image reconstruction. In European Conference on Computer Vision, pages 686–704. Springer, 2022.
  • [5] Yuanhao Cai, Jing Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17502–17511, 2022.
  • [6] Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Zhang, Hanspeter Pfister, Radu Timofte, and Luc Van Gool. Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 744–754, 2022.
  • [7] Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Henghui Ding, Yulun Zhang, Radu Timofte, and Luc V Gool. Degradation-aware unfolding half-shuffle transformer for spectral compressive imaging. Advances in Neural Information Processing Systems, 35:37749–37761, 2022.
  • [8] Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Henghui Ding, Yulun Zhang, Radu Timofte, and Luc Van Gool. Degradation-aware unfolding half-shuffle transformer for spectral compressive imaging. In Advances in Neural Information Processing Systems, 2022.
  • [9] Stanley H. Chan, Xiran Wang, and Omar A. Elgendy. Plug-and-play admm for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84–98, 2017.
  • [10] Ziheng Cheng, Bo Chen, Ruiying Lu, Zhengjue Wang, Hao Zhang, Ziyi Meng, and Xin Yuan. Recurrent neural networks for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2264–2281, 2023.
  • [11] Inchang Choi, MH Kim, D Gutierrez, DS Jeon, and G Nam. High-quality hyperspectral reconstruction using a spectral prior. In Technical report, 2017.
  • [12] Jiahua Dong, Yang Cong, Gan Sun, Zhen Fang, and Zhengming Ding. Where and how to transfer: Knowledge aggregation-induced transferability perception for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3):1664–1681, 2024.
  • [13] Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4022–4031, June 2020.
  • [14] Yubo Dong, Dahua Gao, Yuyan Li, Guangming Shi, and Danhua Liu. Degradation estimation recurrent neural network with local and non-local priors for compressive spectral imaging. arXiv preprint arXiv:2311.08808, 2024.
  • [15] Yubo Dong, Dahua Gao, Tian Qiu, Yuyan Li, Minxi Yang, and Guangming Shi. Residual degradation learning unfolding framework with mixing priors across spectral and spatial for compressive spectral imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22262–22271, 2023.
  • [16] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
  • [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • [18] Mathieu Fauvel, Yuliya Tarabalka, Jón Atli Benediktsson, Jocelyn Chanussot, and James C. Tilton. Advances in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3):652–675, 2013.
  • [19] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  • [20] Ying Fu, Zhiyuan Liang, and Shaodi You. Bidirectional 3d quasi-recurrent neural network for hyperspectral image super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:2674–2688, 2021.
  • [21] Ying Fu, Yinqiang Zheng, Imari Sato, and Yoichi Sato. Exploiting spectral-spatial correlation for coded hyperspectral image restoration. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3727–3736, 2016.
  • [22] M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz. Single-shot compressive spectral imaging with a dual-disperser architecture. Opt. Express, 15(21):14013–14027, Oct 2007.
  • [23] Alexander F. H. Goetz, Gregg Vane, Jerry E. Solomon, and Barrett N. Rock. Imaging spectrometry for earth remote sensing. Science, 228(4704):1147–1153, 1985.
  • [24] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • [25] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • [26] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  • [27] Xiaowan Hu, Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Hdnet: High-resolution dual-domain learning for spectral compressive imaging. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17521–17530, 2022.
  • [28] Tao Huang, Weisheng Dong, Xin Yuan, Jinjian Wu, and Guangming Shi. Deep gaussian scale mixture prior for spectral compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16216–16225, 2021.
  • [29] Tao Huang, Xin Yuan, Weisheng Dong, Jinjian Wu, and Guangming Shi. Deep gaussian scale mixture prior for image reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [30] Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, and Gedas Bertasius. Efficient movie scene detection using state-space transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18749–18758, 2023.
  • [31] Shirin Jalali and Xin Yuan. Snapshot compressed sensing: Performance bounds and algorithms. IEEE Transactions on Information Theory, 65(12):8005–8024, 2019.
  • [32] Min H. Kim, Todd Alan Harvey, David S. Kittle, Holly Rushmeier, Julie Dorsey, Richard O. Prum, and David J. Brady. 3d imaging spectroscopy for measuring hyperspectral patterns on solid objects. ACM Transactions on Graphics, 31(4), jul 2012.
  • [33] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [34] David Kittle, Kerkil Choi, Ashwin Wagadarikar, and David J. Brady. Multiframe image estimation for coded aperture snapshot spectral imagers. Appl. Opt., 49(36):6824–6833, Dec 2010.
  • [35] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • [36] Zeqiang Lai, Kaixuan Wei, and Ying Fu. Deep plug-and-play prior for hyperspectral image restoration. Neurocomputing, 481:281–293, 2022.
  • [37] Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
  • [38] Miaoyu Li, Ying Fu, Ji Liu, and Yulun Zhang. Pixel adaptive deep unfolding transformer for hyperspectral image reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12959–12968, 2023.
  • [39] Yang Liu, Xin Yuan, Jinli Suo, David J. Brady, and Qionghai Dai. Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):2990–3006, 2019.
  • [40] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  • [41] Guolan Lu and Baowei Fei. Medical hyperspectral imaging: a review. Journal of Biomedical Optics, 2014.
  • [42] Jiawei Ma, Xiao-Yang Liu, Zheng Shou, and Xin Yuan. Deep tensor admm-net for snapshot compressive imaging. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10222–10231, 2019.
  • [43] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  • [44] Xiao Ma, Xin Yuan, Chen Fu, and Gonzalo R. Arce. Led-based compressive spectral-temporal imaging. Opt. Express, 29(7):10698–10715, Mar 2021.
  • [45] Emmanuel Maggiori, Guillaume Charpiat, Yuliya Tarabalka, and Pierre Alliez. Recurrent neural networks to correct satellite image classification maps. IEEE Transactions on Geoscience and Remote Sensing, 55(9):4962–4971, 2017.
  • [46] Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  • [47] F. Melgani and L. Bruzzone. Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42(8):1778–1790, 2004.
  • [48] Ziyi Meng, Shirin Jalali, and Xin Yuan. Gap-net for snapshot compressive imaging. arXiv preprint arXiv:2012.08364, 2020.
  • [49] Ziyi Meng, Jiawei Ma, and Xin Yuan. End-to-end low cost compressive spectral imaging with spatial-spectral self-attention. In European conference on computer vision, pages 187–204. Springer, 2020.
  • [50] Ziyi Meng, Mu Qiao, Jiawei Ma, Zhenming Yu, Kun Xu, and Xin Yuan. Snapshot multispectral endomicroscopy. Optics Letters, 2020.
  • [51] Ziyi Meng, Zhenming Yu, Kun Xu, and Xin Yuan. Self-supervised neural networks for spectral snapshot compressive imaging. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2602–2611, 2021.
  • [52] Xin Miao, Xin Yuan, Yunchen Pu, and Vassilis Athitsos. lambda-net: Reconstruct hyperspectral images from a snapshot measurement. In IEEE International Conference on Computer Vision, pages 4058–4068, Oct. 2019.
  • [53] Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  • [54] Hien Van Nguyen, Amit Banerjee, and Rama Chellappa. Tracking via object reflectance using a hyperspectral video camera. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pages 44–51, 2010.
  • [55] Zhihong Pan, G. Healey, M. Prasad, and B. Tromberg. Face recognition in hyperspectral images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12):1552–1560, 2003.
  • [56] Jong-Il Park, Moon-Hyun Lee, Michael D. Grossberg, and Shree K. Nayar. Multispectral imaging using multiplexed illumination. In ICCV, 2007.
  • [57] Mu Qiao, Ziyi Meng, Jiawei Ma, and Xin Yuan. Deep learning for video compressive sensing. Apl Photonics, 2020.
  • [58] Mu Qiao and Xin Yuan. Coded aperture compressive temporal imaging using complementary codes and untrained neural networks for high-quality reconstruction. Opt. Lett., 48(1):109–112, Jan 2023.
  • [59] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  • [60] Jin Tan, Yanting Ma, Hoover Rueda, Dror Baron, and Gonzalo R. Arce. Compressive hyperspectral imaging via approximate message passing. IEEE Journal of Selected Topics in Signal Processing, 10(2):389–401, 2016.
  • [61] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, 2007.
  • [62] Burak Uzkent, Matthew J. Hoffman, and Anthony Vodacek. Real-time vehicle tracking in aerial video using hyperspectral features. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1443–1451, 2016.
  • [63] Burak Uzkent, Aneesh Rangnekar, and Matthew J. Hoffman. Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 233–242, 2017.
  • [64] Ashwin Wagadarikar, Renu John, Rebecca Willett, and David Brady. Single disperser design for coded aperture snapshot spectral imaging. Appl. Opt., 47(10):B44–B51, Apr 2008.
  • [65] Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid. Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6387–6397, 2023.
  • [66] Lizhi Wang, Chen Sun, Ying Fu, Min H. Kim, and Hua Huang. Hyperspectral image reconstruction using a deep spatial-spectral prior. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8024–8033, 2019.
  • [67] Lizhi Wang, Chen Sun, Maoqing Zhang, Ying Fu, and Hua Huang. Dnu: Deep non-local unrolling for computational spectral imaging. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1658–1668, 2020.
  • [68] Lizhi Wang, Zhiwei Xiong, Guangming Shi, Feng Wu, and Wenjun Zeng. Adaptive nonlocal sparse representation for dual-camera compressive hyperspectral imaging. IEEE transactions on pattern analysis and machine intelligence, 39(10):2104–2111, 2016.
  • [69] Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, and Lei Li. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079, 2024.
  • [70] Zongliang Wu, Ruiying Lu, Ying Fu, and Xin Yuan. Latent diffusion prior enhanced deep unfolding for spectral image reconstruction. arXiv preprint arXiv:2311.14280, 2023.
  • [71] Xin Yuan. Generalized alternating projection based total variation minimization for compressive sensing. In 2016 IEEE International Conference on Image Processing (ICIP), pages 2539–2543, 2016.
  • [72] Xin Yuan, David J. Brady, and Aggelos K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2):65–88, 2021.
  • [73] Xin Yuan, Yang Liu, Jinli Suo, and Qionghai Dai. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1447–1457, 2020.
  • [74] Xin Yuan, Tsung-Han Tsai, Ruoyu Zhu, Patrick Llull, David Brady, and Lawrence Carin. Compressive hyperspectral imaging with side information. IEEE Journal of Selected Topics in Signal Processing, 9(6):964–976, 2015.
  • [75] Yuan Yuan, Xiangtao Zheng, and Xiaoqiang Lu. Hyperspectral image superresolution by transfer learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(5):1963–1974, 2017.
  • [76] Yubiao Yue and Zhenzhang Li. Medmamba: Vision mamba for medical image classification. arXiv preprint arXiv:2403.03849, 2024.
  • [77] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6360–6376, 2021.
  • [78] Shipeng Zhang, Lizhi Wang, Ying Fu, Xiaoming Zhong, and Hua Huang. Computational hyperspectral imaging based on dimension-discriminative low-rank tensor recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10183–10192, 2019.
  • [79] Xuanyu Zhang, Yongbing Zhang, Ruiqin Xiong, Qilin Sun, and Jian Zhang. Herosnet: Hyperspectral explicable reconstruction and optimal sampling deep network for snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17532–17541, 2022.
  • [80] Siming Zheng, Yang Liu, Ziyi Meng, Mu Qiao, Zhishen Tong, Xiaoyu Yang, Shensheng Han, and Xin Yuan. Deep plug-and-play priors for spectral snapshot compressive imaging. Photon. Res., 9(2):B18–B29, Feb 2021.
  • [81] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.