Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging

Jiahua Dong¹  , Hui Yin^{2 $*$}, Hongliu Li³, Wenbo Li⁴, Yulun Zhang^{5 ${\dagger}$},
Salman Khan^{1, 6}, Fahad Shahbaz Khan^{1, 7}
¹Mohamed bin Zayed University of Artificial Intelligence   ²Hunan University
³The Hong Kong Polytechnic University   ⁴The Chinese University of Hong Kong
⁵Shanghai Jiao Tong University   ⁶Australian National University   ⁷Linköping University
Equal Contributions.   ^†Corresponding Author.

Abstract

Deep unfolding methods have made impressive progress in restoring 3D hyperspectral images (HSIs) from 2D measurements through convolution neural networks or Transformers in spectral compressive imaging. However, they cannot efficiently capture long-range dependencies using global receptive fields, which significantly limits their performance in HSI reconstruction. Moreover, these methods may suffer from local context neglect if we directly utilize Mamba to unfold a 2D feature map as a 1D sequence for modeling global long-range dependencies. To address these challenges, we propose a novel Dual Hyperspectral Mamba (DHM) to explore both global long-range dependencies and local contexts for efficient HSI reconstruction. After learning informative parameters to estimate degradation patterns of the CASSI system, we use them to scale the linear projection and offer noise level for the denoiser (i.e., our proposed DHM). Specifically, our DHM consists of multiple dual hyperspectral S4 blocks (DHSBs) to restore original HSIs. Particularly, each DHSB contains a global hyperspectral S4 block (GHSB) to model long-range dependencies across the entire high-resolution HSIs using global receptive fields, and a local hyperspectral S4 block (LHSB) to address local context neglect by establishing structured state-space sequence (S4) models within local windows. Experiments verify the benefits of our DHM for HSI reconstruction. The source codes and models will be available at https://github.com/JiahuaDong/DHM.

1 Introduction

Refer to caption — Figure 1: Comparisons of PSNR-FLOPS between our DHM and SOTA models.

Unlike standard RGB images with only three spectral bands, hyperspectral images (HSIs) [54, 18, 45, 23] comprise multiple contiguous bands, providing detailed spectral information for each pixel. In recent decades, HSIs have achieved remarkable successes in a wide range of applications such as remote sensing [3, 47, 75], object detection [32, 55], vehicle tracking [62, 63, 21], and medical image analysis [1, 41, 50]. With the development of compressive sensing theory, the coded aperture snapshot spectral imaging (CASSI) [49, 22], one of the snapshot compressive imaging systems [74, 44, 58, 64], has shown impressive performance in capturing HSIs at video rate. The CASSI system modulates HSI signals at various wavelengths, and mixes all modulated spectra to output a 2D compressed measurement. Then, numerous HSI reconstruction methods [20, 78, 57] are developed to restore original HSIs from 2D compressed measurements (i.e., the CASSI inverse problem [8]).

Different from natural image restoration, HSI reconstruction deals with substantially degraded measurements caused by uncertain system noise and spectral compression [49, 14]. Thus, it is more challenging to learn underlying HSI properties than natural image restoration. Generally, existing HSI reconstruction methods can be mainly divided into four categories. To solve the CASSI inverse problem, model-based methods [68, 78, 60] are heavily dependent on hand-crafted image priors (e.g., low-rank [39] and sparsity [34]), suffering from limited generalization capability. Some plug-and-play works [77, 80, 54] apply the pretrained denoiser into model-based methods [57, 73], while end-to-end algorithms [49, 27, 52] ignore the working mechanism of the CASSI system and instead model a brute-force projection from 2D compressed measurements to HSIs via convolutional neural networks (CNNs). Moreover, deep unfolding methods [79, 20, 66, 67] introduce a multi-stage unfolding framework to iteratively learn a linear projection and a denoiser. They possess the interpretability of model-based methods [72] as well as the powerful encoding capability of deep learning, thereby achieving state-of-the-art performance to lead the development of HSI reconstruction task.

Many deep unfolding methods [79, 72] rely on CNNs as denoiser to capture local contexts, showing significant limitations in exploiting the crucial global contexts for HSI reconstruction. To tackle this issue, some works employ Transformers [17] to model wide-range dependencies [14, 5, 4, 7], but the complexity is quadratic to the token size. Therefore, there is a trade-off between computation complexity and effective receptive fields, hindering these methods from exploring long-range dependencies, especially in high-resolution HSIs. Recently, structured state space sequence (S4) models [25, 65, 46] have emerged as a promising backbone to address the limitations of Transformers and CNNs. Then visual Mamba models [81, 69] introduce a cross-scan module to apply S4 models into vision tasks by unfolding 2D features as 1D array along four directions. It can use global receptive fields to capture long-range contexts while reducing the quadratic complexity to linear. However, existing Mamba models [81, 69] face a crucial challenge of local context neglect when directly applied to the high-resolution HSI reconstruction. Since Mamba unfolds a 2D feature map as a 1D sequence, spatially close pixels may end up being located at distant positions in the flattened sequences. The excessive distance among nearby pixels leads to the problem of local context neglect (i.e., significant loss of critical local textures), thereby degrading the performance of HSI reconstruction.

To resolve the above challenges, we develop a novel Dual Hyperspectral Mamba (DHM) for efficient HSI reconstruction. Our DHM relies on structured state-space sequence (S4) models to reconstruct HSIs from 2D degraded measurements, which can capture both global long-range dependencies and local contexts with linear computation complexity. It is the first attempt to address HSI reconstruction via S4 models in the field of hyperspectral compressive imaging. After learning informative parameters from the physical mask and degraded measurement of the CASSI system, we feed them into multi-stage unfolding framework by scaling the linear projection and estimating noise level for the denoiser (i.e., our proposed DHM). The core component of our DHM is dual hyperspectral S4 block (DHSB), which is mainly composed of a global hyperspectral S4 block (GHSB) and a local hyperspectral S4 block (LHSB). More specifically, the GHSB focuses on understanding global long-range dependencies by modeling discrete state-space equation on the entire high-resolution HSIs, which can effectively balance computation complexity and global receptive fields. Besides, the LHSB aims to surmount the challenge of local context neglect by constructing S4 models within different local windows. As shown in Fig. 1, experiments shows that our DHM significantly surpasses existing HSI reconstruction methods. The novel contributions of our paper are listed as follows:

$\bullet$ We propose a new Dual Hyperspectral Mamba (DHM) for HSI reconstruction, capable of capturing both global long-range dependencies and local contexts with linear computational complexity. To our best knowledge, our DHM is the first Mamba-based deep unfolding method for HSI reconstruction.

$\bullet$ We develop a global hyperspectral S4 block (GHSB) to explore long-range dependencies across the entire high-resolution HSIs using global receptive fields, while design a local hyperspectral S4 block (LHSB) to tackle local context neglect by constructing S4 models within different local windows.

$\bullet$ We conduct comprehensive experiments to illustrate that our DHM significantly surpasses SOTA deep unfolding methods, while requiring lower model size and computational complexity.

2 Related Work

Hyperspectral Image Reconstruction: Traditional model-based HSI reconstruction methods [21, 60, 68, 71, 78] utilize hand-crafted priors such as sparsity [34], total total variation [71] and low-rank constraint to address the CASSI inverse problem. Unfortunately, they highly rely on manual parameter tuning, leading to unsatisfactory reconstruction performance. In light of this, some plug-and-play methods [77, 54, 36] focus on integrating convex optimization with the pretrained denoising networks for HSI reconstruction. They have limited generalization performance due to the overreliance on pretrained denoiser. Besides, end-to-end (E2E) algorithms [5, 27, 49] rely on convolutional neural networks (CNNs) [13, 12] or Transformers [17] to learn a brute-force projection function for HSI restoration. They can improve the HSI reconstruction performance but lack robustness and interpretability. To address these limitations, deep unfolding methods [8, 20, 28, 42, 67] are developed to restore HSI cubes from 2D compressed measurements via a multi-stage framework, showcasing the interpretability and strong encoding ability. [42, 79, 67] employ CNNs to estimate degradation patterns, showing limitations to explore long-range contexts. After Cai et al. [8] employ Transformer to capture non-local dependencies, many Transformer-based methods [29, 15, 38, 14, 70] are proposed to design the denoisers. However, the above methods suffer from a trade-off between computation complexity and effective receptive fields, preventing them from understanding long-range dependencies with global receptive fields to achieve better HSI reconstruction performance.

State Space Models (SSMs) [25, 26, 59, 35] have attracted increasing attention recently due to their capability to linearly scale with sequence length in the long-range dependency modeling. After structured state space sequence (S4) model [25] shows impressive performance on long-range sequence modeling tasks, S5 model [59] introduces an efficient parallel scan and a general MIMO SSM based on S4. Then [19, 46] are proposed to alleviate the performance gap between Transformers and SSMs. Mamba [24], an enhanced SSM with efficient hardware design and a selective mechanism, has surpassed Transformer in natural language processing [30, 53]. Due to its ability in modeling long-range dependencies with linear complexity, Mamba has been widely applied to diverse vision tasks, such as image/video understanding [40, 76, 37] and biomedical image analysis [43]. However, these Mamba models [24, 76, 37, 40, 53] ’may face the challenge of local context neglect (i.e., substantial loss of critical local textures), when directly applied to the high-resolution HSI reconstruction task.

3 The Proposed Model

3.1 The CASSI System

Degradation Model: In the coded aperture snapshot spectral imaging (CASSI) system [61, 49, 22], the camera can capture the vectorized degraded measurement $\mathbf{Y}\in\mathbb{R}^{\xi}$ , where $\xi=H(\delta_{s}(N_{\omega}-1)+W)$ . $N_{\omega},\delta_{s},H$ and $W$ represent the number of wavelengths, shifting step of dispersion, height and width in hyperspectral images (HSIs), respectively. As introduced in [8], after vectorizing the shifted HSI signal as $\mathbf{X}\in\mathbb{R}^{\xi N_{\omega}}$ , we express the degradation model of the CASSI system as follows:

\displaystyle\mathbf{Y}=\boldsymbol{\Psi}\mathbf{X}+\boldsymbol{\epsilon},

(1)

where $\boldsymbol{\epsilon}\in\mathbb{R}^{\xi}$ is the vectorized imaging noise on $\mathbf{Y}$ . $\mathbf{\Psi}\in\mathbb{R}^{\xi\times\xi N_{\omega}}$ indicates the sparse and fat sensing matrix which is determined via the physical mask in the CASSI system [16, 31]. Given $\mathbf{\Psi}$ and $\mathbf{Y}$ in the CASSI system, the goal of HSI reconstruction is to restore HSI signal $\mathbf{X}$ by removing the imaging noise $\boldsymbol{\epsilon}$ .

Estimation of Degradation Patterns: As analyzed in previous deep unfolding methods [8, 15, 14, 70], the estimation of degradation patterns is crucial to improve HSI reconstruction performance in the multi-stage unfolding framework, by adaptively scaling linear projection and offering information about imaging noise $\boldsymbol{\epsilon}$ for the denoiser. Thus, motivated by [8, 15], we use maximum a posteriori (MAP) theory to restore original HSI signal $\mathbf{X}$ in Eq. (1) via optimizing the following energy function:

\displaystyle\widehat{\mathbf{X}}=\arg\min_{\mathbf{X}}\frac{1}{2}\|\mathbf{Y-% \mathbf{\Psi X}}\|^{2}+\lambda\mathcal{R}(\mathbf{X}),

(2)

where $\mathcal{R}(\mathbf{X})$ denotes the prior term about $\mathbf{X}$ , and $\lambda$ is the hyperparameter to balance the importance of prior term. In order to solve Eq. (2), we define an auxiliary variable as $\mathbf{Z}=\mathbf{X}\in\mathbb{R}^{\xi N_{\omega}}$ , and then utilize the half-quadratic splitting algorithm to minimize the following loss $\mathcal{L}_{\mathrm{HSI}}$ :

\displaystyle\mathcal{L}_{\mathrm{HSI}}=\frac{1}{2}\|\mathbf{Y-\mathbf{\Psi X}% }\|^{2}+\lambda\mathcal{R}(\mathbf{Z})+\frac{\eta}{2}\|\mathbf{Z}-\mathbf{X}\|% ^{2},

(3)

where $\eta$ is a penalty parameter. We decouple $\mathbf{X}$ and $\mathbf{Z}$ into two iterative subproblems to solve Eq. (3):

\displaystyle\mathbf{X}_{t}=\arg\min_{\mathbf{X}}\|\mathbf{Y}-\mathbf{\Psi X}% \|^{2}+\eta\|\mathbf{X}-\mathbf{Z}_{t-1}\|^{2},~{}~{}\mathbf{Z}_{t}=\arg\min_{% \mathbf{Z}}+\frac{\eta}{2}\|\mathbf{Z}-\mathbf{X}_{t}\|^{2}+\lambda\mathcal{R}% (\mathbf{Z}),

(4)

where $t=1,\cdots,T$ denotes the iterative stage index in the multi-stage unfolding framework, as shown in Fig. 2. Since the subproblem of solving $\mathbf{X}$ in Eq. (4) is a quadratic regularized least-squares problem, we can derive its closed solution as $\mathbf{X}_{t}=(\boldsymbol{\Psi}^{\top}\boldsymbol{\Psi}+\eta\mathbf{I})^{-1}% (\boldsymbol{\Psi}^{\top}\mathbf{Y}+\eta\mathbf{Z}_{t-1})$ . Considering the high computational overhead of $(\boldsymbol{\Psi}^{\top}\boldsymbol{\Psi}+\eta\mathbf{I})^{-1}$ brought by the fat sensing matrix $\boldsymbol{\Psi}\in\mathbb{R}^{\xi\times\xi N_{\omega}}$ , we resort to the matrix inversion formula to simplify it: $(\boldsymbol{\Psi}^{\top}\boldsymbol{\Psi}+\eta\mathbf{I})^{-1}={\eta}^{-1}% \mathbf{I}-{\eta}^{-1}\boldsymbol{\Psi}^{\top}(\boldsymbol{\Psi}{\eta}^{-1}% \boldsymbol{\Psi}^{\top}+\mathbf{I})^{-1}\boldsymbol{\Psi}\eta^{-1}$ . As a result, we can reformulate the closed solution of $\mathbf{X}$ in Eq. (4) as follows:

\displaystyle\!\!\!\mathbf{X}_{t}\!=\!\mathbf{Z}_{t-1}+\eta^{-1}\boldsymbol{% \Psi}^{\top}\mathbf{Y}\!-\!\eta^{-1}\boldsymbol{\Psi}^{\top}(\boldsymbol{\Psi}% {\eta}^{-1}\boldsymbol{\Psi}^{\top}\!\!+\!\mathbf{I})^{-1}\boldsymbol{\Psi}% \mathbf{Z}_{t-1}\!-\!\eta^{-2}\boldsymbol{\Psi}^{\top}(\boldsymbol{\Psi}{\eta}% ^{-1}\boldsymbol{\Psi}^{\top}\!\!+\!\mathbf{I})^{-1}\boldsymbol{\Psi}% \boldsymbol{\Psi}^{\top}\mathbf{Y}.\!\!\!

(5)

As introduced in [4, 8], $\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top}=\boldsymbol{\mathrm{diag}}\{\psi_{1}% ,\psi_{2},\cdots,\psi_{\xi}\}$ is a diagonal matrix in the CASSI system. After defining $\boldsymbol{\psi}=[\psi_{1},\psi_{2},\cdots,\psi_{\xi}]\in\mathbb{R}^{\xi}$ , we plug $\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top}=\boldsymbol{\mathrm{diag}}\{\psi_{1}% ,\psi_{2},\cdots,\psi_{\xi}\}$ into Eq. (5):

\displaystyle\mathbf{X}_{t}=\mathbf{Z}_{t-1}+\boldsymbol{\Psi}^{\top}((\mathbf% {Y}-\boldsymbol{\Psi}\mathbf{Z}_{t-1})\otimes(\eta+\boldsymbol{\psi})^{-1})^{% \top},

(6)

where $\otimes$ is the element-wise multiplication. Since $\boldsymbol{\psi}$ is precomputed and stored in $\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top}$ , the value of $\eta$ in Eq. (6) can affect the output of each iterative stage in the multi-stage unfolding framework. To eliminate negative influence of manually determining $\eta$ , we set $\eta$ to be learnable in the multi-stage framework, and denote $\eta_{t}$ as the value of $\eta$ at the $t$ -th iterative stage. Besides, we also define a learnable parameter $\lambda_{t}$ at the $t$ -th stage, and express the subproblem of solving $\mathbf{Z}_{t}$ in Eq. (4) as:

\displaystyle\mathbf{Z}_{t}=\arg\min_{\mathbf{Z}}\frac{1}{2(\sqrt{\lambda_{t}/% \eta_{t}})^{2}}\|\mathbf{Z}-\mathbf{X}_{t}\|^{2}+\mathcal{R}(\mathbf{Z}).

(7)

In Eq. (7), the subproblem of solving $\mathbf{Z}_{t}$ is equivalent to denoising the image $\mathbf{X}_{t}$ with a Gaussian noise level of $\sqrt{\lambda_{t}/\eta_{t}}$ , according to Bayesian probability [9]. Given $\boldsymbol{\eta}=[\eta_{1},\cdots,\eta_{T}]\in\mathbb{R}^{T}$ and $\boldsymbol{\rho}=[\eta_{1}/\lambda_{1},\cdots,\eta_{T}/\lambda_{T}]\in\mathbb% {R}^{T}$ , we can introduce the following iterative optimization scheme to estimate degradation patterns of the CASSI system and reconstruct original HSI signal $\mathbf{X}$ in Eq. (1):

\displaystyle(\boldsymbol{\eta},\boldsymbol{\rho})=\Omega(\boldsymbol{\Psi},% \mathbf{Y}),~{}\mathbf{X}_{t}=\mathcal{E}(\mathbf{Z}_{t-1},\boldsymbol{\Psi},% \mathbf{Y},\eta_{t}),~{}\mathbf{Z}_{t}=\mathcal{D}(\mathbf{X}_{t},\rho_{t}),

(8)

where $\Omega(\cdot)$ is the parameter learner. $\mathcal{E}(\cdot)$ is equivalent to Eq. (6), which is a linear projection used for mapping $\mathbf{Z}_{t-1}$ to $\mathbf{X}_{t}$ . $\mathcal{D}(\cdot)$ indicates the Gaussian denoiser to solve Eq. (7). As shown in Fig. 2, we depict our unfolding framework with $T$ iterative training stages to restore original HSI signal $\mathbf{X}$ in Eq. (1). Specifically, we first concatenate the given sensing matrix $\boldsymbol{\Psi}$ and compressed measurement $\mathbf{Y}$ , and input it into a convolution block to initialize $\mathbf{Z}_{0}$ . At the $t$ -th ( $t=1,\cdots,T$ ) stage, the parameter learner $\Omega(\cdot)$ contains two degradation-aware blocks (DABs), an average pooling layer and three fully connected layers to encode $\mathbf{Z}_{0}$ and $\boldsymbol{\Psi}$ , and then outputs learnable parameters $(\boldsymbol{\eta},\boldsymbol{\rho})$ . The DAB has three convolution layers and two GELU functions. Then $\mathcal{E}(\cdot)$ and $\mathcal{D}(\cdot)$ use the parameters $(\boldsymbol{\eta},\boldsymbol{\rho})$ to iteratively update $\mathbf{X}_{t}$ and $\mathbf{Z}_{t}$ in Eq. (8) until the $T$ -th stage. Particularly, $(\boldsymbol{\eta},\boldsymbol{\rho})$ learned by $\Omega(\cdot)$ can effectively scale the linear projection in Eq. (6), while offering accurate noise level for the denoiser $\mathcal{D}(\cdot)$ to solve Eq. (7). In the CASSI system, they are essential to estimate the ill-posedness degree and degradation patterns, thereby substantially improving HSI reconstruction performance.

3.2 Dual Hyperspectral Mamba (DHM)

Generally, existing deep unfolding methods [7, 14, 70, 15] mainly utilize CNNs or Transformers to design the denoiser $\mathcal{D}(\cdot)$ . However, these methods struggle to capture long-range dependencies using global receptive fields, thereby limiting their HSI reconstruction performance. Besides, directly applying Mamba to high-resolution HSI reconstruction suffers from local context neglect (i.e., substantial loss of critical local details). To resolve the above challenges, we develop a novel Dual Hyperspectral Mamba (DHM) as the denoiser $\mathcal{D}(\cdot)$ in Eq. (8). Our DHM uses global receptive fields to model long-range dependencies while tackling local context neglect via capturing local contexts.

Fig. 3a shows the architecture of our DHM (i.e., the denoiser $\mathcal{D}(\cdot)$ ) at the $t$ -th ( $t=1,\cdots,T$ ) iterative stage in Fig. 2. Specifically, given the scalar $\rho_{t}$ and $\mathbf{X}_{t}\in\mathbb{R}^{H\times W_{*}\times N_{\omega}}$ at the $t$ -th stage, we first reshape $\rho_{t}$ to $\mathbb{R}^{H\times W_{*}}$ , and concatenate $\mathbf{X}_{t}$ with the reshaped $\rho_{t}$ to extract shallow feature $\mathbf{F}_{s}\in\mathbb{R}^{H\times W_{*}\times C}$ via a convolutional layer, where $W_{*}=\delta_{s}(N_{\omega}-1)+W$ , and $C$ is the feature dimension. Then we forward $\mathbf{F}_{s}$ to the encoder, bottleneck and decoder to obtain the deep feature $\mathbf{F}_{d}\in\mathbb{R}^{H\times W_{*}\times C}$ . The encoder and decoder comprise $N_{1}$ pairs of dual hyperspectral S4 block (DHSB) and the resizing module, while the bottleneck only has $N_{2}$ DHSBs. In Fig. 3a, we visualize the pipeline of our DHM when $N_{1}=2$ and $N_{2}=1$ for better demonstration. In Fig. 3b, the DHSB includes a global hyperspectral S4 block (GHSB), a local hyperspectral S4 block (LHSB), a gated feed-forward network (GFFN) and three layer normalization (LN). Fig. 3c presents the components of GHSB and LHSB, which are the two most important modules in our DHM. Apart from the reshaping operation, they have the same architectures. Particularly, the GHSB can use global receptive fields to model long-range dependencies, and the LHSB aims to address local context neglect by constructing structured state space sequence (S4) model within local windows. Besides, Fig. 3d shows the design of GFFN module. Then we perform a convolution operation on $\mathbf{F}_{d}$ to obtain $\mathbf{F}_{z}\in\mathbb{R}^{H\times W_{*}\times N_{\omega}}$ . Finally, we sum $\mathbf{X}_{t}$ and $\mathbf{F}_{z}$ to generate the denoised image $\mathbf{Z}_{t}\in\mathbb{R}^{H\times W_{*}\times N_{\omega}}$ at the $t$ -th iterative stage. In the following subsections, we introduce the detailed components of the GHSB and LHSB.

Global Hyperspectral S4 Block (GHSB) constructs S4 model on the entire high-resolution HSIs to capture global contexts using global receptive fields. As shown in Fig. 3c, we forward a given feature $\mathbf{F}_{i}\in\mathbb{R}^{H\times W_{*}\times D}$ into two branches, where $D=\{C,2C,4C\}$ denotes the feature dimensions at different levels of encoder, bottleneck and decoder. Specifically, the upper branch encodes $\mathbf{F}_{i}$ to $\mathbf{F}_{u}=\boldsymbol{\sigma}(\mathbf{P}_{u}(\mathbf{W}_{d}(\mathbf{F}_{i% })))\in\mathbb{R}^{H\times W_{*}\times D}$ via a linear projection $\mathbf{P}_{u}(\cdot)$ , a depth-wise convolution $\mathbf{W}_{d}(\cdot)$ and a SILU activation function $\boldsymbol{\sigma}(\cdot)$ . Then we reshape $\mathbf{F}_{u}$ as $\mathbf{F}_{s}^{g}\in\mathbb{R}^{1\times H\times W_{*}\times D}$ , and input it into the $\mathrm{HSI}\text{-}\mathrm{SSM}(\cdot)$ to model long-range dependencies using global receptive fields. As a result, we can formulate the output feature $\mathbf{F}_{o}^{g}\in\mathbb{R}^{H\times W_{*}\times D}$ of the GHSB module as follows:

\displaystyle\mathbf{F}_{o}^{g}=\mathbf{P}_{o}\big{(}\mathrm{LN}(\mathrm{RS}(% \mathrm{HSI}\text{-}\mathrm{SSM}(\mathbf{F}_{s}^{g})))\otimes\mathbf{F}_{l}% \big{)},

(9)

where $\otimes$ denotes the element-wise multiplication. $\mathbf{F}_{l}=\boldsymbol{\sigma}(\mathbf{P}_{l}(\mathbf{F}_{i}))\in\mathbb{R% }^{H\times W_{*}\times D}$ denotes the output of lower branch in Fig. 3c, and $\mathbf{P}_{l}(\cdot)$ is the linear mapping. $\mathrm{LN}(\cdot)$ is the layer normalization (LN), $\mathrm{RS}(\cdot)$ can reshape the given feature to $\mathbb{R}^{H\times W_{*}\times D}$ , and $\mathbf{P}_{o}$ is the linear projection to obtain $\mathbf{F}_{o}^{g}$ . Moreover, $\mathrm{HSI}\text{-}\mathrm{SSM}(\cdot)$ denotes the proposed hyperspectral image state space module (HSI-SSM).

HyperSpectral Image State Space Module (HSI-SSM) can model long-range cross-pixel interactions to explore global contexts of $\mathbf{F}_{i}$ using global receptive fields. As shown in Fig. 3e, given the input feature $\mathbf{F}_{s}^{g}\in\mathbb{R}^{1\times H\times W_{*}\times D}$ , we unfold the entire hyperspectral image (HSI) that includes $H\times W_{*}$ pixels, into four one-dimensional sequences with a size of $HW_{*}$ , by scanning these pixels along four distinct traversal paths: from the top-left to the bottom-right, from the top-right to the bottom-left, from the bottom-right to the top-left, and from the bottom-left to the top-right. We denote four sequence features as $\{\mathbf{S}_{u}\in\mathbb{R}^{G\times L\times D}\}_{u=1}^{n_{s}}$ , where $n_{s}=4,G=1$ , and $L=HW_{*}$ denotes the sequence length in the GHSB. Motivated by Mamba [24, 40, 69], we construct some enhanced discrete state space equations on the $u$ -th ( $u=1,\cdots,n_{s}$ ) sequence feature $\mathbf{S}_{u}$ . Specifically, after defining the learnable variables: $\mathbf{A}\in\mathbb{R}^{D\times D_{s}}$ and $\mathbf{E}\in\mathbb{R}^{G\times L\times D}$ , we can formulate some continuous parameters such as $\mathbf{B}\in\mathbb{R}^{G\times L\times D_{s}},\mathbf{C}\in\mathbb{R}^{G% \times L\times D_{s}}$ and a timescale parameter $\triangle\in\mathbb{R}^{G\times L\times D}$ as:

\displaystyle\mathbf{B}=\mathbf{P}_{b}(\mathbf{S}_{u}),~{}~{}\mathbf{C}=% \mathbf{P}_{c}(\mathbf{S}_{u}),~{}~{}\triangle=\tau_{\triangle}(\mathbf{E}+% \mathbf{P}_{\triangle}(\mathbf{S}_{u})),

(10)

where $D_{s}$ is the latent feature dimension, and $\tau_{\triangle}(\cdot)$ is the softplus activation function. $\mathbf{P}_{b}(\cdot),\mathbf{P}_{c}(\cdot)$ and $\mathbf{P}_{\triangle}(\cdot)$ are the linear projection matrices. Inspired by the zero-order hold (ZOH) discretization rule [24], we reshape the parameter $\triangle$ as $\overline{\triangle}\in\mathbb{R}^{G\times L\times D\times 1}$ , and utilize it to transform the continuous parameters $\mathbf{A}$ and $\mathbf{B}$ into the discrete parameters $\overline{\mathbf{A}}\in\mathbb{R}^{G\times L\times D\times D_{s}}$ and $\overline{\mathbf{B}}\in\mathbb{R}^{G\times L\times D\times D_{s}}$ :

\displaystyle\overline{\mathbf{A}}=\exp(\overline{\triangle}\mathbf{A}),~{}% \overline{\mathbf{B}}=(\overline{\triangle}\mathbf{A})^{-1}(\exp(\overline{% \triangle}\mathbf{A})-\mathbf{I})\cdot\overline{\triangle}\mathbf{B}.

(11)

After obtaining the discrete $\overline{\mathbf{A}}$ and $\overline{\mathbf{B}}$ via Eq. (11), we reshape the parameter $\mathbf{C}$ as $\overline{\mathbf{C}}\in\mathbb{R}^{G\times L\times D_{s}\times 1}$ , and formulate the semantic encoding of $\mathbf{S}_{u}$ as the form of recurrent neural networks (RNNs) to extract a new sequence feature $\mathbf{y}_{k}^{u}\in\mathbb{R}^{G\times L\times D}$ . Then we denote $\mathbf{h}_{k-1}^{u},\mathbf{h}_{k}^{u}\in\mathbb{R}^{G\times L\times D\times D% _{s}}$ as the latent features of the $(k\!-\!1)$ -th and $k$ -th hidden states in the RNNs, and define $\mathbf{y}_{k}^{u}$ as follows:

\displaystyle\mathbf{h}_{k}^{u}=\overline{\mathbf{A}}\mathbf{h}_{k-1}^{u}+% \overline{\mathbf{B}}\mathbf{S}_{u},~{}~{}\mathbf{y}_{k}^{u}=\overline{\mathbf% {C}}\mathbf{h}_{k}^{u}+\nu\cdot\mathbf{S}_{u},

(12)

where $\nu$ denotes the scale parameter. Inspired by [24], we use the broadcasting mechanism to match the dimensions of different matrices for matrix multiplication operations in Eqs. (11)(12). Then we merge all sequence features $\{\mathbf{y}_{k}^{u}\}_{u=1}^{n_{s}}$ to get the final output map $\mathbf{y}=\sum_{u=1}^{n_{s}}\mathbf{y}_{k}^{u}$ of the HSI-SSM. In the GHSB, we utilize the HSI-SSM to encode the entire high-resolution HSI in a recursive manner. It can explore long-range dependencies of the input feature $\mathbf{F}_{i}$ using global receptive fields.

Local Hyperspectral S4 Block (LHSB) aims to explore local contexts within position-specific windows. Different from the GHSB that uses the HSI-SSM to unfold and scan the entire high-resolution HSI containing $H\times W_{*}$ pixels, the LHSB scans each local window, including $N\times N$ pixels, to capture local contexts. Specifically, as shown in Fig. 3c, after encoding the given feature $\mathbf{F}_{i}$ to $\mathbf{F}_{u}\in\mathbb{R}^{H\times W_{*}\times D}$ via the upper branch, we partition $\mathbf{F}_{u}$ to $\nicefrac{{H}}{{N}}\times\nicefrac{{W_{*}}}{{N}}$ non-overlapping windows, then reshape it as $\mathbf{F}_{s}^{l}\in\mathbb{R}^{\nicefrac{{HW_{*}}}{{N^{2}}}\times N\times N% \times D}$ , and input $\mathbf{F}_{s}^{l}$ into the HSI-SSM, where $\nicefrac{{HW_{*}}}{{N^{2}}}$ denotes the number of windows and each window includes $N^{2}$ pixels. In the HSI-SSM, we flatten each window including $N^{2}$ pixels and scan them along four distinctive directions to obtain four sequence features $\{\mathbf{S}_{u}\in\mathbb{R}^{G\times L\times D}\}_{u=1}^{n_{s}}$ . Note that we set $G=\nicefrac{{HW_{*}}}{{N^{2}}}$ and $L=N^{2}$ in the LHSB, which are different from the GHSB. After encoding each sequence $\{\mathbf{S}_{u}\}_{u=1}^{n_{s}}$ under a recursive manner to get $\{\mathbf{y}_{k}^{u}\}_{u=1}^{n_{s}}$ , we sum them to get the output map $\mathbf{y}\in\mathbb{R}^{G\times L\times D}$ of the HSI-SSM. The LHSB can capture local contexts of HSI by encoding different local windows of the given feature $\mathbf{F}_{i}$ in a recursive manner. Thus, we formulate the final feature $\mathbf{F}_{o}^{l}\in\mathbb{R}^{H\times W_{*}\times D}$ outputted by the LHSB as follows:

\displaystyle\mathbf{F}_{o}^{l}=\mathbf{P}_{o}\big{(}\mathrm{LN}(\mathrm{RS}(% \mathrm{HSI}\text{-}\mathrm{SSM}(\mathbf{F}_{s}^{l})))\otimes\mathbf{F}_{l}% \big{)}.

(13)

Optimization: As shown in Fig. 2, we utilize $\mathcal{E}(\cdot)$ and $\mathcal{D}(\cdot)$ (i.e., our DHM) to iteratively update $\mathbf{X}_{t}$ and $\mathbf{Z}_{t}$ in Eq. (8) until the $T$ -th stage. After getting $\mathbf{Z}_{T}$ at the $T$ -th stage, we follow [14, 15] to train our DHM by minimizing the Charbonnier loss between the groundtruth and reconstructed HSI $\mathbf{Z}_{T}$ .

4 Experiments

4.1 Implementation Details

For fair comparisons, we set exactly the same experimental configurations with existing HSI reconstruction methods [7, 70, 10, 14, 6, 27] to validate the effectiveness of our DHM. Following the settings of [27, 48, 6, 49], we perform spectral interpolation on the original HSIs and choose a wide spectral range from 450 nm to 650 nm for comparisons on both the simulation and real datasets. The simulation dataset is composed of two subsets: KAIST [11] and CAVE [56]. We employ the CAVE subset to train our DHM, and select 10 HSIs from the KAIST to evaluate performance. Moreover, the real dataset [49] consists of five HSI cubes, which are captured by the practical CASSI system [49].

During training, we employ the Adam optimizer [33] to train all variants of our DHM on a single NVIDIA A100 GPU, where initial learning rate is $1.0\times 10^{-3}$ , and the training epoches are set to 300. Following [7, 70, 14, 27], we randomly crop HSI cubes to $256\times 256\times 28$ for simulation dataset, and $660\times 660\times 28$ for real dataset. The shifting step of dispersion in the CASSI system is set to $\delta_{s}=2$ . Moreover, we set $C=28,N=8,N_{1}=2,N_{2}=1$ and $D=D_{s}$ in this paper. Motivated by baseline HSI reconstruction methods [14, 15], we share the network weights of our DHM across different stages, and use exactly the same data augmentation to train our DHM.

Table 1: Performance of our DHM and other comparison methods on the simulation dataset with 10 scenes (S1

\sim

S10). In each cell, the upper and lower entries report PSNR and SSIM, respectively.

Comparison Methods

#Params

GFLOPS

S10

Avg

TwIST [2]

25.16

0.700

23.02

0.604

21.40

0.711

30.19

0.851

21.41

0.635

20.95

0.644

22.20

0.643

21.82

0.650

22.42

0.690

22.67

0.569

23.12

0.669

\lambda

-Net [52]

62.64M

117.98

30.10

0.849

28.49

0.805

27.73

0.870

37.01

0.934

26.19

0.817

28.64

0.853

26.47

0.806

26.09

0.831

27.50

0.826

27.13

0.816

28.53

0.841

DNU [67]

1.19M

163.48

31.72

0.863

31.13

0.846

29.99

0.845

35.34

0.908

29.03

0.833

30.87

0.887

28.99

0.839

30.13

0.885

31.03

0.876

29.14

0.849

30.74

0.863

DIP-HSI [51]

33.85M

64.42

32.68

0.890

27.26

0.833

31.30

0.914

40.54

0.962

29.79

0.900

30.39

0.877

28.18

0.913

29.44

0.874

34.51

0.927

28.51

0.851

31.26

0.894

DGSMP [29]

3.76M

646.65

33.26

0.915

32.09

0.898

33.06

0.925

40.54

0.964

28.86

0.882

33.08

0.937

30.74

0.886

31.55

0.923

31.66

0.911

31.44

0.925

32.63

0.917

GAP-Net [48]

4.27M

78.58

33.74

0.911

33.26

0.900

34.28

0.929

41.03

0.967

31.44

0.919

32.40

0.925

32.27

0.902

30.46

0.905

33.51

0.915

30.24

0.895

33.26

0.917

ADMM-Net [42]

4.27M

78.58

34.12

0.918

33.62

0.902

35.04

0.931

41.15

0.966

31.82

0.922

32.54

0.924

32.42

0.896

30.74

0.907

33.75

0.915

30.68

0.895

33.58

0.918

HDNet [27]

2.37M

154.76

35.14

0.935

35.67

0.940

36.03

0.943

42.30

0.969

32.69

0.946

34.46

0.952

33.67

0.926

32.48

0.941

34.89

0.942

32.38

0.937

34.97

0.943

MST-L [5]

2.03M

28.15

35.40

0.941

35.87

0.944

36.51

0.953

42.27

0.973

32.77

0.947

34.80

0.955

33.66

0.925

32.67

0.948

35.39

0.949

32.50

0.941

35.18

0.948

MST++ [6]

1.33M

19.42

35.80

0.943

36.23

0.947

37.34

0.957

42.63

0.973

33.38

0.952

35.38

0.957

34.35

0.934

33.71

0.953

36.67

0.953

33.38

0.945

35.99

0.951

CST-L [4]

3.00M

40.01

35.96

0.949

36.84

0.955

38.16

0.962

42.44

0.975

33.25

0.955

35.72

0.963

34.86

0.944

34.34

0.961

36.51

0.957

33.09

0.945

36.12

0.957

BIRNAT [10]

4.40M

2122.66

36.79

0.951

37.89

0.957

40.61

0.971

46.94

0.985

35.42

0.964

35.30

0.959

36.58

0.955

33.96

0.956

39.47

0.970

32.80

0.938

37.58

0.960

LDMUN [70]

–

38.07

0.969

41.16

0.982

43.70

0.983

48.01

0.993

37.76

0.980

37.65

0.980

38.58

0.973

36.31

0.979

42.66

0.984

35.18

0.967

39.91

0.979

DAUHST [8]

6.15M

79.50

37.25

0.958

39.02

0.967

41.05

0.971

46.15

0.983

35.80

0.969

37.08

0.970

37.57

0.963

35.10

0.966

40.02

0.970

34.59

0.956

38.36

0.967

PADUT [38]

5.38M

90.46

37.36

0.962

40.43

0.978

42.38

0.979

46.62

0.990

36.26

0.974

37.27

0.974

37.83

0.966

35.33

0.974

40.86

0.978

34.55

0.963

38.89

0.974

RDLUF [15]

1.89M

115.34

37.94

0.966

40.95

0.977

43.25

0.979

47.83

0.990

37.11

0.976

37.47

0.975

38.58

0.969

35.50

0.970

41.83

0.978

35.23

0.962

39.57

0.974

DERNN (3stg) [14]

0.65M

27.41

37.54

0.964

39.23

0.973

42.01

0.979

47.08

0.992

36.03

0.973

36.82

0.974

37.34

0.966

35.04

0.971

40.97

0.978

34.39

0.960

38.65

0.973

DERNN (5stg) [14]

0.65M

45.60

37.86

0.963

40.28

0.976

42.69

0.978

47.97

0.990

37.11

0.975

37.23

0.974

37.97

0.967

35.82

0.971

41.93

0.979

34.98

0.959

39.38

0.973

DERNN (7stg) [14]

0.65M

63.80

37.91

0.964

40.75

0.978

42.95

0.978

47.51

0.990

37.81

0.978

37.37

0.975

38.49

0.970

35.83

0.971

42.47

0.980

35.04

0.961

39.61

0.974

DERNN (9stg) [14]

0.65M

81.99

38.26

0.965

40.97

0.979

43.22

0.979

48.10

0.991

38.08

0.980

37.41

0.975

38.83

0.971

36.41

0.973

42.87

0.981

35.15

0.962

39.93

0.976

DERNN (9stg^∗) [14]

1.09M

134.18

38.49

0.968

41.27

0.980

43.97

0.980

48.61

0.992

38.29

0.981

37.81

0.977

39.30

0.973

36.51

0.974

43.38

0.983

35.61

0.966

40.33

0.977

DHM-light (3stg)

0.66M

26.42

37.67

0.965

39.58

0.974

42.67

0.981

47.90

0.993

36.47

0.975

36.76

0.975

37.72

0.968

35.14

0.972

41.65

0.981

34.35

0.961

38.99

0.975

DHM-light (5stg)

0.66M

43.96

38.17

0.971

40.91

0.981

43.78

0.983

47.18

0.993

37.41

0.980

37.51

0.978

38.78

0.973

35.83

0.977

43.26

0.985

35.28

0.968

39.81

0.979

DHM-light (7stg)

0.66M

61.50

38.58

0.972

41.42

0.983

43.93

0.984

47.95

0.993

38.29

0.983

37.88

0.980

39.03

0.974

36.26

0.979

43.25

0.986

35.42

0.970

40.20

0.980

DHM-light (9stg)

0.66M

79.04

38.78

0.972

41.44

0.983

44.07

0.984

48.16

0.994

38.32

0.983

37.45

0.980

39.22

0.976

36.37

0.980

43.75

0.987

35.73

0.972

40.33

0.981

DHM (3stg)

0.92M

36.34

37.63

0.967

39.85

0.976

43.40

0.982

47.56

0.993

36.37

0.976

36.98

0.975

38.05

0.970

34.94

0.972

42.04

0.982

34.42

0.962

39.13

0.975

DHM (5stg)

0.92M

60.50

38.48

0.972

41.14

0.982

44.10

0.984

48.03

0.993

37.82

0.981

37.95

0.979

39.21

0.975

36.34

0.978

43.31

0.986

35.20

0.967

40.16

0.980

DHM (7stg)

0.92M

84.65

38.40

0.972

41.52

0.983

44.21

0.984

47.93

0.994

38.21

0.983

38.17

0.981

39.58

0.976

36.17

0.978

43.56

0.986

35.60

0.970

40.34

0.981

DHM (9stg)

0.92M

108.80

38.50

0.972

41.64

0.984

44.37

0.985

48.13

0.994

38.33

0.983

38.27

0.982

39.70

0.977

36.52

0.980

43.89

0.988

35.75

0.971

40.50

0.982

4.2 Quantitative Performance Comparisons

As shown in Tab. 1, we introduce comprehensive quantitative comparisons between our HDM and SOTA HSI reconstruction methods on the simulation dataset with 10 scenes (S1 $\sim$ S10). From the results in Tab. 1, we observe that the proposed DHM (9stg) (i.e., our DHM at the $9$ -th stage) achieves the best HSI reconstruction performance (i.e., 40.50 dB in PSNR and 0.982 in SSIM). Our DHM (9stg) substantially surpasses existing methods [2, 39, 51, 4, 28], especially several recent SOTA comparison models (e.g., DAUHST [8], LDMUN [70], RDLUF-Mix [15], DERNN [14]) by $0.57\sim 2.14$ dB. Such improvements verify the effectiveness of our DHM in exploring long-range dependencies across the entire high-resolution HSIs using global receptive fields, while capturing local context within local windows. More importantly, our DHM requires lower model size and computational costs to dramatically outperform existing methods. Compared with the SOTA DERNN (9stg^∗) [14], our DHM (9stg) improves 0.17 dB in PSNR and 0.005 in SSIM, but only consumes 84.40% (0.92M / 1.09M) parameters and 81.09% (108.80 / 134.18) GFLOPS. Moreover, we propose a light model (i.e., DHM-light) where each DHSB contains a single global hyperspectral S4 block (GHSB) and a GFFN. In Tab. 1, our DHM-light at the 3/5/7/9-th stage has significant improvement than other comparison methods (e.g., DERNN [14]) with the same number of stages, while retaining comparable model size and less GFLOPs. It illustrate the effectiveness of our DHM for HSI reconstruction task.

4.3 Qualitative Performance Comparisons

Simulation Dataset: As depicted in Fig. 4, we select 4 out of the 28 spectral channels to visualize some qualitative comparisons of HSI reconstruction on the Scene 7 (S7) of simulation dataset. For better visibility, we zoom in on the regions within the yellow boxes of the original HSIs (bottom), and show the comparison of these regions in the top-right part. In Fig. 4, previous methods suffer from blotchy texture, distortions and blurring artifacts. In contrast, our DHM (9stg) can effectively restore HSIs with less artifacts and finer details. Besides, the spectral density curves corresponding to the green boxes in the top-left RGB image are depicted in the top-middle part. Our DHM (9stg) exhibits the best correlation with groundtruth, which illustrates the effectiveness of our DHM.

Real Dataset: To verify the superiority of our model in real HSI reconstruction, we follow [49, 8, 70, 14] to retrain our DHM-light (5stg) on the joint KAIST [11] and CAVE [56]. Besides, we introduce 11-bit shot noise into training samples to simulate real imaging scenarios. As shown in Fig. 5, our DHM-light (5stg) can effectively restore the plant region corresponding to the yellow box. Compared with SOTA methods [8, 38, 14], our DHM-light (5stg) restores clearer contents and structural details with less artifacts, verifying the robustness of our model to address the real HSI restoration.

Table 2: Ablation studies (averaged PSNR and SSIM) of our DHM (5stg) on simulation dataset.

GHSB	LHSB	GFFN	#Params	GFLOPs	PSNR	SSIM
✓		✓	0.66M	43.96	39.81	0.979
	✓	✓	0.66M	43.96	38.76	0.973
✓	✓		0.92M	60.26	39.93	0.979
✓	✓	✓	0.92M	60.50	40.16	0.980

(a) Ablation experiments of the DHSB.

GHSB $\rightarrow$ GA	LHSB $\rightarrow$ LA	#Params	GFLOPs	PSNR	SSIM
		0.92M	60.50	40.16	0.980
✓		0.79M	53.05	39.11	0.975
	✓	0.79M	53.05	40.08	0.980
✓	✓	0.65M	45.60	39.38	0.973

(b) Ablation experiments of alternative variants.

GS $\rightarrow$ LS	LS $\rightarrow$ GS	SPs	#Params	GFLOPs	PSNR	SSIM
	✓		4.59M	60.50	39.23	0.977
	✓	✓	0.92M	60.50	40.12	0.980
✓			4.59M	60.50	39.28	0.977
✓		✓	0.92M	60.50	40.16	0.980

Variants	$~{}~{}\boldsymbol{\eta}$	$~{}~{}\boldsymbol{\rho}$	#Params	GFLOPs	PSNR	SSIM
Baseline			0.90M	59.11	39.86	0.979
DHM w/o $\boldsymbol{\eta}$		✓	0.92M	60.50	39.71	0.978
DHM w/o $\boldsymbol{\rho}$	✓		0.92M	60.42	39.92	0.979
DHM	✓	✓	0.92M	60.50	40.16	0.980

(d) Ablation analysis of learnable

(\boldsymbol{\eta},\boldsymbol{\rho})

Table 3: Ablation results of the DHSB. In each cell, the upper and lower entries are PSNR and SSIM.

GHSB

LHSB

GFFN

#Params

GFLOPS

S10

Avg

✓

0.66M

43.96

38.17

0.971

40.91

0.981

43.78

0.983

47.18

0.993

37.41

0.980

37.51

0.978

38.78

0.973

35.83

0.977

43.26

0.985

35.28

0.968

39.81

0.979

✓

0.66M

43.96

37.28

0.962

39.95

0.975

42.77

0.981

47.42

0.992

35.95

0.973

36.65

0.974

37.40

0.966

34.94

0.971

41.00

0.979

34.28

0.960

38.76

0.973

✓

0.92M

60.26

38.42

0.972

40.75

0.981

43.97

0.984

47.65

0.993

37.79

0.981

37.47

0.978

38.95

0.974

35.96

0.976

43.18

0.985

35.13

0.967

39.93

0.979

✓

0.92M

60.50

38.48

0.972

41.14

0.982

44.10

0.984

48.03

0.993

37.82

0.981

37.95

0.979

39.21

0.975

36.34

0.978

43.31

0.986

35.20

0.967

40.16

0.980

4.4 Ablation Studies

This subsection analyzes the effectiveness of all proposed modules on simulation dataset using our DHM (5stg) as an example. 1) DHSB: As shown in Tab. 2a, when we remove the GHSB, LHSB or replace the GFFN with a traditional feed-forward network (FNN) [8] in the DHSB, the performance of our DHM (5stg) significantly decreases by $0.23\sim 1.40$ dB in PSNR and $0.001\sim 0.007$ in SSIM. Tab. 3 presents ablation results of our DHM (5stg) on 10 scenes (S1 $\sim$ S10) to veirify the effectiveness of the DHSB. 2) Variants: In Tab. 2b, our model decreases by $0.08\sim 1.05$ dB in PSNR when we replace the GHSB with non-local MSA [14] (GHSB $\rightarrow$ GA) or substitute the LHSB with local MSA [14] (LHSB $\rightarrow$ LA), where MSA is the multi-head self-attention [17]. It verifies the effectiveness of our DHM in using global receptive fields to model long-range dependencies while capturing local contexts. 3) Block Orders: In Tab. 2c, we perform ablation studies about shared parameters (SPs) across different stages, and the orders of GHSB and LHSB: from GHSB to LHSB (GS $\rightarrow$ LS) or from LHSB to GHSB (LS $\rightarrow$ GS). The ablation results validate the effectiveness of our DHM. 4) Parameters: Tab. 2d shows ablation studies about learnable parameters $(\boldsymbol{\eta},\boldsymbol{\rho})$ , which validates their effectiveness to estimate degradation patterns. Fig. 6 visualizes $\{\mathbf{Z}_{0},\mathbf{Z}_{3},\mathbf{Z}_{9}\}$ as examples to verify the effectiveness of our unfolding framework in HSI reconstruction, when we use our DHM (9stg) as the denoiser $\mathcal{D}(\cdot)$ .

5 Conclusion

In this paper, we propose a novel Dual Hyperspectral Mamba (DHM) to model both global and local dependencies for efficient HSI reconstruction. After estimating degradation patterns of the CASSI system via the learnable parameters, we utilize these parameters to scale the linear projection and offer noise level for the denoiser (i.e., our DHM) in the multi-stage unfolding framework. Particularly, the proposed DHM mainly consists of a global hyperspectral S4 block (GHSB) and a local hyperspectral S4 block (LHSB). The GHSB can explore long-range dependencies across the entire high-resolution HSIs using global receptive fields, while the LHSB constructs S4 models within different local windows to capture local contexts. We conduct enormous quantitative and qualitative comparison experiments on both the simulation and real datasets to demonstrate the effectiveness of our DHM.

References

[1] V. Backman, M. B. Wallace, L. Perelman, J. Arendt, R. Gurjar, M. Muller, Q. Zhang, G. Zonios, E. Kline, and T. McGillican. Detection of preinvasive cancer cells. Nature, 2000.
[2] JosÉ M. Bioucas-Dias and MÁrio A. T. Figueiredo. A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16(12):2992–3004, 2007.
[3] M. Borengasser, W. S. Hungate, and R. Watkins. Hyperspectral remote sensing: principles and applications. CRC press, 2007.
[4] Yuanhao Cai, Jing Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Coarse-to-fine sparse transformer for hyperspectral image reconstruction. In European Conference on Computer Vision, pages 686–704. Springer, 2022.
[5] Yuanhao Cai, Jing Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17502–17511, 2022.
[6] Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Zhang, Hanspeter Pfister, Radu Timofte, and Luc Van Gool. Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 744–754, 2022.
[7] Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Henghui Ding, Yulun Zhang, Radu Timofte, and Luc V Gool. Degradation-aware unfolding half-shuffle transformer for spectral compressive imaging. Advances in Neural Information Processing Systems, 35:37749–37761, 2022.
[8] Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Henghui Ding, Yulun Zhang, Radu Timofte, and Luc Van Gool. Degradation-aware unfolding half-shuffle transformer for spectral compressive imaging. In Advances in Neural Information Processing Systems, 2022.
[9] Stanley H. Chan, Xiran Wang, and Omar A. Elgendy. Plug-and-play admm for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84–98, 2017.
[10] Ziheng Cheng, Bo Chen, Ruiying Lu, Zhengjue Wang, Hao Zhang, Ziyi Meng, and Xin Yuan. Recurrent neural networks for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2264–2281, 2023.
[11] Inchang Choi, MH Kim, D Gutierrez, DS Jeon, and G Nam. High-quality hyperspectral reconstruction using a spectral prior. In Technical report, 2017.
[12] Jiahua Dong, Yang Cong, Gan Sun, Zhen Fang, and Zhengming Ding. Where and how to transfer: Knowledge aggregation-induced transferability perception for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3):1664–1681, 2024.
[13] Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4022–4031, June 2020.
[14] Yubo Dong, Dahua Gao, Yuyan Li, Guangming Shi, and Danhua Liu. Degradation estimation recurrent neural network with local and non-local priors for compressive spectral imaging. arXiv preprint arXiv:2311.08808, 2024.
[15] Yubo Dong, Dahua Gao, Tian Qiu, Yuyan Li, Minxi Yang, and Guangming Shi. Residual degradation learning unfolding framework with mixing priors across spectral and spatial for compressive spectral imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22262–22271, 2023.
[16] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
[18] Mathieu Fauvel, Yuliya Tarabalka, Jón Atli Benediktsson, Jocelyn Chanussot, and James C. Tilton. Advances in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3):652–675, 2013.
[19] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
[20] Ying Fu, Zhiyuan Liang, and Shaodi You. Bidirectional 3d quasi-recurrent neural network for hyperspectral image super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:2674–2688, 2021.
[21] Ying Fu, Yinqiang Zheng, Imari Sato, and Yoichi Sato. Exploiting spectral-spatial correlation for coded hyperspectral image restoration. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3727–3736, 2016.
[22] M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz. Single-shot compressive spectral imaging with a dual-disperser architecture. Opt. Express, 15(21):14013–14027, Oct 2007.
[23] Alexander F. H. Goetz, Gregg Vane, Jerry E. Solomon, and Barrett N. Rock. Imaging spectrometry for earth remote sensing. Science, 228(4704):1147–1153, 1985.
[24] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
[25] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
[26] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
[27] Xiaowan Hu, Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Hdnet: High-resolution dual-domain learning for spectral compressive imaging. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17521–17530, 2022.
[28] Tao Huang, Weisheng Dong, Xin Yuan, Jinjian Wu, and Guangming Shi. Deep gaussian scale mixture prior for spectral compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16216–16225, 2021.
[29] Tao Huang, Xin Yuan, Weisheng Dong, Jinjian Wu, and Guangming Shi. Deep gaussian scale mixture prior for image reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[30] Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, and Gedas Bertasius. Efficient movie scene detection using state-space transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18749–18758, 2023.
[31] Shirin Jalali and Xin Yuan. Snapshot compressed sensing: Performance bounds and algorithms. IEEE Transactions on Information Theory, 65(12):8005–8024, 2019.
[32] Min H. Kim, Todd Alan Harvey, David S. Kittle, Holly Rushmeier, Julie Dorsey, Richard O. Prum, and David J. Brady. 3d imaging spectroscopy for measuring hyperspectral patterns on solid objects. ACM Transactions on Graphics, 31(4), jul 2012.
[33] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[34] David Kittle, Kerkil Choi, Ashwin Wagadarikar, and David J. Brady. Multiframe image estimation for coded aperture snapshot spectral imagers. Appl. Opt., 49(36):6824–6833, Dec 2010.
[35] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
[36] Zeqiang Lai, Kaixuan Wei, and Ying Fu. Deep plug-and-play prior for hyperspectral image restoration. Neurocomputing, 481:281–293, 2022.
[37] Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
[38] Miaoyu Li, Ying Fu, Ji Liu, and Yulun Zhang. Pixel adaptive deep unfolding transformer for hyperspectral image reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12959–12968, 2023.
[39] Yang Liu, Xin Yuan, Jinli Suo, David J. Brady, and Qionghai Dai. Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):2990–3006, 2019.
[40] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
[41] Guolan Lu and Baowei Fei. Medical hyperspectral imaging: a review. Journal of Biomedical Optics, 2014.
[42] Jiawei Ma, Xiao-Yang Liu, Zheng Shou, and Xin Yuan. Deep tensor admm-net for snapshot compressive imaging. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10222–10231, 2019.
[43] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
[44] Xiao Ma, Xin Yuan, Chen Fu, and Gonzalo R. Arce. Led-based compressive spectral-temporal imaging. Opt. Express, 29(7):10698–10715, Mar 2021.
[45] Emmanuel Maggiori, Guillaume Charpiat, Yuliya Tarabalka, and Pierre Alliez. Recurrent neural networks to correct satellite image classification maps. IEEE Transactions on Geoscience and Remote Sensing, 55(9):4962–4971, 2017.
[46] Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
[47] F. Melgani and L. Bruzzone. Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42(8):1778–1790, 2004.
[48] Ziyi Meng, Shirin Jalali, and Xin Yuan. Gap-net for snapshot compressive imaging. arXiv preprint arXiv:2012.08364, 2020.
[49] Ziyi Meng, Jiawei Ma, and Xin Yuan. End-to-end low cost compressive spectral imaging with spatial-spectral self-attention. In European conference on computer vision, pages 187–204. Springer, 2020.
[50] Ziyi Meng, Mu Qiao, Jiawei Ma, Zhenming Yu, Kun Xu, and Xin Yuan. Snapshot multispectral endomicroscopy. Optics Letters, 2020.
[51] Ziyi Meng, Zhenming Yu, Kun Xu, and Xin Yuan. Self-supervised neural networks for spectral snapshot compressive imaging. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2602–2611, 2021.
[52] Xin Miao, Xin Yuan, Yunchen Pu, and Vassilis Athitsos. lambda-net: Reconstruct hyperspectral images from a snapshot measurement. In IEEE International Conference on Computer Vision, pages 4058–4068, Oct. 2019.
[53] Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
[54] Hien Van Nguyen, Amit Banerjee, and Rama Chellappa. Tracking via object reflectance using a hyperspectral video camera. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pages 44–51, 2010.
[55] Zhihong Pan, G. Healey, M. Prasad, and B. Tromberg. Face recognition in hyperspectral images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12):1552–1560, 2003.
[56] Jong-Il Park, Moon-Hyun Lee, Michael D. Grossberg, and Shree K. Nayar. Multispectral imaging using multiplexed illumination. In ICCV, 2007.
[57] Mu Qiao, Ziyi Meng, Jiawei Ma, and Xin Yuan. Deep learning for video compressive sensing. Apl Photonics, 2020.
[58] Mu Qiao and Xin Yuan. Coded aperture compressive temporal imaging using complementary codes and untrained neural networks for high-quality reconstruction. Opt. Lett., 48(1):109–112, Jan 2023.
[59] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
[60] Jin Tan, Yanting Ma, Hoover Rueda, Dror Baron, and Gonzalo R. Arce. Compressive hyperspectral imaging via approximate message passing. IEEE Journal of Selected Topics in Signal Processing, 10(2):389–401, 2016.
[61] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, 2007.
[62] Burak Uzkent, Matthew J. Hoffman, and Anthony Vodacek. Real-time vehicle tracking in aerial video using hyperspectral features. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1443–1451, 2016.
[63] Burak Uzkent, Aneesh Rangnekar, and Matthew J. Hoffman. Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 233–242, 2017.
[64] Ashwin Wagadarikar, Renu John, Rebecca Willett, and David Brady. Single disperser design for coded aperture snapshot spectral imaging. Appl. Opt., 47(10):B44–B51, Apr 2008.
[65] Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid. Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6387–6397, 2023.
[66] Lizhi Wang, Chen Sun, Ying Fu, Min H. Kim, and Hua Huang. Hyperspectral image reconstruction using a deep spatial-spectral prior. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8024–8033, 2019.
[67] Lizhi Wang, Chen Sun, Maoqing Zhang, Ying Fu, and Hua Huang. Dnu: Deep non-local unrolling for computational spectral imaging. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1658–1668, 2020.
[68] Lizhi Wang, Zhiwei Xiong, Guangming Shi, Feng Wu, and Wenjun Zeng. Adaptive nonlocal sparse representation for dual-camera compressive hyperspectral imaging. IEEE transactions on pattern analysis and machine intelligence, 39(10):2104–2111, 2016.
[69] Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, and Lei Li. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079, 2024.
[70] Zongliang Wu, Ruiying Lu, Ying Fu, and Xin Yuan. Latent diffusion prior enhanced deep unfolding for spectral image reconstruction. arXiv preprint arXiv:2311.14280, 2023.
[71] Xin Yuan. Generalized alternating projection based total variation minimization for compressive sensing. In 2016 IEEE International Conference on Image Processing (ICIP), pages 2539–2543, 2016.
[72] Xin Yuan, David J. Brady, and Aggelos K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2):65–88, 2021.
[73] Xin Yuan, Yang Liu, Jinli Suo, and Qionghai Dai. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1447–1457, 2020.
[74] Xin Yuan, Tsung-Han Tsai, Ruoyu Zhu, Patrick Llull, David Brady, and Lawrence Carin. Compressive hyperspectral imaging with side information. IEEE Journal of Selected Topics in Signal Processing, 9(6):964–976, 2015.
[75] Yuan Yuan, Xiangtao Zheng, and Xiaoqiang Lu. Hyperspectral image superresolution by transfer learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(5):1963–1974, 2017.
[76] Yubiao Yue and Zhenzhang Li. Medmamba: Vision mamba for medical image classification. arXiv preprint arXiv:2403.03849, 2024.
[77] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6360–6376, 2021.
[78] Shipeng Zhang, Lizhi Wang, Ying Fu, Xiaoming Zhong, and Hua Huang. Computational hyperspectral imaging based on dimension-discriminative low-rank tensor recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10183–10192, 2019.
[79] Xuanyu Zhang, Yongbing Zhang, Ruiqin Xiong, Qilin Sun, and Jian Zhang. Herosnet: Hyperspectral explicable reconstruction and optimal sampling deep network for snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17532–17541, 2022.
[80] Siming Zheng, Yang Liu, Ziyi Meng, Mu Qiao, Zhishen Tong, Xiaoyu Yang, Shensheng Han, and Xin Yuan. Deep plug-and-play priors for spectral snapshot compressive imaging. Photon. Res., 9(2):B18–B29, Feb 2021.
[81] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.