Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China 22institutetext: Shenzhen Research Institute, CUHK, Shenzhen, China 33institutetext: The University of Sydney, Sydney, NSW, Australia 44institutetext: City University of Hong Kong, Hong Kong, China 55institutetext: Centre for Artificial Intelligence and Robotics, HKISI-CAS, Hong Kong, China 66institutetext: University College London, London, UK 77institutetext: Qilu Hospital of Shandong University, Jinan, China
77email: b.long@link.cuhk.edu.hk, hlren@ee.cuhk.edu.hk

EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy

Long Bai 1 ⋆1 ⋆ Qiaozhi Tan 112 ⋆2 ⋆ Tong Chen Co-first authors. 33 Wan Jun Nah 1122 Yanheng Li 44 Zhicheng He 1122 Sishen Yuan 11 Zhen Chen 55 Jinlin Wu 55 Mobarakol Islam 66 Zhen Li 77 Hongbin Liu 55 Hongliang Ren Corresponding author. 1122
Abstract

Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DFT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DFT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance. The code and the proposed dataset are available at github.com/longbai1006/EndoUIC.

1 Introduction

Wireless Capsule Endoscopy (WCE) has revolutionized gastrointestinal (GI) diagnostics by offering a minimally invasive, painless way of examination in the GI tract [32]. However, the effectiveness of WCE can often be influenced due to factors such as limited battery capacity, camera performance, and the complexity of the GI tract [23, 22]. Uneven illumination within the tract can significantly degrade image quality, thus affecting the accuracy and efficiency of diagnosis, screening, and the provision of timely feedback [29]. While the issue of low-light image enhancement (LLIE) in WCE images has received considerable attention, leading to various strategies to improve visibility in low-light areas [14, 16], the challenge of overexposure remains less explored [27]. Various solutions [2, 17, 15, 29] have been put forward to enhance low-light WCE images. Nevertheless, the complex and dynamic internal body environment will also result in overexposure, which obscures critical details with excessive brightness, as the brightness levels often extend beyond the dynamic range these techniques can adequately adjust [20, 33].

Some conventional approaches have been utilized to enhance the structure visibility in WCE images [20, 25]. However, compared to deep learning methods, they tend to be less adaptive, less content-aware, and require manual intervention. Sequentially, García-Vega et al. implemented a multi-stage structure-aware deep network for exposure correction (EC) [5], and employed CycleGAN [35] for EC dataset generation [4]. Presently, solutions for WCE unified illumination adaptation are still underexplored, lacking an end-to-end architecture that can unify illumination correction tasks. Furthermore, existing endoscopy EC datasets produced via generative models struggle to replicate the complexity encountered in real-world scenarios. This gap underscores the need for a unified light adaptation model capable of concurrently tackling EC and LLIE, which is crucial for the retention and enhancement of vital diagnostic details.

Denosing diffusion probabilistic models (DDPMs) have demonstrated outstanding performance in low-level vision tasks including denoising, super resolution, and low-light enhancement, owing to their ability to model complex data distributions and incorporate conditional information effectively [6, 24]. In scenarios involving overexposed and underexposed images, which typically demand different parameter spaces and optimization trajectories, directly training diffusion models might not be the best approach. Contrastive learning methods have already been employed to learn varying image degradation types, while an additional network would be needed [10]. To this end, we introduce a set of learnable parameters that act as our prompt. These prompt parameters are optimized through an end-to-end process, learning to adjust the model’s prior for different image degradation. Then, it shall steer the model within the parameter space toward different low-level details essential for EC and LLIE. Thus, leveraging the task-specific knowledge acquired by the model, it dynamically adapts the input data according to different brightness levels. Additionally, the prompt module is capable of learning different levels of illumination abnormalities within a single degradation. Thus, even if the input exhibits only one type of degradation, the model can still maintain effective restoration performance. Moreover, to address the issue of data scarcity, we have collected a WCE dataset and invited photography experts to annotate underexposed and overexposed images manually. Specifically, our contributions to this work can be summarized as three-fold:

  • We propose EndoUIC - Endoscopic Unified Illumination Correction - a promptable diffusion model for unified WCE illumination correction. Specifically, the illumination prompt module is designed to navigate the diffusion model toward specific illumination conditions.

  • In our proposed framework, we embed a diffusion process within a U-shape transformer to perceive global illumination and multiscale contextual information, and utilize prompts to guide the illumination restoration procedure. Our prompt module contains an Adaptive Prompt Integration (API) module, which dynamically produces and integrates prompt parameters with feature representations. Additionally, we incorporate the Global Prompt Scanner (GPS) module to enhance the interaction between prompts and features.

  • To tackle the data shortage issue, we propose a novel WCE EC dataset, named Capsule-endoscopy Exposure Correction (CEC) dataset, with normal and wrongly exposed image pairs. Extensive comparison, ablation, and downstream experiments on four datasets demonstrate the superior effectiveness of our EndoUIC, showcasing its potential in clinical applications.

2 Methodology

2.1 Preliminaries

2.1.1 Visual Prompt Learning

introduces a set of learnable parameters that provide deep learning models with contextual information regarding the image degradation types in image restoration tasks [12, 18]. These prompts interact with the features of the input image, directing the model to adaptively adjust to different degradation types, thus restoring high-quality, clear images. This method enables a single unified model to address multiple image degradation challenges, enhancing the model’s generalization capabilities.

2.1.2 Pyramid Diffusion Models

(PyDiff) is an LLIE diffusion model that implements a pyramid diffusion strategy [34]. Unlike DDPMs, where image resolution remains constant throughout the reverse process, PyDiff starts with a lower resolution and progressively increases it to a higher resolution in the diffusion process. The forward and reverse process can be formulated with the given input 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, time step t𝑡titalic_t, noise schedule {α}t=0Tsuperscriptsubscript𝛼𝑡0𝑇\{\alpha\}_{t=0}^{T}{ italic_α } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and scaling schedule {U}t=0Tsuperscriptsubscript𝑈𝑡0𝑇\{U\}_{t=0}^{T}{ italic_U } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT:

q(𝐱t𝐱t1)=𝒩(𝐱t;α¯t(𝐱0Ut/Ut1),(1α¯t)𝐈)q\left(\mathbf{x}_{t}\mid\mathbf{x}_{t-1}\right)=\mathcal{N}\left(\mathbf{x}_{% t};\sqrt{\bar{\alpha}_{t}}\left(\mathbf{x}_{0}\downarrow_{U_{t}/U_{t-1}}\right% ),\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right)italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ↓ start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I )

(1)

pθ(𝐱t1𝐱t)={𝒩(𝐱t1;αt1(1αt)1α¯t𝐲θ(𝐱t)+αt(1α¯t1)1α¯t𝐱t,1α¯t11α¯t(1αt)𝐈), if Ut=Ut1𝒩(𝐱t1;α¯t1(𝐲θ(𝐱t)Ut/Ut1),(1α¯t1)𝐈), if Ut>Ut1p_{\theta}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\right)=\begin{cases}% \mathcal{N}\left(\mathbf{x}_{t-1};\frac{\sqrt{\alpha_{t-1}}(1-\alpha_{t})}{1-% \bar{\alpha}_{t}}\mathbf{y}_{\theta}\left(\mathbf{x}_{t}\right)+\frac{\sqrt{% \alpha_{t}}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_{t}}\mathbf{x}_{t% }\right.\left.,\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}(1-\alpha_{t})% \mathbf{I}\right),&\text{ if }U_{t}=U_{t-1}\\ \mathcal{N}\left(\mathbf{x}_{t-1};\sqrt{\bar{\alpha}_{t-1}}\left(\mathbf{y}_{% \theta}\left(\mathbf{x}_{t}\right)\uparrow_{U_{t}/U_{t-1}}\right)\right.\left.% ,\left(1-\bar{\alpha}_{t-1}\right)\mathbf{I}\right),&\text{ if }U_{t}>U_{t-1}% \end{cases}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ROW start_CELL caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , end_CELL start_CELL if italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( bold_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ↑ start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_I ) , end_CELL start_CELL if italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_CELL end_ROW

(2)

in which αt(0,1)subscript𝛼𝑡01\alpha_{t}\in(0,1)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) and α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. While atat+1subscript𝑎𝑡subscript𝑎𝑡1a_{t}\geq a_{t+1}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is getting bigger noise, stst+1subscript𝑠𝑡subscript𝑠𝑡1s_{t}\leq s_{t+1}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is getting lower resolution. This approach optimizes the sampling speed and allows for improved image restoration quality by gradually refining image details with increasing resolution.

2.2 Proposed Method: EndoUIC

Our proposed EndoUIC framework is presented in Fig. 1, with the U-shape restoration DFT to estimate noises. The network is optimized with simple 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and the noise sampling strategy follows [34].

Refer to caption
Figure 1: The overview of our EndoUIC. Our network comprises the 4-level diffusion transformer (DFT), which is used to predict the noise. In each upsampling stage of the restoration DFT, the illumination prompt module is incorporated, which consists of the Adaptive Prompt Integration (API) and a Global Prompt Scanner (GPS) blocks. ‘SFE’ and ‘OUT’ denote the shallow feature extractor and the output block, respectively.

2.2.1 Restoration DFT

Our network begins with a shallow feature extractor that transforms the image into the feature representation R𝑅Ritalic_R. The features are then fed into the 4-level down-sampling transformer encoder and up-sampling transformer decoder, which is similar to the UNet structure [19]. The skip connections are executed at each level of the encoder-decoder. With 𝒳0subscript𝒳0\mathcal{X}_{0}caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denoting the input and output feature respectively, each transformer layer can be formulated:

𝒳1=Attn(Norm(𝒳0)),𝒳2=FFN(Norm(𝒳1))formulae-sequencesubscript𝒳1𝐴𝑡𝑡𝑛𝑁𝑜𝑟𝑚subscript𝒳0subscript𝒳2𝐹𝐹𝑁𝑁𝑜𝑟𝑚subscript𝒳1\mathcal{X}_{1}=Attn({Norm}(\mathcal{X}_{0})),\;\;\mathcal{X}_{2}=FFN({Norm}(% \mathcal{X}_{1}))caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_n ( italic_N italic_o italic_r italic_m ( caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_F italic_F italic_N ( italic_N italic_o italic_r italic_m ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) (3)

in which the time embedding t𝑡titalic_t is injected with 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The decoder finally outputs a high-resolution image with normal illumination after propagating through the output block. Each up-sampling stage incorporates an illumination prompt module. Firstly, the output of the prompt module is concatenated with the corresponding encoder’s skip connection output. Then it will be passed through a 1×1111\times 11 × 1 convolutional (Conv) layer before being input into the respective decoder.

2.2.2 Illumination Prompt Module

Our proposed illumination prompt module includes an Adaptive Prompt Integration (API) block and a Global Prompt Scanner (GPS) block. The prompt module shall dynamically adjust the learning targets for different illumination conditions. Given learnable parameters P𝑃Pitalic_P as the prompts, the prompt module can be formulated:

P=API(𝒳I;P),𝒳O=GPS(𝒳I;P)formulae-sequencesuperscript𝑃subscript𝐴𝑃𝐼subscript𝒳𝐼𝑃subscript𝒳𝑂subscript𝐺𝑃𝑆subscript𝒳𝐼superscript𝑃P^{\prime}=\mathcal{F}_{API}(\mathcal{X}_{I};P),\;\;\mathcal{X}_{O}=\mathcal{F% }_{GPS}(\mathcal{X}_{I};P^{\prime})italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_A italic_P italic_I end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ; italic_P ) , caligraphic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_G italic_P italic_S end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ; italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (4)

in which 𝒳Isubscript𝒳𝐼\mathcal{X}_{I}caligraphic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and 𝒳Osubscript𝒳𝑂\mathcal{X}_{O}caligraphic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT depict the input and output feature of the illumination prompt module, respectively.

2.2.3 Adaptive Prompt Integration (API)

The API module is designed to generate the prompt parameters P𝑃Pitalic_P and integrate P𝑃Pitalic_P with the adaptively learned feature maps. We first define a set of learnable parameters P𝑃Pitalic_P, which are designed to embed different illumination conditions into the features. This design can efficiently capture long-range dependencies to perceive global illumination information and address local-region uneven illumination conditions. Thus, our method can also effectively learn the illumination representation of features.

To achieve this, we refrain from directly multiplying P𝑃Pitalic_P with the feature matrix as this could diminish the correlation between P𝑃Pitalic_P and the features. Instead, we employ a self-adaptive dynamic feature space for efficient representation learning, and a dynamic selection mechanism is utilized with multi-size Conv kernels. To implement large Conv kernels, we decouple the large kernel into small kernels and combine them with a 1×1111\times 11 × 1 layer by following [11].

As illustrated in the left-bottom of Fig. 1, our dynamic kernel selection mechanism concatenates features from different receptive fields to obtain the combined feature 𝒳𝒜subscript𝒳𝒜\mathcal{X_{A}}caligraphic_X start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT. We then utilize average and max pooling to extract spatial information from the feature space. The extracted features, after concatenation, are passed through a Conv layer, which expands the channel from 2222 to 𝐍𝐍\mathbf{N}bold_N:

𝒳A=2𝐍[AvgPool(𝒳𝒜)MaxPool(𝒳𝒜)]superscriptsubscript𝒳𝐴subscript2𝐍delimited-[]conditional𝐴𝑣𝑔𝑃𝑜𝑜𝑙subscript𝒳𝒜𝑀𝑎𝑥𝑃𝑜𝑜𝑙subscript𝒳𝒜\mathcal{X}_{A}^{\prime}=\mathcal{F}_{2\rightarrow\mathbf{N}}[AvgPool(\mathcal% {X_{A}})\|MaxPool(\mathcal{X_{A}})]caligraphic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT 2 → bold_N end_POSTSUBSCRIPT [ italic_A italic_v italic_g italic_P italic_o italic_o italic_l ( caligraphic_X start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ) ∥ italic_M italic_a italic_x italic_P italic_o italic_o italic_l ( caligraphic_X start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ) ] (5)

where [][\cdot\|\cdot][ ⋅ ∥ ⋅ ] denotes concatenation. Subsequently, we apply the Sigmoid activation σ𝜎\sigmaitalic_σ to obtain the weighted coefficient, which is then multiplied with P𝑃Pitalic_P using the following equation, enabling dynamic feature learning to weight P𝑃Pitalic_P adaptively:

P=FCN[Mean(σ(𝒳𝒜))]Psuperscript𝑃direct-productsubscript𝐹𝐶𝑁delimited-[]𝑀𝑒𝑎𝑛𝜎superscriptsubscript𝒳𝒜𝑃P^{\prime}=\mathcal{F}_{FCN}[Mean(\sigma(\mathcal{X_{A}}^{\prime}))]\odot Pitalic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_F italic_C italic_N end_POSTSUBSCRIPT [ italic_M italic_e italic_a italic_n ( italic_σ ( caligraphic_X start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] ⊙ italic_P (6)

where direct-product\odot denotes element-wise multiplication and FCNsubscript𝐹𝐶𝑁\mathcal{F}_{FCN}caligraphic_F start_POSTSUBSCRIPT italic_F italic_C italic_N end_POSTSUBSCRIPT means the fully-connected layer. The obtained Conv3×3(P)𝐶𝑜𝑛subscript𝑣33superscript𝑃Conv_{3\times 3}(P^{\prime})italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is then propagated through the GPS module for further correlation learning of the feature maps and prompt parameters.

2.2.4 Global Prompt Scanner

In the GPS module, the prompt parameter Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is firstly concatenated with the input feature 𝒳Gsubscript𝒳𝐺\mathcal{X}_{G}caligraphic_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, utilizing Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to guide the process of luminance restoration:

𝒳P=[𝒳GP]subscript𝒳𝑃delimited-[]conditionalsubscript𝒳𝐺superscript𝑃\mathcal{X}_{P}=[\mathcal{X}_{G}\|P^{\prime}]caligraphic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = [ caligraphic_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] (7)

The selective-scan mechanism of VMamba [13], which captures long-range representations by scanning sequentially from four directions (top-left \rightarrow bottom-right, bottom-right \rightarrow top-left, top-right \rightarrow bottom-left, bottom-left \rightarrow top-right), has proven to be an effective approach for learning visual representations. To further enhance the global perception and foster the interaction between 𝒳Gsubscript𝒳𝐺\mathcal{X}_{G}caligraphic_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we conduct the cross-scan on 𝒳Psubscript𝒳𝑃\mathcal{X}_{P}caligraphic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. In this case, the scans in the same dimension as the concatenation can effectively facilitate the interaction between Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒳Gsubscript𝒳𝐺\mathcal{X}_{G}caligraphic_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, while the scans vertical to the concatenation dimension can promote the internal representation learning within Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒳Gsubscript𝒳𝐺\mathcal{X}_{G}caligraphic_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT themselves.

The GPS module is presented in the right-bottom of Fig. 1(c). Features that follow a skip connection are processed by a 1×1111\times 11 × 1 Conv layer and a 3×3333\times 33 × 3 Conv layer. The combined feature is then fused with features from higher spatial dimensions to facilitate the illumination restoration process of the overall model.

3 Experiments

3.1 Dataset

We conduct our experiments on two EC datasets and two LLIE datasets:

Capsule endoscopy Exposure Correction (CEC) dataset is collected by ANKON magnetically controlled WCEs of three patients. The training set includes 800 images from two patients, and the test set contains 200 images from one patient. The dataset comprises half overexposed and half underexposed images. Due to the limited working space and dynamic deformable in-vivo scenes, it is difficult to obtain normal and corrupted image pairs. In this case, we adapt the normal images towards corrupted renditions (overexposure and underexposure) by expert photographers. The images are adjusted using Adobe toolkits (Photoshop/Lightroom) and saved in RGB format through a systematic process. The dataset and ethical approval information will be released upon acceptance.

Endo4IE dataset is a public synthetic EC dataset of conventional endoscopy [4]. It was created by initially selecting public images without exposure issues. Then, CycleGAN [35] was applied to generate paired overexposed and underexposed synthetic images, and MSE and SSIM metrics were used to filter and finalize a dataset of 985985985985 underexposed and 1231123112311231 overexposed images.

Refer to caption
Figure 2: The visualization results and error maps of our EndoUIC against SOTA method on the CEC dataset. We present the enhanced images by SOTA exposure correction methods and their heat maps of reconstruction errors, with blue indicating lower errors and red denoting higher errors.

Kvasir-Capsule [21] and Red Lesion Endoscopy (RLE) [3] are originally two datasets utilized for WCE disease diagnosis. [2] have curated images from these datasets and synthesized two datasets specifically tailored for WCE LLIE tasks by applying random Gamma correction and illumination reduction. Specifically, the Kvasir-Capsule dataset comprises 2000 training images and 400 test images. The RLE dataset contains 946 training images and 337 test images.

3.2 Implementation Details

Table 1: EC comparison against existing and SOTA methods on our CEC dataset.
Methods FECNet [8] SID [7] DRBN [7] MIRv2 [30] LLCaps [2] PyDiff [34] PromptIR [18] LACT [1] PIP [12] EndoUIC
PSNR \uparrow 28.78 24.29 26.83 28.36 27.55 28.18 28.27 28.40 25.01 29.65
SSIM \uparrow 92.61 85.69 90.50 93.58 85.95 95.79 83.14 93.09 70.09 96.80
LPIPS \downarrow 0.1048 0.2111 0.1452 0.1080 0.2366 0.0941 0.0717 0.1103 0.1527 0.0655
Table 2: EC comparison against existing and SOTA methods on the Endo4IE dataset. ‘*’ means we use the results from the previous works
Methods LMSPEC* [4] LMSPEC+* [5] FECNet [8] MIRv2 [30] LA-Net [27] PyDiff [34] PromptIR [18] LACT [1] PIP [12] EndoUIC
PSNR \uparrow 23.97 23.62 24.72 23.85 23.51 24.73 23.73 22.92 25.28 25.49
SSIM \uparrow 80.34 79.97 81.84 82.33 83.78 84.78 79.57 76.88 81.94 85.20
LPIPS \downarrow - - 0.2031 0.2376 0.1186 0.2148 0.2396 0.2671 0.2150 0.1937

The performance of our proposed EndoUIC is compared with a variety of state-of-the-art (SOTA) LLIE and EC methodologies, which are listed in Table 1, 2, 3, and supplementary materials. For methods marked with ‘*’, we obtain results directly from previous works. For the remaining methods, we reproduce the results through their official repositories. We conduct our experiments with Python PyTorch on NVIDIA A100 GPUs. We train our model with Adam for 1000100010001000 epochs. The learning rate is set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We evaluate the image enhancement performance with Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) [31]. We also follow the previous work [2] to conduct a downstream medical diagnosis task on the RLE test set - the red lesion segmentation. The UNet [19] is trained using Adam with 20 epochs and a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and evaluated with Intersection over Union (mIoU).

3.3 Results

As indicated in Tables 1 and 2, we initially perform the endoscopy exposure correction experiment in comparison with the existing SOTA methods. Our approach successfully surpasses various methods with different architectures (e.g, including CNNs, Transformers, and DDPMs). It achieves the best results on our self-proposed CEC dataset and the public Endo4IE benchmark. Specifically, on the CEC and Endo4IE datasets, our method improves the PSNR by 0.87 dB and 0.21 dB respectively compared to the second-best methods. Our method also achieves SSIM of 96.80% and 85.20% respectively. The visualization of our image enhancement results and their corresponding error maps are presented in Fig. 2, where blue indicates fewer errors. It is observable that our method demonstrates the least errors and optimal results.

Furthermore, we also compare our method with SOTA LLIE techniques on two publicly available LLIE datasets, with the results presented in Table 3. We demonstrate that our method still achieves excellent results even when only one type of illumination degradation occurs. We also conduct a segmentation experiment on red lesions following LLCaps [2], and the superior results further prove the clinical applicability of our EndoUIC.

Finally, we deconstructed each module of EndoUIC and performed ablation studies, as shown in Table 4. We attempted to remove the API and GPS modules from the prompt architecture and reverted the DFT back to the UNet architecture. In all cases, we observed varying degrees of performance degradation, which further proves the effectiveness of the various components we proposed.

Table 3: LLIE comparison with existing and SOTA solutions on the Kvasir-Capsule [21] and RLE datasets [3]. The red lesion segmentation experiment is conducted on the RLE test set [3] by following the previous work [2]. LLCaps* means we use the results from the previous SOTA [2].
Methods LLCaps* [2] PIP [12] CFWD [26] Diff-LOL [9] LA-Net [27] CLE [28] PyDiff [34] PromptIR [18] EndoUIC
Kvasir- Capsule PSNR \uparrow 35.24 33.60 35.88 33.60 30.84 26.55 35.07 33.54 36.85
SSIM \uparrow 96.34 95.09 96.26 95.42 95.32 87.87 96.60 96.77 97.04
LPIPS \downarrow 0.0374 0.0302 0.0467 0.0847 0.0562 0.0829 0.0364 0.0377 0.0255
RLE PSNR \uparrow 33.18 28.60 30.14 28.46 25.92 26.20 33.21 32.07 33.50
SSIM \uparrow 93.34 87.27 90.25 82.52 85.72 81.42 93.54 93.30 93.99
LPIPS \downarrow 0.0721 0.0977 0.1088 0.1437 0.1491 0.1134 0.0774 0.0694 0.0658
RLE Seg mIoU \uparrow 66.47 59.46 51.47 62.46 52.57 45.33 62.56 59.92 68.97
Table 4: Ablation study of the proposed EndoUIC on the EndoUIC dataset. Specifically, we (i) replace the restoration DFT with the original U-Net architecture, (ii) remove the API block, and (iii) remove the GPS block.
Diffusion Trans
Prompt API
GPS
PSNR \uparrow 28.18 29.16 28.45 28.47 29.31 29.42 28.78 29.65
SSIM \uparrow 95.79 95.81 96.49 96.60 94.92 95.73 96.21 96.80
LPIPS \downarrow 0.0941 0.0735 0.858 0.0776 0.0710 0.0682 0.0727 0.0655

4 Conclusion

This paper presents EndoUIC, a promptable DFT model for unified illumination correction for WCE. The model’s ability to navigate through different regions of the parameter space allows for tailored adjustments that address the distinct challenges posed by either overexposed or underexposed images. Furthermore, with the assistance of photographer experts, we customize the CEC dataset tailored for the EC task in WCEs. Extensive experiments conducted across four datasets demonstrate that our EndoUIC surpasses existing SOTA techniques, validating its efficacy in performing endoscopic LLIE and EC tasks. Our suggested approach can be combined with clinical endoscopy systems, greatly enhancing the visualization, diagnosis, screening, and treatment of GI diseases.

Acknowledgement

This work was supported by HK RGC GRF 14203323 & 14211420, CRF C4026-21GF, Shenzhen-HK-Macau Technology Research Programme (Type C) STIC Grant 202108233000303, and Regional Joint Fund Project 2021B1515120035 (B.02.21.00101) of Guangdong Basic and Applied Research Fund.

References

  • [1] Baek, J.H., Kim, D., Choi, S.M., Lee, H.j., Kim, H., Koh, Y.J.: Luminance-aware color transform for multiple exposure correction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6156–6165 (2023)
  • [2] Bai, L., Chen, T., Wu, Y., Wang, A., Islam, M., Ren, H.: Llcaps: Learning to illuminate low-light capsule endoscopy with curved wavelet attention and reverse diffusion. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 34–44. Springer (2023)
  • [3] Coelho, P., Pereira, A., Leite, A., Salgado, M., Cunha, A.: A deep learning approach for red lesions detection in video capsule endoscopies. In: Image Analysis and Recognition: 15th International Conference, ICIAR 2018, Póvoa de Varzim, Portugal, June 27–29, 2018, Proceedings 15. pp. 553–561. Springer (2018)
  • [4] García-Vega, A., Espinosa, R., Ochoa-Ruiz, G., Bazin, T., Falcón-Morales, L., Lamarque, D., Daul, C.: A novel hybrid endoscopic dataset for evaluating machine learning-based photometric image enhancement models. In: Mexican International Conference on Artificial Intelligence. pp. 267–281. Springer (2022)
  • [5] García-Vega, A., Espinosa, R., Ramírez-Guzmán, L., Bazin, T., Falcón-Morales, L., Ochoa-Ruiz, G., Lamarque, D., Daul, C.: Multi-scale structural-aware exposure correction for endoscopic imaging. In: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2023)
  • [6] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, 6840–6851 (2020)
  • [7] Huang, J., Liu, Y., Fu, X., Zhou, M., Wang, Y., Zhao, F., Xiong, Z.: Exposure normalization and compensation for multiple-exposure correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6043–6052 (2022)
  • [8] Huang, J., Liu, Y., Zhao, F., Yan, K., Zhang, J., Huang, Y., Zhou, M., Xiong, Z.: Deep fourier-based exposure correction network with spatial-frequency interaction. In: European Conference on Computer Vision. pp. 163–180. Springer (2022)
  • [9] Jiang, H., Luo, A., Fan, H., Han, S., Liu, S.: Low-light image enhancement with wavelet-based diffusion models. ACM Transactions on Graphics 42(6), 1–14 (2023)
  • [10] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17452–17462 (June 2022)
  • [11] Li, Y., Hou, Q., Zheng, Z., Cheng, M.M., Yang, J., Li, X.: Large selective kernel network for remote sensing object detection. arXiv preprint arXiv:2303.09030 (2023)
  • [12] Li, Z., Lei, Y., Ma, C., Zhang, J., Shan, H.: Prompt-in-prompt learning for universal image restoration. arXiv preprint arXiv:2312.05038 (2023)
  • [13] Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024)
  • [14] Long, M., Li, Z., Xie, X., Li, G., Wang, Z.: Adaptive image enhancement based on guide image and fraction-power transformation for wireless capsule endoscopy. IEEE transactions on biomedical circuits and systems 12(5), 993–1003 (2018)
  • [15] Ma, Y., Liu, Y., Cheng, J., Zheng, Y., Ghahremani, M., Chen, H., Liu, J., Zhao, Y.: Cycle structure and illumination constrained gan for medical image enhancement. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part II 23. pp. 667–677. Springer (2020)
  • [16] Moghtaderi, S., Yaghoobian, O., Wahid, K.A., Lukong, K.E.: Endoscopic image enhancement: Wavelet transform and guided filter decomposition-based fusion approach. Journal of Imaging 10(1),  28 (2024)
  • [17] Mou, E., Wang, H., Yang, M., Cao, E., Chen, Y., Ran, C., Pang, Y.: Global and local enhancement of low-light endoscopic images (2023)
  • [18] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
  • [19] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
  • [20] Rukundo, O., Pedersen, M., Hovde, Ø., et al.: Advanced image enhancement method for distant vessels and structures in capsule endoscopy. Computational and mathematical methods in medicine 2017 (2017)
  • [21] Smedsrud, P.H., Thambawita, V., Hicks, S.A., Gjestang, H., Nedrejord, O.O., Næss, E., Borgli, H., Jha, D., Berstad, T.J.D., Eskeland, S.L., et al.: Kvasir-capsule, a video capsule endoscopy dataset. Scientific Data 8(1),  142 (2021)
  • [22] Tan, Q., Bai, L., Wang, G., Islam, M., Ren, H.: Endoood: Uncertainty-aware out-of-distribution detection in capsule endoscopy diagnosis. arXiv preprint arXiv:2402.11476 (2024)
  • [23] Wang, G., Bai, L., Wu, Y., Chen, T., Ren, H.: Rethinking exemplars for continual semantic segmentation in endoscopy scenes: Entropy-based mini-batch pseudo-replay. Computers in Biology and Medicine 165, 107412 (2023)
  • [24] Wang, L., Yang, Q., Wang, C., Wang, W., Pan, J., Su, Z.: Learning a coarse-to-fine diffusion transformer for image restoration. arXiv preprint arXiv:2308.08730 (2023)
  • [25] Wang, L., Wu, B., Wang, X., Zhu, Q., Xu, K.: Endoscopic image luminance enhancement based on the inverse square law for illuminance and retinex. International Journal of Medical Robotics and Computer Assisted Surgery 18(4), e2396 (2022)
  • [26] Xue, M., He, J., He, Y., Liu, Z., Wang, W., Zhou, M.: Low-light image enhancement via clip-fourier guided wavelet diffusion. arXiv preprint arXiv:2401.03788 (2024)
  • [27] Yang, K.F., Cheng, C., Zhao, S.X., Yan, H.M., Zhang, X.S., Li, Y.J.: Learning to adapt to light. International Journal of Computer Vision 131(4), 1022–1041 (2023)
  • [28] Yin, Y., Xu, D., Tan, C., Liu, P., Zhao, Y., Wei, Y.: Cle diffusion: Controllable light enhancement diffusion model. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 8145–8156 (2023)
  • [29] Yue, G., Gao, J., Cong, R., Zhou, T., Li, L., Wang, T.: Deep pyramid network for low-light endoscopic image enhancement. IEEE Transactions on Circuits and Systems for Video Technology (2023)
  • [30] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Learning enriched features for fast image restoration and enhancement. IEEE transactions on pattern analysis and machine intelligence 45(2), 1934–1948 (2022)
  • [31] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
  • [32] Zhang, Y., Bai, L., Liu, L., Ren, H., Meng, M.Q.H.: Deep reinforcement learning-based control for stomach coverage scanning of wireless capsule endoscopy. In: 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO). pp. 01–06. IEEE (2022)
  • [33] Zhang, Y., Zhang, K., Ding, Y., Liu, S., Wang, M., Wang, X., Qin, Z., Zhang, X., Ma, T., et al.: Deep transfer learning from ordinary to capsule esophagogastroduodenoscopy for image quality controlling. Engineering Reports p. e12776 (2023)
  • [34] Zhou, D., Yang, Z., Yang, Y.: Pyramid diffusion models for low-light image enhancement. arXiv preprint arXiv:2305.10028 (2023)
  • [35] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (2017)

Supplementary Materials for “EndoUIC: Prompt Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy”

Table 5: Full results of EC comparison with existing methods on our CEC dataset.
Models RetinexNet Zero-DCE RCTNet MSEC FECNet DRBN-ENC MIRNetv1 MIRNetv2 LA-Net
PSNR \uparrow 19.05 10.30 18.90 7.54 28.78 26.83 27.62 28.36 16.58
SSIM \uparrow 76.74 65.46 65.03 38.77 92.61 90.50 93.03 93.58 66.40
LPIPS \downarrow 0.4522 0.5873 0.5847 0.6561 0.1048 0.1452 0.1252 0.1080 0.4233
Models LLCaps PyDiff PromptIR LACT PIP CLE PSENet DiT EndoUIC
PSNR \uparrow 27.55 28.18 28.27 28.40 25.01 27.12 12.70 23.43 29.65
SSIM \uparrow 85.95 95.79 83.14 93.09 70.09 69.99 64.74 91.18 96.80
LPIPS \downarrow 0.2366 0.0941 0.0717 0.1103 0.1833 0.1527 0.4973 0.1560 0.0655
Table 6: Full results of EC comparison with existing methods on the Endo4IE dataset.
Models LMSPEC* LMSPEC+* RetinexNet Zero-DCE RCTNet MSEC FECNet DRBN-ENC MIRNetv2
PSNR \uparrow 23.97 23.62 16.82 14.30 23.66 22.24 24.72 7.86 23.85
SSIM \uparrow 23.62 79.97 68.78 68.15 79.59 76.05 81.84 5.14 82.33
LPIPS \downarrow - - 0.3148 0.4840 0.3041 0.2508 0.2031 0.8516 0.2376
Models LA-Net LLCaps PyDiff PromptIR LACT PIP CLE PSENet EndoUIC
PSNR \uparrow 23.51 22.85 24.73 23.73 22.92 25.28 21.20 17.49 25.49
SSIM \uparrow 83.78 66.42 84.78 79.57 76.88 81.94 57.39 71.12 85.20
LPIPS \downarrow 0.1186 0.3446 0.2148 0.2396 0.2671 0.2150 0.3816 0.2839 0.1937
Refer to caption
Figure 3: The visualization for the EC task on the CEC dataset.
Refer to caption
Figure 4: The visualization for the EC task on the Endo4IE dataset.
Refer to caption
Figure 5: The visualization for the LLIE task on Kvasir-Capsule and RLE datasets.