¹¹institutetext: Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China ²²institutetext: Shenzhen Research Institute, CUHK, Shenzhen, China ³³institutetext: The University of Sydney, Sydney, NSW, Australia ⁴⁴institutetext: City University of Hong Kong, Hong Kong, China ⁵⁵institutetext: Centre for Artificial Intelligence and Robotics, HKISI-CAS, Hong Kong, China ⁶⁶institutetext: University College London, London, UK ⁷⁷institutetext: Qilu Hospital of Shandong University, Jinan, China
⁷⁷email: b.long@link.cuhk.edu.hk, hlren@ee.cuhk.edu.hk

EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy

Long Bai 1 ⋆1 ⋆ Qiaozhi Tan 112 ⋆2 ⋆ Tong Chen Co-first authors. 33 Wan Jun Nah 1122 Yanheng Li 44 Zhicheng He 1122 Sishen Yuan 11 Zhen Chen 55 Jinlin Wu 55 Mobarakol Islam 66 Zhen Li 77 Hongbin Liu 55 Hongliang Ren Corresponding author. 1122

Abstract

Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DFT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DFT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance. The code and the proposed dataset are available at github.com/longbai1006/EndoUIC.

1 Introduction

Wireless Capsule Endoscopy (WCE) has revolutionized gastrointestinal (GI) diagnostics by offering a minimally invasive, painless way of examination in the GI tract [32]. However, the effectiveness of WCE can often be influenced due to factors such as limited battery capacity, camera performance, and the complexity of the GI tract [23, 22]. Uneven illumination within the tract can significantly degrade image quality, thus affecting the accuracy and efficiency of diagnosis, screening, and the provision of timely feedback [29]. While the issue of low-light image enhancement (LLIE) in WCE images has received considerable attention, leading to various strategies to improve visibility in low-light areas [14, 16], the challenge of overexposure remains less explored [27]. Various solutions [2, 17, 15, 29] have been put forward to enhance low-light WCE images. Nevertheless, the complex and dynamic internal body environment will also result in overexposure, which obscures critical details with excessive brightness, as the brightness levels often extend beyond the dynamic range these techniques can adequately adjust [20, 33].

Some conventional approaches have been utilized to enhance the structure visibility in WCE images [20, 25]. However, compared to deep learning methods, they tend to be less adaptive, less content-aware, and require manual intervention. Sequentially, García-Vega et al. implemented a multi-stage structure-aware deep network for exposure correction (EC) [5], and employed CycleGAN [35] for EC dataset generation [4]. Presently, solutions for WCE unified illumination adaptation are still underexplored, lacking an end-to-end architecture that can unify illumination correction tasks. Furthermore, existing endoscopy EC datasets produced via generative models struggle to replicate the complexity encountered in real-world scenarios. This gap underscores the need for a unified light adaptation model capable of concurrently tackling EC and LLIE, which is crucial for the retention and enhancement of vital diagnostic details.

Denosing diffusion probabilistic models (DDPMs) have demonstrated outstanding performance in low-level vision tasks including denoising, super resolution, and low-light enhancement, owing to their ability to model complex data distributions and incorporate conditional information effectively [6, 24]. In scenarios involving overexposed and underexposed images, which typically demand different parameter spaces and optimization trajectories, directly training diffusion models might not be the best approach. Contrastive learning methods have already been employed to learn varying image degradation types, while an additional network would be needed [10]. To this end, we introduce a set of learnable parameters that act as our prompt. These prompt parameters are optimized through an end-to-end process, learning to adjust the model’s prior for different image degradation. Then, it shall steer the model within the parameter space toward different low-level details essential for EC and LLIE. Thus, leveraging the task-specific knowledge acquired by the model, it dynamically adapts the input data according to different brightness levels. Additionally, the prompt module is capable of learning different levels of illumination abnormalities within a single degradation. Thus, even if the input exhibits only one type of degradation, the model can still maintain effective restoration performance. Moreover, to address the issue of data scarcity, we have collected a WCE dataset and invited photography experts to annotate underexposed and overexposed images manually. Specifically, our contributions to this work can be summarized as three-fold:

–

We propose EndoUIC - Endoscopic Unified Illumination Correction - a promptable diffusion model for unified WCE illumination correction. Specifically, the illumination prompt module is designed to navigate the diffusion model toward specific illumination conditions.
–

In our proposed framework, we embed a diffusion process within a U-shape transformer to perceive global illumination and multiscale contextual information, and utilize prompts to guide the illumination restoration procedure. Our prompt module contains an Adaptive Prompt Integration (API) module, which dynamically produces and integrates prompt parameters with feature representations. Additionally, we incorporate the Global Prompt Scanner (GPS) module to enhance the interaction between prompts and features.
–

To tackle the data shortage issue, we propose a novel WCE EC dataset, named Capsule-endoscopy Exposure Correction (CEC) dataset, with normal and wrongly exposed image pairs. Extensive comparison, ablation, and downstream experiments on four datasets demonstrate the superior effectiveness of our EndoUIC, showcasing its potential in clinical applications.

2 Methodology

2.1 Preliminaries

2.1.1 Visual Prompt Learning

introduces a set of learnable parameters that provide deep learning models with contextual information regarding the image degradation types in image restoration tasks [12, 18]. These prompts interact with the features of the input image, directing the model to adaptively adjust to different degradation types, thus restoring high-quality, clear images. This method enables a single unified model to address multiple image degradation challenges, enhancing the model’s generalization capabilities.

2.1.2 Pyramid Diffusion Models

(PyDiff) is an LLIE diffusion model that implements a pyramid diffusion strategy [34]. Unlike DDPMs, where image resolution remains constant throughout the reverse process, PyDiff starts with a lower resolution and progressively increases it to a higher resolution in the diffusion process. The forward and reverse process can be formulated with the given input $\mathbf{x}_{0}$ , time step $t$ , noise schedule $\{\alpha\}_{t=0}^{T}$ , and scaling schedule $\{U\}_{t=0}^{T}$ :

$q\left(\mathbf{x}_{t}\mid\mathbf{x}_{t-1}\right)=\mathcal{N}\left(\mathbf{x}_{% t};\sqrt{\bar{\alpha}_{t}}\left(\mathbf{x}_{0}\downarrow_{U_{t}/U_{t-1}}\right% ),\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right)$

(1)

$p_{\theta}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\right)=\begin{cases}% \mathcal{N}\left(\mathbf{x}_{t-1};\frac{\sqrt{\alpha_{t-1}}(1-\alpha_{t})}{1-% \bar{\alpha}_{t}}\mathbf{y}_{\theta}\left(\mathbf{x}_{t}\right)+\frac{\sqrt{% \alpha_{t}}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_{t}}\mathbf{x}_{t% }\right.\left.,\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}(1-\alpha_{t})% \mathbf{I}\right),&\text{ if }U_{t}=U_{t-1}\\ \mathcal{N}\left(\mathbf{x}_{t-1};\sqrt{\bar{\alpha}_{t-1}}\left(\mathbf{y}_{% \theta}\left(\mathbf{x}_{t}\right)\uparrow_{U_{t}/U_{t-1}}\right)\right.\left.% ,\left(1-\bar{\alpha}_{t-1}\right)\mathbf{I}\right),&\text{ if }U_{t}>U_{t-1}% \end{cases}$

(2)

in which $\alpha_{t}\in(0,1)$ and $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ . While $a_{t}\geq a_{t+1}$ is getting bigger noise, $s_{t}\leq s_{t+1}$ is getting lower resolution. This approach optimizes the sampling speed and allows for improved image restoration quality by gradually refining image details with increasing resolution.

2.2 Proposed Method: EndoUIC

Our proposed EndoUIC framework is presented in Fig. 1, with the U-shape restoration DFT to estimate noises. The network is optimized with simple $\mathcal{L}_{1}$ loss and the noise sampling strategy follows [34].

Refer to caption — Figure 1: The overview of our EndoUIC. Our network comprises the 4-level diffusion transformer (DFT), which is used to predict the noise. In each upsampling stage of the restoration DFT, the illumination prompt module is incorporated, which consists of the Adaptive Prompt Integration (API) and a Global Prompt Scanner (GPS) blocks. ‘SFE’ and ‘OUT’ denote the shallow feature extractor and the output block, respectively.

2.2.1 Restoration DFT

Our network begins with a shallow feature extractor that transforms the image into the feature representation $R$ . The features are then fed into the 4-level down-sampling transformer encoder and up-sampling transformer decoder, which is similar to the UNet structure [19]. The skip connections are executed at each level of the encoder-decoder. With $\mathcal{X}_{0}$ and $\mathcal{X}_{2}$ denoting the input and output feature respectively, each transformer layer can be formulated:

\mathcal{X}_{1}=Attn({Norm}(\mathcal{X}_{0})),\;\;\mathcal{X}_{2}=FFN({Norm}(% \mathcal{X}_{1}))

(3)

in which the time embedding $t$ is injected with $\mathcal{X}_{1}$ . The decoder finally outputs a high-resolution image with normal illumination after propagating through the output block. Each up-sampling stage incorporates an illumination prompt module. Firstly, the output of the prompt module is concatenated with the corresponding encoder’s skip connection output. Then it will be passed through a $1\times 1$ convolutional (Conv) layer before being input into the respective decoder.

2.2.2 Illumination Prompt Module

Our proposed illumination prompt module includes an Adaptive Prompt Integration (API) block and a Global Prompt Scanner (GPS) block. The prompt module shall dynamically adjust the learning targets for different illumination conditions. Given learnable parameters $P$ as the prompts, the prompt module can be formulated:

P^{\prime}=\mathcal{F}_{API}(\mathcal{X}_{I};P),\;\;\mathcal{X}_{O}=\mathcal{F% }_{GPS}(\mathcal{X}_{I};P^{\prime})

(4)

in which $\mathcal{X}_{I}$ and $\mathcal{X}_{O}$ depict the input and output feature of the illumination prompt module, respectively.

2.2.3 Adaptive Prompt Integration (API)

The API module is designed to generate the prompt parameters $P$ and integrate $P$ with the adaptively learned feature maps. We first define a set of learnable parameters $P$ , which are designed to embed different illumination conditions into the features. This design can efficiently capture long-range dependencies to perceive global illumination information and address local-region uneven illumination conditions. Thus, our method can also effectively learn the illumination representation of features.

To achieve this, we refrain from directly multiplying $P$ with the feature matrix as this could diminish the correlation between $P$ and the features. Instead, we employ a self-adaptive dynamic feature space for efficient representation learning, and a dynamic selection mechanism is utilized with multi-size Conv kernels. To implement large Conv kernels, we decouple the large kernel into small kernels and combine them with a $1\times 1$ layer by following [11].

As illustrated in the left-bottom of Fig. 1, our dynamic kernel selection mechanism concatenates features from different receptive fields to obtain the combined feature $\mathcal{X_{A}}$ . We then utilize average and max pooling to extract spatial information from the feature space. The extracted features, after concatenation, are passed through a Conv layer, which expands the channel from $2$ to $\mathbf{N}$ :

\mathcal{X}_{A}^{\prime}=\mathcal{F}_{2\rightarrow\mathbf{N}}[AvgPool(\mathcal% {X_{A}})\|MaxPool(\mathcal{X_{A}})]

(5)

where $[\cdot\|\cdot]$ denotes concatenation. Subsequently, we apply the Sigmoid activation $\sigma$ to obtain the weighted coefficient, which is then multiplied with $P$ using the following equation, enabling dynamic feature learning to weight $P$ adaptively:

P^{\prime}=\mathcal{F}_{FCN}[Mean(\sigma(\mathcal{X_{A}}^{\prime}))]\odot P

(6)

where $\odot$ denotes element-wise multiplication and $\mathcal{F}_{FCN}$ means the fully-connected layer. The obtained $Conv_{3\times 3}(P^{\prime})$ is then propagated through the GPS module for further correlation learning of the feature maps and prompt parameters.

2.2.4 Global Prompt Scanner

In the GPS module, the prompt parameter $P^{\prime}$ is firstly concatenated with the input feature $\mathcal{X}_{G}$ , utilizing $P^{\prime}$ to guide the process of luminance restoration:

\mathcal{X}_{P}=[\mathcal{X}_{G}\|P^{\prime}]

(7)

The selective-scan mechanism of VMamba [13], which captures long-range representations by scanning sequentially from four directions (top-left $\rightarrow$ bottom-right, bottom-right $\rightarrow$ top-left, top-right $\rightarrow$ bottom-left, bottom-left $\rightarrow$ top-right), has proven to be an effective approach for learning visual representations. To further enhance the global perception and foster the interaction between $\mathcal{X}_{G}$ and $P^{\prime}$ , we conduct the cross-scan on $\mathcal{X}_{P}$ . In this case, the scans in the same dimension as the concatenation can effectively facilitate the interaction between $P^{\prime}$ and $\mathcal{X}_{G}$ , while the scans vertical to the concatenation dimension can promote the internal representation learning within $P^{\prime}$ and $\mathcal{X}_{G}$ themselves.

The GPS module is presented in the right-bottom of Fig. 1(c). Features that follow a skip connection are processed by a $1\times 1$ Conv layer and a $3\times 3$ Conv layer. The combined feature is then fused with features from higher spatial dimensions to facilitate the illumination restoration process of the overall model.

3 Experiments

3.1 Dataset

We conduct our experiments on two EC datasets and two LLIE datasets:

Capsule endoscopy Exposure Correction (CEC) dataset is collected by ANKON magnetically controlled WCEs of three patients. The training set includes 800 images from two patients, and the test set contains 200 images from one patient. The dataset comprises half overexposed and half underexposed images. Due to the limited working space and dynamic deformable in-vivo scenes, it is difficult to obtain normal and corrupted image pairs. In this case, we adapt the normal images towards corrupted renditions (overexposure and underexposure) by expert photographers. The images are adjusted using Adobe toolkits (Photoshop/Lightroom) and saved in RGB format through a systematic process. The dataset and ethical approval information will be released upon acceptance.

Endo4IE dataset is a public synthetic EC dataset of conventional endoscopy [4]. It was created by initially selecting public images without exposure issues. Then, CycleGAN [35] was applied to generate paired overexposed and underexposed synthetic images, and MSE and SSIM metrics were used to filter and finalize a dataset of $985$ underexposed and $1231$ overexposed images.

Kvasir-Capsule [21] and Red Lesion Endoscopy (RLE) [3] are originally two datasets utilized for WCE disease diagnosis. [2] have curated images from these datasets and synthesized two datasets specifically tailored for WCE LLIE tasks by applying random Gamma correction and illumination reduction. Specifically, the Kvasir-Capsule dataset comprises 2000 training images and 400 test images. The RLE dataset contains 946 training images and 337 test images.

3.2 Implementation Details

Table 1: EC comparison against existing and SOTA methods on our CEC dataset.

Methods	FECNet [8]	SID [7]	DRBN [7]	MIRv2 [30]	LLCaps [2]	PyDiff [34]	PromptIR [18]	LACT [1]	PIP [12]	EndoUIC
PSNR $\uparrow$	28.78	24.29	26.83	28.36	27.55	28.18	28.27	28.40	25.01	29.65
SSIM $\uparrow$	92.61	85.69	90.50	93.58	85.95	95.79	83.14	93.09	70.09	96.80
LPIPS $\downarrow$	0.1048	0.2111	0.1452	0.1080	0.2366	0.0941	0.0717	0.1103	0.1527	0.0655

Table 2: EC comparison against existing and SOTA methods on the Endo4IE dataset. ‘*’ means we use the results from the previous works

Methods	LMSPEC* [4]	LMSPEC+* [5]	FECNet [8]	MIRv2 [30]	LA-Net [27]	PyDiff [34]	PromptIR [18]	LACT [1]	PIP [12]	EndoUIC
PSNR $\uparrow$	23.97	23.62	24.72	23.85	23.51	24.73	23.73	22.92	25.28	25.49
SSIM $\uparrow$	80.34	79.97	81.84	82.33	83.78	84.78	79.57	76.88	81.94	85.20
LPIPS $\downarrow$	-	-	0.2031	0.2376	0.1186	0.2148	0.2396	0.2671	0.2150	0.1937

The performance of our proposed EndoUIC is compared with a variety of state-of-the-art (SOTA) LLIE and EC methodologies, which are listed in Table 1, 2, 3, and supplementary materials. For methods marked with ‘*’, we obtain results directly from previous works. For the remaining methods, we reproduce the results through their official repositories. We conduct our experiments with Python PyTorch on NVIDIA A100 GPUs. We train our model with Adam for $1000$ epochs. The learning rate is set to $10^{-4}$ . We evaluate the image enhancement performance with Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) [31]. We also follow the previous work [2] to conduct a downstream medical diagnosis task on the RLE test set - the red lesion segmentation. The UNet [19] is trained using Adam with 20 epochs and a learning rate of $10^{-4}$ , and evaluated with Intersection over Union (mIoU).

3.3 Results

As indicated in Tables 1 and 2, we initially perform the endoscopy exposure correction experiment in comparison with the existing SOTA methods. Our approach successfully surpasses various methods with different architectures (e.g, including CNNs, Transformers, and DDPMs). It achieves the best results on our self-proposed CEC dataset and the public Endo4IE benchmark. Specifically, on the CEC and Endo4IE datasets, our method improves the PSNR by 0.87 dB and 0.21 dB respectively compared to the second-best methods. Our method also achieves SSIM of 96.80% and 85.20% respectively. The visualization of our image enhancement results and their corresponding error maps are presented in Fig. 2, where blue indicates fewer errors. It is observable that our method demonstrates the least errors and optimal results.

Furthermore, we also compare our method with SOTA LLIE techniques on two publicly available LLIE datasets, with the results presented in Table 3. We demonstrate that our method still achieves excellent results even when only one type of illumination degradation occurs. We also conduct a segmentation experiment on red lesions following LLCaps [2], and the superior results further prove the clinical applicability of our EndoUIC.

Finally, we deconstructed each module of EndoUIC and performed ablation studies, as shown in Table 4. We attempted to remove the API and GPS modules from the prompt architecture and reverted the DFT back to the UNet architecture. In all cases, we observed varying degrees of performance degradation, which further proves the effectiveness of the various components we proposed.

Table 3: LLIE comparison with existing and SOTA solutions on the Kvasir-Capsule [21] and RLE datasets [3]. The red lesion segmentation experiment is conducted on the RLE test set [3] by following the previous work [2]. LLCaps* means we use the results from the previous SOTA [2].

Methods		LLCaps* [2]	PIP [12]	CFWD [26]	Diff-LOL [9]	LA-Net [27]	CLE [28]	PyDiff [34]	PromptIR [18]	EndoUIC
Kvasir- Capsule	PSNR $\uparrow$	35.24	33.60	35.88	33.60	30.84	26.55	35.07	33.54	36.85
	SSIM $\uparrow$	96.34	95.09	96.26	95.42	95.32	87.87	96.60	96.77	97.04
	LPIPS $\downarrow$	0.0374	0.0302	0.0467	0.0847	0.0562	0.0829	0.0364	0.0377	0.0255
RLE	PSNR $\uparrow$	33.18	28.60	30.14	28.46	25.92	26.20	33.21	32.07	33.50
	SSIM $\uparrow$	93.34	87.27	90.25	82.52	85.72	81.42	93.54	93.30	93.99
	LPIPS $\downarrow$	0.0721	0.0977	0.1088	0.1437	0.1491	0.1134	0.0774	0.0694	0.0658
RLE Seg	mIoU $\uparrow$	66.47	59.46	51.47	62.46	52.57	45.33	62.56	59.92	68.97

Table 4: Ablation study of the proposed EndoUIC on the EndoUIC dataset. Specifically, we (i) replace the restoration DFT with the original U-Net architecture, (ii) remove the API block, and (iii) remove the GPS block.

Diffusion Trans		✗	✓	✗	✗	✓	✓	✗	✓
Prompt	API	✗	✗	✓	✗	✓	✗	✓	✓
Prompt	GPS	✗	✗	✗	✓	✗	✓	✓	✓
PSNR $\uparrow$		28.18	29.16	28.45	28.47	29.31	29.42	28.78	29.65
SSIM $\uparrow$		95.79	95.81	96.49	96.60	94.92	95.73	96.21	96.80
LPIPS $\downarrow$		0.0941	0.0735	0.858	0.0776	0.0710	0.0682	0.0727	0.0655

4 Conclusion

This paper presents EndoUIC, a promptable DFT model for unified illumination correction for WCE. The model’s ability to navigate through different regions of the parameter space allows for tailored adjustments that address the distinct challenges posed by either overexposed or underexposed images. Furthermore, with the assistance of photographer experts, we customize the CEC dataset tailored for the EC task in WCEs. Extensive experiments conducted across four datasets demonstrate that our EndoUIC surpasses existing SOTA techniques, validating its efficacy in performing endoscopic LLIE and EC tasks. Our suggested approach can be combined with clinical endoscopy systems, greatly enhancing the visualization, diagnosis, screening, and treatment of GI diseases.

Acknowledgement

This work was supported by HK RGC GRF 14203323 & 14211420, CRF C4026-21GF, Shenzhen-HK-Macau Technology Research Programme (Type C) STIC Grant 202108233000303, and Regional Joint Fund Project 2021B1515120035 (B.02.21.00101) of Guangdong Basic and Applied Research Fund.

References

[1] Baek, J.H., Kim, D., Choi, S.M., Lee, H.j., Kim, H., Koh, Y.J.: Luminance-aware color transform for multiple exposure correction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6156–6165 (2023)
[2] Bai, L., Chen, T., Wu, Y., Wang, A., Islam, M., Ren, H.: Llcaps: Learning to illuminate low-light capsule endoscopy with curved wavelet attention and reverse diffusion. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 34–44. Springer (2023)
[3] Coelho, P., Pereira, A., Leite, A., Salgado, M., Cunha, A.: A deep learning approach for red lesions detection in video capsule endoscopies. In: Image Analysis and Recognition: 15th International Conference, ICIAR 2018, Póvoa de Varzim, Portugal, June 27–29, 2018, Proceedings 15. pp. 553–561. Springer (2018)
[4] García-Vega, A., Espinosa, R., Ochoa-Ruiz, G., Bazin, T., Falcón-Morales, L., Lamarque, D., Daul, C.: A novel hybrid endoscopic dataset for evaluating machine learning-based photometric image enhancement models. In: Mexican International Conference on Artificial Intelligence. pp. 267–281. Springer (2022)
[5] García-Vega, A., Espinosa, R., Ramírez-Guzmán, L., Bazin, T., Falcón-Morales, L., Ochoa-Ruiz, G., Lamarque, D., Daul, C.: Multi-scale structural-aware exposure correction for endoscopic imaging. In: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2023)
[6] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, 6840–6851 (2020)
[7] Huang, J., Liu, Y., Fu, X., Zhou, M., Wang, Y., Zhao, F., Xiong, Z.: Exposure normalization and compensation for multiple-exposure correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6043–6052 (2022)
[8] Huang, J., Liu, Y., Zhao, F., Yan, K., Zhang, J., Huang, Y., Zhou, M., Xiong, Z.: Deep fourier-based exposure correction network with spatial-frequency interaction. In: European Conference on Computer Vision. pp. 163–180. Springer (2022)
[9] Jiang, H., Luo, A., Fan, H., Han, S., Liu, S.: Low-light image enhancement with wavelet-based diffusion models. ACM Transactions on Graphics 42(6), 1–14 (2023)
[10] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17452–17462 (June 2022)
[11] Li, Y., Hou, Q., Zheng, Z., Cheng, M.M., Yang, J., Li, X.: Large selective kernel network for remote sensing object detection. arXiv preprint arXiv:2303.09030 (2023)
[12] Li, Z., Lei, Y., Ma, C., Zhang, J., Shan, H.: Prompt-in-prompt learning for universal image restoration. arXiv preprint arXiv:2312.05038 (2023)
[13] Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024)
[14] Long, M., Li, Z., Xie, X., Li, G., Wang, Z.: Adaptive image enhancement based on guide image and fraction-power transformation for wireless capsule endoscopy. IEEE transactions on biomedical circuits and systems 12(5), 993–1003 (2018)
[15] Ma, Y., Liu, Y., Cheng, J., Zheng, Y., Ghahremani, M., Chen, H., Liu, J., Zhao, Y.: Cycle structure and illumination constrained gan for medical image enhancement. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part II 23. pp. 667–677. Springer (2020)
[16] Moghtaderi, S., Yaghoobian, O., Wahid, K.A., Lukong, K.E.: Endoscopic image enhancement: Wavelet transform and guided filter decomposition-based fusion approach. Journal of Imaging 10(1), 28 (2024)
[17] Mou, E., Wang, H., Yang, M., Cao, E., Chen, Y., Ran, C., Pang, Y.: Global and local enhancement of low-light endoscopic images (2023)
[18] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
[19] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
[20] Rukundo, O., Pedersen, M., Hovde, Ø., et al.: Advanced image enhancement method for distant vessels and structures in capsule endoscopy. Computational and mathematical methods in medicine 2017 (2017)
[21] Smedsrud, P.H., Thambawita, V., Hicks, S.A., Gjestang, H., Nedrejord, O.O., Næss, E., Borgli, H., Jha, D., Berstad, T.J.D., Eskeland, S.L., et al.: Kvasir-capsule, a video capsule endoscopy dataset. Scientific Data 8(1), 142 (2021)
[22] Tan, Q., Bai, L., Wang, G., Islam, M., Ren, H.: Endoood: Uncertainty-aware out-of-distribution detection in capsule endoscopy diagnosis. arXiv preprint arXiv:2402.11476 (2024)
[23] Wang, G., Bai, L., Wu, Y., Chen, T., Ren, H.: Rethinking exemplars for continual semantic segmentation in endoscopy scenes: Entropy-based mini-batch pseudo-replay. Computers in Biology and Medicine 165, 107412 (2023)
[24] Wang, L., Yang, Q., Wang, C., Wang, W., Pan, J., Su, Z.: Learning a coarse-to-fine diffusion transformer for image restoration. arXiv preprint arXiv:2308.08730 (2023)
[25] Wang, L., Wu, B., Wang, X., Zhu, Q., Xu, K.: Endoscopic image luminance enhancement based on the inverse square law for illuminance and retinex. International Journal of Medical Robotics and Computer Assisted Surgery 18(4), e2396 (2022)
[26] Xue, M., He, J., He, Y., Liu, Z., Wang, W., Zhou, M.: Low-light image enhancement via clip-fourier guided wavelet diffusion. arXiv preprint arXiv:2401.03788 (2024)
[27] Yang, K.F., Cheng, C., Zhao, S.X., Yan, H.M., Zhang, X.S., Li, Y.J.: Learning to adapt to light. International Journal of Computer Vision 131(4), 1022–1041 (2023)
[28] Yin, Y., Xu, D., Tan, C., Liu, P., Zhao, Y., Wei, Y.: Cle diffusion: Controllable light enhancement diffusion model. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 8145–8156 (2023)
[29] Yue, G., Gao, J., Cong, R., Zhou, T., Li, L., Wang, T.: Deep pyramid network for low-light endoscopic image enhancement. IEEE Transactions on Circuits and Systems for Video Technology (2023)
[30] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Learning enriched features for fast image restoration and enhancement. IEEE transactions on pattern analysis and machine intelligence 45(2), 1934–1948 (2022)
[31] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
[32] Zhang, Y., Bai, L., Liu, L., Ren, H., Meng, M.Q.H.: Deep reinforcement learning-based control for stomach coverage scanning of wireless capsule endoscopy. In: 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO). pp. 01–06. IEEE (2022)
[33] Zhang, Y., Zhang, K., Ding, Y., Liu, S., Wang, M., Wang, X., Qin, Z., Zhang, X., Ma, T., et al.: Deep transfer learning from ordinary to capsule esophagogastroduodenoscopy for image quality controlling. Engineering Reports p. e12776 (2023)
[34] Zhou, D., Yang, Z., Yang, Y.: Pyramid diffusion models for low-light image enhancement. arXiv preprint arXiv:2305.10028 (2023)
[35] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (2017)

Supplementary Materials for “EndoUIC: Prompt Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy”

Table 5: Full results of EC comparison with existing methods on our CEC dataset.

Models	RetinexNet	Zero-DCE	RCTNet	MSEC	FECNet	DRBN-ENC	MIRNetv1	MIRNetv2	LA-Net
PSNR $\uparrow$	19.05	10.30	18.90	7.54	28.78	26.83	27.62	28.36	16.58
SSIM $\uparrow$	76.74	65.46	65.03	38.77	92.61	90.50	93.03	93.58	66.40
LPIPS $\downarrow$	0.4522	0.5873	0.5847	0.6561	0.1048	0.1452	0.1252	0.1080	0.4233
Models	LLCaps	PyDiff	PromptIR	LACT	PIP	CLE	PSENet	DiT	EndoUIC
PSNR $\uparrow$	27.55	28.18	28.27	28.40	25.01	27.12	12.70	23.43	29.65
SSIM $\uparrow$	85.95	95.79	83.14	93.09	70.09	69.99	64.74	91.18	96.80
LPIPS $\downarrow$	0.2366	0.0941	0.0717	0.1103	0.1833	0.1527	0.4973	0.1560	0.0655

Table 6: Full results of EC comparison with existing methods on the Endo4IE dataset.

Models	LMSPEC*	LMSPEC+*	RetinexNet	Zero-DCE	RCTNet	MSEC	FECNet	DRBN-ENC	MIRNetv2
PSNR $\uparrow$	23.97	23.62	16.82	14.30	23.66	22.24	24.72	7.86	23.85
SSIM $\uparrow$	23.62	79.97	68.78	68.15	79.59	76.05	81.84	5.14	82.33
LPIPS $\downarrow$	-	-	0.3148	0.4840	0.3041	0.2508	0.2031	0.8516	0.2376
Models	LA-Net	LLCaps	PyDiff	PromptIR	LACT	PIP	CLE	PSENet	EndoUIC
PSNR $\uparrow$	23.51	22.85	24.73	23.73	22.92	25.28	21.20	17.49	25.49
SSIM $\uparrow$	83.78	66.42	84.78	79.57	76.88	81.94	57.39	71.12	85.20
LPIPS $\downarrow$	0.1186	0.3446	0.2148	0.2396	0.2671	0.2150	0.3816	0.2839	0.1937