SKIP and SKIP: Segmenting Medical Images with Prompts

Abstract

Most medical image lesion segmentation methods rely on hand-crafted accurate annotations of the original image for supervised learning. Recently, a series of weakly supervised or unsupervised methods have been proposed to reduce the dependence on pixel-level annotations. However, these methods are essentially based on pixel-level annotation, ignoring the image-level diagnostic results of the current massive medical images. In this paper, we propose a dual U-shaped two-stage framework that utilizes image-level labels to prompt the segmentation. In the first stage, we pre-train a classification network with image-level labels, which is used to obtain the hierarchical pyramid features and guide the learning of downstream branches. In the second stage, we feed the hierarchical features obtained from the classification branch into the downstream branch through short-skip and long-skip and get the lesion masks under the supervised learning of pixel-level labels. Experiments show that our framework achieves better results than networks simply using pixel-level annotations.

Index Terms— Incomplete supervision, medical image segmentation, skip connection

1 Introduction

Medical image segmentation plays a vital role in clinical practice. Its purpose is to delineate the contour of organs, tissues, lesions, and other regions of interest from original medical images, such as CT, MRI, and pathological images. The segmentation of the lesion area is particularly important for image-guided radiotherapy. Benefiting from the development of deep learning technology, supervised learning methods based on convolutional neural networks (CNN) have shown good performance in a series of medical image segmentation tasks. However, CNN cannot capture the long-range dependencies between image pixels. With the advent of Transformers[1], researchers have migrated them to the vision domain and applied them to medical image segmentation[2]. However, most of these methods[3][4] still focus on using accurate pixel-level annotations, which require an extremely high medical level of annotators, and the annotation process is extremely cumbersome and expensive.

Therefore, a series of recent methods have been proposed to use less accurately labeled data for medical image lesion segmentation[5]. These methods can be summarized as follows: 1) Incomplete Supervision: Only part of the existing training data have per-pixel labels, and it can be subdivided into semi-supervision, partial supervision, and domain-specific supervision. 2) Coarse Supervision[6]: All the training images are annotated, but the annotation of some images is coarse, and there are no pixel-by-pixel (PP) annotations. The labels can be divided into a) Image-Level: only category labels are provided for coarse-annotated images. b) Box-Level: Besides the class label, each image includes the object bounding box.

Currently, most research in this field focuses on incomplete supervision and centers on knowledge distillation[7] methods, which learn through the combination of PP annotated data and the remaining unlabeled data. However, these works have strong limitations: as the proportion of unlabeled data increases, the performance of the model will degrade significantly. So when we want to improve the accuracy and generalization of the model, we need to increase the scale of labeled data and unlabeled data simultaneously. Additionally, these efforts have low utilization efficiency for samples without PP annotations and ignore the fact that most of the medical images clinically collected have useful knowledge, such as negative and positive discrimination conclusions. Considering most of the image data from the front line have already had preliminary clinical imaging diagnosis conclusions, how to utilize existing knowledge better should be especially worthy of our attention.

In this paper, we aim to establish an incomplete supervised learning model and propose a dual U-shaped two-stage framework named skip and skip (SKS). First, we train a branch in the framework using the previously ignored image-level coarse-grained labels in other work, enabling it to distinguish whether there is a lesion in the image and extract corresponding coarse-grained pyramid features. Subsequently, we perform fine-grained feature extraction based on fine-grained labels for lesion segmentation. Inspired by related modality interaction work [8, 9, 10, 11], we propose a novel module that combines the fine-grained annotation of medical images with the coarse-grained hidden knowledge, and the coarse-grained features are used to prompt lesion segmentation so that our framework acquires better results than the previous methods which only utilize PP annotations. The main contributions of this paper are as follows: 1) A dual U-shaped medical image lesion segmentation framework (SKS) is proposed. It utilizes the coarse-grained knowledge contained in vast amounts of medical images that have previously been overlooked. SKS achieves excellent segmentation results on small samples of data. 2) We propose a three-layer pyramid structure that employs Swin Transformer V2 (Swin-T v2) as its backbone to extract pyramid features of both coarse and fine grain from medical images within the same framework. 3) We implement three different skips that artfully integrate coarse- and fine-grained features within our framework. The skip of different features effectively prompts lesion segmentation results.

Refer to caption — Fig. 1: Overview of SKS. This dual U-shaped two-stage framework is a multi-level pyramid structure containing a feature extraction module and a lesion segmentation module, in the former represented by the yellow part is the coarse-grained feature guidance branch, and the green part is the fine-grained feature student branch.

2 METHODOLOGY

2.1 SKS Overview

To the best of our knowledge, SKS is the first framework using the existing coarse-grained diagnostic results of the medical image itself for lesion segmentation. Encouraged by the multi-task learning paradigm, we leverage the image features learned from the model’s coarse-grained predictions to guide the downstream branches through three skips for lesion segmentation. As depicted in Figure 1, the model includes two main parts: a feature extraction module equipped with a coarse-grained image feature guidance branch and a fine-grained image feature student branch, and a lesion segmentation module. We assume that the input space is $\mathcal{X}=\{x_{1},x_{2},...,x_{n}\}$ , $\mathcal{Y}_{cor}=\{y_{1}^{c},y_{2}^{c},...,y_{n}^{c}\}$ is the coarse-grained label (i.e., the clinical diagnosis result on the image level) while $\mathcal{Y}_{fine}=\{y_{1}^{f},y_{2}^{f},...,y_{n}^{f}\}$ is fine-grained label (i.e., the accurate annotation of the image lesion). Next, we will introduce the details of our SKS framework and how the upstream branch guides the segmentation module to perform lesion segmentation through three skips.

2.2 Feature Extraction

The feature extraction module includes two branches with similar structures: the coarse-grained image feature guidance branch and the fine-grained image feature student branch. To acquire prior knowledge from clinical image data itself, specifically the existing coarse-grained diagnosis results at the image level, we first feed the image into the coarse-grained image feature guidance branch. As shown in Figure 1, the feature extraction modules are all pyramid structures. The length and width of the input image are respectively denoted as $H$ and $W$ . We first divide the image into non-overlapping patches of size 4 $\times$ 4 and each patch contains dimensions of 4 $\times$ 4 $\times$ 3 = 48 for its features. Details of these two branches are represented in Figure 2(a). Following patch partitioning, a linear embedding layer maps the input to an arbitrary dimension $C$ before entering the pyramid feature extraction stage. Leveraging the advancements made by Swin-T v2, we employ a shifted window self-attention mechanism[12]: $Attention(Q,K,V)=SoftMax(\frac{QK^{T}}{\sqrt{d}}+B)V$ to extract hierarchical features from the image, where $Q,K,V\in\mathbbm{R}^{M^{2}\times d}$ means query, key, and value matrices. $M^{2}$ is the number of patches in a window and $d$ is the dimension of query or key, and $B\in\mathbbm{R}^{M^{2}\times M^{2}}$ is a relative position bias. Within each pyramid feature extraction process, patch tokens are first processed through multiple Swin-T v2 blocks for representation learning. Then, a patch merging layer is utilized to downsample and $2\times$ increase sample dimensions while generating feature representations.

2.3 Connection’s Artful Leap

The SKS structure contains three skips, which we respectively refer to as: feature share skip, resolution compensation skip, and prompt skip. As shown in Figure 1, for the pyramid features from the $i$ layer $F_{pyra}^{i}=\{f_{1}^{i},f_{2}^{i},f_{3}^{i},f_{4}^{i}\}$ , we fuse it with another pyramid feature $F_{pyra}^{k}=\{f_{1}^{k},f_{2}^{k},f_{3}^{k},f_{4}^{k}\}$ , for arbitrary $f_{\lambda}^{fuse}$ in the new feature $F_{fuse}$ , $f_{\lambda}^{fuse}=W\times(f_{\lambda}^{i}\oplus f_{\lambda}^{k})+b$ , $W\in\mathbbm{R}^{C\times 2C}$ and $C$ represents the dimension of the $\lambda$ layer feature of the pyramid structure, $\oplus$ is concat and $b\in\mathbbm{R}^{C\times 1}$ is the bias.

Feature Share Skip (FSS): As mentioned in Section 2.2 above, in the pre-training stage, we train the coarse-grained branch based on a large number of medical images with coarse-grained labels so that the branch could extract coarse-grained pyramid features of the image. In the training stage of the fine-grained branch, the coarse-grained pyramid feature is fused with this branch through the FSS, enabling it to learn new feature representations and combine coarse-grained features at the same time. As shown in Figure 1, when the fine-grained label of the image is used to train the fine-grained branch, the coarse-grained pyramid features are layered into each layer of the fine-grained branch through the FSS to guide this branch to simultaneously learn the coarse and fine-grained semantic information of the image.

Resolution Compensation Skip (RCS): For the fused features, motivated by U-Net[13], we send them into the lesion segmentation module by the RCS. This operation enables the lesion segmentation module to obtain the high-resolution information contained in the high-level feature map during the up-sampling process so as to obtain the details of the image better and improve the lesion segmentation accuracy.

Prompt Skip: The coarse-grained pyramid features are also fed into the lesion segmentation module through a prompt skip. Through this skip, we make the lesion segmentation module better able to analyze the information of the image based on the coarse-grained features of the image. Compared with previous works, this operation can greatly reduce the sample size of medical images with accurate annotations. The prompt and segmentation details will be described below.

2.4 Prompt and Segmentation

Here, the lesion segmentation of medical images is carried out under the prompt of the upstream branch. As introduced in Section 2.3 above, for an image, the combined coarse- and fine-grained features obtained by the feature extraction module are sent to the lesion segmentation module through the RCS. The lesion segmentation module needs to up-sample the low-resolution feature map transmitted upstream into a mask. Details of it have been shown in Figure 2(b). Contrary to the patch merging layer of the feature extraction module, we use the patch expanding layer[14] to $2\times$ up-sample the deep features while reducing the feature dimension to generate the feature representation of the next layer.

In addition to being connected to the fine-grained feature student module through the RCS, the lesion segmentation module also receives the coarse-grained pyramid features from the coarse-grained branch through the prompt skip. After the supervised learning of the previous coarse-grained label, the coarse-grained branch can provide the lesion segmentation module with a prior coarse-grained feature representation. With the prompts of this feature representation, the lesion segmentation module directly obtains certain feature representations that previous segmentation models required learning through some samples. At the same time, the task branch of the model can focus on the segmentation of the lesions without needing to distinguish whether a medical image contains lesions, reducing the sample size of accurate annotations required to achieve the same segmentation effect.

3 EXPERIMENTS

3.1 Experiment Setup

Dataset Configuration: Considering the complexity and diversity of lesions at the liver site, we choose to evaluate our framework on the LITS dataset[15], which contains a total of 200 CT scans from various clinical sites. Each CT scan contains about 100-400 CT slices, and every piece of CT slices has been experienced clinician’s manual annotation of liver and liver tumour location. In this paper, we set the window height and width as [-200, 200] to extract the liver region. For each slice in the CT scan, we take the upper and lower adjacent slices of it as a single input $x_{i}$ and the 3D input image size is 224 $\times$ 224. We consider whether a slice has a lesion as its coarse-grained label and the precise annotation of the slice as its fine-grained label.

Training Details: For evaluation of the LITS dataset, firstly, we train the coarse-grained branch with 131 CT scans. As described in Section 2, we employ 3D slices and coarse-grained labels to carry out supervised training in this branch, enabling it to identify whether the abdominal CT has liver lesions and extract the coarse-grained pyramid features. Since SKS aims to achieve the same segmentation effect with fewer samples, we randomly selected 35 CT scans from the original dataset to train the fine-grained branch and lesion segmentation branch for segmentation. We train all models with dice loss[16] and validate the effect on 8 CT scans. For the Swin-T v2 backbone, the patch size is 4 $\times$ 4 and every layer in the pyramid structure has two Swin-T v2 blocks.

3.2 Experiment Evaluation

In this section, we show that SKS can effectively leverage coarse-grained knowledge from images and achieve excellent performance with a small number of samples compared to previous supervised training methods. As shown in Table 1, with an upstream coarse-grained branch achieving an accuracy of 89% in diagnosing liver lesions, the SKS framework achieved a global DSC (Dice Similarity Coefficient) of 0.549 for liver tumor segmentation using only 35 CT scans with precise annotations. In contrast, U-Net and U-Net++[17] achieve DSC of only 0.489 and 0.509 using the same 35 CT scans. Therefore, compared to traditional segmentation models, our SKS framework significantly achieves better results with a small amount of samples. The segmentation masks are visualized in Figure 3, which shows SKS have demonstrated good segmentation performance for lesions of different sizes. We extract the liver region from the CT image and delete other unrelated organ regions to make the segmentation result look clearer. In Figure 3(d), the tumor is too small and similar to the background, so SKS is unable to recognize it.

Table 1: Comparison and ablation experiments on the LITS dataset. “JC” means the Jaccard Coefficient. “w/o” means the without.

Method	DSC	JC	Precision	Recall
U-Net [13]	0.489	0.379	0.678	0.451
U-Net++ [17]	0.509	0.379	0.643	0.524
Att-UNet [18]	0.348	0.246	0.727	0.258
CE-Net [19]	0.401	0.286	0.354	0.592
CE-Net-OCT [19]	0.468	0.342	0.490	0.527
Channel-UNet [20]	0.524	0.392	0.508	0.591
SwinUnet [14]	0.475	0.355	0.365	0.952
w/o coarse-grain branch	0.420	0.311	0.686	0.347
w/o prompt skip	0.505	0.379	0.728	0.447
w/o RCS	0.467	0.331	0.400	0.672
SKS (ours)	0.549	0.456	0.542	0.580

3.3 Ablation Study

We conduct ablation experiments on the key modules of the proposed SKS framework, and the results are presented in the lower part of Table 1. To evaluate the impacts of the FSS, the entire coarse-grained branch is removed and the remaining parts of the framework cannot recognize lesions due to the lowest recall of 0.347. When only the FSS and RCS are present, our framework achieves a DSC of 0.505. Moreover, removing the RCS results in a decrease of DSC to 0.467. The ablation studies demonstrate that the use of our coarse-grained feature guidance branch and SKS structure can fully extract coarse-grained knowledge from medical images, resulting in a better understanding of the images and achieving a better segmentation effect with few samples compared with other methods.

4 CONCLUSION

In this paper, we propose a dual U-shaped two-stage framework for medical image segmentation. This framework extracts both fine-grained and coarse-grained knowledge from medical images through a multi-level pyramid structure and introduces a Connection’s Artful Leap to fuse the coarse-grained and fine-grained features, utilizing the coarse-grained knowledge to guide the downstream branch for lesion segmentation. Experiments show that SKS realizes a better segmentation effect with few samples. In the future, we will attempt to extend the SKS framework to more signal fields.

References

[1] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[2] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[3] Ailiang Lin, Bingzhi Chen, Jiayu Xu, Zheng Zhang, Guangming Lu, and David Zhang, “Ds-transunet: Dual swin transformer u-net for medical image segmentation,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–15, 2022.
[4] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 574–584.
[5] Wei Shen et al., “A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[6] Zhi-Hua Zhou, “A brief introduction to weakly supervised learning,” National science review, vol. 5, no. 1, pp. 44–53, 2018.
[7] Antti Tarvainen and Harri Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in neural information processing systems, vol. 30, 2017.
[8] Jiawei Chen, Yue Jiang, Dingkang Yang, Mingcheng Li, Jinjie Wei, Ziyun Qian, and Lihua Zhang, “Can llms’ tuning methods work in medical multimodal domain?,” arXiv preprint arXiv:2403.06407, 2024.
[9] Jiawei Chen, Dingkang Yang, Yue Jiang, Mingcheng Li, Jinjie Wei, Xiaolu Hou, and Lihua Zhang, “Efficiency in focus: Layernorm as a catalyst for fine-tuning medical visual language pre-trained models,” arXiv preprint arXiv:2404.16385, 2024.
[10] Jiawei Chen, Dingkang Yang, Yue Jiang, Yuxuan Lei, and Lihua Zhang, “Miss: A generative pretraining and finetuning approach for med-vqa,” arXiv preprint arXiv:2401.05163, 2024.
[11] Yuxuan Lei, Dingkang Yang, Mingcheng Li, Shunli Wang, Jiawei Chen, and Lihua Zhang, “Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences,” in CAAI International Conference on Artificial Intelligence. Springer, 2023, pp. 189–200.
[12] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
[13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[14] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in European conference on computer vision. Springer, 2022, pp. 205–218.
[15] Patrick Bilic, Patrick Ferdinand Christ, Eugene Vorontsov, Grzegorz Chlebus, Hao Chen, et al., “The liver tumor segmentation benchmark (lits),” CoRR, vol. abs/1901.04056, 2019.
[16] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” CoRR, vol. abs/1606.04797, 2016.
[17] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE transactions on medical imaging, vol. 39, no. 6, pp. 1856–1867, 2019.
[18] Ozan Oktay, Jo Schlemper, Loïc Le Folgoc, Matthew C. H. Lee, Mattias P. Heinrich, Kazunari Misawa, Kensaku Mori, Steven G. McDonagh, Nils Y. Hammerla, Bernhard Kainz, Ben Glocker, and Daniel Rueckert, “Attention u-net: Learning where to look for the pancreas,” CoRR, vol. abs/1804.03999, 2018.
[19] Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu, “Ce-net: Context encoder network for 2d medical image segmentation,” IEEE transactions on medical imaging, vol. 38, no. 10, pp. 2281–2292, 2019.
[20] Yilong Chen, Kai Wang, Xiangyun Liao, Yinling Qian, and Pheng Ann Heng, “Channel-unet: A spatial channel-wise convolutional neural network for liver and tumors segmentation,” Frontiers in Genetics, vol. 10, pp. 1110, 2019.