Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY-NC-ND 4.0
arXiv:2403.10635v1 [cs.CV] 15 Mar 2024
11institutetext: Department of Computer Science, University of Sheffield, Sheffield, UK 22institutetext: Centre of Machine Intelligence, University of Sheffield, Sheffield, UK 33institutetext: Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK 44institutetext: Department of Clinical Radiology, Sheffield Teaching Hospitals, Sheffield, UK 55institutetext: INSIGNEO, Institute for in Silico Medicine, University of Sheffield, Sheffield, UK 66institutetext: Information School, University of Sheffield, Sheffield, UK 77institutetext: Department of Engineering Science, University of Oxford, Oxford, UK 88institutetext: Department of Computing, Imperial College London, London, UK
88email: {wenrui.fan, m.suvon, shuo.zhou, xianyuan.liu, s.alabed, v.osmani, a.j.swift, chen.chen2, h.lu(✉)}@sheffield.ac.uk

MeDSLIP: Medical Dual-Stream Language-Image Pre-training for Fine-grained Alignment

Wenrui Fan 1122    Mohammod Naimul Islam Suvon 1122    Shuo Zhou 1122    Xianyuan Liu 1122    Samer Alabed 334455    Venet Osmani 66    Andrew Swift 334455    Chen Chen 117788    Haiping Lu(✉) 112255
Abstract

Vision-language pre-training (VLP) models have shown significant advancements in the medical domain. Yet, most VLP models align raw reports to images at a very coarse level, without modeling fine-grained relationships between anatomical and pathological concepts outlined in reports and the corresponding semantic counterparts in images. To address this problem, we propose a Medical Dual-Stream Language-Image Pre-training (MeDSLIP) framework. Specifically, MeDSLIP establishes vision-language fine-grained alignments via disentangling visual and textual representations into anatomy-relevant and pathology-relevant streams. Moreover, a novel vision-language Prototypical Contr-astive Learning (ProtoCL) method is adopted in MeDSLIP to enhance the alignment within the anatomical and pathological streams. MeDSLIP further employs cross-stream Intra-image Contrastive Learning (ICL) to ensure the consistent coexistence of paired anatomical and pathological concepts within the same image. Such a cross-stream regularization encourages the model to exploit the synchrony between two streams for a more comprehensive representation learning. MeDSLIP is evaluated under zero-shot and supervised fine-tuning settings on three public datasets: NIH CXR14, RSNA Pneumonia, and SIIM-ACR Pneumothorax. Under these settings, MeDSLIP outperforms six leading CNN-based models on classification, grounding, and segmentation tasks.

Keywords:
Medical vision-language pre-trainingcontrastive learning

1 Introduction

Deep learning models have shown their power in chest X-ray (CXR) analysis, including disease classification [19] and semantic segmentation [3]. However, these models rely on a large number of labeled images for training, affecting their practical utility [17]. The flourishing of pre-trained language models like BERT [23, 4] enlightened medical vision-language pre-training (VLP) [2, 9, 18]. VLP models are pre-trained in a self-supervised way with language [16, 29], avoiding the requirements of labeled data. Medical VLP models have demonstrated their generalization to downstream medical diagnosis tasks [22, 25, 26]. Yet, most VLP models only align CXR images with raw medical reports coarsely, without modeling more fine-grained relationships between semantics in images and concepts in reports.

To achieve more fine-grained alignments, information entanglement presents a significant issue [25], arising from two causes, as shown in Fig. 1(a). Firstly, medical data usually encompasses information from two perspectives: anatomy and pathology. Pathological information is the clinical findings from CXRs, such as opacity. Anatomies describe anatomical structures and locations, such as ribs. Secondly, the co-occurrence of multiple concepts within one perspective leads to information entanglement as well. For instance, a patient could suffer from multiple diseases simultaneously. While a recent work, MedKLIP [25], has offered a promising approach to disentangling textual information into anatomical and pathological levels, the entanglement issue in image data remains unsolved.

To address information entanglement, we propose a novel Medical Dual-Stream Language-Image Pre-training (MeDSLIP) framework, shown in Fig. 1. It solves the two causes of entanglement, thereby disentangling anatomical and pathological information and constructing the fine-grained alignment between vision and language. For information from various perspectives, MeDSLIP applies a dual-stream mechanism, as shown in Fig. 1(a). It disentangles both visual and textual data into anatomical and pathological streams. The vision-language alignments are established within each stream in a more fine-grained way after disentangling. To address information entangled within a single perspective, MeDSLIP uses a novel vision-language Prototypical Contrastive Learning (ProtoCL[21, 15] method, coupled with triplet extraction. The information is disentangled by triplet extraction and then aggregated more efficiently and structurally by ProtoCL. ProtoCL further enhances the fine-grained alignment within each stream. Besides, as an extra benefit of ProtoCL, the learned language latent space is optimized to be more aligned with the real textual data distribution of the MIMIC-CXR [12] dataset. Furthermore, we encourage cross-stream information sharing through an intra-image contrastive learning (ICL) loss. It contrasts anatomy and pathology representations from the same CXR image. It further regularizes the training process by the consistent coexistence of paired anatomical and pathological concepts within the same image.

Refer to caption
Figure 1: (a) Overview of MeDSLIP. MeDSLIP disentangles images and texts into anatomy and pathology streams to construct a fine-grained alignment. Colored boxes in CXR match highlights in the report: red for anatomy and blue for pathology, denoting information from the individual perspective. Variations in line style within the same color indicate detailed cross-perspective information. The combination of colored boxes and text illustrates two causes for entanglement. (b) Image encoding involves disentangling image representations and aligning them with top common text queries. ProtoCL and ICL losses are used for refining alignment, respectively. Existence predictors trained with BCE loss enable the model to make predictions in a zero-shot manner. (c) Triplets are extracted from reports and prompted (Sec. 2.1). Then, they are encoded by a text encoder, usually a frozen language model with a learnable linear layer.

In summary, our main contributions are four-fold: (1) We propose MeDSLIP, a novel medical dual-stream language-image pre-training framework. It disentangles information from different perspectives to corresponding streams, aligning visual and textual information in a more fine-grained way (Sec.2.1-2.2). (2) We then introduce a novel vision-language prototypical contrastive learning (ProtoCL) paradigm, further enhancing fine-grained alignments. With triplet extraction, ProtoCL can disentangle the information from the same perspective, efficiently integrate this information, and optimize language latent space (Sec. 2.3). (3) We further apply intra-image contrastive learning (ICL) loss, which can enhance cross-stream information sharing and regularize the training process (Sec. 2.4). (4) We perform comprehensive experiments on three public datasets: NIH CXR14 [24], RSNA Pneumonia [20], and SIIM-ACR Pneumothorax [27] under zero-shot and supervised fine-tuning settings with three tasks, such as classification, grounding and segmentation. The experimental results show that MeDSLIP outperforms six leading CNN-based VLP models among all evaluation tasks and achieves state-of-the-art performance (Sec. 3).

2 Methodology

Fig. 1 shows the architecture of MeDSLIP. It comprises two streams carrying anatomical and pathological information individually, where each stream contains image encoding and text encoding. A fine-grained alignment is achieved inside streams for cross-modal information alignment. Text data is processed by extracting (anatomy, pathology, existence) triplets from medical reports and enhanced with domain knowledge [25], as detailed in Sec. 2.1. Image data is processed by a disentanglement module into two separate streams, which is further discussed in Sec. 2.2. Then, the training in each stream is further refined using prototypical contrastive loss ProtoCLsubscript𝑃𝑟𝑜𝑡𝑜𝐶𝐿\mathcal{L}_{ProtoCL}caligraphic_L start_POSTSUBSCRIPT italic_P italic_r italic_o italic_t italic_o italic_C italic_L end_POSTSUBSCRIPT, intra-image contrastive loss ICLsubscript𝐼𝐶𝐿\mathcal{L}_{ICL}caligraphic_L start_POSTSUBSCRIPT italic_I italic_C italic_L end_POSTSUBSCRIPT, and existence prediction loss Existsubscript𝐸𝑥𝑖𝑠𝑡\mathcal{L}_{Exist}caligraphic_L start_POSTSUBSCRIPT italic_E italic_x italic_i italic_s italic_t end_POSTSUBSCRIPT, respectively (Sec. 2.2-2.4).

2.1 Dual-Stream Text Encoding

For text encoding, MeDSLIP uses a named entity recognition (NER) method, Rad-Graph [11], to extract (anatomy, pathology, existence) triplets from medical reports, as shown in Fig. 1(c). Anatomical words, denoting anatomical locations, are prompted by “it is located at [ANATOMY]” to specify locations of pathological terms. Pathological terms, indicating clinical observations from CXR images, are enhanced by domain knowledge and plain language to improve their significance and comprehensibility. For example, “collapse” is enhanced with “collapse lung refers to pneumothorax or atelectasis.” These enriched texts are then processed by a text encoder with a learnable linear layer behind it to generate text embeddings 𝐄psuperscript𝐄𝑝\mathbf{E}^{p}bold_E start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and 𝐄asuperscript𝐄𝑎\mathbf{E}^{a}bold_E start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. The text encoder is a frozen pre-trained medical language model. Existence labels l(a,p)superscript𝑙𝑎𝑝l^{(a,p)}italic_l start_POSTSUPERSCRIPT ( italic_a , italic_p ) end_POSTSUPERSCRIPT are binary encoded, indicating whether a pathology is present at a specific anatomical location. After encoding all textual data in the dataset, the most commonly appeared n𝑛nitalic_n anatomy embeddings and m𝑚mitalic_m pathology embeddings are selected as the anatomy and pathology query sets.

2.2 Dual-Stream Image Encoding

A dual-stream mechanism is incorporated in image encoding to process anatomical and pathological information separately. A disentanglement module with a mask generator separates representations into two distinct streams. The query networks in the framework are multi-layer, multi-head cross-attention modules [18, 23, 25] to align representations with query embeddings. Their generated outputs encapsulate the visual features aligned to queried textual entities, with which existence predictors can predict the presence of queried entities, even in a zero-shot context. The attention maps from query networks play a crucial role in zero-shot grounding tasks. This dual-stream architecture not only enhances the model’s capabilities in zero-shot disease diagnosis but also supports localization without training supervision from annotated bounding boxes.

2.3 Vision-Language Prototypical Contrastive Learning

To enhance information disentangling within a single stream, we introduce a vision-language prototypical contrastive learning (ProtoCL) approach, depicted in Fig. 2(b). Coupled with triplet extraction, it disentangles the information inside anatomical or pathological streams. Simply extracting triplets turns a single sentence into multiple triplets, disrupting the one-to-one correspondence between image-text pairs and yielding multiple relevant text descriptions for a single CXR image. For instance, the report in Fig. 1(c) would yield two triplets: (opacity, nipples, exist) and (ribs, deformity, exist), each representing a piece of specific information. Therefore, without ProtoCL, this deteriorates vision-language contrastive learning, which only uses a fraction of the available information by randomly choosing one text description from the multiple available, as shown in Fig. 2(a). ProtoCL aggregates the information from all extracted positive text descriptions in a more structured and efficient way.

ProtoCL between one of the anatomy representations outputted from the query network and its corresponding pathology embeddings serves as an illustrative example here to introduce how it works. Assuming there are l𝑙litalic_l positive pathological text embeddings 𝐄1p+superscriptsubscript𝐄1limit-from𝑝\mathbf{E}_{1}^{p+}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + end_POSTSUPERSCRIPT, 𝐄2p+superscriptsubscript𝐄2limit-from𝑝\mathbf{E}_{2}^{p+}bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + end_POSTSUPERSCRIPT, \cdots, 𝐄lp+superscriptsubscript𝐄𝑙limit-from𝑝\mathbf{E}_{l}^{p+}bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + end_POSTSUPERSCRIPT and we sampled k𝑘kitalic_k negative samples 𝐄1psuperscriptsubscript𝐄1limit-from𝑝\mathbf{E}_{1}^{p-}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - end_POSTSUPERSCRIPT, 𝐄2psuperscriptsubscript𝐄2limit-from𝑝\mathbf{E}_{2}^{p-}bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - end_POSTSUPERSCRIPT, \cdots, 𝐄kpsuperscriptsubscript𝐄𝑘limit-from𝑝\mathbf{E}_{k}^{p-}bold_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - end_POSTSUPERSCRIPT. The prototype 𝐏psuperscript𝐏𝑝\mathbf{P}^{p}bold_P start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT of the positive pathology embeddings can be calculated by 𝐏p=1li=1l𝐄ip+superscript𝐏𝑝1𝑙superscriptsubscript𝑖1𝑙superscriptsubscript𝐄𝑖limit-from𝑝\mathbf{P}^{p}=\frac{1}{l}\sum_{i=1}^{l}\mathbf{E}_{i}^{p+}bold_P start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + end_POSTSUPERSCRIPT. Then, the relationship between the anatomy representation 𝐑asuperscript𝐑𝑎\mathbf{R}^{a}bold_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and corresponding positive pathology embeddings 𝐄ip+superscriptsubscript𝐄𝑖limit-from𝑝{\mathbf{E}_{i}^{p+}}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + end_POSTSUPERSCRIPT simplifies to a one-to-one correspondence between 𝐑asuperscript𝐑𝑎\mathbf{R}^{a}bold_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and the prototype 𝐏psuperscript𝐏𝑝\mathbf{P}^{p}bold_P start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Therefore, NCE loss [7] of prototypical contrastive learning ProtoCLpsuperscriptsubscript𝑃𝑟𝑜𝑡𝑜𝐶𝐿𝑝\mathcal{L}_{ProtoCL}^{p}caligraphic_L start_POSTSUBSCRIPT italic_P italic_r italic_o italic_t italic_o italic_C italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT can be calculated between the image, prototype, and negative samples by

ProtoCLa=𝔼[logexp(𝐑a𝐏p/τ)Σi=1kexp(𝐑a𝐄ip)].superscriptsubscript𝑃𝑟𝑜𝑡𝑜𝐶𝐿𝑎𝔼delimited-[]𝑙𝑜𝑔𝑒𝑥𝑝superscript𝐑𝑎superscript𝐏𝑝𝜏superscriptsubscriptΣ𝑖1𝑘𝑒𝑥𝑝superscript𝐑𝑎superscriptsubscript𝐄𝑖limit-from𝑝\mathcal{L}_{ProtoCL}^{a}=-\mathbb{E}\Biggl{[}log\frac{exp(\mathbf{R}^{a}\cdot% \mathbf{P}^{p}/\tau)}{\Sigma_{i=1}^{k}exp(\mathbf{R}^{a}\cdot\mathbf{E}_{i}^{p% -})}\Biggr{]}.caligraphic_L start_POSTSUBSCRIPT italic_P italic_r italic_o italic_t italic_o italic_C italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = - blackboard_E [ italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( bold_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⋅ bold_P start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e italic_x italic_p ( bold_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⋅ bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - end_POSTSUPERSCRIPT ) end_ARG ] . (1)
Refer to caption
Figure 2: Comparison between contrastive learning without (a) or with (b) prototypes, using pathology embeddings and anatomy representations as an example.

Language Latent Space Optimization. ProtoCL optimizes language latent space by enclosing images and prototypes while disclosing them from negative samples. This clustering effect causes positive text embeddings to gather around respective prototypes. Concepts that often appear together in medical reports become closer spatially. This alignment results in a text latent space distribution that better reflects the real-world distribution of medical concepts.

2.4 Intra-Image Contrastive Learning

Intra-image contrastive learning (ICL) loss is used to enforce an existence regularization item for true pathological and anatomical pairs within each single image while minimizing the similarity of unrelated pair representations. It is defined as ICL=BCE(𝐒R,𝐋)subscript𝐼𝐶𝐿subscript𝐵𝐶𝐸superscript𝐒𝑅𝐋\mathcal{L}_{ICL}=\mathcal{L}_{BCE}(\mathbf{S}^{R},\mathbf{L})caligraphic_L start_POSTSUBSCRIPT italic_I italic_C italic_L end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( bold_S start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , bold_L ), where BCEsubscript𝐵𝐶𝐸\mathcal{L}_{BCE}caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT is the binary cross-entropy loss function; 𝐒Rsuperscript𝐒𝑅\mathbf{S}^{R}bold_S start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT is a cosine similarity matrix whose element 𝐒i,jR=𝐑ip,𝐑jasubscriptsuperscript𝐒𝑅𝑖𝑗subscriptsuperscript𝐑𝑝𝑖subscriptsuperscript𝐑𝑎𝑗\mathbf{S}^{R}_{i,j}=\langle\mathbf{R}^{p}_{i},\mathbf{R}^{a}_{j}\ranglebold_S start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ⟨ bold_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ indicates the possibility of the presence of anatomy i𝑖iitalic_i and pathology j𝑗jitalic_j together, as shown in Fig. 1(c). 𝐋m×n𝐋superscript𝑚𝑛\mathbf{L}\in\mathbb{R}^{m\times n}bold_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is the ground truth existence matrix. Such a cross-stream regularization encourages the model to exploit the synchrony between two streams for a more comprehensive representation learning.

3 Experiments and Results

3.1 Implementation and Beselines

Implementation. MeDSLIP uses MIMIC-CXR [6, 12, 13, 14] for pre-training. We follow the same setting as baseline, i.e. using ResNet-50 [8] for image encoding and Bio-ClinicalBERT [1] with a learnable linear layer for text encoding [25]. Pre-training involves a batch size of 64, an AdamW optimizer, and a cosine scheduler. MeDSLIP’s performance is evaluated on classification, grounding, and segmentation tasks using NIH CXR14 [24], RSNA Pneumonia [20], and SIIM-ACR Pneumothorax [27] datasets under both zero-shot and fine-tuning settings. All experiments follow baseline settings unless specified. The upper arrows in tables mean the higher the values are, the better the model is, and vice versa. The best result in tables is highlighted in bold, and the second best is underlined.

Baselines. We compare our method to six leading CNN-based models in the field, including ConVIRT [29], GLoRIA [9], BioViL [2], CheXzero [22], MedKLIP [25] and CXR-CLIP [26]. For the first four, we refer to results from [25] on the same set due to resource limits. We evaluate CXR-CLIP with its official CNN version’s pre-trained weights and codes [26]. As CXR-CLIP didn’t claim itself with the grounding ability [26], and CXR14 [24] could be one of its pre-training datasets, we exclude results for these two groups of experiments. The strongest baseline, MedKLIP, was re-trained under the same setting as ours.

Table 1: Evaluation on zero-shot classification and grounding tasks with SOTA CNN-based models. For CXR14 [24], metrics refer to the macro average on the 14 diseases. All results are in a percentage format. PG is pointing game score.
Models Venue Classification Grounding
CXR14 [24] Pneumonia Pneumonia
AUC\uparrow F1\uparrow ACC\uparrow AUC\uparrow F1\uparrow ACC\uparrow Dice\uparrow IoU\uparrow PG \uparrow
ConVIRT[29] PMLR’22 61.01 16.28 71.02 80.42 58.42 76.11 - - -
GLoRIA[9] ICCV’21 66.10 17.32 77.00 71.45 49.01 71.29 34.68 21.82 76.07
BioViL[2] ECCV’22 69.12 19.31 79.16 82.80 58.33 76.69 43.86 30.29 83.42
CheXzero[22] Nat. BE’22 72.96 21.41 82.78 82.80 62.11 79.42 - - -
CXR-CLIP[26] MICCAI’23 - - - 82.91 57.57 72.93 - - -
CXR-CLIP[26] MICCAI’23 - - - 83.41 53.80 64.39 - - -
MedKLIP[25] ICCV’23 78.00 25.71 85.22 85.74 61.88 79.69 49.63 34.32 87.02
MeDSLIP Ours 80.34 29.96 88.93 86.49 63.81 80.98 50.60 35.47 88.57
  • CXR-CLIP here is trained on MIMIC-CXR [12], CXR14 [24] and ChestXpert [10].

3.2 Experiments and Results

Zero-shot Evaluation. This section shows the evaluation of MeDSLIP in zero-shot classification and grounding, as shown in Table 1. In classification, MeDSLIP is compared solely with CNN-based state-of-the-art models for a fair comparison. MeDSLIP demonstrates its superiority over all other models. MeDSLIP achieves at least 2.34% improvement in average AUC score over the baselines on the CXR14 [24] dataset. Except following the protocol of MedKLIP [25], we perform an extra experiment using all data in CXR14 [24] dataset to prevent the potential bias caused by dataset split. Our method achieves a higher AUC score of 78.14%, outperforming MedKLIP [25] (76.15%). Although lower than values in Table 1, MeDSLIP still outperforms the baseline for about 2%. In grounding, MeDSLIP continues to outperform the baselines, showing enhancements of at least 0.97%, 1.25% in Dice, IoU scores. The point game score also increases by 1.55%, further evidencing MeDSLIP’s advanced grounding abilities.

Table 2: Evaluation on fine-tuning classification and segmentation on SIIM-ACR Pneumothorax [27]. AUC and Dice are reported in a percentage format.
Tasks Classification (AUC\uparrow) Segmentation (Dice\uparrow)
Fine-tuning Data Ratio 1% 10% 100% 1% 10% 100%
MedKLIP 86.27 89.90 93.10 67.71 75.52 76.96
MeDSLIP(Ours) 88.06 90.96 93.83 70.39 77.53 77.75

Fine-tuning Evaluation. We evaluate MeDSLIP on the SIIM-ACR Pneumothorax [27] dataset, including classification and segmentation tasks under fine-tuning setting. Here, we follow baseline protocols for end-to-end fine-tuning [25]. MeDSLIP is fine-tuned with 1%, 10%, and 100% of training data of SIIM-ACR [27] in both classification and segmentation. Pre-trained MeDSLIP encoder and a ResUNet decoder [5] are used in segmentation. Table 2 shows MeDSLIP surpasses the baseline in both tasks, particularly noticeable with limited training data. The performance gap between MeDSLIP and the baseline widens as the amount of training data decreases. It shows MeDSLIP’s efficiency and robustness in data-scarce conditions.

Language Latent Space Optimization. Cosine similarities 𝐒Esuperscript𝐒𝐸\mathbf{S}^{E}bold_S start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT between anatomy embeddings 𝐄psuperscript𝐄𝑝\mathbf{E}^{p}bold_E start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and pathology embeddings 𝐄esuperscript𝐄𝑒\mathbf{E}^{e}bold_E start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are regarded as the learned language distribution. The real distribution is represented by the existence label matrix 𝐋𝐋\mathbf{L}bold_L. We use the Kullback–Leibler (KL) divergence DKL(𝐒E,𝐋)subscript𝐷𝐾𝐿superscript𝐒𝐸𝐋D_{KL}(\mathbf{S}^{E},\mathbf{L})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( bold_S start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , bold_L ) for alignment measurement. The baseline without ProtoCL gets a larger score (782.15) while adding ProtoCL reduces the score to 749.03, indicating ProtoCL can optimize the embeddings for a stronger alignment with the real textual distribution.

3.3 Ablation Study

Table 3: Ablation study under zero-shot setting: BL, PCL, DS, MG, ICL represent baseline (MedKLIP), ProtoCL loss, dual stream, mask generator in disentanglement module, and ICL loss. No MG means inputting the entangled representations into two streams directly. All results are in percentage format.
ID Modules Classification Gounding
NIH CXR14 [26] RSNA [20] RSNA [20]
BL PCL DS MG ICL AUC\uparrow F1\uparrow ACC\uparrow AUC\uparrow F1\uparrow ACC\uparrow Pointing Game\uparrow
1 76.15 22.98 84.08 85.74 61.18 79.69 87.02
2 77.24 23.42 86.68 85.77 62.57 79.20 87.69
3 77.81 26.59 86.90 85.64 62.32 80.38 83.98
4 78.09 26.09 87.66 86.85 62.83 80.49 84.38
5 78.14 26.01 88.88 86.49 63.18 80.98 88.57
  • We only use the hyperparameter-agnostic pointing game accuracy [28] in the grounding task to fair compare unaffected by hyperparameter selection.

  • The study on CXR14 [24] uses the full dataset.

We conduct ablation studies to assess the impact of MeDSLIP and its three new modules compared to the baseline, MedKLIP [25]. Results are presented in Table 3, with the first column indicating the identifier for each experiment.

Prototypical Contrastive Learning. Experiments 1 and 2 show that the use of ProtoCL outperforms the random selection in the baseline in almost all zero-shot tasks with 1.09% improvement on AUC for CXR14 [24] classification and 0.67% on pointing game accuracy for grounding.

Dual-stream and Disentangling Module. Experiments 2, 4, and 5 show that the disentangling module is crucial for disentanglement information from different perspectives, particularly in grounding tasks. Feeding pathological and anatomical information without disentangling can degrade the model’s grounding ability, causing about a 4.2% drop in pointing game accuracy on RSNA [20].

Intra-image Contrastive Learning. Experiments 2, 3, and 5 show that ICL can boost the information sharing between two streams, which is beneficial to both classification and grounding tasks. The model with ICL loss can achieve a higher AUC score, showing improvements of 0.33% and 0.85% for CXR14 [24] and RSNA [20] classification, respectively. In addition, pointing game accuracy is improved by 4.59%.

4 Conclusion

In this paper, to address information entanglement in medical data, we propose MeDSLIP, a dual-stream language-image pre-training framework. The dual stream mechanism disentangles information from anatomical and pathological information from data and establishes a fine-grained vision-language alignment. Prototypical contrastive learning is introduced to efficiently use information from one perspective. An intra-image contrastive learning loss is proposed to regularize cross-stream information sharing. MeDSLIP is evaluated on classification, grounding, and segmentation tasks under zero-shot and fine-tuning settings. It outperforms all CNN-based models and achieves SOTA performance on all tasks. Future work includes integrating a transformer-based image encoder for comparison with transformer-based models, using both pathology and anatomy stream outputs in downstream tasks, and exploring the potential of using dual-stream predictions for generating medical reports.

References

  • [1] Alsentzer, E., Murphy, J.R., Boag, W., Weng, W.H., Jin, D., Naumann, T., McDermott, M.: Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)
  • [2] Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European Conference on Computer Vision. pp. 1–21. Springer (2022)
  • [3] Chen, C., Qin, C., Qiu, H., Tarroni, G., Duan, J., Bai, W., Rueckert, D.: Deep learning for cardiac image segmentation: a review. Frontiers in Cardiovascular Medicine 7,  25 (2020)
  • [4] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [5] Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C.: Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS Journal of Photogrammetry and Remote Sensing 162, 94–114 (2020)
  • [6] Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
  • [7] Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13(11), 307–361 (2012), http://jmlr.org/papers/v13/gutmann12a.html
  • [8] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
  • [9] Huang, S.C., Shen, L., Lungren, M.P., Yeung, S.: Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3942–3951 (2021)
  • [10] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 590–597 (2019)
  • [11] Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., et al.: Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463 (2021)
  • [12] Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 6(1),  317 (2019)
  • [13] Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
  • [14] Johnson, A.E., Pollard, T.J., Mark, R.G., Berkowitz, S.J., Horng, S.: MIMIC-CXR Database (version 2.0.0) (2019). https://doi.org/10.13026/C2JT1Q, https://doi.org/10.13026/C2JT1Q
  • [15] Li, J., Zhou, P., Xiong, C., Hoi, S.C.: Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966 (2020)
  • [16] Liu, C., Cheng, S., Chen, C., Qiao, M., Zhang, W., Shah, A., Bai, W., Arcucci, R.: M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 637–647. Springer (2023)
  • [17] Ouyang, C., Biffi, C., Chen, C., Kart, T., Qiu, H., Rueckert, D.: Self-supervised learning for few-shot medical image segmentation. IEEE Transactions on Medical Imaging 41(7), 1837–1848 (2022). https://doi.org/10.1109/TMI.2022.3150682
  • [18] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763. PMLR (2021)
  • [19] Sajed, S., Sanati, A., Garcia, J.E., Rostami, H., Keshavarz, A., Teixeira, A.: The effectiveness of deep learning vs. traditional methods for lung disease diagnosis using chest x-ray images: A systematic review. Applied Soft Computing 147, 110817 (2023). https://doi.org/https://doi.org/10.1016/j.asoc.2023.110817, https://www.sciencedirect.com/science/article/pii/S1568494623008359
  • [20] Shih, G., Wu, C.C., Halabi, S.S., Kohli, M.D., Prevedello, L.M., Cook, T.S., Sharma, A., Amorosa, J.K., Arteaga, V., Galperin-Aizenberg, M., et al.: Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence 1(1), e180041 (2019)
  • [21] Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems 30 (2017)
  • [22] Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering 6(12), 1399–1406 (2022)
  • [23] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
  • [24] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)
  • [25] Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training. Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
  • [26] You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E.K., Baek, W., Roh, B.: Cxr-clip: Toward large scale chest x-ray language-image pre-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 101–111. Springer (2023)
  • [27] Zawacki, A., Wu, C., Shih, G., Elliott, J., Fomitchev, M., Hussain, M., ParasLakhani, Culliton, P., Bao, S.: Siim-acr pneumothorax segmentation (2019), https://kaggle.com/competitions/siim-acr-pneumothorax-segmentation
  • [28] Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. International Journal of Computer Vision 126(10), 1084–1102 (2018)
  • [29] Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference. pp. 2–25. PMLR (2022)