Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: 1 Sheikh Zayed Institute for Pediatric Surgical Innovation, Children’s National Hospital, Washington DC.
2 School of Medicine and Health Sciences, George Washington University, Washington DC.
3 NVIDIA

D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

Hareem Nisar1    Syed Muhammad Anwar 1,2    Zhifan Jiang1    Abhijeet Parida 1    Vishwesh Nath 3    Holger R. Roth3    Marius George Linguraru 1,2
Abstract

Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax – a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.

Keywords:
Large vision language models Radiologic assistant Chest X-ray Expert models.

1 Introduction

Burnout in radiology is on the rise globally leading to chronic job dissatisfaction and critical under-staffing [4]. Radiologists routinely spend extensive time meticulously analyzing medical images to identify pathologies and diagnose diseases, which is vital in guiding treatment decisions and ensuring appropriate patient care. The retrospective error rate among radiologic exams has been reported to be around 30% [17]. Cindy et al.  [17] assess these errors to be either cognitive, like false initial assessment, framing bias (i.e., misinterpretation caused by choice of words), and premature closure of a case, or system-related errors, such as long working shifts, repetitive tasks, and lighting conditions. Many of these factors contribute to visual and mental fatigue for radiologists, further contributing to misdiagnosis and poor patient outcomes. Another challenge is miscommunication between radiologists, clinicians, and patients, often caused by inefficient reporting.

With the constant increase in workload in radiology departments [2], generative artificial intelligence (AI) can play a crucial role in reducing the burden and improving healthcare [21]. Recent large vision language models (VLMs) such as LLaVA-Med [18] have been created to assist clinicians in interpreting complex medical imaging and provide visual question answering (VQA) in natural language settings. Despite its enhanced capabilities for medical image analysis and interpretation, LLaVA-Med is highly generalized and cannot precisely answer specific questions [25], as well as suffers from hallucinations that can result in misdiagnosis. Another challenge in the integration and adoption of AI-driven technologies among healthcare professionals is user-friendliness of the tool [8]. These clinical and technological challenges necessitate a “Radiology Assistant” tool that can facilitate report writing and provide a natural-language interface to discuss imaging features, pathological findings, and disease diagnosis with the radiologist.

To address these challenges, we propose a novel, domain-specific VLM, called D-Rax, which empowers radiologists to interact with images using natural language prompts and questions, similar to how they converse with colleagues. Furthermore, our model is equipped with the knowledge of identifying pathologies and diagnostic reasoning. D-Rax leverages established AI models [11] to incorporate expert model diagnostic predictions for multiple diseases, thus reducing the risk of missed findings and aiding in achieving more accurate diagnoses. To exemplify the utility of our domain-specialist VLM, we chose chest X-ray (CXR) images for this study. CXRs are among the most commonly performed imaging studies and play a crucial role in the diagnosis and management of a wide range of medical conditions, including respiratory diseases, cardiac abnormalities, and thoracic injuries. The novelty and contributions of this work can be summarized as:

  • \bullet

    Enhanced instruction-following training with expert model predictions. We introduce a novel, domain-specific, and multi-modal instruction-following training strategy enriched with multiple expert model predictions for large VLMs.

  • \bullet

    Expert-enhanced instruction-following data generation. We use MIMIC-CXR and Medical-Diff-VQA datasets to generate baseline instruction-following data for the design of conversational image analysis tools. State-of-the-art (SOTA) AI (expert) models are incorporated to add diagnostic and demographic predictions to the baseline, thus creating expert-enhanced visual instruction-following training data.

  • \bullet

    D-Rax. Our expert-enhanced instruction-following training leads to a more accurate radiologic assistance tool, demonstrated by comprehensive comparisons. The same training paradigm can potentially benefit other conversational AI tools.

2 Related Work

The introduction of foundational large VLMs has flooded the gates for the design of complex multi-modal AI tools. Flamingo [1] is one of the earliest multi-modal VLMs that bridged the gap between image-only and text-only methods. It combines prompts and multi-line chains of thought to produce sensible outcomes. Another notable example is the Large Language and Vision Assistant (LLaVA) [20] model that leverages a multi-modal architecture capable of processing both visual and textual information. Both of these VLM frameworks closely follow the technicalities from the Contrastive Language-Image Pre-training (CLIP) [22] model, which is a technique to associate images with corresponding textual descriptions. Such VLMs are widely adopted in the computer vision industry and are a gateway to many advances in biomedicine.

In the realm of biomedical VLMs, BioMedClip [27] is an important foundation model, with vision-language processing capabilities, enabling several standard biomedical imaging tasks such as classification and visual question-answering. LLaVA-Med [18], a specialized version of LLaVA, is tailored for biomedical applications, including radiology, to enable clinicians to interact with medical images in a conversational language setting, thereby facilitating more efficient radiological workflows. OphGLM [5] combined expert model deductions with large language models (LLM) by generating a diagnostic report from retinal images. Most biomedical VLMs, however, are generalized and suffer from hallucinations, inaccurate diagnosis, and imprecise question answering. A domain-specialized tool in radiology can help overcome these challenges and provide accurate outcomes.

3 Methods

3.1 Data

Baseline Instruction-following Data

The multi-modal nature of our task requires both vision and language information. In this study, we use the MIMIC-CXR and Medical-Diff-VQA datasets to generate a baseline instruction-following dataset for our experiments. MIMIC-CXR [7, 14, 15] is a large open-access dataset of 377,110377110377,110377 , 110 CXRs with structured labels on cardiopulmonary conditions derived from 227,827227827227,827227 , 827 free-text radiology reports. Medical-Diff-VQA [9, 10, 14] is a derivative of the MIMIC-CXR dataset containing 700,703700703700,703700 , 703 question-answer (QA) pairs derived from CXRs. The questions are divided into seven categories: abnormality, presence, view, location, level, type, and difference. Each category can hold either open-ended questions such as ‘why, what, how’, etc. with dynamic natural language answers or close-ended questions such as ‘Is there’ with binary answers like ‘yes/no’. To limit the complexity of the evaluation, we did not focus on longitudinal changes, therefore the difference QAs were removed from the current evaluation. As a result, only a single image per patient was extracted to form the test set. Table 1 summarizes the data distribution for the baseline dataset.

Table 1: Baseline instruction-following data - Summary of train and test datasets and percentage distribution of QA categories.
Total %percent\%%Abnormality Presence View Location Level Type
Train #QA Pairs 429,000429000429,000429 , 000 27.127.127.127.1 29.129.129.129.1 10.510.510.510.5 15.715.715.715.7 12.512.512.512.5 5.15.15.15.1
129,232129232129,232129 , 232 #Open 219,305219305219,305219 , 305 24.624.624.624.6 00 10.210.210.210.2 30.630.630.630.6 24.524.524.524.5 10.110.110.110.1
images #Close 209,695209695209,695209 , 695 29.829.829.829.8 59.459.459.459.4 10.810.810.810.8 00 00 00
Test #QA Pairs 13,6881368813,68813 , 688 26.826.826.826.8 29.329.329.329.3 13.513.513.513.5 14141414 11.611.611.611.6 4.84.84.84.8
4,19041904,1904 , 190 #Open 6,68366836,6836 , 683 23.623.623.623.6 00 14141414 28.828.828.828.8 23.723.723.723.7 9.99.99.99.9
images #Close 7,00570057,0057 , 005 29.829.829.829.8 57.257.257.257.2 13131313 00 00 00

Enhanced Expert Instruction-following Data

We enhanced the baseline dataset by incorporating MIMIC-CXR along with QA conversations and integrating expert model predictions using pre-trained models from the TorchXRayVision [3] model zoo. Expert predictions on the MIMIC-CXR dataset fall into one of the four categories - diseases, age, race, and view (Table 2). The outcomes of these SOTA AI model predictions are appended to the baseline dataset to create our expert-model enhanced instruction-following dataset. The medical conditions in the first category include cardiomegaly, atelectasis, pneumonia, infiltration, fracture, enlarged cardio mediastinum, lung opacity, pneumothorax, emphysema, hernia, lung lesion, pleural thickening, edema, effusion, fibrosis, nodule, mass, and consolidation.

Table 2: Expert-model enhanced instruction-following data - Details on the AI model, training dataset, and labels used for each category.
Expert Predictions Model Dataset Labels
Disease diagnosis DenseNet121 [11] MIMIC-CXR CheXpert[13]
Patient age Regression [12] NIH ChestX-ray8 [24]
Patient race Classifier [6] MIMIC-CXR Asian, Black, White
CXR view position ChestViewSplit [26] Frontal, Lateral

3.2 Domain Specific Radiologic Assistant Design

The original LLaVA-Med model was trained on 15 million figure-caption pairs from PubMed [27]. While this teaches the model the context of biomedical application, we argue that for the sensitive process of medical imaging diagnosis, it is beneficial to develop a domain-specific VLM. Therefore, we perform end-to-end instruction tuning by training our model with CXRs and VQA-derived instructions generated from the associated radiology reports. In the process, we generated novel and enhanced instruction-following data for CXRs by incorporating predictions from expert models (Figure 1).

Refer to caption
Figure 1: Overview of our expert vision language model D-Rax design - Training data includes multimodal data including visual information (Chest X-ray images) and textual information (VQA from radiology reports, and expert model predictions).

Network Architecture: The definition of the expert VLM model follows the network architecture proposed in [20]. We chose Llama2 [23] as our LLM due to the availability of the pre-trained checkpoints and particularly used the Llama2-7B model. The visual encoder was kept consistent as ViT-Large/14 which is a pre-trained CLIP model. For any given input image Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and a series of question and answer defined as (Xq1,Xa1,Xq2,Xa2XqT,XaT)superscriptsubscript𝑋𝑞1superscriptsubscript𝑋𝑎1superscriptsubscript𝑋𝑞2superscriptsubscript𝑋𝑎2superscriptsubscript𝑋𝑞𝑇superscriptsubscript𝑋𝑎𝑇(X_{q}^{1},X_{a}^{1},X_{q}^{2},X_{a}^{2}...X_{q}^{T},X_{a}^{T})( italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT … italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). First, Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is transformed into a set of visual features Zvsubscript𝑍𝑣Z_{v}italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT by the CLIP model. For training with the instruction tuning data, the visual encoder is kept frozen and a trainable projection matrix W𝑊Witalic_W is used to convert the visual features Zvsubscript𝑍𝑣Z_{v}italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT into language embedding tokens Hv=W.Zvformulae-sequencesubscript𝐻𝑣𝑊subscript𝑍𝑣H_{v}=W.Z_{v}italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_W . italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT that can be jointly used with the language embedding tokens Hqsubscript𝐻𝑞H_{q}italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT of the questions Xqsubscript𝑋𝑞X_{q}italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. The output of the model is Xasubscript𝑋𝑎X_{a}italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. Please note that Hqsubscript𝐻𝑞H_{q}italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and Hvsubscript𝐻𝑣H_{v}italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are used as inputs to the LLM model of the entire framework. For VLM training, we used the LLaVA-v1.5-7B [20] model as a baseline and the model weights were initialized with Llama2-7B weights added with a delta from LLaVA training.

Expert Model Enhanced Instruction Tuning Data: Within the Medical-Diff-VQA data, for any given input image Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, there exists a multi-turn conversation that pertains to different categories of questions related to abnormality, presence, view, location, level, and type such that Xqsubscript𝑋𝑞X_{q}italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT belongs to a specific category. The dataset was enhanced with expert response Xesubscript𝑋𝑒X_{e}italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to the questions which when updated reads as ({Xe,Xq1}:Xa1,{Xe,Xq2}:Xa2,{Xe,XqT}:XaT):subscript𝑋𝑒superscriptsubscript𝑋𝑞1superscriptsubscript𝑋𝑎1subscript𝑋𝑒superscriptsubscript𝑋𝑞2:superscriptsubscript𝑋𝑎2subscript𝑋𝑒superscriptsubscript𝑋𝑞𝑇:superscriptsubscript𝑋𝑎𝑇(\{X_{e},X_{q}^{1}\}:X_{a}^{1},\{X_{e},X_{q}^{2}\}:X_{a}^{2},...\{X_{e},X_{q}^% {T}\}:X_{a}^{T})( { italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } : italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , { italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } : italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … { italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } : italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). Medical-Diff-VQA also provides QA pairs for difference (with a reference image), which we have not used in our experiments.

End-to-end Training: The complete training of domain-specific VLMs (particularly LLaVA) involves two steps: (1) concept alignment to biomedical concepts from large data from PubMed including figures in published articles, captions, and inline references to figures, and (2) an instruction tuning step, where both the projection layer and the language model are updated. In our method, we perform the instruction fine tuning with the multi-modal, expert-enhanced dataset for CXRs presented in Section 3.1. We also perform a set of experiments to establish the usefulness of employing expert model predictions to guide the radiology assistant’s answers. Finally, we evaluate the performance of open- and close-ended conversational questions to establish the efficacy of our proposed strategy.

3.3 Experiments

For visual question answers related to radiology, LLaVA-Med was finetuned and evaluated on the VQA-RAD and SLAKE datasets. However, the data used covers multiple modalities and is relatively small in size, for instance: VAQ-RAD [16] has 315 radiology images and 3,515 QA pairs, and SLAKE [19] has 642 images and 7,000 QA pairs. For D-Rax, we utilize MIMIC, one of the largest domain-specific medical imaging data, and the associated VQA pairs. We further leverage expert models to provide more context and expert knowledge of the language model. We hypothesize that with expert model prediction, the trained radiologic assistant will have better outcomes in terms of reducing hallucinations and providing more precise and correct responses.

To establish the efficacy of utilizing model predictions related to abnormality, age, race, and view, we ran multiple experiments for end-to-end instruction tuning of various LLaVA models. In particular, we use the following pre-trained models: LLaVA, LLaVA-Med finetuned on VQA-RAD (LLaVA-Med-RAD), and LLaVA-Med finetuned on SLAKE (LLaVA-Med-SLAKE). VQA-RAD and SLAKE are selected since they are closely related to our research question, however, the data represented has a much larger scope with fewer examples to enable the development of precise models. Overall we performed six different experiments, including end-to-end instruction fine-tuning with a model initialized with weights from the three aforementioned pre-trained models. For each of these model initializations, the instruction fine-tuning was performed both for the baseline dataset (images and VQA-derived instructions from MIMIC) and an enhanced dataset with augmentation of expert model predictions. The training was performed for a single epoch, with a learning rate of 2e52superscript𝑒52e^{-5}2 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and an effective batch size of 8888.

3.4 Evaluation

For performance evaluation, two metrics were utilized: accuracy and token recall, depending on the type of questions evaluated. For close-ended questions, the task can be considered as a classification, and hence we used accuracy. For open-ended questions, token recall measures the ratio of tokens correctly generated by the trained model according to the ground truth. Evaluating VLMs, particularly for open-ended questions is still a difficult problem and some approaches try to use OpenAI’s GPT-4 to evaluate the similarity between ground truth and predicted answers [18, 20]. The inference of the finetuned model required 20G of GPU memory and could generate answers for 10,0001000010,00010 , 000 questions per hour on a single NVIDIA H100 80G GPU.

4 Results

Performance of Enhanced Instruction

Figure 2 shows the qualitative evaluation of D-Rax by showing an example of conversations on a given CXR, as generated by VLMs trained on basic and expert-enhanced data.

Refer to caption
Figure 2: Qualitative evaluation: conversations provided by VLMs trained on basic and expert enhanced data. The red arrow shows the area of the pleural effusion and the yellow arrows outline the lateral margins of the enlarged heart (cardiomegaly) provided by the radiologist, which were correctly identified by D-Rax.

The results from quantitative evaluation (Table 3) indicates that the enhanced expert instruction training allows for statistically significant improvements in the model performance for abnormality and presence questions (both open and closed-ended). Meanwhile, for location, level, and type questions, where the expert model provides no explicit information, training on both basic and enhanced data mostly yields similar performance and even showcases improvements when using the LLaVA-Med-RAD model as the base. Intriguingly, in addressing the view questions, the expert model introduces different view information but does not affect the model’s capacity to derive correct answers from images and questions. Overall, expert model-enhanced instruction training enables higher performance without impeding the pre-trained model’s inherent ability to comprehend queries and images.

Table 3: Quantitative evaluation: token recall (%) for open-ended questions (O) and accuracy (%) for close-ended questions (C) are reported to show the performance of domain-specific VLM with basic and enhanced instruction tuning strategies across various question types. Each value is an average and standard deviation of three inferences. The asterisks show statistical significance across paired comparisons using the Wilcoxon signed rank test (* for p-value <0.05absent0.05<0.05< 0.05 and ** for p-value <0.001absent0.001<0.001< 0.001).
Metrics (%) LLaVA LLaVA-Med-RAD LLaVA-Med-SLAKE
Question Type Basic Enhanced Basic Enhanced Basic Enhanced
Abnormality (O) 40.6(0.5)40.60.540.6(0.5)40.6 ( 0.5 ) 41.7(0.3)41.70.3\mathbf{41.7(0.3)}bold_41.7 ( bold_0.3 ) 39.8(0.5)39.80.539.8(0.5)39.8 ( 0.5 ) 41.7(0.1)41.7superscript0.1\mathbf{41.7(0.1)}^{*}bold_41.7 ( bold_0.1 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 39.5(0.5)39.50.539.5(0.5)39.5 ( 0.5 ) 42.0(0.7)42.0superscript0.7absent\mathbf{42.0(0.7)}^{**}bold_42.0 ( bold_0.7 ) start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT
(C) 70.1(1.3)70.11.370.1(1.3)70.1 ( 1.3 ) 71.5(0.3)71.5superscript0.3\mathbf{71.5(0.3)}^{*}bold_71.5 ( bold_0.3 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 70.3(0.9)70.30.970.3(0.9)70.3 ( 0.9 ) 72.8(0.6)72.8superscript0.6absent\mathbf{72.8(0.6)}^{**}bold_72.8 ( bold_0.6 ) start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 68.9(0.4)68.90.468.9(0.4)68.9 ( 0.4 ) 71.8(1.1)71.8superscript1.1absent\mathbf{71.8(1.1)}^{**}bold_71.8 ( bold_1.1 ) start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT
Presence (C) 76.1(0.2)76.10.276.1(0.2)76.1 ( 0.2 ) 77.7(0.1)77.7superscript0.1absent\mathbf{77.7(0.1)}^{**}bold_77.7 ( bold_0.1 ) start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 75.5(0.2)75.50.275.5(0.2)75.5 ( 0.2 ) 77.6(0.4)77.6superscript0.4absent\mathbf{77.6(0.4)}^{**}bold_77.6 ( bold_0.4 ) start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 75.0(0.3)75.00.375.0(0.3)75.0 ( 0.3 ) 77.9(0.4)77.9superscript0.4absent\mathbf{77.9(0.4)}^{**}bold_77.9 ( bold_0.4 ) start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT
View (O) 99.7(0.0)99.70.099.7(0.0)99.7 ( 0.0 ) 99.7(0.0)99.70.099.7(0.0)99.7 ( 0.0 ) 99.6(0.0)99.60.099.6(0.0)99.6 ( 0.0 ) 99.6(0.0)99.60.099.6(0.0)99.6 ( 0.0 ) 99.6(0.1)99.60.199.6(0.1)99.6 ( 0.1 ) 99.6(0.1)99.60.199.6(0.1)99.6 ( 0.1 )
(C) 99.0(0.2)99.00.299.0(0.2)99.0 ( 0.2 ) 98.8(0.1)98.80.198.8(0.1)98.8 ( 0.1 ) 98.9(0.2)98.90.298.9(0.2)98.9 ( 0.2 ) 99.1(0.2)99.10.299.1(0.2)99.1 ( 0.2 ) 98.8(0.1)98.80.198.8(0.1)98.8 ( 0.1 ) 98.6(0.2)98.60.298.6(0.2)98.6 ( 0.2 )
Location (O) 61.8(0.0)61.80.061.8(0.0)61.8 ( 0.0 ) 61.6(0.4)61.60.461.6(0.4)61.6 ( 0.4 ) 60.2(0.4)60.20.460.2(0.4)60.2 ( 0.4 ) 61.6(0.6)61.6superscript0.6\mathbf{61.6(0.6)}^{*}bold_61.6 ( bold_0.6 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 60.3(0.2)60.30.260.3(0.2)60.3 ( 0.2 ) 61.8(0.5)61.8superscript0.5\mathbf{61.8(0.5)}^{*}bold_61.8 ( bold_0.5 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Level (O) 59.1(0.8)59.10.859.1(0.8)59.1 ( 0.8 ) 59.5(0.4)59.50.459.5(0.4)59.5 ( 0.4 ) 58.8(0.5)58.80.558.8(0.5)58.8 ( 0.5 ) 60.4(0.4)60.4superscript0.4\mathbf{60.4(0.4)}^{*}bold_60.4 ( bold_0.4 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 59.2(0.8)59.20.859.2(0.8)59.2 ( 0.8 ) 60.0(0.9)60.00.960.0(0.9)60.0 ( 0.9 )
Type (O) 60.6(1.0)60.61.060.6(1.0)60.6 ( 1.0 ) 60.6(0.8)60.60.860.6(0.8)60.6 ( 0.8 ) 58.9(1.0)58.91.058.9(1.0)58.9 ( 1.0 ) 58.5(1.0)58.51.058.5(1.0)58.5 ( 1.0 ) 58.1(0.2)58.10.258.1(0.2)58.1 ( 0.2 ) 58.4(1.3)58.41.358.4(1.3)58.4 ( 1.3 )
Average (O) 61.361.361.361.3 61.661.661.661.6 60.460.460.460.4 61.6superscript61.6absent\mathbf{61.6}^{**}bold_61.6 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 60.460.460.460.4 61.7superscript61.7absent\mathbf{61.7}^{**}bold_61.7 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT
(C) 77.377.377.377.3 78.6superscript78.6absent\mathbf{78.6}^{**}bold_78.6 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 77.077.077.077.0 79.0superscript79.0absent\mathbf{79.0}^{**}bold_79.0 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 76.376.376.376.3 78.8superscript78.8absent\mathbf{78.8}^{**}bold_78.8 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT

Comparison with Expert Models

D-Rax is not expected to outperform disease-specific expert models that are restricted to answering simple and close-ended (C) questions based on classification. Analysis of experiments on Abnormality (C) questions shows that the diagnostic accuracy of expert models of 70.4% was comparable to VLMs (p-value >0.08absent0.08>0.08> 0.08), except in 1/18 inferences where enhanced LLaVA-Med-RAD (Table 3) outperformed significantly the expert models (p-value =0.01absent0.01=0.01= 0.01). However, identifying abnormalities is just one aspect of the VLM. VLMs can handle complex and nuanced questions, unlike expert models which cannot understand natural language queries.

Ablation Studies

While we maintained the test set’s characteristics by extracting one image per patient (Section 3.1), the following ablation study (Table 4) shows the results evaluated on an extended test set, including all the images from each patient. The improved results demonstrate the robustness of the method when tested on larger data. However, since evaluation of the larger test data is computationally expensive, most results reported in the paper are on the smaller test set.

Table 4: Evaluation on an extended test set. The asterisks show statistical significance across paired comparisons using the Wilcoxon signed rank test (* for p-value <0.05absent0.05<0.05< 0.05 and ** for p-value <0.001absent0.001<0.001< 0.001).
Question Type Abnormality Presence View Location Level Type Average
Metrics (%) (O) (C) (C) (O) (C) (O) (O) (O) (O) (C)
Test Set 4,19041904,1904 , 190 images 13,6881368813,68813 , 688 QA pairs
LLaVA-Basic 40.640.640.640.6 70.170.170.170.1 76.176.176.176.1 99.799.799.799.7 99.099.099.099.0 61.861.861.861.8 59.159.159.159.1 60.660.660.660.6 61.361.361.361.3 77.377.377.377.3
LLaVA-Enhanced 41.741.7\mathbf{41.7}bold_41.7 71.5superscript71.5\mathbf{71.5}^{*}bold_71.5 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 77.7superscript77.7absent\mathbf{77.7}^{**}bold_77.7 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 99.799.799.799.7 98.898.898.898.8 61.661.661.661.6 59.559.559.559.5 60.660.660.660.6 61.661.661.661.6 78.6superscript78.6absent\mathbf{78.6}^{**}bold_78.6 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT
Extended Test Set 32,2053220532,20532 , 205 images 107,379107379107,379107 , 379 QA pairs
LLaVA-Basic 42.742.742.742.7 73.473.473.473.4 76.176.176.176.1 99.599.599.599.5 98.798.798.798.7 61.861.861.861.8 57.757.757.757.7 60.060.060.060.0 59.959.959.959.9 77.777.777.777.7
LLaVA-Enhanced 43.9superscript43.9\mathbf{43.9}^{*}bold_43.9 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 75.4superscript75.4absent\mathbf{75.4}^{**}bold_75.4 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 77.2superscript77.2absent\mathbf{77.2}^{**}bold_77.2 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT 99.599.599.599.5 98.798.798.798.7 61.961.961.961.9 58.7superscript58.7\mathbf{58.7}^{*}bold_58.7 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 59.959.959.959.9 60.4superscript60.460.4^{*}60.4 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 78.9superscript78.9absent\mathbf{78.9}^{**}bold_78.9 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT

5 Discussion and Conclusion

Our goal for developing D-Rax, a domain-specific expert model-guided radiologic assistant, is to reduce the hallucinations and improve the precision observed in responses from VLMs. We achieve this goal by establishing a novel training paradigm incorporating predictions from expert models. Hence, in our target application of CXR analysis, we embed expert predictions for disease, age, race, and view with the VQA instructions generated from radiological reports. Our results validate our hypothesis that (1) domain-specific knowledge, such as the use of MIMIC-CXR and Medical-Diff-VQA for CXR analysis, extracted from clinical radiology reports introduces a human factor into the model resulting in reduced hallucinations and allowing the system to provide precise information; and (2) addition of expert information from SOTA AI models generates statistically significant improved outcomes, enhancing accuracy of answering both open and close-ended questions in a conversation. D-Rax has the potential to enable a natural flow of diagnostic reasoning, enhance communication among clinicians, provide clear and accessible information to patients, and ultimately improve clinical care.

References

  • [1] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Han, S.C.T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a Visual Language Model for Few-Shot Learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
  • [2] Bruls, R.J., Kwee, R.M.: Workload for radiologists during on-call hours: dramatic increase in the past 15 years. Insights into Imaging 11,  1–7 (2020)
  • [3] Cohen, J.P., Viviano, J.D., Bertin, P., Morrison, P., Torabian, P., Guarrera, M., Lungren, M.P., Chaudhari, A., Brooks, R., Hashir, M., Bertrand, H.: TorchXRayVision: A library of chest X-ray datasets and models. In: Medical Imaging with Deep Learning (2022)
  • [4] Fawzy, N.A., Tahir, M.J., Saeed, A., Ghosheh, M.J., Alsheikh, T., Ahmed, A., Lee, K.Y., Yousaf, Z.: Incidence and factors associated with burnout in radiologists: A systematic review. European Journal of Radiology Open 11, 100530 (2023)
  • [5] Gao, W., Deng, Z., Niu, Z., Rong, F., Chen, C., Gong, Z., Zhang, W., Xiao, D., Li, F., Cao, Z., Ma, Z., Wei, W., Ma, L.: Ophglm: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue (2023), https://arxiv.org/abs/2306.12174
  • [6] Gichoya, J.W., Banerjee, I., Bhimireddy, A.R., Burns, J.L., Celi, L.A., Chen, L.C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.C., Kuo, P.C., Lungren, M.P., Palmer, L.J., Price, B.J., Purkayastha, S., Pyrros, A.T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., Zhang, H.: Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health (2022)
  • [7] Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C., Stanley, H.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
  • [8] Hemmer, P., Schemmer, M., Riefle, L., Rosellen, N., Vössing, M., Kühl, N.: Factors that influence the adoption of human-AI collaboration in clinical decision-making. In: Thirtieth European Conference on Information Systems (ECIS 2022) (2022)
  • [9] Hu, X., Gu, L., An, Q., Zhang, M., Liu, L., Kobayashi, K., Harada, T., Summers, R., Zhu, Y.: Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images. PhysioNet (2023)
  • [10] Hu, X., Gu, L., An, Q., Zhang, M., Liu, L., Kobayashi, K., Harada, T., Summers, R.M., Zhu, Y.: Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining pp. 4156–4165 (2023)
  • [11] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2261–2269 (2017)
  • [12] Ieki, H.e.a.: Deep learning-based age estimation from chest X-rays indicates cardiovascular prognosis. Communications Medicine (2022)
  • [13] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. pp. 590–597 (2019)
  • [14] Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., Horng, S.: MIMIC-CXR-JPG - chest radiographs with structured labels. PhysioNet (2019)
  • [15] Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. PysioNet (2019)
  • [16] Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)
  • [17] Lee, C.S., Nagy, P.G., Weaver, S.J., Newman-Toker, D.E.: Cognitive and system factors contributing to diagnostic errors in radiology. American Journal of Roentgenology 201(3), 611–617 (2013)
  • [18] Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day (2023)
  • [19] Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–1654. IEEE (2021)
  • [20] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023)
  • [21] Mukherjee, P., Hou, B., Lanfredi, R.B., Summers, R.M.: Feasibility of using the privacy-preserving large language model vicuna for labeling radiology reports. Radiology 309 (2023)
  • [22] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021)
  • [23] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models (2023)
  • [24] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2097–2106 (2017)
  • [25] Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data (2023)
  • [26] Yi, X.: chestviewsplit. https://github.com/xinario/chestViewSplit
  • [27] Zhang, S., Xu, Y., Usuyama, N., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., Poon, H.: Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing (2023)