2 School of Medicine and Health Sciences, George Washington University, Washington DC.
3 NVIDIA
D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions
Abstract
Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax – a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.
Keywords:
Large vision language models Radiologic assistant Chest X-ray Expert models.1 Introduction
Burnout in radiology is on the rise globally leading to chronic job dissatisfaction and critical under-staffing [4]. Radiologists routinely spend extensive time meticulously analyzing medical images to identify pathologies and diagnose diseases, which is vital in guiding treatment decisions and ensuring appropriate patient care. The retrospective error rate among radiologic exams has been reported to be around 30% [17]. Cindy et al. [17] assess these errors to be either cognitive, like false initial assessment, framing bias (i.e., misinterpretation caused by choice of words), and premature closure of a case, or system-related errors, such as long working shifts, repetitive tasks, and lighting conditions. Many of these factors contribute to visual and mental fatigue for radiologists, further contributing to misdiagnosis and poor patient outcomes. Another challenge is miscommunication between radiologists, clinicians, and patients, often caused by inefficient reporting.
With the constant increase in workload in radiology departments [2], generative artificial intelligence (AI) can play a crucial role in reducing the burden and improving healthcare [21]. Recent large vision language models (VLMs) such as LLaVA-Med [18] have been created to assist clinicians in interpreting complex medical imaging and provide visual question answering (VQA) in natural language settings. Despite its enhanced capabilities for medical image analysis and interpretation, LLaVA-Med is highly generalized and cannot precisely answer specific questions [25], as well as suffers from hallucinations that can result in misdiagnosis. Another challenge in the integration and adoption of AI-driven technologies among healthcare professionals is user-friendliness of the tool [8]. These clinical and technological challenges necessitate a “Radiology Assistant” tool that can facilitate report writing and provide a natural-language interface to discuss imaging features, pathological findings, and disease diagnosis with the radiologist.
To address these challenges, we propose a novel, domain-specific VLM, called D-Rax, which empowers radiologists to interact with images using natural language prompts and questions, similar to how they converse with colleagues. Furthermore, our model is equipped with the knowledge of identifying pathologies and diagnostic reasoning. D-Rax leverages established AI models [11] to incorporate expert model diagnostic predictions for multiple diseases, thus reducing the risk of missed findings and aiding in achieving more accurate diagnoses. To exemplify the utility of our domain-specialist VLM, we chose chest X-ray (CXR) images for this study. CXRs are among the most commonly performed imaging studies and play a crucial role in the diagnosis and management of a wide range of medical conditions, including respiratory diseases, cardiac abnormalities, and thoracic injuries. The novelty and contributions of this work can be summarized as:
-
Enhanced instruction-following training with expert model predictions. We introduce a novel, domain-specific, and multi-modal instruction-following training strategy enriched with multiple expert model predictions for large VLMs.
-
Expert-enhanced instruction-following data generation. We use MIMIC-CXR and Medical-Diff-VQA datasets to generate baseline instruction-following data for the design of conversational image analysis tools. State-of-the-art (SOTA) AI (expert) models are incorporated to add diagnostic and demographic predictions to the baseline, thus creating expert-enhanced visual instruction-following training data.
-
D-Rax. Our expert-enhanced instruction-following training leads to a more accurate radiologic assistance tool, demonstrated by comprehensive comparisons. The same training paradigm can potentially benefit other conversational AI tools.
2 Related Work
The introduction of foundational large VLMs has flooded the gates for the design of complex multi-modal AI tools. Flamingo [1] is one of the earliest multi-modal VLMs that bridged the gap between image-only and text-only methods. It combines prompts and multi-line chains of thought to produce sensible outcomes. Another notable example is the Large Language and Vision Assistant (LLaVA) [20] model that leverages a multi-modal architecture capable of processing both visual and textual information. Both of these VLM frameworks closely follow the technicalities from the Contrastive Language-Image Pre-training (CLIP) [22] model, which is a technique to associate images with corresponding textual descriptions. Such VLMs are widely adopted in the computer vision industry and are a gateway to many advances in biomedicine.
In the realm of biomedical VLMs, BioMedClip [27] is an important foundation model, with vision-language processing capabilities, enabling several standard biomedical imaging tasks such as classification and visual question-answering. LLaVA-Med [18], a specialized version of LLaVA, is tailored for biomedical applications, including radiology, to enable clinicians to interact with medical images in a conversational language setting, thereby facilitating more efficient radiological workflows. OphGLM [5] combined expert model deductions with large language models (LLM) by generating a diagnostic report from retinal images. Most biomedical VLMs, however, are generalized and suffer from hallucinations, inaccurate diagnosis, and imprecise question answering. A domain-specialized tool in radiology can help overcome these challenges and provide accurate outcomes.
3 Methods
3.1 Data
Baseline Instruction-following Data
The multi-modal nature of our task requires both vision and language information. In this study, we use the MIMIC-CXR and Medical-Diff-VQA datasets to generate a baseline instruction-following dataset for our experiments. MIMIC-CXR [7, 14, 15] is a large open-access dataset of CXRs with structured labels on cardiopulmonary conditions derived from free-text radiology reports. Medical-Diff-VQA [9, 10, 14] is a derivative of the MIMIC-CXR dataset containing question-answer (QA) pairs derived from CXRs. The questions are divided into seven categories: abnormality, presence, view, location, level, type, and difference. Each category can hold either open-ended questions such as ‘why, what, how’, etc. with dynamic natural language answers or close-ended questions such as ‘Is there’ with binary answers like ‘yes/no’. To limit the complexity of the evaluation, we did not focus on longitudinal changes, therefore the difference QAs were removed from the current evaluation. As a result, only a single image per patient was extracted to form the test set. Table 1 summarizes the data distribution for the baseline dataset.
Total | Abnormality | Presence | View | Location | Level | Type | ||
---|---|---|---|---|---|---|---|---|
Train | #QA Pairs | |||||||
#Open | ||||||||
images | #Close | |||||||
Test | #QA Pairs | |||||||
#Open | ||||||||
images | #Close |
Enhanced Expert Instruction-following Data
We enhanced the baseline dataset by incorporating MIMIC-CXR along with QA conversations and integrating expert model predictions using pre-trained models from the TorchXRayVision [3] model zoo. Expert predictions on the MIMIC-CXR dataset fall into one of the four categories - diseases, age, race, and view (Table 2). The outcomes of these SOTA AI model predictions are appended to the baseline dataset to create our expert-model enhanced instruction-following dataset. The medical conditions in the first category include cardiomegaly, atelectasis, pneumonia, infiltration, fracture, enlarged cardio mediastinum, lung opacity, pneumothorax, emphysema, hernia, lung lesion, pleural thickening, edema, effusion, fibrosis, nodule, mass, and consolidation.
3.2 Domain Specific Radiologic Assistant Design
The original LLaVA-Med model was trained on 15 million figure-caption pairs from PubMed [27]. While this teaches the model the context of biomedical application, we argue that for the sensitive process of medical imaging diagnosis, it is beneficial to develop a domain-specific VLM. Therefore, we perform end-to-end instruction tuning by training our model with CXRs and VQA-derived instructions generated from the associated radiology reports. In the process, we generated novel and enhanced instruction-following data for CXRs by incorporating predictions from expert models (Figure 1).
Network Architecture: The definition of the expert VLM model follows the network architecture proposed in [20]. We chose Llama2 [23] as our LLM due to the availability of the pre-trained checkpoints and particularly used the Llama2-7B model. The visual encoder was kept consistent as ViT-Large/14 which is a pre-trained CLIP model. For any given input image and a series of question and answer defined as . First, is transformed into a set of visual features by the CLIP model. For training with the instruction tuning data, the visual encoder is kept frozen and a trainable projection matrix is used to convert the visual features into language embedding tokens that can be jointly used with the language embedding tokens of the questions . The output of the model is . Please note that and are used as inputs to the LLM model of the entire framework. For VLM training, we used the LLaVA-v1.5-7B [20] model as a baseline and the model weights were initialized with Llama2-7B weights added with a delta from LLaVA training.
Expert Model Enhanced Instruction Tuning Data: Within the Medical-Diff-VQA data, for any given input image , there exists a multi-turn conversation that pertains to different categories of questions related to abnormality, presence, view, location, level, and type such that belongs to a specific category. The dataset was enhanced with expert response to the questions which when updated reads as . Medical-Diff-VQA also provides QA pairs for difference (with a reference image), which we have not used in our experiments.
End-to-end Training: The complete training of domain-specific VLMs (particularly LLaVA) involves two steps: (1) concept alignment to biomedical concepts from large data from PubMed including figures in published articles, captions, and inline references to figures, and (2) an instruction tuning step, where both the projection layer and the language model are updated. In our method, we perform the instruction fine tuning with the multi-modal, expert-enhanced dataset for CXRs presented in Section 3.1. We also perform a set of experiments to establish the usefulness of employing expert model predictions to guide the radiology assistant’s answers. Finally, we evaluate the performance of open- and close-ended conversational questions to establish the efficacy of our proposed strategy.
3.3 Experiments
For visual question answers related to radiology, LLaVA-Med was finetuned and evaluated on the VQA-RAD and SLAKE datasets. However, the data used covers multiple modalities and is relatively small in size, for instance: VAQ-RAD [16] has 315 radiology images and 3,515 QA pairs, and SLAKE [19] has 642 images and 7,000 QA pairs. For D-Rax, we utilize MIMIC, one of the largest domain-specific medical imaging data, and the associated VQA pairs. We further leverage expert models to provide more context and expert knowledge of the language model. We hypothesize that with expert model prediction, the trained radiologic assistant will have better outcomes in terms of reducing hallucinations and providing more precise and correct responses.
To establish the efficacy of utilizing model predictions related to abnormality, age, race, and view, we ran multiple experiments for end-to-end instruction tuning of various LLaVA models. In particular, we use the following pre-trained models: LLaVA, LLaVA-Med finetuned on VQA-RAD (LLaVA-Med-RAD), and LLaVA-Med finetuned on SLAKE (LLaVA-Med-SLAKE). VQA-RAD and SLAKE are selected since they are closely related to our research question, however, the data represented has a much larger scope with fewer examples to enable the development of precise models. Overall we performed six different experiments, including end-to-end instruction fine-tuning with a model initialized with weights from the three aforementioned pre-trained models. For each of these model initializations, the instruction fine-tuning was performed both for the baseline dataset (images and VQA-derived instructions from MIMIC) and an enhanced dataset with augmentation of expert model predictions. The training was performed for a single epoch, with a learning rate of and an effective batch size of .
3.4 Evaluation
For performance evaluation, two metrics were utilized: accuracy and token recall, depending on the type of questions evaluated. For close-ended questions, the task can be considered as a classification, and hence we used accuracy. For open-ended questions, token recall measures the ratio of tokens correctly generated by the trained model according to the ground truth. Evaluating VLMs, particularly for open-ended questions is still a difficult problem and some approaches try to use OpenAI’s GPT-4 to evaluate the similarity between ground truth and predicted answers [18, 20]. The inference of the finetuned model required 20G of GPU memory and could generate answers for questions per hour on a single NVIDIA H100 80G GPU.
4 Results
Performance of Enhanced Instruction
Figure 2 shows the qualitative evaluation of D-Rax by showing an example of conversations on a given CXR, as generated by VLMs trained on basic and expert-enhanced data.
The results from quantitative evaluation (Table 3) indicates that the enhanced expert instruction training allows for statistically significant improvements in the model performance for abnormality and presence questions (both open and closed-ended). Meanwhile, for location, level, and type questions, where the expert model provides no explicit information, training on both basic and enhanced data mostly yields similar performance and even showcases improvements when using the LLaVA-Med-RAD model as the base. Intriguingly, in addressing the view questions, the expert model introduces different view information but does not affect the model’s capacity to derive correct answers from images and questions. Overall, expert model-enhanced instruction training enables higher performance without impeding the pre-trained model’s inherent ability to comprehend queries and images.
Metrics (%) | LLaVA | LLaVA-Med-RAD | LLaVA-Med-SLAKE | ||||
---|---|---|---|---|---|---|---|
Question Type | Basic | Enhanced | Basic | Enhanced | Basic | Enhanced | |
Abnormality | (O) | ||||||
(C) | |||||||
Presence | (C) | ||||||
View | (O) | ||||||
(C) | |||||||
Location | (O) | ||||||
Level | (O) | ||||||
Type | (O) | ||||||
Average | (O) | ||||||
(C) |
Comparison with Expert Models
D-Rax is not expected to outperform disease-specific expert models that are restricted to answering simple and close-ended (C) questions based on classification. Analysis of experiments on Abnormality (C) questions shows that the diagnostic accuracy of expert models of 70.4% was comparable to VLMs (p-value ), except in 1/18 inferences where enhanced LLaVA-Med-RAD (Table 3) outperformed significantly the expert models (p-value ). However, identifying abnormalities is just one aspect of the VLM. VLMs can handle complex and nuanced questions, unlike expert models which cannot understand natural language queries.
Ablation Studies
While we maintained the test set’s characteristics by extracting one image per patient (Section 3.1), the following ablation study (Table 4) shows the results evaluated on an extended test set, including all the images from each patient. The improved results demonstrate the robustness of the method when tested on larger data. However, since evaluation of the larger test data is computationally expensive, most results reported in the paper are on the smaller test set.
Question Type | Abnormality | Presence | View | Location | Level | Type | Average | |||
Metrics (%) | (O) | (C) | (C) | (O) | (C) | (O) | (O) | (O) | (O) | (C) |
Test Set images QA pairs | ||||||||||
LLaVA-Basic | ||||||||||
LLaVA-Enhanced | ||||||||||
Extended Test Set images QA pairs | ||||||||||
LLaVA-Basic | ||||||||||
LLaVA-Enhanced |
5 Discussion and Conclusion
Our goal for developing D-Rax, a domain-specific expert model-guided radiologic assistant, is to reduce the hallucinations and improve the precision observed in responses from VLMs. We achieve this goal by establishing a novel training paradigm incorporating predictions from expert models. Hence, in our target application of CXR analysis, we embed expert predictions for disease, age, race, and view with the VQA instructions generated from radiological reports. Our results validate our hypothesis that (1) domain-specific knowledge, such as the use of MIMIC-CXR and Medical-Diff-VQA for CXR analysis, extracted from clinical radiology reports introduces a human factor into the model resulting in reduced hallucinations and allowing the system to provide precise information; and (2) addition of expert information from SOTA AI models generates statistically significant improved outcomes, enhancing accuracy of answering both open and close-ended questions in a conversation. D-Rax has the potential to enable a natural flow of diagnostic reasoning, enhance communication among clinicians, provide clear and accessible information to patients, and ultimately improve clinical care.
References
- [1] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Han, S.C.T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a Visual Language Model for Few-Shot Learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
- [2] Bruls, R.J., Kwee, R.M.: Workload for radiologists during on-call hours: dramatic increase in the past 15 years. Insights into Imaging 11, 1–7 (2020)
- [3] Cohen, J.P., Viviano, J.D., Bertin, P., Morrison, P., Torabian, P., Guarrera, M., Lungren, M.P., Chaudhari, A., Brooks, R., Hashir, M., Bertrand, H.: TorchXRayVision: A library of chest X-ray datasets and models. In: Medical Imaging with Deep Learning (2022)
- [4] Fawzy, N.A., Tahir, M.J., Saeed, A., Ghosheh, M.J., Alsheikh, T., Ahmed, A., Lee, K.Y., Yousaf, Z.: Incidence and factors associated with burnout in radiologists: A systematic review. European Journal of Radiology Open 11, 100530 (2023)
- [5] Gao, W., Deng, Z., Niu, Z., Rong, F., Chen, C., Gong, Z., Zhang, W., Xiao, D., Li, F., Cao, Z., Ma, Z., Wei, W., Ma, L.: Ophglm: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue (2023), https://arxiv.org/abs/2306.12174
- [6] Gichoya, J.W., Banerjee, I., Bhimireddy, A.R., Burns, J.L., Celi, L.A., Chen, L.C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.C., Kuo, P.C., Lungren, M.P., Palmer, L.J., Price, B.J., Purkayastha, S., Pyrros, A.T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., Zhang, H.: Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health (2022)
- [7] Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C., Stanley, H.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
- [8] Hemmer, P., Schemmer, M., Riefle, L., Rosellen, N., Vössing, M., Kühl, N.: Factors that influence the adoption of human-AI collaboration in clinical decision-making. In: Thirtieth European Conference on Information Systems (ECIS 2022) (2022)
- [9] Hu, X., Gu, L., An, Q., Zhang, M., Liu, L., Kobayashi, K., Harada, T., Summers, R., Zhu, Y.: Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images. PhysioNet (2023)
- [10] Hu, X., Gu, L., An, Q., Zhang, M., Liu, L., Kobayashi, K., Harada, T., Summers, R.M., Zhu, Y.: Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining pp. 4156–4165 (2023)
- [11] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2261–2269 (2017)
- [12] Ieki, H.e.a.: Deep learning-based age estimation from chest X-rays indicates cardiovascular prognosis. Communications Medicine (2022)
- [13] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. pp. 590–597 (2019)
- [14] Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., Horng, S.: MIMIC-CXR-JPG - chest radiographs with structured labels. PhysioNet (2019)
- [15] Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. PysioNet (2019)
- [16] Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)
- [17] Lee, C.S., Nagy, P.G., Weaver, S.J., Newman-Toker, D.E.: Cognitive and system factors contributing to diagnostic errors in radiology. American Journal of Roentgenology 201(3), 611–617 (2013)
- [18] Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day (2023)
- [19] Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–1654. IEEE (2021)
- [20] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023)
- [21] Mukherjee, P., Hou, B., Lanfredi, R.B., Summers, R.M.: Feasibility of using the privacy-preserving large language model vicuna for labeling radiology reports. Radiology 309 (2023)
- [22] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021)
- [23] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models (2023)
- [24] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2097–2106 (2017)
- [25] Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data (2023)
- [26] Yi, X.: chestviewsplit. https://github.com/xinario/chestViewSplit
- [27] Zhang, S., Xu, Y., Usuyama, N., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., Poon, H.: Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing (2023)