D-Rax: Domain-Specific Radiologic Assistant Leveraging Multi-modal Data and eXpert Model Predictions

Nisar, Hareem; Anwar, Syed Muhammad; Jiang, Zhifan; Parida, Abhijeet; Sanchez-Jacob, Ramon; Nath, Vishwesh; Roth, Holger R.; Linguraru, Marius George

doi:10.1007/978-3-031-73471-7_10

Hareem Nisar¹⁴,
Syed Muhammad Anwar^14,15,
Zhifan Jiang¹⁴,
Abhijeet Parida^14,16,
Ramon Sanchez-Jacob^14,15,
Vishwesh Nath¹⁷,
Holger R. Roth¹⁷ &
…
Marius George Linguraru^14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15184))

Included in the following conference series:

International Workshop on Foundation Models for General Medical AI

238 Accesses

Abstract

Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis, which currently hinders VLMs’ clinical adaptability. To create precise, user-friendly models in healthcare, we propose D-Rax- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.

H. Nisar, S. M. Anwar and Z. Jiang—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A generalist vision–language foundation model for diverse biomedical tasks

Article 07 August 2024

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

A vision-language model with multi-granular knowledge fusion in medical imaging

Article 07 December 2024

References

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Han, S.C.T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a Visual Language Model for Few-Shot Learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Google Scholar
Bruls, R.J., Kwee, R.M.: Workload for radiologists during on-call hours: dramatic increase in the past 15 years. Insights Imaging 11, 1–7 (2020)
Article Google Scholar
Cohen, J.P., Viviano, J.D., Bertin, P., Morrison, P., Torabian, P., Guarrera, M., Lungren, M.P., Chaudhari, A., Brooks, R., Hashir, M., Bertrand, H.: TorchXRayVision: A library of chest X-ray datasets and models. In: Medical Imaging with Deep Learning (2022)
Google Scholar
Fawzy, N.A., Tahir, M.J., Saeed, A., Ghosheh, M.J., Alsheikh, T., Ahmed, A., Lee, K.Y., Yousaf, Z.: Incidence and factors associated with burnout in radiologists: A systematic review. European Journal of Radiology Open 11, 100530 (2023)
Article Google Scholar
Gao, W., Deng, Z., Niu, Z., Rong, F., Chen, C., Gong, Z., Zhang, W., Xiao, D., Li, F., Cao, Z., Ma, Z., Wei, W., Ma, L.: Ophglm: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue (2023), https://arxiv.org/abs/2306.12174
Gichoya, J.W., Banerjee, I., Bhimireddy, A.R., Burns, J.L., Celi, L.A., Chen, L.C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.C., Kuo, P.C., Lungren, M.P., Palmer, L.J., Price, B.J., Purkayastha, S., Pyrros, A.T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., Zhang, H.: Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health (2022)
Google Scholar
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C., Stanley, H.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
Article Google Scholar
Hemmer, P., Schemmer, M., Riefle, L., Rosellen, N., Vössing, M., Kühl, N.: Factors that influence the adoption of human-AI collaboration in clinical decision-making. In: Thirtieth European Conference on Information Systems (ECIS 2022) (2022)
Google Scholar
Hu, X., Gu, L., An, Q., Zhang, M., Liu, L., Kobayashi, K., Harada, T., Summers, R., Zhu, Y.: Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images. PhysioNet (2023)
Google Scholar
Hu, X., Gu, L., An, Q., Zhang, M., Liu, L., Kobayashi, K., Harada, T., Summers, R.M., Zhu, Y.: Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining pp. 4156–4165 (2023)
Google Scholar
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2261–2269 (2017)
Google Scholar
Ieki, H.e.a.: Deep learning-based age estimation from chest X-rays indicates cardiovascular prognosis. Communications Medicine (2022)
Google Scholar
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. pp. 590–597 (2019)
Google Scholar
Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., Horng, S.: MIMIC-CXR-JPG - chest radiographs with structured labels. PhysioNet (2019)
Google Scholar
Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. PysioNet (2019)
Google Scholar
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)
Article Google Scholar
Lee, C.S., Nagy, P.G., Weaver, S.J., Newman-Toker, D.E.: Cognitive and system factors contributing to diagnostic errors in radiology. Am. J. Roentgenol. 201(3), 611–617 (2013)
Article Google Scholar
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day (2023)
Google Scholar
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–1654. IEEE (2021)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023)
Google Scholar
Mukherjee, P., Hou, B., Lanfredi, R.B., Summers, R.M.: Feasibility of using the privacy-preserving large language model vicuna for labeling radiology reports. Radiology 309 (2023)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021)
Google Scholar
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models (2023)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2097–2106 (2017)
Google Scholar
Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Towards generalist foundation model for radiology by leveraging web-scale 2d &3d medical data (2023)
Google Scholar
Yi, X.: chestviewsplit. https://github.com/xinario/chestViewSplit
Zhang, S., Xu, Y., Usuyama, N., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., Poon, H.: Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Children’s National Hospital, Washington, DC, USA
Hareem Nisar, Syed Muhammad Anwar, Zhifan Jiang, Abhijeet Parida, Ramon Sanchez-Jacob & Marius George Linguraru
George Washington University, Washington, DC, USA
Syed Muhammad Anwar, Ramon Sanchez-Jacob & Marius George Linguraru
Universidad Politécnica de Madrid, Madrid, Spain
Abhijeet Parida
Nvidia Corporation, Santa Clara, CA, USA
Vishwesh Nath & Holger R. Roth

Authors

Hareem Nisar
View author publications
You can also search for this author in PubMed Google Scholar
Syed Muhammad Anwar
View author publications
You can also search for this author in PubMed Google Scholar
Zhifan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Abhijeet Parida
View author publications
You can also search for this author in PubMed Google Scholar
Ramon Sanchez-Jacob
View author publications
You can also search for this author in PubMed Google Scholar
Vishwesh Nath
View author publications
You can also search for this author in PubMed Google Scholar
Holger R. Roth
View author publications
You can also search for this author in PubMed Google Scholar
Marius George Linguraru
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Muhammad Anwar .

Editor information

Editors and Affiliations

University of Cambridge, Cambridge, UK
Zhongying Deng
Johns Hopkins University, Baltimore, MD, USA
Yiqing Shen
Korea University, Seoul, Korea (Republic of)
Hyunwoo J. Kim
Korea University, Seoul, Korea (Republic of)
Won-Ki Jeong
University of Cambridge, Cambridge, UK
Angelica I. Aviles-Rivero
Shanghai AI Laboratory, Shanghai, China
Junjun He
Shanghai AI Laboratory, Shanghai, China
Shaoting Zhang

Appendices

Expert Enhanced Training

(See Fig. 3)

No Abnormality Questions

(See Table 5)

Table 5. Removing abnormality questions ($27\%$ of the data) from training. Token recall (%) for open-ended questions (O) and accuracy (%) for close-ended questions (C) are reported to show the performance of LLaVA models finetuned on enhanced instruction dataset using $100\%$ and $73\%$ data, respectively. Each value is an average of three inferences and standard deviations are reported in parentheses. The asterisks show statistical significance across paired comparisons using the Wilcoxon signed rank test (* for p-value $< 0.05$ and ** for p-value $< 0.001$).

Full size table

Expert Model Metrics

(See Table 6)

Table 6. Quantitative evaluation of the expert model for disease diagnosis (DenseNet121) on $20\%$ of MIMIC-CXR. The AUC performance is reported. https://github.com/mlmed/torchxrayvision/blob/master/BENCHMARKS.md

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nisar, H. et al. (2025). D-Rax: Domain-Specific Radiologic Assistant Leveraging Multi-modal Data and eXpert Model Predictions. In: Deng, Z., et al. Foundation Models for General Medical AI. MedAGI 2024. Lecture Notes in Computer Science, vol 15184. Springer, Cham. https://doi.org/10.1007/978-3-031-73471-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-73471-7_10
Published: 28 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73470-0
Online ISBN: 978-3-031-73471-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)