HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Chen, Junying; Gui, Chi; Ouyang, Ruyi; Gao, Anningzhe; Chen, Shunian; Chen, Guiming Hardy; Wang, Xidong; Zhang, Ruifei; Cai, Zhenyang; Ji, Ke; Yu, Guangjun; Wan, Xiang; Wang, Benyou

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.19280 (cs)

[Submitted on 27 Jun 2024 (v1), last revised 30 Sep 2024 (this version, v4)]

Title:HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Authors:Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang

View PDF HTML (experimental)

Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2406.19280 [cs.CV]
	(or arXiv:2406.19280v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.19280

Submission history

From: Junying Chen [view email]
[v1] Thu, 27 Jun 2024 15:50:41 UTC (1,545 KB)
[v2] Sun, 15 Sep 2024 07:25:49 UTC (1,546 KB)
[v3] Wed, 25 Sep 2024 13:36:27 UTC (1,546 KB)
[v4] Mon, 30 Sep 2024 06:45:16 UTC (1,545 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators