LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Chen, Gongwei; Shen, Leyang; Shao, Rui; Deng, Xiang; Nie, Liqiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.11860 (cs)

[Submitted on 20 Nov 2023 (v1), last revised 26 Nov 2023 (this version, v2)]

Title:LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Authors:Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie

View PDF

Abstract:Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

Comments:	Technical Report. Project page: this https URL Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.11860 [cs.CV]
	(or arXiv:2311.11860v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.11860

Submission history

From: Rui Shao [view email]
[v1] Mon, 20 Nov 2023 15:56:44 UTC (6,646 KB)
[v2] Sun, 26 Nov 2023 10:10:55 UTC (7,296 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators