Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Lin, Weifeng; Wei, Xinyu; An, Ruichuan; Gao, Peng; Zou, Bocheng; Luo, Yulin; Huang, Siyuan; Zhang, Shanghang; Li, Hongsheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.20271 (cs)

[Submitted on 29 Mar 2024 (v1), last revised 1 Apr 2024 (this version, v2)]

Title:Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Authors:Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

View PDF HTML (experimental)

Abstract:The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.

Comments:	16 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.20271 [cs.CV]
	(or arXiv:2403.20271v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.20271

Submission history

From: Weifeng Lin [view email]
[v1] Fri, 29 Mar 2024 16:26:20 UTC (19,128 KB)
[v2] Mon, 1 Apr 2024 03:25:30 UTC (19,008 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators