metadata

license: cc-by-nc-4.0
task_categories:
  - text-generation
  - image-to-text
  - summarization
  - question-answering
language:
  - en

🎨 Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. Therefore, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.

Training and Evaluation Dataset Card

MDVP-Data is a comprehensive dataset for multi-domain visual-prompt instruction tuning. This dataset encompasses data for both point-level and region-level understanding, designed to enhance a model’s comprehension ability and robustness.
We also introduce MDVP-Bench, a challenging benchmark designed to evaluate tasks that require a combination of detailed description referrals, inter-relationship analysis, and complex reasoning.

Paper and Code

Project Page: Draw-and-Understand
Paper: https://arxiv.org/abs/2403.20271
Code: https://github.com/AFeng-x/Draw-and-Understand

License

Attribution-NonCommercial 4.0 International
It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use.

Citations

@misc{lin2024drawandunderstand,
      title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
      year={2024},
      eprint={2403.20271},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}