What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Zeng, Yan; Zhang, Hanbo; Zheng, Jiani; Xia, Jiangnan; Wei, Guoqiang; Wei, Yang; Zhang, Yuchen; Kong, Tao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.02469 (cs)

[Submitted on 5 Jul 2023 (v1), last revised 30 Jul 2023 (this version, v2)]

Title:What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Authors:Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong

View PDF

Abstract:Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

Comments:	32 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2307.02469 [cs.CV]
	(or arXiv:2307.02469v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.02469

Submission history

From: Yan Zeng [view email]
[v1] Wed, 5 Jul 2023 17:44:28 UTC (38,351 KB)
[v2] Sun, 30 Jul 2023 13:20:39 UTC (38,351 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators