AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Zhan, Jun; Dai, Junqi; Ye, Jiasheng; Zhou, Yunhua; Zhang, Dong; Liu, Zhigeng; Zhang, Xin; Yuan, Ruibin; Zhang, Ge; Li, Linyang; Yan, Hang; Fu, Jie; Gui, Tao; Sun, Tianxiang; Jiang, Yugang; Qiu, Xipeng

Computer Science > Computation and Language

arXiv:2402.12226 (cs)

[Submitted on 19 Feb 2024 (v1), last revised 7 Mar 2024 (this version, v3)]

Title:AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Authors:Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu

View PDF HTML (experimental)

Abstract:We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in this https URL

Comments:	28 pages, 16 figures, under review, work in progress
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2402.12226 [cs.CL]
	(or arXiv:2402.12226v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.12226

Submission history

From: Jun Zhan [view email]
[v1] Mon, 19 Feb 2024 15:33:10 UTC (2,877 KB)
[v2] Mon, 26 Feb 2024 15:24:20 UTC (2,952 KB)
[v3] Thu, 7 Mar 2024 06:31:46 UTC (2,953 KB)

Computer Science > Computation and Language

Title:AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators