OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Bai, Jinze; Men, Rui; Yang, Hao; Ren, Xuancheng; Dang, Kai; Zhang, Yichang; Zhou, Xiaohuan; Wang, Peng; Tan, Sinan; Yang, An; Cui, Zeyu; Han, Yu; Bai, Shuai; Ge, Wenbin; Ma, Jianxin; Lin, Junyang; Zhou, Jingren; Zhou, Chang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.04408 (cs)

[Submitted on 8 Dec 2022]

Title:OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Authors:Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, Zeyu Cui, Yu Han, Shuai Bai, Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, Chang Zhou

View PDF

Abstract:Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2212.04408 [cs.CV]
	(or arXiv:2212.04408v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.04408

Submission history

From: Xuancheng Ren [view email]
[v1] Thu, 8 Dec 2022 17:07:09 UTC (3,638 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators