Scaling Vision-Language Models with Sparse Mixture of Experts

Shen, Sheng; Yao, Zhewei; Li, Chunyuan; Darrell, Trevor; Keutzer, Kurt; He, Yuxiong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.07226 (cs)

[Submitted on 13 Mar 2023]

Title:Scaling Vision-Language Models with Sparse Mixture of Experts

Authors:Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He

View PDF

Abstract:The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-language models, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling VLMs. We hope our work will inspire further research into the use of MoE for scaling large-scale vision-language models and other multimodal machine learning applications.

Comments:	Preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2303.07226 [cs.CV]
	(or arXiv:2303.07226v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.07226

Submission history

From: Sheng Shen [view email]
[v1] Mon, 13 Mar 2023 16:00:31 UTC (13,289 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Vision-Language Models with Sparse Mixture of Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Vision-Language Models with Sparse Mixture of Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators