Deep Multimodal Neural Architecture Search

Yu, Zhou; Cui, Yuhao; Yu, Jun; Wang, Meng; Tao, Dacheng; Tian, Qi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2004.12070 (cs)

[Submitted on 25 Apr 2020 (v1), last revised 11 Oct 2020 (this version, v2)]

Title:Deep Multimodal Neural Architecture Search

Authors:Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, Qi Tian

View PDF

Abstract:Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding.

Comments:	Accept to ACM MM2020, code available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2004.12070 [cs.CV]
	(or arXiv:2004.12070v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2004.12070

Submission history

From: Zhou Yu [view email]
[v1] Sat, 25 Apr 2020 07:00:32 UTC (2,185 KB)
[v2] Sun, 11 Oct 2020 03:28:08 UTC (2,062 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Deep Multimodal Neural Architecture Search

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Deep Multimodal Neural Architecture Search

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators