A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Li, Linjie; Gan, Zhe; Liu, Jingjing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2012.08673 (cs)

[Submitted on 15 Dec 2020 (v1), last revised 30 Mar 2021 (this version, v2)]

Title:A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Authors:Linjie Li, Zhe Gan, Jingjing Liu

View PDF

Abstract:Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2012.08673 [cs.CV]
	(or arXiv:2012.08673v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2012.08673

Submission history

From: Linjie Li [view email]
[v1] Tue, 15 Dec 2020 23:41:42 UTC (2,595 KB)
[v2] Tue, 30 Mar 2021 23:51:50 UTC (2,638 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators