Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Nguyen, Duy-Kien; Okatani, Takayuki

Computer Science > Computer Vision and Pattern Recognition

arXiv:1804.00775 (cs)

[Submitted on 3 Apr 2018 (v1), last revised 1 Dec 2018 (this version, v2)]

Title:Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Authors:Duy-Kien Nguyen, Takayuki Okatani

View PDF

Abstract:A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.

Comments:	In Proceeding of CVPR'2018
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1804.00775 [cs.CV]
	(or arXiv:1804.00775v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1804.00775

Submission history

From: Duy-Kien Nguyen [view email]
[v1] Tue, 3 Apr 2018 01:24:23 UTC (765 KB)
[v2] Sat, 1 Dec 2018 08:12:22 UTC (2,480 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2018-04

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Duy-Kien Nguyen
Takayuki Okatani

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators