Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Fan, Chenyou; Zhang, Xiaofan; Zhang, Shu; Wang, Wensheng; Zhang, Chi; Huang, Heng

Computer Science > Computer Vision and Pattern Recognition

arXiv:1904.04357 (cs)

[Submitted on 8 Apr 2019]

Title:Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Authors:Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, Heng Huang

View PDF

Abstract:In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the multimodal data and improve the final representation of the QA pair. Experimental results demonstrate our approach achieves state-of-the-art performance on four VideoQA benchmark datasets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1904.04357 [cs.CV]
	(or arXiv:1904.04357v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1904.04357

Submission history

From: Chenyou Fan [view email]
[v1] Mon, 8 Apr 2019 21:10:16 UTC (1,073 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2019-04

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Chenyou Fan
Xiaofan Zhang
Shu Zhang
Wensheng Wang
Chi Zhang

…

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators