Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

Aissa, Wafa; Ferecatu, Marin; Crucianu, Michel

Computer Science > Computation and Language

arXiv:2310.15585 (cs)

[Submitted on 24 Oct 2023]

Title:Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

Authors:Wafa Aissa (CEDRIC - VERTIGO), Marin Ferecatu (CEDRIC - VERTIGO), Michel Crucianu (CEDRIC - VERTIGO)

View PDF

Abstract:Neural Module Networks (NMN) are a compelling method for visual question answering, enabling the translation of a question into a program consisting of a series of reasoning sub-tasks that are sequentially executed on the image to produce an answer. NMNs provide enhanced explainability compared to integrated models, allowing for a better understanding of the underlying reasoning process. To improve the effectiveness of NMNs we propose to exploit features obtained by a large-scale cross-modal encoder. Also, the current training approach of NMNs relies on the propagation of module outputs to subsequent modules, leading to the accumulation of prediction errors and the generation of false answers. To mitigate this, we introduce an NMN learning strategy involving scheduled teacher guidance. Initially, the model is fully guided by the ground-truth intermediate outputs, but gradually transitions to an autonomous behavior as training progresses. This reduces error accumulation, thus improving training efficiency and final this http URL demonstrate that by incorporating cross-modal features and employing more effective training techniques for NMN, we achieve a favorable balance between performance and transparency in the reasoning process.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2310.15585 [cs.CL]
	(or arXiv:2310.15585v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.15585
Journal reference:	Advanced Concepts for Intelligent Vision Systems, 21st International Conference (ACIVS 2023), Aug 2023, Kumamoto, Japan

Submission history

From: Wafa Aissa [view email] [via CCSD proxy]
[v1] Tue, 24 Oct 2023 07:51:08 UTC (847 KB)

Computer Science > Computation and Language

Title:Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators