research-article

A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering

Authors:

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 110 - 120

https://doi.org/10.1145/3539618.3591629

Published: 18 July 2023 Publication History

Get Access

Abstract

Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA, respectively.

Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for generating a textual answer for KI-VQA tasks. MM-FiD encodes the question, the image, and each retrieved passage separately and uses all passages jointly in its decoder. Compared to competitive baselines in the literature, this approach leads to 5.5% and 8.5% improvements in terms of question answering accuracy on OK-VQA and FVQA, respectively.

Supplemental Material

MP4 File

Presentation video of the paper titled ''A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering.'' This paper introduces a novel pipeline for Knowledge-Intensive Visual Question Answering (KI-VQA) tasks, comprising a retriever (DEDR) and a reader (MM-FiD). DEDR utilizes a dual encoding dense retrieval framework, incorporating both textual and multi-modal encoders, and employs an iterative knowledge distillation approach to align their representation spaces. Extensive evaluation of OK-VQA and FVQA datasets demonstrates that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9%, respectively. MM-FiD, leveraging the retrieved passages, utilizes an encoder-decoder multi-modal fusion-in-decoder model to generate textual answers, surpassing competitive baselines by 5.5% on OK-VQA and 8.5% on FVQA in terms of question answering accuracy.

Download
32.82 MB

References

[1]

Bilal Abu-Salih. 2021. Domain-specific knowledge graphs: A survey. Journal of Network and Computer Applications, Vol. 185 (2021), 103076. https://doi.org/10.1016/j.jnca.2021.103076

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Passage Retrieval for Outside-Knowledge Visual Question Answering

Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

Cross-Modal Retrieval for Knowledge-Based Visual Question Answering

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations