Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612516acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA

Published: 27 October 2023 Publication History

Abstract

Considerable performance gains have been achieved for knowledge-based visual question answering due to the visual-language pre-training models with pre-training-then-fine-tuning paradigm. However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. Specifically, based on static declaration prompt, we construct a consistent goal with the fine-tuning via masked language modeling to inherit capabilities of pre-training task, while selecting the top-t relevant knowledge in a dense retrieval manner. Additionally, a dynamic knowledge prompt is learned from retrieved knowledge, which not only alleviates the length constraint on inputs for visual-language pre-trained models but also assists in providing answer features via fine-tuning. Combining and unifying the aims of the two stages could fully exploit the abilities of pre-training and fine-tuning to predict answer. We evaluate the proposed model on the OKVQA dataset, and the result shows that our model outperforms the state-of-the-art methods based on visual-language pre-training models with a noticeable performance gap and even exceeds the large-scale language model of GPT-3, which proves the benefits of the hybrid prompts and the advantages of unifying pre-training to fine-tuning.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[2]
Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision. 2612--2620.
[3]
Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Yin Fang, Jeff Z Pan, Ningyu Zhang, and Wen Zhang. 2022. Lako: Knowledge-driven visual question answering via late knowledge-to-text injection. In Proceedings of the 11th International Joint Conference on Knowledge Graphs. 20--29.
[4]
Dorottya Demszky, Kelvin Guu, and Percy Liang. 2018. Transforming question answering datasets into natural language inference datasets. arXiv preprint arXiv:1809.02922 (2018).
[5]
Yang Ding, Jing Yu, Bang Liu, Yue Hu, Mingxin Cui, and Qi Wu. 2022. Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5089--5098.
[6]
François Gardères, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. 2020. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020. 489--498.
[7]
Yangyang Guo, Liqiang Nie, Yongkang Wong, Yibing Liu, Zhiyong Cheng, and Mohan Kankanhalli. 2022. A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA. In Proceedings of the 30th ACM International Conference on Multimedia. 2061--2069.
[8]
Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for realworld visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.
[9]
Vladimir Karpukhin, Barlas O?uz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for opendomain question answering. arXiv preprint arXiv:2004.04906 (2020).
[10]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. Advances in neural information processing systems 31 (2018).
[11]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicodervl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.
[12]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[13]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.
[14]
Hugo Liu and Push Singh. 2004. ConceptNet-a practical commonsense reasoning tool-kit. BT technology journal 22, 4 (2004), 211--226.
[15]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1--35.
[16]
Yuhang Liu, Wei Wei, Daowan Peng, and Feida Zhu. 2022. Declaration-based prompt tuning for visual question answering. arXiv preprint arXiv:2205.02456 (2022).
[17]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[18]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
[19]
Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. 2021. Weakly-supervised visual-retriever-reader for knowledge-based question answering. arXiv preprint arXiv:2109.04014 (2021).
[20]
Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, and Marcus Rohrbach. 2021. Krisp: Integrating implicit and symbolic knowledge for opendomain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14111--14121.
[21]
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195--3204.
[22]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[23]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[24]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485--5551.
[25]
Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, and Vered Shwartz. 2023. VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1155--1165.
[26]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).
[27]
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).
[28]
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200--212.
[29]
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413--2427.
[30]
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015).
[31]
Jialin Wu, Jiasen Lu, Ashish Sabharwal, and Roozbeh Mottaghi. 2022. Multimodal answer validation for knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2712--2721.
[32]
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. 2022. An empirical study of gpt-3 for few-shot knowledgebased vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3081--3089.
[33]
Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2021. Cpt: Colorful prompt tuning for pre-trained visionlanguage models. arXiv preprint arXiv:2109.11797 (2021).
[34]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.
[35]
Wenbo Zheng, Lan Yan, Chao Gou, and Fei-Yue Wang. 2021. Km4: Visual reasoning via knowledge embedding memory model with mutual modulation. Information Fusion 67 (2021), 14--28.
[36]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337--2348.
[37]
Maryam Ziaeefard and Freddy Lecue. 2020. Towards knowledge-augmented visual question answering. In Proceedings of the 28th International Conference on Computational Linguistics. 1863--1873.

Cited By

View all
  • (2024)Semantic Alignment for Multimodal Large Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681014(3489-3498)Online publication date: 28-Oct-2024
  • (2024)Enhancing GPT-3.5 for Knowledge-Based VQA with In-Context Prompt Learning and Image Captioning2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10832004(4030-4035)Online publication date: 6-Oct-2024

Index Terms

  1. Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 October 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. knowledge integration
        2. multi-modal fusion
        3. visual question answering

        Qualifiers

        • Research-article

        Funding Sources

        • NSFC
        • National Key R&D Program of China
        • R&D Program of Beijing Municipal Education Commission

        Conference

        MM '23
        Sponsor:
        MM '23: The 31st ACM International Conference on Multimedia
        October 29 - November 3, 2023
        Ottawa ON, Canada

        Acceptance Rates

        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)99
        • Downloads (Last 6 weeks)9
        Reflects downloads up to 08 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Semantic Alignment for Multimodal Large Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681014(3489-3498)Online publication date: 28-Oct-2024
        • (2024)Enhancing GPT-3.5 for Knowledge-Based VQA with In-Context Prompt Learning and Image Captioning2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10832004(4030-4035)Online publication date: 6-Oct-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media