Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475492acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question Answering

Published: 17 October 2021 Publication History

Abstract

Most Visual Question Answering (VQA) models are faced with language bias when learning to answer a given question, thereby failing to understand multimodal knowledge simultaneously. Based on the fact that VQA samples with different levels of language bias contribute differently for answer prediction, in this paper, we overcome the language prior problem by proposing a novel Language Bias driven Curriculum Learning (LBCL) approach, which employs an easy-to-hard learning strategy with a novel difficulty metric Visual Sensitive Coefficient (VSC). Specifically, in the initial training stage, the VQA model mainly learns the superficial textual correlations between questions and answers (easy concept) from more-biased examples, and then progressively focuses on learning the multimodal reasoning (hard concept) from less-biased examples in the following stages. The curriculum selection of examples on different stages is according to our proposed difficulty metric VSC, which is to evaluate the difficulty driven by the language bias of each VQA sample. Furthermore, to avoid the catastrophic forgetting of the learned concept during the multi-stage learning procedure, we propose to integrate knowledge distillation into the curriculum learning framework. Extensive experiments show that our LBCL can be generally applied to common VQA baseline models, and achieves remarkably better performance on the VQA-CP v1 and v2 datasets, with an overall 20% accuracy boost over baseline models.

Supplementary Material

ZIP File (mfp1779aux.zip)
A pdf contains a supplemental experiment on small-scale datasets, and a case study for our proposed Language Bias driven Curriculum Learning for Visual Question Answering.

References

[1]
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2]
Fukui Akira, Huk Park Dong, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[5]
Hedi Ben-younes, Remi Cadene, Matthieu Cord, and Nicolas Thome. 2017. MU-TAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[6]
Hedi Ben-younes, Remi Cadene, Nicolas Thome, and Matthieu Cord. 2019. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
[7]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML).
[8]
Remi Cadene, Corentin Dancette, Hedi Ben younes, Matthieu Cord, and Devi Parikh. 2019. RUBi: Reducing Unimodal Biases for Visual Question Answering. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
[9]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual Samples Synthesizing for Robust Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10]
Xinlei Chen and Abhinav Gupta. 2015. Webly Supervised Learning of Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[11]
Jaehoon Choi, Minki Jeong, Taekyung Kim, and Changick Kim. 2019. Pseudo-Labeling Curriculum for Unsupervised Domain Adaptation. In Proceedings of the British Machine Vision Conference (BMVC).
[12]
Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).
[13]
Itai Gat, Idan Schwartz, Alexander Schwing, and Tamir Hazan. 2020. Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies. In Advances in Neural Information Processing Systems (NIPS).
[14]
Hinton Geoffrey, Vinyals Oriol, and Dean Jeffrey. 2015. Distilling the Knowledge in a Neural Network. In Proceeding of the Deep Learning and Representation Learning Workshop at NIPS 2015.
[15]
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. MUTANT:A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[16]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.2017. Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17]
Gabriel Grand and Yonatan Belinkov. 2019. Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects. In Proceedings of the 2nd Workshop on Shortcomings in Vision and Language (SiVL) at NAACL-HLT2019.
[18]
Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, and Michael S.Lew. 2016. Deep learning for visual understanding: A review. Neurocomputing 187 (2016), 27--48.
[19]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991(2015).
[20]
Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2014.Easy Samples First: Self-Paced Reranking for Zero-Example Multimedia Search.In Proceedings of the 22nd ACM International Conference on Multimedia (ACMMM).
[21]
Chenchen Jing, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, and Qi Wu. 2020. Overcoming Language Priors in VQA via Decomposed Linguistic Representations. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
[22]
Kushal Kafle and Christopher Kanan. 2017. An Analysis of Visual Question Answering Algorithms. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[23]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Proceedings of the Advances in Neural Information Processing Systems(NIPS).
[24]
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, JungWoo Ha,and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. In Proceedings of the 5th International Conference on Learning Representations(ICLR).
[25]
Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Phys. Rev. E69 (2004), 066138. Issue 6.
[26]
M. Pawan Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-Paced Learning for Latent Variable Models. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS).
[27]
Gouthaman KV and Anurag Mittal. 2020. Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder. In Proceedings of the European Conference on Computer Vision (ECCV).
[28]
Mingrui Lao, Yanming Guo, Yu Liu, and Michael S. Lew. 2021. A Language Prior Based Focal Loss for Visual Question Answering. In IEEE International Conference on Multimedia and Expo (ICME).
[29]
Mingrui Lao, Yanming Guo, Nan Pu, Wei Chen, Yu Liu, and Michael S. Lew. 2021.Multi-stage hybrid embedding fusion network for visual question answering. Neurocomputing 423 (2021), 541--550.
[30]
Qing Li, Siyuan Huang, Yining Hong, and Song-Chun Zhu. 2020. A Competence-Aware Curriculum for Visual Concepts Learning via Question Answering. In Proceedings of the European Conference on Computer Vision (ECCV).
[31]
Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. 2020. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[32]
Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E. Taylor,and Peter Stone. 2020. Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey.Journal of Machine Learning Research21, 181 (2020), 1--50.
[33]
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
[34]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[35]
Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabás Póczos,and Tom M. Mitchell. 2019. Competence-based Curriculum Learning for Neural Machine Translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL).
[36]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98--125.
[37]
Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018. Overcoming Language Priors in Visual Question Answering with Adversarial Regularization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
[38]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
[39]
Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. 2019. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[40]
Watanabe Shinji, Hori Takaaki, Le Roux Jonathan, and R. Hershey John. 2017. Student-teacher network learning with enhanced features. In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[41]
Yingpeng Tang and Shengjun Huang. 2019. Self-Paced Active Learning: Query the Right Thing at the Right Time. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
[42]
Damien Teney, Ehsan Abbasnedjad, and Anton van den Hengel. 2020. Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision. In Proceedings of the European Conference on Computer Vision (ECCV).
[43]
Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim P.Papadopoulos, and Vittorio Ferrari. 2016. How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
[45]
Jialin Wu and Raymond J. Mooney. 2019. Self-Critical Reasoning for Robust Visual Question Answering. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
[46]
Tongtong Wu, Xuekai Li, Yuan-Fang Li, Reza Haffari, Guilin Qi, Yujin Zhu,and Guoqiang Xu. 2021. Curriculum-Meta Learning for Order-Robust Continual Relation Extraction. In Proceedings of the AAAI Conference on Artificial Intelligence(AAAI).
[47]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked Attention Networks for Image Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49]
Dingwen Zhang, Deyu Meng, Chao Li, Lu Jiang, Qian Zhao, and Junwei Han. 2015.A Self-Paced Multiple-Instance Learning Framework for Co-Saliency Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[50]
Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul McNamee, Marine Carpuat,and Kevin Duh. 2019. Curriculum Learning for Domain Adaptation in Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL).
[51]
Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, and Yongdong Zhang. 2020. Overcoming Language Priors with Self-supervised Learning for Visual Question Answering. In Proceedings of the Twenty-Ninth InternationalJoint Conference on Artificial Intelligence (IJCAI).

Cited By

View all
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
  • (2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: Aug-2024
  • (2023)Overcoming Language Bias in Remote Sensing Visual Question Answering Via Adversarial TrainingIGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS52108.2023.10282946(2235-2238)Online publication date: 16-Jul-2023
  • Show More Cited By

Index Terms

  1. From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question Answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. curriculum learning
    2. language bias
    3. visual question answering

    Qualifiers

    • Research-article

    Funding Sources

    • Natural Science Foundation of Hunan Province
    • National Natural Science Foundation of China

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)78
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
    • (2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: Aug-2024
    • (2023)Overcoming Language Bias in Remote Sensing Visual Question Answering Via Adversarial TrainingIGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS52108.2023.10282946(2235-2238)Online publication date: 16-Jul-2023
    • (2023)Distance Metric Learning-optimized Attention Mechanism for Visual Question Answering2023 9th International Conference on Information, Cybernetics, and Computational Social Systems (ICCSS)10.1109/ICCSS58421.2023.10270653(91-96)Online publication date: 2-Jun-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media