Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-consistency Training

Tan, Cheng; Wei, Jingxuan; Gao, Zhangyang; Sun, Linzhuang; Li, Siyuan; Guo, Ruifeng; Yu, Bihui; Li, Stan Z.

doi:10.1007/978-3-031-73661-2_17

Cheng Tan^13,16,17,
Jingxuan Wei^14,15,
Zhangyang Gao^13,16,17,
Linzhuang Sun^14,15,
Siyuan Li^13,16,17,
Ruifeng Guo^14,15,
Bihui Yu^14,15 &
…
Stan Z. Li ORCID: orcid.org/0000-0002-2961-8096¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15098))

Included in the following conference series:

European Conference on Computer Vision

240 Accesses

Abstract

Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model’s accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning. The code is available at github.com/chengtan9907/mc-cot.

C. Tan, J. Wei and Z. Gao—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

Article 16 August 2024

ANSWERED: Adaptive Tool-Augmented LLMs with Strategic Error Feedback for Compositional Reasoning

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Anil, R., et al.: Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Askell, A., et al.: A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021)
Bai, J., et al.: Qwen-vl: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023)
Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advance in Neural Information Processing System, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, Z., Zhou, Q., Shen, Y., Hong, Y., Zhang, H., Gan, C.: See, think, confirm: interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226 (2023)
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gao, P., et al.: Llama-adapter v2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Horawalavithana, S., Munikoti, S., Stewart, I., Kvinge, H.: Scitune: aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139 (2023)
Huang, J., Chang, K.C.C.: Towards reasoning in large language models: a survey. arXiv preprint arXiv:2212.10403 (2022)
Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D., Kembhavi, A.: Webly supervised concept expansion for general purpose vision models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 662–681. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_38
Chapter Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol. 1, p. 2 (2019)
Google Scholar
Khashabi, D., et al.: UNIFIEDQA: crossing format boundaries with a single QA system. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1896–1907. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.171, https://aclanthology.org/2020.findings-emnlp.171
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: What does Bert with vision look at? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5265–5275 (2020)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., et al.: Mmbench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521 (2022)
Google Scholar
Lu, P., et al.: Chameleon: plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 (2023)
Lu, P., et al.: IconQA: a new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214 (2021)
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., Ji, R.: Cheap and quick: efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023 (2023)
Marino, K., Chen, X., Parikh, D., Gupta, A., Rohrbach, M.: KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14111–14121 (2021)
Google Scholar
Ning, X., Lin, Z., Zhou, Z., Yang, H., Wang, Y.: Skeleton-of-thought: large language models can do parallel decoding. arXiv preprint arXiv:2307.15337 (2023)
OpenAI: Gpt-4 technical report (2023)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advance in Neural Information Processing System, vol. 35, pp. 27730–27744 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: a benchmark for visual question answering using world knowledge. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 146–162. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_9
Chapter Google Scholar
Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advance in Neural Information Processing System, vol. 35, pp. 24824–24837 (2022)
Google Scholar
Wu, C., Liu, J., Wang, X., Dong, X.: Chain of reasoning for visual question answering. In: Advances in Neural Information Processing Systems 31 (2018)
Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3081–3089 (2022)
Google Scholar
Yang, Z., et al.: Mm-react: prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Google Scholar
Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)
Google Scholar
Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

Download references

Acknowledgement

This work was supported by the Science & Technology Innovation 2030 Major Program Project No. 2021ZD0150100, National Natural Science Foundation of China Project No. U21A20427, Project No. WU2022A009 from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University, Project No. WU2023C019 from the Westlake University Industries of the Future Research. Finally, we thank the Westlake University HPC Center for providing part of the computational resources, and Project No. 23-407-3-29 from the Shenyang Science and Technology Program.

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, China
Cheng Tan, Zhangyang Gao & Siyuan Li
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jingxuan Wei, Linzhuang Sun, Ruifeng Guo & Bihui Yu
University of Chinese Academy of Sciences, Beijing, China
Jingxuan Wei, Linzhuang Sun, Ruifeng Guo & Bihui Yu
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
Cheng Tan, Zhangyang Gao, Siyuan Li & Stan Z. Li
Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, China
Cheng Tan, Zhangyang Gao & Siyuan Li

Authors

Cheng Tan
View author publications
You can also search for this author in PubMed Google Scholar
Jingxuan Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zhangyang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Linzhuang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Ruifeng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Bihui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Stan Z. Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingxuan Wei .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 416 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, C. et al. (2025). Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-consistency Training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15098. Springer, Cham. https://doi.org/10.1007/978-3-031-73661-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-73661-2_17
Published: 10 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73660-5
Online ISBN: 978-3-031-73661-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-consistency Training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

ANSWERED: Adaptive Tool-Augmented LLMs with Strategic Error Feedback for Compositional Reasoning

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 416 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-consistency Training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

ANSWERED: Adaptive Tool-Augmented LLMs with Strategic Error Feedback for Compositional Reasoning

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 416 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation