ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Yan, Siming; Bai, Min; Chen, Weifeng; Zhou, Xiong; Huang, Qixing; Li, Li Erran

doi:10.1007/978-3-031-73030-6_3

Siming Yan¹³,
Min Bai¹⁴,
Weifeng Chen¹⁴,
Xiong Zhou¹⁴,
Qixing Huang¹³ &
…
Li Erran Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15119))

Included in the following conference series:

European Conference on Computer Vision

177 Accesses

Abstract

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

S. Yan and M. Bai—Equal contribution.

S. Yan—Work done when interning at AWS AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)
Askell, A., et al.: A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021)
Awadalla, A., et al.: Openflamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Sebastopol (2009)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=mWVoBz4W0u
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
Gurari, D., et al.: Vizwiz grand challenge: Answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9
Li, J., et al.: Fine-tuning multimodal LLMs to follow zero-shot demonstrative instructions. In: The Twelfth International Conference on Learning Representations (2023)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Li, Y., Du, Y., Kun Zhou, J.W., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023). https://openreview.net/forum?id=xozJw0kZXF
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744. Curran Associates, Inc. (2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
Radford, A., et al.: Learning transferrable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 25278–25294 (2022)
Google Scholar
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 742–758. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_44
Chapter Google Scholar
Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF (2023)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)
Google Scholar
Workshop, B., et al.: Bloom: a 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)
Ye, Q., et al.: mplug-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, T., et al.: Reformulating vision-language foundation models and datasets towards universal multimodal assistants. arXiv preprint arXiv:2310.00653 (2023)
Zhang, Y., Mai, Y., Roberts, J.S.R., Bommasani, R., Dubois, Y., Liang, P.: Helm instruct: a multidimensional instruction following evaluation framework with absolute ratings. https://crfm.stanford.edu/2024/02/18/helm-instruct.html
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Author information

Authors and Affiliations

The University of Texas at Austin, Austin, USA
Siming Yan & Qixing Huang
AWS AI, San Francisco, USA
Min Bai, Weifeng Chen, Xiong Zhou & Li Erran Li

Authors

Siming Yan
View author publications
You can also search for this author in PubMed Google Scholar
Min Bai
View author publications
You can also search for this author in PubMed Google Scholar
Weifeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qixing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Li Erran Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siming Yan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4641 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, S., Bai, M., Chen, W., Zhou, X., Huang, Q., Li, L.E. (2025). ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15119. Springer, Cham. https://doi.org/10.1007/978-3-031-73030-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-73030-6_3
Published: 24 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73029-0
Online ISBN: 978-3-031-73030-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4641 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4641 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation