Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-72673-6_26guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction

Published: 22 October 2024 Publication History

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs’ compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect’s class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models’ performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction. Resources are available at https://hanghuacs.github.io/finematch/.

References

[2]
Agrawal, H., et al.: nocaps: novel object captioning at scale. International Conference on Computer Vision pp. 8947–8956 (2019). https://api.semanticscholar.org/CorpusID:56517630
[3]
Alayrac JB et al. Flamingo: a visual language model for few-shot learning Adv. Neural. Inf. Process. Syst. 2022 35 23716-23736
[4]
Awadalla, A., et al.: OpenFlamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
[5]
Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset (2022). https://github.com/kakaobrain/coyo-dataset
[6]
Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: TextDiffuser: Diffusion models as text painters. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
[7]
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
[8]
Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
[9]
Diwan, A., Berry, L., Choi, E., Harwath, D., Mahowald, K.: Why is winoground hard? Investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768 (2022)
[10]
Dong, X., et al.: InternLM-XComposer2: mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)
[11]
Gao, P., et al.: Llama-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
[12]
Gurari D, Zhao Y, Zhang M, and Bhattacharya N Vedaldi A, Bischof H, Brox T, and Frahm J-M Captioning images taken by people who are blind Computer Vision – ECCV 2020 2020 Cham Springer 417-434
[13]
Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: SugarCrepe: fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610 (2023)
[14]
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: PromptCap: prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022)
[15]
Hua, H., Li, X., Dou, D., Xu, C.Z., Luo, J.: Noise stability regularization for improving BERT fine-tuning. arXiv preprint arXiv:2107.04835 (2021)
[16]
Hua, H., Li, X., Dou, D., Xu, C.Z., Luo, J.: Fine-tuning pre-trained language models with noise stability regularization. arXiv preprint arXiv:2206.05658 (2022)
[17]
Hua, H., Tang, Y., Xu, C., Luo, J.: V2Xum-LLM: cross-modal video summarization with temporal prompt instruction tuning. arXiv preprint arXiv:2404.12353 (2024)
[18]
Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation. ArXiv abs/2307.06350 (2023). https://api.semanticscholar.org/CorpusID:259847295
[19]
Hudson, D.A., Manning, C.D.: GQA: A new dataset for real-world visual reasoning and compositional question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
[20]
Ilharco, G., et al.: Openclip (2021)., https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite it as below
[21]
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
[22]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[23]
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
[24]
Lin, J., Hua, H., Chen, M., Li, Y., Hsiao, J., Ho, C., Luo, J.: VideoXum: cross-modal visual and textural summarization of videos. arXiv preprint arXiv:2303.12060 (2023)
[25]
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)
[26]
Liu, H., et al.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)
[27]
Liu, H., et al.: LLaVA-NeXT: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/
[28]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. ArXiv abs/2304.08485 (2023). https://api.semanticscholar.org/CorpusID:258179774
[29]
Liu, J., Wang, W., Wang, D., Smith, N.A., Choi, Y., Hajishirzi, H.: Vera: a general-purpose plausibility estimation model for commonsense statements. arXiv preprint arXiv:2305.03695 (2023)
[30]
Lu, X.H., Kasner, Z., Reddy, S.: WebLINX: real-world website navigation with multi-turn dialogue (2024). https://api.semanticscholar.org/CorpusID:267547883
[31]
Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: CREPE: can vision-language foundation models reason compositionally? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10910–10921 (2023)
[32]
Morris, J.X., Lifland, E., Yoo, J.Y., Grigsby, J., Jin, D., Qi, Y.: TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. arXiv preprint arXiv:2005.05909 (2020)
[33]
Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: Self-supervision Meets Language-Image Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13686. Springer, Cham.
[34]
OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815
[35]
Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., Gatt, A.: VALSE: a task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566 (2021)
[36]
Parcalabescu, L., Gatt, A., Frank, A., Calixto, I.: Seeing past words: testing the cross-modal capabilities of pretrained V &L models on counting tasks. arXiv preprint arXiv:2012.12352 (2020)
[37]
Popovic, M.: chrF: character n-gram F-score for automatic MT evaluation. In: WMT@EMNLP (2015). https://api.semanticscholar.org/CorpusID:15349458
[38]
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
[39]
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
[40]
Rawte, V., Sheth, A., Das, A.: A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023)
[41]
Ray, A., Radenovic, F., Dubey, A., Plummer, B.A., Krishna, R., Saenko, K.: COLA: a benchmark for compositional text-to-image retrieval (2023). https://api.semanticscholar.org/CorpusID:258546995
[42]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)
[43]
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv abs/2210.08402 (2022). https://api.semanticscholar.org/CorpusID:252917726
[44]
Singh, A., et a;l.: FLAVA: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)
[45]
Smithsonian Institution: Smithsonian Open Access (2023). https://www.si.edu/openaccess
[46]
Song, L., Yin, G., Jin, Z., Dong, X., Xu, C.: Emotional listener portrait: Realistic listener motion simulation in conversation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20782–20792. IEEE (2023)
[47]
Song, Y., et al.: ObjectStitch: generative object compositing. arXiv preprint arXiv:2212.00932 (2022)
[48]
Sun, Q., et al.: Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 (2023)
[49]
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
[50]
Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248 (2022)
[51]
Touvron, H., et al.: Llama: open and efficient foundation language models. ArXiv abs/2302.13971 (2023). https://api.semanticscholar.org/CorpusID:257219404
[52]
Wang, B., et al.: VIGC: visual instruction generation and correction. arXiv preprint arXiv:2308.12714 (2023)
[53]
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)
[54]
Wang, T., et al.: Caption anything: interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677 (2023)
[55]
Wang, W., et al.: Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022)
[56]
Yarom, M., et al.: What you see is what you read? Improving text-image alignment evaluation. arXiv preprint arXiv:2305.10400 (2023)
[57]
Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)
[58]
Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023)
[59]
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
[60]
Yu, Y., Zeng, Z., Hua, H., Fu, J., Luo, J.: PromptFix: you prompt and we fix the photo. arXiv preprint arXiv:2405.16785 (2024)
[61]
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)
[62]
Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: MagicBrush: a manually annotated dataset for instruction-guided image editing. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
[63]
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BertScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)
[64]
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
[65]
Zhao, T., et al.: Vl-checklist: evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221 (2022)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX
Sep 2024
597 pages
ISBN:978-3-031-72672-9
DOI:10.1007/978-3-031-72673-6
  • Editors:
  • Aleš Leonardis,
  • Elisa Ricci,
  • Stefan Roth,
  • Olga Russakovsky,
  • Torsten Sattler,
  • Gül Varol

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 22 October 2024

Author Tags

  1. Pre-trained Vision-Language Models
  2. Aspect-based Image and Text Analysis
  3. Compositionality

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media