Article

FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction

Authors:

John Collomosse,

Jiebo LuoAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX

Pages 474 - 491

https://doi.org/10.1007/978-3-031-72673-6_26

Published: 22 October 2024 Publication History

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs’ compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect’s class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models’ performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction. Resources are available at https://hanghuacs.github.io/finematch/.

References

[1]

Gpt-4V(ision) system card (2023). https://api.semanticscholar.org/CorpusID:263218031

[2]

Agrawal, H., et al.: nocaps: novel object captioning at scale. International Conference on Computer Vision pp. 8947–8956 (2019). https://api.semanticscholar.org/CorpusID:56517630

[3]

Alayrac JB et al. Flamingo: a visual language model for few-shot learning Adv. Neural. Inf. Process. Syst. 2022 35 23716-23736

[4]

Awadalla, A., et al.: OpenFlamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

[5]

Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset (2022). https://github.com/kakaobrain/coyo-dataset

[6]

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: TextDiffuser: Diffusion models as text painters. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

[7]

Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

[8]

Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)

[9]

Diwan, A., Berry, L., Choi, E., Harwath, D., Mahowald, K.: Why is winoground hard? Investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768 (2022)

[10]

Dong, X., et al.: InternLM-XComposer2: mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)

[11]

Gao, P., et al.: Llama-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

[12]

Gurari D, Zhao Y, Zhang M, and Bhattacharya N Vedaldi A, Bischof H, Brox T, and Frahm J-M Captioning images taken by people who are blind Computer Vision – ECCV 2020 2020 Cham Springer 417-434

Digital Library

[13]

Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: SugarCrepe: fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610 (2023)

[14]

Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: PromptCap: prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022)

[15]

Hua, H., Li, X., Dou, D., Xu, C.Z., Luo, J.: Noise stability regularization for improving BERT fine-tuning. arXiv preprint arXiv:2107.04835 (2021)

[16]

Hua, H., Li, X., Dou, D., Xu, C.Z., Luo, J.: Fine-tuning pre-trained language models with noise stability regularization. arXiv preprint arXiv:2206.05658 (2022)

[17]

Hua, H., Tang, Y., Xu, C., Luo, J.: V2Xum-LLM: cross-modal video summarization with temporal prompt instruction tuning. arXiv preprint arXiv:2404.12353 (2024)

[18]

Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation. ArXiv abs/2307.06350 (2023). https://api.semanticscholar.org/CorpusID:259847295

[19]

Hudson, D.A., Manning, C.D.: GQA: A new dataset for real-world visual reasoning and compositional question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[20]

Ilharco, G., et al.: Openclip (2021)., https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite it as below

[21]

Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

[22]

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

[23]

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

[24]

Lin, J., Hua, H., Chen, M., Li, Y., Hsiao, J., Ho, C., Luo, J.: VideoXum: cross-modal visual and textural summarization of videos. arXiv preprint arXiv:2303.12060 (2023)

[25]

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)

[26]

Liu, H., et al.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

[27]

Liu, H., et al.: LLaVA-NeXT: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/

[28]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. ArXiv abs/2304.08485 (2023). https://api.semanticscholar.org/CorpusID:258179774

[29]

Liu, J., Wang, W., Wang, D., Smith, N.A., Choi, Y., Hajishirzi, H.: Vera: a general-purpose plausibility estimation model for commonsense statements. arXiv preprint arXiv:2305.03695 (2023)

[30]

Lu, X.H., Kasner, Z., Reddy, S.: WebLINX: real-world website navigation with multi-turn dialogue (2024). https://api.semanticscholar.org/CorpusID:267547883

[31]

Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: CREPE: can vision-language foundation models reason compositionally? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10910–10921 (2023)

[32]

Morris, J.X., Lifland, E., Yoo, J.Y., Grigsby, J., Jin, D., Qi, Y.: TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. arXiv preprint arXiv:2005.05909 (2020)

[33]

Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: Self-supervision Meets Language-Image Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13686. Springer, Cham.

Digital Library

[34]

OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815

[35]

Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., Gatt, A.: VALSE: a task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566 (2021)

[36]

Parcalabescu, L., Gatt, A., Frank, A., Calixto, I.: Seeing past words: testing the cross-modal capabilities of pretrained V &L models on counting tasks. arXiv preprint arXiv:2012.12352 (2020)

[37]

Popovic, M.: chrF: character n-gram F-score for automatic MT evaluation. In: WMT@EMNLP (2015). https://api.semanticscholar.org/CorpusID:15349458

[38]

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

[39]

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

[40]

Rawte, V., Sheth, A., Das, A.: A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023)

[41]

Ray, A., Radenovic, F., Dubey, A., Plummer, B.A., Krishna, R., Saenko, K.: COLA: a benchmark for compositional text-to-image retrieval (2023). https://api.semanticscholar.org/CorpusID:258546995

[42]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)

[43]

Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv abs/2210.08402 (2022). https://api.semanticscholar.org/CorpusID:252917726

[44]

Singh, A., et a;l.: FLAVA: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)

[45]

Smithsonian Institution: Smithsonian Open Access (2023). https://www.si.edu/openaccess

[46]

Song, L., Yin, G., Jin, Z., Dong, X., Xu, C.: Emotional listener portrait: Realistic listener motion simulation in conversation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20782–20792. IEEE (2023)

[47]

Song, Y., et al.: ObjectStitch: generative object compositing. arXiv preprint arXiv:2212.00932 (2022)

[48]

Sun, Q., et al.: Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 (2023)

[49]

Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

[50]

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248 (2022)

[51]

Touvron, H., et al.: Llama: open and efficient foundation language models. ArXiv abs/2302.13971 (2023). https://api.semanticscholar.org/CorpusID:257219404

[52]

Wang, B., et al.: VIGC: visual instruction generation and correction. arXiv preprint arXiv:2308.12714 (2023)

[53]

Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)

[54]

Wang, T., et al.: Caption anything: interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677 (2023)

[55]

Wang, W., et al.: Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022)

[56]

Yarom, M., et al.: What you see is what you read? Improving text-image alignment evaluation. arXiv preprint arXiv:2305.10400 (2023)

[57]

Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)

[58]

Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023)

[59]

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

[60]

Yu, Y., Zeng, Z., Hua, H., Fu, J., Luo, J.: PromptFix: you prompt and we fix the photo. arXiv preprint arXiv:2405.16785 (2024)

[61]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)

[62]

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: MagicBrush: a manually annotated dataset for instruction-guided image editing. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

[63]

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BertScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)

[64]

Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)

[65]

Zhao, T., et al.: Vl-checklist: evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221 (2022)

Index Terms

FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
        Object identification
      2. Computer vision tasks
        Biometrics
        Visual content-based indexing and retrieval
    2. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Index terms have been assigned to the content through auto-classification.

Recommendations

Detection of interference among aspects in aspect oriented modeling
Bipartite graph-based mismatch removal for wide-baseline image matching

The conventional wide-baseline image matching aims to establish point-to-point correspondence pairs across the two images under matching. This is normally accomplished by identifying those feature points with most similar local features represented by ...
Mismatch Removal for Wide-baseline Image Matching via Coherent Region-to-Region Correspondence
PSIVT '10: Proceedings of the 2010 Fourth Pacific-Rim Symposium on Image and Video Technology

A novel wide-baseline image matching method that exploits coherent region correspondence is presented in this paper. The traditional matching algorithms solely rely on identifying the same local features between the two images to determine pixel-to-...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX

Sep 2024

597 pages

ISBN:978-3-031-72672-9

DOI:10.1007/978-3-031-72673-6

Editors:
Aleš Leonardis
https://ror.org/03angcq70University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
https://ror.org/05n911h24Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
https://ror.org/00hx57361Princeton University, Princeton, NJ, USA
,
Torsten Sattler
https://ror.org/03kqpb082Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
https://ror.org/02nwvxz07École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 22 October 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten