research-article

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Authors:

Liqiang NieAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 6

Article No.: 149, Pages 1 - 26

https://doi.org/10.1145/3662732

Published: 19 August 2024 Publication History

Abstract

Unleashing the power of image-text matching in real-world applications is hampered by noisy correspondence. Manually curating high-quality datasets is expensive and time-consuming, and datasets generated using diffusion models are not adequately well-aligned. The most promising way is to collect image-text pairs from the Internet, but it will inevitably introduce noisy correspondence. To reduce the negative impact of noisy correspondence, we propose a novel model that first transforms the noisy correspondence filtering problem into a similarity distribution modeling problem by exploiting the powerful capabilities of pre-trained models. Specifically, we use the Gaussian Mixture model to model the similarity obtained by CLIP as clean distribution and noisy distribution, to filter out most of the noisy correspondence in the dataset. Afterward, we used relatively clean data to fine-tune the model. To further reduce the negative impact of unfiltered noisy correspondence, i.e., a minimal part where two distributions intersect during the fine-tuning process, we propose a distribution-sensitive dynamic margin ranking loss, further increasing the distance between the two distributions. Through continuous iteration, the noisy correspondence gradually decreases and the model performance gradually improves. Our extensive experiments demonstrate the effectiveness and robustness of our model even under high noise rates.

References

[1]

Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. 2019. Unsupervised Label Noise Modeling and Loss Correction. In Proceedings of the International Conference on Machine Learning, Vol. 97. 312–321.

[2]

Jie Cao, Shengsheng Qian, Huaiwen Zhang, Quan Fang, and Changsheng Xu. 2021. Global Relation-Aware Attention Network for Image-Text Retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 19–28.

Digital Library

[3]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12652–12660.

[4]

Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. 2022. Cross-Modal Graph Matching Network for Image-Text Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 95:1–95:23.

Digital Library

[5]

Junyoung Chung, Caglar Gülcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:2201.08239. Retrieved from https://arxiv.org/abs/1412.3555

[6]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In Proceedings of the Conference on Artificial Intelligence. 1218–1226.

[7]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the International Conference on Machine Learning, Vol. 202. 8469–8488.

[8]

Duoduo Feng, Xiangteng He, and Yuxin Peng. 2023. MKVSE: Multimodal Knowledge Enhanced Visual-Semantic Embedding for Image-Text Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 5 (2023), 162:1–162:21.

[9]

Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval With Generative Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.

[10]

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. In Proceedings of the Neural Information Processing Systems Conference. 8536–8546.

[11]

Haochen Han, Kaiyao Miao, Qinghua Zheng, and Minnan Luo. 2023. Noisy Correspondence Learning with Meta Similarity Correction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7517–7526.

[12]

Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. 2018. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Proceedings of the Advances in Neural Information Processing Systems. 10477–10486.

[13]

Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image Captioning: Transforming Objects into Words. In Proceedings of the Neural Information Processing Systems Conference. 11135–11145.

[14]

Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, and Xi Peng. 2023. Cross-Modal Retrieval With Partially Mismatched Pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 9595–9610.

Digital Library

[15]

Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021. Learning with Noisy Correspondence for Cross-Modal Matching. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34. 29406–29419.

[16]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv:2004.00849. Retrieved from https://arxiv.org/abs/2004.00849

[17]

Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664–676.

Digital Library

[18]

Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Proceedings of the Neural Information Processing Systems Conference. 1889–1897.

[19]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. Retrieved from https://dblp.org/db/conf/iclr/iclr2015.html#KingmaB14

[20]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision. 212–228.

Digital Library

[21]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Conference on Artificial Intelligence. 11336–11344.

[22]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, Vol. 202. 19730–19742.

[23]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of the International Conference on Computer Vision. 4653–4661.

[24]

Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The Devil Is in The Labels: Noisy Label Correction for Robust Scene Graph Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 18847–18856.

[25]

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 2592–2607.

[26]

Xuelong Li. 2022. Positive-Incentive Noise. IEEE Transactions on Neural Networks and Learning Systems (2022). 1–7. Retrieved from https://ieeexplore.ieee.org/document/10003114/metrics{#}metrics

[27]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Vol. 8693. 740–755.

[28]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. In Proceedings of the International Conference on Multimedia. 3–11.

Digital Library

[29]

Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a Recurrent Residual Fusion Network for Multimodal Matching. In Proceedings of the International Conference on Computer Vision. 4127–4136.

[30]

Yueming Lyu and Ivor W. Tsang. 2020. Curriculum Loss: Robust Learning and Generalization against Label Corruption. In Proceedings of the International Conference on Learning Representations. 1490–1500.

[31]

Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, Shu-Tao Xia, Sudanthi N. R. Wijewickrema, and James Bailey. 2018. Dimensionality-Driven Learning with Noisy Labels. In Proceedings of the International Conference on Machine Learning. 3361–3370.

[32]

Liu Meng, Wang Xiang, Nie Liqiang, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive Moment Retrieval in Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15–24.

[33]

Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087–6096.

[34]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. Retrieved from https://arxiv.org/abs/2303.08774

[35]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-Linear Attention Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10968–10977.

[36]

Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, and Peng Hu. 2022. Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval. In Proceedings of the International Conference on Multimedia. 4948–4956.

Digital Library

[37]

Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2023. Cross-Modal Active Complementary Learning with Self-refining Correspondence. In Proceedings of the Advances in Neural Information Processing Systems. 24829–24840.

[38]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proceedings of the International Conference on Multimedia. 1047–1055.

Digital Library

[39]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.

Digital Library

[40]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning. 8748–8763.

[41]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.

Digital Library

[42]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning. In Proceedings of the Meeting of the Association for Computational Linguistics. 2556–2565.

[43]

Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019a. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning, Vol. 97. 5907–5915.

[44]

Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019b. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning. 5907–5915.

[45]

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2021a. Robust Learning by Self-Transition for Handling Noisy Labels. In Proceedings of the Conference on Knowledge Discovery and Data Mining. 1490–1500.

Digital Library

[46]

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2021b. Robust Learning by Self-Transition for Handling Noisy Labels. In Proceedings of the Conference on Knowledge Discovery and Data Mining. 1490–1500.

Digital Library

[47]

Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural Compatibility Modeling with Attentive Knowledge Distillation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 5–14.

Digital Library

[48]

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Agüera y Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arXiv:2201.08239. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2021/hash/4f16c818875d9fcb6867c7bdc89be7eb-Abstract.html

[49]

Zhenyu Wang, Ya-Li Li, Ye Guo, and Shengjin Wang. 2021. Combating Noise: Semi-supervised Learning by Region Uncertainty Quantification. In Proceedings of the Neural Information Processing Systems Conference. 9534–9545.

[50]

Yiwei Wei, Shaozu Yuan, Ruosong Yang, Lei Shen, Zhangmeizhi Li, Longbiao Wang, and Meng Chen. 2023. Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection. In Proceedings of the Meeting of the Association for Computational Linguistics. 5240–5252.

[51]

Dongqing Wu, Huihui Li, Yinge Tang, Lei Guo, and Hang Liu. 2022. Global-Guided Asymmetric Attention Network for Image-Text Matching. Neurocomputing 481 (2022), 77–90.

Digital Library

[52]

Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online Asymmetric Similarity Learning for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3984–3993.

[53]

Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. 2019. Are Anchor Points Really Indispensable in Label-Noise Learning? In Proceedings of the Neural Information Processing Systems Conference. 6835–6846.

[54]

Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang, Xianzhe Xu, Ji Zhang, Songfang Huang, Fei Huang, Luo Si, and Rong Jin. 2023. Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems 41, 3 (2023), 79:1–79:40.

Digital Library

[55]

Shuo Yang, Zhaopan Xu, Kai Wang, Yang You, Hongxun Yao, Tongliang Liu, and Min Xu. 2023. BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19883–19892.

[56]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.

[57]

Tianzi Zang, Yanmin Zhu, Haobing Liu, Ruohan Zhang, and Jiadi Yu. 2023. A Survey on Cross-domain Recommendation: Taxonomies, Methods, and Future Directions. ACM Transactions on Information Systems 41, 2 (2023), 42:1–42:39.

Digital Library

[58]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, and Yoram Singer. 2020. Identity Crisis: Memorization and Generalization Under Extreme Overparameterization. In Proceedings of the International Conference on Learning Representations.

[59]

Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-Aware Attention Framework for Image-Text Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15640–15649.

[60]

Yivan Zhang and Masashi Sugiyama. 2021. Approximating Instance-Dependent Noise via Instance-Confidence Embedding. arXiv:2103.13569. Retrieved from https://arxiv.org/abs/2103.13569

[61]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592. Retrieved from https://arxiv.org/abs/2304.10592

Index Terms

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval

Recommendations

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Cross-modal retrieval has been a compelling topic in the multimodal community. Recently, to mitigate the high cost of data collection, the co-occurred pairs (e.g., image and text) could be collected from the Internet as a large-scaled cross-modal ...
Multi-head Similarity Feature Representation and Filtration for Image-Text Matching
Advanced Data Mining and Applications
Abstract
The field of multimedia analysis has been increasingly focused on image-text retrieval, which aims to retrieve semantically relevant images or text through queries of the opposite modality. The key challenge is to learn the correspondence between ...
Comparing Noisy Patches for Image Denoising: A Double Noise Similarity Model
This paper presents a concept of noise similarity (NS), which can be used to refine the comparison of noisy patch and enhance the denoising power of the nonlocal means (NLM) filter. The fact behind this concept is that the similarity of noisy patch should ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 42, Issue 6

November 2024

467 pages

EISSN:1558-2868

DOI:10.1145/3618085

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 August 2024

Online AM: 29 April 2024

Accepted: 15 April 2024

Revised: 20 March 2024

Received: 25 September 2023

Published in TOIS Volume 42, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Shandong Provincial Natural Science Foundation
Science and Technology Innovation Program for Distinguished Young Scholars of Shandong Province Higher Education Institutions

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
449
Total Downloads

Downloads (Last 12 months)449
Downloads (Last 6 weeks)118

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents