Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Published: 19 August 2024 Publication History

Abstract

Unleashing the power of image-text matching in real-world applications is hampered by noisy correspondence. Manually curating high-quality datasets is expensive and time-consuming, and datasets generated using diffusion models are not adequately well-aligned. The most promising way is to collect image-text pairs from the Internet, but it will inevitably introduce noisy correspondence. To reduce the negative impact of noisy correspondence, we propose a novel model that first transforms the noisy correspondence filtering problem into a similarity distribution modeling problem by exploiting the powerful capabilities of pre-trained models. Specifically, we use the Gaussian Mixture model to model the similarity obtained by CLIP as clean distribution and noisy distribution, to filter out most of the noisy correspondence in the dataset. Afterward, we used relatively clean data to fine-tune the model. To further reduce the negative impact of unfiltered noisy correspondence, i.e., a minimal part where two distributions intersect during the fine-tuning process, we propose a distribution-sensitive dynamic margin ranking loss, further increasing the distance between the two distributions. Through continuous iteration, the noisy correspondence gradually decreases and the model performance gradually improves. Our extensive experiments demonstrate the effectiveness and robustness of our model even under high noise rates.

References

[1]
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. 2019. Unsupervised Label Noise Modeling and Loss Correction. In Proceedings of the International Conference on Machine Learning, Vol. 97. 312–321.
[2]
Jie Cao, Shengsheng Qian, Huaiwen Zhang, Quan Fang, and Changsheng Xu. 2021. Global Relation-Aware Attention Network for Image-Text Retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 19–28.
[3]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12652–12660.
[4]
Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. 2022. Cross-Modal Graph Matching Network for Image-Text Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 95:1–95:23.
[5]
Junyoung Chung, Caglar Gülcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:2201.08239. Retrieved from https://arxiv.org/abs/1412.3555
[6]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In Proceedings of the Conference on Artificial Intelligence. 1218–1226.
[7]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the International Conference on Machine Learning, Vol. 202. 8469–8488.
[8]
Duoduo Feng, Xiangteng He, and Yuxin Peng. 2023. MKVSE: Multimodal Knowledge Enhanced Visual-Semantic Embedding for Image-Text Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 5 (2023), 162:1–162:21.
[9]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval With Generative Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.
[10]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. In Proceedings of the Neural Information Processing Systems Conference. 8536–8546.
[11]
Haochen Han, Kaiyao Miao, Qinghua Zheng, and Minnan Luo. 2023. Noisy Correspondence Learning with Meta Similarity Correction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7517–7526.
[12]
Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. 2018. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Proceedings of the Advances in Neural Information Processing Systems. 10477–10486.
[13]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image Captioning: Transforming Objects into Words. In Proceedings of the Neural Information Processing Systems Conference. 11135–11145.
[14]
Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, and Xi Peng. 2023. Cross-Modal Retrieval With Partially Mismatched Pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 9595–9610.
[15]
Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021. Learning with Noisy Correspondence for Cross-Modal Matching. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34. 29406–29419.
[16]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv:2004.00849. Retrieved from https://arxiv.org/abs/2004.00849
[17]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664–676.
[18]
Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Proceedings of the Neural Information Processing Systems Conference. 1889–1897.
[19]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. Retrieved from https://dblp.org/db/conf/iclr/iclr2015.html#KingmaB14
[20]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision. 212–228.
[21]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Conference on Artificial Intelligence. 11336–11344.
[22]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, Vol. 202. 19730–19742.
[23]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of the International Conference on Computer Vision. 4653–4661.
[24]
Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The Devil Is in The Labels: Noisy Label Correction for Robust Scene Graph Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 18847–18856.
[25]
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 2592–2607.
[26]
Xuelong Li. 2022. Positive-Incentive Noise. IEEE Transactions on Neural Networks and Learning Systems (2022). 1–7. Retrieved from https://ieeexplore.ieee.org/document/10003114/metrics{#}metrics
[27]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Vol. 8693. 740–755.
[28]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. In Proceedings of the International Conference on Multimedia. 3–11.
[29]
Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a Recurrent Residual Fusion Network for Multimodal Matching. In Proceedings of the International Conference on Computer Vision. 4127–4136.
[30]
Yueming Lyu and Ivor W. Tsang. 2020. Curriculum Loss: Robust Learning and Generalization against Label Corruption. In Proceedings of the International Conference on Learning Representations. 1490–1500.
[31]
Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, Shu-Tao Xia, Sudanthi N. R. Wijewickrema, and James Bailey. 2018. Dimensionality-Driven Learning with Noisy Labels. In Proceedings of the International Conference on Machine Learning. 3361–3370.
[32]
Liu Meng, Wang Xiang, Nie Liqiang, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive Moment Retrieval in Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15–24.
[33]
Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087–6096.
[34]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. Retrieved from https://arxiv.org/abs/2303.08774
[35]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-Linear Attention Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10968–10977.
[36]
Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, and Peng Hu. 2022. Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval. In Proceedings of the International Conference on Multimedia. 4948–4956.
[37]
Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2023. Cross-Modal Active Complementary Learning with Self-refining Correspondence. In Proceedings of the Advances in Neural Information Processing Systems. 24829–24840.
[38]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proceedings of the International Conference on Multimedia. 1047–1055.
[39]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.
[40]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning. 8748–8763.
[41]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
[42]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning. In Proceedings of the Meeting of the Association for Computational Linguistics. 2556–2565.
[43]
Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019a. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning, Vol. 97. 5907–5915.
[44]
Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019b. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning. 5907–5915.
[45]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2021a. Robust Learning by Self-Transition for Handling Noisy Labels. In Proceedings of the Conference on Knowledge Discovery and Data Mining. 1490–1500.
[46]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2021b. Robust Learning by Self-Transition for Handling Noisy Labels. In Proceedings of the Conference on Knowledge Discovery and Data Mining. 1490–1500.
[47]
Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural Compatibility Modeling with Attentive Knowledge Distillation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 5–14.
[48]
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Agüera y Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arXiv:2201.08239. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2021/hash/4f16c818875d9fcb6867c7bdc89be7eb-Abstract.html
[49]
Zhenyu Wang, Ya-Li Li, Ye Guo, and Shengjin Wang. 2021. Combating Noise: Semi-supervised Learning by Region Uncertainty Quantification. In Proceedings of the Neural Information Processing Systems Conference. 9534–9545.
[50]
Yiwei Wei, Shaozu Yuan, Ruosong Yang, Lei Shen, Zhangmeizhi Li, Longbiao Wang, and Meng Chen. 2023. Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection. In Proceedings of the Meeting of the Association for Computational Linguistics. 5240–5252.
[51]
Dongqing Wu, Huihui Li, Yinge Tang, Lei Guo, and Hang Liu. 2022. Global-Guided Asymmetric Attention Network for Image-Text Matching. Neurocomputing 481 (2022), 77–90.
[52]
Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online Asymmetric Similarity Learning for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3984–3993.
[53]
Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. 2019. Are Anchor Points Really Indispensable in Label-Noise Learning? In Proceedings of the Neural Information Processing Systems Conference. 6835–6846.
[54]
Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang, Xianzhe Xu, Ji Zhang, Songfang Huang, Fei Huang, Luo Si, and Rong Jin. 2023. Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems 41, 3 (2023), 79:1–79:40.
[55]
Shuo Yang, Zhaopan Xu, Kai Wang, Yang You, Hongxun Yao, Tongliang Liu, and Min Xu. 2023. BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19883–19892.
[56]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
[57]
Tianzi Zang, Yanmin Zhu, Haobing Liu, Ruohan Zhang, and Jiadi Yu. 2023. A Survey on Cross-domain Recommendation: Taxonomies, Methods, and Future Directions. ACM Transactions on Information Systems 41, 2 (2023), 42:1–42:39.
[58]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, and Yoram Singer. 2020. Identity Crisis: Memorization and Generalization Under Extreme Overparameterization. In Proceedings of the International Conference on Learning Representations.
[59]
Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-Aware Attention Framework for Image-Text Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15640–15649.
[60]
Yivan Zhang and Masashi Sugiyama. 2021. Approximating Instance-Dependent Noise via Instance-Confidence Embedding. arXiv:2103.13569. Retrieved from https://arxiv.org/abs/2103.13569
[61]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592. Retrieved from https://arxiv.org/abs/2304.10592

Index Terms

  1. Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 42, Issue 6
    November 2024
    467 pages
    EISSN:1558-2868
    DOI:10.1145/3618085
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 August 2024
    Online AM: 29 April 2024
    Accepted: 15 April 2024
    Revised: 20 March 2024
    Received: 25 September 2023
    Published in TOIS Volume 42, Issue 6

    Check for updates

    Author Tags

    1. Cross-model retrieval
    2. image-text matching
    3. noisy correspondence
    4. similarity distribution modeling

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Shandong Provincial Natural Science Foundation
    • Science and Technology Innovation Program for Distinguished Young Scholars of Shandong Province Higher Education Institutions

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 449
      Total Downloads
    • Downloads (Last 12 months)449
    • Downloads (Last 6 weeks)118
    Reflects downloads up to 14 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media