Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Incomplete Cross-Modal Retrieval with Deep Correlation Transfer

Published: 11 January 2024 Publication History

Abstract

Most cross-modal retrieval methods assume the multi-modal training data is complete and has a one-to-one correspondence. However, in the real world, multi-modal data generally suffers from missing modality information due to the uncertainty of data collection and storage processes, which limits the practical application of existing cross-modal retrieval methods. Although some solutions have been proposed to generate the missing modality data using a single pseudo sample, this may lead to incomplete semantic restoration and sub-optimal retrieval results due to the limited semantic information it provides. To address this challenge, this article proposes an Incomplete Cross-Modal Retrieval with Deep Correlation Transfer (ICMR-DCT) method that can robustly model incomplete multi-modal data and dynamically capture the adjacency semantic correlation for cross-modal retrieval. Specifically, we construct intra-modal graph attention-based auto-encoder to learn modality-invariant representations by performing semantic reconstruction through intra-modality adjacency correlation mining. Then, we design dual cross-modal alignment constraints to project multi-modal representations into a common semantic space, thus bridging the heterogeneous modality gap and enhancing the discriminability of the common representation. We further introduce semantic preservation to enhance adjacency semantic information and achieve cross-modal semantic correlation. Moreover, we propose a nearest-neighbor weighting integration strategy with cross-modal correlation transfer to generate the missing modality data according to inter-modality mapping relations and adjacency correlations between each sample and its neighbors, which improves the robustness of our method against incomplete multi-modal training data. Extensive experiments on three widely tested benchmark datasets demonstrate the superior performance of our method in cross-modal retrieval tasks under both complete and incomplete retrieval scenarios. Our used datasets and source codes are available at https://github.com/shidan0122/DCT.git.

References

[1]
Imad Afyouni, Zaher Al Aghbari, and Reshma Abdul Razack. 2022. Multi-feature, multi-modal, and multi-source social event detection: A comprehensive survey. Inf. Fusion 79 (2022), 279–308.
[2]
Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of ICML, Vol. 28. 1247–1255.
[3]
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of ICLR.
[4]
Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. 2022. Image-text retrieval: A survey on recent research and development. In Proceedings of IJCAI. 5410–5417.
[5]
Bin Chen, Yan Feng, Tao Dai, Jiawang Bai, Yong Jiang, Shu-Tao Xia, and Xuan Wang. 2023. Adversarial examples generation for deep product quantization networks on image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 45, 2 (2023), 1388–1404.
[6]
Dong Chen, Miaomiao Cheng, Chen Min, and Liping Jing. 2020. Unsupervised deep imputed hashing for partial cross-modal retrieval. In Proceedings of IJCNN. 1–8.
[7]
Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. 2022. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multim. Comput. Commun. Appl. 18, 4 (2022), Article 95, 23 pages.
[8]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of CIVR. 1–9.
[9]
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014).
[10]
Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. 2016. Discriminative dictionary learning with common label alignment for cross-modal retrieval. IEEE Trans. Multim. 18, 2 (2016), 208–218.
[11]
Thanh-Toan Do, Tuan Hoang, Dang-Khoa Le Tan, Huu Le, Tam V. Nguyen, and Ngai-Man Cheung. 2019. From selective deep convolutional features to compact binary representations for image retrieval. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), Article 43, 22 pages.
[12]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of ACM MM. 7–16.
[13]
Jun Guo and Wenwu Zhu. 2020. Collective affinity learning for partial cross-modal hashing. IEEE Trans. Image Process. 29 (2020), 1344–1355.
[14]
David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664.
[15]
Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. 2022. GraphMAE: Self-supervised masked graph autoencoders. In Proceedings of SIGKDD. 594–604.
[16]
Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, and Jie Lin. 2021. Learning cross-modal retrieval with noisy labels. In Proceedings of CVPR. 5403–5413.
[17]
Zhikai Hu, Xin Liu, Xingzhi Wang, Yiu-Ming Cheung, Nannan Wang, and Yewang Chen. 2019. Triplet fusion network hashing for unpaired cross-modal retrieval. In Proceedings of ICMR. 141–149.
[18]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of CVPR. 3270–3278.
[19]
Mengmeng Jing, Jingjing Li, Lei Zhu, Ke Lu, Yang Yang, and Zi Huang. 2020. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of ACM MM. 3283–3291.
[20]
Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In Proceedings of ICLR.
[21]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of CVPR. 4242–4251.
[22]
Hong Liu, Mingbao Lin, Shengchuan Zhang, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2018. Dense auto-encoder hashing for robust cross-modality retrieval. In Proceedings of ACM MM. 1589–1597.
[23]
Xu Lu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiushan Nie, and Huaxiang Zhang. 2019. Flexible online multi-modal hashing for large-scale multimedia retrieval. In Proceedings of ACM MM. 1129–1137.
[24]
Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of IJCAI. 3846–3853.
[25]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput. Commun. Appl. 15, 1 (2019), Article 22, 24 pages.
[26]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multim. 20, 2 (2018), 405–420.
[27]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27, 11 (2018), 5585–5599.
[28]
José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 521–535.
[29]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of ICML, Vol. 139. 8748–8763.
[30]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of NAACL. 139–147.
[31]
Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster canonical correlation analysis. In Proceedings of AISTATS, Vol. 33. 823–831.
[32]
Nikhil Rasiwasia, José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of ACM MM. 251–260.
[33]
David Semedo and João Magalhães. 2019. Cross-modal subspace learning with scheduled adaptive margin constraints. In Proceedings of MM. 75–83.
[34]
Abhishek Sharma, Abhishek Kumar, Hal Daumé III, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of CVPR. 2160–2167.
[35]
Xiao-Bo Shen, Fumin Shen, Quan-Sen Sun, Yang Yang, Yunhao Yuan, and Heng Tao Shen. 2017. Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE Trans. Cybern. 47, 12 (2017), 4275–4288.
[36]
Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Canim. 2020. Web table retrieval using multimodal deep learning. In Proceedings of SIGIR. 1399–1408.
[37]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of ICLR. 1–12.
[38]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of ACM MM. 154–162.
[39]
Junsheng Wang, Tiantian Gong, Zhixiong Zeng, Changchang Sun, and Yan Yan. 2022. C \({}^{\mbox{3}}\) CMR: Cross-modality cross-instance contrastive learning for cross-media retrieval. In Proceedings of ACM MM. 4300–4308.
[40]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215 (2016).
[41]
Qifan Wang, Luo Si, and Bin Shen. 2015. Learning to hash on partial multi-modal data. In Proceedings of IJCAI. 3904–3910.
[42]
Weiran Wang and Karen Livescu. 2016. Large-scale approximate kernel canonical correlation analysis. In Proceedings of ICLR. 1–14.
[43]
Zhangcheng Wang, Ya Li, Richang Hong, and Xinmei Tian. 2019. Eigenvector-based distance metric learning for image classification and retrieval. ACM Trans. Multim. Comput. Commun. Appl. 15, 3 (2019), Article 84, 19 pages.
[44]
Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online asymmetric similarity learning for cross-modal retrieval. In Proceedings of CVPR. 3984–3993.
[45]
Yiling Wu, Shuhui Wang, and Qingming Huang. 2018. Learning semantic structure-preserved embeddings for cross-modal retrieval. In Proceedings of ACM MM. 825–833.
[46]
Xing Xu, Kaiyi Lin, Huimin Lu, Lianli Gao, and Heng Tao Shen. 2020. Correlated features synthesis and alignment for zero-shot cross-modal retrieval. In Proceedings of SIGIR. 1419–1428.
[47]
Yang Xu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Zheng Zhang, and Huaxiang Zhang. 2023. Multi-modal discrete collaborative filtering for efficient cold-start recommendation. IEEE Trans. Knowl. Data Eng. 35, 1 (2023), 741–755.
[48]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of AAAI. 7444–7452.
[49]
Erkun Yang, Dongren Yao, Tongliang Liu, and Cheng Deng. 2022. Mutual quantization for cross-modal search with noisy labels. In Proceedings of CVPR. 7541–7550.
[50]
Zhixiong Zeng and Wenji Mao. 2022. A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval. CoRR abs/2201.02772 (2022).
[51]
Zhixiong Zeng, Ying Sun, and Wenji Mao. 2021. MCCN: Multimodal coordinated clustering network for large-scale cross-modal retrieval. In Proceedings of ACM MM. 5427–5435.
[52]
Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. PAN: Prototype-based adaptive network for robust cross-modal retrieval. In Proceedings of SIGIR. 1125–1134.
[53]
Chengyuan Zhang, Jiayu Song, Xiaofeng Zhu, Lei Zhu, and Shichao Zhang. 2021. HCMSL: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Trans. Multim. Comput. Commun. Appl. 17, 1s (2021), Article 2, 22 pages.
[54]
Wei Zhang, Ting Yao, Shiai Zhu, and Abdulmotaleb El-Saddik. 2019. Deep learning-based multimedia analytics: A review. ACM Trans. Multim. Comput. Commun. Appl. 15, 1s (2019), Article 2, 26 pages.
[55]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of CVPR. 10394–10403.
[56]
Lei Zhu, Xu Lu, Zhiyong Cheng, Jingjing Li, and Huaxiang Zhang. 2020. Deep collaborative multi-view hashing for large-scale image search. IEEE Trans. Image Process. 29 (2020), 4643–4655.
[57]
Lei Zhu, Chaoqun Zheng, Weili Guan, Jingjing Li, Yang Yang, and Heng Tao Shen. 2023. Multi-modal hashing for efficient multimedia retrieval: A survey. IEEE Trans. Knowl. Data Eng. Early Access, June 5, 2023. DOI:

Cited By

View all

Index Terms

  1. Incomplete Cross-Modal Retrieval with Deep Correlation Transfer

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
    May 2024
    650 pages
    EISSN:1551-6865
    DOI:10.1145/3613634
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 January 2024
    Online AM: 13 December 2023
    Accepted: 10 December 2023
    Revised: 01 December 2023
    Received: 20 April 2023
    Published in TOMM Volume 20, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Incomplete cross-modal retrieval
    2. adjacency semantic correlation
    3. robustness
    4. graph attention

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Natural Science Foundation of Shandong Province
    • Taishan Scholar Foundation of Shandong Province
    • CCF-Baidu Open Fund

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 522
      Total Downloads
    • Downloads (Last 12 months)522
    • Downloads (Last 6 weeks)53
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media