Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475670acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval

Published: 17 October 2021 Publication History

Abstract

Cross-modal retrieval is an important multimedia research area which aims to take one type of data as the query to retrieve relevant data of another type. Most of the existing methods follow the paradigm of pair-wise learning and class-level learning to generate a common embedding space, where the similarity of heterogeneous multimodal samples can be calculated. However, in contrast to large-scale cross-modal retrieval applications which often need to tackle multiple modalities, previous studies on cross-modal retrieval mainly focus on two modalities (i.e., text-image or text-video). In addition, for large-scale cross-modal retrieval with modality diversity, another important problem is that the available training data are considerably modality-imbalanced. In this paper, we focus on the challenging problem of modality-imbalanced cross-modal retrieval, and propose a Multimodal Coordinated Clustering Network (MCCN) which consists of two modules, Multimodal Coordinated Embedding (MCE) module to alleviate the imbalanced training data and Multimodal Contrastive Clustering (MCC) module to tackle the imbalanced optimization. The MCE module develops a data-driven approach to coordinate multiple modalities via multimodal semantic graph for the generation of modality-balanced training samples. The MCC module learns class prototypes as anchors to preserve the pair-wise and class-level similarities across modalities for intra-class compactness and inter-class separation, and further introduces intra-class and inter-class margins to enhance optimization flexibility. We conduct experiments on the benchmark multimodal datasets to verify the effectiveness of our proposed method.

Supplementary Material

MP4 File (MM21-fp2734.mp4)
Video presentation.

References

[1]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning. 1247--1255.
[2]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[3]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia. 7--16.
[4]
Shlomo Geva and Joaquin Sitte. 1991. Adaptive nearest neighbor pattern classification. IEEE Transactions on Neural Networks, Vol. 2, 2 (1991), 318--322.
[5]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, Vol. 106, 2 (2014), 210--233.
[6]
Jun Guo and Wenwu Zhu. 2019. Collective affinity learning for partial cross-modal hashing. IEEE Transactions on Image Processing, Vol. 29 (2019), 1344--1355.
[7]
David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, Vol. 16, 12 (2004), 2639--2664.
[8]
Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable deep multimodal learning for cross-modal retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 635--644.
[9]
Xin Huang and Yuxin Peng. 2018. Deep cross-media knowledge transfer. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 8837--8846.
[10]
Xiao-Bo Jin, Cheng-Lin Liu, and Xinwen Hou. 2010. Regularized margin-based conditional log-likelihood loss for prototype learning. Pattern Recognition, Vol. 43, 7 (2010), 2428--2438.
[11]
Mengmeng Jing, Jingjing Li, Lei Zhu, Ke Lu, Yang Yang, and Zi Huang. 2020. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the 28th ACM International Conference on Multimedia. 3283--3291.
[12]
KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vineeth N Balasubramanian. 2021. Towards open world object detection. arXiv preprint arXiv:2103.02603 (2021).
[13]
Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2015. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 1 (2015), 188--194.
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).
[15]
Teuvo Kohonen. 1998. The self-organizing map. Neurocomputing, Vol. 21, 1--3 (1998), 1--6.
[16]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[17]
Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, and Yueting Zhuang. 2020. Topic adaptation and prototype encoding for few-shot visual storytelling. arXiv preprint arXiv:2008.04504 (2020).
[18]
Cheng-Lin Liu, In-Jung Eim, and Jin Hyung Kim. 1997. High accuracy handwritten Chinese character recognition by improved feature matching method. In Proceedings of the Fourth International Conference on Document Analysis and Recognition. 1033--1037.
[19]
Cheng-Lin Liu and Masaki Nakagawa. 2001. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, Vol. 34, 3 (2001), 601--615.
[20]
Xin Liu, Yiu-ming Cheung, Zhikai Hu, Yi He, and Bineng Zhong. 2021. Adversarial tri-fusion hashing network for imbalanced cross-modal retrieval. IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 5, 4 (2021), 607--619.
[21]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
[22]
Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 3846--3853.
[23]
Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 9 (2017), 2372--2385.
[24]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 1 (2019), 1--24.
[25]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5585--5599.
[26]
Yuxin Peng, Xiaohua Zhai, Yunzhen Zhao, and Xin Huang. 2015. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 26, 3 (2015), 583--596.
[27]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. 139--147.
[28]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251--260.
[29]
Jan Rupnik and John Shawe-Taylor. 2010. Multi-view canonical correlation analysis. In Proceedings of the 2010 Conference on Data Mining and Data Warehouses. 1--4.
[30]
Atsushi Sato and Keiji Yamada. 1996. Generalized learning vector quantization. In Proceedings of the 10th Conference on Neural Information Processing Systems. 423--429.
[31]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[32]
Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition. 6398--6407.
[33]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[34]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154--162.
[35]
Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. 2016a. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 10 (2016), 2010--2023.
[36]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016b. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).
[37]
Weiran Wang and Karen Livescu. 2015. Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773 (2015).
[38]
Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437--1445.
[39]
Fei Wu, Xiao-Yuan Jing, Zhiyong Wu, Yimu Ji, Xiwei Dong, Xiaokai Luo, Qinghua Huang, and Ruchuan Wang. 2020. Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition, Vol. 104 (2020), 107335.
[40]
Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2017. Joint latent subspace learning and regression for cross-modal retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 917--920.
[41]
Liwei Wu, Shuqing Li, Cho-Jui Hsieh, and James Sharpnack. 2019. Stochastic shared embeddings: Data-driven regularization of embedding layers. arXiv preprint arXiv:1905.10630 (2019).
[42]
Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. 2014. Supervised hashing for image retrieval via image representation learning. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2156--2162.
[43]
Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2018. Robust classification with convolutional prototype learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 3474--3482.
[44]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd Conference on Neural Information Processing Systems. 5753--5763.
[45]
Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. PAN: Prototype-based Adaptive Network for Robust Cross-modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1125--1134.
[46]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 24, 6 (2013), 965--978.
[47]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10394--10403.

Cited By

View all
  • (2024)Multimodal Coordinated Representation Learning Based on Evidence Theory2024 27th International Conference on Information Fusion (FUSION)10.23919/FUSION59988.2024.10706295(1-6)Online publication date: 8-Jul-2024
  • (2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
  • (2024)Incomplete Cross-Modal Retrieval with Deep Correlation TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744220:5(1-21)Online publication date: 11-Jan-2024
  • Show More Cited By

Index Terms

  1. MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. multimodal contrastive clustering
    3. multimodal coordinated embedding
    4. prototype learning

    Qualifiers

    • Research-article

    Funding Sources

    • NSFC
    • Ministry of Science & Technology of China

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)78
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 22 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multimodal Coordinated Representation Learning Based on Evidence Theory2024 27th International Conference on Information Fusion (FUSION)10.23919/FUSION59988.2024.10706295(1-6)Online publication date: 8-Jul-2024
    • (2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
    • (2024)Incomplete Cross-Modal Retrieval with Deep Correlation TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744220:5(1-21)Online publication date: 11-Jan-2024
    • (2024)Contrastive Incomplete Cross-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.341038836:11(5823-5834)Online publication date: Nov-2024
    • (2024)Pretrained models for cross-modal retrieval: experiments and improvementsSignal, Image and Video Processing10.1007/s11760-024-03126-z18:5(4915-4923)Online publication date: 6-Apr-2024
    • (2024)Semantic deep learning and adaptive clustering for handling multimodal multimedia information retrievalMultimedia Tools and Applications10.1007/s11042-024-19312-7Online publication date: 25-May-2024
    • (2023)Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.324540025(9082-9095)Online publication date: 15-Feb-2023
    • (2023)Graph Embedding Contrastive Multi-Modal Representation Learning for ClusteringIEEE Transactions on Image Processing10.1109/TIP.2023.324086332(1170-1183)Online publication date: 2023
    • (2023)Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00137(913-917)Online publication date: 6-Nov-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media