research-article

MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval

Authors:

Wenji MaoAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5427 - 5435

https://doi.org/10.1145/3474085.3475670

Published: 17 October 2021 Publication History

Abstract

Cross-modal retrieval is an important multimedia research area which aims to take one type of data as the query to retrieve relevant data of another type. Most of the existing methods follow the paradigm of pair-wise learning and class-level learning to generate a common embedding space, where the similarity of heterogeneous multimodal samples can be calculated. However, in contrast to large-scale cross-modal retrieval applications which often need to tackle multiple modalities, previous studies on cross-modal retrieval mainly focus on two modalities (i.e., text-image or text-video). In addition, for large-scale cross-modal retrieval with modality diversity, another important problem is that the available training data are considerably modality-imbalanced. In this paper, we focus on the challenging problem of modality-imbalanced cross-modal retrieval, and propose a Multimodal Coordinated Clustering Network (MCCN) which consists of two modules, Multimodal Coordinated Embedding (MCE) module to alleviate the imbalanced training data and Multimodal Contrastive Clustering (MCC) module to tackle the imbalanced optimization. The MCE module develops a data-driven approach to coordinate multiple modalities via multimodal semantic graph for the generation of modality-balanced training samples. The MCC module learns class prototypes as anchors to preserve the pair-wise and class-level similarities across modalities for intra-class compactness and inter-class separation, and further introduces intra-class and inter-class margins to enhance optimization flexibility. We conduct experiments on the benchmark multimodal datasets to verify the effectiveness of our proposed method.

Supplementary Material

MP4 File (MM21-fp2734.mp4)

Video presentation.

Download
7.22 MB

References

[1]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning. 1247--1255.

Digital Library

[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[3]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia. 7--16.

Digital Library

[4]

Shlomo Geva and Joaquin Sitte. 1991. Adaptive nearest neighbor pattern classification. IEEE Transactions on Neural Networks, Vol. 2, 2 (1991), 318--322.

Digital Library

[5]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, Vol. 106, 2 (2014), 210--233.

Digital Library

[6]

Jun Guo and Wenwu Zhu. 2019. Collective affinity learning for partial cross-modal hashing. IEEE Transactions on Image Processing, Vol. 29 (2019), 1344--1355.

[7]

David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, Vol. 16, 12 (2004), 2639--2664.

Digital Library

[8]

Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable deep multimodal learning for cross-modal retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 635--644.

Digital Library

[9]

Xin Huang and Yuxin Peng. 2018. Deep cross-media knowledge transfer. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 8837--8846.

[10]

Xiao-Bo Jin, Cheng-Lin Liu, and Xinwen Hou. 2010. Regularized margin-based conditional log-likelihood loss for prototype learning. Pattern Recognition, Vol. 43, 7 (2010), 2428--2438.

Digital Library

[11]

Mengmeng Jing, Jingjing Li, Lei Zhu, Ke Lu, Yang Yang, and Zi Huang. 2020. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the 28th ACM International Conference on Multimedia. 3283--3291.

Digital Library

[12]

KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vineeth N Balasubramanian. 2021. Towards open world object detection. arXiv preprint arXiv:2103.02603 (2021).

[13]

Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2015. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 1 (2015), 188--194.

Digital Library

[14]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[15]

Teuvo Kohonen. 1998. The self-organizing map. Neurocomputing, Vol. 21, 1--3 (1998), 1--6.

[16]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).

[17]

Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, and Yueting Zhuang. 2020. Topic adaptation and prototype encoding for few-shot visual storytelling. arXiv preprint arXiv:2008.04504 (2020).

Digital Library

[18]

Cheng-Lin Liu, In-Jung Eim, and Jin Hyung Kim. 1997. High accuracy handwritten Chinese character recognition by improved feature matching method. In Proceedings of the Fourth International Conference on Document Analysis and Recognition. 1033--1037.

Digital Library

[19]

Cheng-Lin Liu and Masaki Nakagawa. 2001. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, Vol. 34, 3 (2001), 601--615.

[20]

Xin Liu, Yiu-ming Cheung, Zhikai Hu, Yi He, and Bineng Zhong. 2021. Adversarial tri-fusion hashing network for imbalanced cross-modal retrieval. IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 5, 4 (2021), 607--619.

[21]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.

Digital Library

[22]

Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 3846--3853.

Digital Library

[23]

Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 9 (2017), 2372--2385.

Digital Library

[24]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 1 (2019), 1--24.

Digital Library

[25]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5585--5599.

Digital Library

[26]

Yuxin Peng, Xiaohua Zhai, Yunzhen Zhao, and Xin Huang. 2015. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 26, 3 (2015), 583--596.

Digital Library

[27]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. 139--147.

Digital Library

[28]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251--260.

Digital Library

[29]

Jan Rupnik and John Shawe-Taylor. 2010. Multi-view canonical correlation analysis. In Proceedings of the 2010 Conference on Data Mining and Data Warehouses. 1--4.

[30]

Atsushi Sato and Keiji Yamada. 1996. Generalized learning vector quantization. In Proceedings of the 10th Conference on Neural Information Processing Systems. 423--429.

Digital Library

[31]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[32]

Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition. 6398--6407.

[33]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[34]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154--162.

Digital Library

[35]

Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. 2016a. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 10 (2016), 2010--2023.

Digital Library

[36]

Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016b. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).

[37]

Weiran Wang and Karen Livescu. 2015. Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773 (2015).

[38]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437--1445.

Digital Library

[39]

Fei Wu, Xiao-Yuan Jing, Zhiyong Wu, Yimu Ji, Xiwei Dong, Xiaokai Luo, Qinghua Huang, and Ruchuan Wang. 2020. Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition, Vol. 104 (2020), 107335.

[40]

Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2017. Joint latent subspace learning and regression for cross-modal retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 917--920.

Digital Library

[41]

Liwei Wu, Shuqing Li, Cho-Jui Hsieh, and James Sharpnack. 2019. Stochastic shared embeddings: Data-driven regularization of embedding layers. arXiv preprint arXiv:1905.10630 (2019).

Digital Library

[42]

Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. 2014. Supervised hashing for image retrieval via image representation learning. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2156--2162.

Digital Library

[43]

Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2018. Robust classification with convolutional prototype learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 3474--3482.

[44]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd Conference on Neural Information Processing Systems. 5753--5763.

Digital Library

[45]

Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. PAN: Prototype-based Adaptive Network for Robust Cross-modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1125--1134.

Digital Library

[46]

Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 24, 6 (2013), 965--978.

[47]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10394--10403.

Cited By

Li WHan DDezert JYang Y(2024)Multimodal Coordinated Representation Learning Based on Evidence Theory2024 27th International Conference on Information Fusion (FUSION)10.23919/FUSION59988.2024.10706295(1-6)Online publication date: 8-Jul-2024
https://doi.org/10.23919/FUSION59988.2024.10706295
Han WCai CGuo YPeng JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681215
Shi DZhu LLi JDong GZhang H(2024)Incomplete Cross-Modal Retrieval with Deep Correlation TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744220:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3637442
Show More Cited By

Index Terms

MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

PAN: Prototype-based Adaptive Network for Robust Cross-modal Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

In practical applications of cross-modal retrieval, test queries of the retrieval system may vary greatly and come from unknown category. Meanwhile, due to the cost and difficulty of data collection as well as other issues, the available data for cross-...
Prototype-guided Knowledge Transfer for Federated Unsupervised Cross-modal Hashing
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Although deep cross-modal hashing methods have shown superiorities for cross-modal retrieval recently, there is a concern about potential data privacy leakage when training the models. Federated learning adopts a distributed machine learning strategy, ...
Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSFC
Ministry of Science & Technology of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
539
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)12

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li WHan DDezert JYang Y(2024)Multimodal Coordinated Representation Learning Based on Evidence Theory2024 27th International Conference on Information Fusion (FUSION)10.23919/FUSION59988.2024.10706295(1-6)Online publication date: 8-Jul-2024
https://doi.org/10.23919/FUSION59988.2024.10706295
Han WCai CGuo YPeng JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681215
Shi DZhu LLi JDong GZhang H(2024)Incomplete Cross-Modal Retrieval with Deep Correlation TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744220:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3637442
Luo HZhang ZNie L(2024)Contrastive Incomplete Cross-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.341038836:11(5823-5834)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3410388
Zhou KHassan FGan K(2024)Pretrained models for cross-modal retrieval: experiments and improvementsSignal, Image and Video Processing10.1007/s11760-024-03126-z18:5(4915-4923)Online publication date: 6-Apr-2024
https://doi.org/10.1007/s11760-024-03126-z
Sattari SYazici A(2024)Semantic deep learning and adaptive clustering for handling multimodal multimedia information retrievalMultimedia Tools and Applications10.1007/s11042-024-19312-7Online publication date: 25-May-2024
https://doi.org/10.1007/s11042-024-19312-7
Luo KZhang CLi HJia XChen C(2023)Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.324540025(9082-9095)Online publication date: 15-Feb-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3245400
Xia WWang TGao QYang MGao X(2023)Graph Embedding Contrastive Multi-Modal Representation Learning for ClusteringIEEE Transactions on Image Processing10.1109/TIP.2023.324086332(1170-1183)Online publication date: 2023
https://doi.org/10.1109/TIP.2023.3240863
Luo KZhang XWang JLi HCheng NXiao J(2023)Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00137(913-917)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICTAI59109.2023.00137

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents