Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475286acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Joint-teaching: Learning to Refine Knowledge for Resource-constrained Unsupervised Cross-modal Retrieval

Published: 17 October 2021 Publication History

Abstract

Cross-modal retrieval has received considerable attention owing to its applicability to enable users to search desired information with diversified forms. Existing retrieval methods retain good performance mainly relying on complex deep neural networks and high-quality supervision signals, which deters them from real-world resource-constrained development and deployment. In this paper, we propose an effective unsupervised learning framework named JOint-teachinG (JOG) to pursue a high-performance yet light-weight cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student") with strong feature learning ability and predictive power. Considering that a teacher model serving the same task as the student is not always available, we resort to a cross-task teacher to leverage transferrable knowledge to guide student learning. To eliminate the inevitable noises in the distilled knowledge resulting from the task discrepancy, an online knowledge-refinement strategy is designed to progressively improve the quality of the cross-task knowledge in a joint-teaching manner, where a peer student is engaged. In addition, the proposed JOG learns to represent the original high-dimensional data with compact binary codes to accelerate the query processing, further facilitating resource-limited retrieval. Through extensive experiments, we demonstrate that in various network structures, the proposed method can yield promising learning results on widely-used benchmarks. The proposed research is a pioneering work for resource-constrained cross-modal retrieval, which has strong potential to be applied to on-device deployment and is hoped to pave the way for further study.

Supplementary Material

MP4 File (MM21-fp0653.mp4)
Presentation video.

References

[1]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247--1255.
[2]
Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Proceedings of Advances in Neural Information Processing Systems. 2654--2662.
[3]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1445--1454.
[4]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from national university of singapore. In Proceedings of ACM International Conference on Image and Video Retrieval. 1--9.
[5]
Inseop Chung, SeongUk Park, Jangho Kim, and Nojun Kwak. 2020. Feature-map-level online adversarial knowledge distillation. In Proceedings of International Conference on Machine Learning. 2006--2015.
[6]
Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2019. Scalable deep hashing for large-scale social image retrieval. IEEE Transactions on Image Processing, Vol. 29 (2019), 1271--1284.
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: a large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[8]
Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2075--2082.
[9]
Yixiao Ge, Dapeng Chen, and Hongsheng Li. 2020. Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv preprint arXiv:2001.01526 (2020).
[10]
Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. 2020. Online knowledge distillation via collaborative learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 11020--11029.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[12]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[13]
Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. 2017. Learning to cluster in order to transfer across domains and tasks. arXiv preprint arXiv:1711.10125 (2017).
[14]
Hengtong Hu, Lingxi Xie, Richang Hong, and Qi Tian. 2020. Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3123--3132.
[15]
Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of ACM International Conference on Multimedia Information Retrieval. 39--43.
[16]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3232--3240.
[17]
Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2015. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 1 (2015), 188--194.
[18]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems. 1097--1105.
[20]
Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of International Joint Conference on Artificial Intelligence. 1360--1365.
[21]
Jogendra Nath Kundu, Nishank Lakkakula, and R Venkatesh Babu. 2019. Um-adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In Proceedings of IEEE International Conference on Computer Vision. 1436--1445.
[22]
Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016).
[23]
Chuan-Xiang Li, Zhen-Duo Chen, Peng-Fei Zhang, Xin Luo, Liqiang Nie, Wei Zhang, and Xin-Shun Xu. 2018. SCRATCH: a scalable discrete matrix factorization hashing for cross-modal retrieval. In Proceedings of ACM International Conference on Multimedia. 1--9.
[24]
Song Liu, Shengsheng Qian, Yang Guan, Jiawei Zhan, and Long Ying. 2020. Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval. 1379--1388.
[25]
Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, Weiming Hu, Yangxi Li, and Yunqiang Duan. 2019. Knowledge distillation via instance relationship graph. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 7096--7104.
[26]
Mingsheng Long, Yue Cao, Jianmin Wang, and Philip S Yu. 2016. Composite correlation quantization for efficient multimodal retrieval. In Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval. 579--588.
[27]
Xu Lu, Lei Zhu, Zhiyong Cheng, Liqiang Nie, and Huaxiang Zhang. 2019. Online multi-modal hashing with dynamic query-adaption. In Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval. 715--724.
[28]
Yadan Luo, Yang Yang, Fumin Shen, Zi Huang, Pan Zhou, and Heng Tao Shen. 2018. Robust discrete code modeling for supervised hashing. Pattern Recognition, Vol. 75 (2018), 128--135.
[29]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of European Conference on Computer Vision. 116--131.
[30]
Andrew Kachites McCallum. 1996. Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. (1996). http://www.cs.cmu.edu/ mccallum/bow.
[31]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4510--4520.
[32]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering (2020).
[33]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[34]
Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, and Heng Tao Shen. 2013. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of ACM SIGMOD International Conference on Management of Data. 785--796.
[35]
Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of IEEE International Conference on Computer Vision. 3027--3035.
[36]
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of Advances in Neural Information Processing Systems. 1195--1204.
[37]
Frederick Tung and Greg Mori. 2019. Similarity-preserving knowledge distillation. In Proceedings of IEEE International Conference on Computer Vision. 1365--1374.
[38]
Laurens Van Der Maaten and Kilian Weinberger. 2012. Stochastic triplet embedding. In Proceedings of IEEE International Workshop on Machine Learning for Signal Processing. 1--6.
[39]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of ACM International Conference on Multimedia. 154--162.
[40]
Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multi-view representation learning. In Proceedings of International Conference on Machine Learning. 1083--1092.
[41]
Han-Jia Ye, Su Lu, and De-Chuan Zhan. 2020. Distilling cross-task knowledge via relationship matching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 12396--12405.
[42]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4133--4141.
[43]
Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016).
[44]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013a. Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In Proceedings of AAAI Conference on Artificial Intelligence.
[45]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013b. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 24, 6 (2013), 965--978.
[46]
Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. 2018. Comprehensive distance-preserving autoencoders for cross-modal retrieval. In Proceedings of ACM International Conference on Multimedia. 1137--1145.
[47]
Peng-Fei Zhang, Chuan-Xiang Li, Meng-Yuan Liu, Liqiang Nie, and Xin-Shun Xu. 2017. Semi-relaxation supervised hashing for cross-modal retrieval. In Proceedings of ACM International Conference on Multimedia. 1762--1770.
[48]
Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4320--4328.
[49]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 10394--10403.

Cited By

View all
  • (2024)Privacy-Enhanced Prototype-Based Federated Cross-Modal Hashing for Cross-Modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367450720:9(1-19)Online publication date: 25-Jun-2024
  • (2024)Unsupervised Dual Hashing Coding (UDC) on Semantic Tagging and Sample Content for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338598626(9109-9120)Online publication date: 2024
  • (2024)Structures Aware Fine-Grained Contrastive Adversarial Hashing for Cross-Media RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335625836:7(3514-3528)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. Joint-teaching: Learning to Refine Knowledge for Resource-constrained Unsupervised Cross-modal Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. knowledge distillation
    3. noise refinery
    4. unsupervised learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Privacy-Enhanced Prototype-Based Federated Cross-Modal Hashing for Cross-Modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367450720:9(1-19)Online publication date: 25-Jun-2024
    • (2024)Unsupervised Dual Hashing Coding (UDC) on Semantic Tagging and Sample Content for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338598626(9109-9120)Online publication date: 2024
    • (2024)Structures Aware Fine-Grained Contrastive Adversarial Hashing for Cross-Media RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335625836:7(3514-3528)Online publication date: Jul-2024
    • (2024)Unsupervised NIR-VIS Face Recognition via Homogeneous-to-Heterogeneous Learning and Residual-Invariant EnhancementIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334617619(2112-2126)Online publication date: 2024
    • (2024)Multi-Layer Probabilistic Association Reasoning Network for Image-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339455134:10(9706-9717)Online publication date: Oct-2024
    • (2024)Multiple Information Embedded Hashing for Large-Scale Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334010234:6(5118-5131)Online publication date: Jun-2024
    • (2024)Joint Semantic Preserving Sparse Hashing for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330760834:4(2989-3002)Online publication date: Apr-2024
    • (2024)A Web Knowledge-Driven Multimodal Retrieval Method in Computational Social Systems: Unsupervised and Robust Graph Convolutional HashingIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321662111:3(3146-3156)Online publication date: Jun-2024
    • (2024)Weighted cross-modal hashing with label enhancementKnowledge-Based Systems10.1016/j.knosys.2024.111657293:COnline publication date: 7-Jun-2024
    • (2024)De-biased knowledge distillation framework based on knowledge infusion and label de-biasing techniquesJournal of Electronic Science and Technology10.1016/j.jnlest.2024.100278(100278)Online publication date: Aug-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media