Abstract
As one of the hottest research topics in multimedia information retrieval, cross-modal hashing has drawn widespread attention in the past decades. How to minimize the semantic gap of heterogeneous data and accurately calculate the similarity of cross-modal data is a key challenge for this task. A paradigm for tackling this problem is to map features of multi-modal data into common space. However, these approaches lack inter-modal information interaction and may not achieve satisfactory results. To overcome this problem, we propose a novel text-assisted attention-based cross-modal hashing (TAACH) method in this paper. Firstly, TAACH relies on LabelNet supervision to guide the learning of hash functions for each modality. In addition, a novel text-assisted attention mechanism is designed in our TAACH to densely integrate text features into image features, perceiving their spatial correlation and enhancing the consistency of image and text knowledge. Extensive experiments on four benchmark datasets show the effectiveness of our proposed TAACH, and it also achieves competitive performance compared to state-of-the-art methods. The source code is available at https://github.com/SWU-CS-MediaLab/TAACH.














Similar content being viewed by others
References
Peng Y, Huang X, Zhao Y (2018) An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Trans Circuits Syst Video Technol 28(9):2372–2385
Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215
Ding G, Guo Y, Zhou J, Gao Y (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440
Ding K, Fan B, Huo C, Xiang S, Pan C (2016) Cross-modal hashing via rank-order preserving. IEEE Trans Multimed 19(3):571–585
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp 415–424
Bronstein MM, Bronstein AM, Michel F, Paragios N (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3594–3601
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence
Fei W, Zhou Yu, Yang Y, Tang S, Zhang Y, Zhuang Y (2013) Sparse multi-modal hashing. IEEE Trans Multimed 16(2):427–439
Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Proceedings of the AAAI conference on artificial intelligence, vol 28
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Cao Y, Long M, Wang J, Yang Q, Yu PS (2016) Deep visual-semantic hashing for cross-modal retrieval. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1445–1454
Jiang QY, Li WJ (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C (2014) Microsoft coco: Common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 Sept, 2014, Proceedings, Part V 13. Springer, pp 740–755
Shen Y, Liu L, Shao L, Song J (2017) Deep binaries: encoding semantic-rich cues for efficient textual-visual cross retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 4097–4106
Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251
Ma X, Zhang T, Changsheng X (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans Multimed 22(12):3101–3114
Wang J, Zhang T, Sebe N, Shen HT (2017) A survey on learning to hash. IEEE Trans Pattern Anal Mach intel 40(4):769–790
Wang X, Zou X, Bakker EM, Song W (2020) Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 400:255–271
Zou X, Wu S, Zhang N, Bakker EM (2022) Multi-label modality enhanced attention based self-supervised deep cross-modal hashing. Knowl Based Syst 239:107927
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 591–606
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: International conference on machine learning
Cao Y, Long M, Wang J, Yu PS (2016) Correlation hashing network for efficient cross-modal retrieval. arXiv preprint arXiv:1602.06697
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp 39–43
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9
Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes M, Morales EF, Enrique Sucar L, Villasenor L, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114(4):419–428
Mandal D, Chaudhury KN, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4076–4084
Zou X, Wang X, Bakker EM, Wu S (2021) Multi-label semantics preserving based deep cross-modal hashing. Signal Process Image Commun 93:116131
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities, China (SWU-KT22032).
Author information
Authors and Affiliations
Contributions
Xiang Yuan made contributions to the conception of research, software, investigation, implementation of experiments, and writing the manuscript. Shihao Shan contributed to the methodology and software. Yuwen Huo contributed to the revision and editing of the returned manuscript. Junkai Jiang contributed to the methodology and software. Song Wu contributed to the research conception, methodology, software, wrote, reviewed, and edited the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
There are no declared competing interests of the authors that are pertinent to the subject matter of this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yuan, X., Shan, S., Huo, Y. et al. Text-assisted attention-based cross-modal hashing. Int J Multimed Info Retr 13, 3 (2024). https://doi.org/10.1007/s13735-023-00311-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00311-7