Abstract
People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more fine-grained features to evaluate the similarity among multimedia samples. In this paper, we propose a Deep Attentional Fine-grained Similarity Network (DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the “heterogeneity gap” between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section.
Similar content being viewed by others
References
Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S, Zhang, L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Andrew G, Arora R, Bilmes J, Livescu K (2010) Deep canonical correlation analysis. In International Conference on Machine Learning 3408–3415.
Bellogín A, Wang J, Castells P (2013) Bridging memory-based collaborative filtering and text retrieval. Inf Retr 16(6):697–724. https://doi.org/10.1007/s10791-012-9214-z
Chekalina V, Orlova E, Ratnikov F, Ulyanov D, Ustyuzhanin A, Zakharov E (2018) Generative models for fast calorimeter simulation: LHCb case. In EPJ Web of Conferences 214:02034. https://doi.org/10.1051/epjconf/201921402034
Choi H, Cho K, Bengio Y (2018) Fine-grained attention mechanism for neural machine translation. Neurocomputing 284:171–176. https://doi.org/10.1016/j.neucom.2018.01.007
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM international conference on image and video retrieval ACM 48. https://doi.org/10.1145/1646396.1646452
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In proceedings of the 22nd ACM international conference on multimedia 7-16. https://doi.org/10.1145/2647868.2654902
Girshick R (2015) Fast r-cnn. In proceedings of the IEEE international conference on computer vision 1440-1448. https://doi.org/10.1109/cvpr.2017.683
Goodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative adversarial nets. In Advances in neural information processing systems:2672–2680
Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In European conference on computer vision:241–257. https://doi.org/10.1007/978-3-319-46466-4_15
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In proceedings of the IEEE conference on computer vision and pattern recognition 7181-7189. https://doi.org/10.1109/cvpr.2018.00750
He Y, Xiang S, Kang C, Wang J, Pan C (2016) Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18(7):1363–1377. https://doi.org/10.1109/tmm.2016.2558463
Hua Y, Wang S, Liu S, Cai A, Huang Q (2016) Cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Transactions on Multimedia 18(6):1201–1216. https://doi.org/10.1109/tmm.2016.2535864
Huang X, Peng Y, Yuan M (2017) Cross-modal common representation learning by hybrid transfer network. In twenty-sixth international joint conference on artificial intelligence. https://doi.org/10.24963/ijcai.2017/263
Huang X, Peng Y, Yuan M (2018) MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Transactions on Cybernetics 50:1047–1059. https://doi.org/10.1109/tcyb.2018.2879846
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17(3):370–381. https://doi.org/10.1109/tmm.2015.2390499
Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV) 201-216. https://doi.org/10.1007/978-3-030-01225-0_13
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In proceedings of the eleventh ACM international conference on Multimedia 604-611. https://doi.org/10.1145/957013.957143
Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia:1. https://doi.org/10.1109/TMM.2020.2969792
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res:2579–2605
Mandal D, Chaudhury KN, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In IEEE conference on computer vision and pattern recognition 2633–2641. https://doi.org/10.1109/cvpr.2017.282
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning 689-696.
Ou W, Xuan R, Gou J, Zhou Q, Cao Y (2019) Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimed Tools Appl 79:1–18. https://doi.org/10.1007/s11042-019-7343-8
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceeding International Joint Conference Artificial Intelligence 3846–3853. https://doi.org/10.5555/3061053.3061157
Pereira C, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R et al (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535. https://doi.org/10.1109/tpami.2013.142
Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In IEEE international conference on computer vision 4094–4102. https://doi.org/10.1016/j.neucom.2017.10.032
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's mechanical Turk 139–147. https://doi.org/10.1002/acp.3140
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In proceedings of the 18th ACM international conference on multimedia 251–260. https://doi.org/10.1007/springerreference_63237
Shang F, Zhang H, Zhu L, Sun J (2019) Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing 355:93–104. https://doi.org/10.1016/j.neucom.2019.04.041
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop 79. https://doi.org/10.1007/978-1-4899-7502-7_67-1
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In proceedings of the IEEE conference on computer vision and pattern recognition 2818-2826. https://doi.org/10.1109/cvpr.2016.308
Tian D (2018) Support vector machine for content-based image retrieval: a comprehensive overview. Journal of Information Hiding and Multimedia Signal Processing 9(6)
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In proceedings of the IEEE conference on computer vision and pattern recognition 7167-7176. https://doi.org/10.1109/cvpr.2017.316
Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023. https://doi.org/10.1109/tpami.2015.2505311
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H et al (2017) Residual attention network for image classification. In proceedings of the IEEE conference on computer vision and pattern recognition 3156-3164. https://doi.org/10.1109/cvpr.2017.683
Wang B, Yang Y, Xu X, Hanjalic A, Shen H T (2017) Adversarial cross-modal retrieval. In Proceedings of ACM Multimedia 154–162.
Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics 47(2):449–460. https://doi.org/10.1109/tcyb.2016.2519449
Wu Y, Schuster M, Chen Z, Le Q V, Norouzi M, Macherey W et al (2016) Google's neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4269-4278. https://doi.org/10.1109/cvpr.2017.424
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In international conference on machine learning 2048-2057. https://doi.org/10.1109/cvpr.2015.7298935
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1316-1324. https://doi.org/10.1109/cvpr.2018.00143
Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2):657–672. https://doi.org/10.1007/s11280-018-0541-x
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In proceedings of the IEEE conference on computer vision and pattern recognition 21-29. https://doi.org/10.1109/cvpr.2016.10
Zhai X, Peng Y, Xiao J (2013) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24(6):965–978. https://doi.org/10.1109/tcsvt.2013.2276704
Zhang, X, Lai, H, Feng, J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In Proceedings of the European Conference on Computer Vision 591–606. https://doi.org/10.1007/978-3-030-01267-0_36
Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1(1–4):43–52. https://doi.org/10.1007/s13042-010-0001-0
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962. https://doi.org/10.1109/tpami.2018.2856256
Acknowledgments
This work is supported by National Natural Science Foundation of China under grants 61771145 and 61371148.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cheng, Q., Gu, X. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval. Multimed Tools Appl 79, 31401–31428 (2020). https://doi.org/10.1007/s11042-020-09450-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09450-z