Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

Cheng, Qingrong; Gu, Xiaodong

doi:10.1007/s11042-020-09450-z

Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

Published: 20 August 2020

Volume 79, pages 31401–31428, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more fine-grained features to evaluate the similarity among multimedia samples. In this paper, we propose a Deep Attentional Fine-grained Similarity Network (DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the “heterogeneity gap” between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Article 12 January 2024

Unsupervised Cross-Modal Retrieval by Coupled Dual Generative Adversarial Networks

Adversarial Learning for Cross-Modal Retrieval with Wasserstein Distance

References

Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S, Zhang, L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Andrew G, Arora R, Bilmes J, Livescu K (2010) Deep canonical correlation analysis. In International Conference on Machine Learning 3408–3415.
Bellogín A, Wang J, Castells P (2013) Bridging memory-based collaborative filtering and text retrieval. Inf Retr 16(6):697–724. https://doi.org/10.1007/s10791-012-9214-z
Article Google Scholar
Chekalina V, Orlova E, Ratnikov F, Ulyanov D, Ustyuzhanin A, Zakharov E (2018) Generative models for fast calorimeter simulation: LHCb case. In EPJ Web of Conferences 214:02034. https://doi.org/10.1051/epjconf/201921402034
Article Google Scholar
Choi H, Cho K, Bengio Y (2018) Fine-grained attention mechanism for neural machine translation. Neurocomputing 284:171–176. https://doi.org/10.1016/j.neucom.2018.01.007
Article Google Scholar
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM international conference on image and video retrieval ACM 48. https://doi.org/10.1145/1646396.1646452
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In proceedings of the 22nd ACM international conference on multimedia 7-16. https://doi.org/10.1145/2647868.2654902
Girshick R (2015) Fast r-cnn. In proceedings of the IEEE international conference on computer vision 1440-1448. https://doi.org/10.1109/cvpr.2017.683
Goodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative adversarial nets. In Advances in neural information processing systems:2672–2680
Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In European conference on computer vision:241–257. https://doi.org/10.1007/978-3-319-46466-4_15
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In proceedings of the IEEE conference on computer vision and pattern recognition 7181-7189. https://doi.org/10.1109/cvpr.2018.00750
He Y, Xiang S, Kang C, Wang J, Pan C (2016) Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18(7):1363–1377. https://doi.org/10.1109/tmm.2016.2558463
Article Google Scholar
Hua Y, Wang S, Liu S, Cai A, Huang Q (2016) Cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Transactions on Multimedia 18(6):1201–1216. https://doi.org/10.1109/tmm.2016.2535864
Article Google Scholar
Huang X, Peng Y, Yuan M (2017) Cross-modal common representation learning by hybrid transfer network. In twenty-sixth international joint conference on artificial intelligence. https://doi.org/10.24963/ijcai.2017/263
Huang X, Peng Y, Yuan M (2018) MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Transactions on Cybernetics 50:1047–1059. https://doi.org/10.1109/tcyb.2018.2879846
Article Google Scholar
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17(3):370–381. https://doi.org/10.1109/tmm.2015.2390499
Article Google Scholar
Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV) 201-216. https://doi.org/10.1007/978-3-030-01225-0_13
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In proceedings of the eleventh ACM international conference on Multimedia 604-611. https://doi.org/10.1145/957013.957143
Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia:1. https://doi.org/10.1109/TMM.2020.2969792
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res:2579–2605
Mandal D, Chaudhury KN, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In IEEE conference on computer vision and pattern recognition 2633–2641. https://doi.org/10.1109/cvpr.2017.282
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning 689-696.
Ou W, Xuan R, Gou J, Zhou Q, Cao Y (2019) Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimed Tools Appl 79:1–18. https://doi.org/10.1007/s11042-019-7343-8
Article Google Scholar
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceeding International Joint Conference Artificial Intelligence 3846–3853. https://doi.org/10.5555/3061053.3061157
Pereira C, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R et al (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535. https://doi.org/10.1109/tpami.2013.142
Article Google Scholar
Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In IEEE international conference on computer vision 4094–4102. https://doi.org/10.1016/j.neucom.2017.10.032
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's mechanical Turk 139–147. https://doi.org/10.1002/acp.3140
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In proceedings of the 18th ACM international conference on multimedia 251–260. https://doi.org/10.1007/springerreference_63237
Shang F, Zhang H, Zhu L, Sun J (2019) Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing 355:93–104. https://doi.org/10.1016/j.neucom.2019.04.041
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop 79. https://doi.org/10.1007/978-1-4899-7502-7_67-1
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In proceedings of the IEEE conference on computer vision and pattern recognition 2818-2826. https://doi.org/10.1109/cvpr.2016.308
Tian D (2018) Support vector machine for content-based image retrieval: a comprehensive overview. Journal of Information Hiding and Multimedia Signal Processing 9(6)
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In proceedings of the IEEE conference on computer vision and pattern recognition 7167-7176. https://doi.org/10.1109/cvpr.2017.316
Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023. https://doi.org/10.1109/tpami.2015.2505311
Article Google Scholar
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H et al (2017) Residual attention network for image classification. In proceedings of the IEEE conference on computer vision and pattern recognition 3156-3164. https://doi.org/10.1109/cvpr.2017.683
Wang B, Yang Y, Xu X, Hanjalic A, Shen H T (2017) Adversarial cross-modal retrieval. In Proceedings of ACM Multimedia 154–162.
Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics 47(2):449–460. https://doi.org/10.1109/tcyb.2016.2519449
Article Google Scholar
Wu Y, Schuster M, Chen Z, Le Q V, Norouzi M, Macherey W et al (2016) Google's neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4269-4278. https://doi.org/10.1109/cvpr.2017.424
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In international conference on machine learning 2048-2057. https://doi.org/10.1109/cvpr.2015.7298935
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1316-1324. https://doi.org/10.1109/cvpr.2018.00143
Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2):657–672. https://doi.org/10.1007/s11280-018-0541-x
Article Google Scholar
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In proceedings of the IEEE conference on computer vision and pattern recognition 21-29. https://doi.org/10.1109/cvpr.2016.10
Zhai X, Peng Y, Xiao J (2013) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24(6):965–978. https://doi.org/10.1109/tcsvt.2013.2276704
Article Google Scholar
Zhang, X, Lai, H, Feng, J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In Proceedings of the European Conference on Computer Vision 591–606. https://doi.org/10.1007/978-3-030-01267-0_36
Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1(1–4):43–52. https://doi.org/10.1007/s13042-010-0001-0
Article Google Scholar
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962. https://doi.org/10.1109/tpami.2018.2856256
Article Google Scholar

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under grants 61771145 and 61371148.

Author information

Authors and Affiliations

Department of Electronic Engineering, Fudan University, Shanghai, 200433, China
Qingrong Cheng & Xiaodong Gu

Authors

Qingrong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Gu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, Q., Gu, X. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval. Multimed Tools Appl 79, 31401–31428 (2020). https://doi.org/10.1007/s11042-020-09450-z

Download citation

Received: 12 October 2019
Revised: 24 June 2020
Accepted: 28 July 2020
Published: 20 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11042-020-09450-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Unsupervised Cross-Modal Retrieval by Coupled Dual Generative Adversarial Networks

Adversarial Learning for Cross-Modal Retrieval with Wasserstein Distance

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now