Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more fine-grained features to evaluate the similarity among multimedia samples. In this paper, we propose a Deep Attentional Fine-grained Similarity Network (DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the “heterogeneity gap” between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S, Zhang, L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition 6077–6086. https://doi.org/10.1109/CVPR.2018.00636

  2. Andrew G, Arora R, Bilmes J, Livescu K (2010) Deep canonical correlation analysis. In International Conference on Machine Learning 3408–3415.

  3. Bellogín A, Wang J, Castells P (2013) Bridging memory-based collaborative filtering and text retrieval. Inf Retr 16(6):697–724. https://doi.org/10.1007/s10791-012-9214-z

    Article  Google Scholar 

  4. Chekalina V, Orlova E, Ratnikov F, Ulyanov D, Ustyuzhanin A, Zakharov E (2018) Generative models for fast calorimeter simulation: LHCb case. In EPJ Web of Conferences 214:02034. https://doi.org/10.1051/epjconf/201921402034

    Article  Google Scholar 

  5. Choi H, Cho K, Bengio Y (2018) Fine-grained attention mechanism for neural machine translation. Neurocomputing 284:171–176. https://doi.org/10.1016/j.neucom.2018.01.007

    Article  Google Scholar 

  6. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM international conference on image and video retrieval ACM 48. https://doi.org/10.1145/1646396.1646452

  7. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In proceedings of the 22nd ACM international conference on multimedia 7-16. https://doi.org/10.1145/2647868.2654902

  8. Girshick R (2015) Fast r-cnn. In proceedings of the IEEE international conference on computer vision 1440-1448. https://doi.org/10.1109/cvpr.2017.683

  9. Goodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative adversarial nets. In Advances in neural information processing systems:2672–2680

  10. Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In European conference on computer vision:241–257. https://doi.org/10.1007/978-3-319-46466-4_15

  11. Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In proceedings of the IEEE conference on computer vision and pattern recognition 7181-7189. https://doi.org/10.1109/cvpr.2018.00750

  12. He Y, Xiang S, Kang C, Wang J, Pan C (2016) Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18(7):1363–1377. https://doi.org/10.1109/tmm.2016.2558463

    Article  Google Scholar 

  13. Hua Y, Wang S, Liu S, Cai A, Huang Q (2016) Cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Transactions on Multimedia 18(6):1201–1216. https://doi.org/10.1109/tmm.2016.2535864

    Article  Google Scholar 

  14. Huang X, Peng Y, Yuan M (2017) Cross-modal common representation learning by hybrid transfer network. In twenty-sixth international joint conference on artificial intelligence. https://doi.org/10.24963/ijcai.2017/263

  15. Huang X, Peng Y, Yuan M (2018) MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Transactions on Cybernetics 50:1047–1059. https://doi.org/10.1109/tcyb.2018.2879846

    Article  Google Scholar 

  16. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17(3):370–381. https://doi.org/10.1109/tmm.2015.2390499

    Article  Google Scholar 

  17. Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV) 201-216. https://doi.org/10.1007/978-3-030-01225-0_13

  18. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In proceedings of the eleventh ACM international conference on Multimedia 604-611. https://doi.org/10.1145/957013.957143

  19. Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia:1. https://doi.org/10.1109/TMM.2020.2969792

  20. Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res:2579–2605

  21. Mandal D, Chaudhury KN, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In IEEE conference on computer vision and pattern recognition 2633–2641. https://doi.org/10.1109/cvpr.2017.282

  22. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning 689-696.

  23. Ou W, Xuan R, Gou J, Zhou Q, Cao Y (2019) Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimed Tools Appl 79:1–18. https://doi.org/10.1007/s11042-019-7343-8

    Article  Google Scholar 

  24. Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceeding International Joint Conference Artificial Intelligence 3846–3853. https://doi.org/10.5555/3061053.3061157

  25. Pereira C, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R et al (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535. https://doi.org/10.1109/tpami.2013.142

    Article  Google Scholar 

  26. Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In IEEE international conference on computer vision 4094–4102. https://doi.org/10.1016/j.neucom.2017.10.032

  27. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's mechanical Turk 139–147. https://doi.org/10.1002/acp.3140

  28. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In proceedings of the 18th ACM international conference on multimedia 251–260. https://doi.org/10.1007/springerreference_63237

  29. Shang F, Zhang H, Zhu L, Sun J (2019) Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing 355:93–104. https://doi.org/10.1016/j.neucom.2019.04.041

    Article  Google Scholar 

  30. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .

  31. Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop 79. https://doi.org/10.1007/978-1-4899-7502-7_67-1

  32. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In proceedings of the IEEE conference on computer vision and pattern recognition 2818-2826. https://doi.org/10.1109/cvpr.2016.308

  33. Tian D (2018) Support vector machine for content-based image retrieval: a comprehensive overview. Journal of Information Hiding and Multimedia Signal Processing 9(6)

  34. Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In proceedings of the IEEE conference on computer vision and pattern recognition 7167-7176. https://doi.org/10.1109/cvpr.2017.316

  35. Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023. https://doi.org/10.1109/tpami.2015.2505311

    Article  Google Scholar 

  36. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H et al (2017) Residual attention network for image classification. In proceedings of the IEEE conference on computer vision and pattern recognition 3156-3164. https://doi.org/10.1109/cvpr.2017.683

  37. Wang B, Yang Y, Xu X, Hanjalic A, Shen H T (2017) Adversarial cross-modal retrieval. In Proceedings of ACM Multimedia 154–162.

  38. Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics 47(2):449–460. https://doi.org/10.1109/tcyb.2016.2519449

    Article  Google Scholar 

  39. Wu Y, Schuster M, Chen Z, Le Q V, Norouzi M, Macherey W et al (2016) Google's neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

  40. Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4269-4278. https://doi.org/10.1109/cvpr.2017.424

  41. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In international conference on machine learning 2048-2057. https://doi.org/10.1109/cvpr.2015.7298935

  42. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1316-1324. https://doi.org/10.1109/cvpr.2018.00143

  43. Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2):657–672. https://doi.org/10.1007/s11280-018-0541-x

    Article  Google Scholar 

  44. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In proceedings of the IEEE conference on computer vision and pattern recognition 21-29. https://doi.org/10.1109/cvpr.2016.10

  45. Zhai X, Peng Y, Xiao J (2013) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24(6):965–978. https://doi.org/10.1109/tcsvt.2013.2276704

    Article  Google Scholar 

  46. Zhang, X, Lai, H, Feng, J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In Proceedings of the European Conference on Computer Vision 591–606. https://doi.org/10.1007/978-3-030-01267-0_36

  47. Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1(1–4):43–52. https://doi.org/10.1007/s13042-010-0001-0

    Article  Google Scholar 

  48. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962. https://doi.org/10.1109/tpami.2018.2856256

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under grants 61771145 and 61371148.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodong Gu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, Q., Gu, X. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval. Multimed Tools Appl 79, 31401–31428 (2020). https://doi.org/10.1007/s11042-020-09450-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09450-z

Keywords