Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Stacked cross-modal feature consolidation attention networks for image captioning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability statement

All datasets used in this study are well-known benchmarks freely available on the Internet.

References

  1. Anderson P, Fernando B, Johnson M, Gould S (2016) “Spice: Semantic propositional image caption evaluation.” In: European conference on computer vision, Springer, pp 382–398

  2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) “Bottom-up and top-down attention for image captioning and visual question answering.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  3. Bahdanau D, Cho K, Bengio Y (2014) “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473

  4. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    Article  Google Scholar 

  5. Buschman TJ, Miller EK (2007) Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 315:1860–1862

    Article  Google Scholar 

  6. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667

  7. Corbetta M, Shulman GL (2002) “Control of goal-directed and stimulus-driven attention in the brain’’. Nat Rev Neurosci 3:201–215

    Article  Google Scholar 

  8. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) “Imagenet: A large-scale hierarchical image database.” In: 2009 IEEE conference on computer vision and pattern recognition. Ieee, pp 248–255

  9. Denkowski M, Lavie A (2014) “Meteor universal: Language-specific translation evaluation for any target language.” In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380

  10. Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015)“From captions to visual concepts and back.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482

  11. Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39:2321–2334

    Article  Google Scholar 

  12. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) “Semantic compositional networks for visual captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639

  13. Gers FA, Schmidhuber J (2000) “Recurrent nets that time and count.” In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3. IEEE, pp 189–194

  14. Graves A (2013) “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850

  15. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) Lstm: A search space odyssey. IEEE Trans Neural Netw Learn Syst 28:2222–2232

    Article  MathSciNet  Google Scholar 

  16. He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49:177–185

    Article  Google Scholar 

  17. He K, Zhang X, Ren S, Sun J (2016) ”Deep residual learning for image recognition.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  18. Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026

    Article  Google Scholar 

  19. Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y et al (2022) “Scaling up vision-language pre-training for image captioning.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989

  20. Jaderberg M, Simonyan K, Zisserman A et al (2015) “Spatial transformer networks.” In: Advances in neural information processing systems, pp 2017–2025

  21. Jiang W, Ma L, Jiang Y-G, Liu W, Zhang T (2018) “Recurrent fusion network for image captioning.” In: Proceedings of the European conference on computer vision (ECCV), pp 499–515

  22. Johnson J, Karpathy A, Fei-Fei L (2016) “Densecap: Fully convolutional localization networks for dense captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition

  23. Karpathy A, Fei-Fei L (2015) “Deep visual-semantic alignments for generating image descriptions.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  24. Karpathy A, Joulin A, Fei-Fei LF (2014) “Deep fragment embeddings for bidirectional image sentence mapping.” In: Advances in neural information processing systems, pp 1889–1897

  25. Kingma DP, Ba J (2014) “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980

  26. Kiros R, Salakhutdinov R, Zemel RS (2014) “Unifying visual-semantic embeddings with multimodal neural language models.” arXiv preprint arXiv:1411.2539

  27. Lin C-Y (2004) “Rouge: A package for automatic evaluation of summaries.” In: Text summarization branches out, pp 74 81

  28. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017)“Focal loss for dense object detection.” In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  29. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) “Microsoft coco: Common objects in context.” In: European conference on computer vision, Springer, pp 740–755

  30. Li Y, Pan Y, Yao T, Mei T (2022) “Comprehending and ordering semantics for image captioning.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17990–17999

  31. Liu H, Yang Y, Shen F, Duan L, Shen HT (2016) “Recurrent image captioner: Describing images with spatial-invariant transformation and attention filtering.” arXiv preprint arXiv:1612.04949

  32. Lu J, Xiong C, Parikh D, Socher R (2017) “Knowing when to look: Adaptive attention via a visual sentinel for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  33. Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) “Deep captioning with multimodal recurrent neural networks (m-rnn).” arXiv preprint arXiv:1412.6632

  34. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014)“Explain images with multimodal recurrent neural networks.” arXiv preprint arXiv:1410.1090

  35. Papineni K, Roukos S, Ward T, Zhu W-J (2002) “Bleu: a method for automatic evaluation of machine translation.” In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics pp 311–318

  36. Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) ”Areas of attention for image captioning.” In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250

  37. Qin Y, Du J, Zhang Y, Lu H (2019) “Look back and predict forward in image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8367–8375

  38. Ren S, He K, Girshick R, Sun J (2015) “Faster r-cnn: Towards real-time object detection with region proposal networks.” In: Advances in neural information processing systems, pp 91–99

  39. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) “Selfcritical sequence training for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  40. Shao Z, Han J, Debattista K, Pang Y (2023) “Textual context-aware dense captioning with diverse words.” IEEE Trans Multimedia

  41. Shao Z, Han J, Marnerides D, Debattista K (2022) “Region-object relation-aware dense captioning via transformer.” IEEE Trans Neural Netw Learn Syst

  42. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  Google Scholar 

  43. Su Y, Li Y, Xu N, Liu A-A (2019) “Hierarchical deep neural network for image captioning.” Neural Process Lett 1–11

  44. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104:154–171

    Article  Google Scholar 

  45. Vedantam R, Lawrence Zitnick C, Parikh D (2015)“Cider: Consensus-based image description evaluation.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  46. Vinyals O, Toshev A, Bengio S, Erhan D (2015) “Show and tell: A neural image caption generator.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  47. Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075

    Article  Google Scholar 

  48. Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1:270–280

    Article  Google Scholar 

  49. Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) “What value do explicit high-level concepts have in vision to language problems?”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212

  50. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) “Show, attend and tell: Neural image caption generation with visual attention.” In: International conference on machine learning, pp 2048–2057

  51. Yang X, Tang K, Zhang H, Cai J (2019) “Auto-encoding scene graphs for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10 685–10 694

  52. Yao T, Pan Y, Li Y, Mei T (2018) “Exploring visual relationship for image captioning.” In: Proceedings of the European conference on computer vision (ECCV), pp 684–699

  53. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017)“Boosting image captioning with attributes.” In: Proceedings of the IEEE international conference on computer vision, pp 4894–4902

  54. You Q, Jin H, Wang Z, Fang C, Luo J (2016) “Image captioning with semantic attention.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  55. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  56. Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2019) “Context-aware visual policy network for fine-grained image captioning.” IEEE Trans Pattern Anal Mach Intell

  57. Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55:3833–3862

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohsen Ebrahimi Moghaddam.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pourkeshavarz, M., Nabavi, S., Moghaddam, M.E. et al. Stacked cross-modal feature consolidation attention networks for image captioning. Multimed Tools Appl 83, 12209–12233 (2024). https://doi.org/10.1007/s11042-023-15869-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15869-x

Keywords