Stacked cross-modal feature consolidation attention networks for image captioning

Pourkeshavarz, Mozhgan; Nabavi, Shahabedin; Moghaddam, Mohsen Ebrahimi; Shamsfard, Mehrnoush

doi:10.1007/s11042-023-15869-x

Stacked cross-modal feature consolidation attention networks for image captioning

Published: 23 June 2023

Volume 83, pages 12209–12233, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Mozhgan Pourkeshavarz¹,
Shahabedin Nabavi¹,
Mohsen Ebrahimi Moghaddam ORCID: orcid.org/0000-0002-7391-508X¹ &
…
Mehrnoush Shamsfard¹

247 Accesses
2 Altmetric
Explore all metrics

Abstract

The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring better image captioning with grid features

Article Open access 10 February 2024

MATIC: Memory-Guided Adaptive Transformer for Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Data availability statement

All datasets used in this study are well-known benchmarks freely available on the Internet.

References

Anderson P, Fernando B, Johnson M, Gould S (2016) “Spice: Semantic propositional image caption evaluation.” In: European conference on computer vision, Springer, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) “Bottom-up and top-down attention for image captioning and visual question answering.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Bahdanau D, Cho K, Bengio Y (2014) “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Article Google Scholar
Buschman TJ, Miller EK (2007) Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 315:1860–1862
Article Google Scholar
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Corbetta M, Shulman GL (2002) “Control of goal-directed and stimulus-driven attention in the brain’’. Nat Rev Neurosci 3:201–215
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) “Imagenet: A large-scale hierarchical image database.” In: 2009 IEEE conference on computer vision and pattern recognition. Ieee, pp 248–255
Denkowski M, Lavie A (2014) “Meteor universal: Language-specific translation evaluation for any target language.” In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015)“From captions to visual concepts and back.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39:2321–2334
Article Google Scholar
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) “Semantic compositional networks for visual captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Gers FA, Schmidhuber J (2000) “Recurrent nets that time and count.” In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3. IEEE, pp 189–194
Graves A (2013) “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) Lstm: A search space odyssey. IEEE Trans Neural Netw Learn Syst 28:2222–2232
Article MathSciNet Google Scholar
He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49:177–185
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) ”Deep residual learning for image recognition.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026
Article Google Scholar
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y et al (2022) “Scaling up vision-language pre-training for image captioning.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989
Jaderberg M, Simonyan K, Zisserman A et al (2015) “Spatial transformer networks.” In: Advances in neural information processing systems, pp 2017–2025
Jiang W, Ma L, Jiang Y-G, Liu W, Zhang T (2018) “Recurrent fusion network for image captioning.” In: Proceedings of the European conference on computer vision (ECCV), pp 499–515
Johnson J, Karpathy A, Fei-Fei L (2016) “Densecap: Fully convolutional localization networks for dense captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition
Karpathy A, Fei-Fei L (2015) “Deep visual-semantic alignments for generating image descriptions.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Fei-Fei LF (2014) “Deep fragment embeddings for bidirectional image sentence mapping.” In: Advances in neural information processing systems, pp 1889–1897
Kingma DP, Ba J (2014) “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980
Kiros R, Salakhutdinov R, Zemel RS (2014) “Unifying visual-semantic embeddings with multimodal neural language models.” arXiv preprint arXiv:1411.2539
Lin C-Y (2004) “Rouge: A package for automatic evaluation of summaries.” In: Text summarization branches out, pp 74 81
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017)“Focal loss for dense object detection.” In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) “Microsoft coco: Common objects in context.” In: European conference on computer vision, Springer, pp 740–755
Li Y, Pan Y, Yao T, Mei T (2022) “Comprehending and ordering semantics for image captioning.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17990–17999
Liu H, Yang Y, Shen F, Duan L, Shen HT (2016) “Recurrent image captioner: Describing images with spatial-invariant transformation and attention filtering.” arXiv preprint arXiv:1612.04949
Lu J, Xiong C, Parikh D, Socher R (2017) “Knowing when to look: Adaptive attention via a visual sentinel for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) “Deep captioning with multimodal recurrent neural networks (m-rnn).” arXiv preprint arXiv:1412.6632
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014)“Explain images with multimodal recurrent neural networks.” arXiv preprint arXiv:1410.1090
Papineni K, Roukos S, Ward T, Zhu W-J (2002) “Bleu: a method for automatic evaluation of machine translation.” In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics pp 311–318
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) ”Areas of attention for image captioning.” In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250
Qin Y, Du J, Zhang Y, Lu H (2019) “Look back and predict forward in image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8367–8375
Ren S, He K, Girshick R, Sun J (2015) “Faster r-cnn: Towards real-time object detection with region proposal networks.” In: Advances in neural information processing systems, pp 91–99
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) “Selfcritical sequence training for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Shao Z, Han J, Debattista K, Pang Y (2023) “Textual context-aware dense captioning with diverse words.” IEEE Trans Multimedia
Shao Z, Han J, Marnerides D, Debattista K (2022) “Region-object relation-aware dense captioning via transformer.” IEEE Trans Neural Netw Learn Syst
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet Google Scholar
Su Y, Li Y, Xu N, Liu A-A (2019) “Hierarchical deep neural network for image captioning.” Neural Process Lett 1–11
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104:154–171
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015)“Cider: Consensus-based image description evaluation.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) “Show and tell: A neural image caption generator.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075
Article Google Scholar
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1:270–280
Article Google Scholar
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) “What value do explicit high-level concepts have in vision to language problems?”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) “Show, attend and tell: Neural image caption generation with visual attention.” In: International conference on machine learning, pp 2048–2057
Yang X, Tang K, Zhang H, Cai J (2019) “Auto-encoding scene graphs for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10 685–10 694
Yao T, Pan Y, Li Y, Mei T (2018) “Exploring visual relationship for image captioning.” In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017)“Boosting image captioning with attributes.” In: Proceedings of the IEEE international conference on computer vision, pp 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) “Image captioning with semantic attention.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2019) “Context-aware visual policy network for fine-grained image captioning.” IEEE Trans Pattern Anal Mach Intell
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55:3833–3862
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Mozhgan Pourkeshavarz, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam & Mehrnoush Shamsfard

Authors

Mozhgan Pourkeshavarz
View author publications
You can also search for this author in PubMed Google Scholar
Shahabedin Nabavi
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Ebrahimi Moghaddam
View author publications
You can also search for this author in PubMed Google Scholar
Mehrnoush Shamsfard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohsen Ebrahimi Moghaddam.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pourkeshavarz, M., Nabavi, S., Moghaddam, M.E. et al. Stacked cross-modal feature consolidation attention networks for image captioning. Multimed Tools Appl 83, 12209–12233 (2024). https://doi.org/10.1007/s11042-023-15869-x

Download citation

Received: 02 January 2023
Revised: 07 May 2023
Accepted: 15 May 2023
Published: 23 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15869-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stacked cross-modal feature consolidation attention networks for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring better image captioning with grid features

MATIC: Memory-Guided Adaptive Transformer for Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Stacked cross-modal feature consolidation attention networks for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring better image captioning with grid features

MATIC: Memory-Guided Adaptive Transformer for Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation