Abstract
The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
Similar content being viewed by others
Data availability statement
All datasets used in this study are well-known benchmarks freely available on the Internet.
References
Anderson P, Fernando B, Johnson M, Gould S (2016) “Spice: Semantic propositional image caption evaluation.” In: European conference on computer vision, Springer, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) “Bottom-up and top-down attention for image captioning and visual question answering.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Bahdanau D, Cho K, Bengio Y (2014) “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Buschman TJ, Miller EK (2007) Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 315:1860–1862
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Corbetta M, Shulman GL (2002) “Control of goal-directed and stimulus-driven attention in the brain’’. Nat Rev Neurosci 3:201–215
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) “Imagenet: A large-scale hierarchical image database.” In: 2009 IEEE conference on computer vision and pattern recognition. Ieee, pp 248–255
Denkowski M, Lavie A (2014) “Meteor universal: Language-specific translation evaluation for any target language.” In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015)“From captions to visual concepts and back.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39:2321–2334
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) “Semantic compositional networks for visual captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Gers FA, Schmidhuber J (2000) “Recurrent nets that time and count.” In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3. IEEE, pp 189–194
Graves A (2013) “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) Lstm: A search space odyssey. IEEE Trans Neural Netw Learn Syst 28:2222–2232
He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49:177–185
He K, Zhang X, Ren S, Sun J (2016) ”Deep residual learning for image recognition.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y et al (2022) “Scaling up vision-language pre-training for image captioning.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989
Jaderberg M, Simonyan K, Zisserman A et al (2015) “Spatial transformer networks.” In: Advances in neural information processing systems, pp 2017–2025
Jiang W, Ma L, Jiang Y-G, Liu W, Zhang T (2018) “Recurrent fusion network for image captioning.” In: Proceedings of the European conference on computer vision (ECCV), pp 499–515
Johnson J, Karpathy A, Fei-Fei L (2016) “Densecap: Fully convolutional localization networks for dense captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition
Karpathy A, Fei-Fei L (2015) “Deep visual-semantic alignments for generating image descriptions.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Fei-Fei LF (2014) “Deep fragment embeddings for bidirectional image sentence mapping.” In: Advances in neural information processing systems, pp 1889–1897
Kingma DP, Ba J (2014) “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980
Kiros R, Salakhutdinov R, Zemel RS (2014) “Unifying visual-semantic embeddings with multimodal neural language models.” arXiv preprint arXiv:1411.2539
Lin C-Y (2004) “Rouge: A package for automatic evaluation of summaries.” In: Text summarization branches out, pp 74 81
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017)“Focal loss for dense object detection.” In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) “Microsoft coco: Common objects in context.” In: European conference on computer vision, Springer, pp 740–755
Li Y, Pan Y, Yao T, Mei T (2022) “Comprehending and ordering semantics for image captioning.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17990–17999
Liu H, Yang Y, Shen F, Duan L, Shen HT (2016) “Recurrent image captioner: Describing images with spatial-invariant transformation and attention filtering.” arXiv preprint arXiv:1612.04949
Lu J, Xiong C, Parikh D, Socher R (2017) “Knowing when to look: Adaptive attention via a visual sentinel for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) “Deep captioning with multimodal recurrent neural networks (m-rnn).” arXiv preprint arXiv:1412.6632
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014)“Explain images with multimodal recurrent neural networks.” arXiv preprint arXiv:1410.1090
Papineni K, Roukos S, Ward T, Zhu W-J (2002) “Bleu: a method for automatic evaluation of machine translation.” In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics pp 311–318
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) ”Areas of attention for image captioning.” In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250
Qin Y, Du J, Zhang Y, Lu H (2019) “Look back and predict forward in image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8367–8375
Ren S, He K, Girshick R, Sun J (2015) “Faster r-cnn: Towards real-time object detection with region proposal networks.” In: Advances in neural information processing systems, pp 91–99
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) “Selfcritical sequence training for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Shao Z, Han J, Debattista K, Pang Y (2023) “Textual context-aware dense captioning with diverse words.” IEEE Trans Multimedia
Shao Z, Han J, Marnerides D, Debattista K (2022) “Region-object relation-aware dense captioning via transformer.” IEEE Trans Neural Netw Learn Syst
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Su Y, Li Y, Xu N, Liu A-A (2019) “Hierarchical deep neural network for image captioning.” Neural Process Lett 1–11
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104:154–171
Vedantam R, Lawrence Zitnick C, Parikh D (2015)“Cider: Consensus-based image description evaluation.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) “Show and tell: A neural image caption generator.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1:270–280
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) “What value do explicit high-level concepts have in vision to language problems?”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) “Show, attend and tell: Neural image caption generation with visual attention.” In: International conference on machine learning, pp 2048–2057
Yang X, Tang K, Zhang H, Cai J (2019) “Auto-encoding scene graphs for image captioning.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10 685–10 694
Yao T, Pan Y, Li Y, Mei T (2018) “Exploring visual relationship for image captioning.” In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017)“Boosting image captioning with attributes.” In: Proceedings of the IEEE international conference on computer vision, pp 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) “Image captioning with semantic attention.” In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2019) “Context-aware visual policy network for fine-grained image captioning.” IEEE Trans Pattern Anal Mach Intell
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55:3833–3862
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pourkeshavarz, M., Nabavi, S., Moghaddam, M.E. et al. Stacked cross-modal feature consolidation attention networks for image captioning. Multimed Tools Appl 83, 12209–12233 (2024). https://doi.org/10.1007/s11042-023-15869-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15869-x