Incorporating attentive multi-scale context information for image captioning

Prudviraj, Jeripothula; Sravani, Yenduri; Mohan, C. Krishna

doi:10.1007/s11042-021-11895-9

Incorporating attentive multi-scale context information for image captioning

1225: Sentient Multimedia Systems and Universal Visual Languages
Published: 13 January 2022

Volume 82, pages 10017–10037, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jeripothula Prudviraj ORCID: orcid.org/0000-0002-6653-4991¹,
Yenduri Sravani¹ &
C. Krishna Mohan¹

364 Accesses
51 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose a novel encoding framework to learn the multi-scale context information of the visual scene for image captioning task. The devised multi-scale context information constitutes spatial, semantic, and instance level features of an input mage. We draw spatial features from early convolutional layers, and multi-scale semantic features are achieved by employing a feature pyramid network on top of deep convolutional neural networks. Then, we concatenate the spatial and multi-scale semantic features to harvest fine-to-coarse details of the visual scene. Further, the instance level features are captured by employing a bi-linear interpolation technique on fused representation to hold object-level semantics of an image. We exploit an attention mechanism on attained features to guide the caption decoding module. In addition, we explore various combinations of encoding techniques to acquire global and local features of an image. The efficacy of the proposed approaches is demonstrated on the COCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

M-FFN: multi-scale feature fusion network for image captioning

Article 24 May 2022

Stacked cross-modal feature consolidation attention networks for image captioning

Article 23 June 2023

Intra-Image Region Context for Image Captioning

References

Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Arik SO, Pfister T (2019) Tabnet: attentive interpretable tabular learning. arXiv:1908.07442
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Breve B, Caruccio L, Cirillo S, Deufemia V, Polese G (2021) Dependency visualization in data stream profiling. Big Data Research 25:100240
Article Google Scholar
Caruccio L, Cirillo S (2020) Incremental discovery of imprecise functional dependencies. Journal of Data and Information Quality (JDIQ) 12(4):1–25
Article Google Scholar
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Chen S, Zhao Q (2018) Boosted attention: leveraging human attention for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 68–84
Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience 3(3):201–215
Article Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv:1901.02860
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29. Springer
Firat O, Cho K, Bengio Y (2016) Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv:1601.01073
Fu K, Jin J, Cui R, Sha F, Zhang C (2016) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12):2321–2334
Article Google Scholar
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780
Article Google Scholar
Hossain M, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51(6):118
Article Google Scholar
Hsieh HY, Huang SA, Leu JS (2021) Implementing a real-time image captioning service for scene identification using embedded system. Multimedia Tools and Applications 80(8):12525–12537
Article Google Scholar
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 499–515
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1700–1709
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning, pp 595–603
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(12):2891–2903
Article Google Scholar
Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Li L, Tang S, Zhang Y, Deng L, Tian Q (2017) Gla: Global–local attention for image description. IEEE Transactions on Multimedia 20(3):726–737
Article Google Scholar
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning, pp 220–228. Association for Computational Linguistics
Li Z, Li Y, Lu H (2019) Improve image captioning by self-attention International conference on neural information processing, pp 91–98. Springer
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755. Springer
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Ma X, Pino J, Cross J, Puzon L, Gu J (2019) Monotonic multihead attention. arXiv:1909.12406
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747–756. Association for Computational Linguistics
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250
Peng Y, Qi J (2018) Show and tell in the loop: cross-modal circular correlation learning. IEEE Transactions on Multimedia 21(6):1538–1550
Article Google Scholar
Rensink RA (2000) The dynamic representation of scenes. Visual Cognition 7(1-3):17–42
Article Google Scholar
Su J, Tang J, Lu Z, Han X, Zhang H (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367:144–151
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27:3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Tan JH, Chan CS, Chuah JH (2019) Comic: toward a compact image captioning model with attention. IEEE Transactions on Multimedia 21(10):2686–2696
Article Google Scholar
Tian P, Mo H, Jiang L (2021) Image caption generation using multi-level semantic context information. Symmetry 13(7):1184
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Anne Hendricks L, Rohrbach M, Mooney R, Darrell T, Saenko K (2017) Captioning images with diverse objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5753–5761
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wu L, Xu M, Wang J, Perry S (2019) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia 22 (3):808–818
Article Google Scholar
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6):1367–1381
Article Google Scholar
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Transactions on Multimedia 21(11):2942–2956
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang Y, Teo CL, Daumé H III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing, pp 444–454. Association for Computational Linguistics
Yang Z, Zhang YJ, ur Rehman S, Huang Y (2017) Image captioning with object detection and localization. In: International conference on image and graphics, pp 109–118. Springer
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754
Article MathSciNet MATH Google Scholar
Zhang S, Zhang Y, Chen Z, Li Z (2021) Vsam-based visual keyword generation for image caption. IEEE Access 9:27638–27649
Article Google Scholar
Zhang X, He S, Song X, Lau RW, Jiao J, Ye Q (2020) Image captioning via semantic element embedding. Neurocomputing 395:212–221
Article Google Scholar
Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Indian Institute of Technology Hyderabad, Hyderabad, India
Jeripothula Prudviraj, Yenduri Sravani & C. Krishna Mohan

Authors

Jeripothula Prudviraj
View author publications
You can also search for this author in PubMed Google Scholar
Yenduri Sravani
View author publications
You can also search for this author in PubMed Google Scholar
C. Krishna Mohan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeripothula Prudviraj.

Ethics declarations

Conflict of Interests

Jeripthula Prudiraj, Yenduri Sravani, and C. Krishna Mohan declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prudviraj, J., Sravani, Y. & Mohan, C.K. Incorporating attentive multi-scale context information for image captioning. Multimed Tools Appl 82, 10017–10037 (2023). https://doi.org/10.1007/s11042-021-11895-9

Download citation

Received: 27 June 2021
Revised: 09 November 2021
Accepted: 23 November 2021
Published: 13 January 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11042-021-11895-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incorporating attentive multi-scale context information for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

M-FFN: multi-scale feature fusion network for image captioning

Stacked cross-modal feature consolidation attention networks for image captioning

Intra-Image Region Context for Image Captioning

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Incorporating attentive multi-scale context information for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

M-FFN: multi-scale feature fusion network for image captioning

Stacked cross-modal feature consolidation attention networks for image captioning

Intra-Image Region Context for Image Captioning

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation