Dense-CaptionNet: a Sentence Generation Architecture for Fine-grained Description of Image Semantics

Khurram, I.; Fraz, M. M.; Shahzad, M.; Rajpoot, N. M.

doi:10.1007/s12559-019-09697-1

Dense-CaptionNet: a Sentence Generation Architecture for Fine-grained Description of Image Semantics

Published: 02 March 2020

Volume 13, pages 595–611, (2021)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

I. Khurram¹,
M. M. Fraz^1,2,3,
M. Shahzad¹ &
…
N. M. Rajpoot^2,3

606 Accesses
6 Altmetric
Explore all metrics

Abstract

Automatic image captioning, a highly challenging research problem, aims to understand and describe the contents of the complex scene in human understandable natural language. The majority of the recent solutions are based on holistic approaches where the scene is described as a whole, potentially losing the important semantic relationship of objects in the scene. We propose Dense-CaptionNet, a region-based deep architecture for fine-grained description of image semantics, which localizes and describes each object/region in the image separately and generates a more detailed description of the scene. The proposed network contains three components which work together to generate a fine-grained description of image semantics. Region descriptions and object relationships are generated by the first module, whereas the second one generates the attributes of objects present in the scene. The textual descriptions obtained as an output of the two modules are concatenated to feed as an input to the sentence generation module, which works on encoder-decoder formulation to generate a grammatically correct but single line, fine-grained description of the whole scene. The proposed Dense-CaptionNet is trained and tested using Visual Genome, MSCOCO, and IAPR TC-12 datasets. The results establish a new state-of-the-art when compared with the existing top performing methodologies, e.g., Up-Down-Captioner, Show, Attend and Tell, Semstyle, and Neural Talk, especially on complex scenes. The implementation has been shared on GitHub for other researchers: http://bit.ly/2VIhfrf

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detailed Sentence Generation Architecture for Image Semantics Description

Exploring Visual Relationship for Image Captioning

A transformer-based Urdu image caption generation

Article Open access 02 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. CVPR; 2018. p. 6.
Bai S, An S. 2018. A survey on automatic image caption generation. Neurocomputing.
Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; 2005. p. 65–72.
Bashir R, Shahzad M, Fraz M. Vr-proud: vehicle re-identification using progressive unsupervised deep architecture. Pattern Recogn 2019;90:52–65.
Article Google Scholar
Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 1994;5(2):157–166.
Article Google Scholar
Chen H, Ding G, Lin Z, Guo Y, Shan C, Han J. 2019. Image captioning with memorized knowledge. Cogn Comput: 1–14. https://doi.org/10.1007/s12559-019-09656-w.
Datta R, Li J, Wang JZ. Content-based image retrieval: approaches and trends of the new age. Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval. ACM; 2005. p. 253–262.
Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q. Neural image caption generation with weighted training and reference. Cogn Comput 2019;11(6):763–777.
Article Google Scholar
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 2625–2634.
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D. Every picture tells a story: generating sentences from images. European Conference on Computer Vision. Springer; 2010. p. 15–29.
Grubinger M, Clough P, Müller H, Deselaers T. The iapr tc-12 benchmark: a new evaluation resource for visual information systems. International Workshop OntoImage; 2006. p. 13–23.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–1780.
Article Google Scholar
Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks. Advances in neural information processing systems; 2015. p. 2017–2025.
Johnson J, Karpathy A, Fei-Fei L. Densecap: fully convolutional localization networks for dense captioning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 . p. 4565–4574.
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 2017;39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339.
Article Google Scholar
Khurram I, Fraz MM, Shahzad M. Detailed sentence generation architecture for image semantics description. International Symposium on Visual Computing. Springer; 2018. p. 423–432.
Kingma DP, Ba J. Adam: a method for stochastic optimization. International Conference on Learning Representations; 2015. p. 1–13.
Kolb P. Disco: a multilingual database of distributionally similar words. Proceedings of KONVENS-2008. Berlin; 2008. p. 156.
Krause J, Stark M, Deng J, Fei-fei L. 3d object representations for fine-grained categorization. IEEE International Conference on Computer Vision Workshops (ICCVW). IEEE; 2013. p. 554–561.
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L. Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 2017;123(1):32–73.
Article MathSciNet Google Scholar
Lin CY. 2004. Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. pp 74–81.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: common objects in context. European Conference on Computer Vision. Springer; 2014. p. 740–755.
Liu C, Sun F, Wang C, Wang F, Yuille A. Mat: a multimodal attentive translator for image captioning. Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI); 2017. p. 4033–4039.
Liu X, Deng Z. Segmentation of drivable road using deep fully convolutional residual network with pyramid pooling. Cogn Comput 2018;10(2):272–281.
Article MathSciNet Google Scholar
Lu J, Xiong C, Parikh D, Socher R. Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2017. p. 2.
Luong T, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; 2015. p. 1412–1421. Association for Computational Linguistics, Lisbon, Portugal. https://aclweb.org/anthology/D/D15/D15-1166.
Manning CD. Part-of-speech tagging from 97% to 100%: is it time for some linguistics?. International Conference on Intelligent Text Processing and Computational Linguistics. Springer; 2011. p. 171–189.
Mathews A, Xie L, He X. Semstyle: learning to generate stylised image captions using unaligned text. Proceedings of the IEEE conference on computer vision and pattern recognition; 2018 . p. 8591–8600.
Maynord M, Bhattacharya S, Aha DW. Image surveillance assistant. Applications of Computer Vision Workshops (WACVW). IEEE; 2016. p. 1–7.
Nganji JT, Brayshaw M, Tompsett B. Describing and assessing image descriptions for visually impaired web users with idat. Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011). Springer; 2013 . p. 27–37.
Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics; 2002. p. 311–318. Association for Computational Linguistics.
Park CC, Kim B, Kim G. Attend to you: personalized image captioning with context sequence memory networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 6432–6440.
Poria S, Chaturvedi I, Cambria E, Hussain A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. 2016 IEEE 16th International Conference on Data Mining (ICDM); 2016. p. 439–448, https://doi.org/10.1109/ICDM.2016.0055.
Ren S, He K, Girshick R, Sun J. Faster r-CNN: towards real-time object detection with region proposal networks. Advances in neural information processing systems; 2015. p. 91–99.
Ren Z, Wang X, Zhang N, Lv X, Li LJ. Deep reinforcement learning-based image captioning with embedding reward. IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M. Imagenet large scale visual recognition challenge. Int J Comput Vis 2015;115(3):211–252.
Article MathSciNet Google Scholar
Saez D. 2017. Correcting image orientation using convolutional neural networks. https://d4nst.github.io/2017/01/12/image-orientation/.
Shen J, Liu G, Chen J, Fang Y, Xie J, Yu Y, Yan S. Unified structured learning for simultaneous human pose estimation and garment attribute classification. IEEE Trans Image Process 2014;23(11): 4786–4798.
Article MathSciNet Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR); 2014.
Spratling MW. A hierarchical predictive coding model of object recognition in natural images. Cogn Comput 2017;9(2):151–167.
Article Google Scholar
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Advances in neural information processing systems; 2014. p. 3104–3112.
Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal and Machine Intelligence 2017;39(4): 652–663. https://doi.org/10.1109/TPAMI.2016.2587640.
Article Google Scholar
Wen TH, Gasic M, Mrksic N, Su PH, Vandyke D, Young S. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. Proceedings of Empirical Methods in Natural Language Processing; 2015. p. 583–593.
Xiao X, Wang L, Ding K, Xiang S, Pan C. 2019. Dense semantic embedding network for image captioning. Pattern Recogn.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. International Conference on Machine Learning; 2015. p. 2048–2057.
Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR. Review networks for caption generation. Advances in neural information processing systems; 2016. p. 2361–2369.
Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM. Actor-critic sequence training for image captioning. Neural Information Processing Systems (NIPS) Workshop on Visually-Grounded Interaction and Language; 2017.
Zhong G, Yan S, Huang K, Cai Y, Dong J. Reducing and stretching deep convolutional activation features for accurate image classification. Cogn Comput 2018;10(1):179–186.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering & Computer Science, National University of Sciences & Technology (NUST), Islamabad, Pakistan
I. Khurram, M. M. Fraz & M. Shahzad
Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK
M. M. Fraz & N. M. Rajpoot
The Alan Turing Institute, London, NW1 2DB, UK
M. M. Fraz & N. M. Rajpoot

Authors

I. Khurram
View author publications
You can also search for this author in PubMed Google Scholar
M. M. Fraz
View author publications
You can also search for this author in PubMed Google Scholar
M. Shahzad
View author publications
You can also search for this author in PubMed Google Scholar
N. M. Rajpoot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. M. Fraz.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 497 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khurram, I., Fraz, M.M., Shahzad, M. et al. Dense-CaptionNet: a Sentence Generation Architecture for Fine-grained Description of Image Semantics. Cogn Comput 13, 595–611 (2021). https://doi.org/10.1007/s12559-019-09697-1

Download citation

Received: 23 April 2019
Accepted: 14 November 2019
Published: 02 March 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s12559-019-09697-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dense-CaptionNet: a Sentence Generation Architecture for Fine-grained Description of Image Semantics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detailed Sentence Generation Architecture for Image Semantics Description

Exploring Visual Relationship for Image Captioning

A transformer-based Urdu image caption generation

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Ethical Approval

Electronic supplementary material

(PDF 497 KB)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dense-CaptionNet: a Sentence Generation Architecture for Fine-grained Description of Image Semantics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detailed Sentence Generation Architecture for Image Semantics Description

Exploring Visual Relationship for Image Captioning

A transformer-based Urdu image caption generation

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Ethical Approval

Electronic supplementary material

(PDF 497 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation