Bidirectional transformer with knowledge graph for video captioning

Zhong, Maosheng; Chen, Youde; Zhang, Hao; Xiong, Hao; Wang, Zhixiang

doi:10.1007/s11042-023-17822-4

Bidirectional transformer with knowledge graph for video captioning

Published: 21 December 2023

Volume 83, pages 58309–58328, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Maosheng Zhong¹,
Youde Chen ORCID: orcid.org/0000-0003-4462-6673¹,
Hao Zhang¹,
Hao Xiong¹ &
…
Zhixiang Wang¹

290 Accesses
Explore all metrics

Abstract

Models based on transformer architecture have risen to prominence for video captioning. However, most models are only to improve either the encoder or the decoder, because when we improve the encoder and decoder simultaneously, the shortcomings of either side may be amplified. Based on the transformer architecture, we connect a bidirectional decoder and an encoder that integrates fine-grained spatio-temporal features, objects, and relationships between the objects in the video. Experiments show that improvements in the encoder amplify the information leakage of the bidirectional decoder and further produce a worse result. To tackle this problem, we generate pseudo reverse captions and propose a Bidirectional Transformer with Knowledge Graph (BTKG), which integrates the outputs of two encoders into the forward and backward decoders of the bidirectional decoder, respectively. In addition, we make fine-grained improvements on the interior of the different encoders according to four modal features of the video. Experiments on two mainstream benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness of BTKG, which achieves state-of-the-art performance in significant metrics. Moreover, the sentences generated by BTKG contain scene words and modifiers, that are more in line with human language habits. Codes are available on https://github.com/nickchen121/BTKG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Article 12 August 2022

Multimodal-enhanced hierarchical attention network for video captioning

Article 15 July 2023

Meaning Guided Video Captioning

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM international conference on Multimedia, pp 537–545
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. Advan Neural Inform Process Syst, 30
Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8739–8748
Zhong M, Zhang H, Xiong H, Chen Y, Wang M. Zhou X Kgvideo: a video captioning method based on object detection and knowledge graph. Available at SSRN 4017055
Zhong M, Zhang H, Wang Y, Xiong H (2022) BiTransformer: augmenting semantic context in video captioning via bidirectional decoder. Mach Vis Appl 33(5):1–9
Article Google Scholar
Hori C, Hori T, Lee TY, Zhang Z, Harsham B, Hershey JR, Sumi K (2017) Attention-based multimodal fusion for video description. In: Proceedings of the IEEE international conference on computer vision, pp 4193–4202
Wang S, Zhou T, Lu Y, Di H (2022) Detail-preserving transformer for light field image super-resolution. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, no 3, pp 2522–2530
Liang C, Wang W, Zhou T, Yang Y (2022) Visual abductive reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15565–15575
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12487–12496
He K, Gkioxari G, Dolläir P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. Advan Neural Inform Process Syst, 26
Zhou L, Zhang J, Zong C (2019) Synchronous bidirectional neural machine translation. Trans Assoc Comput Linguistics 7:91–105
Article Google Scholar
Zhou L, Zhang J, Zong C (2019) Synchronous bidirectional neural machine translation. Trans Assoc Comput Linguistics 7:91–105
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advan Neural Inform Process Syst, 25
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Zhong M, Chen Y, Zhang H, Xiong H, Wang Z (2023) Multimodal-enhanced hierarchical attention network for video captioning. Multimedia Syst, 1–14
Song J, Guo Z, Gao L, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv:1706.01231
Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241
Article Google Scholar
Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: Motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338
Article Google Scholar
Liu F, Ren X, Wu X, Yang B, Ge S, Sun X (2021) O2NA: an object-oriented non-autoregressive approach for controllable video captioning. arXiv:2108.02359
Xu W, Yu J, Miao Z, Wan L, Tian Y, Ji Q (2020) Deep reinforcement polishing network for video captioning. IEEE Trans Multimedia 23:1772–1784
Article Google Scholar
Yang B, Zou Y, Liu F, Zhang C (2019) Non-autoregressive coarse-to-fine video captioning. arXiv:1911.12018
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Zisserman A (2017) The kinetics human action video dataset. arXiv:1705.06950
Han X, Cao S, Lv X, Lin Y, Liu Z, Sun M, Li J (2018) Openke: an open toolkit for knowledge embedding. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp 139–144
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Microsoft coco: common objects in context. European conference on computer vision. Springer, Cham, pp 740–755
Google Scholar
Lin TY, Dolláir P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advan Neural Inform Process Syst, 28
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Pei W, Zhang J, Wang X, Ke L, Shen X, Tai YW (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8347–8356
Pan B, Cai H, Huang DA, Lee KH, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10870–10879
Wang, B., Ma, L., Zhang, W., & Liu, W. (2018). Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7622-7631)
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241
Article Google Scholar
Novikova J, Dušiek O, Curry AC, Rieser V (2017) Why we need new evaluation metrics for NLG. arXiv:1707.06875
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Xu W, Yu J, Miao Z, Wan L, Tian Y, Ji Q (2020) Deep reinforcement polishing network for video captioning. IEEE Trans Multimedia 23:1772–1784
Article Google Scholar
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha ZJ (2020). Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288
Ryu H, Kang S, Kang H, Yoo CD (2021) Semantic grouping network for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, no 3, pp 2514–2522
Chen S, Jiang YG (2021) Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1543–1552
Vaidya J, Subramaniam A, Mittal A (2022) Co-Segmentation Aided Two-Stream Architecture for Video Captioning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2774–2784
Wu B, Niu G, Yu J, Xiao X, Zhang J, Wu H (2022) Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection. IEEE Trans Circuits Syst Video Technol
Sulem E, Abend O, Rappoport A (2018) Bleu is not suitable for the evaluation of text simplification. arXiv:1810.05995

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments to improve the paper. The research is partially supported by the National Natural Science Foundation (NNSF) of China (No. 61877031) and Jiangxi Normal University Graduate Innovation Fund (YJS2022029).

Author information

Authors and Affiliations

Jiangxi Normal University, Nanchang City, China
Maosheng Zhong, Youde Chen, Hao Zhang, Hao Xiong & Zhixiang Wang

Authors

Maosheng Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Youde Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Zhixiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youde Chen.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhong, M., Chen, Y., Zhang, H. et al. Bidirectional transformer with knowledge graph for video captioning. Multimed Tools Appl 83, 58309–58328 (2024). https://doi.org/10.1007/s11042-023-17822-4

Download citation

Received: 04 October 2022
Revised: 09 October 2023
Accepted: 04 December 2023
Published: 21 December 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s11042-023-17822-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bidirectional transformer with knowledge graph for video captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Multimodal-enhanced hierarchical attention network for video captioning

Meaning Guided Video Captioning

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Bidirectional transformer with knowledge graph for video captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Multimodal-enhanced hierarchical attention network for video captioning

Meaning Guided Video Captioning

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation