Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Bidirectional transformer with knowledge graph for video captioning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Models based on transformer architecture have risen to prominence for video captioning. However, most models are only to improve either the encoder or the decoder, because when we improve the encoder and decoder simultaneously, the shortcomings of either side may be amplified. Based on the transformer architecture, we connect a bidirectional decoder and an encoder that integrates fine-grained spatio-temporal features, objects, and relationships between the objects in the video. Experiments show that improvements in the encoder amplify the information leakage of the bidirectional decoder and further produce a worse result. To tackle this problem, we generate pseudo reverse captions and propose a Bidirectional Transformer with Knowledge Graph (BTKG), which integrates the outputs of two encoders into the forward and backward decoders of the bidirectional decoder, respectively. In addition, we make fine-grained improvements on the interior of the different encoders according to four modal features of the video. Experiments on two mainstream benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness of BTKG, which achieves state-of-the-art performance in significant metrics. Moreover, the sentences generated by BTKG contain scene words and modifiers, that are more in line with human language habits. Codes are available on https://github.com/nickchen121/BTKG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Find the latest articles, discoveries, and news in related topics.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  2. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729

  3. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

  4. Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM international conference on Multimedia, pp 537–545

  5. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. Advan Neural Inform Process Syst, 30

  6. Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8739–8748

  7. Zhong M, Zhang H, Xiong H, Chen Y, Wang M. Zhou X Kgvideo: a video captioning method based on object detection and knowledge graph. Available at SSRN 4017055

  8. Zhong M, Zhang H, Wang Y, Xiong H (2022) BiTransformer: augmenting semantic context in video captioning via bidirectional decoder. Mach Vis Appl 33(5):1–9

    Article  Google Scholar 

  9. Hori C, Hori T, Lee TY, Zhang Z, Harsham B, Hershey JR, Sumi K (2017) Attention-based multimodal fusion for video description. In: Proceedings of the IEEE international conference on computer vision, pp 4193–4202

  10. Wang S, Zhou T, Lu Y, Di H (2022) Detail-preserving transformer for light field image super-resolution. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, no 3, pp 2522–2530

  11. Liang C, Wang W, Zhou T, Yang Y (2022) Visual abductive reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15565–15575

  12. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12487–12496

  13. He K, Gkioxari G, Dolläir P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  14. Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. Advan Neural Inform Process Syst, 26

  15. Zhou L, Zhang J, Zong C (2019) Synchronous bidirectional neural machine translation. Trans Assoc Comput Linguistics 7:91–105

    Article  Google Scholar 

  16. Zhou L, Zhang J, Zong C (2019) Synchronous bidirectional neural machine translation. Trans Assoc Comput Linguistics 7:91–105

    Article  Google Scholar 

  17. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advan Neural Inform Process Syst, 25

  18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  19. Zhong M, Chen Y, Zhang H, Xiong H, Wang Z (2023) Multimodal-enhanced hierarchical attention network for video captioning. Multimedia Syst, 1–14

  20. Song J, Guo Z, Gao L, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv:1706.01231

  21. Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  22. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  23. Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241

    Article  Google Scholar 

  24. Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: Motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338

    Article  Google Scholar 

  25. Liu F, Ren X, Wu X, Yang B, Ge S, Sun X (2021) O2NA: an object-oriented non-autoregressive approach for controllable video captioning. arXiv:2108.02359

  26. Xu W, Yu J, Miao Z, Wan L, Tian Y, Ji Q (2020) Deep reinforcement polishing network for video captioning. IEEE Trans Multimedia 23:1772–1784

    Article  Google Scholar 

  27. Yang B, Zou Y, Liu F, Zhang C (2019) Non-autoregressive coarse-to-fine video captioning. arXiv:1911.12018

  28. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

  29. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  30. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Zisserman A (2017) The kinetics human action video dataset. arXiv:1705.06950

  31. Han X, Cao S, Lv X, Lin Y, Liu Z, Sun M, Li J (2018) Openke: an open toolkit for knowledge embedding. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp 139–144

  32. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Microsoft coco: common objects in context. European conference on computer vision. Springer, Cham, pp 740–755

    Google Scholar 

  33. Lin TY, Dolláir P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  34. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advan Neural Inform Process Syst, 28

  35. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  36. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200

  37. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  38. Pei W, Zhang J, Wang X, Ke L, Shen X, Tai YW (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8347–8356

  39. Pan B, Cai H, Huang DA, Lee KH, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10870–10879

  40. Wang, B., Ma, L., Zhang, W., & Liu, W. (2018). Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7622-7631)

  41. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  42. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  43. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  44. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  45. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  46. Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241

    Article  Google Scholar 

  47. Novikova J, Dušiek O, Curry AC, Rieser V (2017) Why we need new evaluation metrics for NLG. arXiv:1707.06875

  48. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980

  49. Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373

  50. Xu W, Yu J, Miao Z, Wan L, Tian Y, Ji Q (2020) Deep reinforcement polishing network for video captioning. IEEE Trans Multimedia 23:1772–1784

    Article  Google Scholar 

  51. Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha ZJ (2020). Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288

  52. Ryu H, Kang S, Kang H, Yoo CD (2021) Semantic grouping network for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, no 3, pp 2514–2522

  53. Chen S, Jiang YG (2021) Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1543–1552

  54. Vaidya J, Subramaniam A, Mittal A (2022) Co-Segmentation Aided Two-Stream Architecture for Video Captioning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2774–2784

  55. Wu B, Niu G, Yu J, Xiao X, Zhang J, Wu H (2022) Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection. IEEE Trans Circuits Syst Video Technol

  56. Sulem E, Abend O, Rappoport A (2018) Bleu is not suitable for the evaluation of text simplification. arXiv:1810.05995

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments to improve the paper. The research is partially supported by the National Natural Science Foundation (NNSF) of China (No. 61877031) and Jiangxi Normal University Graduate Innovation Fund (YJS2022029).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youde Chen.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, M., Chen, Y., Zhang, H. et al. Bidirectional transformer with knowledge graph for video captioning. Multimed Tools Appl 83, 58309–58328 (2024). https://doi.org/10.1007/s11042-023-17822-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17822-4

Keywords