Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CAM-RNN: Co-Attention Model Based RNN for Video Captioning

Published: 01 November 2019 Publication History

Abstract

Video captioning is a technique that bridges vision and language together, for which both visual information and text information are quite important. Typical approaches are based on the recurrent neural network (RNN), where the video caption is generated word by word, and the current word is predicted based on the visual content and previously generated words. However, in the prediction of the current word, there is much uncorrelated visual content, and some of the previously generated words provide little information, which may cause interference in generating a correct caption. Based on this point, we attempt to exploit the visual and text features that are most correlated with the caption. In this paper, a co-attention model based recurrent neural network (CAM-RNN) is proposed, where the CAM is utilized to encode the visual and text features, and the RNN works as the decoder to generate the video caption. Specifically, the CAM is composed of a visual attention module, a text attention module, and a balancing gate. During the generation procedure, the visual attention module is able to adaptively attend to the salient regions in each frame and the frames most correlated with the caption. The text attention module can automatically focus on the most relevant previously generated words or phrases. Moreover, between the two attention modules, a balancing gate is designed to regulate the influence of visual features and text features when generating the caption. In practice, the extensive experiments are conducted on four popular datasets, including MSVD, Charades, MSR-VTT, and MPII-MD, which have demonstrated the effectiveness of the proposed approach.

References

[1]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence—Video to text,” in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2015, pp. 4534–4542.
[2]
X. Tan, Y. Guo, Y. Chen, and W. Zhu, “Accurate inference of user popularity preference in a large-scale online video streaming system,” Sci. China Inf. Sci., vol. 61, no. 1, 2017, Art. no.
[3]
X. Li, B. Zhao, and X. Lu, “A general framework for edited video and raw video summarization,” IEEE Trans. Image Process., vol. 26, no. 8, pp. 3652–3664, Aug. 2017.
[4]
X. Li, B. Zhao, and X. Lu, “Key frame extraction in the summary space,” IEEE Trans. Cybern., vol. 48, no. 6, pp. 1923–1934, Jun. 2018.
[5]
B. Zhao, X. Li, X. Lu, and Z. Wang, “A CNN–RNN architecture for multi-label weather recognition,” Neurocomputing, vol. 322, pp. 47–57, Dec. 2018.
[6]
Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data for crowd counting in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 8198–8207.
[7]
H. Fang, C. Shang, and J. Chen, “An optimization-based shared control framework with applications in multi-robot systems,” Sci. China Inf. Sci., vol. 61, no. 1, 2018, Art. no.
[8]
S. Guadarramaet al., “YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 2712–2719.
[9]
J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney, “Integrating language and vision to generate natural language descriptions of videos in the wild,” in Proc. 25th Int. Conf. Comput. Linguistics, Tech. Papers (COLING), Dublin, Ireland, Aug. 2014, pp. 1218–1227.
[10]
N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama, “Generating natural-language video descriptions using text-mined knowledge,” in Proc. 27th AAAI Conf. Artif. Intell., 2013, pp. 541–547.
[11]
J. Donahueet al., “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677–691, Apr. 2017.
[12]
K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Title generation for user generated videos,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 609–625.
[13]
R. Shetty and J. Laaksonen, “Video captioning with recurrent networks based on frame- and video-level features and visual content classification,” CoRR, vol. abs/1512.02949, Dec. 2015. [Online]. Available: http://arxiv.org/abs/1512.02949
[14]
X. Long, C. Gan, and G. de Melo, “Video captioning with multi-faceted attention,” CoRR, vol. abs/1612.00234, Dec. 2016.
[15]
B. Zhao, X. Li, and X. Lu, “Video captioning with tube features,” in Proc. 27th Int. Joint Conf. Artif. Intell. (IJCAI), Stockholm, Sweden, Jul. 2018, pp. 1177–1183.
[16]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., 2015, pp. 1494–1504.
[17]
X. Long, C. Gan, and G. de Melo. (2016). “Video captioning with multi-faceted attention.” [Online]. Available: https://arxiv.org/abs/1612.00234
[18]
Y. Yu, H. Ko, J. Choi, and G. Kim. (2016). “Video captioning and retrieval models with semantic attention.” [Online]. Available: https://arxiv.org/abs/1610.02947
[19]
L. Yaoet al., “Describing videos by exploiting temporal structure,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4507–4515.
[20]
C. Hori, T. Hori, T.-Y. Lee, K. Sumi, J. R. Hershey, and T. K. Marks, “Attention-based multimodal fusion for video description,” CoRR, vol. abs/1701.03126, Jan. 2017.
[21]
J. Xu, T. Yao, Y. Zhang, and T. Mei, “Learning multimodal attention LSTM networks for video captioning,” in Proc. 25th ACM Int. Conf. Multimedia, Mountain View, CA, USA, Oct. 2017, pp. 537–545.
[22]
Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “TGIF-QA: Toward spatio-temporal reasoning in visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 1359–1367.
[23]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4594–4602.
[24]
J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Proc. Adv. Neural Inf. Process. Syst., Barcelona, Spain, Dec. 2016, pp. 289–297.
[25]
H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 2156–2164.
[26]
T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Lisbon, Portugal, Sep. 2015, pp. 1412–1421.
[27]
X. Liet al., “MAM-RNN: Multi-level attention model based RNN for video captioning,” in Proc. IJCAI, 2017, pp. 2208–2214.
[28]
N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Washington, DC, USA, Oct. 2013, pp. 1700–1709.
[29]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, Sep. 2014.
[30]
A. Eriguchi, K. Hashimoto, and Y. Tsuruoka, “Tree-to-sequence attentional neural machine translation,” in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics (ACL), Berlin, Germany, vol. 1, Aug. 2016, pp. 823–833.
[31]
I. V. Serbanet al., “A hierarchical latent variable encoder-decoder model for generating dialogues,” in Proc. 31st AAAI Conf. Artif. Intell., San Francisco, CA, USA, Feb. 2017, pp. 3295–3301.
[32]
J. Li, M.-T. Luong, and D. Jurafsky, “A hierarchical neural autoencoder for paragraphs and documents,” in Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics, 7th Int. Joint Conf. Natural Lang. Process. Asian Fed. Natural Lang. Process. (ACL), Beijing, China, vol. 1, Jul. 2015, pp. 1106–1115.
[33]
R. Lin, S. Liu, M. Yang, M. Li, M. Zhou, and S. Li, “Hierarchical recurrent neural network for document modeling,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Lisbon, Portugal, Sep. 2015, pp. 899–907.
[34]
A. Farhadiet al., “Every picture tells a story: Generating sentences from images,” in Proc. 11th Eur. Conf. Comput. Vis., Heraklion, Greece, Sep. 2010, pp. 15–29.
[35]
P. Kuznetsova, V. Ordonez, A. Berg, T. Berg, and Y. Choi, “Generalizing image captions for image-text parallel corpus,” in Proc. 51st Annu. Meeting Assoc. Comput. Linguistics (ACL), Sofia, Bulgaria, vol. 2, Aug. 2013, pp. 790–796.
[36]
Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Barcelona, Spain, Nov. 2011, pp. 2407–2414.
[37]
G. Kulkarniet al., “Babytalk: Understanding and generating simple image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2891–2903, Dec. 2013.
[38]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[39]
K. Simonyan and A. Zisserman. (2014). “Very deep convolutional networks for large-scale image recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556
[40]
C. Szegedyet al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9.
[41]
K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content using attention-based encoder-decoder networks,” IEEE Trans. Multimedia, vol. 17, no. 11, pp. 1875–1886, Nov. 2015.
[42]
K. Xuet al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Mach. Learn. (ICML), Lille, France, Jul. 2015, pp. 2048–2057.
[43]
Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4651–4659.
[44]
J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 3242–3250.
[45]
Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred semantic attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 984–992.
[46]
L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware neural encoder for video captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 1657–1666.
[47]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[48]
H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4584–4593.
[49]
P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 1029–1038.
[50]
S. Venugopalan, L. A. Hendricks, R. J. Mooney, and K. Saenko, “Improving lstm-based video description with linguistic knowledge mined from text,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Austin, TX, USA, Nov. 2016, pp. 1961–1966.
[51]
J. Liang, L. Jiang, L. Cao, L. Li, and A. Hauptmann, “Focal visual-text attention for visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6135–6143.
[52]
P. Lu, L. Ji, W. Zhang, N. Duan, M. Zhou, and J. Wang, “R-VQA: Learning visual relation facts with semantic attention for visual question answering,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery & Data Mining, 2018, pp. 1880–1889.
[53]
Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994.
[54]
W. Zaremba and I. Sutskever, “Learning to execute,” CoRR, vol. abs/1410.4615, 2014. [Online]. Available: http://arxiv.org/abs/1410.4615
[55]
G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 510–526.
[56]
J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 5288–5296.
[57]
A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 3202–3212.
[58]
R. Fakoor, A. Mohamed, M. Mitchell, S. B. Kang, and P. Kohli, “Memory-augmented attention modelling for videos,” CoRR, abs/1611.02261, 2016.
[59]
K. Simonyan and A. Zisserman. (2014). “Very deep convolutional networks for large-scale image recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556
[60]
D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 4489–4497.
[61]
J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1532–1543.
[62]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[63]
C. Lin and F. J. Och, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics,” in Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics, 2004, Art. no.
[64]
M. J. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop Stat. Mach. Transl., 2014, pp. 376–380.
[65]
R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 4566–4575.
[66]
X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian, “Task-driven dynamic fusion: Reducing ambiguity in video description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 6250–6258.
[67]
A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in Proc. 37th German Conf. Pattern Recognit. (GCPR), Aachen, Germany, Oct. 2015, pp. 209–221.

Cited By

View all
  • (2025)Multi-modal clear cell renal cell carcinoma grading with the segment anything modelMultimedia Systems10.1007/s00530-024-01602-731:1Online publication date: 1-Feb-2025
  • (2024)A Descriptive Basketball Highlight Dataset for Automatic Commentary GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681178(10316-10325)Online publication date: 28-Oct-2024
  • (2024)Domain Adaptation-Aware Transformer for Hyperspectral Object TrackingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338527334:9(8041-8052)Online publication date: 1-Sep-2024
  • Show More Cited By

Index Terms

  1. CAM-RNN: Co-Attention Model Based RNN for Video Captioning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Image Processing
        IEEE Transactions on Image Processing  Volume 28, Issue 11
        Nov. 2019
        513 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 November 2019

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 26 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Multi-modal clear cell renal cell carcinoma grading with the segment anything modelMultimedia Systems10.1007/s00530-024-01602-731:1Online publication date: 1-Feb-2025
        • (2024)A Descriptive Basketball Highlight Dataset for Automatic Commentary GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681178(10316-10325)Online publication date: 28-Oct-2024
        • (2024)Domain Adaptation-Aware Transformer for Hyperspectral Object TrackingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338527334:9(8041-8052)Online publication date: 1-Sep-2024
        • (2024)Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTMComputing10.1007/s00607-024-01334-6106:11(3691-3709)Online publication date: 13-Aug-2024
        • (2023)Leveraging weighted fine-grained cross-graph attention for visual and semantic enhanced video captioning networkProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i2.25343(2465-2473)Online publication date: 7-Feb-2023
        • (2023)Study on Volleyball-Movement Pose Recognition Based on Joint Point SequenceComputational Intelligence and Neuroscience10.1155/2023/21984952023Online publication date: 1-Jan-2023
        • (2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
        • (2023)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 23-Jan-2023
        • (2023)Online Handwritten Chinese Character Recognition Based on 1-D Convolution and Two-Streams TransformersIEEE Transactions on Multimedia10.1109/TMM.2023.333958926(5769-5781)Online publication date: 5-Dec-2023
        • (2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 1-Dec-2023
        • Show More Cited By

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media