survey

Public Access

Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

Authors:

Syed Zulqarnain Gilani,

Mubarak ShahAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 52, Issue 6

Article No.: 115, Pages 1 - 37

https://doi.org/10.1145/3355390

Published: 16 October 2019 Publication History

All formats PDF

Abstract

Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This article fills the gap by surveying the state-of-the-art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics, such as SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object, and verb detection with template-based language models to generate sentences. However, the release of large datasets revealed that these methods cannot cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods that were soon replaced with deep learning, the current state-of-the-art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons: Analysis of video description models is challenging, because it is difficult to ascertain the contributions towards accuracy or errors of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics fall short of measuring the agreement between machine-generated descriptions with that of humans. We conclude our survey by listing promising future research directions.

References

[1]

Casting Words transcription service, 2014. Retrieved from: http://castingwords.com/.

[2]

Language in Vision, 2017. Retrieved from: https://www.sciencedirect.com/journal/computer-vision-and-image-understanding/vol/163.

[3]

N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR.

[4]

J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the CVPR.

[5]

P. Anderson, B. Fernando, M. Johnson, and S. Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the ECCV.

[6]

B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Siddharth, X. Caiming, and Z. Yibiao. 2015. A workshop on language and vision at CVPR 2015.

[7]

B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, L. Jiebo, and S. Rahul. 2018. A workshop on language and vision at CVPR 2018. http://languageandvision.com/.

[8]

R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, C. Aaron, and S. Bernt. 2015. The joint video and language understanding workshop at ICCV 2015.

[9]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In Proceedings of the ICCV.

[10]

D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv preprint arXiv:1409.0473, (2014).

[11]

N. Ballas, L. Yao, C. Pal, and A. Courville. 2015. Delving deeper into convolutional networks for learning video representations. Retrieved from: arXiv preprint arXiv:1511.06432, (2015).

[12]

S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.

[13]

L. Baraldi, C. Grana, and R. Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the CVPR.

[14]

A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi et al. 2012. Video in sentences out. Retrieved from: arXiv preprint arXiv:1204.2742, (2012).

[15]

K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, and M. I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (Feb. 2003), 1107--1135.

[16]

T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. A. Forsyth. 2004. Names and faces in the news. In Proceedings of the CVPR.

[17]

A. F. Bobick and A. D. Wilson. 1997. A state-based approach to the representation and recognition of gesture. IEEE Trans. Pattern Anal. Mach. Intell. 19, 12 (1997), 1325--1337.

Digital Library

[18]

G. Burghouts, H. Bouma, R. D. Hollander, S. V. D. Broek, and K. Schutte. 2012. Recognition of 48 human behaviors from video. In Proceedings of the OPTRO.

[19]

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR.

[20]

M. Brand. 1997. The “Inverse Hollywood problem”: From video to scripts and storyboards via causal analysis. In Proceedings of the AAAI/IAAI. Citeseer, 132--137.

[21]

R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. 2009. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings of the CVPR.

[22]

D. Chen and W. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL: Human Language Technologies-Volume 1. ACL, 190--200.

[23]

D. Chen, W. Dolan, S. Raghavan, T. Huynh, and R. Mooney. 2010. Collecting highly parallel data for paraphrase evaluation. In J. Artific. Intell. Res. 37 (2010), 397--435.

Digital Library

[24]

Y. Chen, S. Wang, W. Zhang, and Q. Huang. 2018. Less is more: Picking informative frames for video captioning. Retrieved from: arXiv preprint arXiv:1803.01457, (2018).

[25]

K. Cho, B. V. Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. Retrieved from: arXiv preprint arXiv:1409.1259, (2014).

[26]

J. Corso. 2015. GBS: Guidance by Semantics—Using High-level Visual Inference to Improve Vision-based Mobile Robot Localization. Technical Report. State University of New York at Buffalo Amherst.

[27]

N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the CVPR.

[28]

N. Dalal, B. Triggs, and C. Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proceedings of the ECCV.

[29]

P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the CVPR.

[30]

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra. 2017. Visual dialog. In Proceedings of the CVPR.

[31]

J. Deng, K. Li, M. Do, H. Su, and L. Fei-Fei. 2009. Construction and analysis of a large scale image ontology. Vis. Sci. Soc. 186, 2 (2009).

[32]

D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. Hauptmann. 2012. Beyond audio and video retrieval: Towards multimedia summarization. In Proceedings of the ICMR.

[33]

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term RCNN for visual recognition and description. In Proceedings of the CVPR.

[34]

J. Dong, X. Li, W. Lan, Y. Huo, and C. G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the MM. ACM, 1082--1086.

Digital Library

[35]

D. Elliott and F. Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the ACL: Short Papers, Vol. 452. 457.

[36]

M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.

Digital Library

[37]

H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt et al. 2015. From captions to visual concepts and back. In Proceedings of the CVPR.

[38]

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the ECCV.

[39]

C. Fellbaum. 1998. WordNet. Wiley Online Library. Bradford Books.

[40]

P. Felzenszwalb, D. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the CVPR.

[41]

P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. 2010. Cascade object detection with deformable part models. In Proceedings of the CVPR.

[42]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627--1645.

Digital Library

[43]

Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the CVPR.

[44]

A. George, B. Asad, F. Jonathan, J. David, D. Andrew, M. Willie, M. Martial, S. Alan, G. Yvette, and K. Wessel. 2017. TRECVID 2017: Evaluating ad hoc and instance video search, events detection, video captioning, and hyperlinking. In Proceedings of the TRECVID.

[45]

S. Gella, M. Lewis, and M. Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the EMNLP. 968--974.

[46]

S. Gong and T. Xiang. 2003. Recognition of group activities using dynamic probabilistic networks. In Proceedings of the ICCV.

[47]

A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the ICML. 1764--1772.

[48]

A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.

[49]

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. 2013. Recognizing and describing activities using semantic hierarchies and zero-shot recognition. In Proceedings of the ICCV.

[50]

S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel, T. Darrell et al. 2013. Grounding spatial relations for human-robot interaction. In Proceedings of the IROS. 1640--1647.

[51]

A. Hakeem, Y. Sheikh, and M. Shah. 2004. CASE^E: A hierarchical event representation for the analysis of videos. In Proceedings of the AAAI. 263--268.

[52]

P. Hanckmann, K. Schutte, and G. J. Burghouts. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In Proceedings of the ECCV.

[53]

D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass. 2018. Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the ECCV.

[54]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.

[55]

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.

Digital Library

[56]

S. Hongeng, F. Brémond, and R. Nevatia. 2000. Bayesian framework for video surveillance application. In Proceedings of the ICPR, Vol. 1. IEEE, 164--170.

[57]

Drew A. Hudson, Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. In Proceedings of the ICLR.

[58]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the MM. ACM, 675--678.

[59]

Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the MM. ACM, 1087--1091.

[60]

J. Johnson, B. Hariharan, L. V. D. Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the CVPR.

[61]

M. U. G. Khan and Y. Gotoh. 2012. Describing video contents in natural language. In Proceedings of the EACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. ACL, 27--35.

[62]

M. U. G. Khan, L. Zhang, and Y. Gotoh. 2011. Human focused video description. In Proceedings of the ICCV Workshops.

[63]

M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. 2016. Re-evaluating automatic metrics for image captioning. Retrieved from: arXiv preprint arXiv:1612.07600, (2016).

[64]

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. 2018. Textual explanations for self-driving vehicles. In Proceedings of the ECCV.

[65]

W. Kim, J. Park, and C. Kim. 2010. A novel method for efficient indoor-outdoor image classification. J. Sig. Proc. Syst. 61, 3 (2010), 251--258.

Digital Library

[66]

R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Retrieved from: arXiv preprint arXiv:1411.2539, (2014).

[67]

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, WadeShen, C. Moran, R. Zens et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, 177--180.

[68]

A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184.

Digital Library

[69]

D. Koller, N. Heinze, and H. Nagel. 1991. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In Proceedings of the CVPR. 90--95.

[70]

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense-captioning events in videos. Retrieved from: arXiv:1705.00754.

[71]

N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI, Vol. 1. 2.

[72]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105.

Digital Library

[73]

P. Kuchi, P. Gabbur, P. S. Bhat, and S. S. David. 2002. Human face detection and tracking using skin color modeling and connected component operators. IEEE J. Res. 48, 3--4 (2002), 289--293.

[74]

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the CVPR.

[75]

T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the NIPS. 3675--3683.

[76]

M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML.

[77]

I. Langkilde-Geary and K. Knight. Halogen Input Representation. [Online]. http://www.isi.edu/publications/licensed-sw/halogen/interlingua.html.

[78]

M. W. Lee, A. Hakeem, N. Haering, and S. Zhu. 2008. Save: A framework for semantic annotation of visual events. In Proceedings of the CVPR Workshops. 1--8.

[79]

L. Li and B. Gong. 2018. End-to-end video captioning with multitask reinforcement learning. Retrieved from: arXiv preprint arXiv:1803.07950, (2018).

[80]

S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the CNLL.

[81]

Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the CVPR.

[82]

C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out. 74--81.

[83]

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV.

[84]

Y. Liu and Z. Shi. 2016. Boosting video description generation by explicitly translating from frame-level captions. In Proceedings of the MM. ACM, 631--634.

[85]

D. G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the ICCV.

Digital Library

[86]

I. Maglogiannis, D. Vouyioukas, and C. Aggelopoulos. 2009. Face detection and recognition of natural human emotion using Markov random fields. Pers .Ubiq. Comput. 13, 1 (2009), 95--101.

Digital Library

[87]

M. Malinowski and M. Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the NIPS. 1682--1690.

[88]

J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV.

[89]

M. Margaret, H. Ting-Hao, F. Frank, and M. Ishan. 2018. In Proceedings of the First Workshop on Storytelling. ACL. https://www.aclweb.org/anthology/W18-1500.

[90]

C. Matuszek, D. Fox, and K. Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the HRI.

[91]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS. 3111--3119.

[92]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. 2013. Playing Atari with deep reinforcement learning. Retrieved from: arXiv preprint arXiv:1312.5602. (2013).

[93]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassbis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 9529.

[94]

D. Moore and I. Essa. 2002. Recognizing multitasked activities from video using stochastic context-free grammar. In Proceedings of the AAAI/IAAI. 770--776.

[95]

R. Nevatia, J. Hobbs, and B. Bolles. 2004. An ontology for video event representation. In Proceedings of the CVPR Workshop. 119--119.

[96]

F. Nishida and S. Takamatsu. 1982. Japanese-English translation through internal expressions. In Proceedings of the COLING, Volume 1. Academia Praha, 271--276.

[97]

F. Nishida, S. Takamatsu, T. Tani, and T. Doi. 1988. Feedback of correcting information in post editing to a machine translation system. In Proceedings of the COLING, Volume 2. ACL, 476--481.

[98]

A. Owens and A. A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the ECCV.

[99]

P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the CVPR.

[100]

Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the CVPR.

[101]

Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the CVPR.

[102]

K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the ACL. 311--318.

[103]

R. Pasunuru and M. Bansal. 2017. Reinforced video captioning with entailment rewards. Retrieved from: arXiv preprint arXiv:1708.02300, (2017).

[104]

S. Phan, G. E. Henter, Y. Miyao, and S. Satoh. 2017. Consensus-based sequence training for video captioning. Retrieved from: arXiv preprint arXiv:1712.09532, (2017).

[105]

C. S. Pinhanez and A. F. Bobick. 1998. Human action detection using PNF propagation of temporal constraints. In Proceedings of the CVPR.

[106]

C. Pollard and I. A. Sag. 1994. Head-driven Phrase Structure Grammar. University of Chicago Press.

[107]

S. Chen, Y. Song, Y. Zhao, J. Qiu, Q. Jin, and A. Hauptmann. 2017. RUC-CMU: System descriptions for the dense video captioning task. Retrieved from: arXiv preprint arXiv:1710.08011, (2017).

[108]

V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko. 2016. Multimodal video description. In Proceedings of the MM. ACM, 1092--1096.

[109]

M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding action descriptions in videos. Trans. Assoc. Comput. Ling. 1 (2013), 25--36.

[110]

E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.

[111]

M. Ren, R. Kiros, and R. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the NIPS. 2953--2961.

[112]

Z. Ren, X. Wang, N. Zhang, X. Lv, and L. Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. Retrieved from: arXiv preprint arXiv:1704.03899, (2017).

[113]

S. Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 5 (2004), 503--520.

[114]

A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Proceedings of the GCPR.

[115]

A. Rohrbach, M. Rohrbach, and B. Schiele. 2015. The long-short story of movie description. In Proceedings of the GCPR. 209--221.

[116]

A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. 2015. A dataset for movie description. In Proceedings of the CVPR.

[117]

A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. 2017. Movie description. Int. J. Comput. Vis. 123, 1 (2017), 94--120.

Digital Library

[118]

M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. 2012. A database for fine-grained activity detection of cooking activities. In Proceedings of the CVPR.

[119]

M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the ICCV.

[120]

M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the ECCV.

[121]

D. Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artific. Intell. 167, 1--2 (2005), 170--205.

Digital Library

[122]

D. Roy and E. Reiter. 2005. Connecting language to the world. Artific. Intell. 167, 1--2 (2005), 1--12.

Digital Library

[123]

Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121.

Digital Library

[124]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.

Digital Library

[125]

M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Proc. 45, 11 (1997), 2673--2681.

Digital Library

[126]

Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the CVPR.

[127]

R. Shetty and J. Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the MM. ACM, 1073--1076.

[128]

J. Shi and C. Tomasi. 1994. Good features to track. In Proceedings of the CVPR.

[129]

A. Shin, K. Ohnishi, and T. Harada. 2016. Beyond caption to narrative: Video captioning with multiple sentences. In Proceedings of the ICIP.

[130]

G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the ECCV.

[131]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from: arXiv preprint arXiv:1409.1556, (2014).

[132]

N. Srivastava, E. Mansimov, and R. Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the ICML. 843--852.

[133]

C. Sun and R. Nevatia. 2014. Semantic aware video transcription using random forest classifiers. In Proceedings of the ECCV.

[134]

I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence-to-sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112.

[135]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.

[136]

S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI.

[137]

J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the COLING 2, 5 (2014), 9.

[138]

C. Tomasi and T. Kanade. 1991. Detection and tracking of point features. Technical Report CMU-CS-91-132. Carnegie Mellon University.

[139]

A. Torabi, C. Pal, H. Larochelle, and A. Courville. 2015. Using descriptive video services to create a large data source for video annotation research. Retrieved from: arXiv preprint arXiv:1503.01070, (2015).

[140]

A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. 2003. Context-based vision system for place and object recognition. In Proceedings of the ICCV.

[141]

D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. C3D: Generic features for video analysis. Retrieved from: CoRR abs/1412.0767, (2014).

[142]

R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the CVPR.

[143]

S. Venugopalan, L. A. Hendricks, R. Mooney, and K. Saenko. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. Retrieved from: arXiv preprint arXiv:1604.01729, (2016).

[144]

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence-to-sequence video to text. In Proceedings of the ICCV.

[145]

S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Retrieved from: arXiv preprint arXiv:1412.4729, (2014).

[146]

A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. 2017. Feudal networks for hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1703.01161, (2017).

[147]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR.

[148]

P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the CVPR.

[149]

Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. 2018. Talk the walk: Navigating New York City through grounded dialogue. Retrieved from: CoRR abs/1807.03367 (2018).

[150]

B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the CVPR.

[151]

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the BMVC. BMVA Press, 124--1.

[152]

J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the CVPR.

[153]

J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the CVPR.

[154]

J. K. Wang and R. Gaizauskas. 2016. Cross-validating image description datasets and evaluation metrics. In Proceedings of the LREC. European Language Resources Association, 3059--3066.

[155]

X. Wang, W. Chen, J. Wu, Y. Wang, and W. Y. Wang. 2017. Video captioning via hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1711.11135, (2017).

[156]

X. Wu, G. Li, Q. Cao, Q. Ji, and L. Lin. 2018. Interpretable video captioning via trajectory structured localization. In Proceedings of the CVPR.

[157]

Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the CVPR.

[158]

H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. 2018. Joint event detection and description in continuous video streams. Retrieved from: arXiv preprint arXiv:1802.10250, (2018).

[159]

H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and K. Saenko. 2015. A multi-scale multiple instance video description network. Retrieved from: arXiv preprint arXiv:1505.05914, (2015).

[160]

J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the CVPR.

[161]

R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI, Vol. 5, 6.

[162]

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the ICCV.

[163]

T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. 2017. Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In Proceedings of the MSR Asia MSM at ActivityNet Challenge 2017.

[164]

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, 2 (2014), 67--78.

[165]

H. Yu and J. M. Siskind. 2013. Grounded language learning from video sentences. In Proceedings of the ACL 1. 53--63.

[166]

H. Yu and J. M. Siskind. 2015. Learning to describe video with weak supervision by exploiting negative sentential information. In Proceedings of the AAAI. 3855--3863.

[167]

H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the CVPR.

[168]

L. Yu, E. Park, A. C. Berg, and T. L. Berg. 2015. Visual Madlibs: Fill in the blank description generation and question answering. In Proceedings of the ICCV.

[169]

Y. Yu, J. Choi, Y. Kim, K. Yoo, S. Lee, and G. Kim. 2017. Supervising neural attention models for video captioning by human gaze data. In Proceedings of the CVPR.

[170]

Y. Yu, H. Ko, J. Choi, and G. Kim. 2016. End-to-end concept word detection for video captioning, retrieval, and question answering. Retrieved from: arXiv preprint arXiv:1610.02947, (2016).

[171]

K. Zeng, T. Chen, J. C. Niebles, and M. Sun. 2016. Title generation for user-generated videos. In Proceedings of the ECCV.

[172]

X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proceedings of the CVPR.

[173]

L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach. 2018. Grounded video description. Retrieved from: arXiv preprint arXiv:1812.06587 (2018).

[174]

L. Zhou, C. Xu, and J. J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI.

[175]

L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the CVPR.

[176]

S. Zhu and D. Mumford. 2007. A stochastic grammar of images. Found. Trends Comput. Graph. Vis. 2, 4 (2007), 259--362.

Digital Library

Cited By

Yadav TJaiswal CGupta R(2024)Synthetic Aperture Radar Image Classification Using Deep LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1895(2177-2179)Online publication date: 6-Apr-2024
https://doi.org/10.38124/ijisrt/IJISRT24MAR1895
Kossoski CSimão JLopes H(2024)Modeling and Performance Analysis of a Notification-Based Method for Processing Video Queries on the FlyApplied Sciences10.3390/app1409356614:9(3566)Online publication date: 24-Apr-2024
https://doi.org/10.3390/app14093566
Patel SPatel UNanavati JPatel A(2024)Analyzing Generative Models for Realistic Data Augmentation across Modalities and Applications2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498685(1601-1606)Online publication date: 28-Feb-2024
https://doi.org/10.23919/INDIACom61295.2024.10498685
Show More Cited By

Index Terms

Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Multimodal Video Description
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Real-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic of ...
Movie Description

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer ...
Toward Automatic Audio Description Generation for Accessible Videos
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 52, Issue 6

November 2020

806 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3368196

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2019

Accepted: 01 August 2019

Received: 01 March 2019

Published in CSUR Volume 52, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

ARC Discovery
Army Research Office

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

111
Total Citations
View Citations
6,730
Total Downloads

Downloads (Last 12 months)1,917
Downloads (Last 6 weeks)177

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yadav TJaiswal CGupta R(2024)Synthetic Aperture Radar Image Classification Using Deep LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1895(2177-2179)Online publication date: 6-Apr-2024
https://doi.org/10.38124/ijisrt/IJISRT24MAR1895
Kossoski CSimão JLopes H(2024)Modeling and Performance Analysis of a Notification-Based Method for Processing Video Queries on the FlyApplied Sciences10.3390/app1409356614:9(3566)Online publication date: 24-Apr-2024
https://doi.org/10.3390/app14093566
Patel SPatel UNanavati JPatel A(2024)Analyzing Generative Models for Realistic Data Augmentation across Modalities and Applications2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498685(1601-1606)Online publication date: 28-Feb-2024
https://doi.org/10.23919/INDIACom61295.2024.10498685
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Ning ZWimer BJiang KChen KBan JTian YZhao YLi T(2024)SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision ViewersProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642632(1-18)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642632
Liang YZhu LWang XYang Y(2024)IcoCap: Improving Video Captioning by Compounding ImagesIEEE Transactions on Multimedia10.1109/TMM.2023.332232926(4389-4400)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3322329
Kaur GKaur AKhurana M(2024)A Survey of Computational Techniques for Automated Video Creation and their Evaluation2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)10.1109/ICRITO61523.2024.10522350(1-6)Online publication date: 14-Mar-2024
https://doi.org/10.1109/ICRITO61523.2024.10522350
Kehkashan TAlsaeedi AYafooz WIsmail NAl-Dhaqm A(2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3357980
Al-Ruzouq RShanableh AJena RMukherjee SAli Khalil MGibril MPradhan BAtalla Hammouri N(2024)Hybrid deep learning and remote sensing for the delineation of artificial groundwater recharge zonesThe Egyptian Journal of Remote Sensing and Space Sciences10.1016/j.ejrs.2024.02.00627:2(178-191)Online publication date: Jun-2024
https://doi.org/10.1016/j.ejrs.2024.02.006
Babavalian MKiani K(2024)Video captioning using transformer-based GANMultimedia Tools and Applications10.1007/s11042-024-19247-zOnline publication date: 23-Apr-2024
https://doi.org/10.1007/s11042-024-19247-z
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents