Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey
Public Access

Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

Published: 16 October 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This article fills the gap by surveying the state-of-the-art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics, such as SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object, and verb detection with template-based language models to generate sentences. However, the release of large datasets revealed that these methods cannot cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods that were soon replaced with deep learning, the current state-of-the-art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons: Analysis of video description models is challenging, because it is difficult to ascertain the contributions towards accuracy or errors of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics fall short of measuring the agreement between machine-generated descriptions with that of humans. We conclude our survey by listing promising future research directions.

    References

    [1]
    Casting Words transcription service, 2014. Retrieved from: http://castingwords.com/.
    [2]
    Language in Vision, 2017. Retrieved from: https://www.sciencedirect.com/journal/computer-vision-and-image-understanding/vol/163.
    [3]
    N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR.
    [4]
    J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the CVPR.
    [5]
    P. Anderson, B. Fernando, M. Johnson, and S. Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the ECCV.
    [6]
    B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Siddharth, X. Caiming, and Z. Yibiao. 2015. A workshop on language and vision at CVPR 2015.
    [7]
    B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, L. Jiebo, and S. Rahul. 2018. A workshop on language and vision at CVPR 2018. http://languageandvision.com/.
    [8]
    R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, C. Aaron, and S. Bernt. 2015. The joint video and language understanding workshop at ICCV 2015.
    [9]
    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In Proceedings of the ICCV.
    [10]
    D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv preprint arXiv:1409.0473, (2014).
    [11]
    N. Ballas, L. Yao, C. Pal, and A. Courville. 2015. Delving deeper into convolutional networks for learning video representations. Retrieved from: arXiv preprint arXiv:1511.06432, (2015).
    [12]
    S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.
    [13]
    L. Baraldi, C. Grana, and R. Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the CVPR.
    [14]
    A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi et al. 2012. Video in sentences out. Retrieved from: arXiv preprint arXiv:1204.2742, (2012).
    [15]
    K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, and M. I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (Feb. 2003), 1107--1135.
    [16]
    T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. A. Forsyth. 2004. Names and faces in the news. In Proceedings of the CVPR.
    [17]
    A. F. Bobick and A. D. Wilson. 1997. A state-based approach to the representation and recognition of gesture. IEEE Trans. Pattern Anal. Mach. Intell. 19, 12 (1997), 1325--1337.
    [18]
    G. Burghouts, H. Bouma, R. D. Hollander, S. V. D. Broek, and K. Schutte. 2012. Recognition of 48 human behaviors from video. In Proceedings of the OPTRO.
    [19]
    F. C. Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR.
    [20]
    M. Brand. 1997. The “Inverse Hollywood problem”: From video to scripts and storyboards via causal analysis. In Proceedings of the AAAI/IAAI. Citeseer, 132--137.
    [21]
    R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. 2009. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings of the CVPR.
    [22]
    D. Chen and W. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL: Human Language Technologies-Volume 1. ACL, 190--200.
    [23]
    D. Chen, W. Dolan, S. Raghavan, T. Huynh, and R. Mooney. 2010. Collecting highly parallel data for paraphrase evaluation. In J. Artific. Intell. Res. 37 (2010), 397--435.
    [24]
    Y. Chen, S. Wang, W. Zhang, and Q. Huang. 2018. Less is more: Picking informative frames for video captioning. Retrieved from: arXiv preprint arXiv:1803.01457, (2018).
    [25]
    K. Cho, B. V. Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. Retrieved from: arXiv preprint arXiv:1409.1259, (2014).
    [26]
    J. Corso. 2015. GBS: Guidance by Semantics—Using High-level Visual Inference to Improve Vision-based Mobile Robot Localization. Technical Report. State University of New York at Buffalo Amherst.
    [27]
    N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the CVPR.
    [28]
    N. Dalal, B. Triggs, and C. Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proceedings of the ECCV.
    [29]
    P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the CVPR.
    [30]
    A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra. 2017. Visual dialog. In Proceedings of the CVPR.
    [31]
    J. Deng, K. Li, M. Do, H. Su, and L. Fei-Fei. 2009. Construction and analysis of a large scale image ontology. Vis. Sci. Soc. 186, 2 (2009).
    [32]
    D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. Hauptmann. 2012. Beyond audio and video retrieval: Towards multimedia summarization. In Proceedings of the ICMR.
    [33]
    J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term RCNN for visual recognition and description. In Proceedings of the CVPR.
    [34]
    J. Dong, X. Li, W. Lan, Y. Huo, and C. G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the MM. ACM, 1082--1086.
    [35]
    D. Elliott and F. Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the ACL: Short Papers, Vol. 452. 457.
    [36]
    M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.
    [37]
    H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt et al. 2015. From captions to visual concepts and back. In Proceedings of the CVPR.
    [38]
    A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the ECCV.
    [39]
    C. Fellbaum. 1998. WordNet. Wiley Online Library. Bradford Books.
    [40]
    P. Felzenszwalb, D. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the CVPR.
    [41]
    P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. 2010. Cascade object detection with deformable part models. In Proceedings of the CVPR.
    [42]
    P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627--1645.
    [43]
    Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the CVPR.
    [44]
    A. George, B. Asad, F. Jonathan, J. David, D. Andrew, M. Willie, M. Martial, S. Alan, G. Yvette, and K. Wessel. 2017. TRECVID 2017: Evaluating ad hoc and instance video search, events detection, video captioning, and hyperlinking. In Proceedings of the TRECVID.
    [45]
    S. Gella, M. Lewis, and M. Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the EMNLP. 968--974.
    [46]
    S. Gong and T. Xiang. 2003. Recognition of group activities using dynamic probabilistic networks. In Proceedings of the ICCV.
    [47]
    A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the ICML. 1764--1772.
    [48]
    A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.
    [49]
    S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. 2013. Recognizing and describing activities using semantic hierarchies and zero-shot recognition. In Proceedings of the ICCV.
    [50]
    S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel, T. Darrell et al. 2013. Grounding spatial relations for human-robot interaction. In Proceedings of the IROS. 1640--1647.
    [51]
    A. Hakeem, Y. Sheikh, and M. Shah. 2004. CASEE: A hierarchical event representation for the analysis of videos. In Proceedings of the AAAI. 263--268.
    [52]
    P. Hanckmann, K. Schutte, and G. J. Burghouts. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In Proceedings of the ECCV.
    [53]
    D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass. 2018. Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the ECCV.
    [54]
    K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.
    [55]
    S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.
    [56]
    S. Hongeng, F. Brémond, and R. Nevatia. 2000. Bayesian framework for video surveillance application. In Proceedings of the ICPR, Vol. 1. IEEE, 164--170.
    [57]
    Drew A. Hudson, Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. In Proceedings of the ICLR.
    [58]
    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the MM. ACM, 675--678.
    [59]
    Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the MM. ACM, 1087--1091.
    [60]
    J. Johnson, B. Hariharan, L. V. D. Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the CVPR.
    [61]
    M. U. G. Khan and Y. Gotoh. 2012. Describing video contents in natural language. In Proceedings of the EACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. ACL, 27--35.
    [62]
    M. U. G. Khan, L. Zhang, and Y. Gotoh. 2011. Human focused video description. In Proceedings of the ICCV Workshops.
    [63]
    M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. 2016. Re-evaluating automatic metrics for image captioning. Retrieved from: arXiv preprint arXiv:1612.07600, (2016).
    [64]
    J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. 2018. Textual explanations for self-driving vehicles. In Proceedings of the ECCV.
    [65]
    W. Kim, J. Park, and C. Kim. 2010. A novel method for efficient indoor-outdoor image classification. J. Sig. Proc. Syst. 61, 3 (2010), 251--258.
    [66]
    R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Retrieved from: arXiv preprint arXiv:1411.2539, (2014).
    [67]
    P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, WadeShen, C. Moran, R. Zens et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, 177--180.
    [68]
    A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184.
    [69]
    D. Koller, N. Heinze, and H. Nagel. 1991. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In Proceedings of the CVPR. 90--95.
    [70]
    R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense-captioning events in videos. Retrieved from: arXiv:1705.00754.
    [71]
    N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI, Vol. 1. 2.
    [72]
    A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105.
    [73]
    P. Kuchi, P. Gabbur, P. S. Bhat, and S. S. David. 2002. Human face detection and tracking using skin color modeling and connected component operators. IEEE J. Res. 48, 3--4 (2002), 289--293.
    [74]
    G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the CVPR.
    [75]
    T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the NIPS. 3675--3683.
    [76]
    M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML.
    [77]
    I. Langkilde-Geary and K. Knight. Halogen Input Representation. [Online]. http://www.isi.edu/publications/licensed-sw/halogen/interlingua.html.
    [78]
    M. W. Lee, A. Hakeem, N. Haering, and S. Zhu. 2008. Save: A framework for semantic annotation of visual events. In Proceedings of the CVPR Workshops. 1--8.
    [79]
    L. Li and B. Gong. 2018. End-to-end video captioning with multitask reinforcement learning. Retrieved from: arXiv preprint arXiv:1803.07950, (2018).
    [80]
    S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the CNLL.
    [81]
    Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the CVPR.
    [82]
    C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out. 74--81.
    [83]
    T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV.
    [84]
    Y. Liu and Z. Shi. 2016. Boosting video description generation by explicitly translating from frame-level captions. In Proceedings of the MM. ACM, 631--634.
    [85]
    D. G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the ICCV.
    [86]
    I. Maglogiannis, D. Vouyioukas, and C. Aggelopoulos. 2009. Face detection and recognition of natural human emotion using Markov random fields. Pers .Ubiq. Comput. 13, 1 (2009), 95--101.
    [87]
    M. Malinowski and M. Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the NIPS. 1682--1690.
    [88]
    J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV.
    [89]
    M. Margaret, H. Ting-Hao, F. Frank, and M. Ishan. 2018. In Proceedings of the First Workshop on Storytelling. ACL. https://www.aclweb.org/anthology/W18-1500.
    [90]
    C. Matuszek, D. Fox, and K. Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the HRI.
    [91]
    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS. 3111--3119.
    [92]
    V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. 2013. Playing Atari with deep reinforcement learning. Retrieved from: arXiv preprint arXiv:1312.5602. (2013).
    [93]
    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassbis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 9529.
    [94]
    D. Moore and I. Essa. 2002. Recognizing multitasked activities from video using stochastic context-free grammar. In Proceedings of the AAAI/IAAI. 770--776.
    [95]
    R. Nevatia, J. Hobbs, and B. Bolles. 2004. An ontology for video event representation. In Proceedings of the CVPR Workshop. 119--119.
    [96]
    F. Nishida and S. Takamatsu. 1982. Japanese-English translation through internal expressions. In Proceedings of the COLING, Volume 1. Academia Praha, 271--276.
    [97]
    F. Nishida, S. Takamatsu, T. Tani, and T. Doi. 1988. Feedback of correcting information in post editing to a machine translation system. In Proceedings of the COLING, Volume 2. ACL, 476--481.
    [98]
    A. Owens and A. A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the ECCV.
    [99]
    P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the CVPR.
    [100]
    Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the CVPR.
    [101]
    Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the CVPR.
    [102]
    K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the ACL. 311--318.
    [103]
    R. Pasunuru and M. Bansal. 2017. Reinforced video captioning with entailment rewards. Retrieved from: arXiv preprint arXiv:1708.02300, (2017).
    [104]
    S. Phan, G. E. Henter, Y. Miyao, and S. Satoh. 2017. Consensus-based sequence training for video captioning. Retrieved from: arXiv preprint arXiv:1712.09532, (2017).
    [105]
    C. S. Pinhanez and A. F. Bobick. 1998. Human action detection using PNF propagation of temporal constraints. In Proceedings of the CVPR.
    [106]
    C. Pollard and I. A. Sag. 1994. Head-driven Phrase Structure Grammar. University of Chicago Press.
    [107]
    S. Chen, Y. Song, Y. Zhao, J. Qiu, Q. Jin, and A. Hauptmann. 2017. RUC-CMU: System descriptions for the dense video captioning task. Retrieved from: arXiv preprint arXiv:1710.08011, (2017).
    [108]
    V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko. 2016. Multimodal video description. In Proceedings of the MM. ACM, 1092--1096.
    [109]
    M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding action descriptions in videos. Trans. Assoc. Comput. Ling. 1 (2013), 25--36.
    [110]
    E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.
    [111]
    M. Ren, R. Kiros, and R. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the NIPS. 2953--2961.
    [112]
    Z. Ren, X. Wang, N. Zhang, X. Lv, and L. Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. Retrieved from: arXiv preprint arXiv:1704.03899, (2017).
    [113]
    S. Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 5 (2004), 503--520.
    [114]
    A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Proceedings of the GCPR.
    [115]
    A. Rohrbach, M. Rohrbach, and B. Schiele. 2015. The long-short story of movie description. In Proceedings of the GCPR. 209--221.
    [116]
    A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. 2015. A dataset for movie description. In Proceedings of the CVPR.
    [117]
    A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. 2017. Movie description. Int. J. Comput. Vis. 123, 1 (2017), 94--120.
    [118]
    M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. 2012. A database for fine-grained activity detection of cooking activities. In Proceedings of the CVPR.
    [119]
    M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the ICCV.
    [120]
    M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the ECCV.
    [121]
    D. Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artific. Intell. 167, 1--2 (2005), 170--205.
    [122]
    D. Roy and E. Reiter. 2005. Connecting language to the world. Artific. Intell. 167, 1--2 (2005), 1--12.
    [123]
    Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121.
    [124]
    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.
    [125]
    M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Proc. 45, 11 (1997), 2673--2681.
    [126]
    Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the CVPR.
    [127]
    R. Shetty and J. Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the MM. ACM, 1073--1076.
    [128]
    J. Shi and C. Tomasi. 1994. Good features to track. In Proceedings of the CVPR.
    [129]
    A. Shin, K. Ohnishi, and T. Harada. 2016. Beyond caption to narrative: Video captioning with multiple sentences. In Proceedings of the ICIP.
    [130]
    G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the ECCV.
    [131]
    K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from: arXiv preprint arXiv:1409.1556, (2014).
    [132]
    N. Srivastava, E. Mansimov, and R. Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the ICML. 843--852.
    [133]
    C. Sun and R. Nevatia. 2014. Semantic aware video transcription using random forest classifiers. In Proceedings of the ECCV.
    [134]
    I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence-to-sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112.
    [135]
    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.
    [136]
    S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI.
    [137]
    J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the COLING 2, 5 (2014), 9.
    [138]
    C. Tomasi and T. Kanade. 1991. Detection and tracking of point features. Technical Report CMU-CS-91-132. Carnegie Mellon University.
    [139]
    A. Torabi, C. Pal, H. Larochelle, and A. Courville. 2015. Using descriptive video services to create a large data source for video annotation research. Retrieved from: arXiv preprint arXiv:1503.01070, (2015).
    [140]
    A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. 2003. Context-based vision system for place and object recognition. In Proceedings of the ICCV.
    [141]
    D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. C3D: Generic features for video analysis. Retrieved from: CoRR abs/1412.0767, (2014).
    [142]
    R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the CVPR.
    [143]
    S. Venugopalan, L. A. Hendricks, R. Mooney, and K. Saenko. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. Retrieved from: arXiv preprint arXiv:1604.01729, (2016).
    [144]
    S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence-to-sequence video to text. In Proceedings of the ICCV.
    [145]
    S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Retrieved from: arXiv preprint arXiv:1412.4729, (2014).
    [146]
    A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. 2017. Feudal networks for hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1703.01161, (2017).
    [147]
    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR.
    [148]
    P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the CVPR.
    [149]
    Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. 2018. Talk the walk: Navigating New York City through grounded dialogue. Retrieved from: CoRR abs/1807.03367 (2018).
    [150]
    B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the CVPR.
    [151]
    H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the BMVC. BMVA Press, 124--1.
    [152]
    J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the CVPR.
    [153]
    J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the CVPR.
    [154]
    J. K. Wang and R. Gaizauskas. 2016. Cross-validating image description datasets and evaluation metrics. In Proceedings of the LREC. European Language Resources Association, 3059--3066.
    [155]
    X. Wang, W. Chen, J. Wu, Y. Wang, and W. Y. Wang. 2017. Video captioning via hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1711.11135, (2017).
    [156]
    X. Wu, G. Li, Q. Cao, Q. Ji, and L. Lin. 2018. Interpretable video captioning via trajectory structured localization. In Proceedings of the CVPR.
    [157]
    Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the CVPR.
    [158]
    H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. 2018. Joint event detection and description in continuous video streams. Retrieved from: arXiv preprint arXiv:1802.10250, (2018).
    [159]
    H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and K. Saenko. 2015. A multi-scale multiple instance video description network. Retrieved from: arXiv preprint arXiv:1505.05914, (2015).
    [160]
    J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the CVPR.
    [161]
    R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI, Vol. 5, 6.
    [162]
    L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the ICCV.
    [163]
    T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. 2017. Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In Proceedings of the MSR Asia MSM at ActivityNet Challenge 2017.
    [164]
    P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, 2 (2014), 67--78.
    [165]
    H. Yu and J. M. Siskind. 2013. Grounded language learning from video sentences. In Proceedings of the ACL 1. 53--63.
    [166]
    H. Yu and J. M. Siskind. 2015. Learning to describe video with weak supervision by exploiting negative sentential information. In Proceedings of the AAAI. 3855--3863.
    [167]
    H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the CVPR.
    [168]
    L. Yu, E. Park, A. C. Berg, and T. L. Berg. 2015. Visual Madlibs: Fill in the blank description generation and question answering. In Proceedings of the ICCV.
    [169]
    Y. Yu, J. Choi, Y. Kim, K. Yoo, S. Lee, and G. Kim. 2017. Supervising neural attention models for video captioning by human gaze data. In Proceedings of the CVPR.
    [170]
    Y. Yu, H. Ko, J. Choi, and G. Kim. 2016. End-to-end concept word detection for video captioning, retrieval, and question answering. Retrieved from: arXiv preprint arXiv:1610.02947, (2016).
    [171]
    K. Zeng, T. Chen, J. C. Niebles, and M. Sun. 2016. Title generation for user-generated videos. In Proceedings of the ECCV.
    [172]
    X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proceedings of the CVPR.
    [173]
    L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach. 2018. Grounded video description. Retrieved from: arXiv preprint arXiv:1812.06587 (2018).
    [174]
    L. Zhou, C. Xu, and J. J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI.
    [175]
    L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the CVPR.
    [176]
    S. Zhu and D. Mumford. 2007. A stochastic grammar of images. Found. Trends Comput. Graph. Vis. 2, 4 (2007), 259--362.

    Cited By

    View all
    • (2024)Synthetic Aperture Radar Image Classification Using Deep LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1895(2177-2179)Online publication date: 6-Apr-2024
    • (2024)Modeling and Performance Analysis of a Notification-Based Method for Processing Video Queries on the FlyApplied Sciences10.3390/app1409356614:9(3566)Online publication date: 24-Apr-2024
    • (2024)Analyzing Generative Models for Realistic Data Augmentation across Modalities and Applications2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498685(1601-1606)Online publication date: 28-Feb-2024
    • Show More Cited By

    Index Terms

    1. Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 52, Issue 6
        November 2020
        806 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/3368196
        • Editor:
        • Sartaj Sahni
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 16 October 2019
        Accepted: 01 August 2019
        Received: 01 March 2019
        Published in CSUR Volume 52, Issue 6

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Video description
        2. language in vision
        3. video captioning
        4. video to text

        Qualifiers

        • Survey
        • Research
        • Refereed

        Funding Sources

        • ARC Discovery
        • Army Research Office

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1,917
        • Downloads (Last 6 weeks)177
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Synthetic Aperture Radar Image Classification Using Deep LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1895(2177-2179)Online publication date: 6-Apr-2024
        • (2024)Modeling and Performance Analysis of a Notification-Based Method for Processing Video Queries on the FlyApplied Sciences10.3390/app1409356614:9(3566)Online publication date: 24-Apr-2024
        • (2024)Analyzing Generative Models for Realistic Data Augmentation across Modalities and Applications2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498685(1601-1606)Online publication date: 28-Feb-2024
        • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
        • (2024)SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision ViewersProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642632(1-18)Online publication date: 11-May-2024
        • (2024)IcoCap: Improving Video Captioning by Compounding ImagesIEEE Transactions on Multimedia10.1109/TMM.2023.332232926(4389-4400)Online publication date: 1-Jan-2024
        • (2024)A Survey of Computational Techniques for Automated Video Creation and their Evaluation2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)10.1109/ICRITO61523.2024.10522350(1-6)Online publication date: 14-Mar-2024
        • (2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
        • (2024)Hybrid deep learning and remote sensing for the delineation of artificial groundwater recharge zonesThe Egyptian Journal of Remote Sensing and Space Sciences10.1016/j.ejrs.2024.02.00627:2(178-191)Online publication date: Jun-2024
        • (2024)Video captioning using transformer-based GANMultimedia Tools and Applications10.1007/s11042-024-19247-zOnline publication date: 23-Apr-2024
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media