research-article

Amalgamation of Video Description and Multiple Object Localization using single Deep Learning Model

Authors:

Mohan GhaiAuthors Info & Claims

ICSPS 2017: Proceedings of the 9th International Conference on Signal Processing Systems

Pages 109 - 115

https://doi.org/10.1145/3163080.3163108

Published: 27 November 2017 Publication History

Abstract

Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system.

References

[1]

Pablo Andrés Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marqués, and Jitendra Malik. 2014. Multiscale Combinatorial Grouping. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 328--335.

Digital Library

[2]

João Carreira and Cristian Sminchisescu. 2012. CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 34, 7 (2012), 1312--1328.

Digital Library

[3]

David L. Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011). Portland, OR.

Digital Library

[4]

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 2625--2634.

[5]

Ian Endres and Derek Hoiem. 2010. Category Independent Object Proposals. In Computer Vision - ECCV 2010 - 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V (Lecture Notes in Computer Science), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.), Vol. 6315. Springer, 575--588.

Digital Library

[6]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. RegionBased Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (Jan. 2016), 142--158.

Digital Library

[7]

Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 1440--1448.

Digital Library

[8]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society, 2712--2719.

Digital Library

[9]

Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2015. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR abs/1511.00175 (2015). http://dblp.uni-trier.de/db/journals/corr/ corr1511.html#IandolaAMK15

[10]

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 3128--3137.

[11]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. IEEE Computer Society, 1725--1732.

Digital Library

[12]

Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. 2015. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. CoRR abs/1511.06530 (2015). http://arxiv.org/abs/1511.06530

[13]

Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, July 14-18, 2013, Bellevue, Washington, USA., Marie desJardins and Michael L. Littman (Eds.). AAAI Press. http://www.aaai.org/ocs/ index.php/AAAI/AAAI13/paper/view/6454

Digital Library

[14]

Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. CoRR abs/1503.08909 (2015). http://arxiv.org/ abs/1503.08909

[15]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2016. Video Captioning with Transferred Semantic Attributes. CoRR abs/1611.07675 (2016). http://arxiv.org/ abs/1611.07675

[16]

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You Only Look Once: Unified, Real-Time Object Detection. CoRR abs/1506.02640 (2015). http://arxiv.org/abs/1506.02640

[17]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. http://papers.nips.cc/paper/ 5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks

Digital Library

[18]

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for Movie Description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 3202--3212.

[19]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 568--576. http://papers.nips.cc/paper/ 5353-two-stream-convolutional-networks-for-action-recognition-in-videos

Digital Library

[20]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). http: //arxiv.org/abs/1409.1556

[21]

Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, Jan Hajic and Junichi Tsujii (Eds.). ACL, 1218--1227. http://aclweb.org/anthology/C/C14/C14-1115.pdf

[22]

Atousa Torabi, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. CoRR abs/1503.01070 (2015). http://arxiv.org/abs/1503. 01070

[23]

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence - Video to Text. CoRR abs/1505.00487 (2015). http://arxiv.org/abs/1505.00487

Digital Library

[24]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31-June 5, 2015, Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics, 1494--1504. http://aclweb.org/anthology/N/N15/N15-1173.pdf

[25]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. CoRR abs/1411.4555 (2014). http://arxiv.org/abs/1411.4555

[26]

Ji Wan, Dayong Wang, Steven Chu-Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep Learning for Content-Based Image Retrieval: A Comprehensive Study. In Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03-07, 2014, Kien A. Hua, Yong Rui, Ralf Steinmetz, Alan Hanjalic, Apostol Natsev, and Wenwu Zhu (Eds.). ACM, 157--166.

Digital Library

[27]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE.

Digital Library

Cited By

Kehkashan TAlsaeedi AYafooz WIsmail NAl-Dhaqm A(2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3357980
Sreekala KRaj NGupta SAnitha GNanda AChaturvedi A(2023)Deep convolutional neural network with Kalman filter based objected tracking and detection in underwater communicationsWireless Networks10.1007/s11276-023-03290-z30:6(5571-5588)Online publication date: 22-Mar-2023
https://doi.org/10.1007/s11276-023-03290-z
Sangle RJetawat A(2023)Detection of Moving Object Using Modified Fuzzy C-Means Clustering from the Complex and Non-stationary Background ScenesAdvances in Data Science and Artificial Intelligence10.1007/978-3-031-16178-0_18(247-259)Online publication date: 14-May-2023
https://doi.org/10.1007/978-3-031-16178-0_18
Show More Cited By

Index Terms

Amalgamation of Video Description and Multiple Object Localization using single Deep Learning Model
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Efficient deep neural networks for classification of COVID-19 based on CT images: Virtualization via software defined radio
Abstract
The novel 2019 coronavirus disease (COVID-19) has infected over 141 million people worldwide since April 20, 2021. More than 200 countries around the world have been affected by the coronavirus pandemic. Screening for COVID-19, we use ...
Highlights
- To propose two efficient COVID-19 classification models based on convolutional neural network and convolutional auto-encoder neural network for ...
Next word prediction for Urdu language using deep learning models
Abstract
Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low ...
Highlights
- The first language model for Urdu using LSTM and BERT.
- The proposed model predicts the next word in Urdu language.
- Presents a BERT for Urdu language that is trained from scratch on 1.1 million Urdu sentence.
- Present a pre-...
Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICSPS 2017: Proceedings of the 9th International Conference on Signal Processing Systems

November 2017

237 pages

ISBN:9781450353847

DOI:10.1145/3163080

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICSPS 2017

ICSPS 2017: The 9th International Conference on Signal Processing Systems

November 27 - 30, 2017

Auckland, New Zealand

Acceptance Rates

Overall Acceptance Rate 46 of 83 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
134
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kehkashan TAlsaeedi AYafooz WIsmail NAl-Dhaqm A(2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3357980
Sreekala KRaj NGupta SAnitha GNanda AChaturvedi A(2023)Deep convolutional neural network with Kalman filter based objected tracking and detection in underwater communicationsWireless Networks10.1007/s11276-023-03290-z30:6(5571-5588)Online publication date: 22-Mar-2023
https://doi.org/10.1007/s11276-023-03290-z
Sangle RJetawat A(2023)Detection of Moving Object Using Modified Fuzzy C-Means Clustering from the Complex and Non-stationary Background ScenesAdvances in Data Science and Artificial Intelligence10.1007/978-3-031-16178-0_18(247-259)Online publication date: 14-May-2023
https://doi.org/10.1007/978-3-031-16178-0_18
Wageeh YMohamed HFadl AAnas OElMasry NNabil AAtia A(2021)YOLO fish detection with Euclidean tracking in fish farmsJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-020-02847-6Online publication date: 3-Jan-2021
https://doi.org/10.1007/s12652-020-02847-6
Anas OWageeh YMohamed HFadl AElMasry NNabil AAtia A(2020)Detecting Abnormal Fish Behavior Using Motion Trajectories In Ubiquitous EnvironmentsProcedia Computer Science10.1016/j.procs.2020.07.023175(141-148)Online publication date: 2020
https://doi.org/10.1016/j.procs.2020.07.023
Mohamed HFadl AAnas OWageeh YElMasry NNabil AAtia A(2020)MSR-YOLO: Method to Enhance Fish Detection and Tracking in Fish FarmsProcedia Computer Science10.1016/j.procs.2020.03.123170(539-546)Online publication date: 2020
https://doi.org/10.1016/j.procs.2020.03.123
Lal SDuggal SSreedevi I(2019)Online Video Summarization: Predicting Future to Better Summarize Present2019 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV.2019.00056(471-480)Online publication date: Jan-2019
https://doi.org/10.1109/WACV.2019.00056

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents