Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3163080.3163108acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicspsConference Proceedingsconference-collections
research-article

Amalgamation of Video Description and Multiple Object Localization using single Deep Learning Model

Published: 27 November 2017 Publication History

Abstract

Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system.

References

[1]
Pablo Andrés Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marqués, and Jitendra Malik. 2014. Multiscale Combinatorial Grouping. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 328--335.
[2]
João Carreira and Cristian Sminchisescu. 2012. CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 34, 7 (2012), 1312--1328.
[3]
David L. Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011). Portland, OR.
[4]
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 2625--2634.
[5]
Ian Endres and Derek Hoiem. 2010. Category Independent Object Proposals. In Computer Vision - ECCV 2010 - 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V (Lecture Notes in Computer Science), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.), Vol. 6315. Springer, 575--588.
[6]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. RegionBased Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (Jan. 2016), 142--158.
[7]
Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 1440--1448.
[8]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society, 2712--2719.
[9]
Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2015. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR abs/1511.00175 (2015). http://dblp.uni-trier.de/db/journals/corr/ corr1511.html#IandolaAMK15
[10]
Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 3128--3137.
[11]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. IEEE Computer Society, 1725--1732.
[12]
Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. 2015. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. CoRR abs/1511.06530 (2015). http://arxiv.org/abs/1511.06530
[13]
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, July 14-18, 2013, Bellevue, Washington, USA., Marie desJardins and Michael L. Littman (Eds.). AAAI Press. http://www.aaai.org/ocs/ index.php/AAAI/AAAI13/paper/view/6454
[14]
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. CoRR abs/1503.08909 (2015). http://arxiv.org/ abs/1503.08909
[15]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2016. Video Captioning with Transferred Semantic Attributes. CoRR abs/1611.07675 (2016). http://arxiv.org/ abs/1611.07675
[16]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You Only Look Once: Unified, Real-Time Object Detection. CoRR abs/1506.02640 (2015). http://arxiv.org/abs/1506.02640
[17]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. http://papers.nips.cc/paper/ 5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks
[18]
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for Movie Description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 3202--3212.
[19]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 568--576. http://papers.nips.cc/paper/ 5353-two-stream-convolutional-networks-for-action-recognition-in-videos
[20]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). http: //arxiv.org/abs/1409.1556
[21]
Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, Jan Hajic and Junichi Tsujii (Eds.). ACL, 1218--1227. http://aclweb.org/anthology/C/C14/C14-1115.pdf
[22]
Atousa Torabi, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. CoRR abs/1503.01070 (2015). http://arxiv.org/abs/1503. 01070
[23]
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence - Video to Text. CoRR abs/1505.00487 (2015). http://arxiv.org/abs/1505.00487
[24]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31-June 5, 2015, Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics, 1494--1504. http://aclweb.org/anthology/N/N15/N15-1173.pdf
[25]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. CoRR abs/1411.4555 (2014). http://arxiv.org/abs/1411.4555
[26]
Ji Wan, Dayong Wang, Steven Chu-Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep Learning for Content-Based Image Retrieval: A Comprehensive Study. In Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03-07, 2014, Kien A. Hua, Yong Rui, Ralf Steinmetz, Alan Hanjalic, Apostol Natsev, and Wenwu Zhu (Eds.). ACM, 157--166.
[27]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE.

Cited By

View all
  • (2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
  • (2023)Deep convolutional neural network with Kalman filter based objected tracking and detection in underwater communicationsWireless Networks10.1007/s11276-023-03290-z30:6(5571-5588)Online publication date: 22-Mar-2023
  • (2023)Detection of Moving Object Using Modified Fuzzy C-Means Clustering from the Complex and Non-stationary Background ScenesAdvances in Data Science and Artificial Intelligence10.1007/978-3-031-16178-0_18(247-259)Online publication date: 14-May-2023
  • Show More Cited By

Index Terms

  1. Amalgamation of Video Description and Multiple Object Localization using single Deep Learning Model

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        ICSPS 2017: Proceedings of the 9th International Conference on Signal Processing Systems
        November 2017
        237 pages
        ISBN:9781450353847
        DOI:10.1145/3163080
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 November 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Inverted Y-Shape Model
        2. Microsoft Research Video Description Corpus
        3. VGG-16
        4. Video Caption Generation
        5. Video Object Detection
        6. YOLO

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        ICSPS 2017

        Acceptance Rates

        Overall Acceptance Rate 46 of 83 submissions, 55%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 10 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
        • (2023)Deep convolutional neural network with Kalman filter based objected tracking and detection in underwater communicationsWireless Networks10.1007/s11276-023-03290-z30:6(5571-5588)Online publication date: 22-Mar-2023
        • (2023)Detection of Moving Object Using Modified Fuzzy C-Means Clustering from the Complex and Non-stationary Background ScenesAdvances in Data Science and Artificial Intelligence10.1007/978-3-031-16178-0_18(247-259)Online publication date: 14-May-2023
        • (2021)YOLO fish detection with Euclidean tracking in fish farmsJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-020-02847-6Online publication date: 3-Jan-2021
        • (2020)Detecting Abnormal Fish Behavior Using Motion Trajectories In Ubiquitous EnvironmentsProcedia Computer Science10.1016/j.procs.2020.07.023175(141-148)Online publication date: 2020
        • (2020)MSR-YOLO: Method to Enhance Fish Detection and Tracking in Fish FarmsProcedia Computer Science10.1016/j.procs.2020.03.123170(539-546)Online publication date: 2020
        • (2019)Online Video Summarization: Predicting Future to Better Summarize Present2019 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV.2019.00056(471-480)Online publication date: Jan-2019

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media