Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3265987.3265988acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-task Joint Learning for Videos in the Wild

Published: 15 October 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.

    References

    [1]
    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube- 8M: A Large-Scale Video Classification Benchmark. CoRR abs/1609.08675 (2016). arXiv:1609.08675 http://arxiv.org/abs/1609.08675
    [2]
    Akshay Kumar Gupta. 2017. Survey of Visual Question Answering: Datasets and Techniques. CoRR abs/1705.03865 (2017). arXiv:1705.03865 http://arxiv.org/abs/1705.03865
    [3]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
    [4]
    Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classification with Convolutional Neural Networks. In CVPR.
    [5]
    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950 http://arxiv.org/abs/1705.06950
    [6]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
    [7]
    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV).
    [8]
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.
    [9]
    Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. CoRR abs/1406.2199 (2014). arXiv:1406.2199 http://arxiv.org/abs/1406.2199
    [10]
    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRRabs/1212.0402 (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402
    [11]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014). arXiv:1409.4842 http://arxiv.org/abs/1409.4842
    [12]
    Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: Generic Features for Video Analysis. CoRR abs/1412.0767 (2014). arXiv:1412.0767 http://arxiv.org/abs/1412.0767
    [13]
    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. CoRR abs/1711.11248 (2017). arXiv:1711.11248 http://arxiv.org/abs/1711.11248
    [14]
    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. CoRR abs/1411.4555 (2014). arXiv:1411.4555 http://arxiv.org/abs/1411.4555
    [15]
    Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2017. Nonlocal Neural Networks. CoRR abs/1711.07971 (2017). arXiv:1711.07971 http://arxiv.org/abs/1711.07971

    Cited By

    View all
    • (2023)Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video UnderstandingComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25075-0_23(317-333)Online publication date: 19-Feb-2023
    • (2019)Video Multitask Transformer Network2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00194(1553-1561)Online publication date: Oct-2019

    Index Terms

    1. Multi-task Joint Learning for Videos in the Wild

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CoVieW'18: Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild
      October 2018
      45 pages
      ISBN:9781450359764
      DOI:10.1145/3265987
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 October 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. joint learning
      2. video understanding

      Qualifiers

      • Research-article

      Funding Sources

      • Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7069370)

      Conference

      MM '18
      Sponsor:
      MM '18: ACM Multimedia Conference
      October 22, 2018
      Seoul, Republic of Korea

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video UnderstandingComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25075-0_23(317-333)Online publication date: 19-Feb-2023
      • (2019)Video Multitask Transformer Network2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00194(1553-1561)Online publication date: Oct-2019

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media