research-article

Multi-task Joint Learning for Videos in the Wild

Authors:

Hyeran ByunAuthors Info & Claims

CoVieW'18: Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild

Pages 27 - 30

https://doi.org/10.1145/3265987.3265988

Published: 15 October 2018 Publication History

Abstract

Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.

References

[1]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube- 8M: A Large-Scale Video Classification Benchmark. CoRR abs/1609.08675 (2016). arXiv:1609.08675 http://arxiv.org/abs/1609.08675

[2]

Akshay Kumar Gupta. 2017. Survey of Visual Question Answering: Datasets and Techniques. CoRR abs/1705.03865 (2017). arXiv:1705.03865 http://arxiv.org/abs/1705.03865

[3]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385

[4]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classification with Convolutional Neural Networks. In CVPR.

Digital Library

[5]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950 http://arxiv.org/abs/1705.06950

[6]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Digital Library

[7]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV).

Digital Library

[8]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.

Digital Library

[9]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. CoRR abs/1406.2199 (2014). arXiv:1406.2199 http://arxiv.org/abs/1406.2199

[10]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRRabs/1212.0402 (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402

[11]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014). arXiv:1409.4842 http://arxiv.org/abs/1409.4842

[12]

Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: Generic Features for Video Analysis. CoRR abs/1412.0767 (2014). arXiv:1412.0767 http://arxiv.org/abs/1412.0767

[13]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. CoRR abs/1711.11248 (2017). arXiv:1711.11248 http://arxiv.org/abs/1711.11248

[14]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. CoRR abs/1411.4555 (2014). arXiv:1411.4555 http://arxiv.org/abs/1411.4555

[15]

Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2017. Nonlocal Neural Networks. CoRR abs/1711.07971 (2017). arXiv:1711.07971 http://arxiv.org/abs/1711.07971

Cited By

Su SKwong SZhao QHuang DNiebles JAdeli E(2023)Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video UnderstandingComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25075-0_23(317-333)Online publication date: 19-Feb-2023
https://doi.org/10.1007/978-3-031-25075-0_23
Seong HHyun JKim E(2019)Video Multitask Transformer Network2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00194(1553-1561)Online publication date: Oct-2019
https://doi.org/10.1109/ICCVW.2019.00194

Index Terms

Multi-task Joint Learning for Videos in the Wild
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Multi-penalty Functions GANs via Multi-task Learning
Artificial Intelligence and Security
Abstract
Adversarial learning stability is one of the difficulties of generative adversarial networks (GANs), which is closely related to networks convergence and generated images quality. For improving the stability, the multi-penalty functions GANs (MPF-...
Joint Feature Learning for Face Recognition
This paper presents a new joint feature learning (JFL) approach to automatically learn feature representation from raw pixels for face recognition. Unlike many existing face recognition systems, where conventional feature descriptors, such as local binary ...
Resformer: Local Frame-Level Feature and Global Segment-Level Feature Joint Learning for Speaker Verification
Abstract
In this paper, we propose a hybrid network structure to achieve more discriminant feature representations for speaker recognition, termed Resformer, local frame-level features are extracted by convolution operation, and global segment-level ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CoVieW'18: Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild

October 2018

45 pages

ISBN:9781450359764

DOI:10.1145/3265987

General Chairs:
Kwanghoon Sohn
Yonsei University, Korea
,
Ming-Hsuan Yang
University of California at Merced, USA
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Jongwoo Lim
Hanyang University, Korea
,
Jison Hsu
NTUST, Taiwan
,
Stephen Lin
Microsoft Research, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7069370)

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22, 2018

Seoul, Republic of Korea

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
104
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Su SKwong SZhao QHuang DNiebles JAdeli E(2023)Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video UnderstandingComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25075-0_23(317-333)Online publication date: 19-Feb-2023
https://doi.org/10.1007/978-3-031-25075-0_23
Seong HHyun JKim E(2019)Video Multitask Transformer Network2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00194(1553-1561)Online publication date: Oct-2019
https://doi.org/10.1109/ICCVW.2019.00194

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents