Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2964284.2964297acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Published: 01 October 2016 Publication History

Abstract

This paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a network, where the convolutional and fully connected layers are effectively represented by our proposed feature aggregation methods. We further introduce a multimodal scheme that includes four highly complementary modalities to extract diverse static and dynamic cues at multiple temporal scales. In particular, for modeling the long-term temporal information, we propose a new structure, FC-RNN, to effectively transform pre-trained fully connected layers into recurrent layers. A robust boosting model is then introduced to optimize the fusion of multiple layers and modalities in a unified way. In the extensive experiments, we achieve state-of-the-art results on two public benchmark datasets: UCF101 and HMDB51.

References

[1]
D. Borth, T. Chen, R. Ji, and S. Chang. Sentibank: large-scale ontology and classi ers for detecting sentiment and emotions in visual content. In ACM Multimedia, 2013.
[2]
T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical ow estimation based on a theory for warping. In ECCV, 2004.
[3]
Z. Cai, L. Wang, X. Peng, and Y. Qiao. Motionlets: mid-level 3D parts for human motion recognition. In CVPR, 2013.
[4]
Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super vector for action recognition. In CVPR, 2014.
[5]
S. Chen, J. Wang, Y. Liu, C. Xu, and H. Lu. Fast feature selection and training for AdaBoost-based concept detection with large scale datasets. In ACM Multimedia, 2010.
[6]
G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. TASLP, 2012.
[7]
A. Demiriz, K. Bennett, and J. Taylor. Linear programming boosting via column generation. JMLR, 2000.
[8]
J. Donahue, L. Hendricks, S. Guadarrama, and M. Rohrbach. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
[9]
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classification. JMLR, 2008.
[10]
P. Fischer, A. Dosovitskiy, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Smagt, D. Cremers, and T. Brox. FlowNet: learning optical flow with convolutional networks. In ICCV, 2015.
[11]
J. Francoise, N. Schnell, and F. Bevilacqua. A multimodal probabilistic model for gesture based control of sound synthesis. In ACM Multimedia, 2011.
[12]
P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009.
[13]
J. Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. TPAMI, 2009.
[14]
B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
[15]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CoRR, 2015.
[16]
H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating nlocal descriptors into a compact image representation. In CVPR, 2010.
[17]
S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. TPAMI, 2013.
[18]
W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-term audio-visual atoms for generic video concept classification. In ACM Multimedia, 2009.
[19]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[20]
A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
[21]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
[22]
G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 2004.
[23]
I. Laptev. On space-time interest points. IJCV, 2005.
[24]
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
[25]
G. Lev, G. Sadeh, B. Klein, and L. Wolf. RNN Fisher vectors for action recognition and image annotation. In CoRR, 2015.
[26]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
[27]
D. Lowe. Scale invariant feature transform. IJCV, 2004.
[28]
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.
[29]
J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: deep networks for video classi cation. In CVPR, 2015.
[30]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, 2011.
[31]
A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 2001.
[32]
E. Park, X. Han, T. Berg, and A. Berg. Combining multiple sources of knowledge in deep CNNs for action recognition. In WACV, 2016.
[33]
X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CVIU, 2016.
[34]
F. Perronnin, J. Sanchez, and T. Mensink. Improving the Fisher kernel for large-scale image classification. In ECCV, 2010.
[35]
S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, 2015.
[36]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
[37]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale visual recognition. In ICLR, 2015.
[38]
K. Soomro, A. Zamir, and M. Shah. UCF101: a dataset of 101 human actions classes from videos in the wild. In CoRR, 2012.
[39]
N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
[40]
K. Tai, R. Socher, and C. Manning. Improved semantic representations from tree-structured long short-term memory networks. In ACL, 2015.
[41]
A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012.
[42]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: generic features for video analysis. In ICCV, 2015.
[43]
A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. TPAMI, 2012.
[44]
H. Wang, D. Oneata, J. Verbeek, and C. Schmid. A robust and efficient video representation for action recognition. IJCV, 2015.
[45]
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classi cation. In CVPR, 2010.
[46]
L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
[47]
Z. Xu, Y. Yang, and A. Hauptmann. A discriminative CNN video representation for event detection. In CVPR, 2015.
[48]
S. Yang and D. Ramanan. Multi-scale recognition with DAG-CNNs. In ICCV, 2015.
[49]
X. Yang, Z. Liu, E. Zavesky, D. Gibbon, B. Shahraray, and Y. Tian. AT&T Research at TRECVID 2013: surveillance event detection. In NIST TRECVID Workshop, 2013.
[50]
X. Yang and Y. Tian. Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV, 2014.
[51]
M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
[52]
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi cation of texture and object categories: a comprehensive study. IJCV, 2007.

Cited By

View all
  • (2024)Advertisement Video Classification Methodology Based on Auditory Information Using the Korean BERT ModelJournal of Digital Contents Society10.9728/dcs.2024.25.1.12125:1(121-131)Online publication date: 31-Jan-2024
  • (2024)Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World EnvironmentsElectronics10.3390/electronics1320413713:20(4137)Online publication date: 21-Oct-2024
  • (2024)Integrating Keyframe Extraction and MoveNet Thunder for Robust Video Pose Estimation and Classification2024 5th International Conference for Emerging Technology (INCET)10.1109/INCET61516.2024.10593236(1-6)Online publication date: 24-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. boosting
  2. cnn
  3. deep neural networks
  4. fusion
  5. rnn
  6. video classification

Qualifiers

  • Research-article

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)7
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Advertisement Video Classification Methodology Based on Auditory Information Using the Korean BERT ModelJournal of Digital Contents Society10.9728/dcs.2024.25.1.12125:1(121-131)Online publication date: 31-Jan-2024
  • (2024)Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World EnvironmentsElectronics10.3390/electronics1320413713:20(4137)Online publication date: 21-Oct-2024
  • (2024)Integrating Keyframe Extraction and MoveNet Thunder for Robust Video Pose Estimation and Classification2024 5th International Conference for Emerging Technology (INCET)10.1109/INCET61516.2024.10593236(1-6)Online publication date: 24-May-2024
  • (2024)Edge-Guided Multilevel Feature Fusion Network for Lightweight Camouflaged Object Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651051(1-7)Online publication date: 30-Jun-2024
  • (2024)Transfer Learning Models for CNN Fusion With Fisher Vector for Codebook Optimization of Foreground FeaturesIEEE Access10.1109/ACCESS.2023.333957512(5648-5658)Online publication date: 2024
  • (2024)Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimizationScientific Reports10.1038/s41598-024-75640-614:1Online publication date: 31-Oct-2024
  • (2024)Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognitionDiscover Applied Sciences10.1007/s42452-024-05774-96:4Online publication date: 18-Mar-2024
  • (2023)Intelligent Video Analytics for Human Action Recognition: The State of KnowledgeSensors10.3390/s2309425823:9(4258)Online publication date: 25-Apr-2023
  • (2023)On the Use of Deep Learning for Video ClassificationApplied Sciences10.3390/app1303200713:3(2007)Online publication date: 3-Feb-2023
  • (2023)HCMS: Hierarchical and Conditional Modality Selection for Efficient Video RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357277620:2(1-18)Online publication date: 27-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media