research-article

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Authors:

Pavlo Molchanov,

Jan KautzAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 978 - 987

https://doi.org/10.1145/2964284.2964297

Published: 01 October 2016 Publication History

Abstract

This paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a network, where the convolutional and fully connected layers are effectively represented by our proposed feature aggregation methods. We further introduce a multimodal scheme that includes four highly complementary modalities to extract diverse static and dynamic cues at multiple temporal scales. In particular, for modeling the long-term temporal information, we propose a new structure, FC-RNN, to effectively transform pre-trained fully connected layers into recurrent layers. A robust boosting model is then introduced to optimize the fusion of multiple layers and modalities in a unified way. In the extensive experiments, we achieve state-of-the-art results on two public benchmark datasets: UCF101 and HMDB51.

References

[1]

D. Borth, T. Chen, R. Ji, and S. Chang. Sentibank: large-scale ontology and classi ers for detecting sentiment and emotions in visual content. In ACM Multimedia, 2013.

Digital Library

[2]

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical ow estimation based on a theory for warping. In ECCV, 2004.

[3]

Z. Cai, L. Wang, X. Peng, and Y. Qiao. Motionlets: mid-level 3D parts for human motion recognition. In CVPR, 2013.

Digital Library

[4]

Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super vector for action recognition. In CVPR, 2014.

Digital Library

[5]

S. Chen, J. Wang, Y. Liu, C. Xu, and H. Lu. Fast feature selection and training for AdaBoost-based concept detection with large scale datasets. In ACM Multimedia, 2010.

Digital Library

[6]

G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. TASLP, 2012.

Digital Library

[7]

A. Demiriz, K. Bennett, and J. Taylor. Linear programming boosting via column generation. JMLR, 2000.

Digital Library

[8]

J. Donahue, L. Hendricks, S. Guadarrama, and M. Rohrbach. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.

[9]

R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classification. JMLR, 2008.

Digital Library

[10]

P. Fischer, A. Dosovitskiy, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Smagt, D. Cremers, and T. Brox. FlowNet: learning optical flow with convolutional networks. In ICCV, 2015.

Digital Library

[11]

J. Francoise, N. Schnell, and F. Bevilacqua. A multimodal probabilistic model for gesture based control of sound synthesis. In ACM Multimedia, 2011.

Digital Library

[12]

P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009.

[13]

J. Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. TPAMI, 2009.

Digital Library

[14]

B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.

[15]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CoRR, 2015.

[16]

H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating nlocal descriptors into a compact image representation. In CVPR, 2010.

[17]

S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. TPAMI, 2013.

Digital Library

[18]

W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-term audio-visual atoms for generic video concept classification. In ACM Multimedia, 2009.

Digital Library

[19]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.

Digital Library

[20]

A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

Digital Library

[21]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.

Digital Library

[22]

G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 2004.

Digital Library

[23]

I. Laptev. On space-time interest points. IJCV, 2005.

Digital Library

[24]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.

[25]

G. Lev, G. Sadeh, B. Klein, and L. Wolf. RNN Fisher vectors for action recognition and image annotation. In CoRR, 2015.

[26]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

Digital Library

[27]

D. Lowe. Scale invariant feature transform. IJCV, 2004.

[28]

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.

Digital Library

[29]

J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: deep networks for video classi cation. In CVPR, 2015.

[30]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, 2011.

Digital Library

[31]

A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 2001.

Digital Library

[32]

E. Park, X. Han, T. Berg, and A. Berg. Combining multiple sources of knowledge in deep CNNs for action recognition. In WACV, 2016.

[33]

X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CVIU, 2016.

Digital Library

[34]

F. Perronnin, J. Sanchez, and T. Mensink. Improving the Fisher kernel for large-scale image classification. In ECCV, 2010.

Digital Library

[35]

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, 2015.

Digital Library

[36]

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.

Digital Library

[37]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale visual recognition. In ICLR, 2015.

[38]

K. Soomro, A. Zamir, and M. Shah. UCF101: a dataset of 101 human actions classes from videos in the wild. In CoRR, 2012.

[39]

N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.

Digital Library

[40]

K. Tai, R. Socher, and C. Manning. Improved semantic representations from tree-structured long short-term memory networks. In ACL, 2015.

[41]

A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012.

[42]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: generic features for video analysis. In ICCV, 2015.

[43]

A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. TPAMI, 2012.

Digital Library

[44]

H. Wang, D. Oneata, J. Verbeek, and C. Schmid. A robust and efficient video representation for action recognition. IJCV, 2015.

Digital Library

[45]

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classi cation. In CVPR, 2010.

[46]

L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.

[47]

Z. Xu, Y. Yang, and A. Hauptmann. A discriminative CNN video representation for event detection. In CVPR, 2015.

[48]

S. Yang and D. Ramanan. Multi-scale recognition with DAG-CNNs. In ICCV, 2015.

Digital Library

[49]

X. Yang, Z. Liu, E. Zavesky, D. Gibbon, B. Shahraray, and Y. Tian. AT&T Research at TRECVID 2013: surveillance event detection. In NIST TRECVID Workshop, 2013.

[50]

X. Yang and Y. Tian. Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV, 2014.

[51]

M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.

[52]

J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi cation of texture and object categories: a comprehensive study. IJCV, 2007.

Digital Library

Cited By

Park YYang GSohn C(2024)Advertisement Video Classification Methodology Based on Auditory Information Using the Korean BERT ModelJournal of Digital Contents Society10.9728/dcs.2024.25.1.12125:1(121-131)Online publication date: 31-Jan-2024
https://doi.org/10.9728/dcs.2024.25.1.121
Huang JChen WWang FZhang H(2024)Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World EnvironmentsElectronics10.3390/electronics1320413713:20(4137)Online publication date: 21-Oct-2024
https://doi.org/10.3390/electronics13204137
Hanchinamani STigadi PPuttanavvar PPatilkulkarni DPatil P(2024)Integrating Keyframe Extraction and MoveNet Thunder for Robust Video Pose Estimation and Classification2024 5th International Conference for Emerging Technology (INCET)10.1109/INCET61516.2024.10593236(1-6)Online publication date: 24-May-2024
https://doi.org/10.1109/INCET61516.2024.10593236
Show More Cited By

Index Terms

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification
MM '16: Proceedings of the 24th ACM international conference on Multimedia

This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural ...
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-...
Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

69
Total Citations
View Citations
1,311
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)7

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Park YYang GSohn C(2024)Advertisement Video Classification Methodology Based on Auditory Information Using the Korean BERT ModelJournal of Digital Contents Society10.9728/dcs.2024.25.1.12125:1(121-131)Online publication date: 31-Jan-2024
https://doi.org/10.9728/dcs.2024.25.1.121
Huang JChen WWang FZhang H(2024)Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World EnvironmentsElectronics10.3390/electronics1320413713:20(4137)Online publication date: 21-Oct-2024
https://doi.org/10.3390/electronics13204137
Hanchinamani STigadi PPuttanavvar PPatilkulkarni DPatil P(2024)Integrating Keyframe Extraction and MoveNet Thunder for Robust Video Pose Estimation and Classification2024 5th International Conference for Emerging Technology (INCET)10.1109/INCET61516.2024.10593236(1-6)Online publication date: 24-May-2024
https://doi.org/10.1109/INCET61516.2024.10593236
Zhang XGao MGao GWang XWang Q(2024)Edge-Guided Multilevel Feature Fusion Network for Lightweight Camouflaged Object Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651051(1-7)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651051
Kamaleldin MAbu-Bakar SSheikh U(2024)Transfer Learning Models for CNN Fusion With Fisher Vector for Codebook Optimization of Foreground FeaturesIEEE Access10.1109/ACCESS.2023.333957512(5648-5658)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3339575
Weng ZLi XXiong S(2024)Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimizationScientific Reports10.1038/s41598-024-75640-614:1Online publication date: 31-Oct-2024
https://doi.org/10.1038/s41598-024-75640-6
Yosry SElrefaei LElKamaar RZiedan R(2024)Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognitionDiscover Applied Sciences10.1007/s42452-024-05774-96:4Online publication date: 18-Mar-2024
https://doi.org/10.1007/s42452-024-05774-9
Kulbacki MSegen JChaczko ZRozenblit JKulbacki MKlempous RWojciechowski K(2023)Intelligent Video Analytics for Human Action Recognition: The State of KnowledgeSensors10.3390/s2309425823:9(4258)Online publication date: 25-Apr-2023
https://doi.org/10.3390/s23094258
ur Rehman ABelhaouari SKabir MKhan A(2023)On the Use of Deep Learning for Video ClassificationApplied Sciences10.3390/app1303200713:3(2007)Online publication date: 3-Feb-2023
https://doi.org/10.3390/app13032007
Weng ZWu ZLi HChen JJiang Y(2023)HCMS: Hierarchical and Conditional Modality Selection for Efficient Video RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357277620:2(1-18)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3572776
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents