research-article

Combining modality specific deep neural networks for emotion recognition in video

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

Pages 543 - 550

https://doi.org/10.1145/2522848.2531745

Published: 09 December 2013 Publication History

Abstract

In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.

References

[1]

J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13:281--305, 2012.

Digital Library

[2]

P.-L. Carrier, A. Courville, I. J. Goodfellow, M. Mirza, and Y. Bengio. FER-2013 Face Database. Technical report, 1365, Université de Montréal, 2013.

[3]

C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Digital Library

[4]

A. Coates, H. Lee, and A. Y. Ng. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In AISTATS, 2011.

[5]

G. E. Dahl, T. N. Sainath, and G. E. .Improving deep neural networks for lvcsr using rectified linear units and dropout. In Proc. ICASSP, 2013.

[6]

A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon. Emotion recognition in the wild challenge 2013. In ACM ICMI, 2013.

Digital Library

[7]

A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia, 2012.

Digital Library

[8]

Google. The Google picasa face detector, 2013. {Online, accessed 1-August-2013}.

[9]

P. Hamel, S. Lemieux, Y. Bengio, and D. Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR, pages 729--734, 2011.

[10]

G. Heusch, F. Cardinaux, and S. Marcel. Lighting normalization algorithms for face verification. IDIAP Communication Com05-03, 2005.

[11]

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Proc. Magazine,}, 29(6):82--97, 2012.

[12]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[13]

K. R. Konda, R. Memisevic, and V. Michalski. The role of spatio-temporal synchrony in the encoding of motion. arXiv preprint arXiv:1306.3162, 2013.

[14]

A. Krizhevsky. Cuda-convnet Google code home page. https://code.google.com/p/cuda-convnet/, Aug. 2012.

[15]

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106--1114. 2012.

Digital Library

[16]

Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011.

Digital Library

[17]

V. Štruc and N. Pavešić. Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica, 20(1):115--138, 2009.

[18]

J. Susskind, A. Anderson, and G. Hinton. The toronto face database. Technical report, UTML TR 2010-001, University of Toronto, 2010.

[19]

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML 2013, 2013.

[20]

V. Štruc and N. Pavešić. Photometric normalization techniques for illumination invariance, pages 279--300. IGI-Global, 2011.

[21]

H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.

[22]

X. Zhu and D. Ramanan. Face Detection, Pose Estimation, and Landmark Localization in the Wild. In CVPR, 2012.

Digital Library

Cited By

Zhukov ARivero ABenois-Pineau JZemmari AMosbah M(2024)A Hybrid System for Defect Detection on Rail Lines through the Fusion of Object and Context InformationSensors10.3390/s2404117124:4(1171)Online publication date: 10-Feb-2024
https://doi.org/10.3390/s24041171
Kopalidis TSolachidis VVretos NDaras P(2024)Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and DatasetsInformation10.3390/info1503013515:3(135)Online publication date: 28-Feb-2024
https://doi.org/10.3390/info15030135
Li KZhang RChen SChen BSakashita MGuimbretiere FZhang C(2024)EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642613(1-24)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642613
Show More Cited By

Index Terms

Combining modality specific deep neural networks for emotion recognition in video
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors

Recommendations

Video-based emotion recognition using CNN-RNN and C3D hybrid networks
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

In this paper, we present a video-based emotion recognition system submitted to the EmotiW 2016 Challenge. The core module of this system is a hybrid network that combines recurrent neural network (RNN) and 3D convolutional networks (C3D) in a late-...
Multi-Feature Based Emotion Recognition for Video Clips
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video ...
Video-based Emotion Recognition Using Deeply-Supervised Neural Networks
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

Emotion recognition (ER) based on natural facial images/videos has been studied for some years and considered a comparatively hot topic in the field of affective computing. However, it remains a challenge to perform ER in the wild, given the noises ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

December 2013

630 pages

ISBN:9781450321297

DOI:10.1145/2522848

General Chairs:
Julien Epps
The University of New South Wales, Australia
,
Fang Chen
National ICT Australia, Australia
,
Sharon Oviatt
Incaa Designs, USA
,
Kenji Mase
Nagoya University, Japan
,
Program Chairs:
Andrew Sears
Rochester Institute of Technology, USA
,
Kristiina Jokinen
University of Helsinki, Finland
,
Björn Schuller
Technische Universität München, Germany

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

emotion recognition

Qualifiers

Research-article

Conference

ICMI '13

Sponsor:

SIGCHI

ICMI '13: 2013 International Conference on Multimodal Interaction

December 9 - 13, 2013

Sydney, Australia

Acceptance Rates

ICMI '13 Paper Acceptance Rate 49 of 133 submissions, 37%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

169
Total Citations
View Citations
2,578
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)7

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhukov ARivero ABenois-Pineau JZemmari AMosbah M(2024)A Hybrid System for Defect Detection on Rail Lines through the Fusion of Object and Context InformationSensors10.3390/s2404117124:4(1171)Online publication date: 10-Feb-2024
https://doi.org/10.3390/s24041171
Kopalidis TSolachidis VVretos NDaras P(2024)Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and DatasetsInformation10.3390/info1503013515:3(135)Online publication date: 28-Feb-2024
https://doi.org/10.3390/info15030135
Li KZhang RChen SChen BSakashita MGuimbretiere FZhang C(2024)EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642613(1-24)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642613
Wang DJia SPei XHan CYao DLiu D(2024)DERNet: Driver Emotion Recognition Using Onboard CameraIEEE Intelligent Transportation Systems Magazine10.1109/MITS.2023.333388216:2(117-132)Online publication date: Mar-2024
https://doi.org/10.1109/MITS.2023.3333882
Liu CSun YXu YSun ZZhang XLei LKuang G(2024)A Review of Optical and SAR Image Deep Feature Fusion in Semantic SegmentationIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.342483117(12910-12930)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2024.3424831
O DS SEdwin EP SThanka MEbenezer V(2024)Face and Emotion Recognition using Deep Learning with Convolutional Neural Networks in Industry Application2024 3rd International Conference on Sentiment Analysis and Deep Learning (ICSADL)10.1109/ICSADL61749.2024.00077(438-443)Online publication date: 13-Mar-2024
https://doi.org/10.1109/ICSADL61749.2024.00077
Gursesli MLombardi SDuradoni MBocchi LGuazzini ALanata A(2024)Facial Emotion Recognition (FER) Through Custom Lightweight CNN Model: Performance Evaluation in Public DatasetsIEEE Access10.1109/ACCESS.2024.338084712(45543-45559)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3380847
Huang ZZhu YLi HYang D(2024)Dynamic facial expression recognition based on spatial key-points optimized region feature fusion and temporal self-attentionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108535133(108535)Online publication date: Jul-2024
https://doi.org/10.1016/j.engappai.2024.108535
Sasithradevi AChalla RSaketh SChakka SPerumal DPrakash P(2024)Deep dual domain joint discriminant feature framework for emotion based music playerInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02382-z15:8(3854-3868)Online publication date: 12-Jun-2024
https://doi.org/10.1007/s13198-024-02382-z
Bhave Avan Delden JGloor PRenold F(2023)Comparing Synchronicity in Body Movement among Jazz Musicians with Their EmotionsSensors10.3390/s2315678923:15(6789)Online publication date: 29-Jul-2023
https://doi.org/10.3390/s23156789
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents