Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2522848.2531745acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Combining modality specific deep neural networks for emotion recognition in video

Published: 09 December 2013 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.

    References

    [1]
    J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13:281--305, 2012.
    [2]
    P.-L. Carrier, A. Courville, I. J. Goodfellow, M. Mirza, and Y. Bengio. FER-2013 Face Database. Technical report, 1365, Université de Montréal, 2013.
    [3]
    C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
    [4]
    A. Coates, H. Lee, and A. Y. Ng. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In AISTATS, 2011.
    [5]
    G. E. Dahl, T. N. Sainath, and G. E. .Improving deep neural networks for lvcsr using rectified linear units and dropout. In Proc. ICASSP, 2013.
    [6]
    A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon. Emotion recognition in the wild challenge 2013. In ACM ICMI, 2013.
    [7]
    A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia, 2012.
    [8]
    Google. The Google picasa face detector, 2013. {Online, accessed 1-August-2013}.
    [9]
    P. Hamel, S. Lemieux, Y. Bengio, and D. Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR, pages 729--734, 2011.
    [10]
    G. Heusch, F. Cardinaux, and S. Marcel. Lighting normalization algorithms for face verification. IDIAP Communication Com05-03, 2005.
    [11]
    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Proc. Magazine,}, 29(6):82--97, 2012.
    [12]
    G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
    [13]
    K. R. Konda, R. Memisevic, and V. Michalski. The role of spatio-temporal synchrony in the encoding of motion. arXiv preprint arXiv:1306.3162, 2013.
    [14]
    A. Krizhevsky. Cuda-convnet Google code home page. https://code.google.com/p/cuda-convnet/, Aug. 2012.
    [15]
    A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106--1114. 2012.
    [16]
    Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011.
    [17]
    V. Štruc and N. Pavešić. Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica, 20(1):115--138, 2009.
    [18]
    J. Susskind, A. Anderson, and G. Hinton. The toronto face database. Technical report, UTML TR 2010-001, University of Toronto, 2010.
    [19]
    I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML 2013, 2013.
    [20]
    V. Štruc and N. Pavešić. Photometric normalization techniques for illumination invariance, pages 279--300. IGI-Global, 2011.
    [21]
    H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.
    [22]
    X. Zhu and D. Ramanan. Face Detection, Pose Estimation, and Landmark Localization in the Wild. In CVPR, 2012.

    Cited By

    View all
    • (2024)A Hybrid System for Defect Detection on Rail Lines through the Fusion of Object and Context InformationSensors10.3390/s2404117124:4(1171)Online publication date: 10-Feb-2024
    • (2024)Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and DatasetsInformation10.3390/info1503013515:3(135)Online publication date: 28-Feb-2024
    • (2024)EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642613(1-24)Online publication date: 11-May-2024
    • Show More Cited By

    Index Terms

    1. Combining modality specific deep neural networks for emotion recognition in video

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction
      December 2013
      630 pages
      ISBN:9781450321297
      DOI:10.1145/2522848
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 December 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tag

      1. emotion recognition

      Qualifiers

      • Research-article

      Conference

      ICMI '13
      Sponsor:

      Acceptance Rates

      ICMI '13 Paper Acceptance Rate 49 of 133 submissions, 37%;
      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)156
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Hybrid System for Defect Detection on Rail Lines through the Fusion of Object and Context InformationSensors10.3390/s2404117124:4(1171)Online publication date: 10-Feb-2024
      • (2024)Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and DatasetsInformation10.3390/info1503013515:3(135)Online publication date: 28-Feb-2024
      • (2024)EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642613(1-24)Online publication date: 11-May-2024
      • (2024)DERNet: Driver Emotion Recognition Using Onboard CameraIEEE Intelligent Transportation Systems Magazine10.1109/MITS.2023.333388216:2(117-132)Online publication date: Mar-2024
      • (2024)A Review of Optical and SAR Image Deep Feature Fusion in Semantic SegmentationIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.342483117(12910-12930)Online publication date: 2024
      • (2024)Face and Emotion Recognition using Deep Learning with Convolutional Neural Networks in Industry Application2024 3rd International Conference on Sentiment Analysis and Deep Learning (ICSADL)10.1109/ICSADL61749.2024.00077(438-443)Online publication date: 13-Mar-2024
      • (2024)Facial Emotion Recognition (FER) Through Custom Lightweight CNN Model: Performance Evaluation in Public DatasetsIEEE Access10.1109/ACCESS.2024.338084712(45543-45559)Online publication date: 2024
      • (2024)Dynamic facial expression recognition based on spatial key-points optimized region feature fusion and temporal self-attentionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108535133(108535)Online publication date: Jul-2024
      • (2024)Deep dual domain joint discriminant feature framework for emotion based music playerInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02382-z15:8(3854-3868)Online publication date: 12-Jun-2024
      • (2023)Comparing Synchronicity in Body Movement among Jazz Musicians with Their EmotionsSensors10.3390/s2315678923:15(6789)Online publication date: 29-Jul-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media