Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3242969.3264992acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Published: 02 October 2018 Publication History

Abstract

The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio-video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatio-temporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

References

[1]
Roland Goecke Abhinav Dhall, Amanjot Kaur and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction (in press). (2018).
[2]
Roddy Cowie, Ellen Douglascowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G. Taylor. 2002. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine Vol. 18, 1 (2002), 32--80.
[3]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.
[4]
Abhinav Dhall, Roland Goecke, Jyoti Joshi, Michael Wagner, and Tom Gedeon. 2013. Emotion recognition in the wild challenge (EmotiW) challenge and workshop summary ACM on International Conference on Multimodal Interaction. 371--372.
[5]
Abhinav Dhall, Roland Goecke, Simon Lucey, Tom Gedeon, et al. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia Vol. 19, 3 (2012), 34--41.
[6]
Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent neural networks for emotion recognition in video Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 467--474.
[7]
Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition Vol. 44, 3 (2011), 572--587.
[8]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462.
[9]
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 445--450.
[10]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks Vol. 18, 5-6 (2005), 602--610.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation Vol. 9, 8 (1997), 1735--1780.
[13]
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 553--560.
[14]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence Vol. 35, 1 (2013), 221--231.
[15]
Boris Knyazev, Roman Shvetsov, Natalia Efremova, and Artem Kuharenko. 2017. Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598 (2017).
[16]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105.
[17]
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, InterSpeech.
[18]
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep face recognition. In BMVC, Vol. 1. 6.
[19]
Robert V. Shannon, Fan-Gang Zeng, Vivek Kamath, John Wygonski, and Michael Ekelid. 1995. Speech recognition with primarily temporal cues. Science Vol. 270, 5234 (1995), 303--304.
[20]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[21]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE international conference on computer vision. 4489--4497.
[22]
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 569--576.
[23]
Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2017. Funnel-structured cascade for multi-view face detection with alignment-awareness. Neurocomputing Vol. 221 (2017), 138--145.
[24]
Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. 2016. Multi-clue fusion for emotion recognition in the wild ACM International Conference on Multimodal Interaction. 458--463.
[25]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems. 3320--3328.
[26]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters Vol. 23, 10 (2016), 1499--1503.

Cited By

View all
  • (2024)Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive PromptingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681583(5722-5731)Online publication date: 28-Oct-2024
  • (2024)Facial Expression Recognition in Video Using 3D-CNN Deep Features Discrimination2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512101(1-6)Online publication date: 1-Mar-2024
  • (2024)A cheating detection system in online exams through real-time facial emotion recognition of students2024 International Conference on Computing, Internet of Things and Microwave Systems (ICCIMS)10.1109/ICCIMS61672.2024.10690812(1-5)Online publication date: 29-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
October 2018
687 pages
ISBN:9781450356923
DOI:10.1145/3242969
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • SIGCHI: Specialist Interest Group in Computer-Human Interaction of the ACM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3d convolutional neural networks (3d cnn)
  2. convolutional neural networks (cnn)
  3. emotion recognition
  4. long short-term memory (lstm)
  5. spatio-temporal information

Qualifiers

  • Research-article

Funding Sources

Conference

ICMI '18
Sponsor:
  • SIGCHI

Acceptance Rates

ICMI '18 Paper Acceptance Rate 63 of 149 submissions, 42%;
Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)6
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive PromptingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681583(5722-5731)Online publication date: 28-Oct-2024
  • (2024)Facial Expression Recognition in Video Using 3D-CNN Deep Features Discrimination2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512101(1-6)Online publication date: 1-Mar-2024
  • (2024)A cheating detection system in online exams through real-time facial emotion recognition of students2024 International Conference on Computing, Internet of Things and Microwave Systems (ICCIMS)10.1109/ICCIMS61672.2024.10690812(1-5)Online publication date: 29-Jul-2024
  • (2024)CDGT: Constructing diverse graph transformers for emotion recognition from facial videosNeural Networks10.1016/j.neunet.2024.106573179(106573)Online publication date: Nov-2024
  • (2024)Dual-STI: Dual-path Spatial-Temporal Interaction Learning for Dynamic Facial Expression RecognitionInformation Sciences10.1016/j.ins.2024.120953(120953)Online publication date: Jun-2024
  • (2024)Multi-geometry embedded transformer for facial expression recognition in videosExpert Systems with Applications10.1016/j.eswa.2024.123635249(123635)Online publication date: Sep-2024
  • (2024)A joint local spatial and global temporal CNN-Transformer for dynamic facial expression recognitionApplied Soft Computing10.1016/j.asoc.2024.111680161(111680)Online publication date: Aug-2024
  • (2024)A defensive attention mechanism to detect deepfake content across multiple modalitiesMultimedia Systems10.1007/s00530-023-01248-x30:1Online publication date: 3-Feb-2024
  • (2023)Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in WildSensors10.3390/s2304183423:4(1834)Online publication date: 7-Feb-2023
  • (2023)Freq-HD: An Interpretable Frequency-based High-Dynamics Affective Clip Selection Method for in-the-Wild Facial Expression Recognition in VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611972(843-852)Online publication date: 26-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media