Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3347320.3357690acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition

Published: 15 October 2019 Publication History

Abstract

Affective dimension prediction from multi-modal is becoming an increasingly attractive research field in artificial intelligence (AI) and human-computer interaction (HCI) . Previous works have shown that discriminative features from multiple modalities are of importance to accurately recognize emotional states. Recently, deep representations have proved to be effective for emotional state recognition. To investigate new deep spatial-temporal features and evaluate their effectiveness for affective dimension recognition, in this paper, we propose:~(1) combining a pre-trained 2D-CNN and a 1D-CNN for learning deep spatial-temporal features from video images and audio spectrograms; and~(2) a spatial-Temporal Graph Convolutional Networks (ST-GCN) adapted to facial landmarks graph. To evaluate the effectiveness of the proposed spatial-temporal features for affective dimension prediction, we propose Deep Bidirectional Long Short-Term Memory Networks (DBLSTM) model for single-modality prediction, early-fusion and late-fusion predictions. With respect to the liking dimension, we use the text modality for prediction. Experimental results, on the AVEC2019 CES dataset, show that our proposed spatial-temporal features and recognition model obtain promising results. On the development set, the obtained concordance correlation coefficient (CCC) is up to $0.724$ for arousal and $0.705$ for valence, and on the test set, the CCC is $0.513$ for arousal and $0.515$ for valence, which outperform the baseline system with corresponding CCC of $0.355$ and $0.468$ on arousal and valence, respectively.

References

[1]
Tinghua Ai and Xiongfeng Yan. 2019. a graph convolution neuroal network. ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 150 (03 2019). https://doi.org/10.1016/j.isprsjprs.2019.02.010
[2]
Timur R. Almaev and Michel F. Valstar. 2013. Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition. In Affective Computing and Intelligent Interaction.
[3]
Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. Carnegie Mellon University-CS-16--118, Carnegie Mellon University School of Computer Science.
[4]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems.
[5]
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. CoRR, Vol. abs/1803.01271 (2018). arxiv: 1803.01271 http://arxiv.org/abs/1803.01271
[6]
Pablo Barros, Doreen Jirak, Cornelius Weber, and Stefan Wermter. 2015. Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Networks, Vol. 72 (2015), 140 -- 151. https://doi.org/10.1016/j.neunet.2015.09.009 Neurobiologically Inspired Robotics: Enhanced Autonomy through Neuromorphic Cognition.
[7]
Emad Barsoum, Cha Zhang, Cristian Canton-Ferrer, and Zhengyou Zhang. 2016. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. CoRR, Vol. abs/1608.01041 (2016). arxiv: 1608.01041 http://arxiv.org/abs/1608.01041
[8]
Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Campbell, Charlie Dagli, and Thomas S. Huang. 2016. Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge (AVEC '16). ACM, New York, NY, USA, 97--104. https://doi.org/10.1145/2988257.2988264
[9]
Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (AVEC '17). ACM, New York, NY, USA, 19--26. https://doi.org/10.1145/3133944.3133949
[10]
Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, and Khiet P. Truong. 2017. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing, Vol. 7, 2 (2017), 190--202.
[11]
Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cummins and Roddy Cowie and Leili Tavabi and Maximilian Schmitt and Sina Alisamir and Shahin Amiriparian and Eva-Maria Messner and Siyang Song and Shuo Lui and Ziping Zhao and Adria Mallol-Ragolta and Zhao Ren, and Maja Pantic. 2019. AVEC 2019 Workshop and Challenge: State-of-Mind, Depression with AI, and Cross-Cultural Affect Recognition. In Proceedings of the 9th International Workshop on Audio/Visual Emotion Challenge, AVEC'19, co-located with the 27th ACM International Conference on Multimedia, MM 2019, Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.). ACM, Nice, France.
[12]
Noushin Hajarolasvadi and Hasan Demirel. 2019. 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms. Entropy, Vol. 21, 5 (2019), 479. https://doi.org/10.3390/e21050479
[13]
Behzad H. Hasani; Mohammad Mahoor. 2017. Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR, Vol. abs/1512.03385 (2015). arxiv: 1512.03385 http://arxiv.org/abs/1512.03385
[15]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, and Kevin Wilson. 2016. CNN architectures for large-scale audio classification. (2016).
[16]
Jean Kossaifi, Robert Walecki, Yannis Panagakis, Jie Shen, Maximilian Schmitt, Fabien Ringeval, Jing Han, Vedhas Pandit, Bjö rn W. Schuller, Kam Star, Elnar Hajiyev, and Maja Pantic. 2019. SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild. CoRR, Vol. abs/1901.02839 (2019). arxiv: 1901.02839 http://arxiv.org/abs/1901.02839
[17]
He Lang and Jiang Dongmei. 2015. Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks. In International Workshop on Audio/visual Emotion Challenge. ACM.
[18]
Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and Jian Yang. 2018. Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition. CoRR, Vol. abs/1802.09834 (2018). arxiv: 1802.09834 http://arxiv.org/abs/1802.09834
[19]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Computer Science (2013).
[20]
Maja Pantic, Alex Pentland, Anton Nijholt, and Thomas S. Huang. 2007. Human Computing and Machine Understanding of Human Behavior: A Survey. 47--71. https://doi.org/10.1007/978--3--540--72348--6_3
[21]
Maja Pantic and Leon J M Rothkrantz. 2003. Toward an affect-sensitive multimodal human-computer interaction. Proc. IEEE, Vol. 91, 9 (2003), 1370--1390.
[22]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing.
[23]
Fabien Ringeval, Bjorn Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi, Denis Lalanne, Roddy Cowie, and Maja Pantic. 2015. AVEC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC '15). ACM, New York, NY, USA, 3--8. https://doi.org/10.1145/2808196.2811642
[24]
Bjorn Schuller, Michel Valstar, Florian Eyben, Gary Mckeown, Roddy Cowie, and Maja Pantic. 2011. AVEC 2011, The First International Audio/Visual Emotion Challenge. 415--424. https://doi.org/10.1007/978--3--642--24571--8_53
[25]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV '15). IEEE Computer Society, 4489--4497. https://doi.org/10.1109/ICCV.2015.510
[26]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. (2017).
[27]
Panagiotis Tzirakis, George Trigeorgis, Mihalis A. Nicolaou, Bjorn Schuller, and Stefanos Zafeiriou. 2017. End-to-End Multimodal Emotion Recognition using Deep Neural Networks. IEEE Journal of Selected Topics in Signal Processing, Vol. 11, 8 (2017), 1301--1309.
[28]
Michel Valstar, Jonathan Gratch, Bjorn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Guiota Stratou, Roddy Cowie, and Maja Pantic. 2016. AVEC 2016 - Depression, Mood, and Emotion Recognition Workshop and Challenge. (2016).
[29]
Michel F. Valstar, Bihan Jiang, Marc Mehu, Maja Pantic, and Klaus Scherer. 2011. The first facial expression recognition and analysis challenge. (2011).
[30]
L. Yang, D. Jiang, and H. Sahli. 2018. Integrating Deep and Shallow Models for Multi-Modal Depression Analysis, Hybrid Architectures. IEEE Transactions on Affective Computing (2018), 1--1. https://doi.org/10.1109/TAFFC.2018.2870398
[31]
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian. 2018. Learning Affective Features With a Hybrid Deep Model for Audio-Visual Emotion Recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2018), 3030--3043. https://doi.org/10.1109/TCSVT.2017.2719043
[32]
Jinming Zhao, Ruichen Li, Shizhe Chen, and Qin Jin. 2018. Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop (AVEC'18). ACM, New York, NY, USA, 65--72. https://doi.org/10.1145/3266302.3266313

Cited By

View all
  • (2024)COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332577046:2(805-822)Online publication date: Feb-2024
  • (2024)Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion RecognitionIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2024.342282318:3(444-458)Online publication date: Apr-2024
  • (2024)Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A SurveyIEEE Access10.1109/ACCESS.2024.339181712(59399-59430)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop
October 2019
96 pages
ISBN:9781450369138
DOI:10.1145/3347320
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. affective dimension
  2. deep spatial-temporal feature
  3. emotion
  4. multimodal

Qualifiers

  • Research-article

Conference

MM '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 52 of 98 submissions, 53%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332577046:2(805-822)Online publication date: Feb-2024
  • (2024)Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion RecognitionIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2024.342282318:3(444-458)Online publication date: Apr-2024
  • (2024)Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A SurveyIEEE Access10.1109/ACCESS.2024.339181712(59399-59430)Online publication date: 2024
  • (2024)A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videosApplied Intelligence10.1007/s10489-024-05329-w54:4(3040-3057)Online publication date: 1-Feb-2024
  • (2023)Graph-Based Facial Affect Analysis: A ReviewIEEE Transactions on Affective Computing10.1109/TAFFC.2022.321591814:4(2657-2677)Online publication date: 1-Oct-2023
  • (2023)Modeling Multiple Temporal Scales of Full-Body Movements for Emotion ClassificationIEEE Transactions on Affective Computing10.1109/TAFFC.2021.309542514:2(1070-1081)Online publication date: 1-Apr-2023
  • (2023)An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394108(572-577)Online publication date: 1-Oct-2023
  • (2023)Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00116(642-647)Online publication date: Jul-2023
  • (2023)Hand Gesture and Audio Recognition System Using Neural Networks2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10306722(1-6)Online publication date: 6-Jul-2023
  • (2022)Emotion Recognition from Physiological Channels Using Graph Neural NetworkSensors10.3390/s2208298022:8(2980)Online publication date: 13-Apr-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media