Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3382507.3417960acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition

Published: 22 October 2020 Publication History

Abstract

Audio-video group emotion recognition is a challenging task since it is difficult to gather a broad range of potential information to obtain meaningful emotional representations. Humans can easily understand emotions because they can associate implicit contextual knowledge (contained in our memory) when processing explicit information they can see and hear directly. This paper proposes an end-to-end architecture called implicit knowledge injectable cross attention audiovisual deep neural network (K-injection audiovisual network) that imitates this intuition. The K-injection audiovisual network is used to train an audiovisual model that can not only obtain audiovisual representations of group emotions through an explicit feature-based cross attention audiovisual subnetwork (audiovisual subnetwork), but is also able to absorb implicit knowledge of emotions through two implicit knowledge-based injection subnetworks (K-injection subnetwork). In addition, it is trained with explicit features and implicit knowledge but can easily make inferences using only explicit features. We define the region of interest (ROI) visual features and Melspectrogram audio features as explicit features, which obviously are present in the raw audio-video data. On the other hand, we define the linguistic and acoustic emotional representations that do not exist in the audio-video data as implicit knowledge. The implicit knowledge distilled by adapting video situation descriptions and basic acoustic features (MFCCs, pitch and energy) to linguistic and acoustic K-injection subnetworks is defined as linguistic and acoustic knowledge, respectively. When compared to the baseline accuracy for the testing set of 47.88%, the average of the audiovisual models trained with the (linguistic, acoustic and linguistic-acoustic) K-injection subnetworks achieved an overall accuracy of 66.40%.

Supplementary Material

MP4 File (3382507.3417960.mp4)
We introduce paper "Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition" in this presentation video. The contributions of this paper can be summarized as follows: 1) we propose an end-to-end architecture that can not only obtain audiovisual representation from the video directly, but also can absorb implicit knowledge of emotions hidden in the video; 2)we apply a multi-head cross attention network as the audiovisual subnetwork that can dynamically integrates multimodal features throughout the sequence level, 3)we jointly train the audiovisual subnetwork with two knowledge-based injection subnetworks to transform the emotional knowledge distilled from an unimodal model into another model. 4) the audiovisual model achieved an overall accuracy of 66.40% compared to the testing set baseline of 47.88%.

References

[1]
Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2018. Emotion recognition in speech using cross-modal transfer in the wild. In Proceedings of the 26th ACM international conference on Multimedia. 292--301.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[3]
Jyoti Aneja, Aditya Deshpande, and Alexander G. Schwing. 2018. Convolutional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4]
Jessica A Collins and Ingrid R Olson. 2014. Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic bulletin & review, Vol. 21, 4 (2014), 843--860.
[5]
Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 4 (2010), 788--798.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
[7]
Abhinav Dhall. 2019. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 546--550. https://doi.org/10.1145/3340555.3355710
[8]
Jon Driver and Charles Spence. 1998. Crossmodal attention. Current opinion in neurobiology, Vol. 8, 2 (1998), 245--253.
[9]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462.
[10]
Shreya Ghosh, Abhinav Dhall, Nicu Sebe, and Tom Gedeon. 2019. Predicting group cohesiveness in images. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[11]
Da Guo, Kai Wang, Jianfei Yang, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 557--561. https://doi.org/10.1145/3340555.3355712
[12]
Xin Guo, Bin Zhu, Luisa F Polan'ia, Charles Boncelet, and Kenneth E Barner. 2018. Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 635--639.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[14]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS.
[15]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR.
[16]
Ron Kohavi et almbox. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137--1145.
[17]
Sunan Li, Wenming Zheng, Yuan Zong, Cheng Lu, Chuangao Tang, Xingxun Jiang, Jiateng Liu, and Wanchuang Xia. 2019 b. Bi-Modality Fusion for Emotion Recognition in the Wild. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 589--594. https://doi.org/10.1145/3340555.3355719
[18]
Yuanchao Li, Tianyu Zhao, and Tatsuya Kawahara. 2019 a. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. Proc. Interspeech 2019 (2019), 2803--2807.
[19]
Lorenzo Magnani, Sabino Civita, and Guido Previde Massara. 1994. Visual cognition and cognitive modeling. In Human and Machine Vision. Springer, 229--243.
[20]
Richard GM Morris, Lionel Tarassenko, and Michael Kenward. 2005. Cognitive Systems-Information Processing Meets Brain Science .Elsevier.
[21]
Steven Pinker. 1984. Visual cognition: An introduction. Cognition, Vol. 18, 1--3 (1984), 1--63.
[22]
Björn W Schuller, Anton Batliner, Christian Bergler, Florian B Pokorny, Jarek Krajewski, Margaret Cychosz, Ralf Vollmann, Sonja-Dana Roelen, Sebastian Schnieder, Elika Bergelson, et almbox. 2019. The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity. In Interspeech. 2378--2382.
[23]
Garima Sharma, Shreya Ghosh, and Abhinav Dhall. 2019. Automatic Group Level Affect and Cohesion Prediction in Videos. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 161--167.
[24]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
[25]
P. Tzirakis, J. Zhang, and B. W. Schuller. 2018. End-to-End Speech Emotion Recognition Using Deep Neural Networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5089--5093.
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
[27]
Susmitha Vekkot, Deepa Gupta, Mohammed Zakariah, and Yousef Ajami Alotaibi. 2019. Hybrid framework for speaker-independent emotion conversion using i-vector PLDA and neural network. IEEE Access, Vol. 7 (2019), 81883--81902.
[28]
Kai Wang, Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2018. Cascade attention networks for group emotion recognition with face, body and image cues. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 640--645.
[29]
Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019 a. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216--7223.
[30]
Yanan Wang, Jianming Wu, and Keiichiro Hoashi. 2019 b. Multi-Attention Fusion Network for Video-Based Emotion Recognition. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 595--601. https://doi.org/10.1145/3340555.3355720
[31]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019. Detectron2. https://github.com/facebookresearch/detectron2 .
[32]
Tien Xuan Dang, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, and Thanh-Hung Vo. 2019. Group-Level Cohesion Prediction Using Deep Learning Models with A Multi-Stream Hybrid Network. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 572--576. https://doi.org/10.1145/3340555.3355715
[33]
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. 2019. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8807--8817.
[34]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
[35]
Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention Oriented Image Captions With Guiding Objects. In CVPR.
[36]
Bin Zhu, Xin Guo, Kenneth Barner, and Charles Boncelet. 2019. Automatic Group Cohesiveness Detection With Multi-Modal Features. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 577--581. https://doi.org/10.1145/3340555.3355716

Cited By

View all
  • (2024)An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature FusionSensors10.3390/s2417568124:17(5681)Online publication date: 31-Aug-2024
  • (2024)Analysis of Learner’s Emotional Engagement in Online Learning Using Machine Learning Adam Robust Optimization AlgorithmScientific Programming10.1155/2024/88861972024:1Online publication date: 5-Jun-2024
  • (2024)A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in ConversationsIEEE Transactions on Multimedia10.1109/TMM.2023.327101926(776-788)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
      October 2020
      920 pages
      ISBN:9781450375818
      DOI:10.1145/3382507
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 October 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. affective computing
      2. machine learning for multimodal interaction
      3. multimodal fusion and representation

      Qualifiers

      • Research-article

      Conference

      ICMI '20
      Sponsor:
      ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
      October 25 - 29, 2020
      Virtual Event, Netherlands

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 06 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature FusionSensors10.3390/s2417568124:17(5681)Online publication date: 31-Aug-2024
      • (2024)Analysis of Learner’s Emotional Engagement in Online Learning Using Machine Learning Adam Robust Optimization AlgorithmScientific Programming10.1155/2024/88861972024:1Online publication date: 5-Jun-2024
      • (2024)A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in ConversationsIEEE Transactions on Multimedia10.1109/TMM.2023.327101926(776-788)Online publication date: 2024
      • (2024)Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion RecognitionIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2024.342282318:3(444-458)Online publication date: Apr-2024
      • (2024)Multi-grained fusion network with self-distillation for aspect-based multimodal sentiment analysisKnowledge-Based Systems10.1016/j.knosys.2024.111724293:COnline publication date: 7-Jun-2024
      • (2024)Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospectsMultimedia Systems10.1007/s00530-024-01302-230:3Online publication date: 6-Apr-2024
      • (2024)Fusing Multimodal Streams for Improved Group Emotion Recognition in VideosPattern Recognition10.1007/978-3-031-78305-0_26(403-418)Online publication date: 4-Dec-2024
      • (2024)A Spatial-Temporal Graph Convolutional Network for Video-Based Group Emotion RecognitionPattern Recognition10.1007/978-3-031-78201-5_22(339-354)Online publication date: 2-Dec-2024
      • (2023)Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant FeaturesProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616546(750-754)Online publication date: 9-Oct-2023
      • (2023)Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task LearningProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616544(741-745)Online publication date: 9-Oct-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media