research-article

Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition

Authors:

Satoshi KuriharaAuthors Info & Claims

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

Pages 827 - 834

https://doi.org/10.1145/3382507.3417960

Published: 22 October 2020 Publication History

Get Access

Abstract

Audio-video group emotion recognition is a challenging task since it is difficult to gather a broad range of potential information to obtain meaningful emotional representations. Humans can easily understand emotions because they can associate implicit contextual knowledge (contained in our memory) when processing explicit information they can see and hear directly. This paper proposes an end-to-end architecture called implicit knowledge injectable cross attention audiovisual deep neural network (K-injection audiovisual network) that imitates this intuition. The K-injection audiovisual network is used to train an audiovisual model that can not only obtain audiovisual representations of group emotions through an explicit feature-based cross attention audiovisual subnetwork (audiovisual subnetwork), but is also able to absorb implicit knowledge of emotions through two implicit knowledge-based injection subnetworks (K-injection subnetwork). In addition, it is trained with explicit features and implicit knowledge but can easily make inferences using only explicit features. We define the region of interest (ROI) visual features and Melspectrogram audio features as explicit features, which obviously are present in the raw audio-video data. On the other hand, we define the linguistic and acoustic emotional representations that do not exist in the audio-video data as implicit knowledge. The implicit knowledge distilled by adapting video situation descriptions and basic acoustic features (MFCCs, pitch and energy) to linguistic and acoustic K-injection subnetworks is defined as linguistic and acoustic knowledge, respectively. When compared to the baseline accuracy for the testing set of 47.88%, the average of the audiovisual models trained with the (linguistic, acoustic and linguistic-acoustic) K-injection subnetworks achieved an overall accuracy of 66.40%.

Supplementary Material

MP4 File (3382507.3417960.mp4)

We introduce paper "Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition" in this presentation video. The contributions of this paper can be summarized as follows: 1) we propose an end-to-end architecture that can not only obtain audiovisual representation from the video directly, but also can absorb implicit knowledge of emotions hidden in the video; 2)we apply a multi-head cross attention network as the audiovisual subnetwork that can dynamically integrates multimodal features throughout the sequence level, 3)we jointly train the audiovisual subnetwork with two knowledge-based injection subnetworks to transform the emotional knowledge distilled from an unimodal model into another model. 4) the audiovisual model achieved an overall accuracy of 66.40% compared to the testing set baseline of 47.88%.

Download
15.31 MB

References

[1]

Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2018. Emotion recognition in speech using cross-modal transfer in the wild. In Proceedings of the 26th ACM international conference on Multimedia. 292--301.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Emotion-Based Music Information Retrieval Using Lyrics

Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies

Speech-based recognition of self-reported and observed emotion in a dimensional space

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations