research-article

Open access

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Authors:

Samuel Albanie,

Andrea Vedaldi,

Andrew ZissermanAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 292 - 301

https://doi.org/10.1145/3240508.3240578

Published: 15 October 2018 Publication History

Abstract

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

References

[1]

S. Albanie and A. Vedaldi. Learning grimaces by watching tv. In Proc. BMVC., 2016.

[2]

Z. Aldeneh and E. M. Provost. Using regional saliency for speech emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2741--2745. IEEE, 2017.

[3]

R. Arandjelovic and A. Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 609--617. IEEE, 2017.

[4]

H. Aviezer, S. Bentin, R. R. Hassin,W. S. Meschino, J. Kennedy, S. Grewal, S. Esmail, S. Cohen, and M. Moscovitch. Not on the face alone: perception of contextualized face expressions in huntington's disease. Brain, 132(6):1633--1644, 2009.

[5]

Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016.

Digital Library

[6]

Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017.

[7]

J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654--2662, 2014.

Digital Library

[8]

S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433--436. ACM, 2016.

Digital Library

[9]

H. B. Barlow. Unsupervised learning. Neural computation, 1(3):295--311, 1989.

Digital Library

[10]

E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), 2016.

Digital Library

[11]

A. Batliner, C. Hacker, S. Steidl, E. Nøth, S. D'Arcy, M. J. Russell, and M. Wong. You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus. In LREC, 2004.

[12]

C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541. ACM, 2006.

Digital Library

[13]

F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, 2005.

[14]

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008.

[15]

C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces, pages 205--211. ACM, 2004.

Digital Library

[16]

Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2018.

[17]

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.

[18]

J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.

[19]

E. J. Crowley, G. Gray, and A. Storkey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.

[20]

N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 2017 ACM on Multimedia Conference, pages 478--484. ACM, 2017.

Digital Library

[21]

E. Cvejic, J. Kim, and C. Davis. Prosodyoffthe top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication, 52(6):555--564, 2010.

Digital Library

[22]

J. Deng, Z. Zhang, F. Eyben, and B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, 21(9):1068--1072, 2014.

[23]

J. Deng, Z. Zhang, and B. Schuller. Linked source and target domain subspace feature transfer learning--exemplified by speech emotion recognition. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 761--766. IEEE, 2014.

Digital Library

[24]

A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 427--432. ACM, 2016.

Digital Library

[25]

A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34--41, 2012.

Digital Library

[26]

C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422--1430, 2015.

Digital Library

[27]

E. Douglas-Cowie, R. Cowie, and M. Schröder. A new emotion database: considerations, sources and scope. In ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.

[28]

I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117--124. Springer, 2013.

Digital Library

[29]

C. C. Goren, M. Sarty, and P. Y. Wu. Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics, 56(4):544--549, 1975.

[30]

T. Grossmann. The development of emotion perception in face and voice during infancy. Restorative neurology and neuroscience, 28(2):219--236, 2010.

[31]

S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2827--2836. IEEE, 2016.

[32]

J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller. From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference, pages 890--897. ACM, 2017.

Digital Library

[33]

R. R. Hassin, H. Aviezer, and S. Bentin. Inherently ambiguous: Facial expressions of emotions, in context. Emotion Review, 5(1):60--65, 2013.

[34]

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[35]

J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. CVPR, 2018.

[36]

P. Hu, D. Cai, S.Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 553--560. ACM, 2017.

Digital Library

[37]

C. Huang. Combining convolutional neural networks for emotion recognition. In Undergraduate Research Technology Conference (URTC), 2017 IEEE MIT, pages 1--4. IEEE, 2017.

[38]

M. H. Johnson, S. Dziurawiec, H. Ellis, and J. Morton. Newborns' preferential tracking of face-like stimuli and its subsequent decline. Cognition, 40(1--2):1--19, 1991.

[39]

C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, andW. Matusik. On learning associations of faces and voices. arXiv preprint arXiv:1805.05553, 2018.

[40]

J. Kim, G. Englebienne, K. P. Truong, and V. Evers. Deep temporal models using identity skip-connections for speech emotion recognition. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1006--1013. ACM, 2017.

Digital Library

[41]

Y. Kim and E. M. Provost. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 92--99. ACM, 2016.

Digital Library

[42]

S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353, 2018.

[43]

J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size dnn with outputdistribution- based criteria. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

[44]

M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell. Ldc emotional prosody speech transcripts database. University of Pennsylvania, Linguistic data consortium, 2002.

[45]

D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.

[46]

S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 85--90. IEEE, 2013.

Digital Library

[47]

O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterfaceâAZ05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on, pages 8--8. IEEE, 2006.

Digital Library

[48]

A. Nagrani, S. Albanie, and A. Zisserman. Learnable PINs: Cross-modal embeddings for person identity. Proc. ECCV, 2018.

[49]

A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proc. CVPR, 2018.

[50]

A. Nagrani, J. S. Chung, and A. Zisserman. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.

[51]

F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017.

Digital Library

[52]

D. K. Oller. The effect of position in utterance on speech segment duration in english. The journal of the Acoustical Society of America, 54(5):1235--1247, 1973.

[53]

A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, pages 801--816. Springer, 2016.

[54]

S. Parthasarathy, C. Zhang, J. H. Hansen, and C. Busso. A study of speaker verification performance with expressive speech. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5540--5544. IEEE, 2017.

[55]

M. D. Pell. Influence of emotion and focus location on prosody in matched statements and questions. The Journal of the Acoustical Society of America, 109(4):1668--1680, 2001.

[56]

M. D. Pell. Prosody--face interactions in emotional processing as revealed by the facial affect decision task. Journal of Nonverbal Behavior, 29(4):193--215, 2005.

[57]

S. Poria, E. Cambria, A. Hussain, and G.-B. Huang. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 63:104--116, 2015.

Digital Library

[58]

S. Rigoulot and M. D. Pell. Emotion in the voice influences the way we scan emotional faces. Speech Communication, 65:36--49, 2014.

[59]

D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.

[60]

B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2):119--131, 2010.

Digital Library

[61]

M. Swerts and E. Krahmer. Facial expression and prosodic prominence: Effects of modality and facial area. Journal of Phonetics, 36(2):219--238, 2008.

[62]

O. Wiles, A. S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), 2018.

[63]

Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 435--442. ACM, 2015.

Digital Library

[64]

S. Zhalehpour, Z. Akhtar, and C. E. Erdem. Multimodal emotion recognition based on peak frame selection from video. Signal, Image and Video Processing, 10(5):827--834, 2016.

[65]

Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller. Unsupervised learning in cross-corpus acoustic emotion recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 523--528. IEEE, 2011.

[66]

S. Zhao, G. Ding, Y. Gao, and J. Han. Learning visual emotion distributions via multi-modal features fusion. In Proceedings of the 2017 ACM on Multimedia Conference, pages 369--377. ACM, 2017.

Digital Library

Cited By

Tong XGuo XSun XGuo RSu SZuo Z(2025)CMDistill: Cross-Modal Distillation Framework for AAV Image Object DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.347971718(1395-1409)Online publication date: 2025
https://doi.org/10.1109/JSTARS.2024.3479717
Mandi SMitra B(2025)ProxyLabel: A framework to evaluate techniques for survey fatigue reduction leveraging auxiliary modalitiesExpert Systems with Applications10.1016/j.eswa.2024.125913265(125913)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125913
Tao HLiu CLiu Y(2024)A study on automatic identification of students’ emotional states using convolutional neural networksApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-34309:1Online publication date: 25-Nov-2024
https://doi.org/10.2478/amns-2024-3430
Show More Cited By

Index Terms

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Synthesized speech for model training in cross-corpus recognition of human emotion

Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion ...
Speech emotion recognition using a fuzzy approach
The 6th International Multi-Conference on Engineering and Technology Innovation 2017 (IMETI2017)

This paper introduces a fuzzy approach for classifying speech emotions in which a fuzzy inference system based on fuzzy associative memory (FAM-FIS) is used for recognizing speech emotions. Experiments on two databases of emotion speech Emo-DB in German ...
Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
Abstract
Speech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

EPSRC

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

203
Total Citations
View Citations
3,069
Total Downloads

Downloads (Last 12 months)413
Downloads (Last 6 weeks)55

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tong XGuo XSun XGuo RSu SZuo Z(2025)CMDistill: Cross-Modal Distillation Framework for AAV Image Object DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.347971718(1395-1409)Online publication date: 2025
https://doi.org/10.1109/JSTARS.2024.3479717
Mandi SMitra B(2025)ProxyLabel: A framework to evaluate techniques for survey fatigue reduction leveraging auxiliary modalitiesExpert Systems with Applications10.1016/j.eswa.2024.125913265(125913)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125913
Tao HLiu CLiu Y(2024)A study on automatic identification of students’ emotional states using convolutional neural networksApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-34309:1Online publication date: 25-Nov-2024
https://doi.org/10.2478/amns-2024-3430
Fiorini LD'Onofrio GSorrentino ACornacchia Loizzo FRusso SCiccone FGiuliani FSancarlo DCavallo F(2024)The Role of Coherent Robot Behavior and Embodiment in Emotion Perception and Recognition During Human-Robot Interaction: Experimental StudyJMIR Human Factors10.2196/4549411(e45494)Online publication date: 26-Jan-2024
https://doi.org/10.2196/45494
Shen ZCai KFang QLuo X(2024)Air Traffic Flow Prediction with Spatiotemporal Knowledge Distillation NetworkJournal of Advanced Transportation10.1155/2024/43494022024:1Online publication date: 15-May-2024
https://doi.org/10.1155/2024/4349402
Bano STonellotto NCassarà PGotta A(2024)FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers’ Emotion RecognitionACM Transactions on Intelligent Systems and Technology10.1145/365004015:3(1-27)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1145/3650040
Su SZhu JGao LSong J(2024)Utilizing Greedy Nature for Multimodal Conditional Image Synthesis in TransformersIEEE Transactions on Multimedia10.1109/TMM.2023.329509426(2354-2366)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3295094
Ma HWang JLin HZhang BZhang YXu B(2024)A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in ConversationsIEEE Transactions on Multimedia10.1109/TMM.2023.327101926(776-788)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3271019
Xie WPeng ZShen LLu WZhang YSong S(2024)Cross-Layer Contrastive Learning of Latent Semantics for Facial Expression RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.337845933(2514-2529)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3378459
Li CLi XWang XHuang DLiu ZLiao L(2024)FG-AGR: Fine-Grained Associative Graph Representation for Facial Expression Recognition in the WildIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.323700634:2(882-896)Online publication date: Feb-2024
https://doi.org/10.1109/TCSVT.2023.3237006
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten