Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3240508.3240578acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Published: 15 October 2018 Publication History

Abstract

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

References

[1]
S. Albanie and A. Vedaldi. Learning grimaces by watching tv. In Proc. BMVC., 2016.
[2]
Z. Aldeneh and E. M. Provost. Using regional saliency for speech emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2741--2745. IEEE, 2017.
[3]
R. Arandjelovic and A. Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 609--617. IEEE, 2017.
[4]
H. Aviezer, S. Bentin, R. R. Hassin,W. S. Meschino, J. Kennedy, S. Grewal, S. Esmail, S. Cohen, and M. Moscovitch. Not on the face alone: perception of contextualized face expressions in huntington's disease. Brain, 132(6):1633--1644, 2009.
[5]
Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016.
[6]
Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017.
[7]
J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654--2662, 2014.
[8]
S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433--436. ACM, 2016.
[9]
H. B. Barlow. Unsupervised learning. Neural computation, 1(3):295--311, 1989.
[10]
E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), 2016.
[11]
A. Batliner, C. Hacker, S. Steidl, E. Nøth, S. D'Arcy, M. J. Russell, and M. Wong. You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus. In LREC, 2004.
[12]
C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541. ACM, 2006.
[13]
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, 2005.
[14]
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008.
[15]
C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces, pages 205--211. ACM, 2004.
[16]
Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2018.
[17]
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.
[18]
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
[19]
E. J. Crowley, G. Gray, and A. Storkey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.
[20]
N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 2017 ACM on Multimedia Conference, pages 478--484. ACM, 2017.
[21]
E. Cvejic, J. Kim, and C. Davis. Prosodyoffthe top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication, 52(6):555--564, 2010.
[22]
J. Deng, Z. Zhang, F. Eyben, and B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, 21(9):1068--1072, 2014.
[23]
J. Deng, Z. Zhang, and B. Schuller. Linked source and target domain subspace feature transfer learning--exemplified by speech emotion recognition. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 761--766. IEEE, 2014.
[24]
A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 427--432. ACM, 2016.
[25]
A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34--41, 2012.
[26]
C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422--1430, 2015.
[27]
E. Douglas-Cowie, R. Cowie, and M. Schröder. A new emotion database: considerations, sources and scope. In ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.
[28]
I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117--124. Springer, 2013.
[29]
C. C. Goren, M. Sarty, and P. Y. Wu. Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics, 56(4):544--549, 1975.
[30]
T. Grossmann. The development of emotion perception in face and voice during infancy. Restorative neurology and neuroscience, 28(2):219--236, 2010.
[31]
S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2827--2836. IEEE, 2016.
[32]
J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller. From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference, pages 890--897. ACM, 2017.
[33]
R. R. Hassin, H. Aviezer, and S. Bentin. Inherently ambiguous: Facial expressions of emotions, in context. Emotion Review, 5(1):60--65, 2013.
[34]
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[35]
J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. CVPR, 2018.
[36]
P. Hu, D. Cai, S.Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 553--560. ACM, 2017.
[37]
C. Huang. Combining convolutional neural networks for emotion recognition. In Undergraduate Research Technology Conference (URTC), 2017 IEEE MIT, pages 1--4. IEEE, 2017.
[38]
M. H. Johnson, S. Dziurawiec, H. Ellis, and J. Morton. Newborns' preferential tracking of face-like stimuli and its subsequent decline. Cognition, 40(1--2):1--19, 1991.
[39]
C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, andW. Matusik. On learning associations of faces and voices. arXiv preprint arXiv:1805.05553, 2018.
[40]
J. Kim, G. Englebienne, K. P. Truong, and V. Evers. Deep temporal models using identity skip-connections for speech emotion recognition. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1006--1013. ACM, 2017.
[41]
Y. Kim and E. M. Provost. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 92--99. ACM, 2016.
[42]
S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353, 2018.
[43]
J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size dnn with outputdistribution- based criteria. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[44]
M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell. Ldc emotional prosody speech transcripts database. University of Pennsylvania, Linguistic data consortium, 2002.
[45]
D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
[46]
S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 85--90. IEEE, 2013.
[47]
O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterfaceâAZ05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on, pages 8--8. IEEE, 2006.
[48]
A. Nagrani, S. Albanie, and A. Zisserman. Learnable PINs: Cross-modal embeddings for person identity. Proc. ECCV, 2018.
[49]
A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proc. CVPR, 2018.
[50]
A. Nagrani, J. S. Chung, and A. Zisserman. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.
[51]
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017.
[52]
D. K. Oller. The effect of position in utterance on speech segment duration in english. The journal of the Acoustical Society of America, 54(5):1235--1247, 1973.
[53]
A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, pages 801--816. Springer, 2016.
[54]
S. Parthasarathy, C. Zhang, J. H. Hansen, and C. Busso. A study of speaker verification performance with expressive speech. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5540--5544. IEEE, 2017.
[55]
M. D. Pell. Influence of emotion and focus location on prosody in matched statements and questions. The Journal of the Acoustical Society of America, 109(4):1668--1680, 2001.
[56]
M. D. Pell. Prosody--face interactions in emotional processing as revealed by the facial affect decision task. Journal of Nonverbal Behavior, 29(4):193--215, 2005.
[57]
S. Poria, E. Cambria, A. Hussain, and G.-B. Huang. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 63:104--116, 2015.
[58]
S. Rigoulot and M. D. Pell. Emotion in the voice influences the way we scan emotional faces. Speech Communication, 65:36--49, 2014.
[59]
D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
[60]
B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2):119--131, 2010.
[61]
M. Swerts and E. Krahmer. Facial expression and prosodic prominence: Effects of modality and facial area. Journal of Phonetics, 36(2):219--238, 2008.
[62]
O. Wiles, A. S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), 2018.
[63]
Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 435--442. ACM, 2015.
[64]
S. Zhalehpour, Z. Akhtar, and C. E. Erdem. Multimodal emotion recognition based on peak frame selection from video. Signal, Image and Video Processing, 10(5):827--834, 2016.
[65]
Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller. Unsupervised learning in cross-corpus acoustic emotion recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 523--528. IEEE, 2011.
[66]
S. Zhao, G. Ding, Y. Gao, and J. Han. Learning visual emotion distributions via multi-modal features fusion. In Proceedings of the 2017 ACM on Multimedia Conference, pages 369--377. ACM, 2017.

Cited By

View all
  • (2025)CMDistill: Cross-Modal Distillation Framework for AAV Image Object DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.347971718(1395-1409)Online publication date: 2025
  • (2025)ProxyLabel: A framework to evaluate techniques for survey fatigue reduction leveraging auxiliary modalitiesExpert Systems with Applications10.1016/j.eswa.2024.125913265(125913)Online publication date: Mar-2025
  • (2024)A study on automatic identification of students’ emotional states using convolutional neural networksApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-34309:1Online publication date: 25-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal transfer
  2. speech emotion recognition

Qualifiers

  • Research-article

Funding Sources

  • EPSRC

Conference

MM '18
Sponsor:
MM '18: ACM Multimedia Conference
October 22 - 26, 2018
Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)413
  • Downloads (Last 6 weeks)55
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)CMDistill: Cross-Modal Distillation Framework for AAV Image Object DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.347971718(1395-1409)Online publication date: 2025
  • (2025)ProxyLabel: A framework to evaluate techniques for survey fatigue reduction leveraging auxiliary modalitiesExpert Systems with Applications10.1016/j.eswa.2024.125913265(125913)Online publication date: Mar-2025
  • (2024)A study on automatic identification of students’ emotional states using convolutional neural networksApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-34309:1Online publication date: 25-Nov-2024
  • (2024)The Role of Coherent Robot Behavior and Embodiment in Emotion Perception and Recognition During Human-Robot Interaction: Experimental StudyJMIR Human Factors10.2196/4549411(e45494)Online publication date: 26-Jan-2024
  • (2024)Air Traffic Flow Prediction with Spatiotemporal Knowledge Distillation NetworkJournal of Advanced Transportation10.1155/2024/43494022024:1Online publication date: 15-May-2024
  • (2024)FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers’ Emotion RecognitionACM Transactions on Intelligent Systems and Technology10.1145/365004015:3(1-27)Online publication date: 1-Mar-2024
  • (2024)Utilizing Greedy Nature for Multimodal Conditional Image Synthesis in TransformersIEEE Transactions on Multimedia10.1109/TMM.2023.329509426(2354-2366)Online publication date: 1-Jan-2024
  • (2024)A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in ConversationsIEEE Transactions on Multimedia10.1109/TMM.2023.327101926(776-788)Online publication date: 1-Jan-2024
  • (2024)Cross-Layer Contrastive Learning of Latent Semantics for Facial Expression RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.337845933(2514-2529)Online publication date: 2024
  • (2024)FG-AGR: Fine-Grained Associative Graph Representation for Facial Expression Recognition in the WildIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.323700634:2(882-896)Online publication date: Feb-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media