research-article

Public Access

Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition

Authors:

Mohammad SoleymaniAuthors Info & Claims

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

Pages 481 - 490

https://doi.org/10.1145/3382507.3418813

Published: 22 October 2020 Publication History

Abstract

Automatic emotion recognition methods are sensitive to the variations across different datasets and their performance drops when evaluated across corpora. We can apply domain adaptation techniques e.g., Domain-Adversarial Neural Network (DANN) to mitigate this problem. Though the DANN can detect and remove the bias between corpora, the bias between speakers still remains which results in reduced performance. In this paper, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL). Our experiments with multimodal data (speech, vision, and text) and the cross-domain evaluation indicate that the proposed SIDANN outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, suggesting that the SIDANN has a better domain adaptation ability than the DANN. Besides, the modality contribution analysis shows that the acoustic features are the most informative for arousal detection while the lexical features perform the best for valence detection.

Supplementary Material

MP4 File (3382507.3418813.mp4)

Presentation Video for Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition. We propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL).

Download
36.23 MB

References

[1]

Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3722--3731.

[2]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.

[3]

Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. 2016. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing 8, 1 (2016), 67--80.

Digital Library

[4]

Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2018. Partial transfer learning with selective adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2724--2732.

[5]

Yashpalsing Chavhan, ML Dhore, and Pallavi Yesaware. 2010. Speech emotion recognition using support vector machine. International Journal of Computer Applications 1, 20 (2010), 6--9.

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[8]

Kexin Feng and Theodora Chaspari. 2020. A Review of Generalizable Transfer Learning in Automatic Emotion Recognition. Frontiers in Computer Science 2 (2020), 9.

[9]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 1180--1189. http://proceedings.mlr.press/v37/ganin15.html

[10]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.

Digital Library

[11]

John Gideon, Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. 2017. Progressive Neural Networks for Transfer Learning in Emotion Recognition. Proc. Interspeech 2017 (2017), 1098--1102.

[12]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. (2011).

[13]

Hatice Gunes and Björn Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing 31, 2 (2013), 120--136.

Digital Library

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[15]

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.

[16]

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In Proceedings of the 35th International Conference on Machine Learning.

[17]

Mimansa Jaiswal, Zakaria Aldeneh, and Emily Mower Provost. 2019. Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning. In 2019 International Conference on Multimodal Interaction. 174--184.

Digital Library

[18]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[19]

Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan, and Panayiotis Georgiou. 2020. Speaker-Invariant Affective Representation Learning via Adversarial Training. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7144--7148.

[20]

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. JMLR. org, 97--105.

[21]

Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing 2, 2 (2011), 92--105.

Digital Library

[22]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop.

[23]

Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. 2018. Multiadversarial domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence.

[24]

Rosalind W Picard. 2000. Affective computing. MIT press.

[25]

Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, et al. 2018. AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition. In Proceedings of the 2018 on audio/visual emotion challenge and workshop. 3--13.

Digital Library

[26]

Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic. 2015. AVEC 2015: The 5th international audio/visual emotion challenge and workshop. In Proceedings of the 23rd ACM international conference on Multimedia. 1335--1336.

Digital Library

[27]

Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. 2019. AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 3--12.

[28]

Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 3--9.

Digital Library

[29]

Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. 2018. Beyond sharing weights for deep domain adaptation. IEEE transactions on pattern analysis and machine intelligence 41, 4 (2018), 801--814.

[30]

Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3723--3732.

[31]

Björn Schuller, Michel Valster, Florian Eyben, Roddy Cowie, and Maja Pantic. 2012. AVEC 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction. 449--456.

[32]

Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. 2018. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI).

[33]

Mohammad Soleymani, Kalin Stefanov, Sin-Hwa Kang, Jan Ondras, and Jonathan Gratch. 2019. Multimodal Analysis and Estimation of Intimate Self-Disclosure. In 2019 International Conference on Multimodal Interaction. 59--68.

[34]

George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-toend speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5200--5204.

[35]

Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7167--7176.

[36]

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).

[37]

Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou. 2017. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1301--1309.

[38]

Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge. 3--10.

Digital Library

[39]

Michel Valstar, Björn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek Krajewski, Roddy Cowie, and Maja Pantic. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th international workshop on audio/visual emotion challenge. 3--10.

Digital Library

[40]

Michel Valstar, Björn Schuller, Kirsty Smith, Florian Eyben, Bihan Jiang, Sanjay Bilakhia, Sebastian Schnieder, Roddy Cowie, and Maja Pantic. 2013. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. 3--10.

Digital Library

[41]

Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing 312 (2018), 135--153.

Digital Library

[42]

Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. 2017. Learning affective features with a hybrid deep model for audio--visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2017), 3030--3043.

Digital Library

[43]

Jinming Zhao, Ruichen Li, Jingjun Liang, Shizhe Chen, and Qin Jin. 2019. Adversarial Domain Adaption for Multi-Cultural Dimensional Emotion Recognition in Dyadic Interactions. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 37--45.

Digital Library

[44]

Sicheng Zhao, Xin Zhao, Guiguang Ding, and Kurt Keutzer. 2018. EmotionGAN: Unsupervised domain adaptation for learning discrete probability distributions of image emotions. In Proceedings of the 26th ACM international conference on Multimedia. 1319--1327.

Digital Library

Cited By

Mallik SRana A(2024)Challenges and Solutions in Emotion Detection Using Deep Learning ApproachesMachine and Deep Learning Techniques for Emotion Detection10.4018/979-8-3693-4143-8.ch009(184-202)Online publication date: 22-Mar-2024
https://doi.org/10.4018/979-8-3693-4143-8.ch009
Lin WLi LWang D(2024)A Simple Unsupervised Knowledge-Free Domain Adaptation for Speaker RecognitionApplied Sciences10.3390/app1403106414:3(1064)Online publication date: 26-Jan-2024
https://doi.org/10.3390/app14031064
Zeng YWang GRen HCai YLeung HLi QHuang Q(2024)A Knowledge-Enhanced and Topic-Guided Domain Adaptation Model for Aspect-Based Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329221315:2(709-721)Online publication date: Apr-2024
https://doi.org/10.1109/TAFFC.2023.3292213
Show More Cited By

Index Terms

Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing

Recommendations

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
Abstract
Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech ...
Comparison of speaker dependent and speaker independent emotion recognition

AbstractThis paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods ...
Emotion recognition from speech: a review

Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

October 2020

920 pages

ISBN:9781450375818

DOI:10.1145/3382507

General Chairs:
Khiet Truong
University of Twente, the Netherlands
,
Dirk Heylen
University of Twente, the Netherlands
,
Mary Czerwinski
Microsoft Research, USA
,
Program Chairs:
Nadia Berthouze
University College London, United Kingdom
,
Mohamed Chetouani
Sorbonne University, France
,
Mikio Nakano
C4A Research Institute, Japan

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Army Research Office

Conference

ICMI '20

Sponsor:

SIGCHI

ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 25 - 29, 2020

Virtual Event, Netherlands

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
985
Total Downloads

Downloads (Last 12 months)202
Downloads (Last 6 weeks)21

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mallik SRana A(2024)Challenges and Solutions in Emotion Detection Using Deep Learning ApproachesMachine and Deep Learning Techniques for Emotion Detection10.4018/979-8-3693-4143-8.ch009(184-202)Online publication date: 22-Mar-2024
https://doi.org/10.4018/979-8-3693-4143-8.ch009
Lin WLi LWang D(2024)A Simple Unsupervised Knowledge-Free Domain Adaptation for Speaker RecognitionApplied Sciences10.3390/app1403106414:3(1064)Online publication date: 26-Jan-2024
https://doi.org/10.3390/app14031064
Zeng YWang GRen HCai YLeung HLi QHuang Q(2024)A Knowledge-Enhanced and Topic-Guided Domain Adaptation Model for Aspect-Based Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329221315:2(709-721)Online publication date: Apr-2024
https://doi.org/10.1109/TAFFC.2023.3292213
Yuan ZPhilip Chen CLi SZhang T(2024)Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion RecognitionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448044(11686-11690)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10448044
Rajapakshe TRana RKhalifa SSchuller B(2024)Domain Adapting Deep Reinforcement Learning for Real-World Speech Emotion RecognitionIEEE Access10.1109/ACCESS.2024.351976112(193101-193114)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3519761
Chen CZhang P(2024)DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion RecognitionDSNet:用于语音情感识别的带有中性校准的解耦孪生网络Journal of Shanghai Jiaotong University (Science)10.1007/s12204-024-2724-1Online publication date: 23-Apr-2024
https://doi.org/10.1007/s12204-024-2724-1
Li JWang XLi SShi JXiao Y(2024)MBDA: A Multi-scale Bidirectional Perception Approach for Cross-Corpus Speech Emotion RecognitionAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5669-8_27(329-341)Online publication date: 3-Aug-2024
https://doi.org/10.1007/978-981-97-5669-8_27
Latif SRana RKhalifa SJurdak RSchuller B(2023)Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.316701314:3(1912-1926)Online publication date: 1-Jul-2023
https://doi.org/10.1109/TAFFC.2022.3167013
Zhao SHong XYang JZhao YDing G(2023)Toward Label-Efficient Emotion and Sentiment AnalysisProceedings of the IEEE10.1109/JPROC.2023.3309299111:10(1159-1197)Online publication date: Oct-2023
https://doi.org/10.1109/JPROC.2023.3309299
Tun SOkada SHuang HLeong C(2023)Multimodal Transfer Learning for Oral Presentation AssessmentIEEE Access10.1109/ACCESS.2023.329583211(84013-84026)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295832
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents