Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3382507.3418813acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Public Access

Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition

Published: 22 October 2020 Publication History

Abstract

Automatic emotion recognition methods are sensitive to the variations across different datasets and their performance drops when evaluated across corpora. We can apply domain adaptation techniques e.g., Domain-Adversarial Neural Network (DANN) to mitigate this problem. Though the DANN can detect and remove the bias between corpora, the bias between speakers still remains which results in reduced performance. In this paper, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL). Our experiments with multimodal data (speech, vision, and text) and the cross-domain evaluation indicate that the proposed SIDANN outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, suggesting that the SIDANN has a better domain adaptation ability than the DANN. Besides, the modality contribution analysis shows that the acoustic features are the most informative for arousal detection while the lexical features perform the best for valence detection.

Supplementary Material

MP4 File (3382507.3418813.mp4)
Presentation Video for Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition. We propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL).

References

[1]
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3722--3731.
[2]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.
[3]
Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. 2016. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing 8, 1 (2016), 67--80.
[4]
Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2018. Partial transfer learning with selective adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2724--2732.
[5]
Yashpalsing Chavhan, ML Dhore, and Pallavi Yesaware. 2010. Speech emotion recognition using support vector machine. International Journal of Computer Applications 1, 20 (2010), 6--9.
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[8]
Kexin Feng and Theodora Chaspari. 2020. A Review of Generalizable Transfer Learning in Automatic Emotion Recognition. Frontiers in Computer Science 2 (2020), 9.
[9]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 1180--1189. http://proceedings.mlr.press/v37/ganin15.html
[10]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.
[11]
John Gideon, Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. 2017. Progressive Neural Networks for Transfer Learning in Emotion Recognition. Proc. Interspeech 2017 (2017), 1098--1102.
[12]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. (2011).
[13]
Hatice Gunes and Björn Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing 31, 2 (2013), 120--136.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[15]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.
[16]
Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In Proceedings of the 35th International Conference on Machine Learning.
[17]
Mimansa Jaiswal, Zakaria Aldeneh, and Emily Mower Provost. 2019. Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning. In 2019 International Conference on Multimodal Interaction. 174--184.
[18]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan, and Panayiotis Georgiou. 2020. Speaker-Invariant Affective Representation Learning via Adversarial Training. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7144--7148.
[20]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. JMLR. org, 97--105.
[21]
Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing 2, 2 (2011), 92--105.
[22]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop.
[23]
Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. 2018. Multiadversarial domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence.
[24]
Rosalind W Picard. 2000. Affective computing. MIT press.
[25]
Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, et al. 2018. AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition. In Proceedings of the 2018 on audio/visual emotion challenge and workshop. 3--13.
[26]
Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic. 2015. AVEC 2015: The 5th international audio/visual emotion challenge and workshop. In Proceedings of the 23rd ACM international conference on Multimedia. 1335--1336.
[27]
Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. 2019. AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 3--12.
[28]
Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 3--9.
[29]
Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. 2018. Beyond sharing weights for deep domain adaptation. IEEE transactions on pattern analysis and machine intelligence 41, 4 (2018), 801--814.
[30]
Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3723--3732.
[31]
Björn Schuller, Michel Valster, Florian Eyben, Roddy Cowie, and Maja Pantic. 2012. AVEC 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction. 449--456.
[32]
Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. 2018. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI).
[33]
Mohammad Soleymani, Kalin Stefanov, Sin-Hwa Kang, Jan Ondras, and Jonathan Gratch. 2019. Multimodal Analysis and Estimation of Intimate Self-Disclosure. In 2019 International Conference on Multimodal Interaction. 59--68.
[34]
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-toend speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5200--5204.
[35]
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7167--7176.
[36]
Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
[37]
Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou. 2017. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1301--1309.
[38]
Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge. 3--10.
[39]
Michel Valstar, Björn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek Krajewski, Roddy Cowie, and Maja Pantic. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th international workshop on audio/visual emotion challenge. 3--10.
[40]
Michel Valstar, Björn Schuller, Kirsty Smith, Florian Eyben, Bihan Jiang, Sanjay Bilakhia, Sebastian Schnieder, Roddy Cowie, and Maja Pantic. 2013. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. 3--10.
[41]
Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing 312 (2018), 135--153.
[42]
Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. 2017. Learning affective features with a hybrid deep model for audio--visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2017), 3030--3043.
[43]
Jinming Zhao, Ruichen Li, Jingjun Liang, Shizhe Chen, and Qin Jin. 2019. Adversarial Domain Adaption for Multi-Cultural Dimensional Emotion Recognition in Dyadic Interactions. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 37--45.
[44]
Sicheng Zhao, Xin Zhao, Guiguang Ding, and Kurt Keutzer. 2018. EmotionGAN: Unsupervised domain adaptation for learning discrete probability distributions of image emotions. In Proceedings of the 26th ACM international conference on Multimedia. 1319--1327.

Cited By

View all
  • (2024)Challenges and Solutions in Emotion Detection Using Deep Learning ApproachesMachine and Deep Learning Techniques for Emotion Detection10.4018/979-8-3693-4143-8.ch009(184-202)Online publication date: 22-Mar-2024
  • (2024)A Simple Unsupervised Knowledge-Free Domain Adaptation for Speaker RecognitionApplied Sciences10.3390/app1403106414:3(1064)Online publication date: 26-Jan-2024
  • (2024)A Knowledge-Enhanced and Topic-Guided Domain Adaptation Model for Aspect-Based Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329221315:2(709-721)Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
October 2020
920 pages
ISBN:9781450375818
DOI:10.1145/3382507
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. domain adaptation
  2. emotion recognition
  3. multimodal learning
  4. neural networks

Qualifiers

  • Research-article

Funding Sources

Conference

ICMI '20
Sponsor:
ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 25 - 29, 2020
Virtual Event, Netherlands

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)202
  • Downloads (Last 6 weeks)21
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Challenges and Solutions in Emotion Detection Using Deep Learning ApproachesMachine and Deep Learning Techniques for Emotion Detection10.4018/979-8-3693-4143-8.ch009(184-202)Online publication date: 22-Mar-2024
  • (2024)A Simple Unsupervised Knowledge-Free Domain Adaptation for Speaker RecognitionApplied Sciences10.3390/app1403106414:3(1064)Online publication date: 26-Jan-2024
  • (2024)A Knowledge-Enhanced and Topic-Guided Domain Adaptation Model for Aspect-Based Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329221315:2(709-721)Online publication date: Apr-2024
  • (2024)Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion RecognitionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448044(11686-11690)Online publication date: 14-Apr-2024
  • (2024)Domain Adapting Deep Reinforcement Learning for Real-World Speech Emotion RecognitionIEEE Access10.1109/ACCESS.2024.351976112(193101-193114)Online publication date: 2024
  • (2024)DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion RecognitionDSNet:用于语音情感识别的带有中性校准的解耦孪生网络Journal of Shanghai Jiaotong University (Science)10.1007/s12204-024-2724-1Online publication date: 23-Apr-2024
  • (2024)MBDA: A Multi-scale Bidirectional Perception Approach for Cross-Corpus Speech Emotion RecognitionAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5669-8_27(329-341)Online publication date: 3-Aug-2024
  • (2023)Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.316701314:3(1912-1926)Online publication date: 1-Jul-2023
  • (2023)Toward Label-Efficient Emotion and Sentiment AnalysisProceedings of the IEEE10.1109/JPROC.2023.3309299111:10(1159-1197)Online publication date: Oct-2023
  • (2023)Multimodal Transfer Learning for Oral Presentation AssessmentIEEE Access10.1109/ACCESS.2023.329583211(84013-84026)Online publication date: 2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media