Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3529466.3529488acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciaiConference Proceedingsconference-collections
research-article

Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations

Published: 04 June 2022 Publication History

Abstract

Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.

Supplementary Material

Presentation slides (p216-supplement.pptx)

References

[1]
Ming Chen and Xudong Zhao. 2020. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In INTERSPEECH. 374–378.
[2]
Jaejin Cho, Raghavendra Pappagari, Purva Kulkarni, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Deep neural networks for emotion recognition combining audio and transcripts. arXiv preprint arXiv:1911.00432(2019).
[3]
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. Advances in neural information processing systems 28 (2015).
[4]
Han Feng, Sei Ueno, and Tatsuya Kawahara. 2020. End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model. In INTERSPEECH. 501–505.
[5]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100(2020).
[6]
DN Krishna. [n.d.]. A Dual-Decoder Conformer for Multilingual Speech Recognition. ([n. d.]).
[7]
Yuanchao Li, Tianyu Zhao, and Tatsuya Kawahara. 2019. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Interspeech. 2803–2807.
[8]
Shuiyang Mao, PC Ching, C-C Jay Kuo, and Tan Lee. 2020. Advancing multiple instance learning with attention modeling for categorical speech emotion recognition. arXiv preprint arXiv:2008.06667(2020).
[9]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.
[10]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
[11]
Lorenzo Tarantino, Philip N Garner, Alexandros Lazaridis, 2019. Self-Attention for Speech Emotion Recognition. In Interspeech. 2578–2582.
[12]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[13]
Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, and Xiangang Li. 2019. Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645(2019).
[14]
Sung-Lin Yeh, Yun-Shao Lin, and Chi-Chun Lee. 2020. Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation. In INTERSPEECH. 536–540.
[15]
Seunghyun Yoon, Seokhyun Byun, Subhadeep Dey, and Kyomin Jung. 2019. Speech emotion recognition using multi-hop attention mechanism. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2822–2826.
[16]
Jiawei Yu, Minghao Guo, Yanlu Xie, and Jinsong Zhang. 2019. Articulatory Features Based TDNN Model for Spoken Language Recognition. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 308–312.
[17]
Qingran Zhan, Xiang Xie, Chenguang Hu, and Haobo Cheng. 2021. A Self-Supervised Model for Language Identification Integrating Phonological Knowledge. Electronics 10, 18 (2021), 2259.
[18]
Qingran Zhan, Xiang Xie, Chenguang Hu, Juan Zuluaga-Gomez, Jing Wang, and Haobo Cheng. 2021. Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition. Electronics 10, 24 (2021), 3172.
[19]
Binbin Zhang, Di Wu, Chao Yang, Xiaoyu Chen, Zhendong Peng, Xiangming Wang, Zhuoyuan Yao, Xiong Wang, Fan Yu, Lei Xie, 2021. Wenet: Production first and production ready end-to-end speech recognition toolkit. arXiv e-prints (2021), arXiv–2102.
[20]
Chao Zhang, Yi Liu, and Chin-Hui Lee. 2011. Detection-based accented speech recognition using articulatory features. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 500–505.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICIAI '22: Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence
March 2022
240 pages
ISBN:9781450395502
DOI:10.1145/3529466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multi-task learning
  2. Speech emotion recognition
  3. Transfer learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Science and Technology Innovation Foundation of Shenzhen

Conference

ICIAI 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 72
    Total Downloads
  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media