research-article

Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations

Authors:

Hao ChengAuthors Info & Claims

ICIAI '22: Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence

Pages 216 - 220

https://doi.org/10.1145/3529466.3529488

Published: 04 June 2022 Publication History

Abstract

Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.

Supplementary Material

Presentation slides (p216-supplement.pptx)

Download
1.19 MB

References

[1]

Ming Chen and Xudong Zhao. 2020. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In INTERSPEECH. 374–378.

[2]

Jaejin Cho, Raghavendra Pappagari, Purva Kulkarni, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Deep neural networks for emotion recognition combining audio and transcripts. arXiv preprint arXiv:1911.00432(2019).

[3]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. Advances in neural information processing systems 28 (2015).

[4]

Han Feng, Sei Ueno, and Tatsuya Kawahara. 2020. End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model. In INTERSPEECH. 501–505.

[5]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100(2020).

[6]

DN Krishna. [n.d.]. A Dual-Decoder Conformer for Multilingual Speech Recognition. ([n. d.]).

[7]

Yuanchao Li, Tianyu Zhao, and Tatsuya Kawahara. 2019. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Interspeech. 2803–2807.

[8]

Shuiyang Mao, PC Ching, C-C Jay Kuo, and Tan Lee. 2020. Advancing multiple instance learning with attention modeling for categorical speech emotion recognition. arXiv preprint arXiv:2008.06667(2020).

[9]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.

[10]

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.

[11]

Lorenzo Tarantino, Philip N Garner, Alexandros Lazaridis, 2019. Self-Attention for Speech Emotion Recognition. In Interspeech. 2578–2582.

[12]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[13]

Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, and Xiangang Li. 2019. Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645(2019).

[14]

Sung-Lin Yeh, Yun-Shao Lin, and Chi-Chun Lee. 2020. Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation. In INTERSPEECH. 536–540.

[15]

Seunghyun Yoon, Seokhyun Byun, Subhadeep Dey, and Kyomin Jung. 2019. Speech emotion recognition using multi-hop attention mechanism. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2822–2826.

[16]

Jiawei Yu, Minghao Guo, Yanlu Xie, and Jinsong Zhang. 2019. Articulatory Features Based TDNN Model for Spoken Language Recognition. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 308–312.

[17]

Qingran Zhan, Xiang Xie, Chenguang Hu, and Haobo Cheng. 2021. A Self-Supervised Model for Language Identification Integrating Phonological Knowledge. Electronics 10, 18 (2021), 2259.

[18]

Qingran Zhan, Xiang Xie, Chenguang Hu, Juan Zuluaga-Gomez, Jing Wang, and Haobo Cheng. 2021. Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition. Electronics 10, 24 (2021), 3172.

[19]

Binbin Zhang, Di Wu, Chao Yang, Xiaoyu Chen, Zhendong Peng, Xiangming Wang, Zhuoyuan Yao, Xiong Wang, Fan Yu, Lei Xie, 2021. Wenet: Production first and production ready end-to-end speech recognition toolkit. arXiv e-prints (2021), arXiv–2102.

[20]

Chao Zhang, Yi Liu, and Chin-Hui Lee. 2011. Detection-based accented speech recognition using articulatory features. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 500–505.

Recommendations

Speech Emotion Recognition Based on Improved MFCC
CSAE '18: Proceedings of the 2nd International Conference on Computer Science and Application Engineering

Speech1 Emotion Recognition SER uses the Berlin EMO-DB database, seven emotions. Traditional emotional features and their statistics are used in SER. Two improved Mel Frequency Cepstrum Coefficients MFCC features are added to this experiment, which ...
Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted ...
Speech emotion recognition using a fuzzy approach
The 6th International Multi-Conference on Engineering and Technology Innovation 2017 (IMETI2017)

This paper introduces a fuzzy approach for classifying speech emotions in which a fuzzy inference system based on fuzzy associative memory (FAM-FIS) is used for recognizing speech emotions. Experiments on two databases of emotion speech Emo-DB in German ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIAI '22: Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence

March 2022

240 pages

ISBN:9781450395502

DOI:10.1145/3529466

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Science and Technology Innovation Foundation of Shenzhen

Conference

ICIAI 2022

ICIAI 2022: 2022 the 6th International Conference on Innovation in Artificial Intelligence

March 4 - 6, 2022

Guangzhou, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
72
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents