research-article

Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition

Authors:

Balachandran Manavalan,

Chee Peng Lim,

Sy Dzung NguyenAuthors Info & Claims

Volume 230, Issue C

https://doi.org/10.1016/j.eswa.2023.120608

Published: 15 November 2023 Publication History

Abstract

Recently, speech emotion recognition (SER) has become an active research area in speech processing, particularly with the advent of deep learning (DL). Numerous DL-based methods have been proposed for SER. However, most of the existing DL-based models are complex and require a large amounts of data to achieve a good performance. In this study, a new framework of deep attention-based dilated convolutional-recurrent neural networks coupled with a hybrid data augmentation method was proposed for addressing SER tasks. The hybrid data augmentation method constitutes an upsampling technique for generating more speech data samples based on the traditional and generative adversarial network approaches. By leveraging both convolutional and recurrent neural networks in a dilated form along with an attention mechanism, the proposed DL framework can extract high-level representations from three-dimensional log Mel spectrogram features. Dilated convolutional neural networks acquire larger receptive fields, whereas dilated recurrent neural networks overcome complex dependencies as well as the vanishing and exploding gradient issues. Furthermore, the loss functions are reconfigured by combining the SoftMax loss and the center-based losses to classify various emotional states. The proposed framework was implemented using the Python programming language and the TensorFlow deep learning library. To validate the proposed framework, the EmoDB and ERC benchmark datasets, which are imbalanced and/or small datasets, were employed. The experimental results indicate that the proposed framework outperforms other related state-of-the-art methods, yielding the highest unweighted recall rates of 88.03 ± 1.39 (%) and 66.56 ± 0.67 (%) for the EmoDB and ERC datasets, respectively.

Graphical abstract

Display Omitted

Highlights

•

Create a new hybrid data augmentation method to generate speech signals.

•

Propose mADCRNN with attention-based dilated CNNs and dilated LSTMs.

•

Evaluate model parameters and data augmentation methods for SER.

•

Achieve good unweighted recall rates in benchmark SER datasets.

References

[1]

Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., et al., TensorFlow: A system for large-scale machine learning, in: Keeton K., Roscoe T. (Eds.), 12th USENIX symposium on operating systems design and implementation, OSDI 2016, Savannah, GA, USA, November (2016) 2-4, USENIX Association, 2016, pp. 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.

Abstract

Graphical abstract

Highlights

References

Cited By

Index Terms

Recommendations

Maxout neurons for deep convolutional and LSTM neural networks in speech recognition

Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition

Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations