Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition

Published: 15 November 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Recently, speech emotion recognition (SER) has become an active research area in speech processing, particularly with the advent of deep learning (DL). Numerous DL-based methods have been proposed for SER. However, most of the existing DL-based models are complex and require a large amounts of data to achieve a good performance. In this study, a new framework of deep attention-based dilated convolutional-recurrent neural networks coupled with a hybrid data augmentation method was proposed for addressing SER tasks. The hybrid data augmentation method constitutes an upsampling technique for generating more speech data samples based on the traditional and generative adversarial network approaches. By leveraging both convolutional and recurrent neural networks in a dilated form along with an attention mechanism, the proposed DL framework can extract high-level representations from three-dimensional log Mel spectrogram features. Dilated convolutional neural networks acquire larger receptive fields, whereas dilated recurrent neural networks overcome complex dependencies as well as the vanishing and exploding gradient issues. Furthermore, the loss functions are reconfigured by combining the SoftMax loss and the center-based losses to classify various emotional states. The proposed framework was implemented using the Python programming language and the TensorFlow deep learning library. To validate the proposed framework, the EmoDB and ERC benchmark datasets, which are imbalanced and/or small datasets, were employed. The experimental results indicate that the proposed framework outperforms other related state-of-the-art methods, yielding the highest unweighted recall rates of 88.03 ± 1.39 (%) and 66.56 ± 0.67 (%) for the EmoDB and ERC datasets, respectively.

    Graphical abstract

    Display Omitted

    Highlights

    Create a new hybrid data augmentation method to generate speech signals.
    Propose mADCRNN with attention-based dilated CNNs and dilated LSTMs.
    Evaluate model parameters and data augmentation methods for SER.
    Achieve good unweighted recall rates in benchmark SER datasets.

    References

    [1]
    Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., et al., TensorFlow: A system for large-scale machine learning, in: Keeton K., Roscoe T. (Eds.), 12th USENIX symposium on operating systems design and implementation, OSDI 2016, Savannah, GA, USA, November (2016) 2-4, USENIX Association, 2016, pp. 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.
    [2]
    Albornoz E.M., Milone D.H., Rufiner H.L., Spoken emotion recognition using hierarchical classifiers, Computer Speech and Language 25 (2011) 556–570.
    [3]
    Alzubi O.A., Alzubi J.A.A., Alweshah M., Qiqieh I., Al-Shami S., Ramachandran M., An optimal pruning algorithm of classifier ensembles: dynamic programming approach, Neural Computing and Applications 32 (2020) 16091–16107,.
    [4]
    Alzubi J.A.A., Jain R., Kathuria A., Khandelwal A., Saxena A., Singh A., Paraphrase identification using collaborative adversarial networks, Journal of Intelligent & Fuzzy Systems 39 (2020) 1021–1032,.
    [5]
    Arias J.P., Busso C., Yoma N.B., Shape-based modeling of the fundamental frequency contour for emotion detection in speech, Computer Speech and Language 28 (2014) 278–294.
    [6]
    Bao F., Neumann M., Vu N.T., Cycle GAN-based emotion style transfer as data augmentation for speech emotion recognition, in: Kubin G., Kacic Z. (Eds.), Interspeech 2019, 20th annual conference of the international speech communication association, Graz, Austria, 15-19 2019, ISCA, 2019, pp. 2828–2832,.
    [7]
    Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W.F., Weiss B., A database of german emotional speech, in: INTERSPEECH 2005 - Eurospeech, 9th European conference on speech communication and technology, Lisbon, Portugal, September (2005) 4-8, ISCA, 2005, pp. 1517–1520. http://www.isca-speech.org/archive/interspeech_2005/i05_1517.html.
    [8]
    Cao H., Cooper D.G., Keutmann M.K., Gur R.C., Nenkova A., Verma R., CREMA-D: crowd-sourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing 5 (2014) 377–390,.
    [9]
    Cao H., Verma R., Nenkova A., Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech, Computer Speech and Language 29 (2015) 186–202.
    [10]
    Cen L., Wu F., Yu Z.L., Hu F., A real-time speech emotion recognition system and its application in online learning, in: Emotions, technology, design, and learning, Elsevier, 2016, pp. 27–46.
    [11]
    Chang S., Zhang Y., Han W., Yu M., Guo X., Tan W., et al., Dilated recurrent neural networks, in: Guyon I., von Luxburg U., Bengio S., Wallach H.M., Fergus R., Vishwanathan S.V.N., Garnett R. (Eds.), Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, December (2017) 4-9, Long Beach, CA, USA, 2017, pp. 77–87. https://proceedings.neurips.cc/paper/2017/hash/32bb90e8976aab5298d5da10fe66f21d-Abstract.html.
    [12]
    Chen M., He X., Yang J., Zhang H., 3-D convolutional recurrent neural networks with attention model for speech emotion recognition, IEEE Signal Processing Letters 25 (2018) 1440–1444.
    [13]
    Chen L., Mao X., Xue Y., Cheng L.L., Speech emotion recognition: Features and classification models, Digital Signal Processing 22 (2012) 1154–1160.
    [14]
    Dai W., Han D., Dai Y., Xu D., Emotion recognition and affective computing on vocal social media, Information & Management 52 (2015) 777–788.
    [15]
    Donahue C., McAuley J.J., Puckette M.S., Adversarial audio synthesis, in: 7th International conference on learning representations, ICLR 2019, New Orleans, la, USA, May (2019) 6-9, 2019, OpenReview.net. https://openreview.net/forum?id=ByMVTsR5KQ.
    [16]
    Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A.C., Improved training of Wasserstein GANs, in: Guyon I., von Luxburg U., Bengio S., Wallach H.M., Fergus R., Vishwanathan S.V.N., Garnett R. (Eds.), Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, December (2017) 4-9, Long Beach, CA, USA, 2017, pp. 5767–5777. https://proceedings.neurips.cc/paper/2017/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html.
    [17]
    Haghparast A., Penttinen H., Välimäki V., Real-time pitchshifting of musical signals by a time-varying factor using normalized filtered correlation time-scale modification (NFC-TSM), in: Proceedings of the international conference on digital audio effects (DAFx), Bordeaux, France, Citeseer, 2007, pp. 10–15.
    [18]
    Ho N.H., Yang H.J., Kim S.H., Lee G., Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network, IEEE Access 8 (2020) 61672–61686.
    [19]
    Huahu X., Jue G., Jian Y., Application of speech emotion recognition in intelligent household robot, in: 2010 International conference on artificial intelligence and computational intelligence, IEEE, 2010, pp. 537–541.
    [20]
    Huang Y., Tian K., Wu A., Zhang G., Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition, Journal of Ambient Intelligence and Humanized Computing 10 (2019) 1787–1798.
    [21]
    Issa D., Demirci M.F., Yazici A., Speech emotion recognition with deep convolutional neural networks, Biomedical Signal Processing and Control 59 (2020).
    [22]
    Jeon Y., Hasan M.M., Park H.W., Lee K.W., Manavalan B., TACOS: A novel approach for accurate prediction of cell-specific long noncoding rnas subcellular localization, Briefings in Bioinformatics 23 (2022),.
    [23]
    Kingma D.P., Ba J., Adam: A method for stochastic optimization, in: Bengio Y., LeCun Y. (Eds.), 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May (2015) 7-9, Conference Track Proceedings, 2015, http://arxiv.org/abs/1412.6980.
    [24]
    Lalitha S., Gupta D., Zakariah M., Alotaibi Y.A., Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation, Applied Acoustics 170 (2020).
    [25]
    Lee C.C., Mower E., Busso C., Lee S., Narayanan S., Emotion recognition using a hierarchical binary decision tree approach, Speech Communication 53 (2011) 1162–1171.
    [26]
    Lent K., An efficient method for pitch shifting digitally sampled sounds, Computer Music Journal 13 (1989) 65–71.
    [27]
    Li Y., Liu M., Drossos K., Virtanen T., Sound event detection via dilated convolutional recurrent neural networks, in: 2020 IEEE international conference on acoustics, speech and signal processing, ICASSP 2020, Barcelona, Spain, May (2020) 4-8, IEEE, 2020, pp. 286–290,.
    [28]
    Lyons J., Wang D.Y.B., Gianluca, Shteingart H., Mavrinac E., Gaurkar Y., et al., Jameslyons/python_speech_features: release v0.6.1, 2020,.
    [29]
    McFee B., Raffel C., Liang D., Ellis D.P., McVicar M., Battenberg E., et al., librosa: Audio and music signal analysis in python, in: Proceedings of the 14th python in science conference, 2015, pp. 18–25.
    [30]
    Meng H., Yan T., Yuan F., Wei H., Speech emotion recognition from 3D log-Mel spectrograms with deep learning network, IEEE Access 7 (2019) 125868–125881.
    [31]
    Movassagh A.A., Alzubi J.A., Gheisari M., Rahimi M., Mohan S., Abbasi A.A., et al., Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model, Journal of Ambient Intelligence and Humanized Computing (2021) 1–9.
    [32]
    Nguyen S.D., Choi S.B., Seo T.I., Recurrent mechanism and impulse noise filter for establishing anfis, IEEE Transactions on Fuzzy Systems 26 (2017) 985–997.
    [33]
    Nguyen S.D., Nguyen V.S.T., Pham N.T., Determination of the optimal number of clusters: A fuzzy-set based method, IEEE Transactions on Fuzzy Systems 30 (2022) 3514–3526,.
    [34]
    Park D.S., Chan W., Zhang Y., Chiu C., Zoph B., Cubuk E.D., et al., Specaugment: A simple data augmentation method for automatic speech recognition, in: Kubin G., Kacic Z. (Eds.), Interspeech 2019, 20th Annual conference of the international speech communication association, Graz, Austria, 15-19 2019, ISCA, 2019, pp. 2613–2617,.
    [35]
    Peng Z., Li X., Zhu Z., Unoki M., Dang J., Akagi M., Speech emotion recognition using 3D convolutions and attention-based sliding recurrent networks with auditory front-ends, IEEE Access 8 (2020) 16560–16572.
    [36]
    Pham N.T., Dang D.N.M., Nguyen S.D., A method upon deep learning for speech emotion recognition, Journal of Advanced Engineering and Computation 4 (2020) 273–285.
    [37]
    Pham N.T., Nguyen S.D., Nguyen V.S.T., Pham B.N.H., Dang D.N.M., Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural network, Journal of Information and Telecommunication (2023) 1–19.
    [38]
    Pham N.T., Tran A.T., Pham B.N.H., Dang-Ngoc H., Nguyen S.D., Dang D.N.M., Speech emotion recognition: A brief review of multi-modal multi-task learning approaches, in: International conference on advanced engineering theory and applications, Springer, 2022, pp. 563–572.
    [39]
    Qian Y., Hu H., Tan T., Data augmentation using generative adversarial networks for robust speech recognition, Speech Communication 114 (2019) 1–9.
    [40]
    Radford A., Metz L., Chintala S., Unsupervised representation learning with deep convolutional generative adversarial networks, in: Bengio Y., LeCun Y. (Eds.), 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May (2016) 2-4, Conference Track Proceedings, 2016, http://arxiv.org/abs/1511.06434.
    [41]
    Rebai I., BenAyed Y., Mahdi W., Lorré J.P., Improving speech recognition using data augmentation and acoustic model fusion, Procedia Computer Science 112 (2017) 316–322.
    [42]
    Sajjad M., Kwon S., et al., Clustering-based speech emotion recognition by incorporating learned features and deep bilstm, IEEE Access 8 (2020) 79861–79875.
    [43]
    Tiwari U., Soni M.H., Chakraborty R., Panda A., Kopparapu S.K., Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions, in: 2020 IEEE international conference on acoustics, speech and signal processing, ICASSP 2020, Barcelona, Spain, May (2020) 4-8, IEEE, 2020, pp. 7194–7198,.
    [44]
    Tzirakis P., Zhang J., Schuller B.W., End-to-end speech emotion recognition using deep neural networks, in: 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP 2018, Calgary, AB, Canada, April (2018) 15-20, IEEE, 2018, pp. 5089–5093,.
    [45]
    Wu C.H., Liang W.B., Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels, IEEE Transactions on Affective Computing 2 (2010) 10–21.
    [46]
    Yeh J.H., Pao T.L., Lin C.Y., Tsai Y.W., Chen Y.T., Segment-based emotion recognition from continuous Mandarin Chinese speech, Computers in Human Behavior 27 (2011) 1545–1552.
    [47]
    Yi L., Mak M.W., Improving speech emotion recognition with adversarial data augmentation network, IEEE Transactions on Neural Networks and Learning Systems (2020) 1–13.
    [48]
    Yoon W., Cho Y., Park K., A study of speech emotion recognition and its application to mobile services, in: Indulska J., Ma J., Yang L.T., Ungerer T., Cao J. (Eds.), Ubiquitous intelligence and computing, 4th international conference, UIC 2007, Hong Kong, China, July (2007) 11-13, Proceedings, Springer, 2007, pp. 758–766,.
    [49]
    Zhang X., Wei L., Ye X., Zhang K., Teng S., Li Z., et al., Siamese CPP: A sequence-based siamese network to predict cell-penetrating peptides by contrastive learning, Briefings in Bioinformatics 24 (2023),.
    [50]
    Zhang S., Zhang S., Huang T., Gao W., Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching, IEEE Transactions on Multimedia 20 (2017) 1576–1590.
    [51]
    Zhao J., Mao X., Chen L., Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control 47 (2019) 312–323.
    [52]
    Zhu T., Li K., Chen J., Herrero P., Georgiou P., Dilated recurrent neural networks for glucose forecasting in type 1 diabetes, Journal of Healthcare Informatics Research 4 (2020) 308–324,.

    Cited By

    View all
    • (2024)Electroencephalogram-based emotion recognition using factorization temporal separable convolution networkEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108011133:PAOnline publication date: 1-Jul-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Expert Systems with Applications: An International Journal
    Expert Systems with Applications: An International Journal  Volume 230, Issue C
    Nov 2023
    1487 pages

    Publisher

    Pergamon Press, Inc.

    United States

    Publication History

    Published: 15 November 2023

    Author Tags

    1. Speech emotion recognition
    2. Mel spectrogram features
    3. Generative adversarial networks
    4. Attention mechanism
    5. Dilated convolutional neural networks
    6. Dilated recurrent neural networks
    7. Long short-term memory
    8. Hybrid data augmentation
    9. Short-time Fourier transform

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Electroencephalogram-based emotion recognition using factorization temporal separable convolution networkEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108011133:PAOnline publication date: 1-Jul-2024

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media