Speech Emotion Recognition Using Deep Learning

The goal of the project is to detect the speaker's emotions while he or she speaks

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views

Speech Emotion Recognition Using Deep Learning

The goal of the project is to detect the speaker's emotions while he or she speaks

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Speech Emotion Recognition using Deep Learning

Akash Raghav, Dr. C. Lakshmi
Computer Science & Engineering SRM IST
Chennai, India

Abstract:- The goal of the project is to detect the II. LITERATURE REVIEW
speaker's emotions while he or she speaks. Speech
generated under a condition of fear, rage or delight, for Many classification algorithms have been presented in
example, becomes very loud and fast, with a larger and this field of research in recent years. Iqbal et al.[6], however
more varied pitch range, However, in a moment of grief for the sake of this paper, we only looked at the work done
or tiredness, speech is slow and low-pitched. Voice and on RAVDESS. Iqbal etal.[6]developed a granular
speech patterns can be used to detect human emotions, classification technique that merges Gradient Boosting,
which can help improve human-machine interactions. KNN, and Support Vector Machineto work on the
We give Deep Neural Networks CNN, Support Vector RAVDESS dataset used in this study with roughly 40% to
Machine, and MLP Classification based on auditory 80% overall accuracy ,depending upon the tasks. In different
data for emotion produced by speech, such as Mel datasets, the proposed classifiers performed differently. He
Frequency Cepstral Coefficient classification model created three datasets: one with only male recordings, one
(MFCC).Eight different emotions have been taught to with exclusively female recordings, and one with both male
the model (neutral, calm, happy, sad, angry, fearful, and female recordings. In RAVDESS, Support Vector
disgust, surprise), Using the RAVDESS (Ryerson Audio- Machine and KNN show 100 percent accuracy in both angry
Visual Database of Emotional Speech and Song) dataset and neutral situations (male),however gradient boosting
as well as the TESS (Toronto Emotional Speech Set) outperforms support vector machine and KNN algorithm in
dataset, we found that the proposed approach achieves happiness and sadness. Support Vector machine in
accuracies of 86 percent, 84 percent, and 82 percent, RAVDESS (female) achieves a level of accuracy of 100
respectively, for eight emotions using CNN, MLP percent in fury, similar to the male counterpart. Except for
Classifier, and SVM Classifiers. sadness, Support Vector machine has a decent overall
performance in general. KNN also performs well in rage and
I. INTRODUCTION neutral situations, scoring 87 percent and 100 percent,
respectively.
The most common method of human interactions is via
spoken language is the cornerstone for the purpose of Gradient Boosting performs poorly in anger and IEEE
information sharing& has been an important aspect of (978-1-7281-4384-2/20/$31.00 © 2020) neutral. In
society from the dawn of time. Emotions, on the other hand, comparison to other classifiers, KNN performs poorly in
can be traced back to basic instinct before the emergence of happiness and sadness. SVM and KNN perform far better in
the modern spoken language, and can be regarded one of the rage and neutral than Gradient Boosting in a mixed male
first forms of natural communication. It is also used in a and female sample. In both happiness and despair, KNN's
variety of practical applications in many sectors such as performance is dreadful. With the exception of SVM, The
BPO-Business Process Outsourcing Centers& call centers to male dataset has superior average classifier performance
analyze emotion essential for determining client pleasure. than the female dataset. SVM has a higher accuracy in a
combined database Than gender based dataset.
Proposed Model depicts how each of us displays more
than one basic emotion at a time, but we believe that Another technique developed by Jannat et al.[7]
recognizing which emotion & analyzing what percentage of achieved 66.41 percent accuracy on audio data and above
the emotions are jumbled is exceedingly strenuous for both 90% accuracy when integrating audio and visual data..
the speaker and the listener. According to this scenario, it is Faces and audio waveforms, in particular, are present in the
decided to create a model that only recognizes the emotions preprocessed image data.
in the audio track that are more valuable.Various methods,
such as computer vision or text analytics, have been tried to Xinzhou Xu [3] et al. modified the Spectral Regression
classify feelings by a machine. The purpose of this study is model by utilizing the joins of ELMs – Extreme Learning
to employ Mel-frequency cepstral Coefficients(MFCC) with Machines and SL-Subspace Learning in order to overcome
pure audio data. the drawbacks of spectral regression-based GE-Graph
Embedding and ELM. We have to correctly describe these
Speech is one of the most basic human capacities, relationships among data using the GSR model in the
allowing us to connect with one another, express ourselves, execution of the SER- Speech Emotion Recognition .For the
and, most importantly, give us a feeling of self. It is one of same, many embedded graphs were created. The impact and
the most important aspects of mental and physical health. practicality of the strategies were determined by
Because emotions can influence how we act and interpret Demonstrationover4Speech Emotional Corpora when
events, a speaker's speech transmits both its linguistic compared to past methods such as ELM and Subspace
meaning and the emotion with which it is given. Learning (SL) methodologies. Exploring embedded graphs
at a deeper level can help the system produce better results.

IJISRT22SEP761 www.ijisrt.com 1595

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Only Least-Square Regression and l2-norm minimization to transfer learning results, the less informative bits are also
were used in the regression stage. important. TLSL is used for cross-corpus identification of
speech emotion.
To detect speech depression, Zhaocheng Huang[4] et
al deploy a heterogeneous token-based method. Acoustic Jun Deng [6] et al. focused on unsupervised learning
areas and abrupt shifts are only and collectively determined with automatic speech emotion encoders in this research. To
in junctions between distinct embedding methods. combine generative and discriminative training, partially
Contributions to the detection of depression, as well as supervised learning approaches designed for situations with
numerous health issues that may impair vocal generation, non-labeled data were applied. Five databases in various
were utilised. Landmarks are used to retrieve data particular situations were used to test the procedure successively. In
to each type of articulation at a given point in time. This settings with a reduced number of labeled instances, the
system is a mix of the two. LWs and AWs hold a wide range suggested approach improves recognition performance by
of information. LWs depict the sudden variations in speech acquiring prior knowledge from non-labeled data. These
articulation on the contemporary, while AW holds a part of techniques may deal with a wide range of problems and
acoustic space in a single token each frame. The hybrid join incorporate knowledge from different fields into classifiers,
of the LWs and AWs allows for the exploration of numerous resulting in excellent performance. This shows that the
aspects, including articulatory dysfunction as well as model can distinguish speech emotions from a mix of
traditional acoustic features. labeled and unlabeled data.The residual neural network
revealed that dense architectures enable the classifier to
For cross corpus voice recognition, Peng Song [5] extract complex structures in image processing.
proposes the Transfer Linear Subspace Learning (TLSL)
paradigm. TULSL, TSLSL, and TLSL methods were all Ying Qin[7] et al. presented a Cantonese-speaking
counted. The goal of TLSL is to extract robust character PWA narrative speech that serves as the foundation for a
representations from corpora into a trained estimated fully automated assessment system. Experiments based on
subspace. TLSL improves on the currently utilized transfer the recommended data on text characteristics may be able to
learning algorithms, which merely seek for the most detect linguistic impairment in aphasic speech. The Siamese
portable features components. TLSL achieves even better network learned text qualities that were highly linked with
results than the six baseline approaches with statistical the AQ scores. The confusion network was built using the
significance, and TSLSL achieves even better results than improvised representation of ASR output, and the
TULSL; in fact, all transfer learning techniques are more robustness of text characteristics was praised. There was a
accurate than traditional learning procedures. The good pressing need to improve ASR's performance on aphasic
transfer learning approaches based on characteristics speech in order to generate speech with more robust
transformation, such as TLDA, TPCA, TNMF, and TCA, qualities. To use this proposed methodology, databases of
are greatly outperformed by TLSL. One of the major disordered speech and other languages were required. As
drawbacks of these early transfer learning methods is that shown in clinical practice, the most ideal one is automatic
they focus on finding the portable components of traits classification of aphasia variants, which necessitates a
while ignoring the less informative sections. When it comes significant amount of data collection.

III. IMPLEMENTATION DETAILS

A. Methodology

Fig. 1: System Design Flowchart

IJISRT22SEP761 www.ijisrt.com 1596

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The emotion recognition classification models shown These are classes the model aims to predict. There is
here use a deep learning strategy that employs CNN, SVM, no calm class in TESS, hence this dataset is distorted. As
and MLP classifiers. MFCC is the single feature required to a result, there is less data for that particular class, as
train the model, which is also known as the "spectrum of a evidenced by the classification report.
spectrum."
 Identifiers in filename -
The best sound formalization method for automatic Modality – 01 = Full-AV
speech recognition tasks has been proven to be MFCC, 02 = Video-only
which is an MFC-Mel frequency variant Because the MFC 03 = Audio-only
coefficients have been frequently employed because of their
ability to convey the amplitude spectrum of a sound wave in Vocal Channel – 01 = Speech
a compact vectorial form. 02 = Song

To obtain statistically steady waves, The audio files Emotion – 01 = Neutral 05 = Angry
are divided into frames, which are normally determined by 02 = Calm 06 = Fearful
the size of a set window. The frequency scale of the "Mel" is 03 = Happy 07 = Disgust
reduced to standardize the amplitude spectra. This procedure 04 = Sad 08 = Surprised
is carried out in order to identify with the frequency in order
to precisely replicate the wave using the human auditory Emotional Intensity – 01 = Normal
system. 02 = Strong

A total of 40 features have been retrieved from each Statement – 01 = “Kids are talking by the door”
audio file. To build the feature,each audio file was 02 = “Dogs are sitting by the door”
transformed into a floatingpoint time series. The time series
Repetition – 01 = 1st repetition
was then converted into an MFCC sequence.The MFCC
02 = 2nd repetition
array on the horizontal axis is transposed , and arithmetic
Actor – 01 to 24
mean is calculated .
Odd-numbered are male
Even-numbered are female
B. Dataset
This task's dataset consists of 5252 samples collected C. Algorithms
from the following sources:  The Classification task's deep neural network (CNN) is
 RAVDESS – Ryerson Audio Visual Database of operationally reported in Fig. 1. For each and every
Emotional Speech and Song audio file that has been given as an input, the network
 TESS – Toronto Emotional Speech Set can function with 40-feature vectors. The 40 values
Sample consists of: reflect the compressed numerical representation of a
RAVDESS - 1440 speech files and 1012 song files.The two-second audio frame. As a result, A set of 40x1
dataset consists of recordings of 24 (12-females & 12- training files were used to run one round of a 1D CNN
males) professional actors who speak in neutral north using a ReLU activation function, a 20% dropout, and
American accent. Happy, angry, calm, sad, afraid, a 2 x 2 max-pooling function.
disgust, and surprise expressions can be found in speech, The ReLU(rectified linear unit) is formalised as
whereas happy, angry, calm, fearful, and sad emotions g(z) = max0, z and allows us to get a large value in the
can be found in song. event of activation by representing hidden units with
TESS - 2800 files. This dataset consists of two this function. In this situation, Pooling can let the
actresses aged between 26 and 64 recited around 200 set model focus primarily on the most relevant
of taeget phrases in the carrier phrase"Say the word characteristics of each piece of input, resulting in
_____," and Each of the seven emotions were recorded position invariant results. We repeated the procedure,
on the set (anger, neutral, fear, pleasant surprise, disgust, this time adjusting the kernel size. After that, Another
sadness, and happiness). There are 2800 stimuli in all. dropout was applied, and the result was flattened to
Two actresses from the Toronto area have been chosen. make it compatible with the subsequent layers.
Both actors are native English speakers with a university Finally, for each of the properly encoded classes, we
education and musical background. Both actresses' calculated the probability distribution.
audiometric thresholds are within the normal range, Neutral - 0
according to audiometric testing. Calm - 1
0 - Neutral Happy - 2
1 - Calm Sad - 3
2 - Happy Angry - 4
3 - Sad Fearful - 5
4 - Angry Disgust - 6
5 - Fearful Surprised – 7
6 - Disgust
7- Surprised

IJISRT22SEP761 www.ijisrt.com 1597

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Using a softmax activation function on one Dense  Support Vector Machine (SVM) is a supervised
layer (fully connected layer). machine learning technique for solving classification
and regression issues. It is, however, mostly used to
tackle categorisation questions. Every data item is
represented as a point in n-dimensional space (where n
denotes the number of features), with the value of each
feature being the SVM algorithm's value for a given
coordinate. To avoid attributes in higher numeric
ranges while processing data, Before using it with an
SVM classifier, it might be scaled.Scaling also helps to
prevent some arithmetic issues during the calculating
process.

Table 3: SVM model result on the test

Table 1: Model Summary
IV. RESULT AND DISCUSSION
 The Multilayer Perceptron (MLP) is a forward-feeding
artificial neural network (ANN). Back propagation is On the RAVDESS and TESS datasets, when compared
used by MLP during training as a supervised learning against baselines and the state of the art, the evaluation
strategy. MLP is distinguished from a linear results show that the model is effective.
perceptron by its numerous layers and non-linear
activation. It has the ability to identify data that isn't For each of the emotional classifications, Table I
linearly separable. illustrates the precision, recall, and F1 values obtained. The
results show that precision and recall are very well balanced,
allowing us to acquire F1 values that are evenly dispersed
around the value 0.85 for practically all classes. The model's
robustness is demonstrated by the limited range of F1
values, which efficiently classify emotions into eight
separate categories. The model is less accurate in the classes
"Calm" and "Disgust," which is understandable given that
they are the most difficult to discern not only by speaking
but also by observing facial expressions or evaluating
written language, as indicated in the Introduction.

Table 2: MLP model result on the test

Table 4: CNN model result on the test

IJISRT22SEP761 www.ijisrt.com 1598

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
We decided to examine the findings acquired from two
additional methods, namely SVM and MLP classifier, to
determine the efficacy of the emotion classification provided
in this study.

On all classes, our model's F1 values outperform

baselines and competition, as shown in Tab II. However, it
is important to note that the performance loss is minor and
was implemented to avoid overfitting. It's common
knowledge that as the number of classes increases, the
categorization task becomes more complex and inaccurate.

Fig. 2: Cost Function

Fig. 3: Accuracy

V. CONCLUSION AND FURTHER ENHANCMENTS

In the above paper , We used audio recordings from

the Ryerson Audio-Visual Database of Emotional Speech
Table 5: Each class's F1-score is compared to the baselines and Song (RAVDESS) and the Toronto Emotional Speech
(SVM, MLP) Set (TESS) to show an architecture based on deep neural
networks for emotion categorization .The model is trained to
Nonetheless, CNN-MFCC model provided here classify 7 different emotions (neutral, calm, happy, sad,
achieves an F1 score that is comparable to the two jobs we furious, terrified, disgusted, startled) and obtained an overall
were given. Figures 2 and 3 show another indicator of model F1 score of 0.85, with the best results in the Happy class
reliability. Up to the 200th epoch, the value of loss (model (0.90) and the worst results in the silent class (0.80). (0.85).
accuracy error) on both the test and training sets tends to (0.85). (0.77).To obtain this outcome, we extracted the
decrease. From the 100th epoch forward, the decline is less MFCC features (spectrum of a spectrum) from the audio
noticeable, but it is still noticeable. recordings used for training. Using 1D CNNs, max-pooling
operations, and Dense Layers, we trained a deep neural
In Fig. 3, the average value of accuracy across all network to consistently predict the probability of
classes is shown, which, in contrast to the loss, increases as distribution of annotation classes based on the
the age of the child increases. These figures are nearly aforementioned representations of input data.The method
identical between the training and test datasets, was put to the test using data from the RAVDESS dataset.
demonstrating that the model was not overfitted during With an average F1 score of 0.84, we employed an MLP
training. The results are consistent with the previously Classifier trained on the same dataset.across the eight
discovered F1 scores. classes as a baseline for our task. Following the MLP
classifier, we trained an SVM classifier that scored 0.82 on
the F1 scale. On the test set, Our final option was a deep
learning model with an F1 score of 0.86. The positive results
show that deep neural network-based techniques provide an
outstanding foundation forsolving problem. They are
generic enough in particular to perform correctly in a real-
world application setting, Earlier versions of this paper
solely used the RAVDESS dataset, with TESS being added
afterwards. Additionally, previous versions of this research

IJISRT22SEP761 www.ijisrt.com 1599

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
used audio features from the films in the RAVDESS dataset.
Because it was shuffling very similar files in the training
and test sets, this component of the pipeline was removed,
enhancing the model's accuracy (overfitting).

REFERENCES

[1.] Y. Chen, Z. Lin, X. Zhao, S. Member, G. Wang, and

Y. Gu, “Deep Learning-Based Classi fi cation of
Hyperspectral Data,” pp. 1–14, 2014.
[2.] Jannat, R., Tynes, I., Lime, L. L., Adorno, J., and
Canavan, S. Ubiquitous emotion recognition using
audio and video data. In Proceedings of the 2018 ACM
International Joint Conference and 2018 International
Symposium on Pervasive and Ubiquitous Computing
and Wearable Computers (2018), ACM, pp. 956–959.
[3.] X. Xu, J. Deng, E. Coutinho, C. Wu, and L. Zhao,
“Connecting Subspace Learning and Extreme
Learning Machine in Speech Emotion Recognition,”
IEEE, vol. XX, no. XX, pp. 1–13, 2018.
[4.] Logan, B., et al.: Mel frequency cepstral coefficients
for music modeling. In ISMIR (2000), vol. 270, pp. 1–
11.
[5.] Z. Huang, J. Epps, D. Joachim, and V. Sethu, “Natural
Language Processing Methods for Acoustic and
Landmark Event-based Features in Speech-based
Depression Detection,” IEEE J. Sel. Top. Signal
Process., vol. PP, no. c, p. 1, 2019.
[6.] Nair, V., and Hinton, G. E. : Rectified linear units
improve restricted boltzmann machines. In
Proceedings of the 27th international conference on
machine learning (ICML-10) (2010), pp. 807–814
[7.] J. Deng, X. Xu, Z. Zhang, and S. Member, “Semi-
Supervised Autoencoders for Speech Emotion
Recognition,” vol. XX, no. XX, pp. 1–13, 2017.