Speech Emotion Recognition Using Deep Learning
Speech Emotion Recognition Using Deep Learning
ISSN No:-2456-2165
Abstract:- The goal of the project is to detect the II. LITERATURE REVIEW
speaker's emotions while he or she speaks. Speech
generated under a condition of fear, rage or delight, for Many classification algorithms have been presented in
example, becomes very loud and fast, with a larger and this field of research in recent years. Iqbal et al.[6], however
more varied pitch range, However, in a moment of grief for the sake of this paper, we only looked at the work done
or tiredness, speech is slow and low-pitched. Voice and on RAVDESS. Iqbal etal.[6]developed a granular
speech patterns can be used to detect human emotions, classification technique that merges Gradient Boosting,
which can help improve human-machine interactions. KNN, and Support Vector Machineto work on the
We give Deep Neural Networks CNN, Support Vector RAVDESS dataset used in this study with roughly 40% to
Machine, and MLP Classification based on auditory 80% overall accuracy ,depending upon the tasks. In different
data for emotion produced by speech, such as Mel datasets, the proposed classifiers performed differently. He
Frequency Cepstral Coefficient classification model created three datasets: one with only male recordings, one
(MFCC).Eight different emotions have been taught to with exclusively female recordings, and one with both male
the model (neutral, calm, happy, sad, angry, fearful, and female recordings. In RAVDESS, Support Vector
disgust, surprise), Using the RAVDESS (Ryerson Audio- Machine and KNN show 100 percent accuracy in both angry
Visual Database of Emotional Speech and Song) dataset and neutral situations (male),however gradient boosting
as well as the TESS (Toronto Emotional Speech Set) outperforms support vector machine and KNN algorithm in
dataset, we found that the proposed approach achieves happiness and sadness. Support Vector machine in
accuracies of 86 percent, 84 percent, and 82 percent, RAVDESS (female) achieves a level of accuracy of 100
respectively, for eight emotions using CNN, MLP percent in fury, similar to the male counterpart. Except for
Classifier, and SVM Classifiers. sadness, Support Vector machine has a decent overall
performance in general. KNN also performs well in rage and
I. INTRODUCTION neutral situations, scoring 87 percent and 100 percent,
respectively.
The most common method of human interactions is via
spoken language is the cornerstone for the purpose of Gradient Boosting performs poorly in anger and IEEE
information sharing& has been an important aspect of (978-1-7281-4384-2/20/$31.00 © 2020) neutral. In
society from the dawn of time. Emotions, on the other hand, comparison to other classifiers, KNN performs poorly in
can be traced back to basic instinct before the emergence of happiness and sadness. SVM and KNN perform far better in
the modern spoken language, and can be regarded one of the rage and neutral than Gradient Boosting in a mixed male
first forms of natural communication. It is also used in a and female sample. In both happiness and despair, KNN's
variety of practical applications in many sectors such as performance is dreadful. With the exception of SVM, The
BPO-Business Process Outsourcing Centers& call centers to male dataset has superior average classifier performance
analyze emotion essential for determining client pleasure. than the female dataset. SVM has a higher accuracy in a
combined database Than gender based dataset.
Proposed Model depicts how each of us displays more
than one basic emotion at a time, but we believe that Another technique developed by Jannat et al.[7]
recognizing which emotion & analyzing what percentage of achieved 66.41 percent accuracy on audio data and above
the emotions are jumbled is exceedingly strenuous for both 90% accuracy when integrating audio and visual data..
the speaker and the listener. According to this scenario, it is Faces and audio waveforms, in particular, are present in the
decided to create a model that only recognizes the emotions preprocessed image data.
in the audio track that are more valuable.Various methods,
such as computer vision or text analytics, have been tried to Xinzhou Xu [3] et al. modified the Spectral Regression
classify feelings by a machine. The purpose of this study is model by utilizing the joins of ELMs – Extreme Learning
to employ Mel-frequency cepstral Coefficients(MFCC) with Machines and SL-Subspace Learning in order to overcome
pure audio data. the drawbacks of spectral regression-based GE-Graph
Embedding and ELM. We have to correctly describe these
Speech is one of the most basic human capacities, relationships among data using the GSR model in the
allowing us to connect with one another, express ourselves, execution of the SER- Speech Emotion Recognition .For the
and, most importantly, give us a feeling of self. It is one of same, many embedded graphs were created. The impact and
the most important aspects of mental and physical health. practicality of the strategies were determined by
Because emotions can influence how we act and interpret Demonstrationover4Speech Emotional Corpora when
events, a speaker's speech transmits both its linguistic compared to past methods such as ELM and Subspace
meaning and the emotion with which it is given. Learning (SL) methodologies. Exploring embedded graphs
at a deeper level can help the system produce better results.
A. Methodology
To obtain statistically steady waves, The audio files Emotion – 01 = Neutral 05 = Angry
are divided into frames, which are normally determined by 02 = Calm 06 = Fearful
the size of a set window. The frequency scale of the "Mel" is 03 = Happy 07 = Disgust
reduced to standardize the amplitude spectra. This procedure 04 = Sad 08 = Surprised
is carried out in order to identify with the frequency in order
to precisely replicate the wave using the human auditory Emotional Intensity – 01 = Normal
system. 02 = Strong
A total of 40 features have been retrieved from each Statement – 01 = “Kids are talking by the door”
audio file. To build the feature,each audio file was 02 = “Dogs are sitting by the door”
transformed into a floatingpoint time series. The time series
Repetition – 01 = 1st repetition
was then converted into an MFCC sequence.The MFCC
02 = 2nd repetition
array on the horizontal axis is transposed , and arithmetic
Actor – 01 to 24
mean is calculated .
Odd-numbered are male
Even-numbered are female
B. Dataset
This task's dataset consists of 5252 samples collected C. Algorithms
from the following sources: The Classification task's deep neural network (CNN) is
RAVDESS – Ryerson Audio Visual Database of operationally reported in Fig. 1. For each and every
Emotional Speech and Song audio file that has been given as an input, the network
TESS – Toronto Emotional Speech Set can function with 40-feature vectors. The 40 values
Sample consists of: reflect the compressed numerical representation of a
RAVDESS - 1440 speech files and 1012 song files.The two-second audio frame. As a result, A set of 40x1
dataset consists of recordings of 24 (12-females & 12- training files were used to run one round of a 1D CNN
males) professional actors who speak in neutral north using a ReLU activation function, a 20% dropout, and
American accent. Happy, angry, calm, sad, afraid, a 2 x 2 max-pooling function.
disgust, and surprise expressions can be found in speech, The ReLU(rectified linear unit) is formalised as
whereas happy, angry, calm, fearful, and sad emotions g(z) = max0, z and allows us to get a large value in the
can be found in song. event of activation by representing hidden units with
TESS - 2800 files. This dataset consists of two this function. In this situation, Pooling can let the
actresses aged between 26 and 64 recited around 200 set model focus primarily on the most relevant
of taeget phrases in the carrier phrase"Say the word characteristics of each piece of input, resulting in
_____," and Each of the seven emotions were recorded position invariant results. We repeated the procedure,
on the set (anger, neutral, fear, pleasant surprise, disgust, this time adjusting the kernel size. After that, Another
sadness, and happiness). There are 2800 stimuli in all. dropout was applied, and the result was flattened to
Two actresses from the Toronto area have been chosen. make it compatible with the subsequent layers.
Both actors are native English speakers with a university Finally, for each of the properly encoded classes, we
education and musical background. Both actresses' calculated the probability distribution.
audiometric thresholds are within the normal range, Neutral - 0
according to audiometric testing. Calm - 1
0 - Neutral Happy - 2
1 - Calm Sad - 3
2 - Happy Angry - 4
3 - Sad Fearful - 5
4 - Angry Disgust - 6
5 - Fearful Surprised – 7
6 - Disgust
7- Surprised
Fig. 3: Accuracy
REFERENCES