!
Research paper
Social Signal Processing
!
Emotion recognition through voice analysis
Master CIS 2014-2015
!
880243 Social Signal Processing
Elka Popova [ANR 216553]
prof.dr.E.O.Postma, dr.M.Postma
Ilona Isaeva [ANR 291124]
!
!
!
!
!
!
Abstract
One of the most important information that speech acoustics provide is the expression
of emotions. The purpose of this research is to identify the pitch differences between
two basic emotions: anger and joy. In order to find answers to this question vocal
data have been collected from small group of participants. Results from Friedman’s
Two-Way Analysis of Variance by Ranks revealed difference in pitch levels when
!
!
expressing anger and joy as well as jitter (rap).
Introduction
It is well known that speech is an acoustically rich signal that provides a lot of information
about the speaker during vocal interaction. The expression and recognition of emotions are
extremely important steps for human communication process and for this reason voice
recognition is useful for detecting and identifying specific affective characteristics between
the speakers. However, it is scientifically proved that basic acoustic features are an indicator
for someone’s vocal profile.
!
Human’s voice is a reliable source of emotional signalling. Thus, the capability of
recognizing vocal emotional expressions in speech is crucial for creating a more detailed
“decoding” of the message, which will lead to better understanding of the expresser’s social
signals. Previous study argues that the six basic emotions, which are sadness, anger, surprise,
disgust, fear and happiness are very well recognized from prosody and voice quality (Couper,
Pell & Kotz, 2011).
!
In linguistics, the prosody includes intonation, stress and rhythm of speech, whereas
voice quality refers to pitch, energy and tempo. In this research paper the attention will be
mainly brought to pitch analysis. According to the article “Intonation and Emotion: Influence
of Pitch Levels and Contour Type on Creating Emotions”, intonation and a certain pitch levels
indicate the true emotions people are expressing while talking (Rodero, 2010). To illustrate,
the majority of people speak in uncharacteristically high-pitched voice when they are excited,
affected or overwhelmed. In contrast, low-pitch voice expresses neutral feeling, calmness,
sadness and boredom.
Moreover, jitter is another acoustic characteristic that plays a crucial role in
identification of particular voices. There are different kinds of jitter (absolute, relative, rap and
ppq5), but the methodological part of this paper will mainly focus on analyzing jitter rap’s
specifics. Jitter rap is defined as the relative average perturbation, the average absolute
difference between a period and the average of it (Farrus, Hernando & Ejarque, 2006). In
other words, jitter is the acoustic characteristic of a voice signal, which is quantified as cycleto-cycle variation of fundamental frequency and waveform amplitude, respectively. It is
mainly measured by long situated vowels and significant differences could be detected
between different speaking styles.
All the emotion recognition’s vocal features are influenced by gender, culture and
affective state. In most of the cases, it is challenging to make a difference between two
emotions that conduct high intensity, such as anger and happiness. Studying the relationship
!2
between speech and emotional states is difficult, and progress depends on finding forms of
description that apply to those states (Cowie, 2000).
However, authors did not pay their full attention to comparing the two basic emotions
of anger and happiness. Therefore, the aim of this research is to analyze them by posing the
question of what are the voice pitch differences between expressing joy and anger? Based on
this research question, one hypothesis is being formulated: Voice pitch increases when
expressing an emotion of joy.
There is a lot of scientific literature on the topic, but there is also a lot of individual
variation in emotion recognition through voice analysis. This is the reason behind the decision
of using a within-participant comparison.
!
Method
!
Participants
In order to collect vocal recordings we asked 18 participants (10 females, 8 males) to
voluntarily take part in the experiment. The participants were chosen on a random basis as
those were students encountered on campus. All of them were of above 18 years of age and
were promptly informed about the conditions of the experiment and the way their data will be
used. However, “age” was not used as a variable in this research.
!
Design
A within subject-design was chosen for this research as the subjects have to participate in both
conditions created. We created a condition in which each of the participants read out loud 2
short sentences (Happiness “I always love spending time with you” Anger “Get out of my
sight I don’t want to see you again”), which are the same for each participant. The participant
was asked to imagine a situation where a person dear to him is in front of him and read out
the first sentence that contained a positive message and provoked positive emotions in the
participant while he was reading it. This indicated the emotion of joy.
Then, we asked the participant to imagine a situation where a person he despises is in front of
him and to read out the second sentence that contained a negative message and provoked
negative emotions in the participant while reading it. This indicated the emotion of anger.
!
!
!3
Instrumentation
Every sentence was recorded with either a mobile device. The analysis was carried out only
with the permission of the participant. Since most mobile devices record in a .m4a format, a
conversion of the files was necessary. After compiling the corpus, we converted each of the
recording into a .wav so that PRAAT can recognise it.
!
Preprocessing
The obtained results were analysed with behavioural statistical methods. PRAAT was used for
the recordings and SPSS was used for data analysis. For each of the recordings data on
maximum and minimum pitch was extracted as well as on jitter rap (Relative Average
Perturbation).
!
!
Results
This analysis comprises of three dependent variables (min pitch, max pitch and jitter) and two
independent variables (gender and emotion). Firstly, we began by calculating whether there is
a normal distribution of the variables. There are 36 valid cases, 0 missing. None of the
dependent variables was found to be normally distributed, min pitch (M = 118.72, SD =
41.28, Zskewness = 1.38, Zkurtosis = -1.21), max pitch (M = 279.23, SD = 84.60, Zskewness
= -.49, Zkurtosis = -1.24) nor Jitter (M = .01, SD = .004, Zskewness = 1.90 Zkurtosis = -.56).
!
The descriptive statistics we ran showed that males express joy [min pitch (M=83.06,
SD=9.93), max pitch (M=200.02, SD=40.66)] with a lower pitch than anger [min pitch
(M=108.55, SD=37.81), max pitch (M=236.64, SD=60.33)]. The opposite is observed with
females, where the expression of joy [min pitch (M=154.92, SD=29.45), max pitch
(M=381.83, SD=37.50)] has a higher pitch than anger [min pitch (M=130.81, SD=44.08),
max pitch (297.18, SD=57.19)]. From Figure 1, it can also be observed that on average males
use lower pitch than females for both emotions.
!
!
!
!
!
!
!
!4
!
!
!
!
!
!
!
!
!
!
!
!
Figure 1: Pitch levels according to emotion and gender
There are two independent variables - Emotion (anger, joy) and Gender (male, female) and
three dependent variables (max pitch, min pitch and jitter rap). Due to the small sample size
(36) and the non-normal distribution of variables, it was decided to perform a non-parametric
test. We used a related-samples non-parametric test where SPSS determines the right type of
test according to the variables computed at 95% level of confidence.
The Friedman’s Two-Way Analysis of Variance by Ranks is an alternative to the
Factorial ANOVA which would have been used if there was a normal distribution of variables.
The Friedman’s test ranks variables according to their mean per related group. However, the
only data we need from the Friedman’s test to prove or reject our hypothesis is the ChiSquare, degrees of Freedom and Significance level. From the test performed, we can conclude
that there was a large statistically significant difference in pitch depending on which type of
emotion was vocally expressed, χ2(4) = 140.308, p = 0.00. Therefore, we reject the null
hypothesis and retain the alternative hypothesis.
!
!
!
!
!
!5
Conclusion and Discussion
Based on the results above, it has been concluded that there were statistically significant
differences between the pitch levels for anger and joy. The data supports the hypothesis that
“Voice pitch increases when expressing an emotion of joy”. However, our research
established that this is only valid for the females as males increase their voice pitch when they
express anger. Despite the large numerical differences in pitch levels among emotions and
genders, these numbers may be biased by the small sample size. It is therefore recommended
that such an analysis is performed with an increased sample size of minimum 50 participants.
!
Research implementation
This research generated a useful insight into the vocal properties of males and females when
expressing emotions of anger and joy. Such a research can be implemented in plenty of areas
where it would be useful. For instance, in eHealth, patients with heart conditions (and
previous cardiac arrests) may benefit from sensors detecting/recording their voice activities.
When the pitch of an angry voice reaches a critical level (for males approximately 500Hz) for
a certain time period, necessary interventions can be made in order to decrease blood pressure
in a timely manner. Such preventive actions may prove to be useful especially to patients that
live alone or do not have access to immediate healthcare.
The findings of joy pitch levels for women can be used in advertising by creative
agencies for instance. If people react to joyful voices by mirroring the emotion, the advertised
product is prone to generate a higher revenue. When mirroring joy, the customer is generating
higher levels of endorphin which may lead to spontaneous purchase decisions.
!
Limitations
There are some limitations in this study that cannot be ignored. The main limitation is the
empirical part of the study. The data analysis was conducted of small number of participants
(only 9 participants per each emotion).
Another limitation may relate to the fact that none of the participants was an actor –
their vocal recordings were absolutely genuine. However, it is much more difficult to analyze
genuine vocal expressions, compared to posed ones. It is known, that when people know their
voice is being recorded they feel stressed or unable to show their true emotions. Third
limitation could be that in the research only the difference between anger and happiness was
taken into account. Significant difference could be measured if more than two basic emotions
!6
were compared with one another. The improvement of those facts could raise the accuracy of
the results.
!
Future research
Voice analysis is very useful for great variety of other scientific fields like healthcare,
affective gaming, computer science, education, telecommunication, security etc. All of these
fields could benefit from different future researches, in order to improve computer-human
interaction. For the future studies, researchers may focus on measuring and comparing vocal
differences between other than the two basic emotions, mentioned in the research. Scientists
could also focus on analyzing voice parameters in order to detect speaker’s age/culture/
ethnicity.
For other future research it could be also interesting to detect people’s personalities
according to their vocal cues (extrovert/introvert). This will be useful for education: teachers
will be able to understand student’s personality and help them release stress, or motivate them
to participate more in different school activities. Moreover, voice could be indicator for
intention and predict people’s action towards a situation. In security voice detection machines
could be useful to predict criminals moves while being interrogated.
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!7
!
Bibliography
!
Bachorowsi, J. (1999). Vocal expression and perception of emotion. Current directions of
psychological science, 8(2), 53-57.
Bachorowsi, J., Owren,M. (1995). Vocal Expression of Emotion: Acoustic Properties of
Speech Are Associated with Emotional Intensity and Context. Psychological science, 6(4),
219-224.
Cowie,R. (2000). Emotional states expressed in speech. ITRW on Speech and Emotion
Newcastle, 5-7.
Farrus,M., Hernando, J., Ejarque, P. (2006). Jitter and shimmer measurements for speaker
recognition. TALP Research Center, Department of Signal Theory and Communications
Universitat Politècnica de Catalunya
Gorish, J., Wells, B., Brown, G.J. (2011). Pitch contour matching and interactional alignment
across turns: an acoustic investigation. Language of speech, 55(1), 57-76.
Henton, C. (1995). Pitch dynamism in female and male speech. Language & Communication,
15(1). 43-61.
Pell, M.D., Kotz, S.A. (2011). One the time course of vocal emotion recognition. PLoS ONE,
6(11).
Simon-Thomas, E.R., Keltner, D.J., Sauter, D., Sinicropi-Yao, L., Abramson,A. (2009). The
voice conveys specific emotions: Evidence from vocal burst displays. Emotion, 9(6), 838-846.
!
Sobin, C., Alpert, M. (1999). Emotion in speech: The acoustic attributes of fear, anger,
sadness and joy. Journal of psycholinguistic research, 28(4).
!8