Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Emotion recognition through voice analysis

One of the most important information that speech acoustics provide is the expression of emotions. The purpose of this research is to identify the pitch differences between two basic emotions: anger and joy. In order to find answers to this question vocal data have been collected from small group of participants. Results from Friedman’s Two-Way Analysis of Variance by Ranks revealed difference in pitch levels when expressing anger and joy as well as jitter (rap)....Read more
Research paper Social Signal Processing Emotion recognition through voice analysis Master CIS 2014-2015 880243 Social Signal Processing Elka Popova [ANR 216553] prof.dr.E.O.Postma, dr.M.Postma Ilona Isaeva [ANR 291124] Abstract One of the most important information that speech acoustics provide is the expression of emotions. The purpose of this research is to identify the pitch differences between two basic emotions: anger and joy. In order to find answers to this question vocal data have been collected from small group of participants. Results from Friedman’s Two-Way Analysis of Variance by Ranks revealed difference in pitch levels when expressing anger and joy as well as jitter (rap).
2 Introduction It is well known that speech is an acoustically rich signal that provides a lot of information about the speaker during vocal interaction. The expression and recognition of emotions are extremely important steps for human communication process and for this reason voice recognition is useful for detecting and identifying specific affective characteristics between the speakers. However, it is scientifically proved that basic acoustic features are an indicator for someone’s vocal profile. Human’s voice is a reliable source of emotional signalling. Thus, the capability of recognizing vocal emotional expressions in speech is crucial for creating a more detailed “decoding” of the message, which will lead to better understanding of the expresser’s social signals. Previous study argues that the six basic emotions, which are sadness, anger, surprise, disgust, fear and happiness are very well recognized from prosody and voice quality (Couper, Pell & Kotz, 2011). In linguistics, the prosody includes intonation, stress and rhythm of speech, whereas voice quality refers to pitch, energy and tempo. In this research paper the attention will be mainly brought to pitch analysis. According to the article “Intonation and Emotion: Influence of Pitch Levels and Contour Type on Creating Emotions”, intonation and a certain pitch levels indicate the true emotions people are expressing while talking (Rodero, 2010). To illustrate, the majority of people speak in uncharacteristically high-pitched voice when they are excited, affected or overwhelmed. In contrast, low-pitch voice expresses neutral feeling, calmness, sadness and boredom. Moreover, jitter is another acoustic characteristic that plays a crucial role in identification of particular voices. There are different kinds of jitter (absolute, relative, rap and ppq5), but the methodological part of this paper will mainly focus on analyzing jitter rap’s specifics. Jitter rap is defined as the relative average perturbation, the average absolute difference between a period and the average of it (Farrus, Hernando & Ejarque, 2006). In other words, jitter is the acoustic characteristic of a voice signal, which is quantified as cycle- to-cycle variation of fundamental frequency and waveform amplitude, respectively. It is mainly measured by long situated vowels and significant differences could be detected between different speaking styles. All the emotion recognition’s vocal features are influenced by gender, culture and affective state. In most of the cases, it is challenging to make a difference between two emotions that conduct high intensity, such as anger and happiness. Studying the relationship
! Research paper Social Signal Processing ! Emotion recognition through voice analysis Master CIS 2014-2015 ! 880243 Social Signal Processing Elka Popova [ANR 216553] prof.dr.E.O.Postma, dr.M.Postma Ilona Isaeva [ANR 291124] ! ! ! ! ! ! Abstract One of the most important information that speech acoustics provide is the expression of emotions. The purpose of this research is to identify the pitch differences between two basic emotions: anger and joy. In order to find answers to this question vocal data have been collected from small group of participants. Results from Friedman’s Two-Way Analysis of Variance by Ranks revealed difference in pitch levels when ! ! expressing anger and joy as well as jitter (rap). Introduction It is well known that speech is an acoustically rich signal that provides a lot of information about the speaker during vocal interaction. The expression and recognition of emotions are extremely important steps for human communication process and for this reason voice recognition is useful for detecting and identifying specific affective characteristics between the speakers. However, it is scientifically proved that basic acoustic features are an indicator for someone’s vocal profile. ! Human’s voice is a reliable source of emotional signalling. Thus, the capability of recognizing vocal emotional expressions in speech is crucial for creating a more detailed “decoding” of the message, which will lead to better understanding of the expresser’s social signals. Previous study argues that the six basic emotions, which are sadness, anger, surprise, disgust, fear and happiness are very well recognized from prosody and voice quality (Couper, Pell & Kotz, 2011). ! In linguistics, the prosody includes intonation, stress and rhythm of speech, whereas voice quality refers to pitch, energy and tempo. In this research paper the attention will be mainly brought to pitch analysis. According to the article “Intonation and Emotion: Influence of Pitch Levels and Contour Type on Creating Emotions”, intonation and a certain pitch levels indicate the true emotions people are expressing while talking (Rodero, 2010). To illustrate, the majority of people speak in uncharacteristically high-pitched voice when they are excited, affected or overwhelmed. In contrast, low-pitch voice expresses neutral feeling, calmness, sadness and boredom. Moreover, jitter is another acoustic characteristic that plays a crucial role in identification of particular voices. There are different kinds of jitter (absolute, relative, rap and ppq5), but the methodological part of this paper will mainly focus on analyzing jitter rap’s specifics. Jitter rap is defined as the relative average perturbation, the average absolute difference between a period and the average of it (Farrus, Hernando & Ejarque, 2006). In other words, jitter is the acoustic characteristic of a voice signal, which is quantified as cycleto-cycle variation of fundamental frequency and waveform amplitude, respectively. It is mainly measured by long situated vowels and significant differences could be detected between different speaking styles. All the emotion recognition’s vocal features are influenced by gender, culture and affective state. In most of the cases, it is challenging to make a difference between two emotions that conduct high intensity, such as anger and happiness. Studying the relationship !2 between speech and emotional states is difficult, and progress depends on finding forms of description that apply to those states (Cowie, 2000). However, authors did not pay their full attention to comparing the two basic emotions of anger and happiness. Therefore, the aim of this research is to analyze them by posing the question of what are the voice pitch differences between expressing joy and anger? Based on this research question, one hypothesis is being formulated: Voice pitch increases when expressing an emotion of joy. There is a lot of scientific literature on the topic, but there is also a lot of individual variation in emotion recognition through voice analysis. This is the reason behind the decision of using a within-participant comparison. ! Method ! Participants In order to collect vocal recordings we asked 18 participants (10 females, 8 males) to voluntarily take part in the experiment. The participants were chosen on a random basis as those were students encountered on campus. All of them were of above 18 years of age and were promptly informed about the conditions of the experiment and the way their data will be used. However, “age” was not used as a variable in this research. ! Design A within subject-design was chosen for this research as the subjects have to participate in both conditions created. We created a condition in which each of the participants read out loud 2 short sentences (Happiness “I always love spending time with you” Anger “Get out of my sight I don’t want to see you again”), which are the same for each participant. The participant was asked to imagine a situation where a person dear to him is in front of him and read out the first sentence that contained a positive message and provoked positive emotions in the participant while he was reading it. This indicated the emotion of joy. Then, we asked the participant to imagine a situation where a person he despises is in front of him and to read out the second sentence that contained a negative message and provoked negative emotions in the participant while reading it. This indicated the emotion of anger. ! ! !3 Instrumentation Every sentence was recorded with either a mobile device. The analysis was carried out only with the permission of the participant. Since most mobile devices record in a .m4a format, a conversion of the files was necessary. After compiling the corpus, we converted each of the recording into a .wav so that PRAAT can recognise it. ! Preprocessing The obtained results were analysed with behavioural statistical methods. PRAAT was used for the recordings and SPSS was used for data analysis. For each of the recordings data on maximum and minimum pitch was extracted as well as on jitter rap (Relative Average Perturbation). ! ! Results This analysis comprises of three dependent variables (min pitch, max pitch and jitter) and two independent variables (gender and emotion). Firstly, we began by calculating whether there is a normal distribution of the variables. There are 36 valid cases, 0 missing. None of the dependent variables was found to be normally distributed, min pitch (M = 118.72, SD = 41.28, Zskewness = 1.38, Zkurtosis = -1.21), max pitch (M = 279.23, SD = 84.60, Zskewness = -.49, Zkurtosis = -1.24) nor Jitter (M = .01, SD = .004, Zskewness = 1.90 Zkurtosis = -.56). ! The descriptive statistics we ran showed that males express joy [min pitch (M=83.06, SD=9.93), max pitch (M=200.02, SD=40.66)] with a lower pitch than anger [min pitch (M=108.55, SD=37.81), max pitch (M=236.64, SD=60.33)]. The opposite is observed with females, where the expression of joy [min pitch (M=154.92, SD=29.45), max pitch (M=381.83, SD=37.50)] has a higher pitch than anger [min pitch (M=130.81, SD=44.08), max pitch (297.18, SD=57.19)]. From Figure 1, it can also be observed that on average males use lower pitch than females for both emotions. ! ! ! ! ! ! ! !4 ! ! ! ! ! ! ! ! ! ! ! ! Figure 1: Pitch levels according to emotion and gender There are two independent variables - Emotion (anger, joy) and Gender (male, female) and three dependent variables (max pitch, min pitch and jitter rap). Due to the small sample size (36) and the non-normal distribution of variables, it was decided to perform a non-parametric test. We used a related-samples non-parametric test where SPSS determines the right type of test according to the variables computed at 95% level of confidence. The Friedman’s Two-Way Analysis of Variance by Ranks is an alternative to the Factorial ANOVA which would have been used if there was a normal distribution of variables. The Friedman’s test ranks variables according to their mean per related group. However, the only data we need from the Friedman’s test to prove or reject our hypothesis is the ChiSquare, degrees of Freedom and Significance level. From the test performed, we can conclude that there was a large statistically significant difference in pitch depending on which type of emotion was vocally expressed, χ2(4) = 140.308, p = 0.00. Therefore, we reject the null hypothesis and retain the alternative hypothesis. ! ! ! ! ! !5 Conclusion and Discussion Based on the results above, it has been concluded that there were statistically significant differences between the pitch levels for anger and joy. The data supports the hypothesis that “Voice pitch increases when expressing an emotion of joy”. However, our research established that this is only valid for the females as males increase their voice pitch when they express anger. Despite the large numerical differences in pitch levels among emotions and genders, these numbers may be biased by the small sample size. It is therefore recommended that such an analysis is performed with an increased sample size of minimum 50 participants. ! Research implementation This research generated a useful insight into the vocal properties of males and females when expressing emotions of anger and joy. Such a research can be implemented in plenty of areas where it would be useful. For instance, in eHealth, patients with heart conditions (and previous cardiac arrests) may benefit from sensors detecting/recording their voice activities. When the pitch of an angry voice reaches a critical level (for males approximately 500Hz) for a certain time period, necessary interventions can be made in order to decrease blood pressure in a timely manner. Such preventive actions may prove to be useful especially to patients that live alone or do not have access to immediate healthcare. The findings of joy pitch levels for women can be used in advertising by creative agencies for instance. If people react to joyful voices by mirroring the emotion, the advertised product is prone to generate a higher revenue. When mirroring joy, the customer is generating higher levels of endorphin which may lead to spontaneous purchase decisions. ! Limitations There are some limitations in this study that cannot be ignored. The main limitation is the empirical part of the study. The data analysis was conducted of small number of participants (only 9 participants per each emotion). Another limitation may relate to the fact that none of the participants was an actor – their vocal recordings were absolutely genuine. However, it is much more difficult to analyze genuine vocal expressions, compared to posed ones. It is known, that when people know their voice is being recorded they feel stressed or unable to show their true emotions. Third limitation could be that in the research only the difference between anger and happiness was taken into account. Significant difference could be measured if more than two basic emotions !6 were compared with one another. The improvement of those facts could raise the accuracy of the results. ! Future research Voice analysis is very useful for great variety of other scientific fields like healthcare, affective gaming, computer science, education, telecommunication, security etc. All of these fields could benefit from different future researches, in order to improve computer-human interaction. For the future studies, researchers may focus on measuring and comparing vocal differences between other than the two basic emotions, mentioned in the research. Scientists could also focus on analyzing voice parameters in order to detect speaker’s age/culture/ ethnicity. For other future research it could be also interesting to detect people’s personalities according to their vocal cues (extrovert/introvert). This will be useful for education: teachers will be able to understand student’s personality and help them release stress, or motivate them to participate more in different school activities. Moreover, voice could be indicator for intention and predict people’s action towards a situation. In security voice detection machines could be useful to predict criminals moves while being interrogated. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !7 ! Bibliography ! Bachorowsi, J. (1999). Vocal expression and perception of emotion. Current directions of psychological science, 8(2), 53-57. Bachorowsi, J., Owren,M. (1995). Vocal Expression of Emotion: Acoustic Properties of Speech Are Associated with Emotional Intensity and Context. Psychological science, 6(4), 219-224. Cowie,R. (2000). Emotional states expressed in speech. ITRW on Speech and Emotion Newcastle, 5-7. Farrus,M., Hernando, J., Ejarque, P. (2006). Jitter and shimmer measurements for speaker recognition. TALP Research Center, Department of Signal Theory and Communications Universitat Politècnica de Catalunya Gorish, J., Wells, B., Brown, G.J. (2011). Pitch contour matching and interactional alignment across turns: an acoustic investigation. Language of speech, 55(1), 57-76. Henton, C. (1995). Pitch dynamism in female and male speech. Language & Communication, 15(1). 43-61. Pell, M.D., Kotz, S.A. (2011). One the time course of vocal emotion recognition. PLoS ONE, 6(11). Simon-Thomas, E.R., Keltner, D.J., Sauter, D., Sinicropi-Yao, L., Abramson,A. (2009). The voice conveys specific emotions: Evidence from vocal burst displays. Emotion, 9(6), 838-846. ! Sobin, C., Alpert, M. (1999). Emotion in speech: The acoustic attributes of fear, anger, sadness and joy. Journal of psycholinguistic research, 28(4). !8
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Hatice Kafadar
Abant Izzet Baysal University, Bolu, Turkey
Michael B Buchholz
International Psychoanalytic Berlin
Irina Malkina-Pykh
Saint-Petersburg State University
Thomas L Webb
The University of Sheffield