Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discriminating Neutral and Emotional Speech using Neural Networks

2014
In this paper, we address the issue of speaker-specific emotion detection (neutral vs emotion) from speech signals with models for neutral speech as reference. As emotional speech is produced by the human speech production mechanism, the emotion information is expected to lie in the features of both excitation source and the vocal tract system. Linear Prediction residual is used as the excitation source component and Linear Prediction Coefficients as the vocal tract system component. A pitch synchronous analysis is performed. Separate Autoassociative Neural Network models are developed to capture the information specific to neutral speech, from the excitation and the vocal tract system components. Experimental results show that the excitation source carries more information than the vocal tract system. The accuracy neutral vs emotion classification using excitation source information is 91%, which is 8% higher than the accuracy obtained using vocal tract system information. The Berl......Read more
D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 214–221, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) Discriminating Neutral and Emotional Speech using Neural Networks Sudarsana Reddy Kadiri 1 , P. Gangamohan 2 and B. Yegnanarayana 3 Speech and Vision Laboratory, Language Technologies Research Center, International Institute of Information Technology-Hyderabad, India. 1 sudarsanareddy.kadiri@research.iiit.ac.in, 2 gangamohan.p@students.iiit.ac.in, 3 yegna@iiit.ac.in Abstract In this paper, we address the issue of speaker-specific emotion detection (neu- tral vs emotion) from speech signals with models for neutral speech as reference. As emotional speech is produced by the hu- man speech production mechanism, the emotion information is expected to lie in the features of both excitation source and the vocal tract system. Linear Prediction residual is used as the excitation source component and Linear Prediction Coef- ficients as the vocal tract system com- ponent. A pitch synchronous analysis is performed. Separate Autoassociative Neural Network models are developed to capture the information specific to neu- tral speech, from the excitation and the vocal tract system components. Exper- imental results show that the excitation source carries more information than the vocal tract system. The accuracy neu- tral vs emotion classification using excita- tion source information is 91%, which is 8% higher than the accuracy obtained us- ing vocal tract system information. The Berlin EMO-DB database is used in this study. It is observed that, the proposed emotion detection system provides an im- provement of approximately 10% using excitation source features and 3% using vocal tract system features over the re- cently proposed emotion detection which uses the energy and pitch contour model- ing with functional data analysis. Keywords: Excitation Source, Vocal Tract Sys- tem, Linear Prediction (LP) Analysis, Autoasso- ciative Neural Network. 1 Introduction Speech is produced by the human speech pro- duction mechanism, and it carries the signature of the speaker, message, language, dialect, age, gender, context, culture, and state of the speaker such as emotions or expressive states. Extraction of these elements of information from the speech signal depends on identification and extraction of relevant acoustic parameters. Information present in the speech signal, including emotional state of a speaker, has its impact on the performance of speech systems (Athanaselis et al., 2005). In this study, emotion detection refers to, iden- tification of whether the speech is neutral or emo- tional. Emotion recognition refers to determining the category of emotion, i.e., anger, happy, sad, etc. The focus in this study is on detection of presence of emotional state of a speaker with the use of reference models for neutral speech. Mo- tivated by a broad range of commercial applica- tions, automatic emotion recognition from speech has gained increasing research attention over the past few years. Some of the applications for emo- tion recognition system are in the fields of health care, call centre services and also for developing speech systems such as automatic speech recog- nizer (ASR) to improve the performance of dia- logue systems (Athanaselis et al., 2005; Mehu and Scherer, 2012; Cowie et al., 2001; Morrison et al., 2007). Extraction of features from speech signal that characterize the emotion content of speech, and at the same time do not depend on the lexical content is an important issue in emotion recogni- tion (Schuller et al., 2010; Luengo et al., 2010; Scherer, 2003; Williams and Stevens, 1972; Mur- ray and Arnott, 1993; Lee and Narayanan, 2005). From (Schuller et al., 2010; Hassan and Damper, 2012; Schuller et al., 2013; Schuller et al., 2011), it is observed that there is no clear understanding 214
on what type of features can be used for emotion recognition task. Brute force approach involves extracting as many features as possible, and use these in the experiments, sometimes using fea- ture selection mechanisms to choose appropriate subset of features (Schuller et al., 2013; Schuller et al., 2011; Schuller et al., 2009; Zeng et al., 2009). These features can be broadly classified as prosodic features (pitch, intensity, duration), voice quality features (jitter, shimmer, harmonic to noise ratio (HNR)), spectral features (Mel Frequency Cepstral Coefficients (MFCCs), Linear Prediction Cepstral Coefficients (LPCCs)), and their statis- tics such as mean, variance, minimum, maximum, range (Zeng et al., 2009; Schuller et al., 2011; Schuller et al., 2009; ?; Eyben et al., 2012). A limitation of this approach is the assumption that every segment in the utterance is equally impor- tant. Studies have shown that emotional informa- tion is not uniformly distributed in time (Jeon et al., 2011; Lee et al., 2011; Shami and Verhelst, 2007). In (Busso et al., 2009; Bulut and Narayanan, 2008; Arias et al., 2014; Arias et al., 2013; Busso et al., 2007), authors observed that a robust neutral speech models can be useful in contrasting differ- ent emotions expressed in speech. Emotion detec- tion study was made by creating acoustic spectral features of neutral speech with HMMs (Busso et al., 2007). In (Busso et al., 2009), authors used the pitch features of neutral speech to discriminate the emotions using the Kullback-Leibler distance. It was observed that gross pitch contour statistics such as mean, minimum, maximum and range are prominent than pitch shape. Recently, emotion de- tection is performed using functional data analysis (FDA) (Arias et al., 2014; Arias et al., 2013). In this approach, pitch and energy contours of neu- tral speech utterance are modeled using FDA. In testing, pitch and energy contours are projected onto the reference bases, and their projections are used to discriminate neutral and emotional speech. Similar studies were made to model the shape of pitch contour of emotional speech by analyz- ing the rising and falling movements (Astrid and Sendlmeier, 2010). One limitation with the stud- ies (Arias et al., 2014; Arias et al., 2013) is that, all the utterances should be temporally aligned with the Dynamic Time Warping and it may not be re- alistic for most of the situations. Here, we propose an approach based on AANN (Yegnanarayana and Kishore, 2002) to detect whether a given utterance is neutral or emotional speech. The detection of emotional segments or emotion events may help the current approaches in automatic emotion recognition. This approach avoids the interrelations among the lexical content used, language and emotional state across varying acoustic features. The discrimination capabilities of AANN models are exploited in various areas of speech such as speaker identification, speaker verification, speaker recognition, language identi- fication, throat microphone processing, audio clip classification etc (Reddy et al., 2010; Murty and Yegnanarayana, 2006; Yegnanarayana et al., 2001; Mary and Yegnanarayana, 2008; Bajpai and Yeg- nanarayana, 2008; Shahina and Yegnanarayana, 2007). This present work is based on our previous work (Gangamohan et al., 2013) for capturing the devi- ations of emotional speech from neutral speech. In that paper (Gangamohan et al., 2013), it was shown that the excitation source features extracted in the high signal to noise ratio (SNR) regions of the speech signal (around the glottal closure) cap- ture the deviations of emotional speech from neu- tral speech. This paper presents a framework to characterize the high SNR regions of the speech signal using the knowledge of speech produc- tion mechanism. In (Reddy et al., 2010; Murty and Yegnanarayana, 2006; B. Yegnanarayana and S. R. Mahadeva Prasanna and K. Sreenivasa Rao, 2002), the authors showed the importance of pro- cessing the high SNR regions of speech signal for various applications such as speaker recognition (Reddy et al., 2010; Murty and Yegnanarayana, 2006), speech enhancement (B. Yegnanarayana and S. R. Mahadeva Prasanna and K. Sreeni- vasa Rao, 2002), emotion analysis (Gangamohan et al., 2013), etc. Hence, in this study, our focus is on the processing of high SNR regions of speech. The remaining part of the paper is organized as follows: Section 2 describes the basis for the present study. Databases used and feature extrac- tion procedures are described in Section 3. In Sec- tions 4 and 5, description of the AANN models for capturing the excitation source and vocal tract system information are given. Emotion detection experiments and discussion on results are given in Section 6. Finally, Section 7 gives a summary and scope for further study. 215
Discriminating Neutral and Emotional Speech using Neural Networks Sudarsana Reddy Kadiri1 , P. Gangamohan 2 and B. Yegnanarayana 3 Speech and Vision Laboratory, Language Technologies Research Center, International Institute of Information Technology-Hyderabad, India. 1 sudarsanareddy.kadiri@research.iiit.ac.in, 3 2 gangamohan.p@students.iiit.ac.in, yegna@iiit.ac.in Abstract In this paper, we address the issue of speaker-specific emotion detection (neutral vs emotion) from speech signals with models for neutral speech as reference. As emotional speech is produced by the human speech production mechanism, the emotion information is expected to lie in the features of both excitation source and the vocal tract system. Linear Prediction residual is used as the excitation source component and Linear Prediction Coefficients as the vocal tract system component. A pitch synchronous analysis is performed. Separate Autoassociative Neural Network models are developed to capture the information specific to neutral speech, from the excitation and the vocal tract system components. Experimental results show that the excitation source carries more information than the vocal tract system. The accuracy neutral vs emotion classification using excitation source information is 91%, which is 8% higher than the accuracy obtained using vocal tract system information. The Berlin EMO-DB database is used in this study. It is observed that, the proposed emotion detection system provides an improvement of approximately 10% using excitation source features and 3% using vocal tract system features over the recently proposed emotion detection which uses the energy and pitch contour modeling with functional data analysis. Keywords: Excitation Source, Vocal Tract System, Linear Prediction (LP) Analysis, Autoasso214 ciative Neural Network. 1 Introduction Speech is produced by the human speech production mechanism, and it carries the signature of the speaker, message, language, dialect, age, gender, context, culture, and state of the speaker such as emotions or expressive states. Extraction of these elements of information from the speech signal depends on identification and extraction of relevant acoustic parameters. Information present in the speech signal, including emotional state of a speaker, has its impact on the performance of speech systems (Athanaselis et al., 2005). In this study, emotion detection refers to, identification of whether the speech is neutral or emotional. Emotion recognition refers to determining the category of emotion, i.e., anger, happy, sad, etc. The focus in this study is on detection of presence of emotional state of a speaker with the use of reference models for neutral speech. Motivated by a broad range of commercial applications, automatic emotion recognition from speech has gained increasing research attention over the past few years. Some of the applications for emotion recognition system are in the fields of health care, call centre services and also for developing speech systems such as automatic speech recognizer (ASR) to improve the performance of dialogue systems (Athanaselis et al., 2005; Mehu and Scherer, 2012; Cowie et al., 2001; Morrison et al., 2007). Extraction of features from speech signal that characterize the emotion content of speech, and at the same time do not depend on the lexical content is an important issue in emotion recognition (Schuller et al., 2010; Luengo et al., 2010; Scherer, 2003; Williams and Stevens, 1972; Murray and Arnott, 1993; Lee and Narayanan, 2005). From (Schuller et al., 2010; Hassan and Damper, 2012; Schuller et al., 2013; Schuller et al., 2011), it is observed that there is no clear understanding D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 214–221, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) on what type of features can be used for emotion recognition task. Brute force approach involves extracting as many features as possible, and use these in the experiments, sometimes using feature selection mechanisms to choose appropriate subset of features (Schuller et al., 2013; Schuller et al., 2011; Schuller et al., 2009; Zeng et al., 2009). These features can be broadly classified as prosodic features (pitch, intensity, duration), voice quality features (jitter, shimmer, harmonic to noise ratio (HNR)), spectral features (Mel Frequency Cepstral Coefficients (MFCCs), Linear Prediction Cepstral Coefficients (LPCCs)), and their statistics such as mean, variance, minimum, maximum, range (Zeng et al., 2009; Schuller et al., 2011; Schuller et al., 2009; ?; Eyben et al., 2012). A limitation of this approach is the assumption that every segment in the utterance is equally important. Studies have shown that emotional information is not uniformly distributed in time (Jeon et al., 2011; Lee et al., 2011; Shami and Verhelst, 2007). In (Busso et al., 2009; Bulut and Narayanan, 2008; Arias et al., 2014; Arias et al., 2013; Busso et al., 2007), authors observed that a robust neutral speech models can be useful in contrasting different emotions expressed in speech. Emotion detection study was made by creating acoustic spectral features of neutral speech with HMMs (Busso et al., 2007). In (Busso et al., 2009), authors used the pitch features of neutral speech to discriminate the emotions using the Kullback-Leibler distance. It was observed that gross pitch contour statistics such as mean, minimum, maximum and range are prominent than pitch shape. Recently, emotion detection is performed using functional data analysis (FDA) (Arias et al., 2014; Arias et al., 2013). In this approach, pitch and energy contours of neutral speech utterance are modeled using FDA. In testing, pitch and energy contours are projected onto the reference bases, and their projections are used to discriminate neutral and emotional speech. Similar studies were made to model the shape of pitch contour of emotional speech by analyzing the rising and falling movements (Astrid and Sendlmeier, 2010). One limitation with the studies (Arias et al., 2014; Arias et al., 2013) is that, all the utterances should be temporally aligned with the Dynamic Time Warping and it may not be realistic for most of the situations. 215 Here, we propose an approach based on AANN (Yegnanarayana and Kishore, 2002) to detect whether a given utterance is neutral or emotional speech. The detection of emotional segments or emotion events may help the current approaches in automatic emotion recognition. This approach avoids the interrelations among the lexical content used, language and emotional state across varying acoustic features. The discrimination capabilities of AANN models are exploited in various areas of speech such as speaker identification, speaker verification, speaker recognition, language identification, throat microphone processing, audio clip classification etc (Reddy et al., 2010; Murty and Yegnanarayana, 2006; Yegnanarayana et al., 2001; Mary and Yegnanarayana, 2008; Bajpai and Yegnanarayana, 2008; Shahina and Yegnanarayana, 2007). This present work is based on our previous work (Gangamohan et al., 2013) for capturing the deviations of emotional speech from neutral speech. In that paper (Gangamohan et al., 2013), it was shown that the excitation source features extracted in the high signal to noise ratio (SNR) regions of the speech signal (around the glottal closure) capture the deviations of emotional speech from neutral speech. This paper presents a framework to characterize the high SNR regions of the speech signal using the knowledge of speech production mechanism. In (Reddy et al., 2010; Murty and Yegnanarayana, 2006; B. Yegnanarayana and S. R. Mahadeva Prasanna and K. Sreenivasa Rao, 2002), the authors showed the importance of processing the high SNR regions of speech signal for various applications such as speaker recognition (Reddy et al., 2010; Murty and Yegnanarayana, 2006), speech enhancement (B. Yegnanarayana and S. R. Mahadeva Prasanna and K. Sreenivasa Rao, 2002), emotion analysis (Gangamohan et al., 2013), etc. Hence, in this study, our focus is on the processing of high SNR regions of speech. The remaining part of the paper is organized as follows: Section 2 describes the basis for the present study. Databases used and feature extraction procedures are described in Section 3. In Sections 4 and 5, description of the AANN models for capturing the excitation source and vocal tract system information are given. Emotion detection experiments and discussion on results are given in Section 6. Finally, Section 7 gives a summary and scope for further study. 2 Basis for the Present Study Speech production characteristics are changed while producing emotional speech, and the changes are mostly in the excitation component. The changes are not sustainable for longer periods, and hence are not likely to be present throughout. This is due to an extra effort needed to produce the emotional speech. The primary effect is on the source of excitation due to pressure from the lungs and the vibration of the vocal folds. Moreover, the changes in production can be affected only in some selected voiced sounds. Hence, some neutral speech segments are also present in emotional speech. While most changes from perception point of view take place at the suprasegmental level (pitch, intensity and duration), it is less likely that significant changes take place at the segmental level (vocal tract resonances). Changes at the suprasegmental level are mostly learnt features (acquired over a period of time). It is difficult to find consistent suprasegmental features which can form a separate group for each emotion. In this study, changes in the subsegmental features are examined for discriminating neutral and emotional speech of a speaker (speaker-specific) using AANN models. The features are referred to as subsegmental features, since we consider only 1-5 ms around the glottal closure of the voiced excitation for deriving these features. 3 Emotion Speech Databases and Feature Extraction Two types of databases (semi-natural and simulated) are used for discrimination of neutral and emotional speech. 3.1 Emotion Speech Databases Semi-natural database was collected from 2 female and 5 male speakers of Telugu language. They are uttered in 4 emotions (angry, happy, neutral and sad), and it was named as IIIT-H Telugu emotion database (Gangamohan et al., 2013). Speakers were asked to script the text themselves by remembering past memories and situations which make them emotional. The lexical content is different for each speaker and for each emotion. The data was collected in 2 or 3 sessions for each speaker, and consists of around 200 utterances. The complete database was evaluated in perceptual listening tests for recognizability of emotions 216 by 10 listeners. A total of 130 utterances were se- lected, in which anger, happy, neutral and sad are 35, 27, 34 and 34 utterances, respectively. To test the effectiveness of language independent emotion detection, the Berlin emotion speech database (EMO-DB) (Burkhardt et al., 2005) is chosen. Ten professional native German actors (5 male and 5 female) were asked to speak 10 sentences (emotionally neutral sentences) in 7 different emotions, namely, anger, happy, neutral, sad, fear, disgust and boredom in one or more sessions. The total database was evaluated in a perception test by 20 listeners regarding the recognizability of emotions that had recognition rate better than 80% and naturalness better than 60% for analysis. A total of 535 good utterances were selected, in which anger, happy, neutral, sad, fear, disgust and boredom are 127, 71, 79, 62, 69, 46 and 81 utterances, respectively. 3.2 Feature Extraction The features related to the excitation source and the vocal tract system components of speech signal are used in this study. The major source of excitation of the vocal tract system is due to vocal folds vibration at the glottis. The instant of significant excitation is due to sharp closure of the vocal folds, and it is almost like impulse. Hence, the high SNR of speech is present around Glottal Closure Instants (GCIs). By extracting the GCIs from the signal, it is possible to focus the analysis around the GCIs to further extract the information from both the excitation source and the vocal tract system. The features investigated for the detection of emotions are Linear Prediction (LP) residual for excitation source and weighted Linear Prediction Cepstral Coefficients (wLPCCs) for vocal tract system component extracted around the GCIs of speech signal. For this purpose, we use two signal processing methods, one is, a recently proposed method of Zero Frequency Filtering (ZFF) (Murty and Yegnanarayana, 2008) for extraction of GCIs, and another is LP analysis (Makhoul, 1975) for extraction of LP residual and wLPCCs. 3.2.1 Zero Frequency Filtering (ZFF) Method The motivation behind this study was that, the effect of impulse-like excitation is reflected across all frequencies including zero frequency (0 Hz) of the speech signal. The method uses the zero frequency filtered signal obtained from the speech signal by filtering the signal through a cascade Number of nodes of two 0 Hz resonators to get the epoch locations. The instants of negative-to-positive zero crossings (NPZCs) of the ZFF signal correspond to the instants of significant excitation, i.e., epochs or Glottal closure instants (GCIs) in voiced speech (Murty and Yegnanarayana, 2008). This method is also useful for detecting voiced and unvoiced regions. Because of significant contribution by the impulse-like excitation, ZFF signal energy is high in voiced regions (Dhananjaya and Yegnanarayana, 2010). In this paper, we considered only voiced regions for analysis. 4 AANN Models for Capturing the Excitation Source Information Autoassociative Neural Network (AANN) is a feedforward neural network model which performs identity mapping (Yegnanarayana and Kishore, 2002; Yegnanarayana, 1999). Once the AANN model is trained, it should be able to reproduce the input at the output with minimum error, if the input is from the same system. The AANN model consists of one input layer, one or more hidden layers and one output layer (Haykin, 1999). The units in the input and output layers are linear, whereas the units in the hidden layers are nonlinear. The AANN is expected to capture the information specific to the neutral speech present in the samples of LP residual. A five layer neural network architecture (Fig. 1) is considered for the study. The structure of the network 217 33L 80N xN 80N 33L, is chosen for ex- P2 P 3 P4 2 Layer P5 4 5 1 3 Compression Layer Input Layer Type of activation function L N N Output Layer N L Figure 1: Structure of the AANN model 0.8 0.8 (a) 0.7 0.5 0.4 (b) 0.7 x=1 0.6 Training error Training error 3.2.2 Linear Prediction (LP) Analysis The production characteristics of speech has the role of both excitation source and the vocal tract system. LP analysis with proper LP order gives the excitation source (LP residual) component and vocal tract system component through LPCs. In the LP residual, the region around the GCI within each pitch period is used for processing the high SNR regions of speech (Reddy et al., 2010). For deriving the LP residual and LPCs, a 10th order LP analysis is used on the signal sampled at 8 kHz. Two pitch periods of signal are chosen for deriving the residual and a 4 ms segment (i.e, 32 samples) of the LP residual is chosen around each epoch to extract the information from the excitation source component. The vocal tract system characteristics around each GCI is represented by a 15 wLPCC vector derived from the 10 LPCs. P1 x=6 0.3 x=12 0.6 x=1 0.5 0.4 0.3 0.2 0.2 0.1 0.1 x=6 x=12 0 0 50 100 150 200 0 0 50 100 150 200 Figure 2: Training error as a function of iteration number, for (a) excitation source models and (b) vocal tract system models. Here x indicates number of nodes in the compression layer. tracting the neutral speech information using the 4 ms LP residual around each GCI. Here L refers to linear units, and N refers to nonlinear (tanh()) output function of units. Here x refers to the number of units in the compression layer, which is varied to study its effect on the model’s ability to capture the neutral speech specific information. The sizes of input and the output layer are fixed by the number of residual samples (around each GCI) used to train and test the models. The hidden layers provide flexibility for mapping and compression. The network is trained for 200 iterations. The training error plots are shown in Fig. 2(a) for different values of the number of units (x) in the compression layer. From Fig. 2(a), it is observed that the error is decreasing with number of iterations, and hence the network is able to capture neutral speech information of a speaker in the residual. It can also be observed that the decrease in error is more as the number of units in the compression layers are increasing. Even if the error decreases, the generalizing ability may be poor beyond a certain limit on the number of units in the compression layer (Reddy et al., 2010). 5 AANN Models for Capturing the Vocal Tract System Information A 5-layer AANN model with the structure 15L 40N xN 40N 15L is used for extracting the neutral speech specific information using 15 dimensional wLPCC vectors derived using LP analysis on two pitch period segment around each GCI. The model is expected to capture the distribution of the feature vectors of neutral speech of a speaker. The training error plots for a neutral speech of a speaker for x =1, 6 and 12 units in the compression layer are shown in Fig. 2(b). It is observed that the information in the distribution of the feature vectors is captured. The ability of the model to capture the neutral speech information can be determined through emotion detection experiments, as described in Sec. 6. 6 dex 1 to 150). As the number of units in the middle layer increases, the error for the neutral speech is decreasing and the error for emotional speech is increasing. Similar observations can be made from Fig. 3(b) for the error plots for an emotional test utterance tested against neutral speech models using vocal tract system information (wLPCCs). Since the neutral speech AANN models are built, it is expected that the error range should show discrimination for neutral and emotional speech. It is observed that the network is giving lower error values when the test utterance is neutral and higher error values when the test utterance is emotional. Using a threshold on the averaged normalized error value (averaged over all the frames of an utterance), emotion detection studies are performed. The averaged normalized error is given by Emotion Detection Experiments In order to know the capturing ability of AANN models for emotion detection, in the experiments we used all the speech samples from two types of databases described in Sec. 3.1. The speech samples are picked randomly while training and it is noted that the experiments are conducted in lexical independent way. For EMO-DB database, a universal background model (UBM) is built from 10 speakers (5 male and 5 female) using 15 s neutral speech data from each speaker. We have used all 10 speakers data for emotion detection experiments. Approximately 20 s of neutral speech data from a speaker is used to train over the UBM to build the speaker-specific neutral speech AANN models using 200 iterations. For testing, emotional speech utterance is presented to the neutral speech AANN model, and the mean squared error between the output and input, normalized with the magnitude of the input, is computed. Fig. 3(a) shows the plots of the normalized errors obtained from the neutral speech AANN models using excitation source information (LP residual) of a speaker at each GCI. The solid (‘—’) line is the output from the model of the neutral speech of the same speaker. The emotional speech test utterance is fed to the neutral speech AANN models and the resulting error is shown by dotted (‘· · · ’) lines. The plots correspond to three different cases, i.e., for 1, 6, 12 units in the middle compression layer. It can be seen that the solid line (neutral speech) has the lowest normalized er218 ror values for most of the frames (from GCI in- 1 X kyi − zi k2 . l kyi k2 l (1) i=1 where yi is the input feature vector of the model, zi is the output given by the model, and l is the number of frames of the test utterance. The results of emotion detection (neutral vs emotion) using the excitation source information and the vocal tract system information for EMODB are shown in the Table 1. To test the effectiveness of language independent emotion detection, similar study is performed on IIIT-H Telugu emotion database, and the results are shown in Table 2. The accuracy for EMO-DB database using excitation source information is 91%, which is nearly 8% higher than that for the vocal tract system information. This is because, for the production of emotional speech, the primary effect is on the excitation source component such as pressure from the lungs and the changes in the vocal fold vibration. Similar observations can be made for the IIIT-H Telugu emotion database. For both the databases, it is indicative that the excitation source information carries more information than the vocal tract system information, and also the performance is consistent across the speakers. It is observed that, proposed excitation source and vocal tract system features with AANN models provides an improvement of approximately 10% and 3% over the recently proposed emotion detection method (Arias et al., 2014; Arias et al., 2013) which uses the energy and pitch contour modeling Figure 3: Normalized errors obtained from AANN models of various architectures, using (a) excitation source (b) vocal tract system information. In each plot, solid line (‘—’) and dotted line (‘· · · ’) correspond to neutral and emotion utterance normalized error curves, respectively. with functional data analysis (accuracy is 80.4%) for EMO-DB database. From Tables 1 and 2, it is observed that the higher activated emotion states (anger, happy, disgust and fear) are more discriminative compared to the lower activated emotion (sad and boredom) states. This is in conformity with the studies reported in (Jeon et al., 2011; Lee et al., 2011; Shami and Verhelst, 2007), where generally the acoustic features effectively discriminate between high activated emotions and low activated emotions. The accuracies for lower activation states is low, because some of the speech segments in these emotion signals are closer to neutral speech. Thus, AANN models are able to discriminate higher activated and lower activated states using neutral speech reference model. This study can be extended by training models with the lower activation states and testing with the higher states and vice versa. It is also noted that (from Fig. 3), all the frames of an utterance are not important in taking the decision, i.e., emotion information may not be uniformly distributed in time and hence, automatic usage of the high confidence frames may improve the accuracy. Also, the proposed emotion detection system can be evaluated using other spectral parameters such as variants of LPCs, MFCCs, MSFs (Schuller et al., 2013; Schuller et al., 2011; Schuller et al., 2009; Zeng et al., 2009; Ayadi et al., 2011; Eyben et al., 2012) etc. The advantages of the present study are, it is independent of lexical content used, language and channel. 219 Table 1: Results for neutral vs emotion detection for Berlin EMO-DB database (in percentage). Neutral vs Anger Neutral vs Happy Neutral vs Sad Neutral vs Disgust Neutral vs Fear Neutral vs Boredom Neutral vs Emotion Excitation Source Vocal Tract System 100 100 100 100 73.50 62.93 100 81.24 100 91.32 72.72 64.02 91.03 83.25 Table 2: Results for neutral vs emotion detection for IIIT-H Telugu emotion database (in percentage). Neutral vs Anger Neutral vs Happy Neutral vs Sad Neutral vs Emotion 7 Excitation Source Vocal Tract System 100 92.86 100 96.43 82 75.4 94 88.23 Summary In this paper, we have demonstrated the significance of pitch synchronous analysis of speech data using AANN models for discriminating neutral and emotional speech. We have shown that the excitation source information of neutral speech is captured using 4 ms LP residual around the GCI, and the vocal tract system information of neutral speech is captured using 15 dimensional wLPCC vectors derived from 10 LPCs around each GCI. Emotion detection (neutral vs emotion) experiments were conducted using two databases one is IIIT-H Telugu emotion database, and the other is a well known emotion speech database EMO-DB. The results show that excitation source component carries more information than that of the vocal tract system. Further, it can be extended for the discrimination among the emotions, such as discrimination of anger and happy by training anger models and testing with happy emotion utterances, and vice versa. Also, it is important to develop speaker-specific models by determining suitable AANN models for individual speakers and emotions. It is also necessary to explore methods to combine the evidence from excitation source and vocal tract system for emotion detection. References J.P. Arias, C. Busso, and N.B. Yoma. 2013. Energy and F0 contour modeling with functional data analysis for emotional speech detection. In INTERSPEECH, pages 2871–2875, August. Juan Pablo Arias, Carlos Busso, and Nestor Becerra Yoma. 2014. Shape-based modeling of the fundamental frequency contour for emotion detection in speech. Computer Speech and Language, 28(1):278 – 294. Paeschke Astrid and W F Sendlmeier. 2010. Prosodic characteristics of emotional speech: Measurements of fundamental frequency movements. In SpeechEmotion, pages 75–80. Theologos Athanaselis, Stelios Bakamidis, Ioannis Dologlou, Roddy Cowie, Ellen Douglas-Cowie, and Cate Cox. 2005. Asr for emotional speech: Clarifying the issues and enhancing performance. Neural Networks, 18(4):437–444. Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572 – 587. B. Yegnanarayana and S. R. Mahadeva Prasanna and K. Sreenivasa Rao. 2002. Speech enhancement using excitation source information. In ICASSP, volume 1, pages 541–544. A. Bajpai and B. Yegnanarayana. 2008. Combining evidence from subsegmental and segmental features for audio clip classification. In TENCON IEEE Conference, pages 1–5, Nov. Murtaza Bulut and Shrikanth Narayanan. 2008. On the robustness of overall f0-only modifications to the perception of emotions in speech. J. Acoust. Soc. Am., 123(6):4547–4558, June. Felix Burkhardt, Astrid Paeschke, M. Rolfes, Walter F. Sendlmeier, and Benjamin Weiss. 2005. A database of german emotional speech. In INTERSPEECH, 220 pages 1517–1520. C. Busso, S. Lee, and S.S. Narayanan. 2007. Using neutral speech models for emotional speech analysis. In INTERSPEECH, pages 2225–2228, August. Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2009. Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech & Language Processing, 17(4):582–596. R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor. 2001. Emotion recognition in human-computer interaction. Signal Processing Magazine, IEEE, 18(1):32– 80. N. Dhananjaya and B. Yegnanarayana. 2010. Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3):273–276, March. Florian Eyben, Anton Batliner, and Björn Schuller. 2012. Towards a standard set of acoustic features for the processing of emotion in speech. Proceedings of Meetings on Acoustics, 16. P Gangamohan, Sudarsana Reddy Kadiri, and B. Yegnanarayana. 2013. Analysis of emotional speech at subsegmental level. In INTERSPEECH, pages 1916–1920, August. Ali Hassan and Robert I. Damper. 2012. Classification of emotional speech using 3dec hierarchical classifier. Speech Communication, 54(7):903–916. Simon Haykin. 1999. Neural networks: A Comprehensive Foundation. Prentice-Hall International, New Jersey, USA. Je Hun Jeon, Rui Xia, and Yang Liu. 2011. Sentence level emotion recognition based on decisions from subsentence segments. In ICASSP, pages 4940– 4943. Chul Min Lee and Shrikanth S. Narayanan. 2005. Toward detecting emotions in spoken dialogs. IEEE Trans. Audio, Speech, and Language Processing, 13(2):293–303. Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9-10):1162– 1171. Iker Luengo, Eva Navas, and Inmaculada Hernáez. 2010. Feature analysis and evaluation for automatic emotion identification in speech. IEEE Transactions on Multimedia, 12(6):490–501. J. Makhoul. 1975. Linear prediction: A tutorial review. Proc. IEEE, 63:561–580, Apr. Leena Mary and B. Yegnanarayana. 2008. Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10):782–796. Marc Mehu and Klaus R. Scherer. 2012. A psychoethological approach to social signal processing. Cognitive Processing, 13(2). Carl E. Williams and Kenneth N. Stevens. 1972. Emotions and speech: Some acoustical correlates. J. Acoust. Soc. Am., 52(4B):1238–1250. Donn Morrison, Ruili Wang, and Liyanage C. De Silva. 2007. Ensemble methods for spoken emotion recognition in call-centres. Speech Communication, 49(2):98–112. B. Yegnanarayana and S. P. Kishore. 2002. AANN - an alternative to GMM for pattern recognition. Neural Networks, 15:459–469, Apr. Iain R. Murray and John L. Arnott. 1993. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am., 93(2):1097–1108. K. Sri Rama Murty and B. Yegnanarayana. 2006. Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process. Letters, 13(1):52–55, Jan. K. Sri Rama Murty and B. Yegnanarayana. 2008. Epoch extraction from speech signals. IEEE Trans. Audio, Speech, and Language Processing, 16(8):1602–1613, Nov. Sri Harish Reddy, Kishore Prahallad, Suryakanth V. Gangashetty, and B. Yegnanarayana. 2010. Significance of pitch synchronous analysis for speaker recognition using aann models. In INTERSPEECH, pages 669–672. Klaus R. Scherer. 2003. Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1-2):227–256. Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The interspeech 2009 emotion challenge. In INTERSPEECH, pages 312–315. Björn Schuller, Bogdan Vlasenko, Florian Eyben, Martin Wöllmer, André Stuhlsatz, Andreas Wendemuth, and Gerhard Rigoll. 2010. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Tran. Affective Computing, 1(2):119–131. Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9-10):1062–1087. Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian A. Müller, and Shrikanth Narayanan. 2013. Paralinguistics in speech and language - state-of-the-art and the challenge. Computer Speech & Language, 27(1):4–39. A. Shahina and B. Yegnanarayana. 2007. Mapping speech spectra from throat microphone to closespeaking microphone: A neural network approach. EURASIP J. Adv. Sig. Proc., 2007. Mohammad Shami and Werner Verhelst. 2007. An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Communication, 221 49(3):201–212. B. Yegnanarayana, K. Sharat Reddy, and Kishore Prahallad. 2001. Source and system features for speaker recognition using AANN models. In Proc. Int. Conf. Acoustics Speech and Signal Processing, pages 491–494, Salt Lake City, Utah, USA, May. B. Yegnanarayana. 1999. Artificial Neural Networks. Prentice-Hall of India, New Delhi. Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell., 31(1):39–58.
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Roshan Chitrakar
Nepal College of Information Technology
Bogdan Gabrys
University of Technology Sydney
PALIMOTE JUSTICE
RIVERS STATE POLYTECHNIC
Ferhat Bozkurt
Ataturk University