Naturalistic Audio-Visual Emotion Database
Sudarsana Reddy Kadiri1 , P. Gangamohan 2 , V.K. Mittal 3 and B. Yegnanarayana 4
Speech and Vision Laboratory,
Language Technologies Research Center,
International Institute of Information Technology-Hyderabad, India.
1
sudarsanareddy.kadiri@research.iiit.ac.in,
3
gangamohan.p@students.iiit.ac.in,
vinay.mittal@iiit.ac.in, yegna@iiit.ac.in
Abstract
The progress in the areas of research like
emotion recognition, identification, synthesis, etc., relies heavily on the development and structure of the database. This
paper addresses some of the key issues in
development of the emotion databases. A
new audio-visual emotion (AVE) database
is developed. The database consists of audio, video and audio-visual clips sourced
from TV broadcast like movies and soapoperas in English language. The data clips
are manually segregated in an emotion and
speaker specific way. This database is
developed to address the emotion recognition in actual human interaction. The
database is structured in such a way that it
might be useful in a variety of applications
like emotion analysis based on speaker or
gender, emotion identification in multiple
emotive dialogue scenarios etc.
Keywords: Emotion analysis, Emotion recognition, Expressive synthesis, Simulated parallel database, Semi-natural database, Audio-visual
data.
1
2
4
Introduction
Emotion databases provide an important experimental foundation for analysis when researchers
aim at building emotion-aware speech systems
(M. Gnjatovic et al., 2010). The basic requirement
of database for studies in emotion analysis, identification, classification and synthesis is guided primarily by its suitability to the application chosen.
Several emotion databases developed by different
research groups can be categorized as simulated,
semi-natural and natural databases (E. DouglasCowie et al., 2003; B. Schuller et al., 2011; D.
Ververidis et al., 2003; S. G.Koolagudi, et al.,
206
2012).
Simulated parallel emotion corpus are recorded
from speakers (artists) by prompting them to enact emotions through specified text in a given language. The simulated parallel emotion corpus reported in (Zhihong Zeng et al., 2009; F. Burkhardt
et al., 2005; I. S. Engberg et al., 1997; B. Schuller
et al., 2010; S. G. Koolagudi et al., 2009), were
collected from speakers by asking them to emote
same text in different emotions. Their main disadvantage is that the deliberately enacted emotions are quite at variance from the natural ‘spontaneous’ emotions, and also at times they are out
of context (D. Ververidis et al., 2003; D. Erickson
et al., 2006).
Semi-natural is a kind of enacted corpus where
the context is given to the speakers. The seminatural emotion database in German language was
developed by asking speakers to enact the scripted
scenarios, eliciting each emotion (R. Banse et al.,
1996; I. Sneddon et al., 2012). Similar seminatural databases in English and Russian languages were reported in (E. Douglas-Cowie et al.,
2000; N. Amir et al., 2000), respectively.
The third kind of emotion database is natural database, where recordings do not involve
any prompting or the obvious eliciting of emotional responses. Sources for such natural situations could be like talk shows, interviews, panel
discussions, and group interactions, etc., in TV
broadcast. The Belfast natural database in English language was developed by segmenting 1060 seconds long audio-visual clips from TV-talk
shows (E. Douglas-Cowie et al., 2000). Similar kind of databases were developed in Korean
(Zhihong Zeng et al., 2009), German and English
languages such as, FAU Aibo (Steidl, S et al.,
2009), USC-IEMOCAP (C. Busso et al., 2008),
(S. Chung et al., ) etc. Geneva airport lostluggage study database was developed by videotaping the interviews of passengers at lost-luggage
counters (K. Scherer et al., 1997). The “Vera am
D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 206–213,
Goa, India. December 2014. c 2014 NLP Association of India (NLPAI)
Mittag” German audio-visual emotion database
(M. Grimm et al., 2008), was developed by segmenting the audio-visual clips from the talk-show
“Vera am Mittag”. More details of the various
types of databases, issues and important aspects
of databases were given in (E. Douglas-Cowie et
al., 2003; B. Schuller et al., 2011; D. Ververidis et
al., 2003; Koelstra, S et al., 2012).
The simulated emotion parallel speech corpus
are mainly used in the area of emotion conversion (M. Schroder et al., 2001; I. Murray et al.,
1993; H. Kawanami et al., 2003). In these cases
Gaussian Mixture Models (GMMs) and Artificial
Neural Networks (ANNs) are used for developing
these systems. The output speech from these systems is unnatural and has the constraints of parallel speech corpus. Also, the emotion recognition
systems based on these parallel speech emotion
corpus are not reliable in real life scenarios. The
semi-natural emotion databases also have promptness, but they are useful for developing emotion
recognition systems because the recorded utterances have context. The requirement is to have
the sentences with context, and the artists to have
good performance skills. For the collection of natural databases from TV talk shows and interactive
sessions, the main difficulty is to label the emotion
or expressive state of the dialogue. Also, it is possible that emotion states like extreme anger, sad,
fear, etc., may not occur in such TV broadcasts.
Therefore, in “Vera am Mittag” emotion corpus
(M. Grimm et al., 2008), the annotation of the utterances was described by three basic primitives:
Valence (positive or negative), Activation (calm or
excited) and Dominance (weak or strong).
Speakers involved in TV broadcast like talk
shows (M. Grimm et al., 2008), interviews, panel
discussions, and group interactions, etc., control
their emotions/expressive states, i.e., they cannot
express the feelings that occur in natural communication among humans. There is always a tradeoff between the controllability and naturalness of
the interaction (M. Grimm et al., 2008).
In this paper, we describe an audio-visual emotion database named as IIIT-H AVE, developed at
Speech and Vision Laboratory, IIIT Hyderabad.
We have decided to use TV broadcast such as
movies and soap-operas for data collection, because the emotions produced are more generic
towards the natural communication, even though
207
they are enacted.
The remaining part of the paper is organized as
follows. Section II describes the challenges involved in the collection of emotion data for different applications. Section III describes data collection, recording parameters and various stages involved. In Section IV, the structure of the database
and and in Section V, issues encountered, proposed solutions and limitations are reported. In
Section VI, possible applications of the proposed
database are discussed briefly. Finally, Section VII
gives a summary and scope of future work.
2
Challenges involved in the collection of
emotion data for different applications
In order to develop high quality text to emotion speech synthesis systems, large sized natural
databases of each target emotion are required (M.
Schroder et al., 2001). But it is impractical to develop a large sized natural emotion database with
spontaneity (naturalness). Hence emotion conversion systems are adopted as a post-processing
block for speech synthesis from neutral text. In
this, a large database of neutral speech is used by
text-to-speech (TTS) system to generate a neutral
speech first, which is then fed to emotion conversion system where input neutral speech is converted to desired emotional speech. Since emotional speech is produced from emotion conversion systems, it is reasonable to use enacted parallel corpus (D. Erro et al., 2010).
Although it is practically reliable to use simulated parallel corpus for emotion synthesis systems, it does not serve the purpose of developing
the emotion classification system because it consists of enacted speech. The original state of the
speaker might be different as well. Most of the
time, semi-natural and close to natural databases
are used for developing emotion recognition systems.
The problem with semi-natural emotion type of
databases is, whether the produced emotion is real
or it is produced for the purpose of emotion data
collection because the speakers know that they are
being recorded.
Ideally, natural databases with multiple number
of speakers, styles and contextual information are
required to design emotion recognition systems
for realistic applications. The collection of natural databases mostly from talk shows and interactive sessions in TV broadcast, call centers, interaction with robots, conversations in public places
etc. The main difficulty is to identify and label
the emotion or expressive state of the dialogue.
The emotive states like extreme anger, sad, fear,
etc., may not occur some times in such TV broadcasts because the expression of emotion is continuum in nature. Therefore, for natural emotion corpus the annotation of the utterances was described
mostly by three basic primitives or dimensions:
valence, activation and dominance because the labelling of the naturalistic emotions as highly subjective and categorization of emotions is always
debatable (M. Grimm et al., 2008; K. P. Truong
et al., 2012). The difficulties involved in natural
databases are overlapping multiple speakers data
in audio or video or both, background noise or
music etc. The good ground truth for natural emotion databases is a difficult task as there are inconsistencies in the annotation. Databases with
good emotion labels/annotation would be helpful
for emotion recognition tasks.
There are some challenges involved in collection of audio-visual data of naturally occurring emotions. Different people annotate different emotions/expressive states for the same data
(audio visual clips). Also, there is a possibility of
inconsistency in annotation done by the same person. It is impossible to define strict boundaries for
the occurrence of emotion, as presence of emotion
is a continuum in speech. Also, emotion depends
on the semantic and contextual information.
3
Data collection
The objective of this audio-visual emotion data
collection is to have an emotion annotated
database with adequate context and large number
of speakers. We have chosen English movies and
soap operas in TV broadcast as source for data collection.
3.1 Selection of sources
We began by watching a range of source videos
over a period of time, and eventually identified a
few sources that are potentially useful. For example, if the story of a source had some drama revolving around a group of characters then it was
considered as useful source to yield good clips
of emotional content. This collection of source
208
videos is named as raw data.
3.2 Emotive and Expressive states
The emotive and expressive states are chosen
based on the examples derived from the selected
sources. It is also observed that the communication among people always exhibits expressions.
The extreme cases of these expressions leads to
different emotions. We have identified 7 basic
emotions (anger, disgust, fear, happy, neutral, sad
and surprise) and 6 expressive states (confusion,
excited, interested, relaxed, sarcastic and worried)
(K. Scherer et al., 2003; R. Cowie et al., 2003).
The list of emotive and expressive states considered is shown in Table 1.
Table 1: List of emotive and expressive states.
Emotive states Expressive states
1. Anger
1. Confusion
2. Disgust
2. Excited
3. Fear
3. Interested
4. Happy
4. Relaxed
5. Neutral
5. Sarcastic
6. Sad
6. Worried
7. Surprise
3.3 Segregation of raw data
Segregation of audio-visual clip segments (or
specific scene-selection) from the chosen source
video is carried out on the basis of perceived significance of emotion/expressive state. The duration of such audio-visual clips ranges from 0.530 seconds, with average being around 5 seconds.
The criteria adopted for selecting ‘good source
clip’ are the following:
• The audio-visual clips with no background
music or noise
• Clips with only one actor speaking at a time
There were 6 subjects, each a research scholar,
involved in the segregation of source videos. The
basic challenge was to annotate the segregated
clips. Subjects were asked to label the clip with
one of the emotion/expressive state, and also to
specify the confidence level.
If a particular soap-opera has many episodes
then during segregation of the clips, the prominent
characters of that particular soap-opera are also labelled with speaker numbers.
3.4 Recording quality
• Characters 7-9 refer to source video number.
From the segregated audio-visual clips, the audio
and video streams are extracted. The video files
are MPEG4 coded image sequences of frame sizes
mostlt 1280×720 pixels, with frame rate of either
24 fps. Files are in x.avi and x.wmv formats. All
the extracted audio wave files have sampling rates
of 44.1/48 kHz and are in stereo/mono mode. The
data is downsampled to 16 kHz.
4
Structure of database
This database consists of segregated emotion clips
in three formats namely, audio, video, and audiovisual. It has 1176 clips in each format. For ease
of usage, a consistent structure is maintained for
labelling the database, which is explained as follows.
Confidence:
9
Gender, Speaker No.: 8
Clip No.:
MS1
7
001
FS2
002
XX_XX_XXX_XXX_XX_X_XXX_XX
Languages:
English - EN
Hindi - HI
Telugu - TU
File No.:
001
002
Sources:
Movies
- MV
Serials
- SL
Group interactions - GI
Dyadic interactions - DI
Reality Shows
- RS
Emotion/Expressive
states:
Anger
-AN
Disgust -DI
Fear
-FR
Happy
-HA
Neutral -NU
Sad
-SD
Surprise -SU
Confusion -CF
Excited
-EX
Interested -IN
Relaxed -RL
Sarcastic -SC
Worried -WR
Type of file:
Audio
-AX
Video
-VX
Audio-Video -AV
Figure 1: Labelling structure of the segregated
clips
4.1 Labelling the raw data
The raw videos selected from the chosen source
are labelled with a string of 9 characters as follows:
“XX XX XXX”, where
• Characters 1-2 refer to language code [for example, the database collected in English language is coded as ‘EN’].
• Characters 4-5 refer to the kind of source
[for example code ‘MV’ specifies the source
209
video as movie].
Example: EN MV 123
4.2 Labelling of segregated data
Each segregated clip of source video includes the
raw label of source video along with labels specifying the gender, speaker, emotion category, confidence score and the type of file. The labelling
scheme of segregated clips is as follows:
“XX XX XXX XXX XX X XXX XX”,
where
• Initial characters 1-9 are same as the label of
the raw source video.
• Character 11 refers to the gender (M/F).
• Characters 12-13 refer to speaker number.
Speaker numbering is kept consistent for all
episodes of a particular soap-opera or movie
[for example, code ‘FS2’ represents female
speaker number 2].
• Characters 15-16 refer to emotion category
[for example code ‘AN’ refers the particular
clip to anger state].
• Character 18 refers to the confidence score
[Range 2 to 9, 9 being highest].
• Characters 20-22 refer to the clip number for
a particular source video.
• Characters 24-25 refer to the type of clip [for
example codes ‘AX’, ‘VX’ and ‘AV’ specify
the clip in audio, video and audio-visual formats, respectively].
Example: EN MV 123 FS2 AN 9 106 AV
More details of labelling structure are given in
Fig. 1.
The data can be sub-structured as per emotion, gender and speaker. It also has further levels of sub-structuring as per speaker-emotion and
gender-emotion categories. The database consists
of 1176 labelled clips, of which 741 clips are of
male and 435 clips are of female speakers. The
statistics of data as per emotion and per speaker is
given in Tables II and III respectively.
Database also contains multiple emotions (one
emotion followed by another) that occurred in a
sentence continuously.
For example Anger followed by Frustration,
Excitement followed by Anger or Happy etc.
Table 2: Number of the clips per emotion/expressive state, with confidence score in each column (CX),(2
to 9, 9 being highest).
Emotion
C9 C8 C7 C6 C5 C4 C3 C2 Total
1. Anger
5
24 60 44 27
7
2
2
171
2. Disgust
6
35 18
8
8
75
3. Fear
4
5
6
1
16
4. Happy
3
27 50 19 21
9
130
5. Sad
6
17 41 19 17 16
8
117
6. Surprise
5
11 28 27
8
12
2
93
7. Neutral
4
34 90 31 14
1
174
8. Confusion
1
2
4
4
11
9. Excited
5
19 77 28 11
8
1
149
10. Interested
4
46 11
5
4
1
71
11. Relaxed
5
2
4
3
14
12. Sarcastic
3
9
8
8
8
2
1
39
13. Worried
19 33
7
11
4
1
0
75
Table 3:
Speakers
S1
No. of clips 11
Speakers
S10
No. of clips 125
Number of the clips per speaker (SX).
S2
S3
S4
S5
S6
S7
S8
86
16
24
33
45
24
12
S11 S12 S13 S14 S15 S16 S17
99
30
30
78 132 18
54
These are named as Multiple emotion Files (Mfiles). There are 41 such files obtained in this
database. The labelling scheme of M files is as
follows:
“XX XX XXX XXX XXX X XXX XX”,
where
• Initial characters 1-13 are same as the label
of the segregated data.
• Characters 15-17 refers to the starting emotion clip number.
• Character 19 as M, refer to M-file.
• Characters 21-23 refers to the ending emotion
clip number.
• Characters 25-26 refer to the type of clip [for
example codes ‘AX’, ‘VX’ and ‘AV’ specify
the clip in audio, video and audio-visual formats, respectively].
Example: EN MV 123 FS2 106 M 107 AV
To analyze the inter-evaluator agreement, Fleiss
Kappa statistic was computed (Fleiss J et al.,
1981). The result for the entire database is 0.31.
210
Since the emotional content of the database mainly
S9
3
S18
60
span the target emotions (see Table. 1), the Kappa
statistic was calculated for the emotional states
alone and it turns that 0.41. These levels of agreement, which are considered as fair agreement, are
expected since people have different perception
and interpretation of the emotions and these values
are consistent with the agreement levels reported
in previous work (Steidl, S et al., 2009; C. Busso
et al., 2008; M. Grimm et al., 2008).
The database is also labelled in dimensional
approach in two dimensions (primitives) namely
arousal and valence. The labelling structure is
same as described in Fig. 1. except the characters
15-16 refer to two dimensions. The codes using
two primitives, arousal (active-A, passive-P) and
valence (positive-P, negative-N), forms 4 combinations namely AP (active-positive), AN (activenegative), PP (passive-positive) and PN (passivenegative). The neutral samples are labelled as NN
(neutral-neutral).
The statistics of the of data as per dimensions is
shown in Table IV.
5
Issues in data collection
The ambiguity in annotating the emotion is indicated by specifying the confidence scores. There
Table 4: Number of clips as per two dimensions - arousal (active/passive) and valence (positive/negative).
Active
Passive
Positive
309
198
Negative
245
209
are two reasons for ambiguity of annotating the
emotions. One of them is occurrence of mixed
emotions in a sentence. For example, there is a
possibility of combinations like, surprise-happy,
frustration-anger, anger-sad, etc., occurring in the
dialogue at the same time. For these cases, the
subjects were asked to annotate the clip with multiple emotions along with confidence score for
each. If there exist two sub-dialogues in a dialogue, each corresponding to different emotions,
then they are segregated separately as M-files. If
there is only one dialogue which has mixed emotions, the emotion with maximum confidence is
selected. These kind of clips with entire dialogue
are considered as special cases.
The second reason for ambiguity is unsustainability of emotion throughout the dialogue. In
the case of natural communication among human
beings, emotion being non-normal (emotional)
speech, may not be sustainable for the duration of
entire dialogue. The emotion is mostly expressed
in some segments of dialogue, like at the end or
at the beginning of a dialogue, with the rest of the
dialogue being neutral. Hence, the corresponding
emotion is given in the annotation.
We have also given confidence score for each
audio-visual clip. It indicates the degree of confidence in the labelled emotion actually being
present in the clip. Since the confidence score is
given by only one person, the clips with less confidence scores and ambiguities can be used better
after performing the subjective evaluation. Some
of the clips also have abrupt cut-off due to interruption made by other actors before completion
of the dialogue. Although this database is more
generic and is closer to the natural spontaneous
communication, it is still from the enacted source.
6
Possible applications
Due to variety in this database, applications like
emotion recognition based on speaker dependent
and independent, gender dependent and indepen211
dent cases can be studied in audio alone, video
alone and audio-visual modes. Identification of
non-sustainable regions in an entire dialogue will
be an interesting research problem. The clips with
multiple emotions can also be used to study how
an individual can vary his/her emotive state in a
dialogue. The perceptual evaluation of these clips
with only audio, only video and audio-visual analysis can also be performed. The subjective scores
with only audio can be used as ground truth for
evaluation of emotion recognition system based on
audio.
7
Summary
In this paper, we have described the audio-visual
emotion data collection, segregation and labelling
of audio-video clips from movies and soap-operas
in TV broadcast. It is assumed that the generic and
natural communication among the humans can be
reflected closely in these sources. The data is collected in three modes: audio, video and audiovisual. The labelling of gender, speaker and emotion is described. Issues in special cases like multiple emotions and non-sustainability of emotions
in a dialogue are addressed. The database is still
limited in number of clips. Data with sufficient
number of clips covering many other cases need
to be developed. In order to standardize the data
and to know the perception of emotions by human
beings, subjective evaluation need to be carried out
in all three modes (audio, video and audio-visual).
Acknowledgement
The authors would like to thank all the members
of Speech and Vision Lab, especially to B. Rambabu, M. Vasudha, K. Anusha, Karthik, Sivanand
and Ch. Nivedita, for spending their valuable time
in collection of the IIIT-H AVE database.
References
M. Gnjatovic, D. Rosner, “Inducing Genuine Emotions
in Simulated Speech-Based Human-Machine Interaction: The NIMITEK Corpus,” IEEE Transactions
on Affective Computing, vol.1, no.2, pp.132-144,
July-Dec. 2010.
E. Douglas-Cowie, N. Campbell, R. Cowie, and
P.Roach, “Emotional speech: Towards a new generation of databases,” Speech Communication, vol.
40, pp. 33-60, 2003.
B. Schuller, A. Batliner, S. Steidl, and D. Seppi,
“Recognising realisitic emotions and affect in
speech: State of the art and lessions learnt from the
first challenge,” Speech Communication, vol. 53, pp.
1062-1087, 2011.
D. Ververidis, C. Kotropoulos, “A review of emotional
speech database,”. In 9th Panhellenic Conf. on Informatics, November 123, 2003, Thessaloniki, Greece,
pp. 560-574.
S. G.Koolagudi, K. S. Rao, “Emotion recognition from
speech: a review,” International Journal of Speech
Technology, Volume 15, Issue 2, pp 99117. 2012.
Zhihong Zeng, M. Pantic, G.I. Roisman, T.S. Huang,
“A Survey of Affect Recognition Methods: Audio,
Visual, and Spontaneous Expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, no.1, pp.39-58, Jan. 2009.
F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier,
and B. Weiss, “A database of German emotional
speech,” in Proc. Interspeech, Lisbon, Portugal, pp.
1517-1520, 2005.
I. S. Engberg, A. V. Hansen, O. Andersen, and P.
Dalsgaard, “Design, recording and verification of
a Danish emotional speech database,” in Proc. Eurospeech, Vol. 4, pp. 1695-1698, 1997.
B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A.
Stuhlsatz, A. Wendemuth, G. Rigoll , “Cross-Corpus
Acoustic Emotion Recognition: Variances and
Strategies,” IEEE Transactions on Affective Computing, vol.1, no.2, pp.119-131, July-Dec. 2010.
S. G. Koolagudi, S. Maity, V. A. Kumar, S.Chakrabarti,
K.S. Rao, “IITKGP-SESC: speech database for
emotion analysis,” In LNCS Communications in
computer and information science, Berlin: Springer,
August 2009.
Steidl, S. Automatic classification of emotion-related
user states in spontaneous childrens speech. Studien
zur Mustererkennung, Bd. 28, ISBN 978-3-83252145-5, 1260 (January), 2009.
C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E.
Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan,
IEMOCAP: Interactive emotional dyadic motion
capture database, J. Language Resour. Eval., vol.
42,pp. 335-359, 2008.
S. Chung, “Expression and perception of emotion
extracted from the spontaneous speech in Korean
and English,” Ph.D. dissertation, Sorbonne Nouvelle
University, Paris, France.
K. Scherer and G. Ceschi, “Lost luggage emotion: A
field study of emotion-antecedent appraisal,” Motivation and Emotion, Vol. 21, pp. 211-235, 1997.
M. Grimm, K. Kroschel, and S. Narayana, “The Vera
am Mittag German audio-visual emotional speech
database,” in proc. IEEE int. Conf. Multimedia and
Expo (ICME), Hannover, Germany, pp. 865-868,
Jun. 2008.
Koelstra, S.; Muhl, C.; Soleymani, M.; Jong-Seok Lee;
Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I., ”DEAP: A Database for Emotion Analysis
;Using Physiological Signals,” IEEE Tran. on Affective Computing, vol.3, no.1, pp.18-31, Jan-Mar.
2012.
M. Schroder, “Emotional speech synthesis-a review,”
in Proc. Eurospeech, vol. 1, pp. 561-564, Aalborg,
Denmark, 2001.
I. Murray and J. L. Arnott, “Toward the simulation of
emotion in synthetic speech: A review of the literature on human vocal emotion,” J. Acoust. Soc.
Amer., pp. 1097-1108, 1993.
D. Erickson, K. Yoshida, C. Menezes, A. Fujino, T.
Mochida, and Y. Shibuya, “Exploratory study of
some acoustic and articulatory characteristics of sad
speech,” phonetica, Volume 63, p. 1-5, 2006.
H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and
K. Shikano, “GMM-based voice conversion applied
to emotional speech synthesis,” in Proc. Eurospeech,
pp. 2401-2404, Geneva, Switzerland, 2003.
R. Banse, and K. Scherer, “Acoustic profiles in vocal
emotion expression,” Journal of Personality and Social Psychology, Vol. 70,no. 3, pp.614-636, 1996.
D. Erro, E. Navas, I. Hernaez, and I. saratxaga, “Emotion conversion based on prosodic unit selection,”
IEEE Trans. Audio. Speech, Lang. Process., vol. 18,
no. 5, pp. 974-983, Jul. 2010.
I. Sneddon, M. McRorie, G. McKeown, J. Hanratty,
“The Belfast Induced Natural Emotion Database,”
IEEE Transactions on Affective Computing, vol.3,
no.1, pp.32-41, Jan.-March 2012.
E. Douglas-Cowie, R. Cowie, and M. Schroeder, “A
new emotion database: Considerations, sources and
scope,” in proc. ISCA ITRW on Speech and Emotion,
Newcastle, pp. 39-44, Sep. 2000.
N. Amir, S. Ron, and N. Laor, “Analysis of an emotional speech corpus in Hebrew based on objective
criteria,” in proc. ISCA ITRW on Speech and Emo212
tion, Newcastle, pp. 29-33, sep. 2000.
C. M. Lee and S. S. Narayanan, “Toward detecting
emotions in spoken dialogs,” IEEE Trans. Audio,
Speech, Lang. Process., vol. 13, no. 2, pp. 293-303,
Mar. 2005.
K. P. Truong, D. a. van Leeuwen, and F. M. G. de Jong,
Speech-based recognition of self-reported and observed emotion in a dimensional space, in Speech
Communcation, vol. 54, no. 9, pp. 1049-1063, Nov.
2012.
K. Scherer, “Vocal Communication of emotion: A review of research paradigms,” in Speech Communcation, Vol. 40, pp. 227-256, 2003.
R. Cowie and R. R. Cornelius, “Describing the emotional states that are expressed in speech,” in Speech
Communcation, Vol. 40, pp. 2-32, 2003.
Fleiss J, “Statistical methods for rates and proportions”.
New York, NY, USA: John Wiley & Sons, 1981.
213