Neurocomputing 545 (2023) 126271
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Whose emotion matters? Speaking activity localisation without prior
knowledge
Hugo Carneiro ⇑, Cornelius Weber, Stefan Wermter
University of Hamburg, Department of Informatics, Vogt-Koelln-Str. 30, Hamburg 22527, Germany
a r t i c l e
i n f o
Article history:
Received 2 November 2022
Revised 10 March 2023
Accepted 22 April 2023
Available online 03 May 2023
Keywords:
Multimodality
Active speaker detection
Emotion recognition
Forced alignment
a b s t r a c t
The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However,
only a few research approaches use both acoustic and visual information from the MELD videos. There
are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same
scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with
Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection
and automatic speech recognition models, we are able to realign the videos of MELD and capture the
facial expressions from speakers in 96.92% of the utterances provided in MELD. Experiments with a
self-supervised voice recognition model indicate that the realigned MELD-FAIR videos more closely
match the transcribed utterances given in the MELD dataset. Finally, we devise a model for emotion
recognition in conversations trained on the realigned MELD-FAIR videos, which outperforms state-ofthe-art models for ERC based on vision alone. This indicates that localising the source of speaking activities is indeed effective for extracting facial expressions from the uttering speakers and that faces provide
more informative visual cues than the visual features state-of-the-art models have been using so far. The
MELD-FAIR realignment data, and the code of the realignment procedure and of the emotional recognition, are available at https://github.com/knowledgetechnologyuhh/MELD-FAIR.
Ó 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Emotion recognition in conversations (ERC) is a task that
involves recognising the emotion of interlocutors in a dialogue.
Challenges of this task include the modelling of the conversational
context and how the emotion of the interlocutors may change
depending on that context, which is called emotion shift [31].
ERC can prove helpful in real-world scenarios in which people
are talking with each other, for example, in human-robot interaction applications [17,22,39]. However, most ERC datasets are
exclusively based on text transcriptions of conversations
[18,26,42] or are restricted to dyadic interactions in very controlled
environments [5,29].
Poria et al. [31] published the first large-scale multimodal ERC
dataset with several interlocutors, the Multimodal EmotionLines
Dataset (MELD). The dataset consists of videos extracted from
the Friends TV series. Each video is cut to match a single utterance,
and the videos are organised into dialogues and utterances, with
each dialogue having one or more utterances. Together with the
acoustic and visual information provided by the videos, the text
transcription of every utterance and the speaker label are also
provided.
Many approaches have been proposed to tackle the task of ERC
in MELD. Even though MELD was created to be a multimodal dataset, most of the approaches rely exclusively on textual information
[13,24,25,33,36,44]. Using the visual modality is difficult due to
frequent misalignments between video cuts and the expected corresponding utterances (see Fig. 1 for an example). This is likely a
consequence of an automatic generation of the video cuts with
the Gentle1 transcription alignment tool.
For some years, there has been a demand for more reliable
information from the visual modality, given the frequent problems
of video-text synchronisation2. Video cuts and utterance transcriptions can be misaligned in a variety of ways. Fig. 1a presents two
cases of misalignment. In case I, the utterance appears within the
⇑ Corresponding author.
E-mail addresses: hugo.carneiro@uni-hamburg.de (H. Carneiro), cornelius.weber@uni-hamburg.de (C. Weber), stefan.wermter@uni-hamburg.de (S. Wermter).
1
2
https://lowerquality.com/gentle/
https://github.com/declare-lab/MELD/issues/9
https://doi.org/10.1016/j.neucom.2023.126271
0925-2312/Ó 2023 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Carneiro, C. Weber and S. Wermter
Neurocomputing 545 (2023) 126271
Fig. 1. Example of misaligned video cuts provided in the MELD dataset, and their corresponding correction. The different colours in the utterances represent the different
speakers in the video cuts.
Even though quite rarely, speech data from the videos of MELD
has been used for ERC since the work of Poria et al. [31]. However,
without the proper alignment correction of the videos, audio samples used for this task can include speech from other speakers with
different emotions. In contrast, only quite recently there has been
some interest in the use of visual information from the MELD
videos [9,19,21,27,41]. However, alongside the problems that arise
with the lack of proper realignments, the proposed solutions do not
take into account the necessity to localise the source of the speaking activity in a particular scene or frame, which, in turn, is useful
to extract of the emotional facial expressions of the uttering
speaker. The added information from acoustic and visual modalities has improved ERC compared to models that use information
obtained exclusively from utterance transcriptions. However,
those improvements are limited because of the unreliability of
those modalities.
first half of the video cut and another person’s utterance is falsely
assigned to the same cut. In case II, the utterance starts being spoken
in the video cut assigned to the preceding utterance and continues
through the first half of the video cut assigned to that target utterance. Fig. 1b depicts the corrected alignment between the video cuts
and their corresponding utterance transcriptions. This is a result of
our dataset refinement procedure (cf. Section 3).
Facial expressions and speech signals provide relevant information regarding the emotion of a person. However, the noticeable
number of mismatching cases between video cuts and the corresponding utterance transcriptions hindered the use of those
modalities for some years, with information from the visual modality being disregarded even by the dataset creators, who stated that
video-based speaker localisation were still open problems [31].
Accordingly, in the dataset itself, no information on the location
of the face of the uttering speakers is offered.
2
Neurocomputing 545 (2023) 126271
H. Carneiro, C. Weber and S. Wermter
applicability of the extraction of the faces of active speakers for
the task of ERC, we propose an emotion recognition model whose
outstanding performance on the visual data indicates that the faces
extracted from the active speakers indeed provide an informative
visual cue for the task of ERC.
The paper is structured as follows. Section 2 offers a brief overview and some specific details on the MELD dataset. Section 3
describes the procedure of dataset refinement, which consists of
correcting the alignment between video cuts and the corresponding utterances, and determining the position of the face of the
uttering speaker in each frame of the newly produced video cut.
Section 4 provides quantitative analysis of the resulting dataset,
comparing it with the characteristics of the original dataset that
were provided in Section 2. In that same section, experiments are
also provided, as a means to evaluate how well the resulting dataset applies to the task of emotion recognition. Section 5 discusses
the results.
Recent advances in active speaker detection (ASD) in the wild
[2,3,6,30,37,43] indicate the capability of audiovisual neural models to localise sources of speaking activity in videos given the faces
of the people as well as the audio of a scene. Localising the active
speaker can enable more reliable emotion recognition from video
in MELD. State-of-the-art ASD models can be very precise in determining who among multiple people is speaking, especially if there
are at least a few seconds of continuous speaking activity. Multiparty scenarios can still present challenges in accurately localising
the source of some particular speaking activity. These challenges
include:
i) the partial occlusion of the speaker’s face by objects or other
people;
ii) the presence of other people in the same scene moving their
mouths, even though they are not actively speaking;
iii) interfering noise, such as background chatting or, in the case
of TV sitcoms, laugh tracks; and
iv) the active speaker not being in the main focus of the scene,
and having the speaker’s face in a considerably smaller size
and resolution than other non-speaking people.
2. The MELD dataset
MELD contains scenes from various episodes of the Friends TV
series. Those scenes are denoted as dialogues, and each dialogue
is organised as a sequence of utterances. For every utterance, there
is a corresponding dataset entry containing the speaker’s identity,
emotion and sentiment. The annotated emotion can be either one
of Ekman’s universal emotions (joy, sadness, fear, anger, disgust, and
surprise), or neutral if no particular emotion was noticed by the
dataset annotators.
MELD is split into three sets, denoted train, dev, and test. Each
data record in those splits contains the following information:
the utterance, its speaker, the emotion perceived in that utterance,
the corresponding sentiment, a dialogue identifier, an utterance
identifier, the season and episode of Friends in which that scene
happened, a time stamp determining where that scene starts, as
well as one determining where that scene ends. For every split, a
dataset record can be uniquely identified by its dialogue identifier
and its utterance identifier.
Table 1 presents an excerpt of a conversation, containing a
sequence of contiguous data records and corresponding labels for
the uttering speaker, his or her emotion, and the corresponding
Most in-the-wild ASD models were trained on AVAActiveSpeaker, a dataset containing videos in a large variety of resolutions [32]. The videos of AVA-ActiveSpeaker contain scenes
with multiple people speaking with each other, which is similar
to the conversational scenes in the videos of MELD. Fig. 2 displays
examples of conversational scenes present in the videos of the
AVA-ActiveSpeaker dataset.
The first contribution of this paper is to offer a new method to
extract the position of the faces of active speakers for datasets,
which can be useful for tasks in which the facial information
may provide additional relevant information, but, for some reason,
the face position is not given in the dataset. The procedure can be
used in any dataset with humans speaking without annotation
concerning the visual modality, e.g., the position of the speaker’s
face. The second contribution of this paper is the evaluation of this
procedure on the MELD dataset, and the consequent development
of a refined version of MELD, named MELD with Fixed Audiovisual
Information via Realignment (MELD-FAIR). Finally, to assess the
Fig. 2. Examples of conversational scenes from AVA-ActiveSpeaker videos. Green boxes identify those who are speaking, whereas red boxes mark silent people.
Table 1
Excerpt of a dyadic conversation from MELD train split with corresponding speaker, emotion, and sentiment information.
Dia
Utt
Utterance
Speaker
Emotion
Sentiment
D0
D0
D0
D0
D0
D0
U5
U6
U7
U8
U9
U10
Now you’ll be heading a whole division, so you’ll have a lot of duties.
I see.
But there’ll be perhaps 30 people under you, so you can dump a certain amount on them.
Good to know.
We can go into detail.
No, don’t. I beg of you!
Interviewer
Chandler
Interviewer
Chandler
Interviewer
Chandler
neutral
neutral
neutral
neutral
neutral
fear
neutral
neutral
neutral
neutral
neutral
negative
3
H. Carneiro, C. Weber and S. Wermter
Neurocomputing 545 (2023) 126271
3.1. Video realignment
Table 2
Additional information in MELD about the utterances presented in Table 1. Overlaps
between video cuts due to mistaken determination of the start and end times of an
utterance are marked in bold.
Dia
Utt
Season
Episode
Start time
End time
D0
D0
D0
D0
D0
D0
U5
U6
U7
U8
U9
U10
S8
S8
S8
S8
S8
S8
E21
E21
E21
E21
E21
E21
0:16:41.126
0:16:48.800
0:16:48.800
0:16:59.477
0:17:00.478
0:17:02.856
0:16:44.337
0:16:51.886
0:16:54.514
0:17:00.478
0:17:02.719
0:17:04.858
Each video in the MELD dataset corresponds to a particular
utterance, which, in turn, belongs to a sequence of utterances, also
called a dialogue. Videos that are misaligned to their corresponding
utterances are a consequence of the mistaken determination of
where the boundaries of those particular utterances lie within
their respective dialogues. A considerable number of misaligned
videos may prevent the proper identification of the source of
speaking activity, especially because sometimes the speaking
activity might happen partially in the target video and partially
in the one that precedes or in the one that follows it in the dialogue
(see the example depicted in Fig. 1).
The realignment of the videos takes into account that videos
that belong to the same dialogue are organised sequentially. First,
the audio signal of every dataset video is extracted. Next, for every
split r 2 ftrain; dev ; testg and dialogue d, the audio signals ar;d;u
corresponding to each utterance u belonging to dialogue d are concatenated in order. Existing overlaps, such as the one indicated in
Table 2, are removed by truncating the audio signals that lead to
those overlaps. Silence blocks are added between consecutive
video cuts if there is a time difference between the end time stamp
of a video and the start time stamp of the following. The length of a
silence block is equal to the corresponding time difference, but
long silence blocks are capped at 250 ms. Due to a few videos
whose length is much longer than their corresponding utterances,
video lengths are also capped at 45 s. This affects two of altogether
13,708 videos (check A for an indication of the videos affected by
the capping of 45 s).
Fig. 4 presents a graphical representation of the concatenation
of audio signals. Each box labelled U5 to U10 represents the audio
signal of an utterance. The lengths of those boxes are proportional
to the duration of the utterances presented in Tables 1 and 2,
which can be inferred by their start and end time stamps. The label
used in each box corresponds to the utterance identifier given in
sentiment. The misalignment of the video cuts can provide overlaps, which are indicated by the start and end time stamps. Table 2
indicates that two videos of consecutive utterances present an
overlap due to a wrongly executed alignment process.
3. Dataset refinement procedure
The extraction of emotional speech and emotional facial expressions depends on having audio samples that match closely enough
the utterance being said and on being capable of localising the
uttering speaker in a scene, particularly that person’s face. To meet
both requirements, the dataset refinement procedure is divided
into two parts, with each part addressing one requirement. First,
the videos of MELD are realigned, such that their audios match closely enough the target utterance, as indicated in the flow chart in
Fig. 3a. Next, with the videos properly realigned, the faces of the
people in the scene are extracted and organised into sequences.
Then, given the extracted sequences of faces and the scene audio,
an ASD model determines which of these sequences corresponds
to the uttering speaker (see the flow chart in Fig. 3b). The resulting
set including the realigned audio from all videos and the sequence
of facial expressions of the person vocalising in each of those
videos constitutes a refined version of MELD called MELD-FAIR.
Fig. 3. Major steps of the dataset refinement procedure. Orange arrows indicate the application of a model, and blue arrows represent information flow.
4
Neurocomputing 545 (2023) 126271
H. Carneiro, C. Weber and S. Wermter
Fig. 4. Schematic representation of the concatenation of the audio signals of a dialogue. First, the audios of all utterances of a given dialogue are concatenated, with silence
blocks inserted wherever it is adequate. Next, the lengths of the silence blocks are reduced to a minimum length that still allows for the identification of individual blocks of
consecutive utterances (e.g., utterances U6 and U7, and U8 and U9).
Table 2. The gaps between the boxes are proportional to the distance between the end of an utterance and the beginning of the following one. The figure presents an example of overlap being
removed by altering the start time of utterance U7. It also shows
the insertion of silence blocks where the gaps between utterances
lie, and the subsequent capping of silence block lengths to 250 ms.
The utterance transcriptions are concatenated as well. Prior to
their concatenation, all punctuation marks in each transcription
are removed, and both a start-of-sequence and an end-ofsequence token are appended to each end of every utterance transcription within a dialogue. With the audio signals and the transcriptions properly concatenated, the text of the concatenated
transcription is aligned to the concatenated audio through forced
alignment using connectionist-temporal-classification (CTC) segmentation [23]. Given a speech audio signal, CTC segmentation
uses frame-based character posterior probabilities generated by a
CTC-based end-to-end network. From these character-level probabilities, maximum joint probabilities are computed via dynamic
programming. These maximum joint probabilities indicate how
likely a given excerpt from the dialogue transcription is aligned
to a particular slice of the speech audio signal. After the maximum
joint probability for the alignment of the complete dialogue transcription to the whole speech audio signal is computed, the
character-wise alignment is obtained by backtracking from the
most probable temporal position of the last character in the transcription. The CTC-based end-to-end network used to generate
the character-level probabilities had to be pretrained on already
aligned data, for which the Wav2Vec2 [4] automatic speech recognition transformer model3 [40] was used.
The video realignment procedure is executed for each dialogue
in the dataset. Most of the processing time is dedicated to the generation of frame-based character posterior probabilities by the
CTC-based end-to-end network and the subsequent computation
of maximum joint probabilities. The former is run on graphical processing units (GPUs) with high parallelisation capabilities. The latter involves a dynamic programming algorithm whose processing
time is proportional to the number of audio frames of the whole
dialogue and to the square of the number of characters of the concatenated utterance transcriptions.
3.2. Uttering speaker localisation
With videos that very likely contain the part of a scene in which
a given utterance is said, it is possible to localise the source of the
speaking activity, i.e., the person who spoke the utterance. Fig. 3b
schematically represents the process of extracting the speech
audio as well as face images of the uttering speaker from a video.
As a first step, an efficient face detection model with sample and
computation redistribution (SCRFD-10GF) [15] is used to detect
all faces in every frame of those videos. Faces detected this way
are then subsequently extracted and organised into ordered
groups, creating several sequences of faces. Each face is identified
by the video frame from which it is extracted, and by an identifier
of the sequence it belongs to. For the organisation of the faces into
sequences, faces detected in consecutive frames are considered to
belong to the same sequence if the intersection-over-union (IoU)
ratio between their areas is greater than a given threshold h. In case
there is more than one pair of faces extracted from consecutive
frames that satisfy this condition, the face pair with the highest
IoU ratio is considered as belonging to the same sequence.
Each face sequence and the corresponding slice of the speech
signal is then sent to TalkNet-ASD [37], an audiovisual ASD model,
to determine whether that face sequence presents some indication
of speaking activity that resembles that slice of the speech signal.
Fig. 5 shows a sketch of TalkNet-ASD’s architecture. TalkNet-ASD
uses a visual temporal encoder (VTE) to learn long-term representations of facial expression dynamics, and an audio temporal encoder (ATE) to learn audio content representations from the temporal
dynamics [37].
VTE consists of a front end, where video frame streams are
encoded into sequences of frame-based embeddings, and a visual
temporal network, whose aim is to represent the temporal content
in a long-term spatiotemporal structure [37]. Its front end is based
on the vision module introduced in [1], consisting of a 3D convolution layer with a filter width of 5 frames followed by a 2D 18-layer
residual network. Given an input with dimensions T v C W H,
where T v is the number of frames, and C; W and H are the number
of channels, width and height of each frame, the front end yields a
W
H
32
512, which is subsequently
tensor with dimensions T v 32
average-pooled in both its spatial dimensions, thus producing a
feature vector with 512 dimensions for each input frame. Similarly
to the visual model of Afouras et al. [1], TalkNet-ASD receives a
sequence of greyscale images, which means that the number of
channels C in each frame is 1. TalkNet-ASD’s visual temporal net-
3
More specifically, the Wav2Vec2 Large (LV-60) model pretrained and fine-tuned
on 960 h of speech audio from Libri-Light and Librispeech (see list of pretrained
models at https://github.com/facebookresearch/fairseq/tree/main/examples/
wav2vec).
5
H. Carneiro, C. Weber and S. Wermter
Neurocomputing 545 (2023) 126271
Fig. 5. TalkNet-ASD architecture.
namely 25, 50, 75, 100, 125, and 150, as a means to guarantee a more
reliable result. A given value of / implies that / face images and 4 /
audio frames in each block are used as input to the TalkNet-ASD
model. TalkNet-ASD yields / scores si;j;/ per block, indicating
whether a given person pj is detected as actively speaking in frame
f i in that block composed of / video frames. After getting all scores
for every frame, with all different possible values of /, a resulting
score si;j is obtained by averaging the scores si;j;/ . A score si;j > 0 indicates that person pj is predicted as actively speaking in frame f i .
Fig. 6 provides two examples of the application of TalkNet-ASD to
the videos of MELD. In both examples, the uttering speakers are
marked with green boxes around their faces.
After TalkNet-ASD has generated scores for each face track, the
scores are grouped based on their respective tracks to determine
which faces belong to the same person. However, if two face tracks
have faces from the same frame and both tracks have detected
speaking activity, this can result in a ‘‘false positive”, where one
of the tracks belongs to someone who is not actively speaking.
These face tracks that provide conflicting information on the active
speaker are called conflicting face tracks. To reduce false positives,
the face tracks are grouped based on the camera cut where they
appear. Each group contains a set of face tracks where each track
has a conflicting track within the same set. A heuristic is used to
eliminate conflicting face tracks according to three criteria:
work (V-TCN) consists of a 5-block residual network followed by a
sequence of two 1D convolution layers. The residual blocks consist
of a 1D depth-separable convolution layer followed by rectifier linear units and batch normalisation layers. The residual network is
responsible for obtaining a representation of the temporal content.
The representation consists of a tensor with dimensions T v 512.
The sequence of 1D convolution layers finally reduces the dimensionality of this tensor, yielding a visual embedding F v of dimensions T v 128, i.e., 128 dimensions for every input frame.
The speech signal is first encoded as a sequence of overlapping
audio frames, each one characterised by a 13-dimensional vector of
Mel-frequency cepstral coefficients (MFCCs) based on a window
size of 25 ms and a window step of 10 ms. This means that given
a sequence of T a audio frames, ATE receives as input a tensor with
dimensions 1 13 T a . ATE consists of a 2D 34-layer residual network with squeeze-and-excitation (SE) modules [20]. The number
of channels in each block of the ResNet34 network is also reduced
to one quarter of the number in each block of the original ResNet
with 34 layers, similarly to the Thin ResNet34 introduced by Chung
et al. [10]. The output of the audio encoder is an audio embedding
F a of dimensions T4a 128. The dimensions of F a and F v , the embeddings output by both encoders, match when the number of audio
frames is equal to four times the number of visual frames (or face
crops). The matching in their dimensions is a necessary feature for
the subsequent attention mechanism. A direct implication of the
number of audio frames being four times the number of video
frames is that each video frame corresponds to roughly 40 ms of
the video (or 25 fps) since the length of the window step between
consecutive overlapping audio frames is 10 ms.
With the motivation of audiovisual synchronisation working as
an informative cue for speaking activities, TalkNet-ASD contains a
cross-attention subnetwork that receives F a and F v as inputs, and
outputs an audio attention feature F a!v and a video attention feature F v !a . F a!v is obtained through the application of F v as the target sequence to generate the query Q v in the attention layer and F a
as the source sequence to generate key K a and value V a . F v !a is
obtained through an analogous process. Next, F a!v and F v !a are
concatenated into a single audiovisual attention feature vector
F av which is sent to a self-attention subnetwork whose aim is to
model audiovisual utterance-level information, and this way distinguish between speaking and non-speaking frames. Both crossattention and self-attention subnetworks contain one transformer
layer with eight attention heads each [40].
Tao et al. [37] offer a practical implementation of TalkNet-ASD4,
which we apply to the facial expression and emotional speech data
extracted from the realigned MELD videos. In that implementation,
each of the face tracks of a given person and the corresponding audio
frame sequence are split into blocks and sent to TalkNet-ASD to
determine in which frames that given person is actively speaking.
Each of those blocks corresponds to a video sequence of up to /
video frames. Several values for / are used in the implementation,
4
i) reduce the number of conflicting face tracks to zero;
ii) maximise the total number of faces associated with speaking
activity for all non-conflicting face tracks within a set; and
iii) minimise the number of face tracks, provided the first two
criteria are met.
The last criterion is due to the low likelihood of a single person
having their extracted face sequence appearing in several face
tracks within the same set. After eliminating the conflicting tracks,
the remaining non-conflicting tracks are grouped together and
ordered based on their associated frame number. The procedure
then outputs the resulting sequence of faces, which is associated
with the active speaker.
The utterance speaker localisation procedure is executed for
every realigned MELD video, which, in turn, is assigned to one particular utterance in the dataset each. Most time consumption
derives from the frame-wise face extraction and from the detection
of speaking activity in every sequence of facial expressions previously extracted and organised.
4. Assessment of the MELD-FAIR dataset
To assess the applicability of MELD-FAIR in ERC, it is important
to determine whether the distribution of its data after the dataset
refinement procedure is kept similar to that of the original dataset.
Two criteria can be used to evaluate whether the data distribution
was kept similar to its original distribution. Specific steps of the
dataset refinement depend on the target uttering speaker, thus it
https://github.com/TaoRuijie/TalkNet-ASD/blob/main/demoTalkNet.py
6
Neurocomputing 545 (2023) 126271
H. Carneiro, C. Weber and S. Wermter
Fig. 6. Examples of conversational scenes from MELD videos. Green boxes identify those who are speaking, whereas red boxes mark silent people.
is desirable that the proportion of utterances in MELD-FAIR
assigned to a given speaker remains close to its original proportion
in MELD. Similarly, the proportion of utterances assigned to a given
emotion should also be kept close to its original proportion, to not
alter the task. Moreover, because MELD was built for emotion
recognition in conversational contexts, it is worthwhile to determine the portion of dialogues in which the data of at least one
utterance was removed during the dataset refinement process.
After assessing whether most of the original utterances are kept
in MELD-FAIR, and whether its data distribution is nearly unaltered,
it is worthwhile analysing whether the video realignment produces
refined speech signals that actually correspond to the speakers provided by the dataset. The retention of many original utterances and
the proper correspondence between the speech signals and the
expected speakers are indications that the acoustic data is reliable
and therefore useful for an application in ERC. Finally, to determine
the reliability of the process of localising the uttering speaker, we
propose using an emotion recognition model trained on MELDFAIR and comparing its performance to existing ERC approaches
trained on the original version of MELD that use information from
visual and/or acoustic modalities. A superior performance of our
emotion recognition model would indicate that the emotional facial
expressions extracted by the uttering speaker localisation procedure are indeed useful for emotion recognition applications.
4.1. Properties of the MELD-FAIR dataset
The process of dataset refinement consists of two steps, video
realignment and utterance source localisation. These refining steps
may eventually lead to some utterances of the original MELD dataset not having corresponding audiovisual data in MELD-FAIR. This
may happen due to two main reasons. First, the video realignment
step may produce an empty video for a given utterance in case the
CTC segmentation algorithm determines that in the most likely
alignment, ui is aligned to a very small slice of the dialogue audio.
Second, even when new video cuts are produced in the video
realignment step, no uttering speaker may be located in the scene.
Tables 3 and 4 present the number of dataset records for which
there are corresponding audiovisual data in MELD-FAIR, alongside
the number of dataset records in its original version. Table 3 presents the dataset record distribution according to the annotated
emotion and dataset split, and Table 4 presents the dataset record
distribution according to the utterance speaker and dataset split.
Tables 3 and 4 show that the dev and test splits each lost approximately 2.5% of their records in the dataset refinement process.
Regarding the train split, data loss due to the dataset refinement
was also relatively small, with the audiovisual data of MELD-FAIR
corresponding to 96.7% of the utterances of the original MELD
dataset.
Table 3
Distribution of emotion annotations in the MELD-FAIR dataset. The numbers of original dataset records for each emotion and split are given inside parentheses.
Emotion
neutral
joy
surprise
sadness
fear
anger
disgust
train
4537
1683
1158
670
261
1082
267
dev
(4710)
(1743)
(1205)
(683)
(268)
(1109)
(271)
9658 (9989)
461
160
140
109
39
150
22
Total
test
(470)
(163)
(150)
(111)
(40)
(153)
(22)
1226
389
270
207
49
339
67
1081 (1109)
(1256)
(402)
(281)
(208)
(50)
(345)
(68)
6224 (6436)
2232 (2308)
1568 (1636)
986 (1002)
349 (358)
1571 (1607)
356 (361)
2547 (2610)
13286 (13708)
Table 4
Distribution of uttering speakers in the MELD-FAIR dataset. The numbers of original dataset records for each speaker and split are given inside parentheses.
Speaker
Rachel
Monica
Phoebe
Joey
Chandler
Ross
others
train
1392
1253
1269
1456
1243
1410
1635
(1435)
(1299)
(1321)
(1509)
(1283)
(1459)
(1683)
9658 (9989)
dev
158
130
183
146
100
211
153
test
(164)
(137)
(185)
(149)
(101)
(217)
(156)
1081 (1109)
7
350
338
277
399
374
368
441
(356)
(346)
(291)
(411)
(379)
(373)
(454)
2547 (2610)
Total
1900
1721
1729
2001
1717
1989
2229
(1955)
(1782)
(1797)
(2069)
(1763)
(2049)
(2293)
13286 (13708)
H. Carneiro, C. Weber and S. Wermter
Neurocomputing 545 (2023) 126271
signal. Finally, a fully connected layer outputs a prediction regarding the expected speaker of the speech signal from this feature
vector.
The data distribution was kept nearly unaltered. For instance,
the largest data distribution difference occurred in the fraction of
dataset records assigned to the neutral emotion in the train split.
Out of the original 4710 records in the train split that were
assigned to the neutral emotion, the dataset refinement procedure
was unable to retrieve corresponding audiovisual data for only 173
records. This corresponds to 3.67% of those records, and to 1.73% of
all records in the train split. These dataset records, which correspond to one utterance each, are well dispersed throughout the
whole dataset. As a consequence, the fraction of dialogues which
lost at least one of their utterances in the dataset refinement procedure is moderately higher. 222 of the 1038 dialogues of the train
split contain at least one utterance with no corresponding audiovisual data in MELD-FAIR, which represents 21.4% of the dialogues in
that split. For the dev and test splits, this reduction was lower. 19 of
the 114 dialogues of the dev split, i.e., 16.7%, have utterances with
no corresponding audiovisual data in MELD-FAIR, and for the test
split, 49 of its 280 dialogues, i.e., 17.5%.
4.2.2. Data augmentation
Following the steps of Tao et al. [37], negative sampling is used
to augment the available speech data. In negative sampling augmentation, data is augmented by combining it with some other
interfering data within the same batch that effectively shares the
same label as the original data, i.e., it is expected that both the original and the interfering speech signal have been uttered by the
same speaker. Through randomly selecting interfering data that
has those characteristics, an interference is made by combining
the original audio tracks and those of the interfering data, thus
coming up with a mixture of both. By benefitting from the indomain noise and the interfering speech signals from the training
set itself, this approach presents three advantages in comparison
to traditional augmentation through the addition of white noise:
i) the interference data is not artificially generated;
ii) there is no need for data outside the training set for the
audio augmentation; and
iii) by using audio samples from the same speaker, the interference provided in the data augmentation accentuates the
characteristics of that speaker’s voice.
4.2. Assessment of the video realignment
Due to the lack of an annotation of the correct start and end
time stamps of each utterance, a self-supervised form of assessing
the robustness of the video realignment procedure was devised. A
video correctly realigned to its corresponding utterance is expected
to have most of its audio content comprised of a speech signal
uttered by the speaker annotated in the corresponding dataset
record. This would allow training a speaker identification model
with the speech signals of the realigned videos of the train split
so that it generalises and correctly identifies the speakers from
speech signals of the realigned videos of the remaining splits. However, the model would require a given speaker to appear in a reasonable number of MELD records in all dataset splits, but only six
speakers appear consistently throughout all MELD splits. These are
the six main characters: Rachel, Monica, Phoebe, Joey, Chandler,
and Ross. The remaining speakers appear rarely, indicating that it
is highly unlikely that the speaker identification model could learn
to generalise well from their speech.
With a 50% chance, an audio sample is selected to be augmented this way, which means that within a batch, roughly half
of its samples are augmented. Audio samples selected this way
are either circularly padded or trimmed to match the size of the
original audio sample. A single batch typically has audio samples
of very different sizes. In order to let all audio samples in the same
batch have the same size, they are either circularly padded or
trimmed so that every audio sample in the same batch have a
length equal to the average of the lengths of the original audio
samples. This way, it is guaranteed that the model is trained with
samples of a reasonable size, and that at least half of the samples of
a batch consists of unpadded continuous audio samples.
4.2.3. Training procedure
To train the speaker identification model, audio tracks are randomly sampled, such that there be roughly the same number of
audio samples for each class (the six main characters). Audio samples are augmented according to the aforementioned procedure.
The model is trained by minimising a cross-entropy loss function
using an ADAM optimiser with an initial learning rate of 1e-4,
whose value is decreased in half every ten epochs. Batches of size
64 are used in the model training. The training procedure is kept
running until there is a sequence of 30 epochs with no improvement in the weighted F1 score of the dev split.
4.2.1. Model
A speaker identification model is used to assess whether the
speech audio in a given realigned video actually matches the
speaker annotated in the corresponding MELD record. The speaker
identification model is composed of an encoder part followed by a
classifier part. Based on TalkNet-ASD’s ATE, a traditional ResNet34
is used as the encoder. This encoder produces an embedding F a of
dimensions T4a 512, where T a is the number of audio frames corresponding to the speech signal. Then, via temporal max pooling, a
512-dimensional feature vector is obtained for the whole speech
Fig. 7. Confusion matrices of the speaker identification model in MELD’s and MELD-FAIR’s test splits.
8
Neurocomputing 545 (2023) 126271
H. Carneiro, C. Weber and S. Wermter
Fig. 8. ERC model.
4.3.2. Data augmentation
Audio samples are augmented through the same data augmentation procedure described in Section 4.2. Face crops are augmented by performing one of the following operations: random
horizontal flip, random crop of an area with at least 70% the dimension of the original face crop, or a random rotation up to 15 degrees
clockwise or counterclockwise. Afterwards, the face crop is resized
to 112 112 pixels. In order to keep consistency in the direction
the speaker’s head is looking to, the random characteristics of
the data augmentation procedure are applied to the sequence of
faces as a whole, and not to each face separately.
4.2.4. Results and analysis
Fig. 7 presents the confusion matrices obtained when evaluating the speaker identification model in MELD’s and MELD-FAIR’s
test splits. A comparison is presented on how well the speaker
identification model can generalise what it learned from each character’s voice from the data of the original MELD dataset (Fig. 7a)
and from that of its refined version, MELD-FAIR (Fig. 7b). The
speaker identification model subjected to MELD-FAIR achieved a
weighted F1 score of 78.32% in that dataset’s test split, whereas
the speaker identification model subjected to the original MELD
achieved a weighted F1 score of 67.07% in the corresponding test
split. The confusion matrices and the weighted F1 scores indicate
that the video realignment leads to cuts that better match the
expected speaker, which, in turn, indicates that it is highly likely
that the audio contents of those cuts closely match the corresponding utterances whose transcriptions are given in the dataset.
4.3.3. Training procedure
Since the distribution of emotion labels is similar in every split
of MELD-FAIR, no weighted random sampling in the training of the
ER model is performed. Instead, for every record in the train split
representing a single utterance, a sequence of 15 consecutive face
crops is selected as input for the video stream, and the complete
utterance audio is provided as input for the audio stream. In case
the sequence of faces corresponding to the uttering speaker has
less than 15 face crops, then the sequence is circularly padded. If
the sequence of faces has more than 15 face crops, then a subsequence of 15 consecutive face crops is randomly selected. The
model is trained by minimising a cross-entropy loss function using
an ADAM optimiser with an initial learning rate of 1e-4, whose
value is decreased in half every ten epochs. Batches of size 64 are
used in the model training. The training procedure is kept running
until there is a sequence of 30 epochs with no improvement in the
weighted F1 score of the dev split.
4.3. Application in ERC
4.3.1. Model
We have devised an emotion recognition (ER) model to assess
whether MELD-FAIR actually has visual and acoustic information
from which emotional characteristics can be retrieved. Fig. 8 presents the architecture of the ER model. For the encoding of the
visual and acoustic inputs, TalkNet-ASD’s VTE and ATE have been
modified to enable them to produce vector representations with
512 dimensions. VTE has been modified by having its sequence
of 1D convolution layers removed, since its main application is to
reduce the dimensionality of the feature vectors, and V-TCN
already yields vector representations with 512 dimensions. For
TalkNet-ASD’s ATE to produce 512-dimensional feature vectors,
its Thin ResNet34 backbone has been changed for a traditional
ResNet34. Also, we keep the face crops with their original colour
channels for the task of emotion recognition. This way, changes
in skin colour due to some emotional reactions, e.g., blushing,
can be considered by the ER model. The embeddings output by
VTE and ATE are then max-pooled in the temporal dimension into
feature vectors F v and F a , with 512 dimensions each. These vectors
are concatenated and subsequently sent to a self-attention layer.
Finally, a fully connected layer yields a prediction for the emotion
of the uttering speaker given the output of the self-attention layer.
4.3.4. Experimental results
Three variations of the ER model were implemented and trained
from scratch. One incorporated inputs coming from both the acoustic and visual streams, while the other two variations were ablations,
each containing only one of the input streams. Table 5 presents the
weighted F1 score achieved by each variation, the number of training epochs it took for every variation to reach its best performance,
the average training time per batch, and the number of batches used
in each training epoch. The training times presented in Table 5 were
achieved using a single NVIDIA GeForce GTX 1080 Ti.
4.3.5. Comparison with the state of the art
To evaluate the benefits of our refinement procedure for the
task of ERC with MELD-FAIR, we compare the performance of our
ER model to existing approaches that use information from the
original MELD videos in ERC, and not only from the utterance transcriptions provided in the dataset.
DialogueRNN [28]5 is a baseline approach which models the
context of a conversation by tracking the states of individual parties
Table 5
Comparison of ER model variations.
Modalities
Vision
Audio
Audio + Vision
Weighted F1 score (%)
Number of training epochs
Avg. training time (seconds per batch)
Number of batches per epoch
35.58
15
1.164
151
40.54
18
0.287
151
39.81
19
1.211
151
5
Although DialogueRNN was originally proposed in [28], its first application to ERC
in the MELD dataset was in [31].
9
H. Carneiro, C. Weber and S. Wermter
Neurocomputing 545 (2023) 126271
within that conversation. The model determines the emotion of a
given utterance according to three aspects: its speaker, the context
from preceding utterances, and the emotion thereof. DialogueRNN
models these aspects by using three gated recurrent units (GRUs)
[7], each responsible for a particular aspect.
CT + EmbraceNet [41] is a pioneering ERC model in using visual
information from the MELD videos. Although DialogueRNN predates it, the former uses solely information from the acoustic and
textual modalities. This approach uses crossmodal transformers
(CTs) [38] to enrich the information from one modality by taking
into account information from another modality, and this way
learn existing correlated information across pairs of modalities.
EmbraceNet [8] was used to carefully deal with the crossmodal
information in the feature vectors produced by the crossmodal
transformers, and to prevent performance degradation due to the
partial absence of data.
EmoCaps [27] uses transformer-based encoders to extract emotion feature vectors from the visual, acoustic and textual modalities. The authors also use BERT [11] to extract text feature
vectors from every utterance. By concatenating an utterance feature vector with the corresponding emotion vectors of each modality, the authors create a vector representation for that utterance.
Then, through the use of a Bi-LSTM [14,16] and a classification subnetwork, EmoCaps predicts the emotion from every utterance in a
dialogue.
MMGCN [21] uses a multimodal graph, where each node represents a given modality in some particular utterance. Nodes of this
graph are connected if they share either the same modality or the
same utterance. Each MMGCN node is initialised with a concatenation of two elements: a context-aware feature encoding of the corresponding modality and utterance, and an embedding of the
speaker of that particular utterance. MMGCN leverages speaker
embedding to inject speaker information into the graph construction. MMGCN encodes the multimodal contextual information
through the use of a multilayered deep spectral-domain graph convolutional network.
MM-DFN [19], similarly to MMGCN, uses a multimodal graph
with the same structure to characterise the relations between all
modalities within a given uttering event, and of every utterance
within a dialogue. MM-DFN introduces graph-based dynamic
fusion modules, which are stacked in layers, to fuse multimodal
context features dynamically and sequentially. These modules
aggregate both inter- and intra-modality contextual information
in a specific semantic space at each layer. It differs from MMGCN,
which aggregates contextual information in a single semantic
space. This leads to a gradual accumulation of redundant information. By modelling the contextual information in different semantic
spaces, MM-DFN benefits from a reduction in the accumulation of
redundant information, as well as from an enhancement in the
complementarity between the modalities.
M2FNet [9] is the current state-of-the-art model in ERC in
MELD6. Its main characteristics are.
Table 6
Weighted F1 scores for ERC in MELD test split using visual and acoustic data.
Model
Vision
Audio
Audio + Vision
DialogueRNN
CT + EmbraceNet
EmoCaps
MMGCN
MM-DFN
M2FNet
N/A
31.4
31.26
33.27
32.34
32.44
44.3
32.1
31.26
42.63
42.72
39.63
N/A
N/A
N/A
N/A
44.67
35.74
Ours
35.58
40.54
39.81
ii) the use of one stack of transformer encoders for each modality, as a means to learn inter-utterance context on a modality level; and
iii) a multi-head attention fusion module to better incorporate
those modalities, especially the visual and acoustic ones.
It is worth noticing that all multimodal approaches to ERC in
MELD use context from the dialogue in some form. Since we are
interested in extracting the most useful information from the
visual and the acoustic modalities, we rely solely on the utterance
level. This way, we can guarantee that the performance achieved is
a direct consequence of the video realignment and the utterance
source localisation, and not from some other part of the dialogue.
Table 6 compares the performance of the ER model proposed
here with those of ablated versions of all multimodal approaches
to ERC in MELD. The values presented in the table were extracted
from the literature. Some table cells appear empty because either
one modality was not used (e.g., Poria et al. [31] do not use information from the visual modality in their implementation of DialogueRNN), or the authors did not consider the combination of
vision and acoustic modalities in their ablation studies (as in [21]
and in [27]). Table 6 shows that our ER model achieves a higher
weighted F1 score than state-of-the-art approaches when
restricted to the visual modality. It is worth noticing that our ER
model outperforms state-of-the-art approaches, even though it
does not use temporal visual context on a dialogue level. This indicates that the combination of video realignment and active speaker
detection can indeed yield sequences of facial expressions which,
in turn, provide the ER model with more information on the uttering speaker’s emotion than the feature extraction procedures used
in the other approaches.
The performance of our ER model when restricted to the acoustic modality is higher than M2FNet (current state-of-the-art
approach for ERC in MELD) and EmoCaps. Its performance, however, is lower than those of DialogueRNN, MMGCN and MM-DFN.
These models have in common the use of utterance-level feature
vectors extracted from OpenSMILE [12,35] as input for the audio
stream. EmoCaps also uses these, however, its multimodal representation favours the textual modality since it uses both the utterance feature vector yielded by BERT and an emotion feature vector
for the textual modality in its multimodal utterance representation, whereas only a single emotion feature vector is used to represent each of the remaining modalities. Also, EmoCaps’s weighted
F1 scores in both modalities correspond to that of a model that outputs neutral for every input. M2FNet, on the other hand, uses a
novel feature extractor module based on the triplet loss [34] to
fetch deep features from acoustic and visual contents.
i) a visual feature extractor that provides a visual representation based on the faces of the people in a scene as well as
on the scene as a whole;
5. Discussion and conclusion
6
Although M2FNet’s performance values seem lower than those of other models in
Table 6, this is due to most of the contribution in ERC coming from the textual
modality, which was not included in Table 6. We decide not to include the
performance of those models when the text modality is not ablated because the main
objective of this paper is to present a way of extracting useful information from the
visual and acoustic modalities, since those are quite unreliable in MELD. In contrast,
the text transcriptions are very reliable and do not require an extensive refinement.
Connectionist-temporal-classification segmentation and active
speaker detection allowed us to refine MELD, a largely-used multimodal dataset for emotion recognition in multi-party conversational scenarios, making it possible to better align its audiovisual
10
Neurocomputing 545 (2023) 126271
H. Carneiro, C. Weber and S. Wermter
Table 7
List of existing problematic cases in MELD.
Utt
ID
Acknowledgement
Split
Dia
ID
train
125
3
Corrupted video file
dev
110
7
Non-existent video file
test
38
4
Very long video (> 45 s), incompatible with
its utterance transcription
test
220
0
train
train
train
train
train
dev
test
309
404
736
832
1018
108
128
0
15
4
3
2
0
2
Utterance transcription contains not only the
utterance
but also a description within parentheses
train
train
test
65
761
86
3
1
3
Utterance transcription contains not only the
utterance
but also a description within brackets
train
train
739
849
14
3
No utterance. Just a description within
parentheses
train
111
N/A
Utterances not chronologically ordered
train
446
19
Should be the first utterance of dialogue 447
The authors acknowledge partial support from the German
Research Foundation DFG under project CML (TRR 169).
Problem
Appendix A. Problematic cases of MELD
MELD presents a variety of problematic cases beyond the
misalignment between the videos and the utterance transcriptions.
These comprise multiple other problems which raised errors during the processing of data refinement. Table 7 offers an extensive
list of such cases, identified by the split, dialogue id and utterance
id of each case.
References
[1] T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, Deep audio-visual
speech recognition, IEEE Trans. Pattern Anal. Mach. Intell. (2018), https://doi.
org/10.1109/TPAMI.2018.2889052.
[2] J.L. Alcázar, F. Caba, L. Mai, F. Perazzi, J.-Y. Lee, P. Arbelaez, and B. Ghanem.
Active speakers in context. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 12465–12474, June
2020.
[3] J.L. Alcázar, F. Caba, A.K. Thabet, and B. Ghanem. MAAS: Multi-modal
assignation for active speaker detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), pages 265–274, Oct.
2021.
[4] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. Wav2Vec 2.0: A framework for
self-supervised learning of speech representations. In H. Larochelle, M.
Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural
Information Processing Systems, volume 33, pages 12449–12460. Curran
Associates, Inc., 2020.
[5] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S.
Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture
database, Language Resour. Evaluat. 42 (2008) 335–359, https://doi.org/
10.1007/s10579-008-9076-6.
[6] H. Carneiro, C. Weber, and S. Wermter. FaVoA: Face-voice association favours
ambiguous speaker detection. In I. Farkaš, P. Masulli, S. Otte, and S. Wermter,
editors, Artificial Neural Networks and Machine Learning – ICANN 2021, pages
439–450, Cham, 2021. Springer International Publishing
[7] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of
neural machine translation: Encoder-decoder approaches. In Proceedings of
SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation, pages 103–111, Doha, Qatar, Oct. 2014. Association for
Computational Linguistics. https://doi/org/10.3115/v1/W14-4012.
[8] J.-H. Choi, J.-S. Lee, EmbraceNet: A robust deep learning architecture for
multimodal classification, Inform. Fusion 51 (2019) 259–270, https://doi.org/
10.1016/j.inffus.2019.02.010.
[9] V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, and N. Onoe. M2FNet:
Multi-modal fusion network for emotion recognition in conversation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, pages 4652–4661, June 2022.
[10] J.S. Chung, J. Huh, and S. Mun. Delving into VoxCeleb: Environment invariant
speaker recognition. In K. Lee, T. Koshinaka, and K. Shinoda, editors, Odyssey
2020: The Speaker and Language Recognition Workshop, 1–5 November 2020,
Tokyo, Japan, pages 349–356. ISCA, 2020. https://doi/org/10.21437/Odyssey.
2020–49.
[11] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In J. Burstein, C. Doran,
and T. Solorio, editors, Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7,
2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for
Computational Linguistics, 2019. https://doi/org/10.18653/v1/n19-1423.
[12] F. Eyben, M. Wöllmer, and B. Schuller. OpenSMILE: The Munich versatile and
fast open-source audio feature extractor. In Proceedings of the 18th ACM
International Conference on Multimedia, MM ’10, page 1459–1462, New York,
NY, USA, 2010. Association for Computing Machinery. https://doi/org/10.1145/
1873951.1874246.
[13] D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria. COSMIC:
COmmonSense knowledge for eMotion Identification in Conversations. In
Findings of the Association for Computational Linguistics: EMNLP 2020, pages
2470–2481, Online, Nov. 2020. Association for Computational Linguistics.
https://doi/org/10.18653/v1/2020.findings-emnlp.224.
[14] A. Graves, J. Schmidhuber, Framewise phoneme classification with
bidirectional LSTM and other neural network architectures, Neural Networks
18 (5) (2005) 602–610, https://doi.org/10.1016/j.neunet.2005.06.042.
[15] J. Guo, J. Deng, A. Lattas, and S. Zafeiriou. Sample and computation
redistribution for efficient face detection. In International Conference on
Learning Representations, pages 1–17, 2022. https://openreview.net/forum?
id=RhB1AdoFfGE.
data with the corresponding utterance transcriptions, as well as to
obtain reliable face crops of the uttering speaker of nearly every
scene. The comparison with state-of-the-art approaches also indicates that those face crops provide more precise information on the
emotion of the uttering speaker than the most recent approaches.
The reliable extraction of the speakers’ face crops from wellrealigned videos accounts for the high performance of the visiononly version of our emotion recognition model, which outperforms
other competing approaches by more 2.3%. The relatively simple
architecture of our emotion recognition model, as well as its
restriction to working on an utterance level, i.e., without contextual information from the whole dialogue, indicate that much of
its high performance is due to the improvement in the information
from MELD’s visual modality.
Furthermore, researchers on emotion recognition in multi-party
conversational scenarios can benefit from MELD-FAIR, the refined
version of MELD delivered in this publication. More generally, with
the recent advancements in deep learning, creating a dataset automatically becomes within sight. Automatic speech recognition
allows automatic text transcription, while automatic lip reading,
which requires active speaker detection, could verify its correctness, and vice versa.
CRediT authorship contribution statement
Hugo Carneiro: Writing - original draft, Conceptualization,
Methodology, Software, Data curation. Cornelius Weber: Writing
- review & editing, Conceptualization. Stefan Wermter: Writing review & editing, Supervision.
Data availability
The data produced in this research can be accessed in a GitHub
repository whose URL is provided in the manuscript’s abstract.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
11
H. Carneiro, C. Weber and S. Wermter
Neurocomputing 545 (2023) 126271
[16] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8)
(Nov. 1997) 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735.
[17] E. Hornecker, A. Krummheuer, A. Bischof, M. Rehm, Beyond dyadic HRI:
Building robots for society, Interactions 29 (3) (May 2022) 48–53, https://doi.
org/10.1145/3526119.
[18] C.-C. Hsu, S.-Y. Chen, C.-C. Kuo, T.-H. Huang, and L.-W. Ku. EmotionLines: An
emotion corpus of multi-party conversations. In Proceedings of the Eleventh
International Conference on Language Resources and Evaluation (LREC 2018),
pages 1597–1601, Miyazaki, Japan, May 2018. European Language Resources
Association (ELRA).
[19] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo. MM-DFN: Multimodal dynamic
fusion network for emotion recognition in conversations. In ICASSP 2022–
2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 7037–7041, 2022. https://doi/org/10.1109/ICASSP43922.2022.
9747397.
[20] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 7132–7141,
2018. https://doi/org/10.1109/CVPR.2018.00745.
[21] J. Hu, Y. Liu, J. Zhao, and Q. Jin. MMGCN: Multimodal fusion via deep graph
convolution network for emotion recognition in conversation. In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 5666–5675, Online, Aug. 2021. Association for
Computational Linguistics. https://doi/org/10.18653/v1/2021.acl-long.440.
[22] A.L. Krummheuer, M. Rehm, and K. Rodil. Triadic human-robot interaction.
Distributed agency and memory in robot assisted interactions. In Companion
of the 2020 ACM/IEEE International Conference on Human-Robot Interaction,
HRI ’20, page 317–319, New York, NY, USA, 2020. Association for Computing
Machinery. https://doi/org/10.1145/3371382.3378269.
[23] L. Kürzinger, D. Winkelbauer, L. Li, T. Watzel, G. Rigoll, CTC-segmentation of
large corpora for German end-to-end speech recognition, in: A. Karpov, R.
Potapova (Eds.), Speech and Computer, Springer International Publishing,
Cham, 2020, pp. 267–278, 10.1007/978-3-030-60276-5_27.
[24] B. Lee and Y.S. Choi. Graph based network with contextualized representations
of turns in dialogue. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 443–455, Online and Punta
Cana, Dominican Republic, Nov. 2021. Association for Computational
Linguistics. https://doi/org/10.18653/v1/2021.emnlp-main.36.
[25] J. Lee and W. Lee. CoMPM: Context modeling with speaker’s pre-trained
memory tracking for emotion recognition in conversation. In M. Carpuat, M. de
Marneffe, and I.V.M. Ruíz, editors, Proceedings of the 2022 Conference of the
North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL 2022, Seattle, WA, United States, July
10–15, 2022, pages 5669–5679. Association for Computational Linguistics,
2022. https://doi/org/10.18653/v1/2022.naacl-main.416.
[26] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu. DailyDialog: A manually labelled
multi-turn dialogue dataset. In Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pages
986–995, Taipei, Taiwan, Nov. 2017. Asian Federation of Natural Language
Processing.
[27] Z. Li, F. Tang, M. Zhao, Y. Zhu, EmoCaps: Emotion capsule based model for
conversational emotion recognition, in: Findings of the Association for
Computational Linguistics: ACL 2022, Dublin, Ireland, May 2022, pp. 1610–
1618, Association for Computational Linguistics. 10.18653/v1/2022.findingsacl.126.
[28] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria.
DialogueRNN: An attentive RNN for emotion detection in conversations.
Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01): 6818–
6825, Jul 2019. https://doi/org/10.1609/aaai.v33i01.33016818.
[29] G. McKeown, M. Valstar, R. Cowie, M. Pantic, M. Schroder, The SEMAINE
database: Annotated multimodal records of emotionally colored conversations
between a person and a limited agent, IEEE Trans. Affect. Comput. 3 (1) (2012)
5–17, https://doi.org/10.1109/T-AFFC.2011.20.
[30] K. Min, S. Roy, S. Tripathi, T. Guha, and S. Majumdar. Learning long-term
spatial-temporal graphs for active speaker detection. In S. Avidan, G. Brostow,
M. Cissé, G.M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022,
pages 371–387, Cham, 2022. Springer Nature Switzerland. https://doi/org/10.
1007/978-3-031-19833-5_22.
[31] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD:
A multimodal multi-party dataset for emotion recognition in conversations. In
Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 527–536, Florence, Italy, July 2019. Association for
Computational Linguistics. https://doi/org/10.18653/v1/P19-1050.
[32] J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S.
Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, and C. Pantofaru. AVA active
speaker: An audio-visual dataset for active speaker detection. In ICASSP 2020–
2020 IEEE International Conference on Acoustics, Speech and Signal Processing
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
(ICASSP), pages 4492–4496, 2020. https://doi/org/10.1109/ICASSP40776.2020.
9053900.
P. Saxena, Y.J. Huang, and S. Kurohashi. Static and dynamic speaker modeling
based on graph neural network for emotion recognition in conversation. In
Proceedings of the 2022 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies:
Student Research Workshop, pages 247–253, Hybrid: Seattle, Washington +
Online, July 2022. Association for Computational Linguistics. https://doi/org/
10.18653/v1/2022.naacl-srw.31.
F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for
face recognition and clustering. In 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 815–823, 2015. https://doi/org/10.
1109/CVPR.2015.7298682.
B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and
affect in speech: State of the art and lessons learnt from the first challenge,
Speech Commun. 53 (9) (2011) 1062–1087, https://doi.org/10.1016/j.
specom.2011.01.011.
X. Song, L. Zang, R. Zhang, S. Hu, and L. Huang. Emotionflow: Capture the
dialogue level emotion transitions. In ICASSP 2022–2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8542–
8546, 2022. https://doi/org/10.1109/ICASSP43922.2022.9746464.
R. Tao, Z. Pan, R.K. Das, X. Qian, M.Z. Shou, and H. Li. Is someone speaking?
Exploring long-term temporal features for audio-visual active speaker
detection. In Proceedings of the 29th ACM International Conference on
Multimedia, MM ’21, page 3927–3935, New York, NY, USA, 2021.
Association for Computing Machinery. https://doi/org/10.1145/3474085.
3475587.
Y.-H.H. Tsai, S. Bai, P.P. Liang, J.Z. Kolter, L.-P. Morency, and R. Salakhutdinov.
Multimodal transformer for unaligned multimodal language sequences. In
Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 6558–6569, Florence, Italy, July 2019. Association for
Computational Linguistics. https://doi/org/10.18653/v1/P19-1656.
D. Utami and T. Bickmore. Collaborative user responses in multiparty
interaction with a couples counselor robot. In 2019 14th ACM/IEEE
International Conference on Human-Robot Interaction (HRI), pages 294–303,
2019. https://doi/org/10.1109/HRI.2019.8673177.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser,
and I. Polosukhin. Attention is all you need. In Proceedings of the 31st
International Conference on Neural Information Processing Systems, NIPS’17,
pages 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN
9781510860964.
B. Xie, M. Sidulova, C.H. Park, Robust multimodal emotion recognition from
conversation with transformer-based crossmodality fusion, Sensors 21 (14)
(2021), https://doi.org/10.3390/s21144913.
S. Zahiri and J.D. Choi. Emotion detection on TV show transcripts with
sequence-based convolutional neural networks. In Proceedings of the AAAI
Workshop on Affective Content Analysis, AFFCON’18, pages 44–51, New
Orleans, LA, 2018.
Y. Zhang, S. Liang, S. Yang, X. Liu, Z. Wu, S. Shan, and X. Chen. UniCon: Unified
context network for robust active speaker detection. In Proceedings of the
29th ACM International Conference on Multimedia, MM ’21, pages 3964–3972,
New York, NY, USA, 2021. Association for Computing Machinery. https://doi/
org/10.1145/3474085.3475275.
L. Zhu, G. Pergola, L. Gui, D. Zhou, and Y. He. Topic-driven and knowledgeaware transformer for dialogue emotion detection. In Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pages 1571–1582, Online, Aug. 2021. Association for
Computational Linguistics. https://doi/org/10.18653/v1/2021.acl-long.125.
Hugo Carneiro is a postdoctoral research associate at
the Knowledge Technology Institute, University of
Hamburg. He received his MSc and DSc in systems
engineering and computer science from COPPE, Federal
University of Rio de Janeiro. His research interests are in
crossmodal learning, affective computing, computational linguistics, language representation, knowledge
representation and explainable AI.
12
Neurocomputing 545 (2023) 126271
H. Carneiro, C. Weber and S. Wermter
Cornelius Weber graduated in physics at Universität
Bielefeld, Germany, and received his PhD in computer
science at Technische Universität Berlin. Following
positions were a Postdoctoral Fellow in Brain and Cognitive Sciences, University of Rochester, USA; Research
Scientist in Hybrid Intelligent Systems, University of
Sunderland, UK; Junior Fellow at the Frankfurt Institute
for Advanced Studies, Germany. Currently he is Lab
Manager at Knowledge Technology Institute, Universität Hamburg. His interests are in computational neuroscience, development of visual feature detectors,
neural models of representations and transformations,
reinforcement learning and robot control, grounded language learning, humanrobot interaction and related applications in social assistive robotics.
Stefan Wermter is Full Professor at the University of
Hamburg, Germany, and Director of the Knowledge
Technology Institute. Currently, he is a co-coordinator
of the International Collaborative Research Centre on
Crossmodal Learning (TRR-169) and coordinator of the
Doctoral Network on Transparent Interpretable Robots
(TRAIL). His main research interests are in the fields of
neural networks, hybrid knowledge technology, cognitive robotics and human-robot interaction. He is an
Associate Editor of Connection Science and International
Journal for Hybrid Intelligent Systems. He is on the
Editorial Board of the journals Cognitive Systems
Research, Cognitive Computation and Journal of Computational Intelligence. He is
serving as the President for the European Neural Network Society.
13