Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Neurocomputing 545 (2023) 126271 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Whose emotion matters? Speaking activity localisation without prior knowledge Hugo Carneiro ⇑, Cornelius Weber, Stefan Wermter University of Hamburg, Department of Informatics, Vogt-Koelln-Str. 30, Hamburg 22527, Germany a r t i c l e i n f o Article history: Received 2 November 2022 Revised 10 March 2023 Accepted 22 April 2023 Available online 03 May 2023 Keywords: Multimodality Active speaker detection Emotion recognition Forced alignment a b s t r a c t The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection and automatic speech recognition models, we are able to realign the videos of MELD and capture the facial expressions from speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD-FAIR videos more closely match the transcribed utterances given in the MELD dataset. Finally, we devise a model for emotion recognition in conversations trained on the realigned MELD-FAIR videos, which outperforms state-ofthe-art models for ERC based on vision alone. This indicates that localising the source of speaking activities is indeed effective for extracting facial expressions from the uttering speakers and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far. The MELD-FAIR realignment data, and the code of the realignment procedure and of the emotional recognition, are available at https://github.com/knowledgetechnologyuhh/MELD-FAIR. Ó 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 1. Introduction Emotion recognition in conversations (ERC) is a task that involves recognising the emotion of interlocutors in a dialogue. Challenges of this task include the modelling of the conversational context and how the emotion of the interlocutors may change depending on that context, which is called emotion shift [31]. ERC can prove helpful in real-world scenarios in which people are talking with each other, for example, in human-robot interaction applications [17,22,39]. However, most ERC datasets are exclusively based on text transcriptions of conversations [18,26,42] or are restricted to dyadic interactions in very controlled environments [5,29]. Poria et al. [31] published the first large-scale multimodal ERC dataset with several interlocutors, the Multimodal EmotionLines Dataset (MELD). The dataset consists of videos extracted from the Friends TV series. Each video is cut to match a single utterance, and the videos are organised into dialogues and utterances, with each dialogue having one or more utterances. Together with the acoustic and visual information provided by the videos, the text transcription of every utterance and the speaker label are also provided. Many approaches have been proposed to tackle the task of ERC in MELD. Even though MELD was created to be a multimodal dataset, most of the approaches rely exclusively on textual information [13,24,25,33,36,44]. Using the visual modality is difficult due to frequent misalignments between video cuts and the expected corresponding utterances (see Fig. 1 for an example). This is likely a consequence of an automatic generation of the video cuts with the Gentle1 transcription alignment tool. For some years, there has been a demand for more reliable information from the visual modality, given the frequent problems of video-text synchronisation2. Video cuts and utterance transcriptions can be misaligned in a variety of ways. Fig. 1a presents two cases of misalignment. In case I, the utterance appears within the ⇑ Corresponding author. E-mail addresses: hugo.carneiro@uni-hamburg.de (H. Carneiro), cornelius.weber@uni-hamburg.de (C. Weber), stefan.wermter@uni-hamburg.de (S. Wermter). 1 2 https://lowerquality.com/gentle/ https://github.com/declare-lab/MELD/issues/9 https://doi.org/10.1016/j.neucom.2023.126271 0925-2312/Ó 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). H. Carneiro, C. Weber and S. Wermter Neurocomputing 545 (2023) 126271 Fig. 1. Example of misaligned video cuts provided in the MELD dataset, and their corresponding correction. The different colours in the utterances represent the different speakers in the video cuts. Even though quite rarely, speech data from the videos of MELD has been used for ERC since the work of Poria et al. [31]. However, without the proper alignment correction of the videos, audio samples used for this task can include speech from other speakers with different emotions. In contrast, only quite recently there has been some interest in the use of visual information from the MELD videos [9,19,21,27,41]. However, alongside the problems that arise with the lack of proper realignments, the proposed solutions do not take into account the necessity to localise the source of the speaking activity in a particular scene or frame, which, in turn, is useful to extract of the emotional facial expressions of the uttering speaker. The added information from acoustic and visual modalities has improved ERC compared to models that use information obtained exclusively from utterance transcriptions. However, those improvements are limited because of the unreliability of those modalities. first half of the video cut and another person’s utterance is falsely assigned to the same cut. In case II, the utterance starts being spoken in the video cut assigned to the preceding utterance and continues through the first half of the video cut assigned to that target utterance. Fig. 1b depicts the corrected alignment between the video cuts and their corresponding utterance transcriptions. This is a result of our dataset refinement procedure (cf. Section 3). Facial expressions and speech signals provide relevant information regarding the emotion of a person. However, the noticeable number of mismatching cases between video cuts and the corresponding utterance transcriptions hindered the use of those modalities for some years, with information from the visual modality being disregarded even by the dataset creators, who stated that video-based speaker localisation were still open problems [31]. Accordingly, in the dataset itself, no information on the location of the face of the uttering speakers is offered. 2 Neurocomputing 545 (2023) 126271 H. Carneiro, C. Weber and S. Wermter applicability of the extraction of the faces of active speakers for the task of ERC, we propose an emotion recognition model whose outstanding performance on the visual data indicates that the faces extracted from the active speakers indeed provide an informative visual cue for the task of ERC. The paper is structured as follows. Section 2 offers a brief overview and some specific details on the MELD dataset. Section 3 describes the procedure of dataset refinement, which consists of correcting the alignment between video cuts and the corresponding utterances, and determining the position of the face of the uttering speaker in each frame of the newly produced video cut. Section 4 provides quantitative analysis of the resulting dataset, comparing it with the characteristics of the original dataset that were provided in Section 2. In that same section, experiments are also provided, as a means to evaluate how well the resulting dataset applies to the task of emotion recognition. Section 5 discusses the results. Recent advances in active speaker detection (ASD) in the wild [2,3,6,30,37,43] indicate the capability of audiovisual neural models to localise sources of speaking activity in videos given the faces of the people as well as the audio of a scene. Localising the active speaker can enable more reliable emotion recognition from video in MELD. State-of-the-art ASD models can be very precise in determining who among multiple people is speaking, especially if there are at least a few seconds of continuous speaking activity. Multiparty scenarios can still present challenges in accurately localising the source of some particular speaking activity. These challenges include: i) the partial occlusion of the speaker’s face by objects or other people; ii) the presence of other people in the same scene moving their mouths, even though they are not actively speaking; iii) interfering noise, such as background chatting or, in the case of TV sitcoms, laugh tracks; and iv) the active speaker not being in the main focus of the scene, and having the speaker’s face in a considerably smaller size and resolution than other non-speaking people. 2. The MELD dataset MELD contains scenes from various episodes of the Friends TV series. Those scenes are denoted as dialogues, and each dialogue is organised as a sequence of utterances. For every utterance, there is a corresponding dataset entry containing the speaker’s identity, emotion and sentiment. The annotated emotion can be either one of Ekman’s universal emotions (joy, sadness, fear, anger, disgust, and surprise), or neutral if no particular emotion was noticed by the dataset annotators. MELD is split into three sets, denoted train, dev, and test. Each data record in those splits contains the following information: the utterance, its speaker, the emotion perceived in that utterance, the corresponding sentiment, a dialogue identifier, an utterance identifier, the season and episode of Friends in which that scene happened, a time stamp determining where that scene starts, as well as one determining where that scene ends. For every split, a dataset record can be uniquely identified by its dialogue identifier and its utterance identifier. Table 1 presents an excerpt of a conversation, containing a sequence of contiguous data records and corresponding labels for the uttering speaker, his or her emotion, and the corresponding Most in-the-wild ASD models were trained on AVAActiveSpeaker, a dataset containing videos in a large variety of resolutions [32]. The videos of AVA-ActiveSpeaker contain scenes with multiple people speaking with each other, which is similar to the conversational scenes in the videos of MELD. Fig. 2 displays examples of conversational scenes present in the videos of the AVA-ActiveSpeaker dataset. The first contribution of this paper is to offer a new method to extract the position of the faces of active speakers for datasets, which can be useful for tasks in which the facial information may provide additional relevant information, but, for some reason, the face position is not given in the dataset. The procedure can be used in any dataset with humans speaking without annotation concerning the visual modality, e.g., the position of the speaker’s face. The second contribution of this paper is the evaluation of this procedure on the MELD dataset, and the consequent development of a refined version of MELD, named MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR). Finally, to assess the Fig. 2. Examples of conversational scenes from AVA-ActiveSpeaker videos. Green boxes identify those who are speaking, whereas red boxes mark silent people. Table 1 Excerpt of a dyadic conversation from MELD train split with corresponding speaker, emotion, and sentiment information. Dia Utt Utterance Speaker Emotion Sentiment D0 D0 D0 D0 D0 D0 U5 U6 U7 U8 U9 U10 Now you’ll be heading a whole division, so you’ll have a lot of duties. I see. But there’ll be perhaps 30 people under you, so you can dump a certain amount on them. Good to know. We can go into detail. No, don’t. I beg of you! Interviewer Chandler Interviewer Chandler Interviewer Chandler neutral neutral neutral neutral neutral fear neutral neutral neutral neutral neutral negative 3 H. Carneiro, C. Weber and S. Wermter Neurocomputing 545 (2023) 126271 3.1. Video realignment Table 2 Additional information in MELD about the utterances presented in Table 1. Overlaps between video cuts due to mistaken determination of the start and end times of an utterance are marked in bold. Dia Utt Season Episode Start time End time D0 D0 D0 D0 D0 D0 U5 U6 U7 U8 U9 U10 S8 S8 S8 S8 S8 S8 E21 E21 E21 E21 E21 E21 0:16:41.126 0:16:48.800 0:16:48.800 0:16:59.477 0:17:00.478 0:17:02.856 0:16:44.337 0:16:51.886 0:16:54.514 0:17:00.478 0:17:02.719 0:17:04.858 Each video in the MELD dataset corresponds to a particular utterance, which, in turn, belongs to a sequence of utterances, also called a dialogue. Videos that are misaligned to their corresponding utterances are a consequence of the mistaken determination of where the boundaries of those particular utterances lie within their respective dialogues. A considerable number of misaligned videos may prevent the proper identification of the source of speaking activity, especially because sometimes the speaking activity might happen partially in the target video and partially in the one that precedes or in the one that follows it in the dialogue (see the example depicted in Fig. 1). The realignment of the videos takes into account that videos that belong to the same dialogue are organised sequentially. First, the audio signal of every dataset video is extracted. Next, for every split r 2 ftrain; dev ; testg and dialogue d, the audio signals ar;d;u corresponding to each utterance u belonging to dialogue d are concatenated in order. Existing overlaps, such as the one indicated in Table 2, are removed by truncating the audio signals that lead to those overlaps. Silence blocks are added between consecutive video cuts if there is a time difference between the end time stamp of a video and the start time stamp of the following. The length of a silence block is equal to the corresponding time difference, but long silence blocks are capped at 250 ms. Due to a few videos whose length is much longer than their corresponding utterances, video lengths are also capped at 45 s. This affects two of altogether 13,708 videos (check A for an indication of the videos affected by the capping of 45 s). Fig. 4 presents a graphical representation of the concatenation of audio signals. Each box labelled U5 to U10 represents the audio signal of an utterance. The lengths of those boxes are proportional to the duration of the utterances presented in Tables 1 and 2, which can be inferred by their start and end time stamps. The label used in each box corresponds to the utterance identifier given in sentiment. The misalignment of the video cuts can provide overlaps, which are indicated by the start and end time stamps. Table 2 indicates that two videos of consecutive utterances present an overlap due to a wrongly executed alignment process. 3. Dataset refinement procedure The extraction of emotional speech and emotional facial expressions depends on having audio samples that match closely enough the utterance being said and on being capable of localising the uttering speaker in a scene, particularly that person’s face. To meet both requirements, the dataset refinement procedure is divided into two parts, with each part addressing one requirement. First, the videos of MELD are realigned, such that their audios match closely enough the target utterance, as indicated in the flow chart in Fig. 3a. Next, with the videos properly realigned, the faces of the people in the scene are extracted and organised into sequences. Then, given the extracted sequences of faces and the scene audio, an ASD model determines which of these sequences corresponds to the uttering speaker (see the flow chart in Fig. 3b). The resulting set including the realigned audio from all videos and the sequence of facial expressions of the person vocalising in each of those videos constitutes a refined version of MELD called MELD-FAIR. Fig. 3. Major steps of the dataset refinement procedure. Orange arrows indicate the application of a model, and blue arrows represent information flow. 4 Neurocomputing 545 (2023) 126271 H. Carneiro, C. Weber and S. Wermter Fig. 4. Schematic representation of the concatenation of the audio signals of a dialogue. First, the audios of all utterances of a given dialogue are concatenated, with silence blocks inserted wherever it is adequate. Next, the lengths of the silence blocks are reduced to a minimum length that still allows for the identification of individual blocks of consecutive utterances (e.g., utterances U6 and U7, and U8 and U9). Table 2. The gaps between the boxes are proportional to the distance between the end of an utterance and the beginning of the following one. The figure presents an example of overlap being removed by altering the start time of utterance U7. It also shows the insertion of silence blocks where the gaps between utterances lie, and the subsequent capping of silence block lengths to 250 ms. The utterance transcriptions are concatenated as well. Prior to their concatenation, all punctuation marks in each transcription are removed, and both a start-of-sequence and an end-ofsequence token are appended to each end of every utterance transcription within a dialogue. With the audio signals and the transcriptions properly concatenated, the text of the concatenated transcription is aligned to the concatenated audio through forced alignment using connectionist-temporal-classification (CTC) segmentation [23]. Given a speech audio signal, CTC segmentation uses frame-based character posterior probabilities generated by a CTC-based end-to-end network. From these character-level probabilities, maximum joint probabilities are computed via dynamic programming. These maximum joint probabilities indicate how likely a given excerpt from the dialogue transcription is aligned to a particular slice of the speech audio signal. After the maximum joint probability for the alignment of the complete dialogue transcription to the whole speech audio signal is computed, the character-wise alignment is obtained by backtracking from the most probable temporal position of the last character in the transcription. The CTC-based end-to-end network used to generate the character-level probabilities had to be pretrained on already aligned data, for which the Wav2Vec2 [4] automatic speech recognition transformer model3 [40] was used. The video realignment procedure is executed for each dialogue in the dataset. Most of the processing time is dedicated to the generation of frame-based character posterior probabilities by the CTC-based end-to-end network and the subsequent computation of maximum joint probabilities. The former is run on graphical processing units (GPUs) with high parallelisation capabilities. The latter involves a dynamic programming algorithm whose processing time is proportional to the number of audio frames of the whole dialogue and to the square of the number of characters of the concatenated utterance transcriptions. 3.2. Uttering speaker localisation With videos that very likely contain the part of a scene in which a given utterance is said, it is possible to localise the source of the speaking activity, i.e., the person who spoke the utterance. Fig. 3b schematically represents the process of extracting the speech audio as well as face images of the uttering speaker from a video. As a first step, an efficient face detection model with sample and computation redistribution (SCRFD-10GF) [15] is used to detect all faces in every frame of those videos. Faces detected this way are then subsequently extracted and organised into ordered groups, creating several sequences of faces. Each face is identified by the video frame from which it is extracted, and by an identifier of the sequence it belongs to. For the organisation of the faces into sequences, faces detected in consecutive frames are considered to belong to the same sequence if the intersection-over-union (IoU) ratio between their areas is greater than a given threshold h. In case there is more than one pair of faces extracted from consecutive frames that satisfy this condition, the face pair with the highest IoU ratio is considered as belonging to the same sequence. Each face sequence and the corresponding slice of the speech signal is then sent to TalkNet-ASD [37], an audiovisual ASD model, to determine whether that face sequence presents some indication of speaking activity that resembles that slice of the speech signal. Fig. 5 shows a sketch of TalkNet-ASD’s architecture. TalkNet-ASD uses a visual temporal encoder (VTE) to learn long-term representations of facial expression dynamics, and an audio temporal encoder (ATE) to learn audio content representations from the temporal dynamics [37]. VTE consists of a front end, where video frame streams are encoded into sequences of frame-based embeddings, and a visual temporal network, whose aim is to represent the temporal content in a long-term spatiotemporal structure [37]. Its front end is based on the vision module introduced in [1], consisting of a 3D convolution layer with a filter width of 5 frames followed by a 2D 18-layer residual network. Given an input with dimensions T v  C  W  H, where T v is the number of frames, and C; W and H are the number of channels, width and height of each frame, the front end yields a W H  32  512, which is subsequently tensor with dimensions T v  32 average-pooled in both its spatial dimensions, thus producing a feature vector with 512 dimensions for each input frame. Similarly to the visual model of Afouras et al. [1], TalkNet-ASD receives a sequence of greyscale images, which means that the number of channels C in each frame is 1. TalkNet-ASD’s visual temporal net- 3 More specifically, the Wav2Vec2 Large (LV-60) model pretrained and fine-tuned on 960 h of speech audio from Libri-Light and Librispeech (see list of pretrained models at https://github.com/facebookresearch/fairseq/tree/main/examples/ wav2vec). 5 H. Carneiro, C. Weber and S. Wermter Neurocomputing 545 (2023) 126271 Fig. 5. TalkNet-ASD architecture. namely 25, 50, 75, 100, 125, and 150, as a means to guarantee a more reliable result. A given value of / implies that / face images and 4 / audio frames in each block are used as input to the TalkNet-ASD model. TalkNet-ASD yields / scores si;j;/ per block, indicating whether a given person pj is detected as actively speaking in frame f i in that block composed of / video frames. After getting all scores for every frame, with all different possible values of /, a resulting score si;j is obtained by averaging the scores si;j;/ . A score si;j > 0 indicates that person pj is predicted as actively speaking in frame f i . Fig. 6 provides two examples of the application of TalkNet-ASD to the videos of MELD. In both examples, the uttering speakers are marked with green boxes around their faces. After TalkNet-ASD has generated scores for each face track, the scores are grouped based on their respective tracks to determine which faces belong to the same person. However, if two face tracks have faces from the same frame and both tracks have detected speaking activity, this can result in a ‘‘false positive”, where one of the tracks belongs to someone who is not actively speaking. These face tracks that provide conflicting information on the active speaker are called conflicting face tracks. To reduce false positives, the face tracks are grouped based on the camera cut where they appear. Each group contains a set of face tracks where each track has a conflicting track within the same set. A heuristic is used to eliminate conflicting face tracks according to three criteria: work (V-TCN) consists of a 5-block residual network followed by a sequence of two 1D convolution layers. The residual blocks consist of a 1D depth-separable convolution layer followed by rectifier linear units and batch normalisation layers. The residual network is responsible for obtaining a representation of the temporal content. The representation consists of a tensor with dimensions T v  512. The sequence of 1D convolution layers finally reduces the dimensionality of this tensor, yielding a visual embedding F v of dimensions T v  128, i.e., 128 dimensions for every input frame. The speech signal is first encoded as a sequence of overlapping audio frames, each one characterised by a 13-dimensional vector of Mel-frequency cepstral coefficients (MFCCs) based on a window size of 25 ms and a window step of 10 ms. This means that given a sequence of T a audio frames, ATE receives as input a tensor with dimensions 1  13  T a . ATE consists of a 2D 34-layer residual network with squeeze-and-excitation (SE) modules [20]. The number of channels in each block of the ResNet34 network is also reduced to one quarter of the number in each block of the original ResNet with 34 layers, similarly to the Thin ResNet34 introduced by Chung et al. [10]. The output of the audio encoder is an audio embedding F a of dimensions T4a  128. The dimensions of F a and F v , the embeddings output by both encoders, match when the number of audio frames is equal to four times the number of visual frames (or face crops). The matching in their dimensions is a necessary feature for the subsequent attention mechanism. A direct implication of the number of audio frames being four times the number of video frames is that each video frame corresponds to roughly 40 ms of the video (or 25 fps) since the length of the window step between consecutive overlapping audio frames is 10 ms. With the motivation of audiovisual synchronisation working as an informative cue for speaking activities, TalkNet-ASD contains a cross-attention subnetwork that receives F a and F v as inputs, and outputs an audio attention feature F a!v and a video attention feature F v !a . F a!v is obtained through the application of F v as the target sequence to generate the query Q v in the attention layer and F a as the source sequence to generate key K a and value V a . F v !a is obtained through an analogous process. Next, F a!v and F v !a are concatenated into a single audiovisual attention feature vector F av which is sent to a self-attention subnetwork whose aim is to model audiovisual utterance-level information, and this way distinguish between speaking and non-speaking frames. Both crossattention and self-attention subnetworks contain one transformer layer with eight attention heads each [40]. Tao et al. [37] offer a practical implementation of TalkNet-ASD4, which we apply to the facial expression and emotional speech data extracted from the realigned MELD videos. In that implementation, each of the face tracks of a given person and the corresponding audio frame sequence are split into blocks and sent to TalkNet-ASD to determine in which frames that given person is actively speaking. Each of those blocks corresponds to a video sequence of up to / video frames. Several values for / are used in the implementation, 4 i) reduce the number of conflicting face tracks to zero; ii) maximise the total number of faces associated with speaking activity for all non-conflicting face tracks within a set; and iii) minimise the number of face tracks, provided the first two criteria are met. The last criterion is due to the low likelihood of a single person having their extracted face sequence appearing in several face tracks within the same set. After eliminating the conflicting tracks, the remaining non-conflicting tracks are grouped together and ordered based on their associated frame number. The procedure then outputs the resulting sequence of faces, which is associated with the active speaker. The utterance speaker localisation procedure is executed for every realigned MELD video, which, in turn, is assigned to one particular utterance in the dataset each. Most time consumption derives from the frame-wise face extraction and from the detection of speaking activity in every sequence of facial expressions previously extracted and organised. 4. Assessment of the MELD-FAIR dataset To assess the applicability of MELD-FAIR in ERC, it is important to determine whether the distribution of its data after the dataset refinement procedure is kept similar to that of the original dataset. Two criteria can be used to evaluate whether the data distribution was kept similar to its original distribution. Specific steps of the dataset refinement depend on the target uttering speaker, thus it https://github.com/TaoRuijie/TalkNet-ASD/blob/main/demoTalkNet.py 6 Neurocomputing 545 (2023) 126271 H. Carneiro, C. Weber and S. Wermter Fig. 6. Examples of conversational scenes from MELD videos. Green boxes identify those who are speaking, whereas red boxes mark silent people. is desirable that the proportion of utterances in MELD-FAIR assigned to a given speaker remains close to its original proportion in MELD. Similarly, the proportion of utterances assigned to a given emotion should also be kept close to its original proportion, to not alter the task. Moreover, because MELD was built for emotion recognition in conversational contexts, it is worthwhile to determine the portion of dialogues in which the data of at least one utterance was removed during the dataset refinement process. After assessing whether most of the original utterances are kept in MELD-FAIR, and whether its data distribution is nearly unaltered, it is worthwhile analysing whether the video realignment produces refined speech signals that actually correspond to the speakers provided by the dataset. The retention of many original utterances and the proper correspondence between the speech signals and the expected speakers are indications that the acoustic data is reliable and therefore useful for an application in ERC. Finally, to determine the reliability of the process of localising the uttering speaker, we propose using an emotion recognition model trained on MELDFAIR and comparing its performance to existing ERC approaches trained on the original version of MELD that use information from visual and/or acoustic modalities. A superior performance of our emotion recognition model would indicate that the emotional facial expressions extracted by the uttering speaker localisation procedure are indeed useful for emotion recognition applications. 4.1. Properties of the MELD-FAIR dataset The process of dataset refinement consists of two steps, video realignment and utterance source localisation. These refining steps may eventually lead to some utterances of the original MELD dataset not having corresponding audiovisual data in MELD-FAIR. This may happen due to two main reasons. First, the video realignment step may produce an empty video for a given utterance in case the CTC segmentation algorithm determines that in the most likely alignment, ui is aligned to a very small slice of the dialogue audio. Second, even when new video cuts are produced in the video realignment step, no uttering speaker may be located in the scene. Tables 3 and 4 present the number of dataset records for which there are corresponding audiovisual data in MELD-FAIR, alongside the number of dataset records in its original version. Table 3 presents the dataset record distribution according to the annotated emotion and dataset split, and Table 4 presents the dataset record distribution according to the utterance speaker and dataset split. Tables 3 and 4 show that the dev and test splits each lost approximately 2.5% of their records in the dataset refinement process. Regarding the train split, data loss due to the dataset refinement was also relatively small, with the audiovisual data of MELD-FAIR corresponding to 96.7% of the utterances of the original MELD dataset. Table 3 Distribution of emotion annotations in the MELD-FAIR dataset. The numbers of original dataset records for each emotion and split are given inside parentheses. Emotion neutral joy surprise sadness fear anger disgust train 4537 1683 1158 670 261 1082 267 dev (4710) (1743) (1205) (683) (268) (1109) (271) 9658 (9989) 461 160 140 109 39 150 22 Total test (470) (163) (150) (111) (40) (153) (22) 1226 389 270 207 49 339 67 1081 (1109) (1256) (402) (281) (208) (50) (345) (68) 6224 (6436) 2232 (2308) 1568 (1636) 986 (1002) 349 (358) 1571 (1607) 356 (361) 2547 (2610) 13286 (13708) Table 4 Distribution of uttering speakers in the MELD-FAIR dataset. The numbers of original dataset records for each speaker and split are given inside parentheses. Speaker Rachel Monica Phoebe Joey Chandler Ross others train 1392 1253 1269 1456 1243 1410 1635 (1435) (1299) (1321) (1509) (1283) (1459) (1683) 9658 (9989) dev 158 130 183 146 100 211 153 test (164) (137) (185) (149) (101) (217) (156) 1081 (1109) 7 350 338 277 399 374 368 441 (356) (346) (291) (411) (379) (373) (454) 2547 (2610) Total 1900 1721 1729 2001 1717 1989 2229 (1955) (1782) (1797) (2069) (1763) (2049) (2293) 13286 (13708) H. Carneiro, C. Weber and S. Wermter Neurocomputing 545 (2023) 126271 signal. Finally, a fully connected layer outputs a prediction regarding the expected speaker of the speech signal from this feature vector. The data distribution was kept nearly unaltered. For instance, the largest data distribution difference occurred in the fraction of dataset records assigned to the neutral emotion in the train split. Out of the original 4710 records in the train split that were assigned to the neutral emotion, the dataset refinement procedure was unable to retrieve corresponding audiovisual data for only 173 records. This corresponds to 3.67% of those records, and to 1.73% of all records in the train split. These dataset records, which correspond to one utterance each, are well dispersed throughout the whole dataset. As a consequence, the fraction of dialogues which lost at least one of their utterances in the dataset refinement procedure is moderately higher. 222 of the 1038 dialogues of the train split contain at least one utterance with no corresponding audiovisual data in MELD-FAIR, which represents 21.4% of the dialogues in that split. For the dev and test splits, this reduction was lower. 19 of the 114 dialogues of the dev split, i.e., 16.7%, have utterances with no corresponding audiovisual data in MELD-FAIR, and for the test split, 49 of its 280 dialogues, i.e., 17.5%. 4.2.2. Data augmentation Following the steps of Tao et al. [37], negative sampling is used to augment the available speech data. In negative sampling augmentation, data is augmented by combining it with some other interfering data within the same batch that effectively shares the same label as the original data, i.e., it is expected that both the original and the interfering speech signal have been uttered by the same speaker. Through randomly selecting interfering data that has those characteristics, an interference is made by combining the original audio tracks and those of the interfering data, thus coming up with a mixture of both. By benefitting from the indomain noise and the interfering speech signals from the training set itself, this approach presents three advantages in comparison to traditional augmentation through the addition of white noise: i) the interference data is not artificially generated; ii) there is no need for data outside the training set for the audio augmentation; and iii) by using audio samples from the same speaker, the interference provided in the data augmentation accentuates the characteristics of that speaker’s voice. 4.2. Assessment of the video realignment Due to the lack of an annotation of the correct start and end time stamps of each utterance, a self-supervised form of assessing the robustness of the video realignment procedure was devised. A video correctly realigned to its corresponding utterance is expected to have most of its audio content comprised of a speech signal uttered by the speaker annotated in the corresponding dataset record. This would allow training a speaker identification model with the speech signals of the realigned videos of the train split so that it generalises and correctly identifies the speakers from speech signals of the realigned videos of the remaining splits. However, the model would require a given speaker to appear in a reasonable number of MELD records in all dataset splits, but only six speakers appear consistently throughout all MELD splits. These are the six main characters: Rachel, Monica, Phoebe, Joey, Chandler, and Ross. The remaining speakers appear rarely, indicating that it is highly unlikely that the speaker identification model could learn to generalise well from their speech. With a 50% chance, an audio sample is selected to be augmented this way, which means that within a batch, roughly half of its samples are augmented. Audio samples selected this way are either circularly padded or trimmed to match the size of the original audio sample. A single batch typically has audio samples of very different sizes. In order to let all audio samples in the same batch have the same size, they are either circularly padded or trimmed so that every audio sample in the same batch have a length equal to the average of the lengths of the original audio samples. This way, it is guaranteed that the model is trained with samples of a reasonable size, and that at least half of the samples of a batch consists of unpadded continuous audio samples. 4.2.3. Training procedure To train the speaker identification model, audio tracks are randomly sampled, such that there be roughly the same number of audio samples for each class (the six main characters). Audio samples are augmented according to the aforementioned procedure. The model is trained by minimising a cross-entropy loss function using an ADAM optimiser with an initial learning rate of 1e-4, whose value is decreased in half every ten epochs. Batches of size 64 are used in the model training. The training procedure is kept running until there is a sequence of 30 epochs with no improvement in the weighted F1 score of the dev split. 4.2.1. Model A speaker identification model is used to assess whether the speech audio in a given realigned video actually matches the speaker annotated in the corresponding MELD record. The speaker identification model is composed of an encoder part followed by a classifier part. Based on TalkNet-ASD’s ATE, a traditional ResNet34 is used as the encoder. This encoder produces an embedding F a of dimensions T4a  512, where T a is the number of audio frames corresponding to the speech signal. Then, via temporal max pooling, a 512-dimensional feature vector is obtained for the whole speech Fig. 7. Confusion matrices of the speaker identification model in MELD’s and MELD-FAIR’s test splits. 8 Neurocomputing 545 (2023) 126271 H. Carneiro, C. Weber and S. Wermter Fig. 8. ERC model. 4.3.2. Data augmentation Audio samples are augmented through the same data augmentation procedure described in Section 4.2. Face crops are augmented by performing one of the following operations: random horizontal flip, random crop of an area with at least 70% the dimension of the original face crop, or a random rotation up to 15 degrees clockwise or counterclockwise. Afterwards, the face crop is resized to 112  112 pixels. In order to keep consistency in the direction the speaker’s head is looking to, the random characteristics of the data augmentation procedure are applied to the sequence of faces as a whole, and not to each face separately. 4.2.4. Results and analysis Fig. 7 presents the confusion matrices obtained when evaluating the speaker identification model in MELD’s and MELD-FAIR’s test splits. A comparison is presented on how well the speaker identification model can generalise what it learned from each character’s voice from the data of the original MELD dataset (Fig. 7a) and from that of its refined version, MELD-FAIR (Fig. 7b). The speaker identification model subjected to MELD-FAIR achieved a weighted F1 score of 78.32% in that dataset’s test split, whereas the speaker identification model subjected to the original MELD achieved a weighted F1 score of 67.07% in the corresponding test split. The confusion matrices and the weighted F1 scores indicate that the video realignment leads to cuts that better match the expected speaker, which, in turn, indicates that it is highly likely that the audio contents of those cuts closely match the corresponding utterances whose transcriptions are given in the dataset. 4.3.3. Training procedure Since the distribution of emotion labels is similar in every split of MELD-FAIR, no weighted random sampling in the training of the ER model is performed. Instead, for every record in the train split representing a single utterance, a sequence of 15 consecutive face crops is selected as input for the video stream, and the complete utterance audio is provided as input for the audio stream. In case the sequence of faces corresponding to the uttering speaker has less than 15 face crops, then the sequence is circularly padded. If the sequence of faces has more than 15 face crops, then a subsequence of 15 consecutive face crops is randomly selected. The model is trained by minimising a cross-entropy loss function using an ADAM optimiser with an initial learning rate of 1e-4, whose value is decreased in half every ten epochs. Batches of size 64 are used in the model training. The training procedure is kept running until there is a sequence of 30 epochs with no improvement in the weighted F1 score of the dev split. 4.3. Application in ERC 4.3.1. Model We have devised an emotion recognition (ER) model to assess whether MELD-FAIR actually has visual and acoustic information from which emotional characteristics can be retrieved. Fig. 8 presents the architecture of the ER model. For the encoding of the visual and acoustic inputs, TalkNet-ASD’s VTE and ATE have been modified to enable them to produce vector representations with 512 dimensions. VTE has been modified by having its sequence of 1D convolution layers removed, since its main application is to reduce the dimensionality of the feature vectors, and V-TCN already yields vector representations with 512 dimensions. For TalkNet-ASD’s ATE to produce 512-dimensional feature vectors, its Thin ResNet34 backbone has been changed for a traditional ResNet34. Also, we keep the face crops with their original colour channels for the task of emotion recognition. This way, changes in skin colour due to some emotional reactions, e.g., blushing, can be considered by the ER model. The embeddings output by VTE and ATE are then max-pooled in the temporal dimension into feature vectors F v and F a , with 512 dimensions each. These vectors are concatenated and subsequently sent to a self-attention layer. Finally, a fully connected layer yields a prediction for the emotion of the uttering speaker given the output of the self-attention layer. 4.3.4. Experimental results Three variations of the ER model were implemented and trained from scratch. One incorporated inputs coming from both the acoustic and visual streams, while the other two variations were ablations, each containing only one of the input streams. Table 5 presents the weighted F1 score achieved by each variation, the number of training epochs it took for every variation to reach its best performance, the average training time per batch, and the number of batches used in each training epoch. The training times presented in Table 5 were achieved using a single NVIDIA GeForce GTX 1080 Ti. 4.3.5. Comparison with the state of the art To evaluate the benefits of our refinement procedure for the task of ERC with MELD-FAIR, we compare the performance of our ER model to existing approaches that use information from the original MELD videos in ERC, and not only from the utterance transcriptions provided in the dataset. DialogueRNN [28]5 is a baseline approach which models the context of a conversation by tracking the states of individual parties Table 5 Comparison of ER model variations. Modalities Vision Audio Audio + Vision Weighted F1 score (%) Number of training epochs Avg. training time (seconds per batch) Number of batches per epoch 35.58 15 1.164 151 40.54 18 0.287 151 39.81 19 1.211 151 5 Although DialogueRNN was originally proposed in [28], its first application to ERC in the MELD dataset was in [31]. 9 H. Carneiro, C. Weber and S. Wermter Neurocomputing 545 (2023) 126271 within that conversation. The model determines the emotion of a given utterance according to three aspects: its speaker, the context from preceding utterances, and the emotion thereof. DialogueRNN models these aspects by using three gated recurrent units (GRUs) [7], each responsible for a particular aspect. CT + EmbraceNet [41] is a pioneering ERC model in using visual information from the MELD videos. Although DialogueRNN predates it, the former uses solely information from the acoustic and textual modalities. This approach uses crossmodal transformers (CTs) [38] to enrich the information from one modality by taking into account information from another modality, and this way learn existing correlated information across pairs of modalities. EmbraceNet [8] was used to carefully deal with the crossmodal information in the feature vectors produced by the crossmodal transformers, and to prevent performance degradation due to the partial absence of data. EmoCaps [27] uses transformer-based encoders to extract emotion feature vectors from the visual, acoustic and textual modalities. The authors also use BERT [11] to extract text feature vectors from every utterance. By concatenating an utterance feature vector with the corresponding emotion vectors of each modality, the authors create a vector representation for that utterance. Then, through the use of a Bi-LSTM [14,16] and a classification subnetwork, EmoCaps predicts the emotion from every utterance in a dialogue. MMGCN [21] uses a multimodal graph, where each node represents a given modality in some particular utterance. Nodes of this graph are connected if they share either the same modality or the same utterance. Each MMGCN node is initialised with a concatenation of two elements: a context-aware feature encoding of the corresponding modality and utterance, and an embedding of the speaker of that particular utterance. MMGCN leverages speaker embedding to inject speaker information into the graph construction. MMGCN encodes the multimodal contextual information through the use of a multilayered deep spectral-domain graph convolutional network. MM-DFN [19], similarly to MMGCN, uses a multimodal graph with the same structure to characterise the relations between all modalities within a given uttering event, and of every utterance within a dialogue. MM-DFN introduces graph-based dynamic fusion modules, which are stacked in layers, to fuse multimodal context features dynamically and sequentially. These modules aggregate both inter- and intra-modality contextual information in a specific semantic space at each layer. It differs from MMGCN, which aggregates contextual information in a single semantic space. This leads to a gradual accumulation of redundant information. By modelling the contextual information in different semantic spaces, MM-DFN benefits from a reduction in the accumulation of redundant information, as well as from an enhancement in the complementarity between the modalities. M2FNet [9] is the current state-of-the-art model in ERC in MELD6. Its main characteristics are. Table 6 Weighted F1 scores for ERC in MELD test split using visual and acoustic data. Model Vision Audio Audio + Vision DialogueRNN CT + EmbraceNet EmoCaps MMGCN MM-DFN M2FNet N/A 31.4 31.26 33.27 32.34 32.44 44.3 32.1 31.26 42.63 42.72 39.63 N/A N/A N/A N/A 44.67 35.74 Ours 35.58 40.54 39.81 ii) the use of one stack of transformer encoders for each modality, as a means to learn inter-utterance context on a modality level; and iii) a multi-head attention fusion module to better incorporate those modalities, especially the visual and acoustic ones. It is worth noticing that all multimodal approaches to ERC in MELD use context from the dialogue in some form. Since we are interested in extracting the most useful information from the visual and the acoustic modalities, we rely solely on the utterance level. This way, we can guarantee that the performance achieved is a direct consequence of the video realignment and the utterance source localisation, and not from some other part of the dialogue. Table 6 compares the performance of the ER model proposed here with those of ablated versions of all multimodal approaches to ERC in MELD. The values presented in the table were extracted from the literature. Some table cells appear empty because either one modality was not used (e.g., Poria et al. [31] do not use information from the visual modality in their implementation of DialogueRNN), or the authors did not consider the combination of vision and acoustic modalities in their ablation studies (as in [21] and in [27]). Table 6 shows that our ER model achieves a higher weighted F1 score than state-of-the-art approaches when restricted to the visual modality. It is worth noticing that our ER model outperforms state-of-the-art approaches, even though it does not use temporal visual context on a dialogue level. This indicates that the combination of video realignment and active speaker detection can indeed yield sequences of facial expressions which, in turn, provide the ER model with more information on the uttering speaker’s emotion than the feature extraction procedures used in the other approaches. The performance of our ER model when restricted to the acoustic modality is higher than M2FNet (current state-of-the-art approach for ERC in MELD) and EmoCaps. Its performance, however, is lower than those of DialogueRNN, MMGCN and MM-DFN. These models have in common the use of utterance-level feature vectors extracted from OpenSMILE [12,35] as input for the audio stream. EmoCaps also uses these, however, its multimodal representation favours the textual modality since it uses both the utterance feature vector yielded by BERT and an emotion feature vector for the textual modality in its multimodal utterance representation, whereas only a single emotion feature vector is used to represent each of the remaining modalities. Also, EmoCaps’s weighted F1 scores in both modalities correspond to that of a model that outputs neutral for every input. M2FNet, on the other hand, uses a novel feature extractor module based on the triplet loss [34] to fetch deep features from acoustic and visual contents. i) a visual feature extractor that provides a visual representation based on the faces of the people in a scene as well as on the scene as a whole; 5. Discussion and conclusion 6 Although M2FNet’s performance values seem lower than those of other models in Table 6, this is due to most of the contribution in ERC coming from the textual modality, which was not included in Table 6. We decide not to include the performance of those models when the text modality is not ablated because the main objective of this paper is to present a way of extracting useful information from the visual and acoustic modalities, since those are quite unreliable in MELD. In contrast, the text transcriptions are very reliable and do not require an extensive refinement. Connectionist-temporal-classification segmentation and active speaker detection allowed us to refine MELD, a largely-used multimodal dataset for emotion recognition in multi-party conversational scenarios, making it possible to better align its audiovisual 10 Neurocomputing 545 (2023) 126271 H. Carneiro, C. Weber and S. Wermter Table 7 List of existing problematic cases in MELD. Utt ID Acknowledgement Split Dia ID train 125 3 Corrupted video file dev 110 7 Non-existent video file test 38 4 Very long video (> 45 s), incompatible with its utterance transcription test 220 0 train train train train train dev test 309 404 736 832 1018 108 128 0 15 4 3 2 0 2 Utterance transcription contains not only the utterance but also a description within parentheses train train test 65 761 86 3 1 3 Utterance transcription contains not only the utterance but also a description within brackets train train 739 849 14 3 No utterance. Just a description within parentheses train 111 N/A Utterances not chronologically ordered train 446 19 Should be the first utterance of dialogue 447 The authors acknowledge partial support from the German Research Foundation DFG under project CML (TRR 169). Problem Appendix A. Problematic cases of MELD MELD presents a variety of problematic cases beyond the misalignment between the videos and the utterance transcriptions. These comprise multiple other problems which raised errors during the processing of data refinement. Table 7 offers an extensive list of such cases, identified by the split, dialogue id and utterance id of each case. References [1] T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, Deep audio-visual speech recognition, IEEE Trans. Pattern Anal. Mach. Intell. (2018), https://doi. org/10.1109/TPAMI.2018.2889052. [2] J.L. Alcázar, F. Caba, L. Mai, F. Perazzi, J.-Y. Lee, P. Arbelaez, and B. Ghanem. Active speakers in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12465–12474, June 2020. [3] J.L. Alcázar, F. Caba, A.K. Thabet, and B. Ghanem. MAAS: Multi-modal assignation for active speaker detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 265–274, Oct. 2021. [4] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. Wav2Vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020. [5] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database, Language Resour. Evaluat. 42 (2008) 335–359, https://doi.org/ 10.1007/s10579-008-9076-6. [6] H. Carneiro, C. Weber, and S. Wermter. FaVoA: Face-voice association favours ambiguous speaker detection. In I. Farkaš, P. Masulli, S. Otte, and S. Wermter, editors, Artificial Neural Networks and Machine Learning – ICANN 2021, pages 439–450, Cham, 2021. Springer International Publishing [7] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi/org/10.3115/v1/W14-4012. [8] J.-H. Choi, J.-S. Lee, EmbraceNet: A robust deep learning architecture for multimodal classification, Inform. Fusion 51 (2019) 259–270, https://doi.org/ 10.1016/j.inffus.2019.02.010. [9] V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, and N. Onoe. M2FNet: Multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 4652–4661, June 2022. [10] J.S. Chung, J. Huh, and S. Mun. Delving into VoxCeleb: Environment invariant speaker recognition. In K. Lee, T. Koshinaka, and K. Shinoda, editors, Odyssey 2020: The Speaker and Language Recognition Workshop, 1–5 November 2020, Tokyo, Japan, pages 349–356. ISCA, 2020. https://doi/org/10.21437/Odyssey. 2020–49. [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. https://doi/org/10.18653/v1/n19-1423. [12] F. Eyben, M. Wöllmer, and B. Schuller. OpenSMILE: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, page 1459–1462, New York, NY, USA, 2010. Association for Computing Machinery. https://doi/org/10.1145/ 1873951.1874246. [13] D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2470–2481, Online, Nov. 2020. Association for Computational Linguistics. https://doi/org/10.18653/v1/2020.findings-emnlp.224. [14] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18 (5) (2005) 602–610, https://doi.org/10.1016/j.neunet.2005.06.042. [15] J. Guo, J. Deng, A. Lattas, and S. Zafeiriou. Sample and computation redistribution for efficient face detection. In International Conference on Learning Representations, pages 1–17, 2022. https://openreview.net/forum? id=RhB1AdoFfGE. data with the corresponding utterance transcriptions, as well as to obtain reliable face crops of the uttering speaker of nearly every scene. The comparison with state-of-the-art approaches also indicates that those face crops provide more precise information on the emotion of the uttering speaker than the most recent approaches. The reliable extraction of the speakers’ face crops from wellrealigned videos accounts for the high performance of the visiononly version of our emotion recognition model, which outperforms other competing approaches by more 2.3%. The relatively simple architecture of our emotion recognition model, as well as its restriction to working on an utterance level, i.e., without contextual information from the whole dialogue, indicate that much of its high performance is due to the improvement in the information from MELD’s visual modality. Furthermore, researchers on emotion recognition in multi-party conversational scenarios can benefit from MELD-FAIR, the refined version of MELD delivered in this publication. More generally, with the recent advancements in deep learning, creating a dataset automatically becomes within sight. Automatic speech recognition allows automatic text transcription, while automatic lip reading, which requires active speaker detection, could verify its correctness, and vice versa. CRediT authorship contribution statement Hugo Carneiro: Writing - original draft, Conceptualization, Methodology, Software, Data curation. Cornelius Weber: Writing - review & editing, Conceptualization. Stefan Wermter: Writing review & editing, Supervision. Data availability The data produced in this research can be accessed in a GitHub repository whose URL is provided in the manuscript’s abstract. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 11 H. Carneiro, C. Weber and S. Wermter Neurocomputing 545 (2023) 126271 [16] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (Nov. 1997) 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735. [17] E. Hornecker, A. Krummheuer, A. Bischof, M. Rehm, Beyond dyadic HRI: Building robots for society, Interactions 29 (3) (May 2022) 48–53, https://doi. org/10.1145/3526119. [18] C.-C. Hsu, S.-Y. Chen, C.-C. Kuo, T.-H. Huang, and L.-W. Ku. EmotionLines: An emotion corpus of multi-party conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 1597–1601, Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). [19] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo. MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In ICASSP 2022– 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7037–7041, 2022. https://doi/org/10.1109/ICASSP43922.2022. 9747397. [20] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018. https://doi/org/10.1109/CVPR.2018.00745. [21] J. Hu, Y. Liu, J. Zhao, and Q. Jin. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5666–5675, Online, Aug. 2021. Association for Computational Linguistics. https://doi/org/10.18653/v1/2021.acl-long.440. [22] A.L. Krummheuer, M. Rehm, and K. Rodil. Triadic human-robot interaction. Distributed agency and memory in robot assisted interactions. In Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’20, page 317–319, New York, NY, USA, 2020. Association for Computing Machinery. https://doi/org/10.1145/3371382.3378269. [23] L. Kürzinger, D. Winkelbauer, L. Li, T. Watzel, G. Rigoll, CTC-segmentation of large corpora for German end-to-end speech recognition, in: A. Karpov, R. Potapova (Eds.), Speech and Computer, Springer International Publishing, Cham, 2020, pp. 267–278, 10.1007/978-3-030-60276-5_27. [24] B. Lee and Y.S. Choi. Graph based network with contextualized representations of turns in dialogue. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 443–455, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. https://doi/org/10.18653/v1/2021.emnlp-main.36. [25] J. Lee and W. Lee. CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. In M. Carpuat, M. de Marneffe, and I.V.M. Ruíz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10–15, 2022, pages 5669–5679. Association for Computational Linguistics, 2022. https://doi/org/10.18653/v1/2022.naacl-main.416. [26] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan, Nov. 2017. Asian Federation of Natural Language Processing. [27] Z. Li, F. Tang, M. Zhao, Y. Zhu, EmoCaps: Emotion capsule based model for conversational emotion recognition, in: Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 2022, pp. 1610– 1618, Association for Computational Linguistics. 10.18653/v1/2022.findingsacl.126. [28] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria. DialogueRNN: An attentive RNN for emotion detection in conversations. Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01): 6818– 6825, Jul 2019. https://doi/org/10.1609/aaai.v33i01.33016818. [29] G. McKeown, M. Valstar, R. Cowie, M. Pantic, M. Schroder, The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent, IEEE Trans. Affect. Comput. 3 (1) (2012) 5–17, https://doi.org/10.1109/T-AFFC.2011.20. [30] K. Min, S. Roy, S. Tripathi, T. Guha, and S. Majumdar. Learning long-term spatial-temporal graphs for active speaker detection. In S. Avidan, G. Brostow, M. Cissé, G.M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 371–387, Cham, 2022. Springer Nature Switzerland. https://doi/org/10. 1007/978-3-031-19833-5_22. [31] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, Florence, Italy, July 2019. Association for Computational Linguistics. https://doi/org/10.18653/v1/P19-1050. [32] J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, and C. Pantofaru. AVA active speaker: An audio-visual dataset for active speaker detection. In ICASSP 2020– 2020 IEEE International Conference on Acoustics, Speech and Signal Processing [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] (ICASSP), pages 4492–4496, 2020. https://doi/org/10.1109/ICASSP40776.2020. 9053900. P. Saxena, Y.J. Huang, and S. Kurohashi. Static and dynamic speaker modeling based on graph neural network for emotion recognition in conversation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 247–253, Hybrid: Seattle, Washington + Online, July 2022. Association for Computational Linguistics. https://doi/org/ 10.18653/v1/2022.naacl-srw.31. F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. https://doi/org/10. 1109/CVPR.2015.7298682. B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge, Speech Commun. 53 (9) (2011) 1062–1087, https://doi.org/10.1016/j. specom.2011.01.011. X. Song, L. Zang, R. Zhang, S. Hu, and L. Huang. Emotionflow: Capture the dialogue level emotion transitions. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8542– 8546, 2022. https://doi/org/10.1109/ICASSP43922.2022.9746464. R. Tao, Z. Pan, R.K. Das, X. Qian, M.Z. Shou, and H. Li. Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 3927–3935, New York, NY, USA, 2021. Association for Computing Machinery. https://doi/org/10.1145/3474085. 3475587. Y.-H.H. Tsai, S. Bai, P.P. Liang, J.Z. Kolter, L.-P. Morency, and R. Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, Florence, Italy, July 2019. Association for Computational Linguistics. https://doi/org/10.18653/v1/P19-1656. D. Utami and T. Bickmore. Collaborative user responses in multiparty interaction with a couples counselor robot. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 294–303, 2019. https://doi/org/10.1109/HRI.2019.8673177. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. B. Xie, M. Sidulova, C.H. Park, Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion, Sensors 21 (14) (2021), https://doi.org/10.3390/s21144913. S. Zahiri and J.D. Choi. Emotion detection on TV show transcripts with sequence-based convolutional neural networks. In Proceedings of the AAAI Workshop on Affective Content Analysis, AFFCON’18, pages 44–51, New Orleans, LA, 2018. Y. Zhang, S. Liang, S. Yang, X. Liu, Z. Wu, S. Shan, and X. Chen. UniCon: Unified context network for robust active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, pages 3964–3972, New York, NY, USA, 2021. Association for Computing Machinery. https://doi/ org/10.1145/3474085.3475275. L. Zhu, G. Pergola, L. Gui, D. Zhou, and Y. He. Topic-driven and knowledgeaware transformer for dialogue emotion detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1571–1582, Online, Aug. 2021. Association for Computational Linguistics. https://doi/org/10.18653/v1/2021.acl-long.125. Hugo Carneiro is a postdoctoral research associate at the Knowledge Technology Institute, University of Hamburg. He received his MSc and DSc in systems engineering and computer science from COPPE, Federal University of Rio de Janeiro. His research interests are in crossmodal learning, affective computing, computational linguistics, language representation, knowledge representation and explainable AI. 12 Neurocomputing 545 (2023) 126271 H. Carneiro, C. Weber and S. Wermter Cornelius Weber graduated in physics at Universität Bielefeld, Germany, and received his PhD in computer science at Technische Universität Berlin. Following positions were a Postdoctoral Fellow in Brain and Cognitive Sciences, University of Rochester, USA; Research Scientist in Hybrid Intelligent Systems, University of Sunderland, UK; Junior Fellow at the Frankfurt Institute for Advanced Studies, Germany. Currently he is Lab Manager at Knowledge Technology Institute, Universität Hamburg. His interests are in computational neuroscience, development of visual feature detectors, neural models of representations and transformations, reinforcement learning and robot control, grounded language learning, humanrobot interaction and related applications in social assistive robotics. Stefan Wermter is Full Professor at the University of Hamburg, Germany, and Director of the Knowledge Technology Institute. Currently, he is a co-coordinator of the International Collaborative Research Centre on Crossmodal Learning (TRR-169) and coordinator of the Doctoral Network on Transparent Interpretable Robots (TRAIL). His main research interests are in the fields of neural networks, hybrid knowledge technology, cognitive robotics and human-robot interaction. He is an Associate Editor of Connection Science and International Journal for Hybrid Intelligent Systems. He is on the Editorial Board of the journals Cognitive Systems Research, Cognitive Computation and Journal of Computational Intelligence. He is serving as the President for the European Neural Network Society. 13