Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology

MULTIMODAL EMOTION RECOGNITION WITH HIGH-LEVEL SPEECH AND TEXT
FEATURES
Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda
Tokyo Institute of Technology

arXiv:2111.10202v1 [eess.AS] 29 Sep 2021
ABSTRACT Many studies have proposed emotion recognition meth-

ods from para-linguistic features or from transcribed text
Automatic emotion recognition is one of the central con-
data [2, 3, 4]. However, depending on paralanguage, the
cerns of the Human-Computer Interaction field as it can
semantics of the linguistic communication may change and
bridge the gap between humans and machines. Current works
vice-versa. Thus, spoken text may have different interpre-
train deep learning models on low-level data representations
tations depending on the speech intonation. Additionally,
to solve the emotion recognition task. Since emotion datasets
similar speech intonations may be used to convey different
often have a limited amount of data, these approaches may
emotions, which can only be discerned by understanding
suffer from overfitting, and they may learn based on super-
the linguistic factor of communication. Therefore, relying
ficial cues. To address these issues, we propose a novel
solely on linguistic or para-linguistic information may not be
cross-representation speech model, inspired by disentangle-
enough to correctly recognize emotions in conversation.
ment representation learning, to perform emotion recognition
One of the most fundamental challenges in emotion
on wav2vec 2.0 speech features. We also train a CNN-based
recognition is the definition of features that can capture emo-
model to recognize emotions from text features extracted
tion cues in the data. There is no agreement on the set of
with Transformer-based models. We further combine the
features that is the most powerful in distinguishing between
speech-based and text-based results with a score fusion ap-
emotions [5, 6, 7], and, in speech emotion recognition, this
proach. Our method is evaluated on the IEMOCAP dataset
challenge is more aggravated, due to the acoustic variability
in a 4-class classification problem, and it surpasses current
introduced by speakers, speaking styles, and speaking rates.
works on speech-only, text-only, and multimodal emotion
Most current studies propose training deep learning mod-
recognition.
els to extract those feature sets from the data [8, 9]. Although
Index Terms— Emotion recognition, disentanglement these approaches have yielded satisfactory results, two prob-
representation learning, deep learning, multimodality, wav2vec lems remain. First, the training may easily lead to overfitting,
2.0 since these models are usually trained from scratch using low-
level data representations, and emotion recognition datasets
are known to have a limited amount of data. Second, it is
1. INTRODUCTION
known that deep learning architectures may learn from super-
Correctly perceiving other people’s emotion is one of the key ficial cues [10], which makes us question if current models
components of good interpersonal communication. Emotions can actually capture emotion information in the data.
make conversation more natural. They can add or remove We address these issues by challenging commonly used
ambiguity, and they can change the meaning of what is being low-level data representations in speech-based and text-based
communicated altogether. Due to the importance of emo- emotion recognition studies, and by incorporating high-level
tions in human-to-human conversation, automatic emotion features to our method. We define as low-level the features
recognition has been one of the main concerns of the Human- obtained from feature engineering, and we define as high-
Computer Interaction (HCI) field for decades [1]. level the generic features extracted from deep learning ap-
proaches.
Copyright 2021 IEEE. Published in the 2021 IEEE Automatic Speech We propose a cross-representation model for speech emo-
Recognition and Understanding Workshop (ASRU) (ASRU 2021), sched- tion recognition, in which we aim to reconstruct low-level
uled for 14-18 December 2021 in Cartagena, Colombia. Personal use of this
material is permitted. However, permission to reprint/republish this material mel-spectrogram speech representations from high-level
for advertising or promotional purposes or for creating new collective works wav2vec 2.0 ones, thus leveraging both representations. We
for resale or redistribution to servers or lists, or to reuse any copyrighted choose wav2vec because they contain rich prosodic infor-
component of this work in other works, must be obtained from the IEEE.
Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445
mation [11]. Additionally, since we would like to capture a
Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: generic representation of emotion from speech, our model
+ Intl. 908-562-3966. uses disentanglement representation learning techniques to
eliminate speaker identity and phonetic variations in the Word2Vec or GloVe [3] features. Such works yield good re-
data. In our method, we also define a CNN-based model sults, but, given the outstanding performance of Transformer-
for text-based emotion recognition on features acquired from based models in various NLP tasks [20, 21, 22], it is natural
Transformed-based models. We believe these features are to question commonly used representations for the TER task.
better than the commonly used word2vec and GloVe features
due to the Transformer’s ability of modelling long contextual
information in a sentence, which is necessary for emotion 2.4. Disentanglement Representation Learning
recognition. Finally, we combine the results from the speech-
based and the text-based models to obtain the multimodal Disentanglement Representation Learning aims to separate
emotion recognition results. the underlying factors of variation in the data [23]. The idea is
that, by disentangling these factors, we can discard the factors
2. RELATED WORKS that are uninformative to the task that we would like to solve,
while keeping the relevant factors. Disentanglement has been
2.1. Wav2vec 2.0 applied to image [24], video [25] and speech [26] applica-
tions. In speech-related works, it has been applied mainly to
Wav2vec 2.0 [12] is a framework to obtain speech represen-
speech conversion and prosody transfer tasks [27, 28].
tations via self-supervision. The wav2vec model is trained
on large amounts of unlabelled speech data, and then it is AutoVC [27] is an autoencoder that extracts a speaker-
fine-tuned on labelled data for Automatic Speech Recogni- independent representation of speech content for speech con-
tion (ASR). Wav2vec is composed of a feature encoder and a version. A mel-spectrogram is inputted to the model’s en-
context network. The encoder takes a raw waveform as input, coder, and the decoder reconstructs the spectrogram from the
and outputs a sequence of features with stride of 20 ms and encoder’s output and a speaker identity embedding. By con-
receptive field of 25 ms. These features encode the speech’s trolling the encoder’s bottleneck size, speaker identity infor-
local information, and they have a size of 768 and 1024 for mation is eliminated at the encoder’s bottleneck. Speech-
the “base” and “large” versions of wav2vec, respectively. The Flow [26] builds upon AutoVC to disentangle speech into
feature sequence is then inputted to the Transformer-based pitch, rhythm and content features.
context network, which outputs a contextualized represen- Even though these works show impressive results in
tation of speech. In the “base” and “large” wav2vec, there speech conversion, only few works attempt to disentangle
are 12 and 24 Transformer blocks, respectively. Although the speech for SER. [29] proposed an autoencoder to disen-
wav2vec 2.0 learned representations were originally applied tangle speech style and speaker identity from i-vectors and
to ASR, other tasks, such as speech emotion recognition, can x-vectors, and they used the speech style embedding for SER.
also benefit from these representations [11, 13, 14]. [30] used adversarial training to disentangle speech features
into speaker identity and emotion features. These methods
2.2. Speech Emotion Recognition hold similarities with the SER method proposed in this paper,
but our approach differs from previous works in that we ex-
During its early stage, most works on Speech Emotion plicitly eliminate speaker identity information from speech to
Recognition (SER) proposed solutions based on Hidden obtain emotion features, and we perform experiments on the
Markov Models (HMM) [15], Support Vector Machines disentanglement property of these features.
(SVM) [4, 16], or Gaussian Mixture Models (GMM) [5].
However, given the superior performance of deep learning on
many speech-related tasks [17, 18], deep learning approaches
3. METHODOLOGY
for SER became predominant.
A problem characteristic to SER is the definition of ap-
propriate features to represent emotion from speech [19]. Pre- We propose a model to perform SER and a model to perform
vious studies have attempted to extract emotion information TER. The SER model takes as input wav2vec features, a mel
from Mel-frequency cepstral coefficients (MFCC), pitch and spectrogram, speaker identity embeddings, and a phone se-
energy [5, 6, 7]. However, recent studies showed that employ- quence. All these features are extracted from the same speech
ing a weighted sum with learnable weights to combine the lo- segment, and the model outputs the probabilities for each
cal and contextualized outputs of a pre-trained wav2vec 2.0 emotion class, for the speech segment. The TER model takes
model yields better speech emotion recognition results [11]. as input text features extracted from an utterance’s transcript,
and outputs the probabilities for each emotion class, for the
utterance. The SER and TER results are combined via score
2.3. Text Emotion Recognition
fusion to obtain the multimodal emotion class probabilities.
Current Text Emotion Recognition (TER) works use either Our proposed method, including the SER model, the TER
features from an ASR model trained from scratch [9], or model and the fusion approach, is depicted in Figure 1.
Encoder (E) Decoder (D)
Speaker Identity
3x 1D ConvNorm 2048x5
Embedding
[256, 96]
1x 1D ConvNorm 80x5
Downsampler(f)
Concatenation
Concatenation
2x LSTM 1024
Upsampler(f)
1x LSTM 512
1x Linear 80
2x BLSTM d
Weighted Average
Reconstructed
Spectrogram
[80, 96]
[1280, 96]
[2d, 96]
[2d, 96]
[80, 96]
[2d, 96/f]
[2d+512, 96]
Wav2vec
Features
[1024, 96]
[25, 1024, 96]
Classifier (C)
[256, 96]
Phone
Encoder
1x Dropout 0.5 Lr1 Mel Lr2
2x Linear 128
(Ep)
1x Linear 4 SER Emotion Spectrogram
2x BLSTM 128
Classes
Logits Phone
Sequence
[128, 96]
Le
Emotion
Class Label
SER model
1x 1D CNN 128x8
2x 1D CNN 256x1
1x Mean Pooling
1x 1D CNN 64x4
1x Linear 4
1x Flatten
TER Emotion
Text Multimodal Emotion
Classes Score Fusion
Features Class Probabilities
[N’,L]
Logits
TER model
Fig. 1. Proposed method depicting the SER and TER models and the fusion approach. 1D ConvNorm layers are defined as a
1D CNN layer followed by batch normalization, and BLSTM layers are bidirectional LSTMs. The number of layers is shown
as each block name’s prefix. The number of filters F and kernels K in each ConvNorm and CNN layers are shown as the block
name’s suffix F xK. The number of neurons in LSTM, BLSTM, and Linear layers is shown as the blocks names’ suffix.
3.1. Speech Emotion Recognition 3.1.2. Speaker Identity Feature Extraction

We propose an encoder-decoder model that takes wav2vec We extract speaker identity features with Resemblyzer [31]2 ,
2.0 features as input, and reconstructs the corresponding mel- which is pre-trained on LibriSpeech [32], VoxCeleb1 [33] and
frequency spectrogram. Our model has four main compo- VoxCeleb2 [34]. For each utterance, a 256-dimensional em-
nents: an encoder, a decoder, a phone encoder, and a classi- bedding is obtained to represent the speaker identity. For each
fier. The model is trained over speech segments of 96 frames, speaker, we extract speaker identity features from 100 ran-
which is about 2 seconds long. These segments are randomly domly selected utterances, and we take their average as the
cropped from the speech utterances during training. final identity embedding to represent the speaker.
Our SER model is similar to AutoVC [27]. However, our
model differs in three aspects. First, wav2vec features are the
acoustic input to our encoder. Second, we include an emotion 3.1.3. Encoder
classifier and an emotion loss to our method. Third, we define A weighted average havg , with trainable weights α, over the
a phone encoder, whose output is inputted to the decoder. 25 wav2vec 2.0 features hi , is computed as described in [11]:
3.1.1. Wav2vec 2.0 Feature Extraction P25

αi hi
havg = Pi=1
25 . (1)
We extract the wav2vec features from a “large” wav2vec 2.0 i=1 αi
model pre-trained on 60k hours of unlabelled speech data
havg is then concatenated with the 256-dimensional
from the LibriVox dataset1 . We take the features from the fea-
speaker identity embedding frame by frame.
ture encoder’s output and from the output of all the 24 Trans-
former layers in the context network. Thus, for each speech The BLSTM layers in the encoder have d neurons, and
frame, there are 25 1024-dimensional wav2vec features. their output have a size of [2d, 96] since we concatenate the
1 https://huggingface.co/facebook/wav2vec2-large-lv60 2 We use the code in: https://github.com/resemble-ai/Resemblyzer
layers’ output in both forward and backward directions. d de- 3.1.6. Classifier
termines the size of the bottleneck, as it reduces the size of
the features in the channel dimension. The downsampler op- The classifier encourages the encoder’s output to contain
eration [27] takes as input an array of size 2d for each speech emotion information. We compute the cross-entropy loss Le
frame, and returns the arrays taken at every f frames. Thus, between the emotion label c and the softmax of the logits z
this operation reduces the temporal dimension of the feature outputted by the classifier as
array, by the downsampling factor f . The encoder outputs !
a feature array of size [2d, 96/f ], in which d and f control exp(z[c])
Le (z, c) = −log P . (3)
the bottleneck dimension. By controlling the size of the bot- j exp(zj )
tleneck, we would like to obtain a disentangled speech rep-
resentation, that contains emotion information, but that does
3.1.7. Training and inference
not contain speaker identity or phonetic information.
The objective function to be minimized during training is
given as the sum between Lr1 , Lr2 and Le .
3.1.4. Decoder
During inference, the softmax function is applied to the
At the decoder, the encoder’s features are upsampled so that model’s outputted emotion classes logits array to obtain the
their size is the same as before the downsampling operation, emotion class probabilities. The class with the highest prob-
by repeating each feature in the temporal dimension f times. ability is selected as the final classification result. The model
Since the encoder’s features contain only emotion informa- takes as input features from a speech segment of 96 frames.
tion, the decoder takes as input not only the encoder’s output, Thus, to obtain an utterance-level prediction, we first compute
but also speaker identity embeddings and phone sequence em- the emotion class probabilities every 96 consecutive frames
beddings to be able to reconstruct the spectrogram. in an utterance (without overlap), zero-padding as necessary,
The output from the decoder’s linear layer is a feature and we take the average of these segment-level probabilities
array of size 80 for each speech frame, which represents a as the final utterance-level probability.
mel-spectrogram of the speech segment. These features are
compared with the ground-truth mel-spectrogram by means 3.2. Text Emotion Recognition
of a reconstruction loss Lr1 , which is used to update all the
model’s parameters. We also compute the reconstruction loss [35] shows that the TER task can benefit from processing
Lr2 between the decoder’s output and the same ground-truth embeddings of all text tokens before performing the emotion
mel-spectrogram. Lr1 and Lr2 are computed as classification. Inspired by these results, we propose a CNN-
based model to process all the token’s embeddings in an ut-
M terance, extracted with Transformer-based models.
1 X
Lr = (xk − yk )2 , (2) We extract a text representation of shape [N, L] for each
M
k=1 utterance with pre-trained Transformer-based models, in
in which M is the training batch size, xk is the k-th feature which N is the number of tokens in the utterance exclud-
element in the batch outputted by the model, and yk is the ing special tokens, and L is the size of each token’s feature.
corresponding ground-truth for xk . We zero-pad the text representation so that the input to the
TER model have size [N 0 , L], in which N 0 is the maximum
number of tokens found in an utterance of the dataset. These
3.1.5. Phone Encoder text features are processed with the TER model illustrated in
Figure 1, which is trained on the cross-entropy loss defined
The phone encoder takes as input a sequence of phone em-
in Equation (3). Similar to our SER model, during inference,
beddings, and outputs a representation for the whole phone
the softmax function is applied to the TER model’s logits
sequence. We follow two steps to obtain these phone embed-
array to obtain the emotion class probabilities. However, dif-
dings. First, for each utterance, we extract the phone align-
ferently from the SER model, the output from the TER model
ment information from the speech signal and its correspond-
represents the utterance-level emotion classification result.
ing text transcript, by using the Gentle aligner3 . Second, we
obtain the phone sequence from the phone alignment infor-
mation, by determining the longest phone for each frame. We 3.3. Multimodal Emotion Recognition
define an id number to each phone, and we also assign ids
The speech-based utterance-level probabilities ps and the
to silence, not-identified phones, and to each special token
probabilities outputted from the text model pt for the same
in the dataset (e.g. “[LAUGHTER]”), totalling 128 distinct
utterance are combined as
phone ids represented as one-hot embeddings.
3 https://github.com/lowerquality/gentle pf = w1 · ps + w2 · pt , (4)
in which pf is the fused probability, and w1 and w2 are fixed
Table 1. UA (%) results for the SER task on a model with a
weights assigned to the speech and text modalities, respec-
small bottleneck and a model with a large bottleneck.
tively. The weights determine the degree of contribution of
Fold
each data modality to the fused probability, and the emotion Model Avg ± std
1 2 3 4 5
classification result for an utterance corresponds to the emo-
tion class with the highest fused probability. Small 67.9 71.7 67.3 72.2 71.6 70.1 ± 2.3
Large 59.6 73.6 66.0 71.3 70.0 68.1 ± 5.5
4. DATASET
Table 2. Comparison of our SER results with current works
We utilize the Interactive Emotional Dyadic Motion Capture in terms of UA(%) and WA (%).
(IEMOCAP) [36] dataset to evaluate our method. Given the Model UA WA
amount of data and the phonetic and semantic diversity of its GRU+Context [37] 68.3 66.9
utterances, the IEMOCAP dataset is considered well-suited Self-Attn+LSTM [3] 55.6 -
for speech-based and text-based emotion recognition. BLSTM+Self-Attn [9] 57.0 55.7
There are 10 actors in this dataset, whose interactions are Transformer [38] - 64.9
organized in 5 dyadic sessions, each with an unique pair of
CNN+Feat-Attn [39] 66.7 -
male and female actors. The dataset contains approximately
wav2vec+CNN [11] - 67.9
12 hours of audiovisual data, which is segmented into speech
Ours (Small) 70.1 70.7
turns (or utterances). Each utterance is labelled by three an-
notators.
Following previous works [3, 8, 9], we consider only the We compare the results obtained with the “Small” model with
utterances which are given the same label by at least two an- the current state-of-the-art in Table 2.
notators, and we merge the utterances labelled as “Happy” We further evaluate whether inputting the wav2vec em-
and “Excited” into the “Happy” category. We further select beddings is advantageous to SER. We train our model with
only the utterances with the labels “Angry”, “Neutral”, “Sad” the same parameters as the “Small” configuration, but with
and “Happy”, resulting in 5,531 utterances, which is approx- a mel-spectrogram as input instead of the wav2vec features.
imately 7 hours of data. We utilize only the speech data, the Our results achieved an UA of 50.4% on the 5-fold cross-
transcripts, and the labels. validation, which is 19.7% worse than of the model with
wav2vec features as input, in terms of absolute accuracy.
5. TRAINING CONFIGURATION Therefore, we can conclude that the learned weighted aver-
age of the wav2vec embeddings is a better representation of
We perform a leave-one-session-out cross-validation in all our speech for SER on the IEMOCAP dataset when compared to
experiments. We report our results in terms of Weighted Ac- the traditional mel-frequency spectrograms.
curacy (WA) and Unweighted Accuracy (UA). WA is equiva-
lent to the average recall over all the emotion classes and UA 6.2. Disentanglement Experiments
is the fraction of samples correctly classified.
All models are implemented in PyTorch, and, in every We train 4-linear-layer (with, from the input to the output,
training experiment, we use the Adam optimizer with learn- 2048, 1024, 1024, and 8 neurons) classifiers on the obtained
ing rate 10−4 , and with the default exponential decay rate of speech representations to solve the speaker identification task.
the moment estimates. The SER models are trained with a Our goal is to see if the obtained emotion features contain
batch size of 2 for 1 million iterations. The TER models are speaker identity information. Ideally, we would like our fea-
trained with a batch size of 4 for 412,800 iterations. tures to be speaker-independent, and to hold a generic emo-
tion representation that could be used across speakers.
We train the classifiers on a 5-fold cross-validation, but
6. SPEECH EMOTION RECOGNITION
we define the folds differently for this experiment. We ran-
6.1. Emotion Recognition Experiments domly separate 80% of each speaker’s data for training, and
the remaining 20% for test. The folds have speaker dependent
We perform SER with two bottleneck configurations for the train and test sets, and each of them contains the data of only
encoder, “Small” and “Large”, which have respective bottle- 4 sessions. We train the classifiers on a cross-entropy loss.
neck dimension d equal to 8 and 128, and respective down- Table 3 summarizes the speaker identity recognition results.
sampling factor f set as 48 and 2. The utterance-level SER Table 3 suggests that the features extracted with the
results are presented in Table 1. “Large” model contain more information about speaker iden-
Table 1 indicates that the “Small” configuration performs tity than the ones from the “Small” model. Overall, from the
better in the SER task when compared to the “Large” model. results in Tables 3 and 1, we can see that the features extracted
Table 3. UA (%) results for the speaker identification task on Table 5. Comparison of our TER results with current works
features extracted with the “Small” and “Large” models. in terms of UA(%) and WA (%).
Fold Model UA WA
Model Avg ± std
1 2 3 4 5 BERT+Attn+Context [40] 71.9 71.2
Small 18.4 20.6 19.4 15.9 13.6 17.6 ± 2.8 BERT+Attn [40] 64.8 62.9
Large 25.6 19.8 24.5 23.4 21.4 22.9 ± 2.3 BLSTM+Self-Attn [9] 63.6 63.7
Self-Attn+LSTM [3] 65.9 -
Ours (BERT uncased) 66.1 67.0
Table 4. 5-fold cross-validation UA (%) results for TER on
input features extracted from different models. (c = “cased”,
u = “uncased”, uwm = “uncased with whole word masking”)
Model Avg ± std Table 6. Comparison of our multimodal results with current
ALBERT 62.3 ± 2.3 works in terms of UA(%) and WA (%).
BERTc 65.5 ± 3.3 Model UA WA
BERTu 66.1 ± 2.1 BERT+Attn+Context [40] 76.1 77.4
BERTuwm 65.8 ± 2.6 LAS-ASR [8] 66.0 64.0
ELECTRA 56.6 ± 3.8 ASR-SER [9] 69.7 68.6
RoBERTa 64.1 ± 3.5 CMA+Raw waveform [3] 72.8 -
XLNetc 58.1 ± 3.6 Ours (w1 = 0.6, w2 = 1) 73.0 73.5
with the “Small” model can achieve a better SER accuracy do not use recurrent neural networks or self-attention, and we
and a worse speaker identity accuracy when compared to the benefit from the text representations learned by Transformer-
features extracted with the “Large” model. This result sug- based models trained on large text corpora. We believe our
gests that the bottleneck size can lead to a disentanglement of model achieved good results due to BERT’s deep features,
factors in speech, which makes the SER task easier. and to the ability of our model’s 1D CNN layers to extract
temporal information from the sequence of token’s features.
6.3. Discussion
We believe that our SER model achieves better results when
compared to previous methods due to three factors. First, 8. MULTIMODAL EMOTION RECOGNITION
we use high-level speech representations as the input to our
model, which, apart from our work, is only done by [11] and
We combine the results from our best speech and text mod-
[38]. Second, we are careful in analyzing the type of infor-
els by experimenting with different weight values w1 and w2 .
mation encoded in the features obtained by our model, which
Our best multimodal results are acquired when w1 = 0.6 and
makes the features have a certain level of disentanglement as
w2 = 1, and they are reported in Table 6.
shown in Section 6.2. Third, our model can leverage both
high-level and low-level features since it is trained to recon- This result shows that, when combining the speech and
struct spectrograms from wav2vec features. the text results, it is better to give less importance to the
speech model’s result, even though its accuracy is higher than
the text model’s. We believe this may be related to the con-
7. TEXT EMOTION RECOGNITION fidence in which the TER and the SER models obtain their
scores, but further investigation is required.
We train the TER model with input features extracted from
different Transformer-based models4 . We use the “large” ver- By comparing our results in Tables 1, 4, and 6, we can
sion of all these models, which output a 1024-dimensional conclude that the solution to the emotion recognition task
feature for each token. The TER results are shown in Table 4 benefits from combining different types of data, since our
for different feature extractors, and we compare our best re- multimodal result is better than our speech-only and text-only
sults with the current state-of-the-art in Table 5. results. Our multimodal approach gives better results than
From Table 5, we can see that our method achieves better current works except for [40], which uses the context infor-
results than previous works, except for [40], which uses con- mation. We attribute the reason of our good results to the fact
text information (i.e., features from succeeding and preceding that our unimodal models outperform other unimodal mod-
utterances). Our model differs from previous works in that we els, and not to our choice of fusion method. We believe we
could achieve better results with a more sophisticated fusion
4 The trained models can be found at https://huggingface.co/models approach or by jointly training the speech and text modalities.
9. CONCLUSION [8] Sung-Lin Yeh, Yun-Shao Lin, and Chi-Chun Lee,
“Speech representation learning for emotion recognition
We proposed a cross-representation encoder-decoder model using end-to-end asr with factorized adaptation,” Proc.
inspired in disentanglement representation learning to per- Interspeech 2020, pp. 536–540, 2020.
form SER. Our model leverages both high-level wav2vec
features and low-level mel-frequency spectrograms, and it [9] Han Feng, Sei Ueno, and Tatsuya Kawahara, “End-
achieves an accuracy of 70.1% on the IEMOCAP dataset. to-end speech emotion recognition combined with
We also used a CNN-based model that processes token’s acoustic-to-word asr model,” Proc. Interspeech 2020,
embeddings extracted with pre-trained Transformer-based pp. 501–505, 2020.
models to perform TER, achieving an accuracy of 66.1% on [10] Ian J Goodfellow, Jonathon Shlens, and Christian
the same dataset. We further combined the speech-based and Szegedy, “Explaining and harnessing adversarial exam-
the text-based results via score fusion, achieving an accuracy ples,” arXiv preprint arXiv:1412.6572, 2014.
of 73.0%. Our speech-only, text-only and multimodal results
surpassed current works’, showing that emotion recognition [11] Leonardo Pepino, Pablo Riera, and Luciana Ferrer,
can benefit from disentanglement representation learning, “Emotion recognition from speech using wav2vec 2.0
high-level data representations, and multimodalities. embeddings,” Proc. Interspeech 2021, pp. 3400–3404,
2021.
10. REFERENCES [12] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,

and Michael Auli, “wav2vec 2.0: A framework for
[1] Stephen E Levinson, “Continuously variable duration self-supervised learning of speech representations,” Ad-
hidden markov models for automatic speech recogni- vances in Neural Information Processing Systems, vol.
tion,” Computer Speech & Language, vol. 1, no. 1, pp. 33, 2020.
29–45, 1986. [13] Shamane Siriwardhana, Andrew Reis, Rivindu
Weerasekera, and Suranga Nanayakkara, “Jointly
[2] Yequan Wang, Aixin Sun, Jialong Han, Ying Liu, and
Fine-Tuning “BERT-Like” Self Supervised Models to
Xiaoyan Zhu, “Sentiment analysis by capsules,” in Pro-
Improve Multimodal Speech Emotion Recognition,” in
ceedings of the 2018 world wide web conference, 2018,
Proc. Interspeech 2020, 2020, pp. 3755–3759.
pp. 1165–1174.
[14] Manon Macary, Marie Tahon, Yannick Estève, and An-
[3] DN Krishna and Ankita Patil, “Multimodal emotion thony Rousseau, “On the use of self-supervised pre-
recognition using cross-modal attention and 1d convo- trained acoustic and linguistic features for continuous
lutional neural networks,” Proc. Interspeech 2020, pp. speech emotion recognition,” in 2021 IEEE Spoken Lan-
4243–4247, 2020. guage Technology Workshop (SLT), 2021, pp. 373–380.
[4] Yixiong Pan, Peipei Shen, and Liping Shen, “Speech [15] Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva,
emotion recognition using support vector machine,” In- “Speech emotion recognition using hidden markov
ternational Journal of Smart Home, vol. 6, no. 2, pp. models,” Speech communication, vol. 41, no. 4, pp.
101–108, 2012. 603–623, 2003.
[16] Yi-Lin Lin and Gang Wei, “Speech emotion recognition
[5] Aditya Bihar Kandali, Aurobinda Routray, and
based on hmm and svm,” in 2005 international confer-
Tapan Kumar Basu, “Emotion recognition from as-
ence on machine learning and cybernetics. IEEE, 2005,
samese speeches using mfcc features and gmm classi-
vol. 8, pp. 4898–4901.
fier,” in TENCON 2008-2008 IEEE region 10 confer-
ence. IEEE, 2008, pp. 1–5. [17] William Chan, Navdeep Jaitly, Quoc Le, and Oriol
Vinyals, “Listen, attend and spell: A neural network
[6] Kun Han, Dong Yu, and Ivan Tashev, “Speech emo- for large vocabulary conversational speech recognition,”
tion recognition using deep neural network and extreme in 2016 IEEE International Conference on Acoustics,
learning machine,” in Fifteenth annual conference of the Speech and Signal Processing (ICASSP). IEEE, 2016,
international speech communication association, 2014. pp. 4960–4964.
[7] Florian Eyben, Martin Wöllmer, and Björn Schuller, [18] Aäron van den Oord, Sander Dieleman, Heiga Zen,
“Opensmile: the munich versatile and fast open-source Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
audio feature extractor,” in Proceedings of the 18th Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu,
ACM international conference on Multimedia, 2010, pp. “Wavenet: A generative model for raw audio,” in 9th
1459–1462. ISCA Speech Synthesis Workshop, 2016, pp. 125–125.
[19] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Kar- [30] Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan,
ray, “Survey on speech emotion recognition: Features, and Panayiotis Georgiou, “Speaker-invariant affec-
classification schemes, and databases,” Pattern Recog- tive representation learning via adversarial training,”
nition, vol. 44, no. 3, pp. 572–587, 2011. in 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2020,
[20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
pp. 7144–7148.
Kristina Toutanova, “BERT: Pre-training of deep bidi-
rectional transformers for language understanding,” in [31] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez
Proceedings of the 2019 Conference of the North Amer- Moreno, “Generalized end-to-end loss for speaker
ican Chapter of the Association for Computational Lin- verification,” in 2018 IEEE International Conference
guistics: Human Language Technologies, Volume 1 on Acoustics, Speech and Signal Processing (ICASSP).
(Long and Short Papers), 2019, pp. 4171–4186. IEEE, 2018, pp. 4879–4883.
[21] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- [32] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
bonell, Russ R Salakhutdinov, and Quoc V Le, “Xlnet: jeev Khudanpur, “Librispeech: an asr corpus based on
Generalized autoregressive pretraining for language un- public domain audio books,” in 2015 IEEE Interna-
derstanding,” Advances in Neural Information Process- tional Conference on Acoustics, Speech and Signal Pro-
ing Systems, vol. 32, pp. 5753–5763, 2019. cessing (ICASSP). IEEE, 2015, pp. 5206–5210.
[22] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
[33] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
Christopher D. Manning, “Electra: Pre-training text
a large-scale speaker identification dataset,” Proc. Inter-
encoders as discriminators rather than generators,” in
speech 2017, 2017.
International Conference on Learning Representations,
2020. [34] J. S. Chung, A. Nagrani, and A. Zisserman, “Vox-
[23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, celeb2: Deep speaker recognition,” Proc. Interspeech
Deep Learning, MIT Press, 2016. 2018, 2018.
[24] Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, [35] Leonardo Pepino, Pablo Riera, Luciana Ferrer, and
and Xiaogang Wang, “Exploring disentangled feature Agustı́n Gravano, “Fusion approaches for emotion
representation beyond face identification,” in Proceed- recognition from speech using acoustic and text-based
ings of the IEEE Conference on Computer Vision and features,” in 2020 IEEE International Conference on
Pattern Recognition, 2018, pp. 2080–2089. Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2020, pp. 6484–6488.
[25] Emily L Denton et al., “Unsupervised learning of dis-
entangled representations from video,” in Advances in [36] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe
neural information processing systems, 2017, pp. 4414– Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N
4423. Chang, Sungbok Lee, and Shrikanth S Narayanan,
“Iemocap: Interactive emotional dyadic motion capture
[26] Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark
database,” Language resources and evaluation, vol. 42,
Hasegawa-Johnson, and David Cox, “Unsupervised
no. 4, pp. 335, 2008.
speech decomposition via triple information bottle-
neck,” in International Conference on Machine Learn- [37] Srividya Tirunellai Rajamani, Kumar T Rajamani, Adria
ing. PMLR, 2020, pp. 7836–7846. Mallol-Ragolta, Shuo Liu, and Björn Schuller, “A novel
[27] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, attention-based gated recurrent unit and its efficacy in
and Mark Hasegawa-Johnson, “Autovc: Zero-shot voice speech emotion recognition,” in 2021 IEEE Interna-
style transfer with only autoencoder loss,” International tional Conference on Acoustics, Speech and Signal Pro-
Conference on Machine Learning, 2019. cessing (ICASSP). IEEE, 2021, pp. 6294–6298.
[28] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan [38] Ruixiong Zhang, Haiwei Wu, Wubo Li, Dongwei Jiang,
Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Wei Zou, and Xiangang Li, “Transformer based unsu-
Clark, and Rif A Saurous, “Towards end-to-end prosody pervised pre-training for acoustic representation learn-
transfer for expressive speech synthesis with tacotron,” ing,” in 2021 IEEE International Conference on Acous-
Proeedings of ICML, 2018. tics, Speech and Signal Processing (ICASSP). IEEE,
2021, pp. 6933–6937.
[29] Jennifer Williams and Simon King, “Disentangling
style factors from speaker representations.,” Proc. In- [39] Shuiyang Mao, PC Ching, C-C Jay Kuo, and Tan Lee,
terspeech 2019, pp. 3945–3949, 2019. “Advancing multiple instance learning with attention
modeling for categorical speech emotion recognition,”
Proc. Interspeech 2020, pp. 2357–2361, 2020.
[40] Wen Wu, Chao Zhang, and Philip C. Woodland, “Emo-
tion recognition by fusing time synchronous and time
asynchronous representations,” in 2021 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2021, pp. 6269–6273.

Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology

Uploaded by

Copyright:

Available Formats

MULTIMODAL EMOTION RECOGNITION WITH HIGH-LEVEL SPEECH AND TEXT

Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda

Tokyo Institute of Technology

ABSTRACT Many studies have proposed emotion recognition meth-

3.1. Speech Emotion Recognition 3.1.2. Speaker Identity Feature Extraction

3.1.1. Wav2vec 2.0 Feature Extraction P25

10. REFERENCES [12] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,

You might also like