Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
FEATURES
3x 1D ConvNorm 2048x5
Embedding
[256, 96]
3x 1D ConvNorm 512x5
4x 1D ConvNorm 512x5
1x 1D ConvNorm 80x5
Downsampler(f)
Concatenation
Concatenation
2x LSTM 1024
Upsampler(f)
1x LSTM 512
1x Linear 80
2x BLSTM d
Weighted Average
Reconstructed
Spectrogram
[80, 96]
[1280, 96]
[2d, 96]
[2d, 96]
[80, 96]
[2d, 96/f]
[2d+512, 96]
Wav2vec
Features
[1024, 96]
[25, 1024, 96]
Classifier (C)
[256, 96]
Phone
Encoder
1x Dropout 0.5 Lr1 Mel Lr2
3x 1D ConvNorm 2048x5
2x Linear 128
(Ep)
1x Linear 4 SER Emotion Spectrogram
2x BLSTM 128
Classes
Logits Phone
Sequence
[128, 96]
Le
Emotion
Class Label
SER model
1x 1D CNN 128x8
2x 1D CNN 256x1
1x Mean Pooling
1x 1D CNN 64x4
1x Linear 4
1x Flatten
TER Emotion
Text Multimodal Emotion
Classes Score Fusion
Features Class Probabilities
[N’,L]
Logits
TER model
Fig. 1. Proposed method depicting the SER and TER models and the fusion approach. 1D ConvNorm layers are defined as a
1D CNN layer followed by batch normalization, and BLSTM layers are bidirectional LSTMs. The number of layers is shown
as each block name’s prefix. The number of filters F and kernels K in each ConvNorm and CNN layers are shown as the block
name’s suffix F xK. The number of neurons in LSTM, BLSTM, and Linear layers is shown as the blocks names’ suffix.
4. DATASET
Table 2. Comparison of our SER results with current works
We utilize the Interactive Emotional Dyadic Motion Capture in terms of UA(%) and WA (%).
(IEMOCAP) [36] dataset to evaluate our method. Given the Model UA WA
amount of data and the phonetic and semantic diversity of its GRU+Context [37] 68.3 66.9
utterances, the IEMOCAP dataset is considered well-suited Self-Attn+LSTM [3] 55.6 -
for speech-based and text-based emotion recognition. BLSTM+Self-Attn [9] 57.0 55.7
There are 10 actors in this dataset, whose interactions are Transformer [38] - 64.9
organized in 5 dyadic sessions, each with an unique pair of
CNN+Feat-Attn [39] 66.7 -
male and female actors. The dataset contains approximately
wav2vec+CNN [11] - 67.9
12 hours of audiovisual data, which is segmented into speech
Ours (Small) 70.1 70.7
turns (or utterances). Each utterance is labelled by three an-
notators.
Following previous works [3, 8, 9], we consider only the We compare the results obtained with the “Small” model with
utterances which are given the same label by at least two an- the current state-of-the-art in Table 2.
notators, and we merge the utterances labelled as “Happy” We further evaluate whether inputting the wav2vec em-
and “Excited” into the “Happy” category. We further select beddings is advantageous to SER. We train our model with
only the utterances with the labels “Angry”, “Neutral”, “Sad” the same parameters as the “Small” configuration, but with
and “Happy”, resulting in 5,531 utterances, which is approx- a mel-spectrogram as input instead of the wav2vec features.
imately 7 hours of data. We utilize only the speech data, the Our results achieved an UA of 50.4% on the 5-fold cross-
transcripts, and the labels. validation, which is 19.7% worse than of the model with
wav2vec features as input, in terms of absolute accuracy.
5. TRAINING CONFIGURATION Therefore, we can conclude that the learned weighted aver-
age of the wav2vec embeddings is a better representation of
We perform a leave-one-session-out cross-validation in all our speech for SER on the IEMOCAP dataset when compared to
experiments. We report our results in terms of Weighted Ac- the traditional mel-frequency spectrograms.
curacy (WA) and Unweighted Accuracy (UA). WA is equiva-
lent to the average recall over all the emotion classes and UA 6.2. Disentanglement Experiments
is the fraction of samples correctly classified.
All models are implemented in PyTorch, and, in every We train 4-linear-layer (with, from the input to the output,
training experiment, we use the Adam optimizer with learn- 2048, 1024, 1024, and 8 neurons) classifiers on the obtained
ing rate 10−4 , and with the default exponential decay rate of speech representations to solve the speaker identification task.
the moment estimates. The SER models are trained with a Our goal is to see if the obtained emotion features contain
batch size of 2 for 1 million iterations. The TER models are speaker identity information. Ideally, we would like our fea-
trained with a batch size of 4 for 412,800 iterations. tures to be speaker-independent, and to hold a generic emo-
tion representation that could be used across speakers.
We train the classifiers on a 5-fold cross-validation, but
6. SPEECH EMOTION RECOGNITION
we define the folds differently for this experiment. We ran-
6.1. Emotion Recognition Experiments domly separate 80% of each speaker’s data for training, and
the remaining 20% for test. The folds have speaker dependent
We perform SER with two bottleneck configurations for the train and test sets, and each of them contains the data of only
encoder, “Small” and “Large”, which have respective bottle- 4 sessions. We train the classifiers on a cross-entropy loss.
neck dimension d equal to 8 and 128, and respective down- Table 3 summarizes the speaker identity recognition results.
sampling factor f set as 48 and 2. The utterance-level SER Table 3 suggests that the features extracted with the
results are presented in Table 1. “Large” model contain more information about speaker iden-
Table 1 indicates that the “Small” configuration performs tity than the ones from the “Small” model. Overall, from the
better in the SER task when compared to the “Large” model. results in Tables 3 and 1, we can see that the features extracted
Table 3. UA (%) results for the speaker identification task on Table 5. Comparison of our TER results with current works
features extracted with the “Small” and “Large” models. in terms of UA(%) and WA (%).
Fold Model UA WA
Model Avg ± std
1 2 3 4 5 BERT+Attn+Context [40] 71.9 71.2
Small 18.4 20.6 19.4 15.9 13.6 17.6 ± 2.8 BERT+Attn [40] 64.8 62.9
Large 25.6 19.8 24.5 23.4 21.4 22.9 ± 2.3 BLSTM+Self-Attn [9] 63.6 63.7
Self-Attn+LSTM [3] 65.9 -
Ours (BERT uncased) 66.1 67.0
Table 4. 5-fold cross-validation UA (%) results for TER on
input features extracted from different models. (c = “cased”,
u = “uncased”, uwm = “uncased with whole word masking”)
Model Avg ± std Table 6. Comparison of our multimodal results with current
ALBERT 62.3 ± 2.3 works in terms of UA(%) and WA (%).
BERTc 65.5 ± 3.3 Model UA WA
BERTu 66.1 ± 2.1 BERT+Attn+Context [40] 76.1 77.4
BERTuwm 65.8 ± 2.6 LAS-ASR [8] 66.0 64.0
ELECTRA 56.6 ± 3.8 ASR-SER [9] 69.7 68.6
RoBERTa 64.1 ± 3.5 CMA+Raw waveform [3] 72.8 -
XLNetc 58.1 ± 3.6 Ours (w1 = 0.6, w2 = 1) 73.0 73.5
with the “Small” model can achieve a better SER accuracy do not use recurrent neural networks or self-attention, and we
and a worse speaker identity accuracy when compared to the benefit from the text representations learned by Transformer-
features extracted with the “Large” model. This result sug- based models trained on large text corpora. We believe our
gests that the bottleneck size can lead to a disentanglement of model achieved good results due to BERT’s deep features,
factors in speech, which makes the SER task easier. and to the ability of our model’s 1D CNN layers to extract
temporal information from the sequence of token’s features.
6.3. Discussion
We believe that our SER model achieves better results when
compared to previous methods due to three factors. First, 8. MULTIMODAL EMOTION RECOGNITION
we use high-level speech representations as the input to our
model, which, apart from our work, is only done by [11] and
We combine the results from our best speech and text mod-
[38]. Second, we are careful in analyzing the type of infor-
els by experimenting with different weight values w1 and w2 .
mation encoded in the features obtained by our model, which
Our best multimodal results are acquired when w1 = 0.6 and
makes the features have a certain level of disentanglement as
w2 = 1, and they are reported in Table 6.
shown in Section 6.2. Third, our model can leverage both
high-level and low-level features since it is trained to recon- This result shows that, when combining the speech and
struct spectrograms from wav2vec features. the text results, it is better to give less importance to the
speech model’s result, even though its accuracy is higher than
the text model’s. We believe this may be related to the con-
7. TEXT EMOTION RECOGNITION fidence in which the TER and the SER models obtain their
scores, but further investigation is required.
We train the TER model with input features extracted from
different Transformer-based models4 . We use the “large” ver- By comparing our results in Tables 1, 4, and 6, we can
sion of all these models, which output a 1024-dimensional conclude that the solution to the emotion recognition task
feature for each token. The TER results are shown in Table 4 benefits from combining different types of data, since our
for different feature extractors, and we compare our best re- multimodal result is better than our speech-only and text-only
sults with the current state-of-the-art in Table 5. results. Our multimodal approach gives better results than
From Table 5, we can see that our method achieves better current works except for [40], which uses the context infor-
results than previous works, except for [40], which uses con- mation. We attribute the reason of our good results to the fact
text information (i.e., features from succeeding and preceding that our unimodal models outperform other unimodal mod-
utterances). Our model differs from previous works in that we els, and not to our choice of fusion method. We believe we
could achieve better results with a more sophisticated fusion
4 The trained models can be found at https://huggingface.co/models approach or by jointly training the speech and text modalities.
9. CONCLUSION [8] Sung-Lin Yeh, Yun-Shao Lin, and Chi-Chun Lee,
“Speech representation learning for emotion recognition
We proposed a cross-representation encoder-decoder model using end-to-end asr with factorized adaptation,” Proc.
inspired in disentanglement representation learning to per- Interspeech 2020, pp. 536–540, 2020.
form SER. Our model leverages both high-level wav2vec
features and low-level mel-frequency spectrograms, and it [9] Han Feng, Sei Ueno, and Tatsuya Kawahara, “End-
achieves an accuracy of 70.1% on the IEMOCAP dataset. to-end speech emotion recognition combined with
We also used a CNN-based model that processes token’s acoustic-to-word asr model,” Proc. Interspeech 2020,
embeddings extracted with pre-trained Transformer-based pp. 501–505, 2020.
models to perform TER, achieving an accuracy of 66.1% on [10] Ian J Goodfellow, Jonathon Shlens, and Christian
the same dataset. We further combined the speech-based and Szegedy, “Explaining and harnessing adversarial exam-
the text-based results via score fusion, achieving an accuracy ples,” arXiv preprint arXiv:1412.6572, 2014.
of 73.0%. Our speech-only, text-only and multimodal results
surpassed current works’, showing that emotion recognition [11] Leonardo Pepino, Pablo Riera, and Luciana Ferrer,
can benefit from disentanglement representation learning, “Emotion recognition from speech using wav2vec 2.0
high-level data representations, and multimodalities. embeddings,” Proc. Interspeech 2021, pp. 3400–3404,
2021.
[4] Yixiong Pan, Peipei Shen, and Liping Shen, “Speech [15] Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva,
emotion recognition using support vector machine,” In- “Speech emotion recognition using hidden markov
ternational Journal of Smart Home, vol. 6, no. 2, pp. models,” Speech communication, vol. 41, no. 4, pp.
101–108, 2012. 603–623, 2003.
[16] Yi-Lin Lin and Gang Wei, “Speech emotion recognition
[5] Aditya Bihar Kandali, Aurobinda Routray, and
based on hmm and svm,” in 2005 international confer-
Tapan Kumar Basu, “Emotion recognition from as-
ence on machine learning and cybernetics. IEEE, 2005,
samese speeches using mfcc features and gmm classi-
vol. 8, pp. 4898–4901.
fier,” in TENCON 2008-2008 IEEE region 10 confer-
ence. IEEE, 2008, pp. 1–5. [17] William Chan, Navdeep Jaitly, Quoc Le, and Oriol
Vinyals, “Listen, attend and spell: A neural network
[6] Kun Han, Dong Yu, and Ivan Tashev, “Speech emo- for large vocabulary conversational speech recognition,”
tion recognition using deep neural network and extreme in 2016 IEEE International Conference on Acoustics,
learning machine,” in Fifteenth annual conference of the Speech and Signal Processing (ICASSP). IEEE, 2016,
international speech communication association, 2014. pp. 4960–4964.
[7] Florian Eyben, Martin Wöllmer, and Björn Schuller, [18] Aäron van den Oord, Sander Dieleman, Heiga Zen,
“Opensmile: the munich versatile and fast open-source Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
audio feature extractor,” in Proceedings of the 18th Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu,
ACM international conference on Multimedia, 2010, pp. “Wavenet: A generative model for raw audio,” in 9th
1459–1462. ISCA Speech Synthesis Workshop, 2016, pp. 125–125.
[19] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Kar- [30] Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan,
ray, “Survey on speech emotion recognition: Features, and Panayiotis Georgiou, “Speaker-invariant affec-
classification schemes, and databases,” Pattern Recog- tive representation learning via adversarial training,”
nition, vol. 44, no. 3, pp. 572–587, 2011. in 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2020,
[20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
pp. 7144–7148.
Kristina Toutanova, “BERT: Pre-training of deep bidi-
rectional transformers for language understanding,” in [31] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez
Proceedings of the 2019 Conference of the North Amer- Moreno, “Generalized end-to-end loss for speaker
ican Chapter of the Association for Computational Lin- verification,” in 2018 IEEE International Conference
guistics: Human Language Technologies, Volume 1 on Acoustics, Speech and Signal Processing (ICASSP).
(Long and Short Papers), 2019, pp. 4171–4186. IEEE, 2018, pp. 4879–4883.
[21] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- [32] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
bonell, Russ R Salakhutdinov, and Quoc V Le, “Xlnet: jeev Khudanpur, “Librispeech: an asr corpus based on
Generalized autoregressive pretraining for language un- public domain audio books,” in 2015 IEEE Interna-
derstanding,” Advances in Neural Information Process- tional Conference on Acoustics, Speech and Signal Pro-
ing Systems, vol. 32, pp. 5753–5763, 2019. cessing (ICASSP). IEEE, 2015, pp. 5206–5210.
[22] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
[33] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
Christopher D. Manning, “Electra: Pre-training text
a large-scale speaker identification dataset,” Proc. Inter-
encoders as discriminators rather than generators,” in
speech 2017, 2017.
International Conference on Learning Representations,
2020. [34] J. S. Chung, A. Nagrani, and A. Zisserman, “Vox-
[23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, celeb2: Deep speaker recognition,” Proc. Interspeech
Deep Learning, MIT Press, 2016. 2018, 2018.
[24] Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, [35] Leonardo Pepino, Pablo Riera, Luciana Ferrer, and
and Xiaogang Wang, “Exploring disentangled feature Agustı́n Gravano, “Fusion approaches for emotion
representation beyond face identification,” in Proceed- recognition from speech using acoustic and text-based
ings of the IEEE Conference on Computer Vision and features,” in 2020 IEEE International Conference on
Pattern Recognition, 2018, pp. 2080–2089. Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2020, pp. 6484–6488.
[25] Emily L Denton et al., “Unsupervised learning of dis-
entangled representations from video,” in Advances in [36] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe
neural information processing systems, 2017, pp. 4414– Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N
4423. Chang, Sungbok Lee, and Shrikanth S Narayanan,
“Iemocap: Interactive emotional dyadic motion capture
[26] Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark
database,” Language resources and evaluation, vol. 42,
Hasegawa-Johnson, and David Cox, “Unsupervised
no. 4, pp. 335, 2008.
speech decomposition via triple information bottle-
neck,” in International Conference on Machine Learn- [37] Srividya Tirunellai Rajamani, Kumar T Rajamani, Adria
ing. PMLR, 2020, pp. 7836–7846. Mallol-Ragolta, Shuo Liu, and Björn Schuller, “A novel
[27] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, attention-based gated recurrent unit and its efficacy in
and Mark Hasegawa-Johnson, “Autovc: Zero-shot voice speech emotion recognition,” in 2021 IEEE Interna-
style transfer with only autoencoder loss,” International tional Conference on Acoustics, Speech and Signal Pro-
Conference on Machine Learning, 2019. cessing (ICASSP). IEEE, 2021, pp. 6294–6298.
[28] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan [38] Ruixiong Zhang, Haiwei Wu, Wubo Li, Dongwei Jiang,
Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Wei Zou, and Xiangang Li, “Transformer based unsu-
Clark, and Rif A Saurous, “Towards end-to-end prosody pervised pre-training for acoustic representation learn-
transfer for expressive speech synthesis with tacotron,” ing,” in 2021 IEEE International Conference on Acous-
Proeedings of ICML, 2018. tics, Speech and Signal Processing (ICASSP). IEEE,
2021, pp. 6933–6937.
[29] Jennifer Williams and Simon King, “Disentangling
style factors from speaker representations.,” Proc. In- [39] Shuiyang Mao, PC Ching, C-C Jay Kuo, and Tan Lee,
terspeech 2019, pp. 3945–3949, 2019. “Advancing multiple instance learning with attention
modeling for categorical speech emotion recognition,”
Proc. Interspeech 2020, pp. 2357–2361, 2020.
[40] Wen Wu, Chao Zhang, and Philip C. Woodland, “Emo-
tion recognition by fusing time synchronous and time
asynchronous representations,” in 2021 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2021, pp. 6269–6273.