Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Enhancing Apparent Personality Trait Analysis with Cross-Modal Embeddings

Ádám Fodor, Rachid R. Saboundji, András Lőrincz
Department of Artificial Intelligence, Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary
foauaai@inf.elte.hu, sxdj3m@inf.elte.hu, lorincz@inf.elte.hu

Abstract

Automatic personality trait assessment is essential for high-quality human-machine interactions. Systems capable of human behavior analysis could be used for self-driving cars, medical research, and surveillance, among many others. We present a multimodal deep neural network with a Siamese extension for apparent personality trait prediction trained on short video recordings and exploiting modality invariant embeddings. Acoustic, visual, and textual information are utilized to reach high-performance solutions in this task. Due to the highly centralized target distribution of the analyzed dataset, the changes in the third digit are relevant. Our proposed method addresses the challenge of under-represented extreme values, achieves 0.00330.00330.00330.0033 MAE average improvement, and shows a clear advantage over the baseline multimodal DNN without the introduced module.

1 Introduction

Prediction of personality traits is an important task since it is useful for predicting decision-making patterns of people with stable personality traits in diverse situations and detecting changes due to, e.g., stress, drinking, drugs, and so on. One of the most studied model to describe personality is the Big Five personality traits [2]. The theory identifies five factors: EXTraversion, NEUroticism, AGReeableness, CONscientiousness, and OPEnness. Each personality trait represents a range bounded by two extremes, e.g., for extraversion, the two ends are extreme extraversion and extreme introversion.

Audio-visual personality trait prediction has become of high-interest [16] due to high-quality databases released in the ChaLearn challenges, i.e., in First Impressions V1 and V2 [11]. In this study, we used the extended and revised dataset (V2). The dataset contains 10,000 video clips extracted from more than 3,000 different YouTube high-definition videos of people mostly facing and speaking to a camera.

Although multimodal systems offer advantages compared to monomodal systems, they raise several challenges as well. For example, one faces the problems of selecting from the modalities to be included into multimodal systems, deriving the architecture to fuse them, and attenuating errors from noisy, missing, or underrepresented data. One specific characteristic of the First Impression V2 database is its unbalanced data distribution with fewer extreme samples. However, these examples have much more significance and have priority in several use cases, including medication.

Multimodal fusion approaches often hardly consider complex intra- and inter-modal dependencies and lack robustness in case of noisy or missing modalities [27]. Due to these challenges, an increasing number of studies were conducted to transfer knowledge across domains or modalities [13, 21]. Embedding methods have been proven useful for overcoming the before-mentioned inter-dependencies. It has been found that similarity and correlation of semantic information retrieved from real data can be represented using deep metric learning in an embedded feature space [9, 8].

Our contributions are listed below:

  1. 1.

    We propose a general-purpose learning framework to extract modality-invariant embeddings from multiple information sources with a Siamese network, emphasizing extreme examples and implicitly improving the multimodal fusion process.

  2. 2.

    We extended the Multi-Similarity loss [22] to handle multiple apparent personality trait class labels simultaneously, besides using various input modalities. The problem with non-extreme examples that one or more modalities contain inadequate information to aid the deep embedding process. To overcome this issue, we modified the sample selection of the so called “online hard example mining procedure” of the triplet loss evaluation and put the emphasis onto the extreme samples to be detailed in the paper.

  3. 3.

    Although samples having lower or higher personality trait values are less frequent in the database, high quality prediction of their values is desired in various situations. We show that cross-modal embedding enhances the prediction of the Big Five personality traits in the extreme cases.

The paper is organized as follows. Section 2 reviews the related works. The preprocessing steps, baseline, and the proposed method are detailed in Section 3. The experimental setup, dataset introduction, and the implementation details are described in Section 4. Our results, together with the discussion are presented in Section 5. We conclude in Section 6.

2 Related works

Multimodal information has been widely used in various domains ranging from semantic indexing, multimedia event detection to video situation understanding, among many others. To merge such sources of information, fusion strategies have been derived to harness complementary information from single modalities. Such strategies are classified into three categories, model-level fusion, feature-level fusion, and decision level fusion [29].

Human behavior monitoring and evaluation rely heavily on multimodal information fusion. Busso et al. [1] paired facial expressions with audio information yielding better prediction for emotion recognition. Wimmer, et al. [24] studied feature-level fusion of low-level audio and video description. Contextual long-range information was later leveraged by the introduction of BLSTMs by Wöllmer et al [25]. In contrast, with the emergence of deep learning, more sophisticated methods were adopted, e.g., by Ngiam et al. [17], who suggested a bi-modal deep auto-encoder to extract shared representations from the input modalities. However, these approaches hardly consider complex intra- and inter-modal interactions and lack robustness in case of noisy or missing modalities [28, 17].

Embedding methods have been proven useful for integrating such inter-dependencies. Han, J. et al. [7] used triplet loss to distill discriminative representations in the speech modality. Tsai et al. [20] proposed a model that factorizes learned multimodal representations into two sets of independent generative and discriminative factors. Recently, Han et al. [8] introduced a novel learning framework to leverage information from auxiliary modalities for emotion recognition, using triplet loss to produce modality-invariant emotion embedding in a latent space.

There are recent surveys on personality trait detection that can orient the interested reader [3, 10]. Here, we mention the works related to the challenges called ChaLearn: First Impression Challenge. Kaya et al. [14], the winner of the ChaLearn: First Impressions V1 competition, used visual, audio and scene features in their system trained end-to-end. Kampman et al. [12] performed an ablation study by combining audio, video, and text information in a tri-modal stacked CNN architecture. More recently, Zhang et al. [30] studied the feasibility of merging apparent personality and emotion estimations within a single deep neural network in a multi-task learning framework. An apparent problem with this approach is that the standard deviation of the estimations when trained on the ChaLearn First impression dataset is much narrower than that of the original data. The phenomenon is called the “regression-to-the-mean problem” where extreme values prediction becomes severely constrained. Li et al. [16] considered this issue and proposed a classification-regression model in which the final regression is guided by the learned classification features and introduced a new objective function (called Bell loss) to ease the aforementioned problem.

Refer to caption
Figure 1: Pipeline of the proposed method for enhanced Big Five personality trait prediction. Visual, acoustic, and textual information are processed with modality-specific subnetworks. The hidden representations are projected into a shared embedding space with a Siamese network to exploit mutual information of different information sources implicitly. The shared embedding space of the 128D auxiliary vectors is illustrated by colored circles in 2D. The extracted multimodal hidden representations and the cross-modal embeddings are fused before the final Big Five prediction. The training procedure consists of multiple learning stages (LS). FC: fully-connected, Bi-GRU: bidirectional gated recurrent unit, direct-sum\oplus: concatenation operator. The numbers within blocks indicate the number of hidden units used. Multiple values imply stacked layers.

3 Methodology

In this work, we propose a multimodal deep neural network that combines features from visual, acoustic, and textual clues to predict apparent Big-Five personality traits using short video clips from the ChaLearn challenges. The pipeline is depicted in Figure 1.

In the case of audio signals we use standard acoustic features that can be generated by OpenSMILE [6], see later. For the visual feed, most of the frames contain redundant information and we subsample the frames. Since annotated transcripts are noisy, we adopt non-contextual word-level representation for capturing the semantic meaning.

We aim to create a shared coordinate space, transforming the audio, video, and text descriptors into a semantically relevant form using a Siamese network. The triplet-based loss functions are designed to encourage positive examples as close as possible to the so called anchor sample, and negative examples to be separated from each other over a given threshold. Embedded vector and auxiliary vector are interchangeably used for the outputs of the Siamese network. Higher precision estimation of the extremes is one of our goals and we expect that multi-modal data enrichment is advantageous in each trait. We use a DNN that combines tri-modal features along with the embedded vectors to predict apparent personality traits from the short video clips.

3.1 Data preprocessing

Audio Features

For acoustic features, we used a de-facto standard preset called extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [5]. This feature set contains the F0 semitone, loudness, spectral flux, MFCC, jitter, shimmer, F1, F2, F3, alpha ratio, Hammarberg index, slope V0 features. Furthermore, many statistical functions are applied to these low-level descriptors considering voiced and unvoiced regions, resulting in 88 features for every sample. The audio signals were extracted from the videos using FFmpeg with 44100 sampling frequency. Then, the eGeMAPS were generated through OpenSMILE. Min-max normalization was applied as a preprocessing step to rescale variables into the range [0, 1].

Visual Features

We subsampled the video: only 6 frames are selected to reduce the overall complexity and redundancy of successive, similar frames. The choice of 6 is arbitrary and it does not affect the outcome significantly. Pixel values fit into the range of [0, 255]. Images are resized to 140×248140248140\times 248140 × 248 pixels to preserve the original aspect ratio, then the same random 128×128128128128\times 128128 × 128 pixels spatial crop was applied on all frames of a sample. We employed the same augmentation techniques on every frame of a single clip (with 0.5 probability) during training to preserve the relative similarity between video frames. Data augmentation consists of random flip, random hue (±0.15plus-or-minus0.15\pm 0.15± 0.15), brightness (±0.2plus-or-minus0.2\pm 0.2± 0.2), saturation (between 0.8 and 1.2) and contrast (between 0.8 and 1.2). The augmentations on hue and brightness are additive, while the saturation and contrast are multiplicative. During test and validation time, center crop was applied. Finally, the frames are scaled between [-1; 1].

Textual Features

GloVe uses unsupervised learning to obtain non-contextual vector representation of words. This vector is meant to encode semantic information, such that similar words (e.g., synonyms) have similar embedding vectors. We used pre-trained embeddings (Wikipedia 2014 and Gigaword 5), which captures the overall meaning of a sentence in a relatively lesser amount of memory, and faster than contextual models (like BERT) do. The transcripts are tokenized with SpaCy. All special characters, digits, URLs, and emails are filtered. Every token is converted to its corresponding GloVe vector before feeding it to the textual subnetwork.

3.2 Multimodal information fusion

Visual, acoustic, and textual high-level attributes are combined via a model-fusion approach. Being a regression task, in the first learning stage, the modality-specific subnetworks are trained separately, using ground truth annotations. Hence, they are used as feature extractors, and the parameters of the networks are frozen during further training. In the second learning stage, the tri-modal feature vectors are concatenated and used as the input of a fully-connected network.

Acoustic subnetwork

The 88-dimensional acoustic feature vector xAIRNsubscript𝑥𝐴IsuperscriptRNx_{A}\in\rm I\!R^{N}italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_N end_POSTSUPERSCRIPT is the input of the audio subnetwork fA:IRNIRQ:subscript𝑓𝐴IsuperscriptRNIsuperscriptRQf_{A}:\rm I\!R^{N}\rightarrow\rm I\!R^{Q}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT : roman_I roman_R start_POSTSUPERSCRIPT roman_N end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT, which is, a fully-connected shallow network with two hidden layers.

Visual subnetwork

Using the video samples xVIRF×H×W×Csubscript𝑥𝑉IsuperscriptRFHWCx_{V}\in\rm I\!R^{F\times H\times W\times C}italic_x start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_F × roman_H × roman_W × roman_C end_POSTSUPERSCRIPT, where F𝐹Fitalic_F is the number of frames, H𝐻Hitalic_H and W𝑊Witalic_W are the height and width spatial dimensions, C𝐶Citalic_C is the number of channels, a feature extractor fV:IRF×H×W×CIRQ:subscript𝑓𝑉IsuperscriptRFHWCIsuperscriptRQf_{V}:\rm I\!R^{F\times H\times W\times C}\rightarrow\rm I\!R^{Q}italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT : roman_I roman_R start_POSTSUPERSCRIPT roman_F × roman_H × roman_W × roman_C end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT is trained. We chose ResNet-50 for our visual backbone. For every frame, a 2048-dimensional feature vector is extracted. Average pooling was applied to the time dimension, followed by a fully-connected layer.

Textual subnetwork

The textual subnetwork input is xTIRK×Gsubscript𝑥𝑇IsuperscriptRKGx_{T}\in\rm I\!R^{K\times G}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_K × roman_G end_POSTSUPERSCRIPT, where K𝐾Kitalic_K is the maximum sequence length, G𝐺Gitalic_G is the dimension of GloVe embeddings. A bidirectional gated recurrent unit (Bi-GRU) with attention mechanism is trained fT:IRK×GIRQ:subscript𝑓𝑇IsuperscriptRKGIsuperscriptRQf_{T}:\rm I\!R^{K\times G}\rightarrow\rm I\!R^{Q}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : roman_I roman_R start_POSTSUPERSCRIPT roman_K × roman_G end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT as a feature extractor.

The xAsubscript𝑥𝐴x_{A}italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT audio feature vector, the corresponding xVsubscript𝑥𝑉x_{V}italic_x start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT RGB frames and xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT GloVe vectors are fed into their modality-specific subnetworks, producing hAsubscript𝐴h_{A}italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, hVsubscript𝑉h_{V}italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, hTIRQsubscript𝑇IsuperscriptRQh_{T}\in\rm I\!R^{Q}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT hidden representations, respectively:

hA=fA(xA),hV=fV(xV),hT=fT(xT)formulae-sequencesubscript𝐴subscript𝑓𝐴subscript𝑥𝐴formulae-sequencesubscript𝑉subscript𝑓𝑉subscript𝑥𝑉subscript𝑇subscript𝑓𝑇subscript𝑥𝑇h_{A}=f_{A}(x_{A}),\;h_{V}=f_{V}(x_{V}),\;h_{T}=f_{T}(x_{T})italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) , italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) , italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (1)

Let us define p:IR()IR5:𝑝IsuperscriptRIsuperscriptR5p:\rm I\!R^{(\cdot)}\rightarrow\rm I\!R^{5}italic_p : roman_I roman_R start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, which is a linear mapping function, that estimate the five personality attributes from a given hidden representation. For monomodal subnetworks the process can be formalized as follows:

y^=p(hA),y^=p(hV),y^=p(hT)formulae-sequence^𝑦𝑝subscript𝐴formulae-sequence^𝑦𝑝subscript𝑉^𝑦𝑝subscript𝑇\hat{y}=p(h_{A}),\;\hat{y}=p(h_{V}),\;\hat{y}=p(h_{T})over^ start_ARG italic_y end_ARG = italic_p ( italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG = italic_p ( italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG = italic_p ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (2)

The network parameters are optimized with Bell loss, following the work of [16]. The shape of the loss function is like an inverted bell and applied to address the regression-to-the-mean problem [26], which is particularly problematic in our case, where the ground truth scores follow a Gaussian distribution closely. The Bell loss is defined as:

bell=15ni=1nj=15γ(1e(yijy^ij)22σ2),subscript𝑏𝑒𝑙𝑙15𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗15𝛾1superscript𝑒superscriptsubscript𝑦𝑖𝑗subscript^𝑦𝑖𝑗22superscript𝜎2\mathcal{L}_{bell}=\frac{1}{5n}\sum_{i=1}^{n}\sum_{j=1}^{5}\gamma\Big{(}1-e^{-% \frac{(y_{ij}-\hat{y}_{ij})^{2}}{2\sigma^{2}}}\Big{)},caligraphic_L start_POSTSUBSCRIPT italic_b italic_e italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 5 italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_γ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ) , (3)

where n𝑛nitalic_n is the number of samples, yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and y^ijsubscript^𝑦𝑖𝑗\hat{y}_{ij}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are the ground truth and prediction of i𝑖iitalic_ith sample of j𝑗jitalic_jth trait, respectively, σ𝜎\sigmaitalic_σ is the derivation parameter, and γ𝛾\gammaitalic_γ is a scale parameter. The σ𝜎\sigmaitalic_σ controls the amplitude of variation, and γ𝛾\gammaitalic_γ makes the loss function consistent with other used loss functions, such as the classical Mean Absolute Error (MAE) and Mean Squared Error (MSE).

mae=15ni=1nj=15|yijy^ij|,mse=15ni=1nj=15(yijy^ij)2formulae-sequencesubscript𝑚𝑎𝑒15𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗15subscript𝑦𝑖𝑗subscript^𝑦𝑖𝑗subscript𝑚𝑠𝑒15𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗15superscriptsubscript𝑦𝑖𝑗subscript^𝑦𝑖𝑗2\mathcal{L}_{mae}=\frac{1}{5n}\sum_{i=1}^{n}\sum_{j=1}^{5}|y_{ij}-\hat{y}_{ij}% |,\;\mathcal{L}_{mse}=\frac{1}{5n}\sum_{i=1}^{n}\sum_{j=1}^{5}\big{(}y_{ij}-% \hat{y}_{ij}\big{)}^{2}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 5 italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | , caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 5 italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

As empirical results showed in [16], the Bell loss has difficulties at the beginning of the optimization and shines at later optimization stages. For avoiding the issue, the sum of maesubscript𝑚𝑎𝑒\mathcal{L}_{mae}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT and msesubscript𝑚𝑠𝑒\mathcal{L}_{mse}caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT guide the stochastic gradient descent algorithm in the earlier stages by producing a higher gradient. We trained the modality-specific subnetworks with \mathcal{L}caligraphic_L, which is the sum of maesubscript𝑚𝑎𝑒\mathcal{L}_{mae}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT, msesubscript𝑚𝑠𝑒\mathcal{L}_{mse}caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT and bellsubscript𝑏𝑒𝑙𝑙\mathcal{L}_{bell}caligraphic_L start_POSTSUBSCRIPT italic_b italic_e italic_l italic_l end_POSTSUBSCRIPT loss functions introduced in Equation 3 and 4.

Baseline multimodal network

In the second learning stage, the parameters of the fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT acoustic, fVsubscript𝑓𝑉f_{V}italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT visual and fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT textual subnetworks are not updated. To leverage the supplementary information of multiple modalities we concatenated the hAsubscript𝐴h_{A}italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, hVsubscript𝑉h_{V}italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT hidden representations and performed model-level fusion. M1:IR3QIRO:subscript𝑀1IsuperscriptR3QIsuperscriptROM_{1}:\rm I\!R^{3Q}\rightarrow\rm I\!R^{O}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : roman_I roman_R start_POSTSUPERSCRIPT 3 roman_Q end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT roman_O end_POSTSUPERSCRIPT fully-connected shallow network and p𝑝pitalic_p is applied to get the personality trait prediction. Formally defined as:

y^=p(M1(hAhVhT)),^𝑦𝑝subscript𝑀1direct-sumsubscript𝐴subscript𝑉subscript𝑇\hat{y}=p\Big{(}M_{1}\big{(}h_{A}\oplus h_{V}\oplus h_{T}\big{)}\Big{)},over^ start_ARG italic_y end_ARG = italic_p ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊕ italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ⊕ italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , (5)

where direct-sum\oplus is the concatenation operator.

3.3 Cross-modal deep metric learning

In the following paragraphs, we describe the metric learning framework. We can leverage complementary information from different modalities efficiently using a Siamese network. Using the cross-modal embedding, we make the proposed model more robust to noise, so a more accurate prediction can be achieved. We aim to train a cross-modal Siamese network S:IRQIRE:𝑆IsuperscriptRQIsuperscriptRES:\rm I\!R^{Q}\rightarrow\rm I\!R^{E}italic_S : roman_I roman_R start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT roman_E end_POSTSUPERSCRIPT on the hidden representations of modality-specific nets, which project the multimodal descriptors into a shared coordinate space IREIsuperscriptRE\rm I\!R^{E}roman_I roman_R start_POSTSUPERSCRIPT roman_E end_POSTSUPERSCRIPT.

eA=S(hA),eV=S(hV),eT=S(hT)formulae-sequencesubscript𝑒𝐴𝑆subscript𝐴formulae-sequencesubscript𝑒𝑉𝑆subscript𝑉subscript𝑒𝑇𝑆subscript𝑇e_{A}=S(h_{A}),\;e_{V}=S(h_{V}),\;e_{T}=S(h_{T})italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_S ( italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) , italic_e start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = italic_S ( italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_S ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (6)

where S𝑆Sitalic_S is a Siamese network, eAsubscript𝑒𝐴e_{A}italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, eVsubscript𝑒𝑉e_{V}italic_e start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and eTsubscript𝑒𝑇e_{T}italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are the projected E𝐸Eitalic_E-dimensional embeddings of hAsubscript𝐴h_{A}italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, hVsubscript𝑉h_{V}italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT hidden representations, respectively.

We aim to create a common cross-modal embedding space by transforming tri-modal descriptors in a semantically relevant way. For training the Siamese network, we choose the current state-of-the-art, triplet-base multi-similarity (MS) loss function [22], which requires an anchor, a positive and a negative example to form positive and negative pairs within a mini-batch. It can jointly measure the self-similarity and relative similarities of a pair, which allows it to collect informative pairs by implementing iterative pair mining and weighting. Deep metric learning requires class labels for training, and MS loss is proposed and tested for only one modality, the single RGB texture.

Triplet generation

Using inputs and the corresponding class labels, we can form triplets {e,e+,e}𝑒superscript𝑒superscript𝑒\{e,e^{+},e^{-}\}{ italic_e , italic_e start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT }. Examples from the same class {e,e+}𝑒superscript𝑒\{e,e^{+}\}{ italic_e , italic_e start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT } are determined as positive pairs 𝒫absent𝒫\in\mathcal{P}∈ caligraphic_P, as well as samples belonging to different classes {e,e}𝑒superscript𝑒\{e,e^{-}\}{ italic_e , italic_e start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT } are the negative pairs 𝒩absent𝒩\in\mathcal{N}∈ caligraphic_N. The Big Five annotations of the First Impressions V2 dataset are continuous variables. We define personality classes in Section 4.4 because it is a database-specific modification.

We applied MS loss, which is defined as a pair weighting problem, and empirical results show that it is superior over other commonly used loss functions, namely the contrastive loss, triplet loss, binomial deviance loss, and lifted structure loss. To compute a cross-modal MS loss, first, the eAsubscript𝑒𝐴e_{A}italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT audio, eVsubscript𝑒𝑉e_{V}italic_e start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT visual, and eTsubscript𝑒𝑇e_{T}italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT textual embeddings are combined to form a triple-sized batch of embeddings denoted as {eA,eV,eT}subscript𝑒𝐴subscript𝑒𝑉subscript𝑒𝑇\{e_{A},e_{V},e_{T}\}{ italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, then the similarity metrics are calculated using mixed embeddings of different modalities.

Similarity is defined between two embeddings e1subscript𝑒1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and e2subscript𝑒2e_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as the dot product of the vectors considering only the j𝑗jitalic_jth personality trait, denoted as De1,e2j=S(e1),S(e2)superscriptsubscript𝐷subscript𝑒1subscript𝑒2𝑗𝑆subscript𝑒1𝑆subscript𝑒2D_{e_{1},e_{2}}^{j}=\langle S(e_{1}),S(e_{2})\rangleitalic_D start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ⟨ italic_S ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_S ( italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟩. MS consists of two parts: mining and weighting. Both schemes are integrated into a single loss function, which is defined as follows:

MS=15ni=1nj=15{1αlog[1+k𝒫ijeα(Dikjλ)]+1βlog[1+k𝒩ijeβ(Dikjλ)]},subscript𝑀𝑆15𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗151𝛼1subscript𝑘superscriptsubscript𝒫𝑖𝑗superscript𝑒𝛼superscriptsubscript𝐷𝑖𝑘𝑗𝜆1𝛽1subscript𝑘superscriptsubscript𝒩𝑖𝑗superscript𝑒𝛽superscriptsubscript𝐷𝑖𝑘𝑗𝜆\mathcal{L}_{MS}=\frac{1}{5n}\sum_{i=1}^{n}\sum_{j=1}^{5}\Bigg{\{}\frac{1}{% \alpha}\log\big{[}1+\sum_{k\in\mathcal{P}_{i}^{j}}e^{-\alpha(D_{ik}^{j}-% \lambda)}\big{]}+\frac{1}{\beta}\log\big{[}1+\sum_{k\in\mathcal{N}_{i}^{j}}e^{% \beta(D_{ik}^{j}-\lambda)}\big{]}\Bigg{\}},caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 5 italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_α end_ARG roman_log [ 1 + ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_α ( italic_D start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_λ ) end_POSTSUPERSCRIPT ] + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG roman_log [ 1 + ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β ( italic_D start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_λ ) end_POSTSUPERSCRIPT ] } , (7)

where D𝐷Ditalic_D is the similarity matrix within a triple-size mini-batch, Dikjsuperscriptsubscript𝐷𝑖𝑘𝑗D_{ik}^{j}italic_D start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the similarity of two embeddings i𝑖iitalic_i and k𝑘kitalic_k, 𝒫jsuperscript𝒫𝑗\mathcal{P}^{j}caligraphic_P start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and 𝒩jsuperscript𝒩𝑗\mathcal{N}^{j}caligraphic_N start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT are the sets of positive and negative examples considering only the j𝑗jitalic_jth trait class labels, respectively. α𝛼\alphaitalic_α, β𝛽\betaitalic_β and λ𝜆\lambdaitalic_λ are fixed hyper-parameters.

We calculated a mean, trait-wise multi-similarity loss, considering all 5 target variables per sample within a mini-batch. In the case of non-extreme examples, one or more modalities contain inadequate information to aid the deep embedding process, so we modified the online hard sample mining process to only consider extreme samples as an anchor. In the third learning stage, the S𝑆Sitalic_S embedding network is trained with trait-wise MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT with the modified mining procedure. The Siamese network outputs auxiliary vectors that can help the evaluation due to its specific modality mixing mechanism.

3.4 Fused model

Our method combines the multimodal regression network and the cross-modal Siamese network in the fourth (and final) learning stage. The cross-modal embeddings (eA,eV,eTsubscript𝑒𝐴subscript𝑒𝑉subscript𝑒𝑇e_{A},e_{V},e_{T}italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT) are complementary to the hidden representations of the modality-specific subnetworks s and all of them contribute to the final prediction of p𝑝pitalic_p.

Model-level fusion is applied, similarly as before: the embeddings are concatenated to the previously fused features, then a M2:IR3H+3EIRO:subscript𝑀2IsuperscriptR3H3EIsuperscriptROM_{2}:\rm I\!R^{3H+3E}\rightarrow\rm I\!R^{O}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : roman_I roman_R start_POSTSUPERSCRIPT 3 roman_H + 3 roman_E end_POSTSUPERSCRIPT → roman_I roman_R start_POSTSUPERSCRIPT roman_O end_POSTSUPERSCRIPT fully-connected shallow network and p𝑝pitalic_p is applied to get the prediction of the Big Five traits. Formally defined as:

y^=p(M2(M1(hAhVhT)eAeVeT)),^𝑦𝑝subscript𝑀2direct-sumsubscript𝑀1direct-sumsubscript𝐴subscript𝑉subscript𝑇subscript𝑒𝐴subscript𝑒𝑉subscript𝑒𝑇\hat{y}=p\bigg{(}M_{2}\Big{(}M_{1}\big{(}h_{A}\oplus h_{V}\oplus h_{T}\big{)}% \oplus e_{A}\oplus e_{V}\oplus e_{T}\Big{)}\bigg{)},over^ start_ARG italic_y end_ARG = italic_p ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊕ italic_h start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ⊕ italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ⊕ italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊕ italic_e start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ⊕ italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , (8)

4 Experiments

In the following paragraphs, we introduce the dataset used for the experiments, concretize input and hidden dimensions, predetermined hyperparameters during the network implementation. Then the evaluation metric, personality trait class definitions, visualization, and the results are presented.

4.1 Database

We used the ChaLearn: First Impressions V2 database for our experiments because it is the largest publicly available in-the-wild dataset in this subfield.

Refer to caption
Figure 2: Examples of the First Impression V2 dataset. For each video the ground truth Big Five scores are provided. For each trait, the first two samples instantiate the high extremes, and the last two examples demonstrate the low extremes of a given trait.

The dataset contains 15 seconds long videos, which are collected automatically. Transcripts of the video clips are generated by a cloud transcription service Rev. The clips are annotated by Amazon Mechanical Turk (AMT) workers using a special interface [18]. They registered annotations using pairwise comparisons, and then they converted the votes to cardinal values by fitting a BTL model with maximum likelihood estimation. Values are scaled, so every video sample has five continuous trait scores between 0 and 1. Each trait represents a range bounded by two extremes. For example, for extraversion, the two polar ends are extreme extraversion and extreme introversion, which can be described with the words “friendly” and “reserved”, respectively. A few examples from the dataset focusing on the extreme poles per trait are depicted in Figure 2.

Creators of this dataset rely on the perception of human subjects watching the videos. It is a different task than evaluating real personality traits with experts, but equally useful in the context of human interaction.

One specialty of this dataset is that the target variables have unbalanced data distribution. The regression-to-the-mean problem is emphasized because the scores follow a Gaussian distribution, and the optimization process likely produces predictions near the mean of ground truth values to minimize the loss. We alleviated this problem with the Bell loss [16], which is similar to the Mean Squared Error, however, it can produce higher gradients when the prediction is closer to the ground truth.

4.2 Experimental Setup

Our experiments are conducted with Tensorflow on a single GeForce RTX 2080 Ti GPU.

The training process is performed in multiple learning stages. The weights are not modified after a finished stage. We used Adam [15] optimizer with a 0.0010.0010.0010.001 initial learning rate with a polynomial decay schedule throughout all experiments. Following the work of [16], we set the parameters of Bell loss σ=9𝜎9\sigma=9italic_σ = 9 and γ=300𝛾300\gamma=300italic_γ = 300. In the first, second, and fourth learning stages, \mathcal{L}caligraphic_L was used as the loss function (Section 3.2).

For reduced complexity, we define Q=256𝑄256Q=256italic_Q = 256 and O=512𝑂512O=512italic_O = 512 in Section 3. All three modality-specific networks produce 256-dimensional feature vectors, and following a concatenation, shared dense networks produce 512-dimensional vectors in the baseline and the proposed fused model as well.

For acoustic representation 88-dimensional eGeMAPS vectors are used (N=88𝑁88N=88italic_N = 88). We fed a mini-batch of 128 vectors to the audio subnetwork and tuned the two fully-connected layers for 100 epochs with early stopping.

After the frame selection (6 frames per clip) and augmentation techniques 6×128×128×3612812836\times 128\times 128\times 36 × 128 × 128 × 3 input features are fed to the visual subnetwork (F=6𝐹6F=6italic_F = 6, H=W=128𝐻𝑊128H=W=128italic_H = italic_W = 128, C=3𝐶3C=3italic_C = 3). We trained it from scratch with a mini-batch of 22 video sequences for 80 epochs. Dropout with a 0.50.50.50.5 rate was applied before the fully-connected layer as an extra regularization.

For semantic word representation, we used 300-dimensional GloVe embeddings. We empirically set the sequence length to 80. After converting every token to its corresponding GloVe vector, an 80×3008030080\times 30080 × 300 matrix is produced for every sample (K=80𝐾80K=80italic_K = 80, G=300𝐺300G=300italic_G = 300). For the textual subnetwork, we used 0.50.50.50.5 for the Bi-GRU input dropout rate. We also applied a simplified attention mechanism [19] and tuned the subnetwork for 50 epochs.

The Siamese network consists of two fully-connected hidden layers with 200 neurons each, and a linear dense output layer with 128 units. Dropout with a 0.5 rate was applied after the first hidden layer. In the third learning stage, we used MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT as the loss function (Equation 7). We used ReLU as activation function and Kaiming/He normal initialization, in addition to 0.00050.00050.00050.0005 weight decay in every dense layer, except within the Siamese, where weight decay is not considered.

4.3 Evaluation metrics

During the ChaLearn challenge, the Raccsubscript𝑅𝑎𝑐𝑐R_{acc}italic_R start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT (“1-Mean Absolute Error”) was the performance metric, so many publications employed it. It is defined as follows:

Racc=115ni=1nj=15|yijy^ij|,subscript𝑅𝑎𝑐𝑐115𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗15subscript𝑦𝑖𝑗subscript^𝑦𝑖𝑗R_{acc}=1-\frac{1}{5n}\sum_{i=1}^{n}\sum_{j=1}^{5}|y_{ij}-\hat{y}_{ij}|,italic_R start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG 5 italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | , (9)

where n is the number of samples, yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and y^ijsubscript^𝑦𝑖𝑗\hat{y}_{ij}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are the ground truth and prediction of i𝑖iitalic_ith sample and j𝑗jitalic_jth trait.

4.4 Personality trait class definition

Annotation regarding the First Impressions V2 dataset consists of 5 continuous variables. Samples can be grouped into different classes by splitting the [0, 1] interval to equal, smaller intervals. In this work, we aim to differentiate extreme examples from ordinary samples based on the ground truth values. We determine 4 classes per trait, and we are focusing on the two extremes, which can be monitored in various clinical sessions later on: the low-extreme and high-extreme classes, which are labeled as C1 and C4, respectively.

Refer to caption
Figure 3: Personality trait class definitions. Continuous ground truth values are segmented into 4 classes. The thresholds are determined using the mean and standard deviation calculated on the train set trait-wise. Samples from C1 and C4 are the low extremes and high extremes, respectively.

However, in our case the ground truth follows a Gaussian distribution, and splitting the [0, 1] interval to equal parts would lead us to an undesirably unbalanced number of extreme samples. To address this issue, we can create more balanced classes by determining the following segmentation thresholds: scores in range [0,t¯σt)0¯𝑡subscript𝜎𝑡[0,\bar{t}-\sigma_{t})[ 0 , over¯ start_ARG italic_t end_ARG - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) belong to the low-extreme class (C1), values in range [t¯σt,t¯)¯𝑡subscript𝜎𝑡¯𝑡[\bar{t}-\sigma_{t},\bar{t})[ over¯ start_ARG italic_t end_ARG - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_t end_ARG ) as well as [t¯,t¯+σt)¯𝑡¯𝑡subscript𝜎𝑡[\bar{t},\bar{t}+\sigma_{t})[ over¯ start_ARG italic_t end_ARG , over¯ start_ARG italic_t end_ARG + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are labeled as ordinary (C2, C3), and samples between [t¯+σt,1]¯𝑡subscript𝜎𝑡1[\bar{t}+\sigma_{t},1][ over¯ start_ARG italic_t end_ARG + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 1 ] are the high-extremes (C4), where t¯¯𝑡\bar{t}over¯ start_ARG italic_t end_ARG and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the mean and standard deviation calculated over all training samples of t𝑡titalic_t personality trait. Figure 3 demonstrate the class definitions on the histograms of the train and test sets.

4.5 Visualization of Cross-Modal Embeddings

We transformed the acoustic, visual, and textual features to a shared coordinate space with a Siamese network. Figure 4 shows a two-component Principal Component Analysis (PCA) calculated on the multimodal inputs as visualization, using only the NEUroticism ground truth values and trait classes within plots.

Refer to caption
Figure 4: Visualization of 2-component PCA of cross- and multimodal embeddings of the “test” set (a), showing NEUroticism ground truth values and class labels. The audio, video and text modalities are drawn with circle, square and cross, respectively. The four personality classes are represented with colors, where the blue is the low extreme (C1), and the red is the high extreme class (C4). In the (b) and (c), we emphasize embeddings within the two extreme poles of NEUroticism.

The test set contains 2000 samples, so considering all three modalities, 6000 embeddings are available. We randomly subsampled to avoid highly overlapped markers and overcrowded visualization, also paying attention to preserve the modality and class balance within the subset: 25 embeddings are selected for every class per modality, so on (a) subplot 300 transformed embeddings are present. The figure shows that even using only two components, the two polar ends of a personality trait are successfully separated. However, there is a continuous transition between trait classes, especially in the case of C2 and C3: the ground truth values are around the mean, and there are hardly perceived or any clues to make these samples more separated using the available inputs.

5 Results

We performed an ablation study with the used modalities to measure the added values of information sources. For the sake of comparison, a prior model obtained directly from the training labels (by averaging) on this dataset was capable of obtaining close to 0.880.880.880.88 of Raccsubscript𝑅𝑎𝑐𝑐R_{acc}italic_R start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT at test stage [4] due to the highly centralized distribution. In turn, changes in the third digit are relevant. Table 1 indicates that the video modality contains the most information, with an average score of 0.90740.90740.90740.9074. Apparent personality traits can be determined accurately using only a single frame: 0.90560.90560.90560.9056 score over the test set strengthens the statement of trait assignment among human observers can be as fast as 100ms [23]. The bi-modal systems produce a clear performance jump in every single case compared to the monomodal configurations. Furthermore, the “Audio + Video + Text” model performed the expected best result: the different modalities supplement each other.

Thus, we can fairly compare the proposed method to the “Audio + Video + Text” baseline. Table 1 shows that our method performs more superior overall, emphasizing the improvement produced by cross-modal embeddings from 0.90940.90940.90940.9094 to 0.91270.91270.91270.9127.

EXT NEU AGR CON OPE Avg
Audio 0.8947 0.8955 0.9016 0.8916 0.9007 0.8968
Scene 0.9065 0.8990 0.9065 0.9110 0.9048 0.9056
Video 0.9086 0.9016 0.9072 0.9132 0.9065 0.9074
Text 0.8837 0.8853 0.8982 0.8841 0.8900 0.8882
Audio + Video 0.9097 0.9041 0.9088 0.9143 0.9074 0.9089
Audio + Text 0.8958 0.8965 0.9023 0.8952 0.9016 0.8983
Text + Video 0.9105 0.9041 0.9080 0.9140 0.9073 0.9088
Audio + Video + Text 0.9108 0.9083 0.9108 0.9103 0.9069 0.9094
Ours𝑂𝑢𝑟𝑠Oursitalic_O italic_u italic_r italic_s 0.9142 0.9112 0.9127 0.9154 0.9102 0.9127
Table 1: Comparison on the performance Raccsubscript𝑅𝑎𝑐𝑐R_{acc}italic_R start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT of the network trained with different data modalities.

We also evaluated the trained system using only extreme samples. We subsampled the test set, so the subset only contained examples from C1 and C4, trait-wise. In Table 2, the “All” column values are produced on the whole test set by the baseline and our method, respectively. In the case of “Low” and “High” columns, the corresponding personality trait classes are C1 and C4. The results indicate that we can enhance the prediction of high extreme values at the expense of low extreme prediction in most cases. However, focusing on CONscientiousness, enhanced quality of both low and high extreme predictions can be observed.

Baseline Ours
All Low High All Low High
EXT 0.9108 0.8870 0.8739 0.9142 0.8841 0.8768
NEU 0.9083 0.8730 0.8739 0.9112 0.8722 0.8742
AGR 0.9108 0.8590 0.8626 0.9127 0.8573 0.8644
CON 0.9103 0.8731 0.8832 0.9154 0.8753 0.8891
OPE 0.9069 0.8691 0.8702 0.9102 0.8684 0.8794
Table 2: Network performance Raccsubscript𝑅𝑎𝑐𝑐R_{acc}italic_R start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT on all samples, low and high extreme examples. Baseline: Audio + Video + Text. Ours: Audio + Video + Text fused with cross-modal embeddings.

6 Conclusions

In this article, we proposed a general-purpose learning framework for the Big Five personality trait prediction, which deals with multimodal data. We used the currently largest publicly available dataset in our experiments, the ChaLearn: First Impressions V2, and we created modality invariant embeddings to make the different input modalities supplement each other.

An ablation study has demonstrated the added values of different modalities, as well as the proposed extension. We applied a modified multi-similarity constraint over acoustic, visual, and textual representations to implicitly exploit the mutual information. Experiments show that we achieved higher overall prediction accuracy, surpassing the performance of baseline multimodal configurations. Besides, we evaluated the proposed method of extreme examples, which produced the desired results in some cases.

To our best knowledge, this is the first work that introduces cross-modal embedding for personality trait prediction. The proposed learning framework is far from perfect. It could be further developed, which is planned for future works. The feature extraction part could be improved to produce more diverse and descriptive representations. Probabilities could be utilized within the triplet constraint to consider the uncertainty around trait class segmentation thresholds properly. The multiple learning phases could be combined to form an end-to-end training process for better useability.

Acknowledgments

The research was supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program. ÁF and RRS were supported by part through grants EFOP-3.6.3-VEKOP-16-2017-00001 and EFOP-3.6.3-VEKOP-16-2017-00002, respectively. AL was supported by the project “Application Domain Specific Highly Reliable IT Solutions” implemented with the support provided by the National Research, Development and Innovation Fund of Hungary and financed under the Thematic Excellence Programme no. 2020-4.1.1.-TKP2020 (National Challenges Subprogramme) funding scheme.

References

  • [1] B. Carlos, D. Zhigang, Y. Serdar, B. Murtaza, L. Chulmin, K. Abe, L. Sungbok, N. Ulrich, and N. Shrikanth. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th International Conference on Multimodal Interfaces, pages 205–211, 2014.
  • [2] J. M. Digman. Personality structure: Emergence of the five-factor model. Annual Review of Psychology, 41(1):417–440, 1990.
  • [3] H. J. Escalante, H. Kaya, A. Salah, S. Escalera, Y. Gucluturk, U. Guclu, X. Baró, I. Guyon, J. Junior, M. Madadi, S. Ayache, E. Viegas, F. Gürpınar, A. Wicaksana, C. Liem, M. Gerven, and R. Lier. Explaining first impressions: Modeling, recognizing, and explaining apparent personality from videos. IEEE Transactions on Affective Computing, pages 1–18, 2018.
  • [4] H. J. Escalante, H. Kaya, A. Salah, S. Escalera, Y. Güçlütürk, U. Güçlü, X. Baró, I. Guyon, J. C. S. Jacques Junior, M. Madadi, S. Ayache, E. Viegas, F. Gurpinar, A. S. Wicaksana, C. Liem, M. A. J. Van Gerven, and R. Van Lier. Modeling, recognizing, and explaining apparent personality from videos. IEEE Transactions on Affective Computing, 2020.
  • [5] F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan, and K. Truong. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2):190–202, 2015.
  • [6] F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459–1462, 2010.
  • [7] J. Han, Z. Zhang, G. Keren, and B. Schuller. Emotion recognition in speech with latent discriminative representations learning. Acta Acustica united with Acustica, 104(5):737–740, 2018.
  • [8] J. Han, Z. Zhang, Z. Ren, and B. W. Schuller. Emobed: Strengthening monomodal emotion recognition via training with crossmodal emotion embeddings. IEEE Transactions on Affective Computing, 2019.
  • [9] E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92, 2015.
  • [10] J. C. S. Jacques Junior, Y. Güçlütürk, M. Pérez, U. Güçlü, C. Andujar, X. Baró, H. J. Escalante, I. Guyon, M. A. J. Van Gerven, and R. Van Lier. First impressions: A survey on vision-based apparent personality trait analysis. IEEE Transactions on Affective Computing, pages 1–20, 2019.
  • [11] J. J. Junior, Y. Güçlütürk, M. Pérez, U. Güçlü, X. Baró, H. Escalante, I. Guyon, M. V. Gerven, R. Lier, and S. Escalera. First impressions: A survey on vision-based apparent personality trait analysis. arXiv: Computer Vision and Pattern Recognition, 2018.
  • [12] O. Kampman, E. J. Barezi, D. Bertero, and P. Fung. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 606–611, 2018.
  • [13] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia, 17(3):370–381, 2015.
  • [14] H. Kaya, F. Gurpinar, and A. A. Salah. Multimodal score fusion and decision trees for explainable automatic job candidate screening from video cvs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1–9, 2017.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint, arXiv:1412.6980, 2014.
  • [16] Y. Li, J. Wan, Q. Miao, S. Escalera, H. Fang, H. Chen, X. Qi, and G. Guo. Cr-net: A deep classification-regression network for multimodal apparent personality analysis. International Journal of Computer Vision, pages 1–18, 2020.
  • [17] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 689–696, 2011.
  • [18] V. Ponce-López, B. Chen, M. Oliu, C. Corneanu, A. Clapés, I. Guyon, X. Baró, H. J. Escalante, and S. Escalera. Chalearn lap 2016: First round challenge on first impressions - dataset and results. In Computer Vision – ECCV 2016 Workshops, Lecture Notes in Computer Science, pages 400–418, 2016.
  • [19] C. Raffel and D. P. W. Ellis. Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint, arXiv:1512.08756, 2016.
  • [20] Y.-H. H. Tsai, P. P. Liang, A. Zadeh, L.-P. Morency, and R. Salakhutdinov. Learning factorized multimodal representations. In International Conference on Learning Representations (ICLR), pages 1–20, 2019.
  • [21] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia, pages 154–162, 2017.
  • [22] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5022–5030, 2019.
  • [23] J. Willis and A. Todorov. First impressions making up your mind after a 100-ms exposure to a face. Psychological Science, 17(7):592–598, 2006.
  • [24] M. Wimmer, B. Schuller, D. Arsic, G. Rigoll, and B. Radig. Low-level fusion of audio, video feature for multi-modal emotion recognition. In Proceedings of the Third International Conference on Computer Vision Theory and Applications (VISAPP), pages 145–151, 2008.
  • [25] M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. S. Narayanan. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 1–4, 2010.
  • [26] W. Xintao, Y. Ke, D. Chao, and L. C. Chen. Recovering realistic texture in image super-resolution by deep spatial feature transform. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [27] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017.
  • [28] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017.
  • [29] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence, 31(1):39–58, 2008.
  • [30] L. Zhang, S. Peng, and S. Winkler. PersEmoN: A deep network for joint analysis of apparent personality, emotion and their relationship. IEEE Transactions on Affective Computing, pages 1–10, 2019.