Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction

Multi-level Attention network using text, audio and video for
Depression Prediction
Anupama Ray Siddharth Kumar Rutvik Reddy
IBM Research, India IIIT Sricity, India IIIT Sricity, India
anupamar@in.ibm.com siddharth.k16@iiits.in rutvikreddy.v16@iiits.in
Prerana Mukherjee Ritu Garg

IIIT Sricity, India Intel Corporation
Prerana.m@iiits.in ritu@intel.com
arXiv:1909.01417v1 [cs.CV] 3 Sep 2019
ABSTRACT 1 INTRODUCTION
Depression has been the leading cause of mental-health illness Depression is a one of the common mental health disorders and
worldwide. Major depressive disorder (MDD), is a common mental according to WHO, 300 million people around the world have de-
health disorder that affects both psychologically as well as physi- pression 1 . It is a leading cause of mental disability, has tremendous
cally which could lead to loss of lives. Due to the lack of diagnostic psychological and pharmacological affects and can in the worst
tests and subjectivity involved in detecting depression, there is a case lead to suicides. A big barrier to effective treatment of MDD
growing interest in using behavioural cues to automate depression and its care is inaccurate assessment due to the subjectivity in-
diagnosis and stage prediction. The absence of labelled behavioural volved in the assessment procedure. Most assessment procedures
datasets for such problems and the huge amount of variations pos- rely on using questionnaires such as Physical Health Questionnaire
sible in behaviour makes the problem more challenging. This paper Depression Scale (PHQ), the Hamilton Depression Rating Scale
presents a novel multi-level attention based network for multi- (HDRS), or the Beck Depression Inventory (BDI) etc. All of these
modal depression prediction that fuses features from audio, video questionnaries used in screening involve patient’s response which
and text modalities while learning the intra and inter modality is often not very reliable due to different subjective issue of an
relevance. The multi-level attention reinforces overall learning by individual. The symptoms of MDD are covert and there could be
selecting the most influential features within each modality for the some individuals who complain a lot in general even without hav-
decision making. We perform exhaustive experimentation to create ing mild depression, whereas most severely depressed patients do
different regression models for audio, video and text modalities. Sev- not speak much in the screening test. Thus, it is very challenging
eral fusions models with different configurations are constructed to to diagnose early depression and often people are misdiagnosed
understand the impact of each feature and modality. We outperform and prescribed antidepressants. Unlike physical ailments, there are
the current baseline by 17.52% in terms of root mean squared error. no straightforward diagnostic tests for depression and clinicians
have to routinely screen individuals to determine whether the type
CCS CONCEPTS of clinical or chronic depression. Studies have shown that around
• Computing methodologies → Machine Learning; Neural net- 70% sufferers from MDD have consulted a medical practitioner [6].
works. Most practitioners follow the gold standard Physical Health Ques-
tionnaire [24], which has questions to check for symptoms such
KEYWORDS as fatigue, sleep struggles, appetite issues etc. Diagnosis is based
on the judgement of the practitioner (which could be biased from
attention networks; long short term memory; depression prediction; past-education or past experience). Often there are false positives or
multimodal learning false negatives with the PHQ screening which lead to misjudgement
ACM Reference Format: in diagnosis.
Anupama Ray, Siddharth Kumar, Rutvik Reddy, Prerana Mukherjee, and Ritu The huge need for depression detection and the challenges in-
Garg. 2019. Multi-level Attention network using text, audio and video for volved motivated the affective computing research community to
Depression Prediction. In 9th International Audio/Visual Emotion Challenge use behavioural cues to learn to predict depression, Post-traumatic
and Workshop (AVEC ’19), October 21, 2019, Nice, France. ACM, New York, stress disorder, and related mental disorders [40]. Behavioral cues
NY, USA, 8 pages. https://doi.org/10.1145/3347320.3357697 such as facial expression, prosodic features from speech have proven
to be excellent features for depression prediction [9, 34].
Permission to make digital or hard copies of all or part of this work for personal or In this paper, we present a novel framework that invokes atten-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation tion mechanism at several layers to identify and extract important
on the first page. Copyrights for components of this work owned by others than ACM features from different modalities to predict level of depression. The
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
network uses several low-level and mid-level features from both
fee. Request permissions from permissions@acm.org. audio and video modalities and also sentence embeddings on the
AVEC ’19, October 21, 2019, Nice, France speech-to-text output of the participants. We show that attention
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6913-8/19/10. . . $15.00
https://doi.org/10.1145/3347320.3357697 1 https://www.verywellmind.com/depression- statistics- everyone- should- know- 4159056
at different levels gives us the ratio of importance of each feature The cross-cultural and cross-linguistic characteristics in depressed
and modality, leading to better results. We perform several experi- speech using vocal biomarkers is studied in [1]. In [43], authors
ments on each feature from different modality and combine several study the neurocognitive changes influencing the dialogue delivery
modalities. Our best performing network is the all-feature fusion and semantics. Semantic features are encoded using sparse lexical
network which outperforms the baseline by 17.52%. The individual embedding space and context is drawn from subject’s past clinical
feature-based attention network outperforms the baseline by 20.5% history.
and the attention based text model outperforms the state-of-art
by 8.95%, the state-of-art network being an attention based text 2.2 Facial Emotion Analysis
transcription network as well [32].
Although the inherent relationship between verbal content and
mental illness level is more prominent, the visual features also play
1.1 Contributions
a pivotal role to reinstate the deep association of depression to
The key contributions of this work is given as follows: facial emotions. It has been observed that patients suffering from
• Attention based fusion network: We present a novel feature depression often have distorted facial expressions for e.g. eyebrow
fusion framework that utilizes several layers of attention to twitching, dull smile, frowning faces, aggressive looks, restricted
understand importance of intra-modality and inter-modality lip movements, reduced eye blinks etc. With the quantum of prolif-
features for the prediction of depression levels. erating video data and availability of high end built-in cameras in
• The proposed approach outperforms the baseline fusion net- wearables and surveillance sectors, analyzing the facial emotions
work by - on root mean square error. and sentiments is the growing trend amongst the vision community.
• An improved attention based network trained on all three In [31], authors utilize convolutional multiple kernel learning ap-
modalities which outperforms the baseline by 17.52%. proaches for emotion recognition and sentiment analysis in videos.
The remaining paper is organized as follows: Section 2 presents Dalili et.al. conducted a thorough study for meta-analysis of the
the state-of-art methods for depression classification. We present a association between existing facial emotion recognition and depres-
brief overview about the proposed multi-level attention network in sion [12]. In [29], authors rely on temporal LSTM based techniques
section 3, followed by a brief of the dataset used. In section 4, the for capturing the contextual information from videos in sentiment
detailed methodology for each model built on individual features or analysis task. Valstar et.al. introduced Facial Expression Recogni-
fusion is described. Section 5 explains the results of all the models tion and Analysis challenge (FERA 2017) dataset [41] to estimate the
and presents all ablation studies followed by discussions and future head pose movements and identify the action units against them. It
work in section 7. requires to quantify the facial expressions in such challenging sce-
narios. Ebrahimi et.al. introduced Emotion Recognition in the Wild
2 RELATED WORK (EmotiW) Challenge dataset [14] and utilize a hybrid convolutional
In this section, we briefly provide a review of the various works neural network-recurrent neural network (CNN-RNN) framework
done in context of distress analysis using multimodal inputs such for facial expression analysis. These datasets have been very crucial
as text, speech, facial emotions and multimodal sentiment analysis. in advancing state-of-art in research around facial expression recog-
nition and distress prediction. In [4], authors present OpenFace an
2.1 Depression detection from Speech open source interactive tool to estimate the facial behaviour. This
is a widely popular tool and in this paper, we have used the fea-
Speech, more specifically non-verbal paralinguistic cues have gained
tures extracted from OpenFace only as the video low-level features.
significant popularity in distress prediction and similar tasks due
OpenFace gives us features for face landmark regions, head pose es-
to two main reasons. First, clinicians use speech traits such as
timation and eye gaze estimation and converts it into a reliable facial
diminished prosody, less or monotonous verbal activity produc-
action unit. In [26], authors investigate the meta-analysis of facial
tion and energy in speech as important markers in the diagnosis
gestures to identify the schizophrenia based event triggers. In [20],
of distress. Secondly, speech being an easy signal to record (non-
authors provide a meta-analysis on attention deficit hyperactivity
invasive and non-intrusive), makes it the best candidate for all
disorder (ADHD) and dysregulation of children’s emotions. They
automation tasks [10]. Cummins et.al. [10] provide an exhaustive
provide an attempt to establish a coherent link between ADHD and
review of depression and suicide risk assessment using speech
emotional reactions.
analysis. They investigate the usage of vocal biomarkers to as-
sociate clinical scores in case of depression signs. In [38], authors
perform distress assessment in speech signals to infer emotional 2.3 Depression Quotient Detection in Text
information expressed while speaking. This amounts to quantifying Along with video and audio, the verbal content of what a person
various expressions such as anger, valence, arousal, dominance etc. speaks is critically important to be able to diagnose depression and
In [27], authors provide a comparative study on noise and rever- stress. With the surge of social media usage, there is a lot of textual
beration on depression prediction using mel-frequency cepstral data inflow from social media which has given researchers the op-
coefficients (MFCCs) and damped oscillator cepstral coefficients portunity to try to analyze distress from text. Such data could help
(DOCCs). Cummins et.al. in [11] investigate change in spectral in sentiment analysis and provide insights to sudden aberrations in
and energy densities of speech signals for depression prediction. the personality traits of the user as reflected in one’s posts. In [37],
They analyze the acoustic variability in terms of weighted variance, authors leverage social media platforms to detect depression by har-
trajectory of speech signals and volume to measure depression. nessing the social media data. They categorize the tweets gathered
from Twitter API into depression and non-depression data. They modalities and learn the attention weights to find the ratios of im-
extract various feature groups correlated to six depression levels portance of each modality. The paper by Querishi et.al. [32] is the
and further utilize the multimodal depressive dictionary learning only paper closest to what we are doing in terms of using a subset
for online behaviour prediction of the Twitter user base. In [7], of the dataset we are using and applying attention at one layer. By
authors have inspected the tweet content ranging from common using multiple layers of attention at several levels we have been
themes to trivial mention of depression to classify them into rele- able to obtain much better results than them and the network is
vant category of distress disorder levels. In [22], authors present a computationally less expensive due to attention operations, thus
sentiment analysis framework using social media data and mine minimizing the test time of the framework.
patterns based on emotion theory concepts and natural language
processing techniques. Ansari et.al. propose a Markovian model 3 PROPOSED ATTENTION NETWORK FOR
to detect depression utilizing content rating provided by human HYBRID FUSION
subjects [3]. The users are presented with series of contents and A block diagram of the proposed multi-layer attention network is
then asked to rate them based on the reactions tapped and tendency shown in Figure 1. The attention layer over each modality teaches
to skip it the depression level is associated to the event. In [23], the network to attend to the most important features within the
authors examine the onset of triggered events for mental illness modality to create the context feature for that modality. The context
specifically stress and depression based on social media data en- features of each modality are passed through two layers of feedfor-
compassing different ethnic backgrounds. Tong et.al. [39] utilize ward networks, and the outputs of these 3 feedforward networks are
a novel classifier inverse boosting pruning trees to mine online fused in another stacked BLSTM. The output of the 3 feedforward
social behaviour. This enables depression detection at early stages. networks contain the most important features per modality and
In [21], authors adopt clustering techniques to quantify the anxiety are fused to form another concatenation vector with an attention
and depression indices on questionnaire textual data. Further, the layer on top of it. The output of this attention layer is multiplied by
correlation amongst anxiety, depression and social data is investi- the output of the stacked Bi-LSTM output and passed through the
gated.For most of social media data, the text analysis is done for regressor. The loss of the regressor is back-propagated to train the
short text and these classifiers dont work well in a conversation weights learned at each level of the network ensuring end-to-end
setting which happens during the counselling/screening sessions. training.
2.4 Multimodal Approaches for Distress 3.1 Dataset

Detection The Extended Distress Analysis Interview Corpus dataset (E-DAIC)
In [28], authors provide a comprehensive review on the fusion tech- [18] is an extension of DAIC-Wizard-of-Oz dataset (DAIC-WOZ)
niques for depression detection. They also propose a computational dataset. This dataset containing audio-video recordings of clinical
linguistics based fusion approach for multimodal depression de- interviews for psychological distress conditions such as anxiety,
tection. In [25], authors analyse the depression levels from clinical depression and post traumatic stress disorders. The DAIC-WOZ
[13] was collected as part of a different effort which aimed to create
interview corpus DAIC dataset based on context aware feature
a bot that interviews people and identifies verbal and nonverbal
generation techniques and end-to-end trainable deep neural net-
works. They further infuse data augmentation techniques based on indicators of mental illnesses [19] which has an animated virtual
topic modelling in the transformer networks. Zhang et.al. released bot instead of a clinician and the bot is controlled by a human
a multimodal spontaneous emotion corpus for human behaviour from another room. The Audio/Visual Emotion Challenge (AVEC
analysis [44]. Facial emotions are captured by 3D dynamic ranging, 2019) challenge [17] presents an E-DAIC where all interviews are
high resolution video capture and infrared imaging sensors. Apart conducted by the AI based bot rather than a human. This data
from facial context, blood pressure, respiration and pulse rates are has been carefully partitioned to train, development (dev) and test
monitored to gauge the emotional state of a person. Using the data while preserving the overall gender diversity. There are total 275
released in AVEC challenge [33], audio, video and physiological subjects in the E-DAIC dataset out of which 163 subjects are used
parameters are investigated to observe key findings on emotional for train, 56 for both dev and test, out of which the test labels are
not available as per challenge. Thus the results shown are mostly
state of the subjects. In [30], authors fuse audio, visual and textual
on the dev partition only.
cues for harvesting sentiments in multimedia content. They utilize
feature and decision level fusion techniques to perform affective
computing. In [2], authors utilize paralinguistic, head pose and eye
4 METHODOLOGY
gaze fixations for multimodal depression detection. With the help In this section we describe the models created on each modality
of statistical tests on the selected features the inference engine will along with the various models created for fusion of different features
classify the subjects into depressed and healthy categories. from different modalities.
When combining multiple modalities it is important to under-
stand the contribution of each modality in the task prediction and 4.1 Text modality
attention networks can be used to study the relative importance We use the speech-to-text output for the participants in the data
[42]. In this paper we use attention at each modality to understand provided by [17]. Since several participants used colloquial English
the relative importance of the low-level or deep features within words, we modified the utterances by replacing such words with
the modality. We also use attention layers while fusing the three the original full word, otherwise they become all out of vocabulary
Figure 1: Block diagram of proposed multi-layer attention network on multi-modality input features
words while training a neural network for language modeling or Coefficients (MFCC)[16] and studies [5, 16] show that the lower
other predictions. We used pretrained Universal Sentence Encoder order MFCCs are more important for affect/emotion prediction and
[8] to get sentence embeddings. To obtain constant size of the paralinguistic voice analysis tasks. The extended Geneva Minimal-
tensors, we zero pad shorter sentences and have a constant number istic Acoustic Parameter Set (eGeMAPS) contains 88 features which
of timesteps as 400. The length of each sentence embedding vector include the GeMAPS as well as spectral features and their func-
is 512 making the final array dimension as (400,512). We used 2 tionals. The GeMAPS feature consists of frequency related features
layers of stacked Bidirectional Long short term memory network (pitch, jitter, formant), energy related features (shimmer, loudness,
architecture with sentence embeddings as input and PHQ scores harmonic-to-noise ratio), spectral parameters (alpha ratio, hammar-
as output to train a regression model on the speech transcriptions. berg ratio, Spectral slope -500 Hz and 500-1500 Hz, formant 1,2,3
Each BLSTM layer has 200 hidden units, wherein the output of each relative energy), harmonic difference between H1-H2 and H1-A3,
hidden unit of the forward layer of first BLSTM layer is connected and their functionals, and six temporal features related to speech
to the input of the forward hidden unit of the second layer. Same rate [15]. Apart from these low-level features mentioned above, a
connections are built for each hidden unit in the backward layers high dimensional deep representation of the audio sample is ex-
as well to create the stacking. The two layers of BLSTM give an tracted by passing the audio through a Deep Spectrum and a VGG
output of (batchsize,400) at each timestep and this is sent as an network. This feature is referred to as deep densenet feature in the
input to a feedforward layer for regression. We kept the number of rest of the paper.
nodes in the feedforward layers as (500,100,60,1) and used rectified For audio features, the span of vectors were the participant
linear units as the activation function. has spoken was only considered in our experimentation. Each
of these features were available as a part of the challenge data
and they have different sampling rates. The functional audio and
4.2 Audio
deep densenet features are sampled at 1Hz, whereas the Bag-of-
For the audio modality we created models using different audio AudioWords (BoAW) [35] is sampled at 10Hz and the low-level
features (low-level features as well as their functionals). As func- audio descriptors are sampled at 100Hz. The length of the low-level
tionals, the arithmetic mean and co-efficent of variations is applied MFCC features and low-level eGEMAPS is 39 and 23 respectively,
on the low-level features and this is used as a knowledge abstraction and the total time steps is 140500 for both. For the functionals how-
on top of the low-level features [36]. The vocal timbre is encoded ever, the lengths are 78 and 88 respectively with 1300 and 1410
by low-level descriptor features such as Mel-Frequency Cepstral
timesteps. The BoAW-MFCC and the BoAW-eGEMAPS features are BLSTM and an attention layer prior to the video regressor. This
of length 100 and took 14050 timesteps for each. The deep densenet fused model is referred to as Video-text fusion in Table 2.
features are of 1920 dimension in length and took 1415 timesteps. The fourth fusion model uses audio and text modalities together.
For the individual audio modality, we trained another stacked Again we took the output of the attention layer at each modality
BLSTM network with two layers each having 200 hidden units. We and build a hybrid fused network but passing them through two
take the last layer output and pass it to a multi-layer perceptron routes. In the first route the outputs of attention are concatenated
with each layer of (500,100,60,1) nodes in progression and Rectified and passed through attention followed by feed-forward layer and
Linear Units as activation function. regression loss is propagated. Another route passes the output of
both attention layers are passed through a stacked BLSTM of 2
4.3 Visual Modality layers each with 200 cells. An attention layer is applied on top of
the stacked BLSTM layers and this output is fed to a feed forward
For the video features available in the challenge dataset, we experi-
network of 128 hidden units. This network is called Audio-Text fused
mented with both the low-level features as well as the functionals
in Table 2 and is seen to have better performance than standalone
of the low-level video descriptors provided. We observed similar
audio model naturally due to use of text features which led to best
performance comparing the low-level descriptor with its functional.
results.
Since the deep LSTM networks could also learn similar properties
Our fifth fusion model uses Video and text modalities together
from the data (like functionals and more abstract information), we
and here we again use attention layer at each sub-modality of the
chose to use the low-level descriptors as it has more information
video inputs and then combine it with the text modality using
than its mean and standard deviation. Each low-level descriptor
another attention network over video and text. Quite surprisingly,
features for Pose, Gaze and Facial Action Units (FAU) are sampled
the results of this fusion are very similar to audio and text modality
at 10Hz. The length of these features were 6, 8, 35 respectively
fusion and the learning curve also ended up being quite similar.
and all having 15000 timesteps. The Bag-of-VisualWords (BoVW)
This network called the Video-Text fused in Table 2 and, has one
is also provided in the challenge data and has a length of 100 with
route that runs through a Bi-LSTM network of 200 units for each
15000 timesteps. We use these features to train individual model
sub-modality and then for each time-step, they are fused together
per feature, all having a single layer of 200 BLSTM hidden units,
using an attention layer over all the incoming video features(Gaze,
followed by a maxpooling and then learn a regressor. We exper-
Pose, AUs and BOVW), which goes through another Bi-LSTM with
imented with various combinations such as sum of all outputs,
200 units to extract the contextual information from within the
mean of outputs and also by maxpooling as three alternatives, but
fused features.
maxpooling worked best, so we have utilized maxpooling over the
Our sixth and final fused model uses all the modalities together.
LSTM outputs.
We use the attention based visual modalities to finally obtain a
128 unit vector, we use the attention based audio modalities to
4.4 Fusion of Modalities obtain a 128 unit vector and we extract the the information from
Standard procedures of early fusion are computationally expensive the transcript modalities and derive a 128 bit vector from that. We
and can lead to overfitting when trained using neural networks. again use another attention layer over these 3 modalities(Video,
Thus, late fusion and hybrid fusion models became more prevalent. Audio and Text), to fuse them together and regress for the PHQ8
We propose a multi-layer attention based network that learns the score. There were several challenges in integrating this fused model.
importance of each feature and weighs them accordingly leading to We hypothesise that the error function consists of several local
better early fusion and prediction. Such an attention network gives minima, which make it more difficult to reach the global minima.
us insights of which features in a modality are more influential in On testing the model with individual modalities, we observed that
learning. It also gives an understanding of the ratio of contribution both Video feature model and audio feature model have a much
of each modality towards the prediction. steeper descent than the ASR model, on fusion, the model often
Towards the fusion, we did several experiments within each got stuck on the minima of the video and audio features which are
modality and across. First, we fuse the low-level descriptors of the both quite close. To mitigate this and "nudge" the model towards
video modality. We take the gaze,pose, and facial action unit features a minima which takes the path of the minima reached by ASR
and pass them through a single layer of 200 BLSTM cells and apply transcripts, we multiply the final outputs of the attention layer
attention over them. The output of attention layer is passed through element-wise with a variable vector initialised with values in the
another BLSTM layer with 200 cells. We take global max pool of reciprocal ratios of the rmse loss for each individual modality in
this LSTM output and pass through feed forward network with 128 order to prioritize the the text modality initially. This led to a stable
hidden units. We call this fusion model as videoLLD-fused model in decline of train and validation loss, more stable than the individual
table 2. Second, we combine the low-level video features with the modality loss also, and the final attention scores are indicative of
BoVWs and use a similar network of 200 hidden unit BLSTM layer contributions of each individual modality. Upon convergence, the
followed by attention and another BLSTM, which is then passed attention ratios were [0.21262352, 0.21262285, 0.57475364 ] for video,
through a feed-forward layer for regression. We call this fusion audio and text respectively.
model as video-BoVW fused in Table 2.
The third fusion model is created using the attention vector
output from the video modality and the output of the text modality.
These two outputs are combined and passed through a stacked
Table 1: Regression of PHQ score in terms of RMSE and MAE for each feature within each modality
Partition Audio Video Text

Funct MFCC Funct eGeMAPS BoAW-M BoAW-e DS-DNet Pose-LLD Gaze-LLD FAU-LLD BoVW Text
Dev-proposed 5.11 5.52 5.66 5.50 5.65 5.85 6.13 5.96 5.70 4.37
Dev-baseline [17] 7.28 7.78 6.32 6.43 8.09 - - 7.02 5.99 -
Qureshi et.al. [32] - - - - - 6.45 6.57 6.53 - 4.80
Table 2: Different Fusion networks in terms of RMSE and MAE
Partition Attentive Fusion models

videoLLD-fused video-BoVW fused Video-Text fused Audio-Text fused All-feature-fusion
Dev-proposed 5.55 5.38 4.64 4.37 4.28
Dev-baseline [17] - - - - 5.03
Qureshi et.al. [32] - - 5.11 4.64 4.14
5 RESULTS 5.2 Results from Audio Modality

This section presents the results of all regression models and their The loss in terms of RMSE on the audio features of the development
ablation studies in detail. The results of models trained on individual split of the dataset given from the challenge is shown in Table 1 as
features from each modality as shown in Table 1. We show results Dev-proposed. Dev-baseline are the results provided in the baseline
on 4 different types of fusion networks in Table 2. Since the labels paper of the challenge [17]. In comparison with the baseline model,
of the test data are not available as per the challenge, we show each of our individual networks outperformed in terms of RMSE.
most results on the validation (dev) partition. The only results on For the audio MFCC feature based model, we outperform baseline
test partition is from the text-based model with which we made by 29.80%, whereas for eGEMAPS we are better by 29.04%. For the
a submission and got all scores from the challenge. The paper BoAW-MFCC we outperform by 10.44% and in case of BoAW-eGE
[32] uses a subset of the E-DAIC data used by us which does not we achieve 14.46% improvement over the baseline. Each individual
include the test partition and the dev partition could also be a little audio feature code runs for 15 epochs with a batch size of 10. The
different, so we cannot directly compare the results but thats the average time required for one sample using functional MFCC is 0.23
closest comparison on similar dataset. secs, using Functional eGEMAPS is 0.14 secs, using BoAW-MFCC
is 0.45 secs using BoAW-eGE is 0.45 secs and DS-DNet features is
0.13 seconds. For the audio models, we tried a convolution neural
network architecture for fusing the MFCC, eGEMAPS and DS-DNet
5.1 Results from Text Modality features, but observed that the performance with Bi-LSTMs was
The attention based BLSTM network trained on text transcriptions slightly better than convolution networks due to the sequential
achieved best results in comparison to other modalities on the test learning capability which suits such features.
set of both the E-DAIC (challenge data) and DAIC-WOZ dataset
(current dev partition). This is in coherence with the observation of 5.3 Results from Visual Modality
the clinicians that the verbal content is a significant marker and has
The results on the video features are better than the baseline as
explicit features which could influence the decision of depression
well as the state-of-art, but is still worse than the results obtained
stage classification. We achieved a root mean squared error (RMSE)
on the text and speech modalities. Among the visual features, the
of 4.37 on the development partition of the challenge dataset (as
Bag of Visual words performed best outperforming the baseline by
shown in Table 1). we submitted the output of this model to the
4.8%. Comparing with [32], we outperformed them by 9.3% using
challenge, we have the detailed correlation co-efficients on the test
pose features, by 6.6% using gaze and 8.7% using facial action units.
partition only for the text modality and not for other modalities.
On the test set, the text-based model is able to achieve a Mean
absolute error (MAE) of 4.02, a RMSE of 4.73, which confirms to 6 RESULTS FROM FUSED MODALITIES
have a concordance correlation coefficient (CCC) of 0.67 which is The details of the RMSE of each fusion model on the development
main metric in the challenge. The Pearson’s Correlation coefficient partition is shown in Table 2 as Dev-proposed. The Dev-baseline
(PCC) of this model is 0.676, the coefficient of determination (r2) are the results on the same split from the baseline paper. The third
is 0.457, and the Spearman’s Correlation coefficient (SCC) is 0.651 row which shows the results of the state-of-art paper is not on
as per the results from the challenge on test set. Overall this net- the same set, but on the test dataset of DAIC-WOZ. Assuming that
work outperforms the state-of-art model [32] by 8.95%. The code the entire test partition of DAIC-WOZ is now a part of the valida-
converged at 15 epochs with a validation loss of 4.37 and batch size tion/development partition of the challenge dataset, we present the
of 10 which was kept empirically. The average test time for a single comparison with them. The model using all features fused using
text transcription to get predicted is 0.09 secs. multiple levels of attention led to the best results outperforming
the baseline by 17.52%. In comparison to [32], for the Audio-text 2018 Conference on Information and Communication Technology (CICT). IEEE,
and video text fusion networks, our networks are better by 5.8% 1–6.
[4] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface:
and 9.19% respectively, but our all-feature fusion network is slightly an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference
worse. This is not conclusive as the dataset being used in the paper on Applications of Computer Vision (WACV). IEEE, 1–10.
[5] Daniel Bone, Chi-Chun Lee, and Shrikanth Narayanan. 2014. Robust unsupervised
is slightly different. The attention mechanism automatically weighs arousal rating: A rule-based framework withknowledge-inspired vocal features.
each feature in each modality and allows the network to attend IEEE transactions on affective computing 5, 2 (2014), 201–213.
to the most important features for the regression decision. The [6] Meadows G Carey M, Jones K. 2014. Accuracy of general practitioner unassisted
detection of depression. Aust N Z J Psychiatry. 48(6) (4 2014), 571–578.
network thus learns the relationship between the features with the [7] Patricia A Cavazos-Rehg, Melissa J Krauss, Shaina Sowles, Sarah Connolly, Carlos
PHQ-8 scores. Rosas, Meghana Bharadwaj, and Laura J Bierut. 2016. A content analysis of
depression-related tweets. Computers in human behavior 54 (2016), 351–357.
[8] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St.
7 DISCUSSIONS AND FUTURE WORK John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-
This paper proposes a multi-level attention based early fusion net- Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder.
CoRR abs/1803.11175 (2018). http://arxiv.org/abs/1803.11175
work which fuses audio, video and text modalities to predict severity [9] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F.
of depression. For this task we observed that the attention network Zhou, and F. De la Torre. 2009. Detecting depression from facial actions and
vocal prosody. In 2009 3rd International Conference on Affective Computing and
gave highest weights to the text modality and almost equal weigh- Intelligent Interaction and Workshops.
tage to audio and video modalities. Giving higher weights to text [10] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien
modality is in coherence with clinicians as the content of speech is Epps, and Thomas F Quatieri. 2015. A review of depression and suicide risk
assessment using speech analysis. Speech Communication 71 (2015), 10–49.
critical to diagnose depression levels. Audio and video are equally [11] Nicholas Cummins, Vidhyasaharan Sethu, Julien Epps, Sebastian Schnieder, and
important sources of information and can be critical for the predic- Jarek Krajewski. 2015. Analysis of acoustic space variability in speech affected
by depression. Speech Communication 75 (2015), 27–49.
tion of severity. Our intuition for lower importance to video data [12] MN Dalili, IS Penton-Voak, CJ Harmer, and MR Munafò. 2015. Meta-analysis of
is the limited features that we could use from the video modality emotion recognition deficits in major depressive disorder. Psychological medicine
(eye-gaze, facial action-units and headpose). A clinician in a face-to- 45, 6 (2015), 1135–1144.
[13] David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer,
face interview can observe a person’s body-posture (self-touches, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, Gale Lucas,
trembling etc) or record electrophysiological signals, thus helping Stacy Marsella, Fabrizio Morbini, Angela Nazarian, Stefan Scherer, Giota Stratou,
diagnose better. Apar Suri, David Traum, Rachel Wood, Yuyu Xu, Albert Rizzo, and Louis-Philippe
Morency. 2014. SimSensei Kiosk: A Virtual Human Interviewer for Healthcare
The use of multi-level attention led us to obtain significantly Decision Support. In Proceedings of the 2014 International Conference on Au-
better results in all individual and fusion models compared to both tonomous Agents and Multi-agent Systems (AAMAS ’14). International Foundation
for Autonomous Agents and Multiagent Systems.
the baseline and state-of-art. Using attention over each feature and [14] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic,
each modality had a two fold advantage overall. Firstly, this gives and Christopher Pal. 2015. Recurrent neural networks for emotion recognition in
us deep and better understanding of importance of each feature video. In Proceedings of the 2015 ACM on International Conference on Multimodal
Interaction. ACM, 467–474.
within a modality towards depression prediction. Secondly, atten- [15] Florian Eyben, Klaus Scherer, Björn Schuller, Johan Sundberg, Elisabeth André,
tion simplified the network’s overall computational complexity Carlos Busso, Laurence Devillers, Julien Epps, Petri Laukka, Shrikanth Narayanan,
and reduced the training and test time. Experimental results show and Khiet Phuong Truong. 2016. The Geneva Minimalistic Acoustic Parameter
Set (GeMAPS) for Voice Research and Affective Computing. IEEE transactions on
that the model with all-feature fusion using multi-level attention affective computing 7 (4 2016), 190–202.
outperformed the baseline by 17.52%. The model built only on text [16] Florian Eyben, F Weninger, and BjÃűrn Schuller. 2013. Affect recognition in
modality was also significantly better in comparison to [32] and real-life acoustic conditions - A new perspective on feature selection. Proceedings
of the Annual Conference of the International Speech Communication Association,
also on the test set achieving a CCC score of 0.67 in comparison to INTERSPEECH (2013), 2044–2048.
0.1 of the baseline on testset. [17] Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cummins
and Roddy Cowie and Leili Tavabi and Maximilian Schmitt and Sina Alisamir
As future work, the authors want to be able to get rid of in- and Shahin Amiriparian and Eva-Maria Messner and Siyang Song and Shuo
ductive bias from classes of data with more training samples and Lui and Ziping Zhao and Adria Mallol-Ragolta and Zhao Ren, and Maja Pantic.
use few-shot learning techniques to be able to learn models with 2019. AVEC 2019 Workshop and Challenge: State-of-Mind, Depression with AI,
and Cross-Cultural Affect Recognition. In Proceedings of the 9th International
less data, as it is highly challenging to get more data or balanced Workshop on Audio/Visual Emotion Challenge, AVEC’19, co-located with the 27th
data across classes in such domains. We also are trying to deep ACM International Conference on Multimedia, MM 2019, Fabien Ringeval, Björn
delve to understand what are the features that have a positive or Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.).
ACM, Nice, France.
negative influence in making these decisions between mild/severe [18] Jonathan Gratch, Ron Arstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela
depression. That would lead to more explainability in these models, Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum,
Albert Rizzo, and L P. Morency. 2014. The Distress Analysis Interview Corpus of
which is of more importance to a clinician in understanding the human and computer interviews.
output of these models. [19] Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer,
Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al.
2014. The distress analysis interview corpus of human and computer interviews..
REFERENCES In LREC. 3123–3128.
[1] Sharifa Alghowinem, Roland Goecke, Julien Epps, Michael Wagner, and Jeffrey F [20] Paulo A Graziano and Alexis Garcia. 2016. Attention-deficit hyperactivity disor-
Cohn. 2016. Cross-Cultural Depression Recognition from Vocal Biomarkers.. In der and children’s emotion dysregulation: A meta-analysis. Clinical psychology
INTERSPEECH. 1943–1947. review 46 (2016), 106–123.
[2] Sharifa Alghowinem, Roland Goecke, Michael Wagner, Julien Epps, Matthew [21] Fei Hao, Guangyao Pang, Yulei Wu, Zhongling Pi, Lirong Xia, and Geyong Min.
Hyett, Gordon Parker, and Michael Breakspear. 2016. Multimodal depression 2019. Providing Appropriate Social Support to Prevention of Depression for
detection: fusion analysis of paralinguistic, head pose and eye gaze behaviors. Highly Anxious Sufferers. IEEE Transactions on Computational Social Systems
IEEE Transactions on Affective Computing 9, 4 (2016), 478–490. (2019).
[3] Haroon Ansari, Aditya Vijayvergia, and Krishan Kumar. 2018. DCR-HMM:
Depression detection based on Content Rating using Hidden Markov Model. In
[22] Anees Ul Hassan, Jamil Hussain, Musarrat Hussain, Muhammad Sadiq, and Sungy- [43] James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentru-
oung Lee. 2017. Sentiment analysis of social networking sites (SNS) data using ber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, and
machine learning approach for the measurement of depression. In 2017 Interna- Thomas F Quatieri. 2016. Detecting depression using vocal, facial and seman-
tional Conference on Information and Communication Technology Convergence tic communication cues. In Proceedings of the 6th International Workshop on
(ICTC). IEEE, 138–140. Audio/Visual Emotion Challenge. ACM, 11–18.
[23] Jia Jia. 2018. Mental Health Computing via Harvesting Social Media Data.. In [44] Zheng Zhang, Jeff M Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun
IJCAI. 5677–5681. Canavan, Michael Reale, Andy Horowitz, Huiyuan Yang, et al. 2016. Multimodal
[24] Kurt Kroenke, Tara Strine, Robert L Spitzer, Janet Williams, Joyce T Berry, and spontaneous emotion corpus for human behavior analysis. In Proceedings of the
Ali Mokdad. 2008. The PHQ-8 as a Measure of Current Depression in the General IEEE Conference on Computer Vision and Pattern Recognition. 3438–3446.
Population. Journal of affective disorders 114 (09 2008), 163–73.
[25] Genevieve Lam, Huang Dongyan, and Weisi Lin. 2019. Context-aware Deep
Learning for Multi-modal Depression Detection. In ICASSP 2019-2019 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
3946–3950.
[26] Amanda McCleery, Junghee Lee, Aditi Joshi, Jonathan K Wynn, Gerhard S Helle-
mann, and Michael F Green. 2015. Meta-analysis of face processing event-related
potentials in schizophrenia. Biological psychiatry 77, 2 (2015), 116–126.
[27] Vikramjit Mitra, Andreas Tsiartas, and Elizabeth Shriberg. 2016. Noise and rever-
beration effects on depression detection from speech. In 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5795–5799.
[28] Michelle Morales, Stefan Scherer, and Rivka Levitan. 2018. A linguistically-
informed fusion approach for multimodal depression detection. In Proceedings of
the Fifth Workshop on Computational Linguistics and Clinical Psychology: From
Keyboard to Clinic. 13–24.
[29] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir
Zadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment anal-
ysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). 873–883.
[30] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir
Hussain. 2016. Fusing audio, visual and textual clues for sentiment analysis from
multimodal content. Neurocomputing 174 (2016), 50–59.
[31] Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolu-
tional MKL based multimodal emotion recognition and sentiment analysis. In
2016 IEEE 16th international conference on data mining (ICDM). IEEE, 439–448.
[32] Syed Arbaaz Qureshi, Mohammed Hasanuzzaman, Sriparna Saha, and Gaël Dias.
2019. The Verbal and Non Verbal Signals of Depression–Combining Acoustics,
Text and Visuals for Estimating Depression Level. arXiv preprint arXiv:1904.07656
(2019).
[33] Fabien Ringeval, Björn Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi,
Denis Lalanne, Roddy Cowie, and Maja Pantic. 2015. AV+EC 2015: The First
Affect Recognition Challenge Bridging Across Audio, Video, and Physiological
Data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion
Challenge (AVEC ’15).
[34] Stefan Scherer, Giota Stratou, Gale Lucas, Marwa Mahmoud, Jill Boberg, Jonathan
Gratch, Albert (Skip) Rizzo, and Louis-Philippe Morency. 2014. Automatic audio-
visual behavior descriptors for psychological disorder analysis. Image and Vision
Computing Journal 32 (Oct. 2014).
[35] Maximilian Schmitt and Björn W. Schuller. 2016. openXBOW - Introducing the
Passau Open-Source Crossmodal Bag-of-Words Toolkit. CoRR abs/1605.06778
(2016). arXiv:1605.06778 http://arxiv.org/abs/1605.06778
[36] Björn W. Schuller, Anton Batliner, Dino Seppi, Stefan Steidl, Thurid Vogt, Jo-
hannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loïc Kessous,
and Vered Aharonson. 2007. The relevance of feature type for the automatic
classification of emotional user states: low level descriptors and functionals. In
INTERSPEECH.
[37] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui Hu, Tat-
Seng Chua, and Wenwu Zhu. 2017. Depression Detection via Harvesting Social
Media: A Multimodal Dictionary Learning Solution.. In IJCAI. 3838–3844.
[38] Brian Stasak, Julien Epps, Nicholas Cummins, and Roland Goecke. 2016. An
Investigation of Emotional Speech in Depression Classification.. In Interspeech.
485–489.
[39] Lei Tong, Qianni Zhang, Abdul Sadka, Ling Li, Huiyu Zhou, et al. 2019. In-
verse boosting pruning trees for depression detection on Twitter. arXiv preprint
arXiv:1906.00398 (2019).
[40] Michel F. Valstar, Jonathan Gratch, Björn W. Schuller, Fabien Ringeval, Denis
Lalanne, Mercedes Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja
Pantic. 2016. AVEC 2016 - Depression, Mood, and Emotion Recognition Workshop
and Challenge. CoRR abs/1605.01600 (2016). http://arxiv.org/abs/1605.01600
[41] Michel F Valstar, Enrique Sánchez-Lozano, Jeffrey F Cohn, László A Jeni, Jeffrey M
Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. 2017. Fera 2017-addressing
head pose in the third facial expression recognition and analysis challenge. In
2017 12th IEEE International Conference on Automatic Face & Gesture Recognition
(FG 2017). IEEE, 839–847.
[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
You Need. In Proceedings of the 31st International Conference on Neural Informa-
tion Processing Systems (NIPS’17).

Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction

Uploaded by

Copyright:

Available Formats

Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction

Uploaded by

Copyright:

Available Formats

Multi-level Attention network using text, audio and video for

Prerana Mukherjee Ritu Garg

2.4 Multimodal Approaches for Distress 3.1 Dataset

Partition Audio Video Text

Table 2: Different Fusion networks in terms of RMSE and MAE

Partition Attentive Fusion models

5 RESULTS 5.2 Results from Audio Modality

You might also like