Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction
Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction
Multi-Level Attention Network Using Text, Audio and Video For Depression Prediction
Depression Prediction
Anupama Ray Siddharth Kumar Rutvik Reddy
IBM Research, India IIIT Sricity, India IIIT Sricity, India
anupamar@in.ibm.com siddharth.k16@iiits.in rutvikreddy.v16@iiits.in
ABSTRACT 1 INTRODUCTION
Depression has been the leading cause of mental-health illness Depression is a one of the common mental health disorders and
worldwide. Major depressive disorder (MDD), is a common mental according to WHO, 300 million people around the world have de-
health disorder that affects both psychologically as well as physi- pression 1 . It is a leading cause of mental disability, has tremendous
cally which could lead to loss of lives. Due to the lack of diagnostic psychological and pharmacological affects and can in the worst
tests and subjectivity involved in detecting depression, there is a case lead to suicides. A big barrier to effective treatment of MDD
growing interest in using behavioural cues to automate depression and its care is inaccurate assessment due to the subjectivity in-
diagnosis and stage prediction. The absence of labelled behavioural volved in the assessment procedure. Most assessment procedures
datasets for such problems and the huge amount of variations pos- rely on using questionnaires such as Physical Health Questionnaire
sible in behaviour makes the problem more challenging. This paper Depression Scale (PHQ), the Hamilton Depression Rating Scale
presents a novel multi-level attention based network for multi- (HDRS), or the Beck Depression Inventory (BDI) etc. All of these
modal depression prediction that fuses features from audio, video questionnaries used in screening involve patient’s response which
and text modalities while learning the intra and inter modality is often not very reliable due to different subjective issue of an
relevance. The multi-level attention reinforces overall learning by individual. The symptoms of MDD are covert and there could be
selecting the most influential features within each modality for the some individuals who complain a lot in general even without hav-
decision making. We perform exhaustive experimentation to create ing mild depression, whereas most severely depressed patients do
different regression models for audio, video and text modalities. Sev- not speak much in the screening test. Thus, it is very challenging
eral fusions models with different configurations are constructed to to diagnose early depression and often people are misdiagnosed
understand the impact of each feature and modality. We outperform and prescribed antidepressants. Unlike physical ailments, there are
the current baseline by 17.52% in terms of root mean squared error. no straightforward diagnostic tests for depression and clinicians
have to routinely screen individuals to determine whether the type
CCS CONCEPTS of clinical or chronic depression. Studies have shown that around
• Computing methodologies → Machine Learning; Neural net- 70% sufferers from MDD have consulted a medical practitioner [6].
works. Most practitioners follow the gold standard Physical Health Ques-
tionnaire [24], which has questions to check for symptoms such
KEYWORDS as fatigue, sleep struggles, appetite issues etc. Diagnosis is based
on the judgement of the practitioner (which could be biased from
attention networks; long short term memory; depression prediction; past-education or past experience). Often there are false positives or
multimodal learning false negatives with the PHQ screening which lead to misjudgement
ACM Reference Format: in diagnosis.
Anupama Ray, Siddharth Kumar, Rutvik Reddy, Prerana Mukherjee, and Ritu The huge need for depression detection and the challenges in-
Garg. 2019. Multi-level Attention network using text, audio and video for volved motivated the affective computing research community to
Depression Prediction. In 9th International Audio/Visual Emotion Challenge use behavioural cues to learn to predict depression, Post-traumatic
and Workshop (AVEC ’19), October 21, 2019, Nice, France. ACM, New York, stress disorder, and related mental disorders [40]. Behavioral cues
NY, USA, 8 pages. https://doi.org/10.1145/3347320.3357697 such as facial expression, prosodic features from speech have proven
to be excellent features for depression prediction [9, 34].
Permission to make digital or hard copies of all or part of this work for personal or In this paper, we present a novel framework that invokes atten-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation tion mechanism at several layers to identify and extract important
on the first page. Copyrights for components of this work owned by others than ACM features from different modalities to predict level of depression. The
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
network uses several low-level and mid-level features from both
fee. Request permissions from permissions@acm.org. audio and video modalities and also sentence embeddings on the
AVEC ’19, October 21, 2019, Nice, France speech-to-text output of the participants. We show that attention
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6913-8/19/10. . . $15.00
https://doi.org/10.1145/3347320.3357697 1 https://www.verywellmind.com/depression- statistics- everyone- should- know- 4159056
at different levels gives us the ratio of importance of each feature The cross-cultural and cross-linguistic characteristics in depressed
and modality, leading to better results. We perform several experi- speech using vocal biomarkers is studied in [1]. In [43], authors
ments on each feature from different modality and combine several study the neurocognitive changes influencing the dialogue delivery
modalities. Our best performing network is the all-feature fusion and semantics. Semantic features are encoded using sparse lexical
network which outperforms the baseline by 17.52%. The individual embedding space and context is drawn from subject’s past clinical
feature-based attention network outperforms the baseline by 20.5% history.
and the attention based text model outperforms the state-of-art
by 8.95%, the state-of-art network being an attention based text 2.2 Facial Emotion Analysis
transcription network as well [32].
Although the inherent relationship between verbal content and
mental illness level is more prominent, the visual features also play
1.1 Contributions
a pivotal role to reinstate the deep association of depression to
The key contributions of this work is given as follows: facial emotions. It has been observed that patients suffering from
• Attention based fusion network: We present a novel feature depression often have distorted facial expressions for e.g. eyebrow
fusion framework that utilizes several layers of attention to twitching, dull smile, frowning faces, aggressive looks, restricted
understand importance of intra-modality and inter-modality lip movements, reduced eye blinks etc. With the quantum of prolif-
features for the prediction of depression levels. erating video data and availability of high end built-in cameras in
• The proposed approach outperforms the baseline fusion net- wearables and surveillance sectors, analyzing the facial emotions
work by - on root mean square error. and sentiments is the growing trend amongst the vision community.
• An improved attention based network trained on all three In [31], authors utilize convolutional multiple kernel learning ap-
modalities which outperforms the baseline by 17.52%. proaches for emotion recognition and sentiment analysis in videos.
The remaining paper is organized as follows: Section 2 presents Dalili et.al. conducted a thorough study for meta-analysis of the
the state-of-art methods for depression classification. We present a association between existing facial emotion recognition and depres-
brief overview about the proposed multi-level attention network in sion [12]. In [29], authors rely on temporal LSTM based techniques
section 3, followed by a brief of the dataset used. In section 4, the for capturing the contextual information from videos in sentiment
detailed methodology for each model built on individual features or analysis task. Valstar et.al. introduced Facial Expression Recogni-
fusion is described. Section 5 explains the results of all the models tion and Analysis challenge (FERA 2017) dataset [41] to estimate the
and presents all ablation studies followed by discussions and future head pose movements and identify the action units against them. It
work in section 7. requires to quantify the facial expressions in such challenging sce-
narios. Ebrahimi et.al. introduced Emotion Recognition in the Wild
2 RELATED WORK (EmotiW) Challenge dataset [14] and utilize a hybrid convolutional
In this section, we briefly provide a review of the various works neural network-recurrent neural network (CNN-RNN) framework
done in context of distress analysis using multimodal inputs such for facial expression analysis. These datasets have been very crucial
as text, speech, facial emotions and multimodal sentiment analysis. in advancing state-of-art in research around facial expression recog-
nition and distress prediction. In [4], authors present OpenFace an
2.1 Depression detection from Speech open source interactive tool to estimate the facial behaviour. This
is a widely popular tool and in this paper, we have used the fea-
Speech, more specifically non-verbal paralinguistic cues have gained
tures extracted from OpenFace only as the video low-level features.
significant popularity in distress prediction and similar tasks due
OpenFace gives us features for face landmark regions, head pose es-
to two main reasons. First, clinicians use speech traits such as
timation and eye gaze estimation and converts it into a reliable facial
diminished prosody, less or monotonous verbal activity produc-
action unit. In [26], authors investigate the meta-analysis of facial
tion and energy in speech as important markers in the diagnosis
gestures to identify the schizophrenia based event triggers. In [20],
of distress. Secondly, speech being an easy signal to record (non-
authors provide a meta-analysis on attention deficit hyperactivity
invasive and non-intrusive), makes it the best candidate for all
disorder (ADHD) and dysregulation of children’s emotions. They
automation tasks [10]. Cummins et.al. [10] provide an exhaustive
provide an attempt to establish a coherent link between ADHD and
review of depression and suicide risk assessment using speech
emotional reactions.
analysis. They investigate the usage of vocal bio- markers to as-
sociate clinical scores in case of depression signs. In [38], authors
perform distress assessment in speech signals to infer emotional 2.3 Depression Quotient Detection in Text
information expressed while speaking. This amounts to quantifying Along with video and audio, the verbal content of what a person
various expressions such as anger, valence, arousal, dominance etc. speaks is critically important to be able to diagnose depression and
In [27], authors provide a comparative study on noise and rever- stress. With the surge of social media usage, there is a lot of textual
beration on depression prediction using mel-frequency cepstral data inflow from social media which has given researchers the op-
coefficients (MFCCs) and damped oscillator cepstral coefficients portunity to try to analyze distress from text. Such data could help
(DOCCs). Cummins et.al. in [11] investigate change in spectral in sentiment analysis and provide insights to sudden aberrations in
and energy densities of speech signals for depression prediction. the personality traits of the user as reflected in one’s posts. In [37],
They analyze the acoustic variability in terms of weighted variance, authors leverage social media platforms to detect depression by har-
trajectory of speech signals and volume to measure depression. nessing the social media data. They categorize the tweets gathered
from Twitter API into depression and non-depression data. They modalities and learn the attention weights to find the ratios of im-
extract various feature groups correlated to six depression levels portance of each modality. The paper by Querishi et.al. [32] is the
and further utilize the multimodal depressive dictionary learning only paper closest to what we are doing in terms of using a subset
for online behaviour prediction of the Twitter user base. In [7], of the dataset we are using and applying attention at one layer. By
authors have inspected the tweet content ranging from common using multiple layers of attention at several levels we have been
themes to trivial mention of depression to classify them into rele- able to obtain much better results than them and the network is
vant category of distress disorder levels. In [22], authors present a computationally less expensive due to attention operations, thus
sentiment analysis framework using social media data and mine minimizing the test time of the framework.
patterns based on emotion theory concepts and natural language
processing techniques. Ansari et.al. propose a Markovian model 3 PROPOSED ATTENTION NETWORK FOR
to detect depression utilizing content rating provided by human HYBRID FUSION
subjects [3]. The users are presented with series of contents and A block diagram of the proposed multi-layer attention network is
then asked to rate them based on the reactions tapped and tendency shown in Figure 1. The attention layer over each modality teaches
to skip it the depression level is associated to the event. In [23], the network to attend to the most important features within the
authors examine the onset of triggered events for mental illness modality to create the context feature for that modality. The context
specifically stress and depression based on social media data en- features of each modality are passed through two layers of feedfor-
compassing different ethnic backgrounds. Tong et.al. [39] utilize ward networks, and the outputs of these 3 feedforward networks are
a novel classifier inverse boosting pruning trees to mine online fused in another stacked BLSTM. The output of the 3 feedforward
social behaviour. This enables depression detection at early stages. networks contain the most important features per modality and
In [21], authors adopt clustering techniques to quantify the anxiety are fused to form another concatenation vector with an attention
and depression indices on questionnaire textual data. Further, the layer on top of it. The output of this attention layer is multiplied by
correlation amongst anxiety, depression and social data is investi- the output of the stacked Bi-LSTM output and passed through the
gated.For most of social media data, the text analysis is done for regressor. The loss of the regressor is back-propagated to train the
short text and these classifiers dont work well in a conversation weights learned at each level of the network ensuring end-to-end
setting which happens during the counselling/screening sessions. training.
words while training a neural network for language modeling or Coefficients (MFCC)[16] and studies [5, 16] show that the lower
other predictions. We used pretrained Universal Sentence Encoder order MFCCs are more important for affect/emotion prediction and
[8] to get sentence embeddings. To obtain constant size of the paralinguistic voice analysis tasks. The extended Geneva Minimal-
tensors, we zero pad shorter sentences and have a constant number istic Acoustic Parameter Set (eGeMAPS) contains 88 features which
of timesteps as 400. The length of each sentence embedding vector include the GeMAPS as well as spectral features and their func-
is 512 making the final array dimension as (400,512). We used 2 tionals. The GeMAPS feature consists of frequency related features
layers of stacked Bidirectional Long short term memory network (pitch, jitter, formant), energy related features (shimmer, loudness,
architecture with sentence embeddings as input and PHQ scores harmonic-to-noise ratio), spectral parameters (alpha ratio, hammar-
as output to train a regression model on the speech transcriptions. berg ratio, Spectral slope -500 Hz and 500-1500 Hz, formant 1,2,3
Each BLSTM layer has 200 hidden units, wherein the output of each relative energy), harmonic difference between H1-H2 and H1-A3,
hidden unit of the forward layer of first BLSTM layer is connected and their functionals, and six temporal features related to speech
to the input of the forward hidden unit of the second layer. Same rate [15]. Apart from these low-level features mentioned above, a
connections are built for each hidden unit in the backward layers high dimensional deep representation of the audio sample is ex-
as well to create the stacking. The two layers of BLSTM give an tracted by passing the audio through a Deep Spectrum and a VGG
output of (batchsize,400) at each timestep and this is sent as an network. This feature is referred to as deep densenet feature in the
input to a feedforward layer for regression. We kept the number of rest of the paper.
nodes in the feedforward layers as (500,100,60,1) and used rectified For audio features, the span of vectors were the participant
linear units as the activation function. has spoken was only considered in our experimentation. Each
of these features were available as a part of the challenge data
and they have different sampling rates. The functional audio and
4.2 Audio
deep densenet features are sampled at 1Hz, whereas the Bag-of-
For the audio modality we created models using different audio AudioWords (BoAW) [35] is sampled at 10Hz and the low-level
features (low-level features as well as their functionals). As func- audio descriptors are sampled at 100Hz. The length of the low-level
tionals, the arithmetic mean and co-efficent of variations is applied MFCC features and low-level eGEMAPS is 39 and 23 respectively,
on the low-level features and this is used as a knowledge abstraction and the total time steps is 140500 for both. For the functionals how-
on top of the low-level features [36]. The vocal timbre is encoded ever, the lengths are 78 and 88 respectively with 1300 and 1410
by low-level descriptor features such as Mel-Frequency Cepstral
timesteps. The BoAW-MFCC and the BoAW-eGEMAPS features are BLSTM and an attention layer prior to the video regressor. This
of length 100 and took 14050 timesteps for each. The deep densenet fused model is referred to as Video-text fusion in Table 2.
features are of 1920 dimension in length and took 1415 timesteps. The fourth fusion model uses audio and text modalities together.
For the individual audio modality, we trained another stacked Again we took the output of the attention layer at each modality
BLSTM network with two layers each having 200 hidden units. We and build a hybrid fused network but passing them through two
take the last layer output and pass it to a multi-layer perceptron routes. In the first route the outputs of attention are concatenated
with each layer of (500,100,60,1) nodes in progression and Rectified and passed through attention followed by feed-forward layer and
Linear Units as activation function. regression loss is propagated. Another route passes the output of
both attention layers are passed through a stacked BLSTM of 2
4.3 Visual Modality layers each with 200 cells. An attention layer is applied on top of
the stacked BLSTM layers and this output is fed to a feed forward
For the video features available in the challenge dataset, we experi-
network of 128 hidden units. This network is called Audio-Text fused
mented with both the low-level features as well as the functionals
in Table 2 and is seen to have better performance than standalone
of the low-level video descriptors provided. We observed similar
audio model naturally due to use of text features which led to best
performance comparing the low-level descriptor with its functional.
results.
Since the deep LSTM networks could also learn similar properties
Our fifth fusion model uses Video and text modalities together
from the data (like functionals and more abstract information), we
and here we again use attention layer at each sub-modality of the
chose to use the low-level descriptors as it has more information
video inputs and then combine it with the text modality using
than its mean and standard deviation. Each low-level descriptor
another attention network over video and text. Quite surprisingly,
features for Pose, Gaze and Facial Action Units (FAU) are sampled
the results of this fusion are very similar to audio and text modality
at 10Hz. The length of these features were 6, 8, 35 respectively
fusion and the learning curve also ended up being quite similar.
and all having 15000 timesteps. The Bag-of-VisualWords (BoVW)
This network called the Video-Text fused in Table 2 and, has one
is also provided in the challenge data and has a length of 100 with
route that runs through a Bi-LSTM network of 200 units for each
15000 timesteps. We use these features to train individual model
sub-modality and then for each time-step, they are fused together
per feature, all having a single layer of 200 BLSTM hidden units,
using an attention layer over all the incoming video features(Gaze,
followed by a maxpooling and then learn a regressor. We exper-
Pose, AUs and BOVW), which goes through another Bi-LSTM with
imented with various combinations such as sum of all outputs,
200 units to extract the contextual information from within the
mean of outputs and also by maxpooling as three alternatives, but
fused features.
maxpooling worked best, so we have utilized maxpooling over the
Our sixth and final fused model uses all the modalities together.
LSTM outputs.
We use the attention based visual modalities to finally obtain a
128 unit vector, we use the attention based audio modalities to
4.4 Fusion of Modalities obtain a 128 unit vector and we extract the the information from
Standard procedures of early fusion are computationally expensive the transcript modalities and derive a 128 bit vector from that. We
and can lead to overfitting when trained using neural networks. again use another attention layer over these 3 modalities(Video,
Thus, late fusion and hybrid fusion models became more prevalent. Audio and Text), to fuse them together and regress for the PHQ8
We propose a multi-layer attention based network that learns the score. There were several challenges in integrating this fused model.
importance of each feature and weighs them accordingly leading to We hypothesise that the error function consists of several local
better early fusion and prediction. Such an attention network gives minima, which make it more difficult to reach the global minima.
us insights of which features in a modality are more influential in On testing the model with individual modalities, we observed that
learning. It also gives an understanding of the ratio of contribution both Video feature model and audio feature model have a much
of each modality towards the prediction. steeper descent than the ASR model, on fusion, the model often
Towards the fusion, we did several experiments within each got stuck on the minima of the video and audio features which are
modality and across. First, we fuse the low-level descriptors of the both quite close. To mitigate this and "nudge" the model towards
video modality. We take the gaze,pose, and facial action unit features a minima which takes the path of the minima reached by ASR
and pass them through a single layer of 200 BLSTM cells and apply transcripts, we multiply the final outputs of the attention layer
attention over them. The output of attention layer is passed through element-wise with a variable vector initialised with values in the
another BLSTM layer with 200 cells. We take global max pool of reciprocal ratios of the rmse loss for each individual modality in
this LSTM output and pass through feed forward network with 128 order to prioritize the the text modality initially. This led to a stable
hidden units. We call this fusion model as videoLLD-fused model in decline of train and validation loss, more stable than the individual
table 2. Second, we combine the low-level video features with the modality loss also, and the final attention scores are indicative of
BoVWs and use a similar network of 200 hidden unit BLSTM layer contributions of each individual modality. Upon convergence, the
followed by attention and another BLSTM, which is then passed attention ratios were [0.21262352, 0.21262285, 0.57475364 ] for video,
through a feed-forward layer for regression. We call this fusion audio and text respectively.
model as video-BoVW fused in Table 2.
The third fusion model is created using the attention vector
output from the video modality and the output of the text modality.
These two outputs are combined and passed through a stacked
Table 1: Regression of PHQ score in terms of RMSE and MAE for each feature within each modality