This is the html version of the file https://aclanthology.org/2020.findings-emnlp.337/.
Google automatically generates html versions of documents as we crawl the web.
Do Models of Mental Health Based on Social Media Data Generalize?
  Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page 1
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3774–3788
November 16 - 20, 2020. c 2020 Association for Computational Linguistics
3774
Do Models of Mental Health Based on Social Media Data Generalize?
Keith Harrigian, Carlos Aguirre, Mark Dredze
Johns Hopkins University
kharrigian@jhu.edu, caguirr4@jhu.edu, mdredze@cs.jhu.edu
Abstract
Proxy-based methods for annotating mental
health status in social media have grown popu-
lar in computational research due to their abil-
ity to gather large training samples. However,
an emerging body of literature has raised new
concerns regarding the validity of these types
of methods for use in clinical applications. To
further understand the robustness of distantly
supervised mental health models, we explore
the generalization ability of machine learning
classifiers trained to detect depression in indi-
viduals across multiple social media platforms.
Our experiments not only reveal that substan-
tial loss occurs when transferring between plat-
forms, but also that there exist several unreli-
able confounding factors that may enable re-
searchers to overestimate classification perfor-
mance. Based on these results, we enumer-
ate recommendations for future mental health
dataset construction.
1 Introduction
In the last decade, there has been substantial growth
in the area of digital psychiatry. Automated meth-
ods using natural language processing have been
able to detect mental health disorders based on a
person’s language in a variety of data types, such as
social media (Mowery et al., 2016; Morales et al.,
2017), speech (Iter et al., 2018) and other writings
(Kayi et al., 2017; Just et al., 2019). As in-person
clinical visits are made increasingly difficult by so-
cioeconomic barriers and public-health crises, such
as COVID-19, tools for measuring mental wellness
using implicit signal become more important than
ever (Abdel-Rahman, 2019; Bojdani et al., 2020).
Early work in this area leveraged traditional hu-
man subject studies in which individuals with clini-
cally validated psychiatric diagnoses volunteered
their language data to train classifiers and perform
quantitative analyses (Rude et al., 2004; Jarrold
et al., 2010). In an effort to model larger, more di-
verse populations with less overhead, a substantial
portion of research in the last decade has instead
explored data annotated via automated mechanisms
(Coppersmith et al., 2015a; Winata et al., 2018).
Studies leveraging proxy-based annotations have
supported their design by demonstrating alignment
with existing psychological theory regarding lan-
guage usage by individuals living with a mental
health disorder (Cavazos-Rehg et al., 2016; Vedula
and Parthasarathy, 2017). For example, feature
analyses have highlighted higher amounts of neg-
ative affect and increased personal pronoun preva-
lence amongst depressed individuals (Park et al.,
2012; De Choudhury et al., 2013). Given these con-
sistencies, the field has largely turned its attention
toward optimizing predictive power via state of the
art models (Orabi et al., 2018; Song et al., 2018).
The ultimate goal of these efforts has been
threefold—to better personalize psychiatric care,
to enable early intervention, and to monitor
population-level health outcomes in real time.
Nonetheless, research has largely trudged forward
without stopping to ask one critical question: do
models of mental health conditions trained on au-
tomatically annotated social media data actually
generalize to new data platforms and populations?
Typically, the answer is no—or at least not with-
out modification. Performance loss is to be ex-
pected in a variety of scenarios due to underly-
ing distributional shifts, e.g. domain transfer (Shi-
modaira, 2000; Subbaswamy and Saria, 2020). Ac-
cordingly, substantial effort has been devoted to de-
veloping computational methods for domain adap-
tation (Imran et al., 2016; Chu and Wang, 2018).
Outcomes from this work often provide a solid
foundation for use across multiple natural language
processing tasks (Daume III and Marcu, 2006).
However, it is unclear to what extent factors spe-
cific to mental health require tailored intervention.

Page 2
3775
In this study, we demonstrate that at a baseline,
proxy-based models of mental health status do not
transfer well to other datasets annotated via auto-
mated mechanisms. Supported by five widely used
datasets for predicting depression in social media
users from both Reddit and Twitter, we present
a combination of qualitative and quantitative ex-
periments to identify troublesome confounds that
lead to poor predictive generalization in the mental
health research space. We then enumerate evidence-
based recommendations for future mental health
dataset construction.
Ethical Considerations. Given the sensitive na-
ture of data containing mental health status of in-
dividuals, additional precautions based on guid-
ance from Benton et al. (2017a) were taken dur-
ing all data collection and analysis procedures.
Data sourced from external research groups was re-
trieved according to each dataset’s respective data
usage policy. The research was deemed exempt
from review by our Institutional Review Board
(IRB) under 45 CFR § 46.104.
2 Domain Adaptation in Mental Health
Domain adaptation (or “transfer”) of statistical clas-
sifiers is a well-studied computational problem
with high relevance across several areas of natu-
ral language processing (Jiang, 2008; Peng and
Dredze, 2017). It is particularly useful in situations
where acquiring ample training data for a target
application is intractable (e.g. monetary, time con-
straints) or impossible (e.g. privacy constraints)
(Rieman et al., 2017). For example, in the sub-field
of machine translation, significant effort is devoted
to finding ways to effectively use large corpora of
formal parallel text to train models for application
in domains with informal and dynamic language,
such as social media and conversational speech
(Wang et al., 2017; Murakami et al., 2019).
Traditional challenges encountered when trans-
ferring models between domains include variance
in source and target class distributions (Japkowicz
and Stephen, 2002), semantic misalignment (Wu
and Huang, 2016), and sparse vocabulary overlap
(Stojanov et al., 2019). Fortunately, once these
issues are identified, it is typically possible to de-
crease the transfer performance gap via methods
such as structural correspondence learning, feature
subspace mapping, and adversarial training (Blitzer
et al., 2006; Bach et al., 2016; Tzeng et al., 2017).
Domain adaptation is of particular interest in
the mental health space, where there exist numer-
ous complexities in obtaining a sufficient sample
of training data. For instance, the sensitive na-
ture of mental health data necessitates extra care
when creating and supporting new datasets (Benton
et al., 2017a). Additionally, behavioral disorders
are known to display variable clinical presentations
amongst different populations, which can make
identification of ground truth difficult (De Choud-
hury et al., 2017; Arseniev-Koehler et al., 2018).
The latter point highlights the presence of label
noise inherent in mental health data (Mitchell et al.,
2009; Shing et al., 2018). This facet serves as one
of two primary issues unique to this research space
that may hinder attempts at domain transfer. In-
deed, prior work found that diverse and sometimes
conflicting views humans have regarding suicidal
ideation can make obtaining reliable gold-standard
labels fundamentally challenging and lead to degra-
dation in model performance (Liu et al., 2017).
Sampling-related biases present the other main
area of concern for successful domain transfer by
mental health classifiers. Attributes such as per-
sonality, gender, age, and disorder co-morbidity
have been found to significantly affect the presen-
tation of mental health disorders in language data
(Cummins et al., 2015; Preotiuc-Pietro et al., 2015).
Moreover, the proxy-based annotation mechanisms
used to label large social media data sets with
mental health status invite the introduction of self-
disclosure bias into the modeling task (Amir et al.,
2019). Specifically, labels sourced from popula-
tions of individuals who self-disclose certain at-
tributes may contain activity-level and thematic
biases that cause poor generalization in larger pop-
ulations (Lippincott and Carrell, 2018).
Research leveraging text data for mental health
status classification has primarily only considered
a constrained form of domain transfer. In a within-
subject analysis, Ireland and Iserman (2018) ex-
amined differences in language usage by Reddit
users who had posted in an anxiety support forum
within and outside mental health forums. Simi-
larly, Wolohan et al. (2018) explored the predictive
power of models trained to detect depression within
Reddit users as a function of access to text from
explicit mental health related subreddits. Both stud-
ies highlighted a mitigation of overt mental health
discussion outside of the support forums, but still
detected linguistic nuances in individuals with an
affiliation to the mental health forums.

Page 3
3776
Dataset
Platform
Years
Size (Individuals)
Annotation Mechanism
CLPsych
Twitter
2011-2014
Control: 477
Depression: 477
Regular expressions; Manual verification;
Age- & gender-matched controls
Multi-Task Learning
Twitter
2013-2016
Control: 1,400
Depression: 1,400
Regular expressions; Manual verification;
Age- & gender-matched controls
RSDD
Reddit
2008-2017
Control: 107,274
Depression: 9,210
Regular expressions; Manual verification;
Subreddit-based controls
SMHD
Reddit
2010-2018
Control: 127,251
Depression: 14,139
Regular expressions;
Subreddit-based controls
Topic-Restricted Text
Reddit
2014-2020
Control: 7,016
Depression: 6,853
Community participation
Table 1: Summary statistics for each dataset. All datasets leverage proxy-based annotations. Distribution over
time and sample size varies significantly between datasets.
Shen et al. (2018) attempted to use transfer learn-
ing with large amounts of English Twitter data
annotated with individual-level depression labels
to improve predictive performance of depression
classifiers in Chinese Weibo data. Using the En-
glish and Chinese versions of the Linguistic Inquiry
and Word Count tool (LIWC) (Pennebaker et al.,
2001; Huang et al., 2012) in conjunction with other
modalities of social data (e.g. profile metadata, im-
ages), the authors showed that signal from Twitter
was useful for classification on Weibo.
Recent work from Ernala et al. (2019) was the
first to explore some of aforementioned difficulties
with domain transfer in the mental health space.
Multiple different annotation mechanisms were
used to train Twitter-based models for identifying
schizophrenia and then applied to Facebook data
from an independent population of clinically diag-
nosed schizophrenia patients. Three different types
of proxy signals with varying degrees of manual
supervision were each found to generalize poorly
to the clinical population. While the authors’ anal-
ysis suggested the domains were similar enough
to justify transfer attempts, only limited post-hoc
analysis of the data platform effect was carried out.
Thus, it remains unclear to what extent the annota-
tion methodologies as opposed to platform effects
(or other confounds) caused the degradation.
3 Data
We select depression classification as our task be-
cause it is perhaps the most widely studied, has
multiple datasets from different platforms, and is
of critical importance to society. Estimated to
affect 4.4% of the global population, depression
presents a significant economic burden and remains
the most common psychiatric disorder associated
with deaths by suicide (Hawton et al., 2013; Organi-
zation et al., 2017). Occupying a lion’s share of the
computational literature, depression classification
is a critical first target for evaluating generalization
of mental health models in social media (Chancel-
lor and De Choudhury, 2020).
To quantify the nature of domain transfer loss,
we consider five datasets. Datasets were selected
based on their common adoption in the literature
(Preotiuc-Pietro et al., 2015; Gamaarachchige and
Inkpen, 2019) and their use of proxy-based anno-
tations (Coppersmith et al., 2014). We use two
Twitter—CLPsych 2015 Shared Task (Coppersmith
et al., 2015b), Multi-Task Learning (Benton et al.,
2017b)—and three Reddit datasets—RSDD (Yates
et al., 2017), SMHD (Cohan et al., 2018), and Topic-
Restricted Text (Wolohan et al., 2018). Table 1
presents summary statistics. Construction details
are in Appendix A as a courtesy to the reader.
3.1 Mitigating Bias
Each dataset was curated in part by a system of sim-
ple rules (e.g. matches to “I was diagnosed with
depression,” participation in a depression support
forum). While these heuristics are useful for iden-
tifying candidates to include within each dataset,
they also risk introducing bias that may render the
modeling task trivial. For example, individuals
who disclose a depression diagnosis are likely to
also share their experience with other psychiatric
conditions (Benton et al., 2017b), while language
used in dedicated mental-health subreddits system-
atically differs from the rest of Reddit (De Choud-
hury and De, 2014; Ireland and Iserman, 2018).
To encourage our mental health classifiers to
learn subtle linguistic nuances that cannot be easily
captured using straightforward logic, we make ef-
forts to exclude unambiguous mental health content
from all training and evaluation procedures. In line

Page 4
3777
with prior work, we discard posts that include men-
tions of clinically-defined psychiatric conditions,
adopting the list of mental health terms enumerated
by Cohan et al. (2018) as a reference. This list
(N=458) extends work from Yates et al. (2017) by
including disorders tangential to depression, com-
mon misspellings, and colloquial references.
As is standard for mental health modeling, we
also discard posts made in subreddits dedicated to
providing mental health support (Yates et al., 2017;
Cohan et al., 2018; Wolohan et al., 2018). Since
new subreddits are created daily and our version
of the Topic-Restricted Text dataset contains posts
made after collection of RSDD and SMHD, we cre-
ate an updated list of mental health support subred-
dits. To do so, we examine the empirical distribu-
tion of posts amongst subreddits within the Topic-
Restricted Text dataset and rank each subreddit S
based on pointwise mutual information (PMI) for
the depression group D, log(p(S|D)/p(S)). We
manually examined the top 1000 subreddits based
on PMI and identified all subreddits whose descrip-
tion affirmed an association to mental health.
Our list (N=242) expands existing resources
from Yates et al. (2017) and Cohan et al. (2018) by
providing 162 additional mental health subreddits,
many of which were actually created before the
collection of RSDD and SMHD.1 While this step
diminishes the risk of mental health content saturat-
ing the Topic-Restricted Text dataset, the list’s ex-
pansion beyond that of the RSDD and SMHD lists
suggests that the former two Reddit datasets may
indeed still have overt mental health content. We
explore how different degrees of subreddit-based
filtering may affect generalization in §6.4.
4 Models
We begin by training classification models for pre-
dicting depression on each dataset. All classifica-
tion experiments leverage the same training proce-
dure and features (see Appendix D for details). As
a classifier, we use l2-regularized logistic regres-
sion. Despite our model’s relative simplicity we
are able to achieve respectable within-domain clas-
sification performance while maintaining an ability
to interpret learned parameters. Logistic regression
has served as a difficult benchmark to beat given
access to appropriate engineered features for prior
1Subreddits and code are made available to other re-
searchers: https://github.com/kharrigian/emnlp-2020-mental-
health-generalization
mental health studies (Benton et al., 2017b).
4.1 Model Validation
To validate our modeling framework against prior
work, we first establish within-domain predictive
baselines. This step also allows us to contextualize
performance by estimating the intrinsic difficulty
of modeling each dataset (DeMasi et al., 2017).
Methods. We use train/development/test splits
if they have been established by the dataset distrib-
utor; otherwise, we sample 20% from the available
data to be used as a held-out test set and then create
an additional 80/20 train/dev split using the remain-
ing data. For each dataset, we use an independent
grid search to select regularization strength C that
maximizes F1 in the dataset’s development split
(see Appendix E). We use a binarization threshold
of 0.5 (noninclusive) for all datasets.
Results. We report test set F1 for each dataset in
the bottom row of Table 2. Our models perform on
par with prior research for the two Twitter datasets
and the Topic-Restricted Text dataset. Results for
RSDD and SMHD improve upon their respective
baseline models, but are inferior to neural methods.
5 Transfer Experiments
We conduct a series of experiments to measure
the generalization of models between depression
datasets and explain sources of model degradation.
5.1 Cross-domain Transfer
Task formulation and dataset design remain a signif-
icant source of nuance across prior studies for men-
tal health status prediction (Morales et al., 2017;
Chancellor and De Choudhury, 2020). As such,
we hypothesize that standardizing training settings
(e.g. class balance, sample size) will account for
discrepancies in cross-domain performance.
Methods. We consider two experimental de-
signs. In the first experiment (†), we downsample
all datasets to have the same training/development
size of the smallest class in the smallest dataset
(i.e. CLPsych). In the second experiment (††),
we balance class distributions independently for
each dataset based on the dataset’s smaller class,
but allow sample size to vary between datasets.
The former experiment allows us to establish eq-
uitable baselines between datasets, while the latter
experiment enables us to explore whether access to
additional training data ameliorates transfer loss.

Page 5
3778
Test Data
CLPsych
Multi-Task
RSDD
SMHD
Topic-Restricted Text
Train Data
††
††
††
††
††
CLPsych
.774 ± .009
.774 ± .009
.635 ± .054
.635 ± .054
.169 ± .011
.169 ± .011
.064 ± .006
.064 ± .006
.638 ± .034
.638 ± .034
Multi-Task
.533 ± .111
.739 ± .004
.802 ± .018
.830 ± .005
.149 ± .001
.149 ± .001
.054 ± .000
.054 ± .001
.648 ± .007
.655 ± .011
RSDD
.247 ± .034
.284 ± .046
.338 ± .041
.407 ± .051
.338 ± .010
.405 ± .003
.487 ± .046
.434 ± .003
SMHD
.335 ± .048
.355 ± .028
.543 ± .040
.464 ± .028
.186 ± .007
.212 ± .006
.626 ± .011
.631 ± .007
Topic-Restricted Text
.624 ± .018
.668 ± .008
.516 ± .060
.648 ± .026
.173 ± .017
.218 ± .004
.105 ± .014
.106 ± .008
.686 ± .007
.735 ± .002
Baseline
.77
.82
.59
.38
.75
Table 2: F1 score (µ ± σ) for the Balanced & Downsampled (†) and Balanced (††) cross-domain transfer exper-
iments. Baselines described in §4.1, which preserve class imbalance during training, are presented in the bottom
row. Increasing dataset size (10x in some cases) does not unanimously improve transfer.
For both experiments, we start by combining
training and development splits. Then, for each
dataset, we sample from the combined splits based
on the parameters of the experiment and split the re-
sulting sample into 5 class-stratified folds. We train
5 classifiers per dataset, using 4 folds for training
each time, and apply the classifiers to each dataset’s
test set. Since a substantial portion of individuals
in SMHD are part of RSDD, we refrain from con-
ducting experiments between the two datasets.
Results. We report F1 score (µ ± σ) for both
experiments in Table 2. In line with existing re-
search, within-domain training outperforms cross-
domain training in each of our datasets for both
sampling settings. While additional samples avail-
able for training in the second experiment mod-
erately improve within-domain performance, they
are not uniformly helpful for mitigating transfer
loss to other datasets. Models generally outper-
form a random classifier at ranking depression risk
in cross-domain transfer scenarios. However, some
models are poorly calibrated for new domains and
consequently obtain low F1 scores (e.g. CLPsych
→ SMHD). Addressing miscalibration in domain
transfer scenarios remains an open research ques-
tion (Pampari and Ermon, 2020; Park et al., 2020).
We find that models trained on Twitter data trans-
fer to Reddit data better than models in the reverse
direction. Not surprisingly, given their overlap in
training samples, models trained on the SMHD
and RSDD datasets transfer to other domains in
an equitable manner, trading improvements with
each other across transfer settings. These results
indicate that sample size and class balance are not
solely responsible for generalization loss.
5.2 Temporal Transfer
Typical sources of transfer loss concern differences
in features between domains (Blitzer et al., 2007;
Ben-David et al., 2010). However, other factors
may govern model degradation for depression clas-
sification. One such cause of loss is temporal mis-
alignment between the datasets (Table 1). Prior
work has shown that language dynamics may hin-
der models upon deployment (Dredze et al., 2016;
Huang and Paul, 2018). In social media, where
users adopt new linguistic norms rapidly, perfor-
mance may be more volatile (Brigadir et al., 2015).
5.2.1 Class Misalignment
As an exercise to understand whether temporal
artifacts are present in the datasets, we first con-
sider training and evaluating single-domain models
with a temporal misalignment between the control
and depression groups. By training on mutually-
exclusive time periods for each class, we hypothe-
size the classifier will not only able to learn how to
distinguish between groups, but also to distinguish
between time periods. If this hypothesis holds true,
we expect performance metrics to be artificially in-
flated when a temporal exclusivity per class exists.
Methods. We split each dataset into one year
periods based on the calendar year. For each year,
we identify individuals in the Twitter datasets with
at least 200 posts and individuals in the Reddit
datasets with at least 100 posts.2
We balance
the number of individuals across time periods and
groups within each dataset, but allow this sample
size to vary across datasets. To account for growth
in post frequency over time (which increases the
number of documents that generate individual fea-
ture vectors), we perform additional post-level sam-
pling. We randomly select 200 posts per year in the
Twitter datasets and 100 posts per year in the Red-
dit datasets. Samples of individuals within each
time period are additionally separated into 5 strati-
fied folds. Folds are established so that individuals
in the training data of one time period are never
present in the test data of another time period.
2We use 2x more posts in the Twitter data to account for
posts in the Reddit datasets having roughly twice as many
words as Tweets do on average.

Page 6
3779
2012
2013
2012
2013
Train Year
.72
.75
.68
.70
CLPsych
2013 2014 2015
2013
2014
2015
.78 .77 .83
.76 .78 .85
.76 .79 .83
Multi-Task
20122013201420152016
2012
2013
2014
2015
2016
.61 .58 .63 .68 .71
.62 .59 .65 .68 .72
.62 .58 .62 .67 .71
.62 .58 .65 .69 .71
.59 .56 .62 .66 .70
RSDD
201220132014201520162017
Test Year
2012
2013
2014
2015
2016
2017
Train Year
.76.79.79.79.79.81
.76.81.75.81.80.83
.78.79.76.80.82.83
.74.78.79.77.83.82
.72.79.78.80.80.82
.76.79.75.80.80.83
SMHD
20152016201720182019
Test Year
2015
2016
2017
2018
2019
.59 .66 .60 .65 .57
.56 .62 .58 .64 .62
.56 .62 .62 .65 .62
.53 .56 .55 .63 .58
.51 .59 .56 .64 .59
Topic-Restricted
2012 2014 2016 2018
Test Year
0.6
0.7
0.8
Average F1
-10%
10%
CLPsych
0%
10%
-50%
-25%
-30%
-15%
0 2 4 6 8
-20%
-10%
-20%
-10%
Multi-Task
-4%
0%
-45%
-30%
-24%
-16%
-2 0 2 4 6
-24%
-16%
0%
15%
RSDD
0%
15%
-4%
0%
-4-2 0 2 4 6 8
-20%
-10%
-16%
-8%
SMHD
-20%
-10%
-5%
0%
-4-2 0 2 4 6 8
Latency (yrs.)
-30%
-15%
0%
20%
Topic-Restricted
10%
20%
-30%-15%
-20%-10%
-6 -4 -2 0 2 4
Latency (yrs.)
-20%
0%
Source Data
CLPsych
Multi-Task
RSDD
SMHD
Topic-Restricted
F1 Percent Difference From Within Domain/Within Time
Figure 1: Temporal-transfer results. (Left) Average within-domain F1 score as a function of training and evalua-
tion periods. Predictive performance tends to be better for more recent temporal splits regardless of training period.
(Right) Percent difference in F1 score relative to within-domain, no-latency model. Models trained on Twitter data
benefit most from temporal alignment. Performance suffers when applying new models to old data.
To evaluate the degree to which temporal effects
are present, we sample groups from all possible
combinations of time periods. For example, in one
setting, both the control and depression groups are
sampled from 2013; in another setting, the control
group is sampled from 2013, while the depression
group is sampled from 2015. For each combination,
we use 4 of the stratified folds for training and use
the remaining fold for evaluation, and then repeat
the process for all folds. We compare performance
when classes are sampled from the same time pe-
riod against performance when classes are sampled
from mutually exclusive time periods.
Results. We achieve a 3-22% increase in F1
across all datasets when classes are sampled from
mutually exclusive time periods instead of being
temporally-aligned. The improvement suggests
that temporal artifacts exist, as the classifier is able
to not only identify signal relevant to classifying
depression, but also to classifying data from dif-
ferent periods of time. This result highlights the
importance of sampling classes evenly over time.
5.2.2 Latency
We now measure the effect temporal artifacts have
on cross-domain performance. We hypothesize
model degradation scales with deployment latency.
Methods. We use the same data sampling mech-
anism described in §5.2.1. However, we now only
consider the case in which control and depression
groups are sampled from the same time period. As
before, we train a classifier on 4 of the 5 stratified
folds for a time period in one dataset. We then
evaluate within-domain performance using the re-
maining fold and cross-domain performance using
one fold from each time period in the other datasets.
We assume ground truth is consistent over multiple
time periods; given the episodic nature of depres-
sion, we recognize this may promote pessimistic
results for some periods (Tsakalidis et al., 2018).
Results. Examining within-domain results in
Figure 1 (left), predictive performance tends to be
better for more recent temporal splits regardless of
training period. Classifiers trained on old data (rela-
tive to the evaluation period) tend to perform on par
with aligned regimens, while classifiers trained on
new data show linear losses over time. Losses are
significant after 2-3 years depending on the dataset.
Though some trends do emerge, cross-domain
performance as a function of temporal latency is
relatively variable. Visualized in Figure 1 (right),
models trained on the Twitter datasets benefit most
from temporal alignment in cross-domain settings.
Models trained on Topic-Restricted Text show sig-
nificant drop offs in predictive performance when
applied to older samples within all Reddit datasets.
While models trained on RSDD perform better on
Topic-Restricted Text as latency is reduced, models
trained on SMHD do not exhibit the same trend.
6 Post-hoc Analysis
In the previous section, we identified the degree
to which loss occurs under a variety of domain
transfer settings. However, these settings do not
account for all performance disparities. In this sec-
tion, we measure differences between the datasets
to understand the source of loss.

Page 7
3780
6.1 Vocabulary Overlap
Traditionally, different feature vocabularies ac-
count for domain transfer loss (Serra et al., 2017;
Chen and Gomes, 2019; Stojanov et al., 2019).
Therefore, we hypothesize that limited feature over-
lap and poor vocabulary alignment across datasets
could hinder cross-domain generalization.
Methods. We explore this phenomenon by com-
puting the Jaccard Similarity (JS) of vocabularies
between each dataset. We examine correlations
between JS and F1 scores from the cross-domain
transfer experiments discussed in §5.1.
Results. We find the minimum similarity oc-
curs between the CLPsych and RSDD datasets
(JS = 0.10) while the maximum occurs between
the Topic-Restricted Text and SMHD datasets
(JS = 0.65).3 Only a weak correlation between
similarity and performance exists (Pearson ρ <
0.18), suggesting poor generalization is not solely
due to differences in vocabulary.
6.2 Topical Alignment
Our classification models leverage reduced fea-
ture representations in the form of LDA topic-
distributions (Blei et al., 2003) and mean-pooled
pre-trained GloVe embeddings (Pennington et al.,
2014). Designed to capture and reflect seman-
tics, we hypothesized these low-dimensional fea-
tures would mitigate transfer loss due to poor vo-
cabulary alignment. Lacking support from our
cross-domain transfer results, we look closer at
the themes present within each dataset.
Methods. We identify the unigrams that are
most unique to each dataset and group. For
each dataset, we use scores assigned by our KL-
divergence-based feature selection method (see Ap-
pendix D) to rank the most informative features per
class (Chang et al., 2012). We jointly examine the
top-500 most informative unigrams per class, not-
ing high-level themes common across the datasets.
Results. With respect to similarities, we note
that words used in discussion about gender and sex-
uality are strongly associated with each of the de-
pression groups (e.g. ‘cis’, ‘homophobia’, ‘mascu-
line’), likely a reflection of marginalized groups be-
ing at higher risk of depression (Budge et al., 2013).
Also ubiquitous amongst each of the datasets are
references to self-injurious behavior (e.g ‘wrists’,
3JS is moderately deflated in RSDD due to the dataset’s
large vocabulary, causing SMHD and Topic-Restricted Text to
have the highest similarity instead of SMHD and RSDD.
‘self-harm’, ‘hotline’). Increased emoji usage and
references to athletics (‘nbafinals’, ‘scorer’) are
strong indicators of the control group in each
dataset, as well as terms reflecting current events.
With respect to differences, associations between
word usage and depression are subjectively easier
to interpret within the Reddit datasets. For example,
discussion of mental-health treatment (e.g. ‘coun-
selor’, ‘therapy’, ‘wellbutrin’) and familial and in-
timate relationships (‘brother-in-law’, ‘soulmate’)
are prominent within the Reddit datasets. In con-
trast, language associated with depression within
the Twitter datasets tends to reflect slightly more
nuanced elements of the condition—e.g. social
inequity (‘sexism’, ‘#yesallwomen’) and fantasy
(‘fanfics’, ‘cosplay’, ‘villians’). These themes align
with empirical findings that women are at a higher
risk of depression (Kessler, 2003) and depressed
individuals often find solace in niche subcultures
(Blanco and Barnett, 2014; Bowes et al., 2015).
Additionally, we find several temporally-
isolated references within the Twitter datasets
(e.g. ‘#RIPRobinWilliams’, ‘#SDCC’). In the
Multi-task Learning dataset, we also see several
terms using non-American English (e.g. ‘colour’,
‘favourite’) which may represent a geographic im-
balance amongst the sampled individuals.
6.3 Stability of LIWC
The Linguistic Inquiry and Word Count (LIWC)
dictionary has been an effective tool for measur-
ing linguistic-nuances of mental health disorders
regardless of textual formality (Mowery et al.,
2016; Turcan and McKeown, 2019). Our version
of the dictionary (2007) maps approximately 12k
words to 64 dimensions (e.g. negative emotion,
leisure) that have been empirically validated to
capture an individual’s social and psychological
states (Tausczik and Pennebaker, 2010).4 A single
LIWC feature value represents the proportion of
words used across an individual’s post history that
match the given LIWC dimension. In the same way
that we expect semantic distributions (§6.2) to ame-
liorate transfer loss, we hypothesize that models
trained on this representation will be more robust
when vocabulary overlap is sparse.
Methods. We explore this hypothesis from three
angles: 1) We perform cross-domain transfer exper-
iments using LIWC as the only feature set provided
4The 2007 version of LIWC has a high similarity with the
2015 version amongst dimensions most strongly associated
with depression (Pennebaker et al., 2015).

Page 8
3781
for training and evaluation; 2) We fit LIWC-based
classifiers 100 times per dataset using random 70%
samples and examine correlations of the learned co-
efficients; 3) We compute the average feature value
of each LIWC dimension per class and measure the
difference between classes.
Results. We note that domain-transfer experi-
ments using LIWC as the only feature set main-
tain high degrees of transfer loss while sacrific-
ing within-domain performance. Moreover, cor-
relations between coefficients of models between
datasets are relatively low across all comparisons,
maxing out at a Spearman R value of 0.338 for the
comparison between RSDD and SMHD datasets,
which happen to have significant user overlap as
is. In general, LIWC coefficients tend to be more
correlated within platforms than between them.
Examination of the underlying class differences
provides insight into linguistic differences between
each dataset’s depression group. In line with
prior work, function word use, first-person pro-
noun use, and cognitive mechanisms are more com-
mon within the depression group of each dataset,
though their relative prevalence varies. Conversa-
tion regarding relativity (i.e. space, motion, time)
is strongly associated with the control groups in
the Twitter data, but is more associated with the
depression groups in the Reddit data. Anger and
perceptual topics are more prevalent within the de-
pression groups for Twitter than Reddit.
6.4 Self-disclosure Bias
In the aforementioned analysis, posts from men-
tal health subreddits and those including mental
health terms were excluded. Nonetheless, individ-
uals within each of the depression groups for the
Reddit datasets displayed language that was unam-
biguously associated with seeking support or shar-
ing personal experience with mental health issues.
Accordingly, we hypothesize that existing filters
are unable to remove confounds in individuals who
disclose a depression diagnosis on Reddit.
Methods. To measure this effect, we examine
differences in the distribution of subreddits that
individuals in the depression group of the Topic-
Restricted Text data post in relative to individuals
in the control group. Specifically, we fit a logistic
regression model mapping the subreddit distribu-
tion of individuals’ posts to their mental health
status after applying each subreddit filter list (e.g.
RSDD, SMHD, Ours). We compare predictive per-
formance of these models and the learned coef-
ficient weights to understand the effect of filter-
ing. As a baseline, we maintain posts from the
r/depression subreddit in the feature set. Then, in
sequence of coverage from least to most, we apply
subreddit filters from RSDD, SMHD, and our study,
and measure classification performance. For each
filter, we examine the learned coefficient weights
to develop a sense for the personality and interests
of individuals in the depression group.
Results. The baseline F1 score in the devel-
opment set maxes out at 0.83, representing the
fact that several individuals in the control group
had posted in the r/depression subreddit at some
point in their history, but were not labeled as
having depression due to the sole use of recent
original posts by the automatic annotation proce-
dure. Performance degrades with the expansion
of excluded subreddits from each filter, settling
at an F1 of 0.72. Coefficients from the model
highlight subreddits related to themes of sexual-
ity (r/bisexual, r/actuallesbians), gender (r/ftm),
personality (r/introvert, r/INFP), drugs (r/Trees,
r/LSD), and relationships (r/MakeNewFriendsHere,
r/BreakUps) as being predictive of depression.
The strong classification performance achieved
after our filtering measures is evidence that distri-
butional differences in online interaction remain in
the “cleaned” Topic-Restricted Text dataset. As our
subreddit list is more robust than both the RSDD
and SMHD lists, there is reason to believe simi-
lar confounds exist in these datasets. The coeffi-
cient analysis provides a window into the types of
themes that could incorrectly confuse a classifica-
tion model during generalization attempts.
7 Recommendations
We have demonstrated that issues of transfer loss
persist in the mental health space, at least for the
proxy-based social media datasets considered in
our study. Importantly, we identified confounds
that emerge as a result of each dataset’s respective
design. Critically, existing datasets have flaws that
make them difficult to use for constructing models
for new data types and populations.
Topical Alignment. Researchers must account
for self-disclosure bias and confounds of personal-
ity when curating new datasets. First discussed in
§6.2, models trained on the Reddit datasets learn
dependencies between support-driven topics, such
as medication usage and relationship advice, and

Page 9
3782
depression. In contrast, models trained on the
Twitter datasets identify the same correlations be-
tween sexuality, gender, and depression that Reddit-
based models detect, but also learn about the recre-
ational outlets (i.e. fantasy) and social concerns (i.e.
racism, sexism) common in depressed individuals.
We hypothesize that semantic divergences reflect
self-disclosure bias and differences in platform in-
teraction patterns (Malik et al., 2015; Shelton et al.,
2015). Twitter’s status- and reply-based structure
serves as a place for individuals to share personal
thoughts and experiences in reaction to their daily
life. Meanwhile, Reddit’s community-based fo-
rums require active engagement with specific top-
ics and may silo individuals who wish to discuss
their mental health beyond defined areas. The latter
gains support from our analysis of subreddit distri-
butions in the Topic-Restricted Text data (§6.4).
Topical nuances in language may appropriately
reflect elements of identity associated with mental
health disorders (i.e. traumatic experiences, cop-
ing mechanisms). However, if not contextualized
during model training, this type of signal has the
potential to raise several false alarms upon appli-
cation to new populations. Accordingly, we urge
researchers to minimize the presence of overt topi-
cal disparities between classes in their datasets.
Mitigating Temporal Artifacts. Researchers
must take steps to remove temporal artifacts in new
datasets. Experiments conducted in §5.2 reveal
that group-based temporal alignment and latency
between model training and deployment can have a
significant effect on predictive performance. Vari-
ability of performance over time is surprising, as
there is no clinical evidence to suggest that the un-
derlying symptoms of depression (on a population
level) change over time (APA, 2013).
We hypothesize two reasons for this observa-
tion. First, since depression presents in an episodic
manner, we may expect data closest to the date of
annotation to be the most predictive of an individ-
ual’s labeled mental status (Melartin et al., 2004).
If most posts used for annotation occurred in recent
time windows, then it is possible that content in
older posts is less relevant to the depressive state
of individuals in our data sets. Second, and more
problematic, is the possibility that signal used by
our classifiers is only a spurious correlation.
At a bare minimum, our results highlight the im-
portance of sampling classification groups so that
post volume is equal over time. Discrepancies may
wrongly suggest that temporal artifacts are useful
for detecting mental health disorders. Going fur-
ther, researchers should remove temporally-specific
references and minimize highly-dynamic language
in their datasets. Avenues for accomplishing the lat-
ter include using NER to redact n-grams that serve
as spurious correlations (Ritter et al., 2011) and
leveraging adversarial training to evaluate the de-
gree to which mental health signal may be learned
without a notion for time (Tzeng et al., 2017).
8 Limitations and Future Work
Though our study provides a robust perspective to-
ward understanding generalization capabilities of
mental health classifiers for social media, we rec-
ognize that more learning opportunities exist. Our
study only considers a handful of datasets, two plat-
forms, a single mental health disorder, and homo-
geneous annotation mechanisms. Still unexplored,
in large part due to the precautions necessary for
securing sensitive mental health data, is how well
models trained on data from actual clinical popula-
tions generalize to proxy-based datasets and other
clinical populations. While high co-morbidity rates
between depression and other mental health dis-
orders may allow us to infer model behavior for
alternative conditions, we also recognize that pre-
sentations of different psychiatric disorders can be
quite variable and warrant their own research (Ben-
ton et al., 2017b; Arseniev-Koehler et al., 2018).
Another limitation in our work is the lack of
depression to control group matches from original
reference material. Preotiuc-Pietro et al. (2015) and
De Choudhury et al. (2017) demonstrate that men-
tal health disorders such as depression can have
variable presentations based on demographic at-
tributes. The attributes used to construct our Twitter
datasets originally were inferred via now-outdated
text-based models. Accordingly, demographic in-
ference errors may be propagated to and correlated
with depression classification errors. Moreover,
these attributes were not considered within the con-
struction of any of the Reddit datasets we explored.
The effect of demographics on generalization re-
mains a valuable insight for future exploration.
Finally, our attempts at domain transfer are con-
strained. Namely, we do not invoke explicit do-
main adaptation methods (Peng and Dredze, 2017;
Li et al., 2018; Huang and Paul, 2019). Moving
forward, we plan to explore algorithmic strategies
to mitigate the biases discovered in this study.

Page 10
3783
References
Omar Abdel-Rahman. 2019. Socioeconomic predic-
tors of suicide risk among cancer patients in the
united states: A population-based study. Cancer epi-
demiology, 63:101601.
Silvio Amir, Mark Dredze, and John W Ayers. 2019.
Mental health surveillance over social media with
digital cohorts. In Proceedings of the Sixth Work-
shop on Computational Linguistics and Clinical Psy-
chology, pages 114–120.
American Psychiatric Association APA. 2013. Diag-
nostic and statistical manual of mental disorders
(DSM-5 R ). American Psychiatric Pub.
Alina Arseniev-Koehler, Sharon Mozgai, and Stefan
Scherer. 2018. What type of happiness are you
looking for?-a closer look at detecting mental health
from language. In Proceedings of the Fifth Work-
shop on Computational Linguistics and Clinical Psy-
chology: From Keyboard to Clinic, pages 1–12.
Ngo Xuan Bach, Vu Thanh Hai, and Tu Minh Phuong.
2016. Cross-domain sentiment classification with
word embeddings and canonical correlation analy-
sis. In Proceedings of the Seventh Symposium on
Information and Communication Technology, pages
159–166.
Shai Ben-David, John Blitzer, Koby Crammer, Alex
Kulesza, Fernando Pereira, and Jennifer Wortman
Vaughan. 2010. A theory of learning from different
domains. Machine learning, 79(1-2):151–175.
Adrian Benton, Glen Coppersmith, and Mark Dredze.
2017a. Ethical research protocols for social media
health research. In Proceedings of the First ACL
Workshop on Ethics in Natural Language Process-
ing, pages 94–102.
Adrian Benton, Margaret Mitchell, and Dirk Hovy.
2017b. Multitask learning for mental health condi-
tions with limited social media data. In Conference
of the European Chapter of the Association for Com-
putational Linguistics (EACL). Association for Com-
putational Linguistics.
Joel A Blanco and Lynn A Barnett. 2014. The effects
of depression on leisure: Varying relationships be-
tween enjoyment, sociability, participation, and de-
sired outcomes in college students. Leisure Sciences,
36(5):458–478.
David M Blei, Andrew Y Ng, and Michael I Jordan.
2003. Latent dirichlet allocation. Journal of ma-
chine Learning research, 3(Jan):993–1022.
John Blitzer, Mark Dredze, and Fernando Pereira. 2007.
Biographies, bollywood, boom-boxes and blenders:
Domain adaptation for sentiment classification. In
Proceedings of the 45th annual meeting of the asso-
ciation of computational linguistics, pages 440–447.
John Blitzer, Ryan McDonald, and Fernando Pereira.
2006. Domain adaptation with structural correspon-
dence learning. In Proceedings of the 2006 confer-
ence on empirical methods in natural language pro-
cessing, pages 120–128.
Ermal Bojdani, Aishwarya Rajagopalan, Anderson
Chen, Priya Gearin, William Olcott, Vikram
Shankar, Alesia Cloutier, Haley Solomon, Nida Z
Naqvir, Nicolas Batty, et al. 2020. Covid-19 pan-
demic: Impact on psychiatric care in the united
states, a review. Psychiatry Research, page 113069.
Lucy Bowes, Rebecca Carnegie, Rebecca Pearson,
Becky Mars, Lucy Biddle, Barbara Maughan, Glyn
Lewis, Charles Fernyhough, and Jon Heron. 2015.
Risk of depression and self-harm in teenagers iden-
tifying with goth subculture: a longitudinal cohort
study. The Lancet Psychiatry, 2(9):793–800.
Igor Brigadir, Derek Greene, and Pádraig Cunningham.
2015. Analyzing discourse communities with dis-
tributional semantic models. In Proceedings of the
ACM Web Science Conference, pages 1–10.
Stephanie L Budge, Jill L Adelson, and Kimberly AS
Howard. 2013. Anxiety and depression in transgen-
der individuals: the roles of transition status, loss,
social support, and coping. Journal of consulting
and clinical psychology, 81(3):545.
Patricia A Cavazos-Rehg, Melissa J Krauss, Shaina
Sowles, Sarah Connolly, Carlos Rosas, Meghana
Bharadwaj, and Laura J Bierut. 2016. A content
analysis of depression-related tweets. Computers in
human behavior, 54:351–357.
Stevie Chancellor and Munmun De Choudhury. 2020.
Methods in predictive techniques for mental health
status on social media: a critical review. NPJ digital
medicine, 3(1):1–11.
Hau-wen Chang, Dongwon Lee, Mohammed Eltaher,
and Jeongkyu Lee. 2012. @ phillies tweeting from
philly? predicting twitter user locations with spa-
tial word usage. In 2012 IEEE/ACM International
Conference on Advances in Social Networks Analy-
sis and Mining, pages 111–118. IEEE.
Di Chen and Carla P Gomes. 2019. Bias reduction via
end-to-end shift learning: Application to citizen sci-
ence. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 493–500.
Chenhui Chu and Rui Wang. 2018. A survey of do-
main adaptation for neural machine translation. In
Proceedings of the 27th International Conference on
Computational Linguistics, pages 1304–1319.
Arman Cohan, Bart Desmet, Andrew Yates, Luca Sol-
daini, Sean MacAvaney, and Nazli Goharian. 2018.
Smhd: a large-scale resource for exploring online
language usage for multiple mental health condi-
tions. In Proceedings of the 27th International Con-
ference on Computational Linguistics, pages 1485–
1497.

Page 11
3784
Glen Coppersmith, Mark Dredze, and Craig Harman.
2014. Quantifying mental health signals in twitter.
In Proceedings of the workshop on computational
linguistics and clinical psychology: From linguistic
signal to clinical reality, pages 51–60.
Glen Coppersmith, Mark Dredze, Craig Harman, and
Kristy Hollingshead. 2015a. From adhd to sad: An-
alyzing the language of mental health on twitter
through self-reported diagnoses. In Proceedings of
the 2nd Workshop on Computational Linguistics and
Clinical Psychology: From Linguistic Signal to Clin-
ical Reality, pages 1–10.
Glen Coppersmith, Mark Dredze, Craig Harman,
Kristy Hollingshead, and Margaret Mitchell. 2015b.
Clpsych 2015 shared task: Depression and ptsd on
twitter. In Proceedings of the 2nd Workshop on
Computational Linguistics and Clinical Psychology:
From Linguistic Signal to Clinical Reality, pages 31–
39.
Nicholas Cummins, Stefan Scherer, Jarek Krajewski,
Sebastian Schnieder, Julien Epps, and Thomas F
Quatieri. 2015. A review of depression and suicide
risk assessment using speech analysis. Speech Com-
munication, 71:10–49.
Hal Daume III and Daniel Marcu. 2006. Domain adap-
tation for statistical classifiers. Journal of artificial
Intelligence research, 26:101–126.
Munmun De Choudhury and Sushovan De. 2014. Men-
tal health discourse on reddit: Self-disclosure, so-
cial support, and anonymity. In Eighth international
AAAI conference on weblogs and social media.
Munmun De Choudhury, Michael Gamon, Scott
Counts, and Eric Horvitz. 2013. Predicting depres-
sion via social media. In Seventh international AAAI
conference on weblogs and social media.
Munmun De Choudhury, Sanket S Sharma, Tomaz
Logar, Wouter Eekhout, and René Clausen Nielsen.
2017. Gender and cross-cultural differences in so-
cial media disclosures of mental illness. In Pro-
ceedings of the 2017 ACM conference on computer
supported cooperative work and social computing,
pages 353–369.
Orianna DeMasi, Konrad Kording, and Benjamin
Recht. 2017. Meaningless comparisons lead to false
optimism in medical machine learning. PloS one,
12(9):e0184604.
Mark Dredze, Miles Osborne, and Prabhanjan Kam-
badur. 2016. Geolocation for twitter: Timing mat-
ters. In Proceedings of the 2016 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, pages 1064–1069.
Sindhu Kiranmai Ernala, Michael L Birnbaum,
Kristin A Candan, Asra F Rizvi, William A Ster-
ling, John M Kane, and Munmun De Choudhury.
2019. Methodological gaps in predicting mental
health states from social media: triangulating diag-
nostic signals. In Proceedings of the 2019 CHI Con-
ference on Human Factors in Computing Systems,
pages 1–16.
Prasadith Kirinde Gamaarachchige and Diana Inkpen.
2019. Multi-task, multi-channel, multi-input learn-
ing for mental illness detection using social media
text. In Proceedings of the Tenth International Work-
shop on Health Text Mining and Information Analy-
sis (LOUHI 2019), pages 54–64.
Keith Hawton, Carolina Casa˜nas i Comabella, Camilla
Haw, and Kate Saunders. 2013. Risk factors for sui-
cide in individuals with depression: a systematic re-
view. Journal of affective disorders, 147(1-3):17–
28.
Chin-Lan Huang, Cindy K Chung, Natalie Hui, Yi-
Cheng Lin, Yi-Tai Seih, Ben CP Lam, Wei-Chuan
Chen, Michael H Bond, and James W Pennebaker.
2012. The development of the chinese linguistic in-
quiry and word count dictionary. Chinese Journal of
Psychology.
Xiaolei Huang and Michael Paul. 2018. Examining
temporality in document classification. In Proceed-
ings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Pa-
pers), pages 694–699.
Xiaolei Huang and Michael Paul. 2019. Neural tem-
porality adaptation for document classification: Di-
achronic word embeddings and domain adaptation
models. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
pages 4113–4123.
Muhammad Imran, Prasenjit Mitra, and Jaideep Srivas-
tava. 2016. Cross-language domain adaptation for
classifying crisis-related short messages. In 13th In-
ternational Conference on Information Systems for
Crisis Response and Management, ISCRAM 2016.
Information Systems for Crisis Response and Man-
agement, ISCRAM.
Molly Ireland and Micah Iserman. 2018. Within and
between-person differences in language used across
anxiety support and neutral reddit communities. In
Proceedings of the Fifth Workshop on Computa-
tional Linguistics and Clinical Psychology: From
Keyboard to Clinic, pages 182–193.
Dan Iter, Jong Yoon, and Dan Jurafsky. 2018. Auto-
matic detection of incoherent speech for diagnosing
schizophrenia. In Proceedings of the Fifth Workshop
on Computational Linguistics and Clinical Psychol-
ogy: From Keyboard to Clinic, pages 136–146.
Nathalie Japkowicz and Shaju Stephen. 2002. The
class imbalance problem: A systematic study. In-
telligent data analysis, 6(5):429–449.
William L Jarrold, Bart Peintner, Eric Yeh, Ruth Kras-
now, Harold S Javitz, and Gary E Swan. 2010. Lan-
guage analytics for assessing brain health: Cogni-
tive impairment, depression and pre-symptomatic

Page 12
3785
alzheimer’s disease. In International Conference on
Brain Informatics, pages 299–307. Springer.
Jing Jiang. 2008. A literature survey on domain adap-
tation of statistical classifiers. URL: http://sifaka. cs.
uiuc. edu/jiang4/domainadaptation/survey, 3:1–12.
Sandra Just, Erik Haegert, Nora Koránová, Anna-Lena
Bröcker, Ivan Nenchev, Jakob Funcke, Christiane
Montag, and Manfred Stede. 2019. Coherence mod-
els in schizophrenia. In Proceedings of the Sixth
Workshop on Computational Linguistics and Clini-
cal Psychology, pages 126–136.
Efsun Sarioglu Kayi, Mona Diab, Luca Pauselli,
Michael Compton, and Glen Coppersmith. 2017.
Predictive linguistic features of schizophrenia. In
Proceedings of the 6th Joint Conference on Lexical
and Computational Semantics (* SEM 2017), pages
241–250.
Ronald C Kessler. 2003. Epidemiology of women and
depression. Journal of affective disorders, 74(1):5–
13.
Hongmin Li, Doina Caragea, Cornelia Caragea, and
Nic Herndon. 2018. Disaster response aided by
tweet classification with a domain adaptation ap-
proach. Journal of Contingencies and Crisis Man-
agement, 26(1):16–27.
Tom Lippincott and Annabelle Carrell. 2018. Obser-
vational comparison of geo-tagged and randomly-
drawn tweets. In Proceedings of the Second Work-
shop on Computational Modeling of People’s Opin-
ions, Personality, and Emotions in Social Media,
pages 50–55.
Tong Liu, Qijin Cheng, Christopher M Homan, and
Vincent Silenzio. 2017. Learning from various la-
beling strategies for suicide-related messages on so-
cial media: An experimental study. arXiv preprint
arXiv:1701.08796.
Momin M Malik, Hemank Lamba, Constantine Nakos,
and Jürgen Pfeffer. 2015. Population bias in geo-
tagged tweets. In Ninth international AAAI confer-
ence on web and social media.
Tarja K Melartin, Heikki J Rytsala, Ulla S Leskela,
Paula S Lestela-Mielonen, T Petteri Sokero, and
Erkki T Isometsa. 2004. Severity and comorbidity
predict episode duration and recurrence of dsm-iv
major depressive disorder. Journal of Clinical Psy-
chiatry, 65(6):810–819.
Alex J Mitchell, Amol Vaze, and Sanjay Rao. 2009.
Clinical diagnosis of depression in primary care: a
meta-analysis. The Lancet, 374(9690):609–619.
Michelle Morales, Stefan Scherer, and Rivka Levitan.
2017. A cross-modal review of indicators for depres-
sion detection systems. In Proceedings of the Fourth
Workshop on Computational Linguistics and Clini-
cal Psychology—From Linguistic Signal to Clinical
Reality, pages 1–12.
Danielle L Mowery, Y Albert Park, Craig Bryan, and
Mike Conway. 2016. Towards automatically classi-
fying depressive symptoms from twitter data for pop-
ulation health. In Proceedings of the Workshop on
Computational Modeling of People’s Opinions, Per-
sonality, and Emotions in Social Media (PEOPLES),
pages 182–191.
Soichiro Murakami, Makoto Morishita, Tsutomu Hi-
rao, and Masaaki Nagata. 2019. Ntt’s machine
translation systems for wmt19 robustness task. In
Proceedings of the Fourth Conference on Machine
Translation (Volume 2: Shared Task Papers, Day 1),
pages 544–551.
Brendan O’Connor, Michel Krieger, and David Ahn.
2010. Tweetmotif: Exploratory search and topic
summarization for twitter. In Fourth International
AAAI Conference on Weblogs and Social Media.
Ahmed Husseini Orabi, Prasadith Buddhitha, Mah-
moud Husseini Orabi, and Diana Inkpen. 2018.
Deep learning for depression detection of twitter
users. In Proceedings of the Fifth Workshop on
Computational Linguistics and Clinical Psychology:
From Keyboard to Clinic, pages 88–97.
World Health Organization et al. 2017. Depression and
other common mental disorders: global health es-
timates. Technical report, World Health Organiza-
tion.
Anusri Pampari and Stefano Ermon. 2020. Unsu-
pervised calibration under covariate shift. arXiv
preprint arXiv:2006.16405.
Minsu Park, Chiyoung Cha, and Meeyoung Cha. 2012.
Depressive moods of users portrayed in twitter.
In Proceedings of the ACM SIGKDD Workshop
on healthcare informatics (HI-KDD), volume 2012,
pages 1–8.
Sangdon Park, Osbert Bastani, James Weimer, and In-
sup Lee. 2020. Calibrated prediction with covariate
shift via unsupervised domain adaptation. Statistics
(AISTATS).
Nanyun Peng and Mark Dredze. 2017. Multi-task do-
main adaptation for sequence tagging. In Proceed-
ings of the 2nd Workshop on Representation Learn-
ing for NLP, pages 91–100.
James W Pennebaker, Ryan L Boyd, Kayla Jordan, and
Kate Blackburn. 2015. The development and psy-
chometric properties of liwc2015. Technical report.
James W Pennebaker, Martha E Francis, and Roger J
Booth. 2001. Linguistic inquiry and word count:
Liwc 2001. Mahway: Lawrence Erlbaum Asso-
ciates, 71(2001):2001.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the 2014 conference
on empirical methods in natural language process-
ing (EMNLP), pages 1532–1543.

Page 13
3786
Daniel Preotiuc-Pietro, Johannes Eichstaedt, Gregory
Park, Maarten Sap, Laura Smith, Victoria Tobolsky,
H Andrew Schwartz, and Lyle Ungar. 2015. The
role of personality, age, and gender in tweeting about
mental illness. In Proceedings of the 2nd workshop
on computational linguistics and clinical psychol-
ogy: From linguistic signal to clinical reality, pages
21–30.
Juan Ramos et al. 2003. Using tf-idf to determine word
relevance in document queries. In Proceedings of
the first instructional conference on machine learn-
ing, volume 242, pages 133–142. Piscataway, NJ.
Daniel Rieman, Kokil Jaidka, H Andrew Schwartz, and
Lyle Ungar. 2017. Domain adaptation from user-
level facebook models to county-level twitter predic-
tions. In Proceedings of the Eighth International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 764–773.
Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011.
Named entity recognition in tweets: an experimental
study. In Proceedings of the conference on empiri-
cal methods in natural language processing, pages
1524–1534. Association for Computational Linguis-
tics.
Stephanie Rude, Eva-Maria Gortner, and James Pen-
nebaker. 2004. Language use of depressed and
depression-vulnerable college students. Cognition
& Emotion, 18(8):1121–1133.
H Andrew Schwartz, Johannes C Eichstaedt, Mar-
garet L Kern, Lukasz Dziurzynski, Stephanie M Ra-
mones, Megha Agrawal, Achal Shah, Michal Kosin-
ski, David Stillwell, Martin EP Seligman, et al. 2013.
Personality, gender, and age in the language of social
media: The open-vocabulary approach. PloS one,
8(9):e73791.
Joan Serra, Ilias Leontiadis, Dimitris Spathis, Gianluca
Stringhini, Jeremy Blackburn, and Athena Vakali.
2017. Class-based prediction errors to detect hate
speech with out-of-vocabulary words. In Proceed-
ings of the First Workshop on Abusive Language On-
line, pages 36–40.
Martin Shelton, Katherine Lo, and Bonnie Nardi. 2015.
Online media forums as separate social lives: A qual-
itative study of disclosure within and beyond reddit.
iConference 2015 Proceedings.
Tiancheng Shen, Jia Jia, Guangyao Shen, Fuli Feng,
Xiangnan He, Huanbo Luan, Jie Tang, Thanassis
Tiropanis, Tat Seng Chua, and Wendy Hall. 2018.
Cross-domain depression detection via harvesting
social media. International Joint Conferences on Ar-
tificial Intelligence.
Hidetoshi Shimodaira. 2000. Improving predictive in-
ference under covariate shift by weighting the log-
likelihood function. Journal of statistical planning
and inference, 90(2):227–244.
Han-Chin Shing, Suraj Nair, Ayah Zirikly, Meir
Friedenberg, Hal Daumé III, and Philip Resnik.
2018. Expert, crowdsourced, and machine assess-
ment of suicide risk via online postings. In Proceed-
ings of the Fifth Workshop on Computational Lin-
guistics and Clinical Psychology: From Keyboard
to Clinic, pages 25–36.
Hoyun Song, Jinseon You, and Jin-Woo Chung Jong C
Park. 2018. Feature attention network: Interpretable
depression detection from social media. In Proceed-
ings of the 32nd Pacific Asia Conference on Lan-
guage, Information and Computation.
Petar Stojanov, Ahmed Hassan Awadallah, Paul Ben-
nett, and Saghar Hosseini. 2019. On domain transfer
when predicting intent in text.
Adarsh Subbaswamy and Suchi Saria. 2020. From de-
velopment to deployment: dataset shift, causality,
and shift-stable models in health ai. Biostatistics,
21(2):345–352.
Yla R Tausczik and James W Pennebaker. 2010. The
psychological meaning of words: Liwc and comput-
erized text analysis methods. Journal of language
and social psychology, 29(1):24–54.
Adam Tsakalidis, Maria Liakata, Theo Damoulas, and
Alexandra I Cristea. 2018. Can we assess mental
health through social media and smart devices? ad-
dressing bias in methodology and evaluation. In
Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, pages 407–
423. Springer.
Elsbeth Turcan and Kathleen McKeown. 2019. Dread-
dit: A reddit dataset for stress analysis in social me-
dia. EMNLP-IJCNLP 2019, page 97.
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor
Darrell. 2017. Adversarial discriminative domain
adaptation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
7167–7176.
Nikhita Vedula and Srinivasan Parthasarathy. 2017.
Emotional and linguistic cues of depression from so-
cial media. In Proceedings of the 2017 International
Conference on Digital Health, pages 127–136.
Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen,
and Eiichiro Sumita. 2017. Instance weighting for
neural machine translation domain adaptation. In
Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pages
1482–1488.
Genta Indra Winata, Onno Pepijn Kampman, and Pas-
cale Fung. 2018. Attention-based lstm for psycho-
logical stress detection from spoken language us-
ing distant supervision. In 2018 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pages 6204–6208. IEEE.

Page 14
3787
JT Wolohan, Misato Hiraga, Atreyee Mukherjee, Zee-
shan Ali Sayyed, and Matthew Millard. 2018. De-
tecting linguistic traces of depression in topic-
restricted text: attending to self-stigmatized depres-
sion with nlp. In Proceedings of the First Interna-
tional Workshop on Language Cognition and Com-
putational Models, pages 11–21.
Fangzhao Wu and Yongfeng Huang. 2016. Sentiment
domain adaptation with multiple sources. In Pro-
ceedings of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 301–310.
Andrew Yates, Arman Cohan, and Nazli Goharian.
2017. Depression and self-harm risk assessment in
online forums. In Proceedings of the 2017 Con-
ference on Empirical Methods in Natural Language
Processing, pages 2968–2978.
A Data
CLPsych 2015 Shared Task. This Twitter dataset
was constructed using regular expressions match-
ing phrases similar to “I was diagnosed with depres-
sion” by Coppersmith et al. (2015b). The authors
manually verified the authenticity of each candi-
date self-disclosure and then sampled an age- and
gender-matched “control” population using tweet-
based inferences (Schwartz et al., 2013). To approx-
imate the depression-control pairs in the original
dataset, which has since been anonymized, we sam-
pled from the full set of available control group
candidates based on their inferred demographics.
Multi-Task Learning. Compiled by Benton
et al. in 2017, this Twitter dataset for multiple men-
tal health disorders was constructed in the same
manner as the CLPsych 2015 Shared Task.5 Al-
though depression-control linkages remain in our
version of the dataset, we only use them to iso-
late an appropriate control group for the depression
group. Individuals who were annotated as part of
both the Multi-Task Learning and the CLPsych
2015 Shared Task data were removed from the
CLPsych data (55 depression, 0 control).
RSDD. The Reddit Self-disclosed Depression
Diagnosis (RSDD) dataset is a Reddit-based data
asset in which individuals who self-disclosed they
were living with depression were identified via
regular expressions and manually verified much
like the two aforementioned Twitter datasets (Yates
et al., 2017). Individuals selected for the control
5While we were able to reproduce the class distributions
of the dataset described in Benton et al. (2017b), we identified
discrepancies between the dates that tweets in this version of
the dataset were posted relative to the dates that the original
component datasets were published.
group were required not to have posted in a list of
24 mental health related subreddits or to have used
any of 19 mental health terms.
To align the theme of language generated by in-
dividuals across classification groups, each individ-
ual in the depression group was greedily matched
with 12 individuals from the candidate control
pool based on Hellinger distance between each
individual’s post distribution over subreddits. To
preserve privacy of individuals within the dataset,
usernames were anonymized and post metadata
was redacted. Accordingly, linkages between each
individual within the depression group and their re-
spective control group pairs could not be recreated.
SMHD. The Self-Reported Mental Health Diag-
noses (SMHD) dataset was constructed in a sim-
ilar manner as RSDD, albeit being expanded to
support 9 conditions, leverage more precise regu-
lar expressions, and abide by a more conservative
term/subreddit filter set (Cohan et al., 2018). As
with RSDD, linkages between individuals in the
depression group and their controls were not pre-
served in our version of the dataset nor could they
be readily reproduced. A substantial portion of in-
dividuals in SMHD are also part of RSDD; for this
reason, we refrain from conducting domain transfer
experiments between the two datasets.
Topic-Restricted Text. To expand the scope
of our analysis, we follow methods described in
Wolohan et al. (2018) to curate an additional Reddit
dataset in which annotations are assigned based on
community participation and explicit mental health
signal is removed (hence “topic-restricted text”).
Per the original paper, individuals who initiated
one of 10k recent posts in r/depression were con-
sidered members of the depression group, while
individuals who initiated one of 10k recent posts
in r/AskReddit (but not in the recent r/depression
query) were considered to be members of the con-
trol group. Due to the anonymous nature of the
RSDD and SMHD datasets, we were unable to de-
termine if any individuals found within the Topic-
Restricted Text dataset were also in RSDD or
SMHD.
B Temporal Filtering
To limit the introduction of temporal artifacts into
the classification process, all datasets were trun-
cated in time so that at least 100 unique data points
(e.g. Tweets, Reddit comments) were present in
the first and final month across individuals in both

Page 15
3788
classes. Date ranges selected based on this criteria
are presented in Table 1.
C Tokenization
To maintain our ability to interpret results consis-
tently, the same preprocessing pipeline was applied
across all datasets. Text within both Tweets and
Reddit comments was tokenized using a modified
version of the Twokenizer (O’Connor et al., 2010).
English contractions were expanded, while specific
retweet tokens, username mentions, URLs, and nu-
meric values were replaced by generic tokens. As
pronoun usage tends to differ in individuals living
with depression (Vedula and Parthasarathy, 2017),
we removed any English pronouns from our stop
word set.6 Case was standardized across all tokens,
with a single flag included if an entire post was
made in uppercase letters.
D Features
Text from all documents for an individual are con-
catenated together and tokenized using the pro-
cedure described in Appendix C. The vocabulary
of each training procedure is fixed to a maxi-
mum of 100-thousand unigrams selected based on
KL-divergence of the class-unigram distribution
with the class-distribution of stop words (Chang
et al., 2012). This reduced bag-of-words repre-
sentation is then used to generate the following
additional feature dimensions: a 50-dimensional
LDA topic distribution (Blei et al., 2003), a 64-
dimensional LIWC category distribution (Tausczik
and Pennebaker, 2010), and a 200-dimensional
mean-pooled vector of pretrained GloVe embed-
dings (Pennington et al., 2014). The reduced bag-
of-words representation is transformed using TF-
IDF weighting (Ramos et al., 2003).7
E Hyperparameter Selection
Each model is trained using a hyperparameter grid
search over the regularization strengths {1e-3, 1e-2,
1-e1, 1, 10, 100, 1e3, 1e4, 1e5}. Hyperparameters
were selected to maximize F1 score within the de-
velopment splits of each dataset.
6English Stop Words (nltk.org)
7All data-specific feature transformations (e.g. LDA, TF-
IDF) are learned without access to development or test data.
We use Scikit-learn implementations of LDA and TF-IDF.