Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3774–3788
November 16 - 20, 2020. c 2020 Association for Computational Linguistics
Keith Harrigian, Carlos Aguirre, Mark Dredze
Johns Hopkins University,,
Proxy-based methods for annotating mental
health status in social media have grown popu-
lar in computational research due to their abil-
ity to gather large training samples. However,
an emerging body of literature has raised new
concerns regarding the validity of these types
of methods for use in clinical applications. To
further understand the robustness of distantly
supervised mental health models, we explore
the generalization ability of machine learning
classifiers trained to detect depression in indi-
viduals across multiple social media platforms.
Our experiments not only reveal that substan-
tial loss occurs when transferring between plat-
forms, but also that there exist several unreli-
able confounding factors that may enable re-
searchers to overestimate classification perfor-
mance. Based on these results, we enumer-
ate recommendations for future mental health
dataset construction.
1 Introduction
In the last decade, there has been substantial growth
in the area of digital psychiatry. Automated meth-
ods using natural language processing have been
able to detect mental health disorders based on a
person’s language in a variety of data types, such as
social media (Mowery et al., 2016; Morales et al.,
2017), speech (Iter et al., 2018) and other writings
(Kayi et al., 2017; Just et al., 2019). As in-person
clinical visits are made increasingly difficult by so-
cioeconomic barriers and public-health crises, such
as COVID-19, tools for measuring mental wellness
using implicit signal become more important than
ever (Abdel-Rahman, 2019; Bojdani et al., 2020).
Early work in this area leveraged traditional hu-
man subject studies in which individuals with clini-
cally validated psychiatric diagnoses volunteered
their language data to train classifiers and perform
quantitative analyses (Rude et al., 2004; Jarrold
et al., 2010). In an effort to model larger, more di-
verse populations with less overhead, a substantial
portion of research in the last decade has instead
explored data annotated via automated mechanisms
(Coppersmith et al., 2015a; Winata et al., 2018).
Studies leveraging proxy-based annotations have
supported their design by demonstrating alignment
with existing psychological theory regarding lan-
guage usage by individuals living with a mental
health disorder (Cavazos-Rehg et al., 2016; Vedula
and Parthasarathy, 2017). For example, feature
analyses have highlighted higher amounts of neg-
ative affect and increased personal pronoun preva-
lence amongst depressed individuals (Park et al.,
2012; De Choudhury et al., 2013). Given these con-
sistencies, the field has largely turned its attention
toward optimizing predictive power via state of the
art models (Orabi et al., 2018; Song et al., 2018).
The ultimate goal of these efforts has been
threefold—to better personalize psychiatric care,
to enable early intervention, and to monitor
population-level health outcomes in real time.
Nonetheless, research has largely trudged forward
without stopping to ask one critical question: do
models of mental health conditions trained on au-
tomatically annotated social media data actually
generalize to new data platforms and populations?
Typically, the answer is no—or at least not with-
out modification. Performance loss is to be ex-
pected in a variety of scenarios due to underly-
ing distributional shifts, e.g. domain transfer (Shi-
modaira, 2000; Subbaswamy and Saria, 2020). Ac-
cordingly, substantial effort has been devoted to de-
veloping computational methods for domain adap-
tation (Imran et al., 2016; Chu and Wang, 2018).
Outcomes from this work often provide a solid
foundation for use across multiple natural language
processing tasks (Daume III and Marcu, 2006).
However, it is unclear to what extent factors spe-
cific to mental health require tailored intervention.

In this study, we demonstrate that at a baseline,
proxy-based models of mental health status do not
transfer well to other datasets annotated via auto-
mated mechanisms. Supported by five widely used
datasets for predicting depression in social media
users from both Reddit and Twitter, we present
a combination of qualitative and quantitative ex-
periments to identify troublesome confounds that
lead to poor predictive generalization in the mental
health research space. We then enumerate evidence-
based recommendations for future mental health
dataset construction.
Ethical Considerations. Given the sensitive na-
ture of data containing mental health status of in-
dividuals, additional precautions based on guid-
ance from Benton et al. (2017a) were taken dur-
ing all data collection and analysis procedures.
Data sourced from external research groups was re-
trieved according to each dataset’s respective data
usage policy. The research was deemed exempt
from review by our Institutional Review Board
(IRB) under 45 CFR § 46.104.
2 Domain Adaptation in Mental Health
Domain adaptation (or “transfer”) of statistical clas-
sifiers is a well-studied computational problem
with high relevance across several areas of natu-
ral language processing (Jiang, 2008; Peng and
Dredze, 2017). It is particularly useful in situations
where acquiring ample training data for a target
application is intractable (e.g. monetary, time con-
straints) or impossible (e.g. privacy constraints)
(Rieman et al., 2017). For example, in the sub-field
of machine translation, significant effort is devoted
to finding ways to effectively use large corpora of
formal parallel text to train models for application
in domains with informal and dynamic language,
such as social media and conversational speech
(Wang et al., 2017; Murakami et al., 2019).
Traditional challenges encountered when trans-
ferring models between domains include variance
in source and target class distributions (Japkowicz
and Stephen, 2002), semantic misalignment (Wu
and Huang, 2016), and sparse vocabulary overlap
(Stojanov et al., 2019). Fortunately, once these
issues are identified, it is typically possible to de-
crease the transfer performance gap via methods
such as structural correspondence learning, feature
subspace mapping, and adversarial training (Blitzer
et al., 2006; Bach et al., 2016; Tzeng et al., 2017).
Domain adaptation is of particular interest in
the mental health space, where there exist numer-
ous complexities in obtaining a sufficient sample
of training data. For instance, the sensitive na-
ture of mental health data necessitates extra care
when creating and supporting new datasets (Benton
et al., 2017a). Additionally, behavioral disorders
are known to display variable clinical presentations
amongst different populations, which can make
identification of ground truth difficult (De Choud-
hury et al., 2017; Arseniev-Koehler et al., 2018).
The latter point highlights the presence of label
noise inherent in mental health data (Mitchell et al.,
2009; Shing et al., 2018). This facet serves as one
of two primary issues unique to this research space
that may hinder attempts at domain transfer. In-
deed, prior work found that diverse and sometimes
conflicting views humans have regarding suicidal
ideation can make obtaining reliable gold-standard
labels fundamentally challenging and lead to degra-
dation in model performance (Liu et al., 2017).
Sampling-related biases present the other main
area of concern for successful domain transfer by
mental health classifiers. Attributes such as per-
sonality, gender, age, and disorder co-morbidity
have been found to significantly affect the presen-
tation of mental health disorders in language data
(Cummins et al., 2015; Preotiuc-Pietro et al., 2015).
Moreover, the proxy-based annotation mechanisms
used to label large social media data sets with
mental health status invite the introduction of self-
disclosure bias into the modeling task (Amir et al.,
2019). Specifically, labels sourced from popula-
tions of individuals who self-disclose certain at-
tributes may contain activity-level and thematic
biases that cause poor generalization in larger pop-
ulations (Lippincott and Carrell, 2018).
Research leveraging text data for mental health
status classification has primarily only considered
a constrained form of domain transfer. In a within-
subject analysis, Ireland and Iserman (2018) ex-
amined differences in language usage by Reddit
users who had posted in an anxiety support forum
within and outside mental health forums. Simi-
larly, Wolohan et al. (2018) explored the predictive
power of models trained to detect depression within
Reddit users as a function of access to text from
explicit mental health related subreddits. Both stud-
ies highlighted a mitigation of overt mental health
discussion outside of the support forums, but still
detected linguistic nuances in individuals with an
affiliation to the mental health forums.

Size (Individuals)
Annotation Mechanism
Control: 477
Depression: 477
Regular expressions; Manual verification;
Age- & gender-matched controls
Multi-Task Learning
Control: 1,400
Depression: 1,400
Regular expressions; Manual verification;
Age- & gender-matched controls
Control: 107,274
Depression: 9,210
Regular expressions; Manual verification;
Subreddit-based controls
Control: 127,251
Depression: 14,139
Regular expressions;
Subreddit-based controls
Topic-Restricted Text
Control: 7,016
Depression: 6,853
Community participation
Table 1: Summary statistics for each dataset. All datasets leverage proxy-based annotations. Distribution over
time and sample size varies significantly between datasets.
Shen et al. (2018) attempted to use transfer learn-
ing with large amounts of English Twitter data
annotated with individual-level depression labels
to improve predictive performance of depression
classifiers in Chinese Weibo data. Using the En-
glish and Chinese versions of the Linguistic Inquiry
and Word Count tool (LIWC) (Pennebaker et al.,
2001; Huang et al., 2012) in conjunction with other
modalities of social data (e.g. profile metadata, im-
ages), the authors showed that signal from Twitter
was useful for classification on Weibo.
Recent work from Ernala et al. (2019) was the
first to explore some of aforementioned difficulties
with domain transfer in the mental health space.
Multiple different annotation mechanisms were
used to train Twitter-based models for identifying
schizophrenia and then applied to Facebook data
from an independent population of clinically diag-
nosed schizophrenia patients. Three different types
of proxy signals with varying degrees of manual
supervision were each found to generalize poorly
to the clinical population. While the authors’ anal-
ysis suggested the domains were similar enough
to justify transfer attempts, only limited post-hoc
analysis of the data platform effect was carried out.
Thus, it remains unclear to what extent the annota-
tion methodologies as opposed to platform effects
(or other confounds) caused the degradation.
3 Data
We select depression classification as our task be-
cause it is perhaps the most widely studied, has
multiple datasets from different platforms, and is
of critical importance to society. Estimated to
affect 4.4% of the global population, depression
presents a significant economic burden and remains
the most common psychiatric disorder associated
with deaths by suicide (Hawton et al., 2013; Organi-
zation et al., 2017). Occupying a lion’s share of the
computational literature, depression classification
is a critical first target for evaluating generalization
of mental health models in social media (Chancel-
lor and De Choudhury, 2020).
To quantify the nature of domain transfer loss,
we consider five datasets. Datasets were selected
based on their common adoption in the literature
(Preotiuc-Pietro et al., 2015; Gamaarachchige and
Inkpen, 2019) and their use of proxy-based anno-
tations (Coppersmith et al., 2014). We use two
Twitter—CLPsych 2015 Shared Task (Coppersmith
et al., 2015b), Multi-Task Learning (Benton et al.,
2017b)—and three Reddit datasets—RSDD (Yates
et al., 2017), SMHD (Cohan et al., 2018), and Topic-
Restricted Text (Wolohan et al., 2018). Table 1
presents summary statistics. Construction details
are in Appendix A as a courtesy to the reader.
3.1 Mitigating Bias
Each dataset was curated in part by a system of sim-
ple rules (e.g. matches to “I was diagnosed with
depression,” participation in a depression support
forum). While these heuristics are useful for iden-
tifying candidates to include within each dataset,
they also risk introducing bias that may render the
modeling task trivial. For example, individuals
who disclose a depression diagnosis are likely to
also share their experience with other psychiatric
conditions (Benton et al., 2017b), while language
used in dedicated mental-health subreddits system-
atically differs from the rest of Reddit (De Choud-
hury and De, 2014; Ireland and Iserman, 2018).
To encourage our mental health classifiers to
learn subtle linguistic nuances that cannot be easily
captured using straightforward logic, we make ef-
forts to exclude unambiguous mental health content
from all training and evaluation procedures. In line

with prior work, we discard posts that include men-
tions of clinically-defined psychiatric conditions,
adopting the list of mental health terms enumerated
by Cohan et al. (2018) as a reference. This list
(N=458) extends work from Yates et al. (2017) by
including disorders tangential to depression, com-
mon misspellings, and colloquial references.
As is standard for mental health modeling, we
also discard posts made in subreddits dedicated to
providing mental health support (Yates et al., 2017;
Cohan et al., 2018; Wolohan et al., 2018). Since
new subreddits are created daily and our version
of the Topic-Restricted Text dataset contains posts
made after collection of RSDD and SMHD, we cre-
ate an updated list of mental health support subred-
dits. To do so, we examine the empirical distribu-
tion of posts amongst subreddits within the Topic-
Restricted Text dataset and rank each subreddit S
based on pointwise mutual information (PMI) for
the depression group D, log(p(S|D)/p(S)). We
manually examined the top 1000 subreddits based
on PMI and identified all subreddits whose descrip-
tion affirmed an association to mental health.
Our list (N=242) expands existing resources
from Yates et al. (2017) and Cohan et al. (2018) by
providing 162 additional mental health subreddits,
many of which were actually created before the
collection of RSDD and SMHD.1 While this step
diminishes the risk of mental health content saturat-
ing the Topic-Restricted Text dataset, the list’s ex-
pansion beyond that of the RSDD and SMHD lists
suggests that the former two Reddit datasets may
indeed still have overt mental health content. We
explore how different degrees of subreddit-based
filtering may affect generalization in §6.4.
4 Models
We begin by training classification models for pre-
dicting depression on each dataset. All classifica-
tion experiments leverage the same training proce-
dure and features (see Appendix D for details). As
a classifier, we use l2-regularized logistic regres-
sion. Despite our model’s relative simplicity we
are able to achieve respectable within-domain clas-
sification performance while maintaining an ability
to interpret learned parameters. Logistic regression
has served as a difficult benchmark to beat given
access to appropriate engineered features for prior
1Subreddits and code are made available to other re-
mental health studies (Benton et al., 2017b).
4.1 Model Validation
To validate our modeling framework against prior
work, we first establish within-domain predictive
baselines. This step also allows us to contextualize
performance by estimating the intrinsic difficulty
of modeling each dataset (DeMasi et al., 2017).
Methods. We use train/development/test splits
if they have been established by the dataset distrib-
utor; otherwise, we sample 20% from the available
data to be used as a held-out test set and then create
an additional 80/20 train/dev split using the remain-
ing data. For each dataset, we use an independent
grid search to select regularization strength C that
maximizes F1 in the dataset’s development split
(see Appendix E). We use a binarization threshold
of 0.5 (noninclusive) for all datasets.
Results. We report test set F1 for each dataset in
the bottom row of Table 2. Our models perform on
par with prior research for the two Twitter datasets
and the Topic-Restricted Text dataset. Results for
RSDD and SMHD improve upon their respective
baseline models, but are inferior to neural methods.
5 Transfer Experiments
We conduct a series of experiments to measure
the generalization of models between depression
datasets and explain sources of model degradation.
5.1 Cross-domain Transfer
Task formulation and dataset design remain a signif-
icant source of nuance across prior studies for men-
tal health status prediction (Morales et al., 2017;
Chancellor and De Choudhury, 2020). As such,
we hypothesize that standardizing training settings
(e.g. class balance, sample size) will account for
discrepancies in cross-domain performance.
Methods. We consider two experimental de-
signs. In the first experiment (†), we downsample
all datasets to have the same training/development
size of the smallest class in the smallest dataset
(i.e. CLPsych). In the second experiment (††),
we balance class distributions independently for
each dataset based on the dataset’s smaller class,
but allow sample size to vary between datasets.
The former experiment allows us to establish eq-
uitable baselines between datasets, while the latter
experiment enables us to explore whether access to
additional training data ameliorates transfer loss.

Test Data
Topic-Restricted Text
Train Data
.774 ± .009
.774 ± .009
.635 ± .054
.635 ± .054
.169 ± .011
.169 ± .011
.064 ± .006
.064 ± .006
.638 ± .034
.638 ± .034
.533 ± .111
.739 ± .004
.802 ± .018
.830 ± .005
.149 ± .001
.149 ± .001
.054 ± .000
.054 ± .001
.648 ± .007
.655 ± .011
.247 ± .034
.284 ± .046
.338 ± .041
.407 ± .051
.338 ± .010
.405 ± .003
.487 ± .046
.434 ± .003
.335 ± .048
.355 ± .028
.543 ± .040
.464 ± .028
.186 ± .007
.212 ± .006
.626 ± .011
.631 ± .007
Topic-Restricted Text
.624 ± .018
.668 ± .008
.516 ± .060
.648 ± .026
.173 ± .017
.218 ± .004
.105 ± .014
.106 ± .008
.686 ± .007
.735 ± .002
Table 2: F1 score (µ ± σ) for the Balanced & Downsampled (†) and Balanced (††) cross-domain transfer exper-
iments. Baselines described in §4.1, which preserve class imbalance during training, are presented in the bottom
row. Increasing dataset size (10x in some cases) does not unanimously improve transfer.
For both experiments, we start by combining
training and development splits. Then, for each
dataset, we sample from the combined splits based
on the parameters of the experiment and split the re-
sulting sample into 5 class-stratified folds. We train
5 classifiers per dataset, using 4 folds for training
each time, and apply the classifiers to each dataset’s
test set. Since a substantial portion of individuals
in SMHD are part of RSDD, we refrain from con-
ducting experiments between the two datasets.
Results. We report F1 score (µ ± σ) for both
experiments in Table 2. In line with existing re-
search, within-domain training outperforms cross-
domain training in each of our datasets for both
sampling settings. While additional samples avail-
able for training in the second experiment mod-
erately improve within-domain performance, they
are not uniformly helpful for mitigating transfer
loss to other datasets. Models generally outper-
form a random classifier at ranking depression risk
in cross-domain transfer scenarios. However, some
models are poorly calibrated for new domains and
consequently obtain low F1 scores (e.g. CLPsych
→ SMHD). Addressing miscalibration in domain
transfer scenarios remains an open research ques-
tion (Pampari and Ermon, 2020; Park et al., 2020).
We find that models trained on Twitter data trans-
fer to Reddit data better than models in the reverse
direction. Not surprisingly, given their overlap in
training samples, models trained on the SMHD
and RSDD datasets transfer to other domains in
an equitable manner, trading improvements with
each other across transfer settings. These results
indicate that sample size and class balance are not
solely responsible for generalization loss.
5.2 Temporal Transfer
Typical sources of transfer loss concern differences
in features between domains (Blitzer et al., 2007;
Ben-David et al., 2010). However, other factors
may govern model degradation for depression clas-
sification. One such cause of loss is temporal mis-
alignment between the datasets (Table 1). Prior
work has shown that language dynamics may hin-
der models upon deployment (Dredze et al., 2016;
Huang and Paul, 2018). In social media, where
users adopt new linguistic norms rapidly, perfor-
mance may be more volatile (Brigadir et al., 2015).
5.2.1 Class Misalignment
As an exercise to understand whether temporal
artifacts are present in the datasets, we first con-
sider training and evaluating single-domain models
with a temporal misalignment between the control
and depression groups. By training on mutually-
exclusive time periods for each class, we hypothe-
size the classifier will not only able to learn how to
distinguish between groups, but also to distinguish
between time periods. If this hypothesis holds true,
we expect performance metrics to be artificially in-
flated when a temporal exclusivity per class exists.
Methods. We split each dataset into one year
periods based on the calendar year. For each year,
we identify individuals in the Twitter datasets with
at least 200 posts and individuals in the Reddit
datasets with at least 100 posts.2
We balance
the number of individuals across time periods and
groups within each dataset, but allow this sample
size to vary across datasets. To account for growth
in post frequency over time (which increases the
number of documents that generate individual fea-
ture vectors), we perform additional post-level sam-
pling. We randomly select 200 posts per year in the
Twitter datasets and 100 posts per year in the Red-
dit datasets. Samples of individuals within each
time period are additionally separated into 5 strati-
fied folds. Folds are established so that individuals
in the training data of one time period are never
present in the test data of another time period.
2We use 2x more posts in the Twitter data to account for
posts in the Reddit datasets having roughly twice as many
words as Tweets do on average.

Train Year
2013 2014 2015
.78 .77 .83
.76 .78 .85
.76 .79 .83
.61 .58 .63 .68 .71
.62 .59 .65 .68 .72
.62 .58 .62 .67 .71
.62 .58 .65 .69 .71
.59 .56 .62 .66 .70
Test Year
Train Year
Test Year
.59 .66 .60 .65 .57
.56 .62 .58 .64 .62
.56 .62 .62 .65 .62
.53 .56 .55 .63 .58
.51 .59 .56 .64 .59
2012 2014 2016 2018
Test Year
Average F1
0 2 4 6 8
-2 0 2 4 6
-4-2 0 2 4 6 8
-4-2 0 2 4 6 8
Latency (yrs.)
-6 -4 -2 0 2 4
Latency (yrs.)
Source Data
F1 Percent Difference From Within Domain/Within Time
Figure 1: Temporal-transfer results. (Left) Average within-domain F1 score as a function of training and evalua-
tion periods. Predictive performance tends to be better for more recent temporal splits regardless of training period.
(Right) Percent difference in F1 score relative to within-domain, no-latency model. Models trained on Twitter data
benefit most from temporal alignment. Performance suffers when applying new models to old data.
To evaluate the degree to which temporal effects
are present, we sample groups from all possible
combinations of time periods. For example, in one
setting, both the control and depression groups are
sampled from 2013; in another setting, the control
group is sampled from 2013, while the depression
group is sampled from 2015. For each combination,
we use 4 of the stratified folds for training and use
the remaining fold for evaluation, and then repeat
the process for all folds. We compare performance
when classes are sampled from the same time pe-
riod against performance when classes are sampled
from mutually exclusive time periods.
Results. We achieve a 3-22% increase in F1
across all datasets when classes are sampled from
mutually exclusive time periods instead of being
temporally-aligned. The improvement suggests
that temporal artifacts exist, as the classifier is able
to not only identify signal relevant to classifying
depression, but also to classifying data from dif-
ferent periods of time. This result highlights the
importance of sampling classes evenly over time.
5.2.2 Latency
We now measure the effect temporal artifacts have
on cross-domain performance. We hypothesize
model degradation scales with deployment latency.
Methods. We use the same data sampling mech-
anism described in §5.2.1. However, we now only
consider the case in which control and depression
groups are sampled from the same time period. As
before, we train a classifier on 4 of the 5 stratified
folds for a time period in one dataset. We then
evaluate within-domain performance using the re-
maining fold and cross-domain performance using
one fold from each time period in the other datasets.
We assume ground truth is consistent over multiple
time periods; given the episodic nature of depres-
sion, we recognize this may promote pessimistic
results for some periods (Tsakalidis et al., 2018).
Results. Examining within-domain results in
Figure 1 (left), predictive performance tends to be
better for more recent temporal splits regardless of
training period. Classifiers trained on old data (rela-
tive to the evaluation period) tend to perform on par
with aligned regimens, while classifiers trained on
new data show linear losses over time. Losses are
significant after 2-3 years depending on the dataset.
Though some trends do emerge, cross-domain
performance as a function of temporal latency is
relatively variable. Visualized in Figure 1 (right),
models trained on the Twitter datasets benefit most
from temporal alignment in cross-domain settings.
Models trained on Topic-Restricted Text show sig-
nificant drop offs in predictive performance when
applied to older samples within all Reddit datasets.
While models trained on RSDD perform better on
Topic-Restricted Text as latency is reduced, models
trained on SMHD do not exhibit the same trend.
6 Post-hoc Analysis
In the previous section, we identified the degree
to which loss occurs under a variety of domain
transfer settings. However, these settings do not
account for all performance disparities. In this sec-
tion, we measure differences between the datasets
to understand the source of loss.

6.1 Vocabulary Overlap
Traditionally, different feature vocabularies ac-
count for domain transfer loss (Serra et al., 2017;
Chen and Gomes, 2019; Stojanov et al., 2019).
Therefore, we hypothesize that limited feature over-
lap and poor vocabulary alignment across datasets
could hinder cross-domain generalization.
Methods. We explore this phenomenon by com-
puting the Jaccard Similarity (JS) of vocabularies
between each dataset. We examine correlations
between JS and F1 scores from the cross-domain
transfer experiments discussed in §5.1.
Results. We find the minimum similarity oc-
curs between the CLPsych and RSDD datasets
(JS = 0.10) while the maximum occurs between
the Topic-Restricted Text and SMHD datasets
(JS = 0.65).3 Only a weak correlation between
similarity and performance exists (Pearson ρ <
0.18), suggesting poor generalization is not solely
due to differences in vocabulary.
6.2 Topical Alignment
Our classification models leverage reduced fea-
ture representations in the form of LDA topic-
distributions (Blei et al., 2003) and mean-pooled
pre-trained GloVe embeddings (Pennington et al.,
2014). Designed to capture and reflect seman-
tics, we hypothesized these low-dimensional fea-
tures would mitigate transfer loss due to poor vo-
cabulary alignment. Lacking support from our
cross-domain transfer results, we look closer at
the themes present within each dataset.
Methods. We identify the unigrams that are
most unique to each dataset and group. For
each dataset, we use scores assigned by our KL-
divergence-based feature selection method (see Ap-
pendix D) to rank the most informative features per
class (Chang et al., 2012). We jointly examine the
top-500 most informative unigrams per class, not-
ing high-level themes common across the datasets.
Results. With respect to similarities, we note
that words used in discussion about gender and sex-
uality are strongly associated with each of the de-
pression groups (e.g. ‘cis’, ‘homophobia’, ‘mascu-
line’), likely a reflection of marginalized groups be-
ing at higher risk of depression (Budge et al., 2013).
Also ubiquitous amongst each of the datasets are
references to self-injurious behavior (e.g ‘wrists’,
3JS is moderately deflated in RSDD due to the dataset’s
large vocabulary, causing SMHD and Topic-Restricted Text to
have the highest similarity instead of SMHD and RSDD.
‘self-harm’, ‘hotline’). Increased emoji usage and
references to athletics (‘nbafinals’, ‘scorer’) are
strong indicators of the control group in each
dataset, as well as terms reflecting current events.
With respect to differences, associations between
word usage and depression are subjectively easier
to interpret within the Reddit datasets. For example,
discussion of mental-health treatment (e.g. ‘coun-
selor’, ‘therapy’, ‘wellbutrin’) and familial and in-
timate relationships (‘brother-in-law’, ‘soulmate’)
are prominent within the Reddit datasets. In con-
trast, language associated with depression within
the Twitter datasets tends to reflect slightly more
nuanced elements of the condition—e.g. social
inequity (‘sexism’, ‘#yesallwomen’) and fantasy
(‘fanfics’, ‘cosplay’, ‘villians’). These themes align
with empirical findings that women are at a higher
risk of depression (Kessler, 2003) and depressed
individuals often find solace in niche subcultures
(Blanco and Barnett, 2014; Bowes et al., 2015).
Additionally, we find several temporally-
isolated references within the Twitter datasets
(e.g. ‘#RIPRobinWilliams’, ‘#SDCC’). In the
Multi-task Learning dataset, we also see several
terms using non-American English (e.g. ‘colour’,
‘favourite’) which may represent a geographic im-
balance amongst the sampled individuals.
6.3 Stability of LIWC
The Linguistic Inquiry and Word Count (LIWC)
dictionary has been an effective tool for measur-
ing linguistic-nuances of mental health disorders
regardless of textual formality (Mowery et al.,
2016; Turcan and McKeown, 2019). Our version
of the dictionary (2007) maps approximately 12k
words to 64 dimensions (e.g. negative emotion,
leisure) that have been empirically validated to
capture an individual’s social and psychological
states (Tausczik and Pennebaker, 2010).4 A single
LIWC feature value represents the proportion of
words used across an individual’s post history that
match the given LIWC dimension. In the same way
that we expect semantic distributions (§6.2) to ame-
liorate transfer loss, we hypothesize that models
trained on this representation will be more robust
when vocabulary overlap is sparse.
Methods. We explore this hypothesis from three
angles: 1) We perform cross-domain transfer exper-
iments using LIWC as the only feature set provided
4The 2007 version of LIWC has a high similarity with the
2015 version amongst dimensions most strongly associated
with depression (Pennebaker et al., 2015).

for training and evaluation; 2) We fit LIWC-based
classifiers 100 times per dataset using random 70%
samples and examine correlations of the learned co-
efficients; 3) We compute the average feature value
of each LIWC dimension per class and measure the
difference between classes.
Results. We note that domain-transfer experi-
ments using LIWC as the only feature set main-
tain high degrees of transfer loss while sacrific-
ing within-domain performance. Moreover, cor-
relations between coefficients of models between
datasets are relatively low across all comparisons,
maxing out at a Spearman R value of 0.338 for the
comparison between RSDD and SMHD datasets,
which happen to have significant user overlap as
is. In general, LIWC coefficients tend to be more
correlated within platforms than between them.
Examination of the underlying class differences
provides insight into linguistic differences between
each dataset’s depression group. In line with
prior work, function word use, first-person pro-
noun use, and cognitive mechanisms are more com-
mon within the depression group of each dataset,
though their relative prevalence varies. Conversa-
tion regarding relativity (i.e. space, motion, time)
is strongly associated with the control groups in
the Twitter data, but is more associated with the
depression groups in the Reddit data. Anger and
perceptual topics are more prevalent within the de-
pression groups for Twitter than Reddit.
6.4 Self-disclosure Bias
In the aforementioned analysis, posts from men-
tal health subreddits and those including mental
health terms were excluded. Nonetheless, individ-
uals within each of the depression groups for the
Reddit datasets displayed language that was unam-
biguously associated with seeking support or shar-
ing personal experience with mental health issues.
Accordingly, we hypothesize that existing filters
are unable to remove confounds in individuals who
disclose a depression diagnosis on Reddit.
Methods. To measure this effect, we examine
differences in the distribution of subreddits that
individuals in the depression group of the Topic-
Restricted Text data post in relative to individuals
in the control group. Specifically, we fit a logistic
regression model mapping the subreddit distribu-
tion of individuals’ posts to their mental health
status after applying each subreddit filter list (e.g.
RSDD, SMHD, Ours). We compare predictive per-
formance of these models and the learned coef-
ficient weights to understand the effect of filter-
ing. As a baseline, we maintain posts from the
r/depression subreddit in the feature set. Then, in
sequence of coverage from least to most, we apply
subreddit filters from RSDD, SMHD, and our study,
and measure classification performance. For each
filter, we examine the learned coefficient weights
to develop a sense for the personality and interests
of individuals in the depression group.
Results. The baseline F1 score in the devel-
opment set maxes out at 0.83, representing the
fact that several individuals in the control group
had posted in the r/depression subreddit at some
point in their history, but were not labeled as
having depression due to the sole use of recent
original posts by the automatic annotation proce-
dure. Performance degrades with the expansion
of excluded subreddits from each filter, settling
at an F1 of 0.72. Coefficients from the model
highlight subreddits related to themes of sexual-
ity (r/bisexual, r/actuallesbians), gender (r/ftm),
personality (r/introvert, r/INFP), drugs (r/Trees,
r/LSD), and relationships (r/MakeNewFriendsHere,
r/BreakUps) as being predictive of depression.
The strong classification performance achieved
after our filtering measures is evidence that distri-
butional differences in online interaction remain in
the “cleaned” Topic-Restricted Text dataset. As our
subreddit list is more robust than both the RSDD
and SMHD lists, there is reason to believe simi-
lar confounds exist in these datasets. The coeffi-
cient analysis provides a window into the types of
themes that could incorrectly confuse a classifica-
tion model during generalization attempts.
7 Recommendations
We have demonstrated that issues of transfer loss
persist in the mental health space, at least for the
proxy-based social media datasets considered in
our study. Importantly, we identified confounds
that emerge as a result of each dataset’s respective
design. Critically, existing datasets have flaws that
make them difficult to use for constructing models
for new data types and populations.
Topical Alignment. Researchers must account
for self-disclosure bias and confounds of personal-
ity when curating new datasets. First discussed in
§6.2, models trained on the Reddit datasets learn
dependencies between support-driven topics, such
as medication usage and relationship advice, and

depression. In contrast, models trained on the
Twitter datasets identify the same correlations be-
tween sexuality, gender, and depression that Reddit-
based models detect, but also learn about the recre-
ational outlets (i.e. fantasy) and social concerns (i.e.
racism, sexism) common in depressed individuals.
We hypothesize that semantic divergences reflect
self-disclosure bias and differences in platform in-
teraction patterns (Malik et al., 2015; Shelton et al.,
2015). Twitter’s status- and reply-based structure
serves as a place for individuals to share personal
thoughts and experiences in reaction to their daily
life. Meanwhile, Reddit’s community-based fo-
rums require active engagement with specific top-
ics and may silo individuals who wish to discuss
their mental health beyond defined areas. The latter
gains support from our analysis of subreddit distri-
butions in the Topic-Restricted Text data (§6.4).
Topical nuances in language may appropriately
reflect elements of identity associated with mental
health disorders (i.e. traumatic experiences, cop-
ing mechanisms). However, if not contextualized
during model training, this type of signal has the
potential to raise several false alarms upon appli-
cation to new populations. Accordingly, we urge
researchers to minimize the presence of overt topi-
cal disparities between classes in their datasets.
Mitigating Temporal Artifacts. Researchers
must take steps to remove temporal artifacts in new
datasets. Experiments conducted in §5.2 reveal
that group-based temporal alignment and latency
between model training and deployment can have a
significant effect on predictive performance. Vari-
ability of performance over time is surprising, as
there is no clinical evidence to suggest that the un-
derlying symptoms of depression (on a population
level) change over time (APA, 2013).
We hypothesize two reasons for this observa-
tion. First, since depression presents in an episodic
manner, we may expect data closest to the date of
annotation to be the most predictive of an individ-
ual’s labeled mental status (Melartin et al., 2004).
If most posts used for annotation occurred in recent
time windows, then it is possible that content in
older posts is less relevant to the depressive state
of individuals in our data sets. Second, and more
problematic, is the possibility that signal used by
our classifiers is only a spurious correlation.
At a bare minimum, our results highlight the im-
portance of sampling classification groups so that
post volume is equal over time. Discrepancies may
wrongly suggest that temporal artifacts are useful
for detecting mental health disorders. Going fur-
ther, researchers should remove temporally-specific
references and minimize highly-dynamic language
in their datasets. Avenues for accomplishing the lat-
ter include using NER to redact n-grams that serve
as spurious correlations (Ritter et al., 2011) and
leveraging adversarial training to evaluate the de-
gree to which mental health signal may be learned
without a notion for time (Tzeng et al., 2017).
8 Limitations and Future Work
Though our study provides a robust perspective to-
ward understanding generalization capabilities of
mental health classifiers for social media, we rec-
ognize that more learning opportunities exist. Our
study only considers a handful of datasets, two plat-
forms, a single mental health disorder, and homo-
geneous annotation mechanisms. Still unexplored,
in large part due to the precautions necessary for
securing sensitive mental health data, is how well
models trained on data from actual clinical popula-
tions generalize to proxy-based datasets and other
clinical populations. While high co-morbidity rates
between depression and other mental health dis-
orders may allow us to infer model behavior for
alternative conditions, we also recognize that pre-
sentations of different psychiatric disorders can be
quite variable and warrant their own research (Ben-
ton et al., 2017b; Arseniev-Koehler et al., 2018).
Another limitation in our work is the lack of
depression to control group matches from original
reference material. Preotiuc-Pietro et al. (2015) and
De Choudhury et al. (2017) demonstrate that men-
tal health disorders such as depression can have
variable presentations based on demographic at-
tributes. The attributes used to construct our Twitter
datasets originally were inferred via now-outdated
text-based models. Accordingly, demographic in-
ference errors may be propagated to and correlated
with depression classification errors. Moreover,
these attributes were not considered within the con-
struction of any of the Reddit datasets we explored.
The effect of demographics on generalization re-
mains a valuable insight for future exploration.
Finally, our attempts at domain transfer are con-
strained. Namely, we do not invoke explicit do-
main adaptation methods (Peng and Dredze, 2017;
Li et al., 2018; Huang and Paul, 2019). Moving
forward, we plan to explore algorithmic strategies
to mitigate the biases discovered in this study.

