Towards scaling Twitter for digital epidemiology of birth defects

Klein, Ari Z.; Sarker, Abeed; Weissenbacher, Davy; Gonzalez-Hernandez, Graciela

doi:10.1038/s41746-019-0170-5

Download PDF

Article
Open access
Published: 01 October 2019

Towards scaling Twitter for digital epidemiology of birth defects

Ari Z. KleinÂ ORCID: orcid.org/0000-0002-8281-3464¹,
Abeed Sarker²,
Davy Weissenbacher¹ &
â¦
Graciela Gonzalez-Hernandez¹Â

npj Digital Medicine volumeÂ 2, ArticleÂ number:Â 96 (2019) Cite this article

2579 Accesses
13 Altmetric
Metrics details

Subjects

Abstract

Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomesâthe leading cause of infant mortalityâcould be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithmsâfeature-engineered and deep learning-based classifiersâthat automatically distinguish tweets referring to the userâs pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F₁-score of 0.65 for the âdefectâ class and 0.51 for the âpossible defectâ class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.

A machine learning approach predicts future risk to suicidal ideation from social media data

Article Open access 26 May 2020

Identifying schizophrenia stigma on Twitter: a proof of principle model using service user supervised machine learning

Article Open access 07 February 2022

Dynamics and characteristics of misinformation related to earthquake predictions on Twitter

Article Open access 17 August 2023

Introduction

Despite the fact that birth defects are the leading cause of infant mortality in the United States,¹ methods for observing pregnancies with birth defect outcomes remain limited (e.g., clinical trials,^2,3 animal studies,⁴ pregnancy exposure registries^5,6,7,8). Considering that 40% of US adults between ages 18 and 29 use Twitter,⁹ in recent work,¹⁰ we sought to identify a cohort of women whose pregnancies with birth defect outcomes could be observed via their publicly available tweets. In that work, we identified a small cohort in a database containing the timelinesâthe publicly available tweets posted by a user over timeâof more than 100,000 users automatically identified via their public announcements of pregnancy on Twitter.¹¹ Then, we used their timelines to conduct an observational case-control study,¹² in which we compared select risk factors¹³ among the women reporting a birth defect outcome (cases) and users for whom we did not detect a birth defect outcome, selected from the same database (controls). The study found that reports of medication exposure¹⁴ were statistically significantly greater among the cases than the controls.

More generally, our recent work demonstrates several valuable ways in which our social media-mining approach could complement existing methods for studying birth defects, most of which continue to have unknown etiologies.¹⁵ Our pipeline¹¹ begins by collecting all the publicly available tweets of women who have announced their pregnancy on Twitter, which enables the use of social media for selecting internal comparator groups, and provides a unique opportunity of exploring unknown risk factors among the chatter. Furthermore, the users we studied were posting tweets not only during pregnancy; most were posting tweets leading up to their pregnancy and before they could have been aware they were pregnant. Thus, our social media-mining approach can be used to observe risk factors in the periconceptional period and early in the first trimester, respectively. Because our pipeline continues to collect the tweets that users post after pregnancy, social media provides a means of long-term follow-up after birth.

While promising, our preliminary studies also demonstrate the limitations of manual methods for identifying cohorts on Twitter. Upon grappling with collecting tweets that mention birth defects from among the more than 400 million tweets in our database, we manually annotated ~17,000 retrieved tweets for whether they refer to the userâs birth defect outcome (true positives) or merely mention birth defects (false positives). Then, we analyzed ~650 timelines of the users who posted true positives and identified 195 women for inclusion in our cohort.¹⁰ This cohort was sufficient for the purpose of studying birth defects in general,¹² but, as with many pregnancy exposure registries,¹⁶ it did not include enough specific cases for studying select birth defects, especially those that are more rare, such as encephalocele, biliary atresia, bladder exstrophy, and trisomy 13. However, despite the relatively small cohort initially identified and studied, new users are being constantly added to our database.¹¹ In fact, already, the total number of users and tweets in our database has more than doubled since our preliminary studies, so methods capable of automatically identifying cohorts among this growing population are needed in order to exploit social media data for studying birth defects on this larger scale. The development of such methods may be particularly valuable for studying birth defects that are more rare in the general population because of the limited availability of data from traditional sources.

As the first step in automatically identifying cohorts on social media, the primary objective of the present study is to develop methods for automatically detecting tweets by mothers reporting that their child has a birth defect. We took a tweet-level approach to identifying users (as opposed to topic modeling,¹⁷ for example) because we found that, among the thousands of tweets in usersâ timelines, tweets related to their birth defect outcomes appeared to be sparse. Given that the prevalence of birth defect outcomes in the United States is only 3%¹⁸ (as a comparison, postpartum depression¹⁹ affects up to 15% of mothers²⁰), automatically detecting such rare events on social media is difficult.²¹ In particular, the high degree of class imbalance among tweets that mention birth defects presents challenges for training machine learning algorithms and applying them in practice.^22,23,24 Among the 22,999 tweets retrieved using an extensive lexicon of birth defects,¹⁰ only ~10% of them were actually labeled âdefectâ or âpossible defectâ (true positives) by the annotators, and 90% were ânon-defectâ tweets (false positives). These three classes are summarized as follows:

Defect (+): The tweet refers to the userâs child and indicates that he or she has the birth defect mentioned in the tweet (e.g., My little miracle, we are so blessed to have you #hypoplasticleftheartsyndrome #hlhs).
Possible Defect (?): The tweet is ambiguous about whether someone is the userâs child and/or has the birth defect mentioned in the tweet (e.g., Scheduling an appointment for [name] to confirm a craniosynostosis diagnosis).
Non-defect (â): The tweet does not refer to someone who is or may be the userâs child and has or may have the birth defect mentioned in the tweet (e.g., Sally Phillips: My son has Downâs syndrome â but I wouldnât want to live in a world without it via @[username]).

In addition to the class imbalance problem, the sample tweets above illustrate further challengesâposed by the text itselfâof automatically detecting tweets by mothers reporting that their child has a birth defect, including the various ways users may colloquially refer to their child on social media (e.g., my little miracle). The tweet also may not explicitly encode that the userâs child has a birth defect, but rather imply it using nontraditional textual representations afforded by social media (e.g., #hypoplasticleftheartsyndrome #hlhs). Even when tweets do contain a common reference to a child (e.g., my son) and explicitly state that the child has a birth defect (e.g., has Downâs syndrome), they may be âreported speechââthat is, presenting what was said by others (e.g., Sally Phillips). Finally, some tweets do not provide the context on which a full understanding of whether or not the userâs child has a birth defect depends, perhaps because of the character limit imposed by Twitter or the shared background knowledge among users interacting in a social community. For this reason, we created the âpossible defectâ class, which introduces the challenge of multiclass classification and, moreover, modeling the nuances that distinguish âpossible defectâ tweets.

In this study, we preprocessed the 22,999 annotated tweets and used them in experiments to train and evaluate supervised machine learning algorithms, including Multinomial Naive Bayes (NB), support vector machine (SVM), and long short-term memory neural network (LSTM) classifiers. In our experiments, we performed various under-sampling and over-sampling methods to address the class imbalance problem. In this paper, we present (1) benchmark results of natural language processing and supervised classification methods, (2) a comparison of feature-engineered and deep learning-based classifiers trained on imbalanced, under-sampled, and over-sampled social media data, (3) the predictions of our automatic pre-filtering, pre-processing, and classification system deployed on ~600 million unlabeled tweets, and (4) an error analysis that could inform strategies for improving classification performance using the training set of 14,716 annotated tweets (Supplementary Data 1) and development set of 3681 annotated tweets (Supplementary Data 2) that we have made publicly available in the supplementary information files.

Results

Classification

The classifiers were evaluated on a held-out test setâa random sample of 20% of the annotated corpus (4602 tweets), stratified based on the natural, imbalanced distribution of âdefect,â âpossible defect,â and ânon-defectâ tweets that would be automatically detected by the lexicon-based retrieval¹⁰ in practice. Figure 1 presents the performance of the NB, SVM, and LSTM classifiers, as measured by F₁âscore, for the âdefect,â âpossible defect,â and ânon-defectâ classes:

$$\begin{array}{l}{\rm{F}}_1\,-\,{\rm{score}}\,=\,\frac{2\,x\,{\rm{recall}}\,\times\,{\rm{precision}}}{{\rm{recall}}\,+\,{\rm{precision}}};\,{\rm{recall}}=\frac{{\rm{true}}\,{\rm{positives}}}{{\rm{true}}\,{\rm{positives}}\,+\,{\rm{false}}\,{\rm{negatives}}};\, {\rm{precision}}=\frac{{\rm{true}\,\rm{positives}}}{{\rm{true}\,{\rm{positives}}\,+\,{\rm{false}}\,{\rm{positives}}}}\end{array}$$

In Fig. 1, the classifiers were trained on the original, imbalanced data (14,716 tweets), with only word n-grams (nâ=â1, 2, and 3) as features for the NB and SVM classifiers, and word embeddings trained on tweets for the LSTM classifier. This training set was obtained by splitting the remaining 80% of the 22,999 annotated tweets into 80% (training) and 20% (development) stratified random sets. The SVM and LSTM classifiers outperformed the NB classifier for the âdefectâ and ânon-defectâ classes, but the NB classifier outperformed the LSTM classifier for the âpossible defectâ class. The SVM classifier outperformed the LSTM classifier for the âdefectâ and âpossible defectâ classes, and achieved similar performance for the ânon-defectâ class.

To address the class imbalance problem at the data level, the classifiers were trained on variations of the original, imbalanced data. Table 1 compares the performance of the classifiers when they were trained on the original data with their performance when they were trained using various under-sampling and over-sampling approaches. For the NB classifier, neither under-sampling nor over-sampling improved performance for the âdefectâ and âpossible defectâ classes. For the SVM classifier, neither under-sampling nor over-sampling improved performance for any of the classes. For the LSTM classifier, neither under-sampling nor over-sampling improved performance for the âdefectâ or ânon-defectâ classes. Although most of the under-sampling and over-sampling approaches did improve the performance of the LSTM classifier for the âpossible defectâ class, our similarity-based under-sampling methods did not outperform random under-sampling. Overall, the SVM classifier outperformed the NB and LSTM classifiers for the âdefectâ and âpossible defectâ classes, so we decided to pursue the SVM classifier that was trained on the original, imbalanced data.

Table 1 F₁-scores (F) for âdefectâ (+), âpossible defectâ (?), and ânon-defectâ (â) tweet classes for three classifiers trained on the original, under-sampled, and over-sampled data sets

Full size table

In addition to n-grams, we explored additional features for the SVM classifier. Table 2 presents the precision, recall, and F₁-score of the classifier with the addition of word clusters and word/character lengths to n-gram features. Based on the default implementation of a paired t-test in Wekaâs experiment environment, the additional features significantly improved the F₁-score for the âdefectâ class (Pâ<â0.05). The ablation study presented in Table 2 indicates, however, that word/character lengths did not contribute to the improved performance. Thus, word clusters, in particular, increased the classifierâs overall performance though, increasing recall at the expense of precision. Although the addition of word clusters significantly improved the classifierâs performance, the results of (1) using only word clusters or (2) leaving out n-grams significantly decreased the F₁-score for all three classes (Pâ<â0.05), indicating, perhaps unsurprisingly, that the classifierâs performance can be attributed primarily to the information provided by n-grams.

Table 2 Feature ablation for a support vector machine (SVM) classifier: precision (P), recall (R), and F₁-scores (F) for âdefectâ (+), âpossible defectâ (?), and ânon-defectâ (â) tweet classes

Full size table

Discussion

F₁-scores of 0.65 for the âdefectâ (+) class and 0.51 for the âpossible defectâ (?) class represent a promising benchmark performance for automatic classification of highly imbalanced Twitter data.²⁵ Given that birth defects are rare events and, thus, allow for some degree of manual analysis, we believe this performance provides a strong basis to facilitate cohort selection among the number of users identified via automatic classification. In general, a higher F₁-score for the majority class is typical; in this case, it reflects that we have modeled the detection of birth defect outcomes for the data on which the classifier would be applied in practice. If we artificially balance the test set, the F₁-scores for the SVM classifier improve to 0.76 for the âdefectâ class and 0.59 for the âpossible defectâ class. The comparative performance of the SVM and LSTM classifiers used in this study is consistent with our recent findings that SVM-based classifiers can outperform deep neural network-based classifiers for mining health-related events in imbalanced social media data.²⁶ However, recent advances in language representation models²⁷ may help improve the performance of deep neural networks for this task. Given that our under-sampling and over-sampling approaches generally did not improve performance for the LSTM classifier, we conclude that data-level approaches for addressing class imbalance with convolutional neural networks²⁸ may not generalize to recurrent neural networks (e.g., LSTMs).

In order to demonstrate that automatic classification can be used to identify additional users for potential inclusion in our cohort, we implemented the SVM classifier in Python and deployed it on unlabeled tweets that were automatically collected¹¹ since retrieving the tweets used for training and evaluating the classifiers in this study. Approximately 600 million additional tweets had been collected at the time we deployed the classifier, and our automatic rule-based retrieval¹⁰ detected 20,457 of them that mention birth defects, which were then preprocessed prior to automatic classification. The classifier predicted 678 tweets as âdefect,â 359 tweets as âpossible defect,â and 19,420 tweets as ânon-defect.â Among the 7781 users who posted these 20,457 tweets, 414 users posted âdefectâ tweets, 185 users posted âpossible defectâ tweets (without also posting a âdefectâ tweet), and 7182 users posted ânon-defectâ tweets (without also posting a âdefectâ or âpossible defectâ tweet).

Among the 599 users who posted âdefectâ or âpossible defectâ tweets, 542 were new to our databaseâthat is, not included among the 646 users whose timelines we analyzed to identify an initial cohort in our previous work.¹⁰ The addition of these users demonstrates that automatic classification indeed enables the long-term collection of social media data for large-scale epidemiological research of birth defects, including rare and more common ones. For example, given that congenital heart defects (CHDs) are the most prevalent birth defects reported among our initial cohort,¹⁰ social media data could soon enable a more thorough study of CHDs, which are also the most prevalent birth defects in the general population,²⁹ are the leading cause of infant mortality due to birth defects,³⁰ and have a largely unknown etiology.³¹

The identification of 542 additional users also demonstrates the utility of balancing precision (0.62) and recall (0.68) in tuning the classifier for the âdefectâ class. On the one hand, precision is important for automated cohort selection because only users actually reporting a birth defect outcome should be included among the cases for epidemiological analysis. For detecting cohorts reporting common events, such as pregnancy,¹¹ a classifier can be optimized for precision and still be used to identify more than 100,000 users on social media. On the other hand, given that birth defects are rare events, optimizing the classifier for precision at the expense of recall would result in the automatic identification of very few users for inclusion in the cohort. However, considering that more than 7000 users were automatically identified as posting ânon-defectâ tweets, trading off precision for recall would require further reviewing an increasingly impractical number of users. Deploying the SVM classifier demonstrates that a balance of precision and recall can facilitate cohort selection for rare events on social media. Figures 2 and 3 provide the precision-recall curves for the âdefectâ and âpossible defectâ classes, respectively. (SVM classifiers do not provide probability estimates by default. To generate probability estimates for the precision-recall curves, we used an extension of the SVM classifier provided in the LibSVM library in Weka.³²) The area under the curve is 0.70 for the âdefectâ class and 0.54 for the âpossible defectâ class.

Considering that automatic classification of three classes may be more difficult than two, especially since the âdefectâ and âpossible defectâ classes each represents only 5% of the annotated data, we performed an additional experiment in which the âdefectâ and âpossible defectâ tweets were collapsed into a single positive class. With two classes, the SVM classifier achieved a precision of 0.69, a recall of 0.60, and an F₁-score of 0.64 for the positive class. More specifically, the number of true positives (i.e., âdefectâ or âpossible defectâ tweets correctly classified) increased from 272 to 289; the number of false positives (i.e., ânon-defectâ tweets classified as âdefectâ or âpossible defectâ) decreased from 149 to 129; and the number of false negatives (i.e., âdefectâ or âpossible defectâ tweets classified as ânon-defectâ) increased from 177 to 190. While the precision for both the âdefectâ and âpossible defectâ classes benefitted from collapsing the classes, the improved precision came at the expense of recall for the âdefectâ class. As shown in the confusion matrix in Table 3 and discussed in the error analysis that follows, the primary source of confusion for automatic classification of the three classes was false-negative âpossible defectâ tweets, and many of these errors seem to have been absorbed into the single positive class.

Table 3 Confusion matrix for a support vector machine (SVM) classifier for three classes of tweets: âdefectâ (+), âpossible defectâ (?), and ânon-defectâ (â)

Full size table

We conducted an error analysis of the SVM classifier trained on the original training set with n-grams, word clusters, and word/character lengths as features, focusing on false-negative âdefectâ and âpossible defectâ tweets in the test set of 4602 tweetsâmore specifically, the 177 âdefectâ and âpossible defectâ tweets misclassified as ânon-defect.â Our analysis of the 113 false-negative âpossible defectâ tweets (from the 240 actual âpossible defectâ tweets in the test set)âthe classifierâs primary confusionâreveals that the majority of them mention the name of a person. For example, in Table 4, 1 is annotated as âpossible defectâ because the name in the tweet is a personal deictic expression; that is, without additional contextual information (possibly supplied in the userâs timeline), the tweet is ambiguous as to whether the referent of the name is the userâs child. We attempted to model names as a generalized representation (i.e., substituting a name with â_name_â) in automatic preprocessing; however, our analysis reveals that, among the false-negative âpossible defectâ tweets that mention a name, more than half of the names were not detected in preprocessing and, thus, not normalized. Many of the other false-negative âpossible defectâ tweets similarly contain a personal deictic expression, such as he in 2 and this little boy in 3, or they omit an explicit lexical reference entirely, as in 4. Techniques for enhancing the representation of personal deixisâa characteristic feature of âpossible defectâ tweetsâmay improve the classifierâs recall for this class.

Table 4 Samples of false-negative âdefectâ (+) and âpossible defectâ (?) tweets, misclassified as ânon-defectâ (â) by a support vector machine (SVM) classifier trained on the original, imbalanced data set, with word n-grams, word clusters, and word/character lengths as features

Full size table

Tweet 1 explicitly states that the person referred to by name has craniosynostosis (was diagnosed with), and 2 explicitly states that he has a cleft palate (has), but most of the false-negative âpossible defectâ tweets do not as explicitly encode this information in the text, as in 3 and 4. A human can reasonably infer that this little boy in 3 has hydrocephalus, and that the child implicitly referred to in 4 was born with a hole in her heart, but, without explicitly encoding this relationship between the person and the birth defect, the textual surface of the tweets may more closely resemble ânon-defectâ tweets that merely mention birth defects. Similarly, some of the false-negative âpossible defectâ tweets do not explicitly state that the child has a birth defect because, as in 5, they are ambiguous as to whether the child actually has the birth defect (could be a club foot). Figure 4 summarizes our analysis of errors for the false-negative âpossible defectâ and âdefectâ tweets in our held-out test set. As the proportions of errors indicate, many of the tweets contain multiple possible sources of error.

Most of the 64 false-negative âdefectâ tweets (from the 239 actual âdefectâ tweets in the test set) also imply having a birth defect. In addition, many of them, while referring to the userâs child, do not contain a common reference, such as my child. Some of the tweets, such as 6, explicitly refer to the userâs child (my daughter), but obscure this âdefectâ information by interrupting the pattern with a modifier (my unborn daughter). Many of the false negatives do not contain a possessive determiner (e.g., my, our) explicitly indicating that the user is referring to her child. Although 7 does contain the uninterrupted pattern, my girl, it is textually represented as a hashtag and, thus, modeled as part of a single token: lovemygirl. In 8, the first-person pronoun, me, suggests that the user is referring to her daughter, but daughter itself occurs with a determiner, a, that does not express a reference to the userâs child. Furthermore, many of the tweets do not use common âchildâ words. For example, 9 contains my, but the user refers to her child as clubfoot cutie. In 10, the user entirely omits an explicit reference to her child, though humans can infer that termination is a nominalization of the verb phrase, terminating our child.

Given the low prevalence of birth defects in the general population, some of the birth defects included in our rule-based approach to collecting tweets¹⁰ are rarely mentioned or not mentioned at all in the 22,999 tweets used for training and evaluating supervised classification. This underrepresentation could lead to overfitting the model to the birth defects mentioned more frequently in the annotated corpus. In our preprocessing, we made an effort to overcome the limited birth defects mentioned in the corpus by normalizing the span of the tweet that contains the birth defect. Without âhidingâ the birth defects, the performance of the SVM classifier actually improves slightly to 0.68 for the âdefectâ class and 0.52 for the âpossible defectâ class; however, this improvement is likely the result of modeling particular birth defects as features and learning incidental associations with the classes. The classification performance reported in the âResultsâ section is intended to provide a more realistic representation of how the model would generalize in practice.

The tweet-level classification that we presented in this paper is the first step towards automatically identifying large cohorts of Twitter users whose pregnancies with birth defect outcomes could be observed via their publicly available tweets. To advance the large-scale use of social media as a complementary data source for studying birth defects, future work will focus on automating user-level analysesâin particular, (1) determining whether users identified as posting âpossible defectâ tweets have a child with a birth defect, and (2) verifying the availability of a userâs tweets during the pregnancy with a birth defect outcome. The first task would involve, for example, automatically resolving whether a personal deictic expressionâfor example, a personâs name or personal pronoun (e.g., she)âin a âpossible defectâ tweet is referring to the userâs child, based on contextual tweets in the userâs timelines. For the second task, we will further develop our system³³ that automatically identifies a userâs prenatal period.

Methods

Data collection

To collect tweets for training and evaluating supervised machine learning algorithms, we first compiled a lexicon of birth defects (the Penn Social Media Lexicon of Birth Defects),^{34,35,36,37,38} which consists of ~725 single-word and multi-word terms. To enhance the retrieval of posts that mention birth defects on social media, we semi-automatically generated lexical variants of these terms (e.g., misspellings, inflections) using an automatic, data-centric spelling variant generator.³⁹ Updated versions of the Penn Social Media Lexicon of Birth Defects (Supplementary Data 3) and its lexical variants (Supplementary Data 4) are available in the supplementary information files. To retrieve tweets containing (variants of) the terms, we implemented hand-crafted regular expressions for mining more than 400 million tweets, posted by more than 100,000 users, in our database.¹¹ We post-processed the matching tweets by removing retweets and tweets containing user names and URLs that were matched by the regular expressions. Using the current version of the query, we have retrieved a total of 23,680 tweets, including the tweets retrieved for our feasibility study,¹⁰ which describes our data collection methods in more detail. This study received an exempt determination by the Institutional Review Board of the University of Pennsylvania, as it does not meet the definition of âhuman subjectâ according to 45 CRF Â§ 46.102(f).

Annotation

Two professional annotators annotated 23,680 tweets, posted by 7797 Twitter users, with overlapping annotations for 22,408 tweets. Annotation guidelines, available in Supplementary material of our related work,¹⁰ were developed to help the annotators distinguish âdefect,â âpossible defect,â and ânon-defectâ tweets. Table 4 includes examples of âdefectâ and âpossible defectâ tweets, and our feasibility study¹⁰ provides additional examples and a discussion of the three classes. Inter-annotator agreement was Îºâ=â0.86 (Cohenâs kappa), considered âalmost perfect agreement.â⁴⁰ The tweets on which the annotators disagreed were excluded from the final data set, which consists of 22,999 annotated tweets: 1192 (5.12%) âdefectâ tweets, 1196 (5.20%) âpossible defectâ tweets, and 20,611 (89.67%) ânon-defectâ tweets. The âpossible defectâ class is a âplaceholderâ class indicating that analysis of the userâs timeline is needed for contextually determining if they are the mother of a child with a birth defect. Automating the analysis of user timelines, however, is beyond the scope of this paper and will be explored in future work. For now, we focus on the task of automatically classifying tweets, as described next.

Classification

We used the data set of 22,999 annotated tweets in experiments to train and evaluate supervised machine learning algorithms. For the classifiers, we used (1) the default implementation of Multinomial NaÃ¯ve Bayes (NB)⁴¹ in Weka 3.8.2, (2) the WLSVM Weka integration⁴² of the LibSVM⁴³ implementation of Support Vector Machine (SVM), and (3) a Long Short-Term Memory Neural Network (LSTM),⁴⁴ implemented in Keras running on top of TensorFlow. We split the data into 80% (training) and 20% (test) random sets, stratified based on the natural, imbalanced distribution of âdefect,â âpossible defect,â and ânon-defectâ tweets that would be automatically detected by the rule-based retrieval¹⁰ in practice. To optimize training, we further split the training set into 80% (training) and 20% (development) stratified random sets. Thus, our training set consists of 14,716 tweets, our development set consists of 3681 tweets, and our test set consists of 4602 tweets. In the remainder of this subsection, we will describe the data preprocessing, feature sets, classifier parameters, and data imbalance approaches used in supervised classification.

Preprocessing and features

For the NB and SVM classifiers, we performed standard text preprocessing by lowercasing and stemming⁴⁵ the tweets. We also removed non-alphabetical characters (e.g., hashtags, ellipses, UTF-8 special characters), and normalized user names (i.e., strings beginning with â@â) as â_username_â and URLs as â_url_â. We did not remove stop words because, based on a preliminary feature evaluation using information gain,⁴⁶ we found that commonly used stop words (e.g., my, she, was, with, have) occur in n-gramsâcontiguous sequences of n wordsâthat are highly informative for distinguishing the three classes (e.g., my baby, daughter with, she, was born, may have). We also noticed in the feature ranking that some of the n-grams were semantically similar to n-grams that were not ranked as high (e.g., our son, child with, he). Given the context of this data set, we assumed that the ranking of these n-grams does not represent a meaningful semantic distinction but is merely a byproduct of the various linguistic ways in which users are referring to their children; thus, we reduced the semantic feature space by normalizing particular first-person pronouns (e.g., my, our) as â_fppron_â, references to a child (e.g., son, daughter, child) as â_child_â, and third-person pronouns (e.g., she, he) as â_tppron_â.

The information gain feature evaluation also suggested that, in the raw data, specific birth defect terms were informative linguistic features for distinguishing the classes. As Table 5 shows, such terms were mentioned markedly more frequently in ânon-defectâ tweets, suggesting that they tend to play a role in linguistically characterizing the ânon-defectâ classâan artifact of the imbalanced data set, but not an accurate representation of human decision making for a machine learning model. In addition, the birth defect terms in Table 5 represent some of the more frequently mentioned birth defects on Twitter, while most of the others are relatively sparse. In order to avoid overfitting the machine learning model to birth defects that are mentioned relatively frequently and tend to accompany ânon-defectâ tweets, we normalized the specific birth defect mentioned in the tweetâthat is, the span of the tweet matched by the regular expressions in the lexicon-based retrieval¹⁰âas â_malformation_â.

Table 5 Examples and per-class frequencies of birth defect terms that distinguish the âdefectâ (+), âpossible defectâ (?), and ânon-defectâ (â) tweet classes in the raw annotated data

Full size table

Finally, we also normalized peopleâs names as â_name_â in automatically preprocessing the tweets. During manual annotation, we found that, oftentimes, âpossible defectâ tweets are characterized by an ambiguous referent of a personâs name; that is, we do not know if the name refers to the userâs child. However, because of the heterogeneity of specific names across tweets, names would not be modeled as (part of) salient n-grams. To automatically detect names in the tweets, as a preliminary approach for experimentation, we compiled a lexicon, available in the supplementary information files (Supplementary Data 5), of the 200 most popular boys names and the top 200 girls names in the United States between 2010 and the present, identified by the Social Security Administration.⁴⁷ Following preprocessing, for the NB and SVM classifiers, we converted the tweets to word vectors and used Wekaâs default N-Gram Tokenizer to extract unigrams (nâ=â1), bigrams (nâ=â2), and trigrams (nâ=â3) as features.

In addition to n-grams, we explored word clusters as features for the SVM classifier. Word clusters provide generalized representations of semantically similar terms, in that terms appearing in similar collocate contexts, such as misspellings, are represented by the same cluster. The clusters are generated via a two-step process: first, vector representations of the words (i.e., word embeddings) are learned from large, unlabeled social media data, and then, the vectors are grouped via a standard clustering algorithm. We have found generalized representations via cluster numbers to be useful in our past work.⁴⁸ For the present work, we used the Twitter word clusters made publicly available by Owoputi et al.⁴⁹ Finally, we also used the tweetsâ lengths in characters and words as features.

For the LSTM classifier, features were learned using the publicly available GloVe word vectors trained on 2 billion tweets.⁵⁰ Any word in our corpus not found in the vocabulary of the pretrained vectors was represented with a zero vector and updated during training. We preprocessed the raw tweets in our corpus with the same operations and sequence used when creating the vectors. Specifically, we first normalized all user names (<USER>) and URLs (<URL>). Next, we inserted spaces before and after each slash character, and normalized all numbers (<NUMBER>). Then, we normalized all repetitions of the same punctuation (<REPEAT>) following the occurrence of the first one. Similarly, we deleted all letters repeated more than three times, and marked the occurrence of elongated words (<ELONG>). Finally, we replaced hashtag (#) characters (<HASHTAG>). We tokenized the tweets with the PTBTokenizer, and lowercased the tokens.

Classifier parameters

For the SVM classifier,⁵¹ we used the radial basis function (RBF) kernel. Upon tuning over the development set, we set the cost parameter at câ=â100 and, to address the class imbalance problem, assigned higher weights to the minority classes. We scaled the feature vectors before applying the SVM for classification. For the LSTM classifier, we used 100-dimensional word embeddings pretrained on 2 billion tweets. The LSTM neural network consists of four layers sequentially connected: (1) the input layer that passes the word embedding vectors to the second layer, (2) an LSTM layer with 128 dimensions and a tanh activation function, (3) a dropout layer, and (4) a fully connected layer with three dimensions and a softmax function. We passed only the vector computed by the last hidden state of the LSTM layer to the third layer, ignoring the vectors computed by the other hidden states of the LSTM. Figure 5 illustrates the architecture of the LSTM classifier. We set the length of a sequence to 800 tokens and padded the right side of the sequence with zeros when it consisted of <800 tokens. We applied a dropout function with a probability of 0.5 during training in order to avoid overfitting. We trained the LSTM classifier for 15 iterations and fine-tuned the word embeddings during this stage.

Data-level approaches to class imbalance

In addition to assigning higher weights to the minority classes with the SVM classifier, we performed under-sampling of the majority (ânon-defectâ) class using several methods to address the class imbalance problem at the data level: (1) removal of majority class instances that were lexically similar to other instances in the training data, (2) removal of majority class instances that were lexically similar to minority class instances in the development set that were being misclassified as ânon-defectâ in the preliminary experiments, and (3) random under-sampling of the majority class. During annotation, we found a number of cases in which the same or a similar ânon-defectâ tweet was posted numerous times (e.g., fundraisers, news headlines, advertisements). Thus, our assumption underlying method (1) was that we could reduce the imbalance without losing linguistic information about the majority class. For method (2), we attempted to strategically remove ânon-defectâ tweets that were close to the linguistic boundary of the minority classes.

To compute lexical similarities between tweets, we employed the Levenshtein ratio (LR) method, which is calculated as ${\rm{LR}} \,=\, \frac{(\rm{lensum} \,-\, \rm{lendist})}{\rm{lensum}}$, where lensum is the sum of the lengths of the two strings, and lendist is the Levenshtein distance between the two strings. The equation results in a value between 0 (completely dissimilar strings) and 1 (identical strings). Levenshtein distance is a measure of the similarity between two strings computed as the number of deletions, insertions, or substitutions required to transform one string into another. When performing the under-sampling, we removed tweets that produced LR scores above a predefined threshold (k). We were able to choose the size of the under-sampled training sets by varying the value of k. We controlled methods (1) and (2)âthe similarity-based under-sampling methodsâby employing method (3)ârandom under-sampling of the majority classâbased on the final size of the training sets for methods (1) and (2).

We also performed over-sampling of the minority (âdefectâ and âpossible defectâ) classes. We balanced the training set by (4) over-sampling the minority classes with replacement (i.e., multiplying the original sets of âdefectâ and âpossible defectâ tweets until they were nearly equal to the number of ânon-defectâ tweets), and (5) utilizing the Synthetic Minority Over-sampling Technique (SMOTE).⁵² With SMOTE, we nearly balanced the training set by creating artificial instances of âdefectâ and âpossible defectâ tweets in vector space. We used the default settings for the SMOTE filter in Weka. We performed all of the other over-sampling and under-sampling methods for all of classifiers, but applied SMOTE only for the NB and SVM classifiers.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The annotated data that was used for training (Supplementary Data 1) and development (Supplementary Data 2) of machine learning algorithms in this study is available in the supplementary information files of this article. To download the tweets, a Python script is available at: https://bitbucket.org/pennhlp/twitter_data_download/src/master/.

Code availability

The machine learning experiments were performed using open source software, Weka 3.8.2, following preprocessing in Python. The Python code that was subsequently prepared to train and evaluate the SVM classifier for deployment in this study is available at: https://bitbucket.org/pennhlp/birthdefectsclassifier/src/master/. The Penn Social Media Lexicon of Birth Defects (Supplementary Data 3), the lexical variants (Supplementary Data 4), and a file containing a list of 400 names (Supplementary Data 5), which are used for preprocessing in the Python script, are available in the supplementary information files of this article. The word clusters that are used as features in the SVM classifier are available at: http://www.cs.cmu.edu/~ark/TweetNLP/clusters/50mpaths2.

References

Mathews, T. J., MacDorman, M. F. & Thoma, M. E. Infant mortality statistics from the 2013 period linked birth/infant death data set. Natl Vital. Stat. Rep. 64, 2000â2013 (2015).
Google ScholarÂ
Blehar, M. C. et al. Enrolling pregnant women: issues in clinical research. Women's Health Issues 23, e39âe345 (2013).
ArticleÂ Google ScholarÂ
Hartman, R. I. & Kimball, A. B. Performing research in pregnancy: challenges and perspectives. Clin. Dermatol. 34, 410â415 (2016).
ArticleÂ Google ScholarÂ
Ward, R. M. Difficulties in the study of adverse fetal and neonatal effects of drug therapy during pregnancy. Semin. Perinatol. 25, 191â195 (2001).
ArticleÂ CASÂ Google ScholarÂ
Kennedy, D. L., Uhl, K. & Kweder, S. L. Pregnancy exposure registries. Drug Saf. 27, 215â228 (2004).
ArticleÂ Google ScholarÂ
US Department of Health and Human Services, Food and Drug Administration. Reviewer Guidance: Evaluating the Risks of Drug Exposure in Human Pregnancies. (US Department of Health and Human Services, Food and Drug Administration, 2005). https://www.fda.gov/downloads/Drugs/%E2%80%A6/Guidances/ucm071645.pdf.
Sinclair, S. et al. Advantages and problems with pregnancy registries: observations and surprises throughout the life of the International Lamotrigine Pregnancy Registry. Pharmacoepidemiol. Drug Saf. 23, 779â786 (2014).
PubMedÂ PubMed CentralÂ Google ScholarÂ
Gliklich R. E., Dreyer, N. A. & Leavy, M. B. Registries for evaluating patient outcomes: a userâs guide 3rd edn, (Agency for Healthcare Research and Quality, Rockville, MD, 2014).
Smith, A. & Anderson, M. Social media use in 2018. http://www.pewinternet.org/2018/03/01/social-media-use-in-2018/ (2018).
Klein, A. Z., Sarker, A., Cai, H., Weissenbacher, D. & Gonzalez-Hernandez, G. Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter. J. Biomed. Inform. 87, 68â78 (2018).
ArticleÂ Google ScholarÂ
Sarker, A. et al. Discovering cohorts of pregnant women from social media for safety surveillance and analysis. J. Med. Internet Res. 19, e361 (2017).
ArticleÂ Google ScholarÂ
Golder, S. et al. Pharmacoepidemiologic evaluation of birth defects from health-related postings in social media during pregnancy. Drug Saf. 42, 389â400 (2019).
ArticleÂ Google ScholarÂ
Harris, B. S. et al. Risk factors for birth defects. Obstet. Gynecol. Surv. 72, 123â135 (2017).
ArticleÂ Google ScholarÂ
Klein, A., Sarker, A., Rouhizadeh, M., OâConnor, K. & Gonzalez, G. Detecting personal medication intake in twitter: an annotated corpus and baseline classification system. In Proc. BioNLP 2017 Workshop 136â142 (Association for Computational Linguistics, 2017).
Feldkamp, M. L., Carey, J. C., Byrne, J. L. B., Krikov, S. & Botto, L. D. Etiology and clinical presentation of birth defects: population based study. BMJ 357, j2249 (2017).
ArticleÂ Google ScholarÂ
Gelperin, K. et al. A systematic review of pregnancy exposure registries: examination of protocol-specified pregnancy outcomes, target sample size, and comparator selection. Pharmacoepidemiol. Drug Saf. 26, 208â214 (2017).
ArticleÂ Google ScholarÂ
Hong, L. & Davison, B. D. Empirical study of topic modeling in Twitter. In Proc. 1st Workshop on Social Media Analytics (SOMA) 80â88 (2010).
Rynn, L., Cragan, J. & Correa, A. Update on overall prevalence of major birth defects: Atlanta, Georgia, 1978â2005. MMWR Morb. Mortal. Wkly. Rep. 57, 1â5 (2008).
Google ScholarÂ
De Choudhury, M., Counts, S. & Horvitz, E. Predicting postpartum changes in emotion and behavior via social media. In Proc. SIGCHI Conference on Human Factors in Computing Systems 3267â3276 (ACM, 2013).
Pearlstein, T., Howard, M., Salisbury, A. & Zlotnick, C. Postpartum depression. Am. J. Obstet. Gynecol. 200, 357â364 (2009).
ArticleÂ Google ScholarÂ
Haixiang, G. et al. Learning from class-imbalanced data: reviews of methods and applications. Expert Syst. Appl. 73, 220â239 (2017).
ArticleÂ Google ScholarÂ
Japkowicz, N. & Stephen, S. The class imbalance problem: a systematic study. Intell. Data Anal. 6, 429â449 (2002).
ArticleÂ Google ScholarÂ
Chawla, N. V., Japkowicz, N. & Kotcz, A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl. 6, 1â6 (2004).
ArticleÂ Google ScholarÂ
Wang S. et al. Training deep neural networks on imbalanced data sets. In Proc. International Joint Conference on Neural Networks 4368â4374 (IEEE, 2016).
Sarker, A. & Gonzalez, G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196â207 (2015).
ArticleÂ Google ScholarÂ
Sarker, A. et al. Data and systems for medication-related text classification and concept normalization from Twitter: insights from Social Media Mining for Health (SMM4H) 2017 shared task. J. Am. Med. Inform. Assoc. 25, 1274â1283 (2018).
ArticleÂ Google ScholarÂ
Devlin, J., Cheng, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Buda, M., Maki, A. & Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249â259 (2018).
ArticleÂ Google ScholarÂ
Mai, C. T. et al. Population-based birth defects data in the United States, 2008-2012: presentation of state-specific data and descriptive brief on variability of prevalence. Birth Defects Res. A Clin. Mol. Teratol. 103, 972â993 (2015).
ArticleÂ CASÂ Google ScholarÂ
Yang, Q. et al. Racial differences in infant mortality attributable to birth defects in the United States, 1989â2002. Birth Defects Res. A Clin. Mol. Teratol. 76, 706â713 (2006).
ArticleÂ CASÂ Google ScholarÂ
Jenkins, K. J. et al. Noninherited risk factors and congenital cardiovascular defects: current knowledge. Circulation 115, 2995â3014 (2007).
ArticleÂ Google ScholarÂ
Wu, T. F., Lin, C. J. & Weng, R. C. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5, 975â1005 (2004).
Google ScholarÂ
Rouhizadeh, M., Magge, A., Klein, A., Sarker, A. & Gonzalez, G. A rule-based approach to determining pregnancy timeframe from contextual social media postings. In Proc. 8th International Conference on Digital Health 16â20 (ACM, 2018).
National Birth Defects Prevention Network. Guidelines for conducting birth defects surveillance. (National Birth Defects Prevention Network, 2004). https://www.nbdpn.org/docs/NBDPN_Guidelines2012.pdf.
Metropolitan Atlanta Congenital Defects Program. Executive summary. Birth Defects Res. A Clin. Mol. Teratol. 79, 66â93 (2007).
ArticleÂ Google ScholarÂ
Fornoff, J. E. & Shen, T. Birth defects and other adverse pregnancy outcomes in Illinois 2005â2009: a report on county-specific prevalence (2013). http://www.dph.illinois.gov/sites/default/files/publications/ers14-03-birth-defects-inillinois-2005-2009-041516.pdf.
EUROCAT. Guide 1.4: Instruction for the registration of congenital anomalies. (EUROCAT, 2013) http://www.eurocat-network.eu/content/Section%203.3-%2027_Oct2016.pdf.
US National Library of Medicine. UMLS Reference Manual. (US National Library of Medicine, 2009) https://www.ncbi.nlm.nih.gov/books/NBK9676/pdf/Bookshelf_NBK9676.pdf.
Sarker, A. & Gonzalez-Hernandez, G. An unsupervised and customizable misspelling generator for mining noisy health-related text sources. J. Biomed. Inform. 88, 98â107 (2018).
ArticleÂ Google ScholarÂ
Viera, A. J. & Garrett, J. M. Understanding interobserver agreement: the kappa statistic. Fam. Med. 37, 360â363 (2005).
PubMedÂ Google ScholarÂ
McCallum, A. & Nigam, K. A comparison of event models for naÃ¯ve bayes text classification. In Proc. AAAI-98 Learning for Text Categorization Workshop 41â48 (AAAI, 1998).
El-Manzalawy, Y. & Honavar, V. WLSVM: integrating LibSVM into Weka environment. http://www.cs.iastate.edu/~yasser/wlsvm (2005).
Chang, C. & Lin, C. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, (2011).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735â1780 (1997).
ArticleÂ CASÂ Google ScholarÂ
Porter, M. F. An algorithm for suffix stripping. Program 14, 130â137 (1980).
ArticleÂ Google ScholarÂ
Yang, Y. & Pederson, J. O. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning 412â420 (Morgan Kaufmann Publishers Inc., 1997).
Social Security Administration. Top Names of the Period 2010â2017. (Social security administration) https://www.ssa.gov/oact/babynames/decades/names2010s.html.
Sarker, A. et al. Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug Saf. 39, 231â240 (2016).
ArticleÂ Google ScholarÂ
Owoputi, O., OâConnor, B., Dyer, C., Gimpel, K. & Schneider, N. 2012. Part-of-speech tagging for Twitter: word clusters and other advances (2012). http://www.cs.cmu.edu/~ark/TweetNLP/owoputi+etal.tr12.pdf.
Pennington, J., Socher, R. & Manning, C.D. GloVe: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532â1543 (2014).
Hsu, C., Chang, C. & Lin, C. A practical guide to support vector classification. https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321â357 (2002).
ArticleÂ Google ScholarÂ

Download references

Acknowledgements

This work was funded by the National Institutes of Health (NIH) National Library of Medicine (NLM) award number R01LM011176. The content is solely the responsibility of the authors, and does not necessarily represent the views of the NIH or NLM. The authors would like to acknowledge Karen OâConnor and Alexis Upshur for their annotation efforts, and Haitao Cai for implementing database queries, contributing to the preprocessing code, and deploying the classifier.

Author information

Authors and Affiliations

Department of Biostatistics, Epidemiology, and Informatics Perelman School of Medicine University of Pennsylvania, Philadelphia, PA, USA
Ari Z. Klein,Â Davy WeissenbacherÂ &Â Graciela Gonzalez-Hernandez
Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, USA
Abeed Sarker

Authors

Ari Z. Klein
View author publications
You can also search for this author in PubMedÂ Google Scholar
Abeed Sarker
View author publications
You can also search for this author in PubMedÂ Google Scholar
Davy Weissenbacher
View author publications
You can also search for this author in PubMedÂ Google Scholar
Graciela Gonzalez-Hernandez
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

A.Z.K. preprocessed the data and performed the machine learning experiments in Weka for the Multinomial NaÃ¯ve Bayes and SVM classifiers, conceptualized the application of the under-sampling and over-sampling approaches, performed the error analysis, and wrote the majority of the manuscript. A.S. contributed to preprocessing and preparing the training, development, and test sets, designed and deployed the implementation of the similarity-based under-sampling approaches, implemented the word clusters and the tweetsâ word/character lengths as features, wrote the descriptions of these features and how the similarity-based under-sampled training sets were developed, and prepared the Python script to train and evaluate the SVM classifier for deployment. For the deep learning-based classifier, D.W. preprocessed the data and performed the machine learning experiments, and wrote the descriptions of the preprocessing, features, and classifier parameters. G.G.H. provided the overall study design and guidance for mining social media for birth defect outcomes, and edited the final manuscript.

Corresponding author

Correspondence to Ari Z. Klein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Data 1.

Supplementary Data 2.

Supplementary Data 3.

Supplementary Data 4.

Supplementary Data 5.

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the articleâs Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the articleâs Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Klein, A.Z., Sarker, A., Weissenbacher, D. et al. Towards scaling Twitter for digital epidemiology of birth defects. npj Digit. Med. 2, 96 (2019). https://doi.org/10.1038/s41746-019-0170-5

Download citation

Received: 22 May 2019
Accepted: 12 August 2019
Published: 01 October 2019
DOI: https://doi.org/10.1038/s41746-019-0170-5

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Classification

Discussion

Methods

Data collection

Annotation

Classification

Preprocessing and features

Classifier parameters

Data-level approaches to class imbalance

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links