-
The Resume Paradox: Greater Language Differences, Smaller Pay Gaps
Authors:
Joshua R. Minot,
Marc Maier,
Bradford Demarest,
Nicholas Cheney,
Christopher M. Danforth,
Peter Sheridan Dodds,
Morgan R. Frank
Abstract:
Over the past decade, the gender pay gap has remained steady with women earning 84 cents for every dollar earned by men on average. Many studies explain this gap through demand-side bias in the labor market represented through employers' job postings. However, few studies analyze potential bias from the worker supply-side. Here, we analyze the language in millions of US workers' resumes to investi…
▽ More
Over the past decade, the gender pay gap has remained steady with women earning 84 cents for every dollar earned by men on average. Many studies explain this gap through demand-side bias in the labor market represented through employers' job postings. However, few studies analyze potential bias from the worker supply-side. Here, we analyze the language in millions of US workers' resumes to investigate how differences in workers' self-representation by gender compare to differences in earnings. Across US occupations, language differences between male and female resumes correspond to 11% of the variation in gender pay gap. This suggests that females' resumes that are semantically similar to males' resumes may have greater wage parity. However, surprisingly, occupations with greater language differences between male and female resumes have lower gender pay gaps. A doubling of the language difference between female and male resumes results in an annual wage increase of $2,797 for the average female worker. This result holds with controls for gender-biases of resume text and we find that per-word bias poorly describes the variance in wage gap. The results demonstrate that textual data and self-representation are valuable factors for improving worker representations and understanding employment inequities.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
A blind spot for large language models: Supradiegetic linguistic information
Authors:
Julia Witte Zimmerman,
Denis Hudon,
Kathryn Cramer,
Jonathan St. Onge,
Mikaela Fudolig,
Milo Z. Trujillo,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or…
▽ More
Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or even "language". We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences.
△ Less
Submitted 16 May, 2024; v1 submitted 11 June, 2023;
originally announced June 2023.
-
Park visitation and walkshed demographics in the United States
Authors:
Kelsey Linnell,
Mikaela Fudolig,
Laura Bloomfield,
Thomas McAndrew,
Taylor H. Ricketts,
Jarlath P. M. O'Neil-Dunne,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
A large and growing body of research demonstrates the value of local parks to mental and physical well-being. Recently, researchers have begun using passive digital data sources to investigate equity in usage; exactly who is benefiting from parks? Early studies suggest that park visitation differs according to demographic features, and that the demographic composition of a park's surrounding neigh…
▽ More
A large and growing body of research demonstrates the value of local parks to mental and physical well-being. Recently, researchers have begun using passive digital data sources to investigate equity in usage; exactly who is benefiting from parks? Early studies suggest that park visitation differs according to demographic features, and that the demographic composition of a park's surrounding neighborhood may be related to the utilization a park receives. Employing a data set of park visitations generated by observations of roughly 50 million mobile devices in the US in 2019, we assess the ability of the demographic composition of a park's walkshed to predict its yearly visitation. Predictive models are constructed using Support Vector Regression, LASSO, Elastic Net, and Random Forests. Surprisingly, our results suggest that the demographic composition of a park's walkshed demonstrates little to no utility for predicting visitation.
△ Less
Submitted 20 May, 2023;
originally announced May 2023.
-
An assessment of measuring local levels of homelessness through proxy social media signals
Authors:
Yoshi Meke Bird,
Sarah E. Grobe,
Michael V. Arnold,
Sean P. Rogers,
Mikaela I. Fudolig,
Julia Witte Zimmerman,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``hom…
▽ More
Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``homeless.'' Correlations between homelessness-related tweet counts and ranked per capita homelessness volume, but not general-population densities, suggest a relationship between the likelihood of Twitter users to personally encounter or observe homelessness in their everyday lives and their likelihood to communicate about it online. An increase to the log-odds of ``homeless'' appearing in an English-language tweet, as well as an acceleration in the increase in average tweet sentiment, suggest that tweets about homelessness are also affected by trends at the nation-scale. Additionally, changes to the lexical content of tweets over time suggest that reversals to the polarity of national or state-level trends may be detectable through an increase in political or service-sector language over the semantics of charity or direct appeals. An analysis of user account type also revealed changes to Twitter-use patterns by accounts authored by individuals versus entities that may provide an additional signal to confirm changes to homelessness density in a given jurisdiction. While a computational approach to social media analysis may provide a low-cost, real-time dataset rich with information about nationwide and localized impacts of homelessness and homelessness policy, we find that practical issues abound, limiting the potential of social media as a proxy to complement other measures of homelessness.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Curating corpora with classifiers: A case study of clean energy sentiment online
Authors:
Michael V. Arnold,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time,…
▽ More
Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries.
△ Less
Submitted 9 May, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
A decomposition of book structure through ousiometric fluctuations in cumulative word-time
Authors:
Mikaela Irene Fudolig,
Thayer Alshaabi,
Kathryn Cramer,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulati…
▽ More
While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative.
△ Less
Submitted 11 May, 2023; v1 submitted 19 August, 2022;
originally announced August 2022.
-
Ousiometrics and Telegnomics: The essence of meaning conforms to a two-dimensional powerful-weak and dangerous-safe framework with diverse corpora presenting a safety bias
Authors:
P. S. Dodds,
T. Alshaabi,
M. I. Fudolig,
J. W. Zimmerman,
J. Lovato,
S. Beaulieu,
J. R. Minot,
M. V. Arnold,
A. J. Reagan,
C. M. Danforth
Abstract:
We define `ousiometrics' to be the study of essential meaning in whatever context that meaningful signals are communicated, and `telegnomics' as the study of remotely sensed knowledge. From work emerging through the middle of the 20th century, the essence of meaning has become generally accepted as being well captured by the three orthogonal dimensions of evaluation, potency, and activation (EPA).…
▽ More
We define `ousiometrics' to be the study of essential meaning in whatever context that meaningful signals are communicated, and `telegnomics' as the study of remotely sensed knowledge. From work emerging through the middle of the 20th century, the essence of meaning has become generally accepted as being well captured by the three orthogonal dimensions of evaluation, potency, and activation (EPA). By re-examining first types and then tokens for the English language, and through the use of automatically annotated histograms -- `ousiograms' -- we find here that: 1. The essence of meaning conveyed by words is instead best described by a compass-like power-danger (PD) framework, and 2. Analysis of a disparate collection of large-scale English language corpora -- literature, news, Wikipedia, talk radio, and social media -- shows that natural language exhibits a systematic bias toward safe, low danger words -- a reinterpretation of the Pollyanna principle's positivity bias for written expression. To help justify our choice of dimension names and to help address the problems with representing observed ousiometric dimensions by bipolar adjective pairs, we introduce and explore `synousionyms' and `antousionyms' -- ousiometric counterparts of synonyms and antonyms. We further show that the PD framework revises the circumplex model of affect as a more general model of state of mind. Finally, we use our findings to construct and test a prototype `ousiometer', a telegnomic instrument that measures ousiometric time series for temporal corpora. We contend that our power-danger ousiometric framework provides a complement for entropy-based measurements, and may be of value for the study of a wide variety of communication across biological and artificial life.
△ Less
Submitted 29 March, 2023; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Sentiment and structure in word co-occurrence networks on Twitter
Authors:
Mikaela Irene Fudolig,
Thayer Alshaabi,
Michael V. Arnold,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton's presidential bi…
▽ More
We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton's presidential bid in 2016. We then analyze the network properties in conjunction with the word scores by comparing with null models to separate the effects of the network structure and the score distribution. Neutral words are found to be dominant and most words, regardless of polarity, tend to co-occur with neutral words. We do not observe any score homophily among positive and negative words. However, when we perform network backboning, community detection results in word groupings with meaningful narratives, and the happiness scores of the words in each group correspond to its respective theme. Thus, although we observe no clear relationship between happiness scores and co-occurrence at the node or edge level, a community-centric approach can isolate themes of competing sentiments in a corpus.
△ Less
Submitted 1 October, 2021;
originally announced October 2021.
-
Augmenting semantic lexicons using word embeddings and transfer learning
Authors:
Thayer Alshaabi,
Colin M. Van Oort,
Mikaela Irene Fudolig,
Michael V. Arnold,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researc…
▽ More
Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Here, we propose two models for automatic lexicon expansion. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost.
△ Less
Submitted 2 November, 2021; v1 submitted 18 September, 2021;
originally announced September 2021.
-
Blending search queries with social media data to improve forecasts of economic indicators
Authors:
Yi Li,
Asieh Ahani,
Haimao Zhan,
Kevin Foley,
Thayer Alshaabi,
Kelsey Linnell,
Peter Sheridan Dodds,
Christopher M. Danforth,
Adam Fox
Abstract:
The forecasting of political, economic, and public health indicators using internet activity has demonstrated mixed results. For example, while some measures of explicitly surveyed public opinion correlate well with social media proxies, the opportunity for profitable investment strategies to be driven solely by sentiment extracted from social media appears to have expired. Nevertheless, the inter…
▽ More
The forecasting of political, economic, and public health indicators using internet activity has demonstrated mixed results. For example, while some measures of explicitly surveyed public opinion correlate well with social media proxies, the opportunity for profitable investment strategies to be driven solely by sentiment extracted from social media appears to have expired. Nevertheless, the internet's space of potentially predictive input signals is combinatorially vast and will continue to invite careful exploration. Here, we combine unemployment related search data from Google Trends with economic language on Twitter to attempt to nowcast and forecast: 1. State and national unemployment claims for the US, and 2. Consumer confidence in G7 countries. Building off of a recently developed search-query-based model, we show that incorporating Twitter data improves forecasting of unemployment claims, while the original method remains marginally better at nowcasting. Enriching the input signal with temporal statistical features (e.g., moving average and rate of change) further reduces errors, and improves the predictive utility of the proposed method when applied to other economic indices, such as consumer confidence.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
Computational Paremiology: Charting the temporal, ecological dynamics of proverb use in books, news articles, and tweets
Authors:
E. Davis,
C. M. Danforth,
W. Mieder,
P. S. Dodds
Abstract:
Proverbs are an essential component of language and culture, and though much attention has been paid to their history and currency, there has been comparatively little quantitative work on changes in the frequency with which they are used over time. With wider availability of large corpora reflecting many diverse genres of documents, it is now possible to take a broad and dynamic view of the impor…
▽ More
Proverbs are an essential component of language and culture, and though much attention has been paid to their history and currency, there has been comparatively little quantitative work on changes in the frequency with which they are used over time. With wider availability of large corpora reflecting many diverse genres of documents, it is now possible to take a broad and dynamic view of the importance of the proverb. Here, we measure temporal changes in the relevance of proverbs within three corpora, differing in kind, scale, and time frame: Millions of books over centuries; hundreds of millions of news articles over twenty years; and billions of tweets over a decade. We find that proverbs present heavy-tailed frequency-of-usage rank distributions in each venue; exhibit trends reflecting the cultural dynamics of the eras covered; and have evolved into contemporary forms on social media.
△ Less
Submitted 10 July, 2021;
originally announced July 2021.
-
Say Their Names: Resurgence in the collective attention toward Black victims of fatal police violence following the death of George Floyd
Authors:
Henry H. Wu,
Ryan J. Gallagher,
Thayer Alshaabi,
Jane L. Adams,
Joshua R. Minot,
Michael V. Arnold,
Brooke Foucault Welles,
Randall Harp,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
The murder of George Floyd by police in May 2020 sparked international protests and renewed attention in the Black Lives Matter movement. Here, we characterize ways in which the online activity following George Floyd's death was unparalleled in its volume and intensity, including setting records for activity on Twitter, prompting the saddest day in the platform's history, and causing George Floyd'…
▽ More
The murder of George Floyd by police in May 2020 sparked international protests and renewed attention in the Black Lives Matter movement. Here, we characterize ways in which the online activity following George Floyd's death was unparalleled in its volume and intensity, including setting records for activity on Twitter, prompting the saddest day in the platform's history, and causing George Floyd's name to appear among the ten most frequently used phrases in a day, where he is the only individual to have ever received that level of attention who was not known to the public earlier that same week. Further, we find this attention extended beyond George Floyd and that more Black victims of fatal police violence received attention following his death than during other past moments in Black Lives Matter's history. We place that attention within the context of prior online racial justice activism by showing how the names of Black victims of police violence have been lifted and memorialized over the last 12 years on Twitter. Our results suggest that the 2020 wave of attention to the Black Lives Matter movement centered past instances of police violence in an unprecedented way, demonstrating the impact of the movement's rhetorical strategy to "say their names."
△ Less
Submitted 18 June, 2021;
originally announced June 2021.
-
Sirius: Visualization of Mixed Features as a Mutual Information Network Graph
Authors:
Jane L. Adams,
Todd F. Deluca,
Christopher M. Danforth,
Peter S. Dodds,
Yuhang Zheng,
Konstantinos Anastasakis,
Boyoon Choi,
Allison Min,
Michael M. Bessey
Abstract:
Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features of mixed data type (quantitative continuous and discrete categorical). We introduce Sirius, a novel visualization package for researchers to explore feature relationships among mixed data types using mutual information. The visualization of feature relationships ai…
▽ More
Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features of mixed data type (quantitative continuous and discrete categorical). We introduce Sirius, a novel visualization package for researchers to explore feature relationships among mixed data types using mutual information. The visualization of feature relationships aids data scientists in finding meaningful dependence among features prior to the development of predictive modeling pipelines, which can inform downstream analysis such as feature selection, feature extraction, and early detection of potential proxy variables. Using an information theoretic approach, Sirius supports network visualization of heterogeneous data sets (consisting of continuous and discrete data types), and provides a user interface for exploring feature pairs with locally significant mutual information scores. Mutual information algorithm and bivariate chart types are assigned on a data type pairing basis (continuous-continuous, discrete-discrete, and discrete-continuous). We show how this tool can be used for tasks such as hypothesis confirmation, identification of predictive features, suggestions for feature extraction, or early warning of data abnormalities. The accompanying website for this paper can be accessed at https://sirius.universalities.com/. All code and supplemental materials can be accessed at https://osf.io/pdm9r/.
△ Less
Submitted 13 August, 2022; v1 submitted 9 June, 2021;
originally announced June 2021.
-
Quantifying language changes surrounding mental health on Twitter
Authors:
Anne Marie Stupinski,
Thayer Alshaabi,
Michael V. Arnold,
Jane Lydia Adams,
Joshua R. Minot,
Matthew Price,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
Mental health challenges are thought to afflict around 10% of the global population each year, with many going untreated due to stigma and limited access to services. Here, we explore trends in words and phrases related to mental health through a collection of 1- , 2-, and 3-grams parsed from a data stream of roughly 10% of all English tweets since 2012. We examine temporal dynamics of mental heal…
▽ More
Mental health challenges are thought to afflict around 10% of the global population each year, with many going untreated due to stigma and limited access to services. Here, we explore trends in words and phrases related to mental health through a collection of 1- , 2-, and 3-grams parsed from a data stream of roughly 10% of all English tweets since 2012. We examine temporal dynamics of mental health language, finding that the popularity of the phrase 'mental health' increased by nearly two orders of magnitude between 2012 and 2018. We observe that mentions of 'mental health' spike annually and reliably due to mental health awareness campaigns, as well as unpredictably in response to mass shootings, celebrities dying by suicide, and popular fictional stories portraying suicide. We find that the level of positivity of messages containing 'mental health', while stable through the growth period, has declined recently. Finally, we use the ratio of original tweets to retweets to quantify the fraction of appearances of mental health language due to social amplification. Since 2015, mentions of mental health have become increasingly due to retweets, suggesting that stigma associated with discussion of mental health on Twitter has diminished with time.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
The incel lexicon: Deciphering the emergent cryptolect of a global misogynistic community
Authors:
Kelly Gothard,
David Rushing Dewhurst,
Joshua R. Minot,
Jane Lydia Adams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community…
▽ More
Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community's online expressions of misogyny and real-world acts of violence perpetrated against women. After assembling around three million comments from incel-themed Reddit channels, we analyze the temporal dynamics of a data driven rank ordering of the glossary of phrases belonging to an emergent incel lexicon. Our study reveals the generation and normalization of an extensive coded misogynist vocabulary in service of the group's identity.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Interpretable bias mitigation for textual data: Reducing gender bias in patient notes while maintaining classification performance
Authors:
Joshua R. Minot,
Nicholas Cheney,
Marc Maier,
Danne C. Elbers,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language mo…
▽ More
Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models -- statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how word choices made by healthcare practitioners and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce the potential for bias in natural language processing pipelines.
△ Less
Submitted 9 March, 2021;
originally announced March 2021.
-
Probability-turbulence divergence: A tunable allotaxonometric instrument for comparing heavy-tailed categorical distributions
Authors:
P. S. Dodds,
J. R. Minot,
M. V. Arnold,
T. Alshaabi,
J. L. Adams,
D. R. Dewhurst,
A. J. Reagan,
C. M. Danforth
Abstract:
Real-world complex systems often comprise many distinct types of elements as well as many more types of networked interactions between elements. When the relative abundances of types can be measured well, we further observe heavy-tailed categorical distributions for type frequencies. For the comparison of type frequency distributions of two systems or a system with itself at different time points…
▽ More
Real-world complex systems often comprise many distinct types of elements as well as many more types of networked interactions between elements. When the relative abundances of types can be measured well, we further observe heavy-tailed categorical distributions for type frequencies. For the comparison of type frequency distributions of two systems or a system with itself at different time points in time -- a facet of allotaxonometry -- a great range of probability divergences are available. Here, we introduce and explore `probability-turbulence divergence', a tunable, straightforward, and interpretable instrument for comparing normalizable categorical frequency distributions. We model probability-turbulence divergence (PTD) after rank-turbulence divergence (RTD). While probability-turbulence divergence is more limited in application than rank-turbulence divergence, it is more sensitive to changes in type frequency. We build allotaxonographs to display probability turbulence, incorporating a way to visually accommodate zero probabilities for `exclusive types' which are types that appear in only one system. We explore comparisons of example distributions taken from literature, social media, and ecology. We show how probability-turbulence divergence either explicitly or functionally generalizes many existing kinds of distances and measures, including, as special cases, $L^{(p)}$ norms, the Sørensen-Dice coefficient (the $F_1$ statistic), and the Hellinger distance. We discuss similarities with the generalized entropies of R{é}nyi and Tsallis, and the diversity indices (or Hill numbers) from ecology. We close with thoughts on open problems concerning the optimization of the tuning of rank- and probability-turbulence divergence.
△ Less
Submitted 29 August, 2020;
originally announced August 2020.
-
Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series
Authors:
P. S. Dodds,
J. R. Minot,
M. V. Arnold,
T. Alshaabi,
J. L. Adams,
D. R. Dewhurst,
A. J. Reagan,
C. M. Danforth
Abstract:
Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in…
▽ More
Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10\% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable.
△ Less
Submitted 27 August, 2020; v1 submitted 25 August, 2020;
originally announced August 2020.
-
Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy
Authors:
P. S. Dodds,
J. R. Minot,
M. V. Arnold,
T. Alshaabi,
J. L. Adams,
A. J. Reagan,
C. M. Danforth
Abstract:
Measuring the specific kind, temporal ordering, diversity, and turnover rate of stories surrounding any given subject is essential to developing a complete reckoning of that subject's historical impact. Here, we use Twitter as a distributed news and opinion aggregation source to identify and track the dynamics of the dominant day-scale stories around Donald Trump, the 45th President of the United…
▽ More
Measuring the specific kind, temporal ordering, diversity, and turnover rate of stories surrounding any given subject is essential to developing a complete reckoning of that subject's historical impact. Here, we use Twitter as a distributed news and opinion aggregation source to identify and track the dynamics of the dominant day-scale stories around Donald Trump, the 45th President of the United States. Working with a data set comprising around 20 billion 1-grams, we first compare each day's 1-gram and 2-gram usage frequencies to those of a year before, to create day- and week-scale timelines for Trump stories for 2016 through 2020. We measure Trump's narrative control, the extent to which stories have been about Trump or put forward by Trump. We then quantify story turbulence and collective chronopathy -- the rate at which a population's stories for a subject seem to change over time. We show that 2017 was the most turbulent overall year for Trump. In 2020, story generation slowed dramatically during the first two major waves of the COVID-19 pandemic, with rapid turnover returning first with the Black Lives Matter protests following George Floyd's murder and then later by events leading up to and following the 2020 US presidential election, including the storming of the US Capitol six days into 2021. Trump story turnover for 2 months during the COVID-19 pandemic was on par with that of 3 days in September 2017. Our methods may be applied to any well-discussed phenomenon, and have potential to enable the computational aspects of journalism, history, and biography.
△ Less
Submitted 30 September, 2022; v1 submitted 17 August, 2020;
originally announced August 2020.
-
Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts
Authors:
Ryan J. Gallagher,
Morgan R. Frank,
Lewis Mitchell,
Aaron J. Schwartz,
Andrew J. Reagan,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or…
▽ More
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences. Through several case studies, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.
△ Less
Submitted 5 August, 2020;
originally announced August 2020.
-
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
Authors:
Thayer Alshaabi,
Jane L. Adams,
Michael V. Arnold,
Joshua R. Minot,
David R. Dewhurst,
Andrew J. Reagan,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.…
▽ More
In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of n-grams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through 'contagiograms'. We also present some example case studies that bridge n-gram time series with disparate data sources to explore sociotechnical dynamics of famous individuals, box office success, and social unrest.
△ Less
Submitted 16 July, 2021; v1 submitted 25 July, 2020;
originally announced July 2020.
-
Local information sources received the most attention from Puerto Ricans during the aftermath of Hurricane María
Authors:
Benjamin Freixas Emery,
Meredith T. Niles,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
In September 2017, Hurricane María made landfall across the Caribbean region as a category 4 storm. In the aftermath, many residents of Puerto Rico were without power or clean running water for nearly a year. Using both English and Spanish tweets from September 16 to October 15 2017, we investigate discussion of María both on and off the island, constructing a proxy for the temporal network of com…
▽ More
In September 2017, Hurricane María made landfall across the Caribbean region as a category 4 storm. In the aftermath, many residents of Puerto Rico were without power or clean running water for nearly a year. Using both English and Spanish tweets from September 16 to October 15 2017, we investigate discussion of María both on and off the island, constructing a proxy for the temporal network of communication between victims of the hurricane and others. We use information theoretic tools to compare the lexical divergence of different subgroups within the network. Lastly, we quantify temporal changes in user prominence throughout the event. We find at the global level that Spanish tweets more often contained messages of hope and a focus on those helping. At the local level, we find that information propagating among Puerto Ricans most often originated from sources local to the island, such as journalists and politicians. Critically, content from these accounts overshadows content from celebrities, global news networks, and the like for the large majority of the time period studied. Our findings reveal insight into ways social media campaigns could be deployed to disseminate relief information during similar events in the future.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
Gauging the happiness benefit of US urban parks through Twitter
Authors:
A. J. Schwartz,
P. S. Dodds,
J. P. M. O'Neil-Dunne,
T. H. Ricketts,
C. M. Danforth
Abstract:
The relationship between nature contact and mental well-being has received increasing attention in recent years. While a body of evidence has accumulated demonstrating a positive relationship between time in nature and mental well-being, there have been few studies comparing this relationship in different locations over long periods of time. In this study, we estimate a happiness benefit, the diff…
▽ More
The relationship between nature contact and mental well-being has received increasing attention in recent years. While a body of evidence has accumulated demonstrating a positive relationship between time in nature and mental well-being, there have been few studies comparing this relationship in different locations over long periods of time. In this study, we estimate a happiness benefit, the difference in expressed happiness between in- and out-of-park tweets, for the 25 largest cities in the US by population. People write happier words during park visits when compared with non-park user tweets collected around the same time. While the words people write are happier in parks on average and in most cities, we find considerable variation across cities. Tweets are happier in parks at all times of the day, week, and year, not just during the weekend or summer vacation. Across all cities, we find that the happiness benefit is highest in parks larger than 100 acres. Overall, our study suggests the happiness benefit associated with park visitation is on par with US holidays such as Thanksgiving and New Year's Day.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
The sociospatial factors of death: Analyzing effects of geospatially-distributed variables in a Bayesian mortality model for Hong Kong
Authors:
Thayer Alshaabi,
David Rushing Dewhurst,
James P. Bagrow,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information…
▽ More
Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information on the temporal components of this relationship. Using the districts of Hong Kong over multiple census years as a case study, we demonstrate that there are differences in how wealth indicator variables are associated with longevity in (a) areas that are affluent but neighbored by socially deprived districts versus (b) wealthy areas surrounded by similarly wealthy districts. We also show that the inclusion of spatially-distributed variables reduces uncertainty in mortality rate predictions in each census year when compared with a baseline model. Our results suggest that geographic mortality models should incorporate nonlocal information (e.g., spatial neighbors) to lower the variance of their mortality estimates, and point to a more in-depth analysis of sociospatial spillover effects on mortality rates.
△ Less
Submitted 25 January, 2021; v1 submitted 15 June, 2020;
originally announced June 2020.
-
Ratioing the President: An exploration of public engagement with Obama and Trump on Twitter
Authors:
Joshua R. Minot,
Michael V. Arnold,
Thayer Alshaabi,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
The past decade has witnessed a marked increase in the use of social media by politicians, most notably exemplified by the 45th President of the United States (POTUS), Donald Trump. On Twitter, POTUS messages consistently attract high levels of engagement as measured by likes, retweets, and replies. Here, we quantify the balance of these activities, also known as "ratios", and study their dynamics…
▽ More
The past decade has witnessed a marked increase in the use of social media by politicians, most notably exemplified by the 45th President of the United States (POTUS), Donald Trump. On Twitter, POTUS messages consistently attract high levels of engagement as measured by likes, retweets, and replies. Here, we quantify the balance of these activities, also known as "ratios", and study their dynamics as a proxy for collective political engagement in response to presidential communications. We find that raw activity counts increase during the period leading up to the 2016 election, accompanied by a regime change in the ratio of retweets-to-replies connected to the transition between campaigning and governing. For the Trump account, we find words related to fake news and the Mueller inquiry are more common in tweets with a high number of replies relative to retweets. Finally, we find that Barack Obama consistently received a higher retweet-to-reply ratio than Donald Trump. These results suggest Trump's Twitter posts are more often controversial and subject to enduring engagement as a given news cycle unfolds.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
Divergent modes of online collective attention to the COVID-19 pandemic are associated with future caseload variance
Authors:
David Rushing Dewhurst,
Thayer Alshaabi,
Michael V. Arnold,
Joshua R. Minot,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that rep…
▽ More
Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that represents March COVID-19-related discourse. Aggregating countries by dominant language use, we find that volatility in the first dynamic regime is associated with future volatility in new cases of COVID-19 roughly three weeks (average 22.49 $\pm$ 3.26 days) later. Our results suggest that surveillance of change in usage of epidemiology-related words on social media may be useful in forecasting later change in disease case numbers, but we emphasize that our current findings are not causal or necessarily predictive.
△ Less
Submitted 19 May, 2020; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Hurricanes and hashtags: Characterizing online collective attention for natural disasters
Authors:
Michael V. Arnold,
David Rushing Dewhurst,
Thayer Alshaabi,
Joshua R. Minot,
Jane L. Adams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `…
▽ More
We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `hurricane attention maps' and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane's Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage.
△ Less
Submitted 31 March, 2020;
originally announced March 2020.
-
How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter
Authors:
T. Alshaabi,
J. R. Minot,
M. V. Arnold,
J. L. Adams,
D. R. Dewhurst,
A. J. Reagan,
R. Muhamad,
C. M. Danforth,
P. S. Dodds
Abstract:
In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for…
▽ More
In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for antibodies, and we need to track major socioeconomic indices. But we also need auxiliary data of all kinds, including data related to how populations are talking about the unfolding pandemic through news and stories. To in part help on the social media side, we curate a set of 2000 day-scale time series of 1- and 2-grams across 24 languages on Twitter that are most 'important' for April 2020 with respect to April 2019. We determine importance through our allotaxonometric instrument, rank-turbulence divergence. We make some basic observations about some of the time series, including a comparison to numbers of confirmed deaths due to COVID-19 over time. We broadly observe across all languages a peak for the language-specific word for 'virus' in January 2020 followed by a decline through February and then a surge through March and April. The world's collective attention dropped away while the virus spread out from China. We host the time series on Gitlab, updating them on a daily basis while relevant. Our main intent is for other researchers to use these time series to enhance whatever analyses that may be of use during the pandemic as well as for retrospective investigations.
△ Less
Submitted 6 January, 2021; v1 submitted 27 March, 2020;
originally announced March 2020.
-
The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020
Authors:
Thayer Alshaabi,
David R. Dewhurst,
Joshua R. Minot,
Michael V. Arnold,
Jane L. Adams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio':…
▽ More
Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1 -- the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.
△ Less
Submitted 8 March, 2021; v1 submitted 7 March, 2020;
originally announced March 2020.
-
Fame and Ultrafame: Measuring and comparing daily levels of `being talked about' for United States' presidents, their rivals, God, countries, and K-pop
Authors:
Peter Sheridan Dodds,
Joshua R. Minot,
Michael V. Arnold,
Thayer Alshaabi,
Jane Lydia Adams,
David Rushing Dewhurst,
Andrew J. Reagan,
Christopher M. Danforth
Abstract:
When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw…
▽ More
When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw fame dynamics on Twitter for US Presidents and major US Presidential candidates from 2008 to 2020: Barack Obama, John McCain, Mitt Romney, Hillary Clinton, Donald Trump, and Joe Biden. We assign ``lexical fame'' to be the number and (Zipfian) rank of the (lowercased) mentions made for each individual across all languages. We show that all five political figures have at some point reached extraordinary volume levels of what we define to be ``lexical ultrafame'': An overall rank of approximately 300 or less which is largely the realm of function words and demarcated by the highly stable rank of `god'. By this measure, `trump' has become enduringly ultrafamous, from the 2016 election on. We use typical ranks for country names and function words as standards to improve perception of scale. We quantify relative fame rates and find that in the eight weeks leading up the 2008 and 2012 elections, `obama' held a 1000:757 volume ratio over `mccain' and 1000:892 over `romney', well short of the 1000:544 and 1000:504 volumes favoring `trump' over `hillary' and `biden' in the 8 weeks leading up to the 2016 and 2020 elections. Finally, we track how only one other entity has more sustained ultrafame than `trump' on Twitter: The K-pop (Korean pop) band BTS. We chart the dramatic rise of BTS, finding their Twitter handle `@bts\_twt' has been able to compete with `a' and `the'. Our findings for BTS more generally point to K-pop's growing economic, social, and political power.
△ Less
Submitted 29 October, 2021; v1 submitted 30 September, 2019;
originally announced October 2019.
-
Exploring Perceptions of Veganism
Authors:
Laura Jennings,
Christopher M. Danforth,
Peter Sheridan Dodds,
Elizabeth Pinel,
Lizzy Pope
Abstract:
This project examined perceptions of the vegan lifestyle using surveys and social media to explore barriers to choosing veganism. A survey of 510 individuals indicated that non-vegans did not believe veganism was as healthy or difficult as vegans. In a second analysis, Instagram posts using #vegan suggest content is aimed primarily at the female vegan community. Finally, sentiment analysis of roug…
▽ More
This project examined perceptions of the vegan lifestyle using surveys and social media to explore barriers to choosing veganism. A survey of 510 individuals indicated that non-vegans did not believe veganism was as healthy or difficult as vegans. In a second analysis, Instagram posts using #vegan suggest content is aimed primarily at the female vegan community. Finally, sentiment analysis of roughly 5 million Twitter posts mentioning 'vegan' found veganism to be portrayed in a more positive light compared to other topics. Results suggest non-vegans' lack of interest in veganism is driven by non-belief in the health benefits of the diet.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings
Authors:
Tyler J. Gray,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Stretched words like `heellllp' or `heyyyyy' are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of `stretchable words' found in roughly 100 billion twee…
▽ More
Stretched words like `heellllp' or `heyyyyy' are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of `stretchable words' found in roughly 100 billion tweets authored over an 8 year period. We introduce two central parameters, `balance' and `stretch', that capture their main characteristics, and explore their dynamics by creating visual tools we call `balance plots' and `spelling trees'. We discuss how the tools and methods we develop here could be used to study the statistical patterns of mistypings and misspellings, along with the potential applications in augmenting dictionaries, improving language processing, and in any area where sequence construction matters, such as genetics.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.
-
The shocklet transform: A decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series
Authors:
David Rushing Dewhurst,
Thayer Alshaabi,
Dilan Kiley,
Michael V. Arnold,
Joshua R. Minot,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous be…
▽ More
We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST's ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms' utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter.
△ Less
Submitted 18 December, 2019; v1 submitted 27 June, 2019;
originally announced June 2019.
-
Visitors to urban greenspace have higher sentiment and lower negativity on Twitter
Authors:
Aaron J. Schwartz,
Peter Sheridan Dodds,
Jarlath P. M. O'Neil-Dunne,
Christopher M. Danforth,
Taylor H. Ricketts
Abstract:
With more people living in cities, we are witnessing a decline in exposure to nature. A growing body of research has demonstrated an association between nature contact and improved mood. Here, we used Twitter and the Hedonometer, a world analysis tool, to investigate how sentiment, or the estimated happiness of the words people write, varied before, during, and after visits to San Francisco's urba…
▽ More
With more people living in cities, we are witnessing a decline in exposure to nature. A growing body of research has demonstrated an association between nature contact and improved mood. Here, we used Twitter and the Hedonometer, a world analysis tool, to investigate how sentiment, or the estimated happiness of the words people write, varied before, during, and after visits to San Francisco's urban park system. We found that sentiment was substantially higher during park visits and remained elevated for several hours following the visit. Leveraging differences in vegetative cover across park types, we explored how different types of outdoor public spaces may contribute to subjective well-being. Tweets during visits to Regional Parks, which are greener and have greater vegetative cover, exhibited larger increases in sentiment than tweets during visits to Civic Plazas and Squares. Finally, we analyzed word frequencies to explore several mechanisms theorized to link nature exposure with mental and cognitive benefits. Negation words such as 'no', 'not', and 'don't' decreased in frequency during visits to urban parks. These results can be used by urban planners and public health officials to better target nature contact recommendations for growing urban populations.
△ Less
Submitted 27 August, 2019; v1 submitted 20 July, 2018;
originally announced July 2018.
-
Social media usage patterns during natural hazards
Authors:
Meredith T. Niles,
Benjamin F. Emery,
Andrew J. Reagan,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
Natural hazards are becoming increasingly expensive as climate change and development are exposing communities to greater risks. Preparation and recovery are critical for climate change resilience, and social media are being used more and more to communicate before, during, and after disasters. While there is a growing body of research aimed at understanding how people use social media surrounding…
▽ More
Natural hazards are becoming increasingly expensive as climate change and development are exposing communities to greater risks. Preparation and recovery are critical for climate change resilience, and social media are being used more and more to communicate before, during, and after disasters. While there is a growing body of research aimed at understanding how people use social media surrounding disaster events, most existing work has focused on a single disaster case study. In the present study, we analyze five of the costliest disasters in the last decade in the United States (Hurricanes Irene and Sandy, two sets of tornado outbreaks, and flooding in Louisiana) through the lens of Twitter. In particular, we explore the frequency of both generic and specific food-security related terms, and quantify the relationship between network size and Twitter activity during disasters. We find differences in tweet volume for keywords depending on disaster type, with people using Twitter more frequently in preparation for Hurricanes, and for real-time or recovery information for tornado and flooding events. Further, we find that people share a host of general disaster and specific preparation and recovery terms during these events. Finally, we find that among all account types, individuals with "average" sized networks are most likely to share information during these disasters, and in most cases, do so more frequently than normal. This suggests that around disasters, an ideal form of social contagion is being engaged in which average people rather than outsized influentials are key to communication. These results provide important context for the type of disaster information and target audiences that may be most useful for disaster communication during varying extreme events.
△ Less
Submitted 24 October, 2018; v1 submitted 19 June, 2018;
originally announced June 2018.
-
A Sentiment Analysis of Breast Cancer Treatment Experiences and Healthcare Perceptions Across Twitter
Authors:
Eric M. Clark,
Ted James,
Chris A. Jones,
Amulya Alapati,
Promise Ukandu,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Background: Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. In prior work, [Crannell et. al.], we have studied an active cancer patient population on Twitter and compiled a set of tweets describing the…
▽ More
Background: Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. In prior work, [Crannell et. al.], we have studied an active cancer patient population on Twitter and compiled a set of tweets describing their experience with this disease. We refer to these online public testimonies as "Invisible Patient Reported Outcomes" (iPROs), because they carry relevant indicators, yet are difficult to capture by conventional means of self-report. Methods: Our present study aims to identify tweets related to the patient experience as an additional informative tool for monitoring public health. Using Twitter's public streaming API, we compiled over 5.3 million "breast cancer" related tweets spanning September 2016 until mid December 2017. We combined supervised machine learning methods with natural language processing to sift tweets relevant to breast cancer patient experiences. We analyzed a sample of 845 breast cancer patient and survivor accounts, responsible for over 48,000 posts. We investigated tweet content with a hedonometric sentiment analysis to quantitatively extract emotionally charged topics. Results: We found that positive experiences were shared regarding patient treatment, raising support, and spreading awareness. Further discussions related to healthcare were prevalent and largely negative focusing on fear of political legislation that could result in loss of coverage. Conclusions: Social media can provide a positive outlet for patients to discuss their needs and concerns regarding their healthcare coverage and treatment needs. Capturing iPROs from online communication can help inform healthcare professionals and lead to more connected and personalized treatment regimens.
△ Less
Submitted 12 October, 2018; v1 submitted 24 May, 2018;
originally announced May 2018.
-
English verb regularization in books and tweets
Authors:
Tyler J. Gray,
Andrew J. Reagan,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six year…
▽ More
The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six years of published books scanned by Google (2003--2008), and (2) A decade of social media messages posted to Twitter (2008--2017). We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books. Regularization is also greater for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables such as education or income.
△ Less
Submitted 3 January, 2019; v1 submitted 26 March, 2018;
originally announced March 2018.
-
Measuring the happiness of large-scale written expression: Songs, Blogs, and Presidents
Authors:
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
The importance of quantifying the nature and intensity of emotional states at the level of populations is evident: we would like to know how, when, and why individuals feel as they do if we wish, for example, to better construct public policy, build more successful organizations, and, from a scientific perspective, more fully understand economic and social phenomena. Here, by incorporating direct…
▽ More
The importance of quantifying the nature and intensity of emotional states at the level of populations is evident: we would like to know how, when, and why individuals feel as they do if we wish, for example, to better construct public policy, build more successful organizations, and, from a scientific perspective, more fully understand economic and social phenomena. Here, by incorporating direct human assessment of words, we quantify happiness levels on a continuous scale for a diverse set of large-scale texts: song titles and lyrics, weblogs, and State of the Union addresses. Our method is transparent, improvable, capable of rapidly processing Web-scale texts, and moves beyond approaches based on coarse categorization. Among a number of observations, we find that the happiness of song lyrics trends downward from the 1960's to the mid 1990's while remaining stable within genres, and that the happiness of blogs has steadily increased from 2005 to 2009, exhibiting a striking rise and fall with blogger age and distance from the equator.
△ Less
Submitted 6 March, 2017;
originally announced March 2017.
-
Which friends are more popular than you? Contact strength and the friendship paradox in social networks
Authors:
James P. Bagrow,
Christopher M. Danforth,
Lewis Mitchell
Abstract:
The friendship paradox states that in a social network, egos tend to have lower degree than their alters, or, "your friends have more friends than you do". Most research has focused on the friendship paradox and its implications for information transmission, but treating the network as static and unweighted. Yet, people can dedicate only a finite fraction of their attention budget to each social i…
▽ More
The friendship paradox states that in a social network, egos tend to have lower degree than their alters, or, "your friends have more friends than you do". Most research has focused on the friendship paradox and its implications for information transmission, but treating the network as static and unweighted. Yet, people can dedicate only a finite fraction of their attention budget to each social interaction: a high-degree individual may have less time to dedicate to individual social links, forcing them to modulate the quantities of contact made to their different social ties. Here we study the friendship paradox in the context of differing contact volumes between egos and alters, finding a connection between contact volume and the strength of the friendship paradox. The most frequently contacted alters exhibit a less pronounced friendship paradox compared with the ego, whereas less-frequently contacted alters are more likely to be high degree and give rise to the paradox. We argue therefore for a more nuanced version of the friendship paradox: "your closest friends have slightly more friends than you do", and in certain networks even: "your best friend has no more friends than you do". We demonstrate that this relationship is robust, holding in both a social media and a mobile phone dataset. These results have implications for information transfer and influence in social networks, which we explore using a simple dynamical model.
△ Less
Submitted 18 March, 2017;
originally announced March 2017.
-
Forecasting the onset and course of mental illness with Twitter data
Authors:
Andrew G. Reece,
Andrew J. Reagan,
Katharina L. M. Lix,
Peter Sheridan Dodds,
Christopher M. Danforth,
Ellen J. Langer
Abstract:
We developed computational models to predict the emergence of depression and Post-Traumatic Stress Disorder in Twitter users. Twitter data and details of depression history were collected from 204 individuals (105 depressed, 99 healthy). We extracted predictive features measuring affect, linguistic style, and context from participant tweets (N=279,951) and built models using these features with su…
▽ More
We developed computational models to predict the emergence of depression and Post-Traumatic Stress Disorder in Twitter users. Twitter data and details of depression history were collected from 204 individuals (105 depressed, 99 healthy). We extracted predictive features measuring affect, linguistic style, and context from participant tweets (N=279,951) and built models using these features with supervised learning algorithms. Resulting models successfully discriminated between depressed and healthy content, and compared favorably to general practitioners' average success rates in diagnosing depression. Results held even when the analysis was restricted to content posted before first depression diagnosis. State-space temporal analysis suggests that onset of depression may be detectable from Twitter data several months prior to diagnosis. Predictive results were replicated with a separate sample of individuals diagnosed with PTSD (174 users, 243,775 tweets). A state-space time series model revealed indicators of PTSD almost immediately post-trauma, often many months prior to clinical diagnosis. These methods suggest a data-driven, predictive approach for early screening and detection of mental illness.
△ Less
Submitted 27 August, 2016;
originally announced August 2016.
-
Instagram photos reveal predictive markers of depression
Authors:
Andrew G. Reece,
Christopher M. Danforth
Abstract:
Using Instagram data from 166 individuals, we applied machine learning tools to successfully identify markers of depression. Statistical features were computationally extracted from 43,950 participant Instagram photos, using color analysis, metadata components, and algorithmic face detection. Resulting models outperformed general practitioners' average diagnostic success rate for depression. These…
▽ More
Using Instagram data from 166 individuals, we applied machine learning tools to successfully identify markers of depression. Statistical features were computationally extracted from 43,950 participant Instagram photos, using color analysis, metadata components, and algorithmic face detection. Resulting models outperformed general practitioners' average diagnostic success rate for depression. These results held even when the analysis was restricted to posts made before depressed individuals were first diagnosed. Photos posted by depressed individuals were more likely to be bluer, grayer, and darker. Human ratings of photo attributes (happy, sad, etc.) were weaker predictors of depression, and were uncorrelated with computationally-generated features. These findings suggest new avenues for early screening and detection of mental illness.
△ Less
Submitted 13 August, 2016; v1 submitted 10 August, 2016;
originally announced August 2016.
-
Public Opinion Polling with Twitter
Authors:
Emily M. Cody,
Andrew J. Reagan,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
Solicited public opinion surveys reach a limited subpopulation of willing participants and are expensive to conduct, leading to poor time resolution and a restricted pool of expert-chosen survey topics. In this study, we demonstrate that unsolicited public opinion polling through sentiment analysis applied to Twitter correlates well with a range of traditional measures, and has predictive power fo…
▽ More
Solicited public opinion surveys reach a limited subpopulation of willing participants and are expensive to conduct, leading to poor time resolution and a restricted pool of expert-chosen survey topics. In this study, we demonstrate that unsolicited public opinion polling through sentiment analysis applied to Twitter correlates well with a range of traditional measures, and has predictive power for issues of global importance. We also examine Twitter's potential to canvas topics seldom surveyed, including ideas, personal feelings, and perceptions of commercial enterprises. Two of our major observations are that appropriately filtered Twitter sentiment (1) predicts President Obama's job approval three months in advance, and (2) correlates well with surveyed consumer sentiment. To make possible a full examination of our work and to enable others' research, we make public over 10,000 data sets, each a seven-year series of daily word counts for tweets containing a frequently used search term.
△ Less
Submitted 5 August, 2016;
originally announced August 2016.
-
The emotional arcs of stories are dominated by six basic shapes
Authors:
Andrew J. Reagan,
Lewis Mitchell,
Dilan Kiley,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Advances in computing power, natural language processing, and digitization of text now make it possible to study a culture's evolution through its texts using a "big data" lens. Our ability to communicate relies in part upon a shared emotional experience, with stories often following distinct emotional trajectories and forming patterns that are meaningful to us. Here, by classifying the emotional…
▽ More
Advances in computing power, natural language processing, and digitization of text now make it possible to study a culture's evolution through its texts using a "big data" lens. Our ability to communicate relies in part upon a shared emotional experience, with stories often following distinct emotional trajectories and forming patterns that are meaningful to us. Here, by classifying the emotional arcs for a filtered subset of 1,327 stories from Project Gutenberg's fiction collection, we find a set of six core emotional arcs which form the essential building blocks of complex emotional trajectories. We strengthen our findings by separately applying Matrix decomposition, supervised learning, and unsupervised learning. For each of these six core emotional arcs, we examine the closest characteristic stories in publication today and find that particular emotional arcs enjoy greater success, as measured by downloads.
△ Less
Submitted 25 September, 2016; v1 submitted 24 June, 2016;
originally announced June 2016.
-
Divergent discourse between protests and counter-protests: #BlackLivesMatter and #AllLivesMatter
Authors:
Ryan J. Gallagher,
Andrew J. Reagan,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial killings of Black Americans. In response to #BlackLivesMatter, other Twitter users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives…
▽ More
Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial killings of Black Americans. In response to #BlackLivesMatter, other Twitter users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives regardless of race. Through a multi-level analysis of over 860,000 tweets, we study how these protests and counter-protests diverge by quantifying aspects of their discourse. We find that #AllLivesMatter facilitates opposition between #BlackLivesMatter and hashtags such as #PoliceLivesMatter and #BlueLivesMatter in such a way that historically echoes the tension between Black protesters and law enforcement. In addition, we show that a significant portion of #AllLivesMatter use stems from hijacking by #BlackLivesMatter advocates. Beyond simply injecting #AllLivesMatter with #BlackLivesMatter content, these hijackers use the hashtag to directly confront the counter-protest notion of "All lives matter." Our findings suggest that Black Lives Matter movement was able to grow, exhibit diverse conversations, and avoid derailment on social media by making discussion of counter-protest opinions a central topic of #AllLivesMatter, rather than the movement itself.
△ Less
Submitted 19 May, 2017; v1 submitted 22 June, 2016;
originally announced June 2016.
-
Connecting every bit of knowledge: The structure of Wikipedia's First Link Network
Authors:
Mark Ibrahim,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Apples, porcupines, and the most obscure Bob Dylan song---is every topic a few clicks from Philosophy? Within Wikipedia, the surprising answer is yes: nearly all paths lead to Philosophy. Wikipedia is the largest, most meticulously indexed collection of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. By following the first…
▽ More
Apples, porcupines, and the most obscure Bob Dylan song---is every topic a few clicks from Philosophy? Within Wikipedia, the surprising answer is yes: nearly all paths lead to Philosophy. Wikipedia is the largest, most meticulously indexed collection of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. By following the first link in each article, we algorithmically construct a directed network of all 4.7 million articles: Wikipedia's First Link Network. Here, we study the English edition of Wikipedia's First Link Network for insight into how the many articles on inventions, places, people, objects, and events are related and organized.
By traversing every path, we measure the accumulation of first links, path lengths, groups of path-connected articles, and cycles. We also develop a new method, traversal funnels, to measure the influence each article exerts in shaping the network. Traversal funnels provides a new measure of influence for directed networks without spill-over into cycles, in contrast to traditional network centrality measures. Within Wikipedia's First Link Network, we find scale-free distributions describe path length, accumulation, and influence. Far from dispersed, first links disproportionately accumulate at a few articles---flowing from specific to general and culminating around fundamental notions such as Community, State, and Science. Philosophy directs more paths than any other article by two orders of magnitude. We also observe a gravitation towards topical articles such as Health Care and Fossil Fuel. These findings enrich our view of the connections and structure of Wikipedia's ever growing store of knowledge.
△ Less
Submitted 6 December, 2016; v1 submitted 1 May, 2016;
originally announced May 2016.
-
What we write about when we write about causality: Features of causal statements across large-scale social discourse
Authors:
Thomas C. McAndrew,
Joshua C. Bongard,
Christopher M. Danforth,
Peter S. Dodds,
Paul D. H. Hines,
James P. Bagrow
Abstract:
Identifying and communicating relationships between causes and effects is important for understanding our world, but is affected by language structure, cognitive and emotional biases, and the properties of the communication medium. Despite the increasing importance of social media, much remains unknown about causal statements made online. To study real-world causal attribution, we extract a large-…
▽ More
Identifying and communicating relationships between causes and effects is important for understanding our world, but is affected by language structure, cognitive and emotional biases, and the properties of the communication medium. Despite the increasing importance of social media, much remains unknown about causal statements made online. To study real-world causal attribution, we extract a large-scale corpus of causal statements made on the Twitter social network platform as well as a comparable random control corpus. We compare causal and control statements using statistical language and sentiment analysis tools. We find that causal statements have a number of significant lexical and grammatical differences compared with controls and tend to be more negative in sentiment than controls. Causal statements made online tend to focus on news and current events, medicine and health, or interpersonal relationships, as shown by topic models. By quantifying the features and potential biases of causality communication, this study improves our understanding of the accuracy of information and opinions found online.
△ Less
Submitted 21 April, 2016; v1 submitted 19 April, 2016;
originally announced April 2016.
-
Zipf's law is a consequence of coherent language production
Authors:
Jake Ryland Williams,
James P. Bagrow,
Andrew J. Reagan,
Sharon E. Alajajian,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
The task of text segmentation may be undertaken at many levels in text analysis---paragraphs, sentences, words, or even letters. Here, we focus on a relatively fine scale of segmentation, hypothesizing it to be in accord with a stochastic model of language generation, as the smallest scale where independent units of meaning are produced. Our goals in this letter include the development of methods…
▽ More
The task of text segmentation may be undertaken at many levels in text analysis---paragraphs, sentences, words, or even letters. Here, we focus on a relatively fine scale of segmentation, hypothesizing it to be in accord with a stochastic model of language generation, as the smallest scale where independent units of meaning are produced. Our goals in this letter include the development of methods for the segmentation of these minimal independent units, which produce feature-representations of texts that align with the independence assumption of the bag-of-terms model, commonly used for prediction and classification in computational text analysis. We also propose the measurement of texts' association (with respect to realized segmentations) to the model of language generation. We find (1) that our segmentations of phrases exhibit much better associations to the generation model than words and (2), that texts which are well fit are generally topically homogeneous. Because our generative model produces Zipf's law, our study further suggests that Zipf's law may be a consequence of homogeneity in language production.
△ Less
Submitted 5 August, 2016; v1 submitted 28 January, 2016;
originally announced January 2016.
-
Benchmarking sentiment analysis methods for large-scale texts: A case for using continuum-scored words and word shift graphs
Authors:
Andrew J. Reagan,
Brian Tivnan,
Jake Ryland Williams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
The emergence and global adoption of social media has rendered possible the real-time estimation of population-scale sentiment, bearing profound implications for our understanding of human behavior. Given the growing assortment of sentiment measuring instruments, comparisons between them are evidently required. Here, we perform detailed tests of 6 dictionary-based methods applied to 4 different co…
▽ More
The emergence and global adoption of social media has rendered possible the real-time estimation of population-scale sentiment, bearing profound implications for our understanding of human behavior. Given the growing assortment of sentiment measuring instruments, comparisons between them are evidently required. Here, we perform detailed tests of 6 dictionary-based methods applied to 4 different corpora, and briefly examine a further 20 methods. We show that a dictionary-based method will only perform both reliably and meaningfully if (1) the dictionary covers a sufficiently large enough portion of a given text's lexicon when weighted by word usage frequency; and (2) words are scored on a continuous scale.
△ Less
Submitted 7 September, 2016; v1 submitted 1 December, 2015;
originally announced December 2015.
-
Nonlinear functional mapping of the human brain
Authors:
Nicholas Allgaier,
Tobias Banaschewski,
Gareth Barker,
Arun L. W. Bokde,
Josh C. Bongard,
Uli Bromberg,
Christian Büchel,
Anna Cattrell,
Patricia J. Conrod,
Christopher M. Danforth,
Sylvane Desrivières,
Peter S. Dodds,
Herta Flor,
Vincent Frouin,
Jürgen Gallinat,
Penny Gowland,
Andreas Heinz,
Bernd Ittermann,
Scott Mackey,
Jean-Luc Martinot,
Kevin Murphy,
Frauke Nees,
Dimitri Papadopoulos-Orfanos,
Luise Poustka,
Michael N. Smolka
, et al. (5 additional authors not shown)
Abstract:
The field of neuroimaging has truly become data rich, and novel analytical methods capable of gleaning meaningful information from large stores of imaging data are in high demand. Those methods that might also be applicable on the level of individual subjects, and thus potentially useful clinically, are of special interest. In the present study, we introduce just such a method, called nonlinear fu…
▽ More
The field of neuroimaging has truly become data rich, and novel analytical methods capable of gleaning meaningful information from large stores of imaging data are in high demand. Those methods that might also be applicable on the level of individual subjects, and thus potentially useful clinically, are of special interest. In the present study, we introduce just such a method, called nonlinear functional mapping (NFM), and demonstrate its application in the analysis of resting state fMRI from a 242-subject subset of the IMAGEN project, a European study of adolescents that includes longitudinal phenotypic, behavioral, genetic, and neuroimaging data. NFM employs a computational technique inspired by biological evolution to discover and mathematically characterize interactions among ROI (regions of interest), without making linear or univariate assumptions. We show that statistics of the resulting interaction relationships comport with recent independent work, constituting a preliminary cross-validation. Furthermore, nonlinear terms are ubiquitous in the models generated by NFM, suggesting that some of the interactions characterized here are not discoverable by standard linear methods of analysis. We discuss one such nonlinear interaction in the context of a direct comparison with a procedure involving pairwise correlation, designed to be an analogous linear version of functional mapping. We find another such interaction that suggests a novel distinction in brain function between drinking and non-drinking adolescents: a tighter coupling of ROI associated with emotion, reward, and interoceptive processes such as thirst, among drinkers. Finally, we outline many improvements and extensions of the methodology to reduce computational expense, complement other analytical tools like graph-theoretic analysis, and allow for voxel level NFM to eliminate the necessity of ROI selection.
△ Less
Submitted 8 September, 2015;
originally announced October 2015.
-
Vaporous Marketing: Uncovering Pervasive Electronic Cigarette Advertisements on Twitter
Authors:
Eric M. Clark,
Chris A. Jones,
Jake Ryland Williams,
Allison N. Kurti,
Michell Craig Nortotsky,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Background: Twitter has become the "wild-west" of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, "kid-friendly" flavors, algorithmically generated false testimonials, and free samples. Methods:All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012…
▽ More
Background: Twitter has become the "wild-west" of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, "kid-friendly" flavors, algorithmically generated false testimonials, and free samples. Methods:All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012 through December 2014 (approximately 850,000 total tweets) were identified and categorized as Automated or Organic by combining a keyword classification and a machine trained Human Detection algorithm. A sentiment analysis using Hedonometrics was performed on Organic tweets to quantify the change in consumer sentiments over time. Commercialized tweets were topically categorized with key phrasal pattern matching. Results:The overwhelming majority (80%) of tweets were classified as automated or promotional in nature. The majority of these tweets were coded as commercialized (83.65% in 2013), up to 33% of which offered discounts or free samples and appeared on over a billion twitter feeds as impressions. The positivity of Organic (human) classified tweets has decreased over time (5.84 in 2013 to 5.77 in 2014) due to a relative increase in the negative words ban,tobacco,doesn't,drug,against,poison,tax and a relative decrease in the positive words like haha,good,cool. Automated tweets are more positive than organic (6.17 versus 5.84) due to a relative increase in the marketing words best,win,buy,sale,health,discount and a relative decrease in negative words like bad, hate, stupid, don't. Conclusions:Due to the youth presence on Twitter and the clinical uncertainty of the long term health complications of electronic cigarette consumption, the protection of public health warrants scrutiny and potential regulation of social media marketing.
△ Less
Submitted 5 March, 2016; v1 submitted 7 August, 2015;
originally announced August 2015.