1 Introduction
Online communities possess a unique capacity for rapid, large-scale growth [
18,
19]. This stems from both the low barrier to entry for participation and from the ability for members to participate anywhere at any time. Often, this is a strength. Communities devoted to niche topics can quickly establish themselves and attract enough members to maintain a vibrant culture [
19]. Still, this capacity for growth comes with a cost. Online communities can easily expand beyond their initial audiences [
16,
32]. In such cases, they might deviate from their original purposes or lose topical focus. Qualitative work suggests that community moderators and members are sensitive to this phenomenon [
16,
18,
29,
32]. In interviews with moderators, Seering et al. [
32] found that the perception that a community’s population was undergoing a “sudden diversification," often prompted moderators to introduce new community rules to keep the community focused. In such cases, the question of
whose interests a community serves becomes particularly salient. Communities face a choice: do they accommodate the interests of new members, or do they take measures to preserve the existing community identity?
Despite these qualitative accounts, the extent to which new communities deal with disruptive growth is unclear. A few factors work to prevent such disruptions. First, newcomers choose which communities to participate in[
19]. Given the range of open communities available online, potentially disruptive newcomers may end up being steered towards communities that are a better fit. Further, prior work indicates that users can effectively learn certain community norms prior to participating for the first time [
27]. Thus, diverse newcomers may be able to pick up on existing community norms and blend in seamlessly.
While prior work has looked at how exogenous population shocks can impact online communities [
18,
22], to our knowledge, no work has studied organic growth in a generalizable sample of
newborn communities. Filling this gap in the literature is an important step towards building a holistic understanding of online community trajectory. Further, since prior studies focus on single communities [
10,
18] or small groups of communities [
22], a more generalizable sample enables community designers to reason about the extent to which discovered trends may apply to their groups.
As such, we conduct a longitudinal analysis of the first two years of growth in 1,620 Reddit communities.
We track two particular attributes of community identity over this period – the distinctiveness of a community’s language use [39] and the diversity of a community’s user base. The former measure is inspired by a rich history of research tying language use to community identity [
3,
6,
10,
12,
34,
39], while the latter is motivated by qualitative accounts of the factors that lead to changes in community norms [
32]. Crucially, we believe these measures capture the underlying tension communities may face between accommodating users with a wide range of backgrounds and interests and maintaining a unique, shared identity.
We leverage a dataset of approximately 300 million comments and 12 million unique authors to answer the following research questions:
RQ1 : To what extent does the linguistic behavior of a community become less distinctive as the community grows? How much does this vary between communities?
RQ2 : To what extent do moderated communities see a stronger or weaker association between growth and distinctiveness?
RQ3 : To what extent does diversifying communities see a stronger or weaker association between growth and distinctiveness?
Answering these questions is beneficial to both community moderators and users. Fundamentally, answering RQ1 can clarify the trade-offs associated with growth, an attribute which is often treated as a measure of success for online communities [
2,
5,
9,
17].
RQ2 and RQ3 help tease apart why some communities may be more strongly impacted by growth than others. Specifically, answering RQ2 provides preliminary evidence for whether or not existing moderation interventions mitigate loss of community identity, a perceived harm of growth [16, 32]. Meanwhile, answering RQ3 can help clarify the kinds of growth that most strongly impact communities.Using a hierarchical modeling approach, we find that, on average, subreddits experience a small drop in linguistic distinctiveness during periods of growth, providing some evidence for a trade-off between distinctiveness and growth. One potential explanation for this phenomenon is that during growth periods, newcomers introduce more generic language into the community. We find some evidence for this – on average, newcomers use slightly less community-specific language. However, this difference is likely too small to fully explain the observed distinctiveness drops. Surprisingly, we find that neither moderation nor community diversification are significantly associated with changes to distinctiveness during growth periods. Taken together, our results both support and complicate hypotheses derived from qualitative work [
16,
18,
22,
32]. Although there already exists a large body of qualitative work demonstrating how moderators can steer an online community’s culture, we believe our findings highlight the need for more nuanced quantitative work to complement the existing literature.
2 Related Work
In this paper, we study the relationships between several attributes of online communities: growth, moderation, diversity of contributors, and distinctiveness of linguistic behavior. Prior work has provided preliminary evidence that these variables are related, though no studies have evaluated their relationship at the scale present in this paper.
2.1 Norms and Growth in Online Communities
The adoption of distinct norms has long been a feature of online communities. Burnett and Bonnici [
4] surface this in an early qualitative study of Usenet newsgroups. They distinguish between explicit norms (e.g., rules and community FAQs) and implicit norms (uncodified community habits). More recent studies raise the possibility that newcomers may actually shift norms in the communities they join rather than merely adapting to the existing norms [
7,
10,
11,
16,
22,
31].Danescu-Niculescu-Mizil et al. [
10] provide evidence for this, finding that newcomers within an online community tend to use trending vocabulary. Similarly, Dev et al. [
11] find substantive differences between newcomers and long-time users in StackExchange Q&A groups. Despite preliminary evidence that newcomers act differently, evidence that they drive changes in norms is limited. Lin et al. [
22] examine a small set of Reddit communities that received a massive influx of newcomers, finding no evidence for changes in language use. On the other hand, Chan et al. [
7] find that influxes of newcomers in subreddits lead to more comment removals by moderators, suggesting that newcomers may be more likely to violate explicit community norms.
2.2 Moderation and Norms
On Reddit, volunteer moderators are often responsible for setting the explicit norms within their communities [
31]. A substantial amount of research has explored the kinds of norms that moderators choose to enforce [
8,
13,
30,
32]. Within a platform, it is common to see a wide range of explicit norms across communities [
8,
14].Grimmelmann [
14] argues that this variation in norms helps to promote diverse content.
Mechanisms for enforcing explicit norms vary. Penalties, such as bans and content removals, are most common [
21], though example-setting [
31] and mediation [
30] are also utilized. The latter two strategies may also help to enforce both implicit and explicit community norms.
2.3 Moderation and Growth
Moderators play an important role in managing activity as communities grow. Seering et al. [
32] finds that moderators may deal with a greater frequency of norm violations as communities outgrow an initially homogeneous user group. Qualitative work highlights a few attributes of communities that contribute to effective newcomer management [
18], highlighting the technological and leadership capabilities of moderator teams, as well as a shared sense of responsibility amongst the community [
18]. In an empirical analysis of several subreddits that experienced massive, exogenous growth, Lin et al. [
22] find that moderator responsiveness was associated with positive perceptions of content quality amongst community members. However, prior work has shown that moderation can lead to churn in community membership, suggesting that over-moderation may negatively affect growth [
15,
33].
While we motivate our work through qualitative accounts of user experiences in online communities, we emphasize that we do not make explicit value judgments about any of our study measures. Rather, our measures were chosen based on qualitative accounts of community evolution. We contribute to existing literature by analyzing the relationship between all four of our measures in the context of a large-scale, generalizable dataset.
3 Data Collection
In this section, we discuss how we selected the 1,620 subreddits we studied and the data we collected to characterize their evolution over time.
13.1 Subreddits
Following Mensah et al.’s [
26] study of growing subreddits, we chose to collect data from all subreddits created between March 1, 2018 and December 31, 2019 that grew to 10,000 subscribers within two years of creation. This ensures that subreddits in our pool experienced a non-trivial amount of growth and allows us to track subreddits from their inception.
We used a three-step process to identify these subreddits. First, we used the public Pushshift comment archives to identify all subreddits containing at least one comment in 2018 or 2019. Then, we filtered out subreddits created before 2018 using the Reddit API. Finally, we used Pushshift to identify the largest recorded subscriber count for each subreddit every month. We filtered out any subreddits that never reached 10,000 subscribers.
3.2 Comments
For each subreddit in our sample, we use Pushshift to collect all comments from each subreddit‘s first two years of existence. For each comment, we extracted the source subreddit, author name, body text, creation time, and removal status. Because comments can be removed by moderators or deleted by their authors after being indexed by Pushshift, we use the Reddit API to update the comments’ final statuses. Following prior work [
35], we apply a few simple filters to drop suspected bot accounts from our dataset. These filters, and an evaluation of their efficacy, are provided in Section
B.
3.3 Additional Filtering
We refine our dataset to ensure that each month contains enough data to support reliable content- and user-level analyses. For each subreddit, we only analyze months where at least 50 distinct users and comments are present. Because we use an auto-regressive model in some of our analysis, we drop subreddits that do not contain at least two consecutive months that meet these criteria. We filtered out "Not safe for work" (e.g. pornography) subreddits and non-English language subreddits. Our approach for identifying non-English subreddits is included in the supplement. As a result, our final dataset consists of 1,620 subreddits, 12 million unique authors, 291 million active comments, 10 million removed comments, and 17 million deleted comments.
4 Data Analysis
4.1 Subreddit Measures
For each subreddit i in each month t, we compute four community-level measures across a two-year observation period. To respect users’ privacy, we exclude comments deleted by their authors from all analysis.
•
Subscribers (si, t): The number of users subscribed to subreddit i in month t.
•
Removal rate (ri, t): The proportion of comments in month t of subreddit i that were removed by subreddit i’s moderators.
•
Distinctiveness (di, t): The linguistic uniqueness of commenting behavior in month t of subreddit i compared to a random sample of comments from elsewhere across Reddit.
•
Diversity (vi, t): The extent to which community members of subreddit i in month t vary in their participation across subreddits.
While the subscribers and removal rate measures are relatively straightforward to compute, our distinctiveness and diversity measures are more complex, relying on neural embeddings trained to capture similarity between Reddit comments and users, respectively [
25,
36,
38].
4.2 Distinctiveness
To compute the distinctiveness score for a subreddit
i in month
t, we first generate an embedding for each comment posted to
i in month
t. If a month contains more than 1,000 comments, we randomly sample 1,000 comments to reduce computational cost. We compute a month-specific embedding for the subreddit using the average of these comment embeddings. To provide a baseline for Reddit-wide linguistic behavior in month
t, we compute the average embedding of 10,000 comments
2 randomly sampled from across all active non-NSFW, English-speaking subreddits in that month (potentially including subreddits in our sample of 1,620). Intuitively, subreddits are considered distinctive if their linguistic behavior differs substantially from this Reddit-wide baseline. Thus, the distinctiveness score of subreddit
i in month
t is one minus the cosine similarity between the subreddit-specific vector and the Reddit-wide baseline vector.
We use a fine-tuned S-BERT model [
28] to compute the comment embeddings. To fine-tune our model, we use the generalized end2end (GE2E) loss function [
38]. Intuitively, this loss function takes in an input set of comments, grouped by subreddit, and rewards comments for: (i) being close to their subreddit’s centroid and (ii) being far from other subreddits’ centroids. As such, subreddits whose centroids are closer together are subreddits that are harder to distinguish from one another in terms of linguistic behavior. This makes the loss function a good fit for our purposes, since subreddits with low distinctiveness scores will be harder to distinguish from the baseline Reddit-wide sample. All fine-tuning was conducted on a separate dataset of held-out subreddits (refer to section
D).
Formally, GE2E loss takes a batch of
N ×
M comments, where
N is the number of subreddits, and
M is the number of comments per subreddit. It then computes a centroid
\(\vec{v}_{s}\) for each subreddit, as well as a similarity matrix
S. This matrix contains the cosine similarities between comment embeddings
\(\vec{y}_{ji}\) and the computed centroid. Following the suggestion in [
28], we used the modified centroid
\(v_{s}^{(-i)} = \frac{1}{M-1} \left(\sum _{m \ne i}^M \vec{y}_{jm}\right)\) when computing the similarity between
vs and
yji when
j =
s. We scaled all similarities by learned parameters
w and
b, resulting in a matrix:
The final loss function is
For qualitative evaluation, we present the most and least distinctive subreddits in December, 2021 in table
1 a and table
1 c. The most distinctive subreddits tended to be goal-oriented. They focus on technical topics like legal advice, technology, or health and wellness advice. In contrast, comments in the least distinctive subreddits tended to be simple responses to humorous content. Notably, these subreddits tended to lack clear community-wide goals around commenting behavior.
4.3 Diversity
We quantify the diversity,
vi, t, of subreddit
i in month
t based on a set of embeddings representing users. To generate user embeddings, we use a modified, context-based word2vec procedure [
20] first proposed by Waller and Anderson [
36]. Intuitively, this procedure creates embeddings for
subreddits so that subreddits (i.e., “contexts") sharing many participants (i.e., “words") will have similar embeddings. We use a skip-gram model and negative sampling to train a set of user embeddings for each month
i. The training set consists of all user-subreddit co-occurrences across Reddit in month
i, not just those in our sample of 1,620 subreddits. This improves the robustness of the learned subreddit representations.
In each month, we represent a user as a weighted average of the embeddings of all subreddits that appear in their commenting history. These embeddings are weighted based on the number of contributions made by a user to a given subreddit. For each subreddit in our pool, we compute the centroid of embeddings of participating users in a given month. The diversity score of a subreddit is one minus the weighted average of the cosine similarities between each participant and the computed centroid. If there are more than 1,000 active users in a month, we randomly select 1,000 users with which to compute the score. Intuitively, if a subreddit consists of users who mainly participate in the same subreddits, the average cosine similarity between user embeddings and the subreddit centroid will be large, making the diversity score small. To train the embeddings in each month, we use the same hyperparameters as Waller and Anderson [
36].
Table
1 b and table
1 d contain the most and least diverse subreddits in our sample in December, 2021. The least diverse subreddits featured insular commenters who rarely commented in other subreddits. In contrast, the most diverse subreddits were associated with more mainstream and general-interest topics. A full list of subreddits sorted by distinctiveness and diversity scores in December 2021 are included in the supplementary materials.
4.4 Bayesian Linear Regression Analysis
We used a Bayesian auto-regressive linear regression model to analyze the month-to-month changes in distinctiveness of each subreddit in our sample. We model each month-to-month change as a function of the growth in subscribers that a subreddit experienced from month t − 1 to month t. We measure growth by computing the difference in log-scaled subscriber counts between consecutive months (i.e., log (si, t) − log (si, t − 1)), allowing us to estimate the effect of growth on changes to community distinctiveness, answering RQ1.
RQ2 and RQ3 focus on understanding the extent to which the growth-distinctiveness association is moderated by other variables. Thus, we include an interaction term, γ, between growth and removal rate (RQ2) and an interaction term, η, between growth and diversification (RQ3). Although they do not directly answer to our research questions, we also include terms ψ and ρ for modeling the main effects of removal rate and diversification, respectively.
To assess the variation in trends across subreddits, we model the growth coefficients (αi) hierarchically. In other words, each subreddit has its own coefficient governing the association between growth and change in distinctiveness. We model growth coefficients as being drawn from a Normal distribution with mean μα and standard deviation σα. This allows us to reason about the “typical" association between growth and change in distinctiveness and how this association varies between communities. We also include varying subreddit-specific intercepts βi to account for the first month of data, where no distinctiveness change can be observed. Month-specific varying intercepts (θT) were used to account for real-world events, such as the start of the COVID-19 pandemic, that may have shifted Reddit-wide linguistic behavior. We use T(i, t) to refer to the absolute month corresponding to the t-th month of data for subreddit i.
Because the range of distinctiveness scores is between 0 and 1, we use a Beta likelihood function with logit-link to ensure that the model predictions
di, t also range between 0 and 1. This yields the following final model:
We use a multivariate-normal prior for the subreddit-specific slope and intercept terms following McElreath [
24]. Prior distributions for all model parameters are included in Section
E.
5 Results
Because our distinctiveness and diversity measures are novel, we use section
5.1 and section
5.2 to present summary statistics that help build intuition. We then answer our primary RQs using the hierarchical linear model in section
5.3. Finally, in section
5.4 we explore mechanistic explanations for the regression result by comparing newcomer and returning user commenting behaviors.
5.1 Patterns in Community Growth, Distinctiveness, and Diversity
We first characterize changes in diversity and distinctiveness over the study year range. We compare the distributions of subscriber counts, diversity scores, and distinctiveness scores in each subreddit’s first and last available month of data. On average, the first available month was between months four and five (M = 4.66, SD = 5.57). This is because many subreddits had little-to-no activity in the first few months of data. The last available month was between months 22 and 23 on average (M = 22.1, SD = 4.58).
fig.
2 illustrates the distribution comparisons. 60.5% of subreddits saw a decrease in distinctiveness (
Mδ = −.019,
SDδ = .088), while 66.9% of subreddits saw an increase in diversity (
Mδ = .019,
SDδ = .061). 97.1% of subreddits saw an increase in subscribers (
M = 54, 040,
SD = 176, 517), providing an extra level of validation that our inclusion criterion successfully identified growing subreddits.
5.2 Correlations between Measures
We now look at the covariation of measures at a fixed point in time. We calculate Pearson’s r between each pairing of community size, distinctiveness, and diversity in the last month of data for each subreddit. We find a slight positive correlation between size and diversity (r = .0764, p < .05, CI\(_{95\%}\) [ − .0285, .121]), a slight negative correlation between size and distinctiveness (r = −.168, p < .05, CI\(_{95\%}\) [ − .212, − .118]), and a slight negative correlation between diversity and distinctiveness (r = −.118, p < .05, CI\(_{95\%}\) [ − .161, − .0747]).
Qualitatively, we find many subreddits that exemplify these observed trends. For instance, high-distinctiveness/low-diversity subreddits tend to be knowledge sharing groups for niche hobbies like r/IndieMusicFeedback and r/Pathfinder2e (a table-top roleplaying game). Meanwhile, low-distinctiveness/high-diversity communities tend to center on sharing humorous content and have few specific rules for commenting behavior. These include subreddits like r/CoupleMemes or r/SailorMood, two meme-sharing communities.
Still, fig.
3 demonstrates that there are many communities that run counter to the overall trend. For example, many subreddits related to physical and mental wellness (e.g., r/waterbros, r/veggieshake, and r/LifeAfterSchool) scored high on both diversity and distinctiveness. Low-diversity/low-distinctiveness communities tended to center on memes and casual discussion for niche user groups, like r/teenagersnew and r/Jesser (a subreddit for fans of a particular YouTuber).
5.3 Linear Modeling
We now present the results from our linear model. This model allows us to assess the association between monthly distinctiveness changes and our other study measures, answering our primary research questions. We emphasize that results from our regression should be interpreted as comparisons rather than causal statements. That is, our regression model allows us to estimate the average difference in monthly distinctiveness change between two subreddits that differ by one of the predictor variables.
Table
2 contains 95% credible intervals for key model parameters. Recall that our model is hierarchical in nature, meaning that each subreddit is given an individualized
\(\bar{\alpha }_i\) parameter. Thus, the main effect of growth on distinctiveness is subreddit-specific.
μα corresponds to the average of these main effects across all subreddits. As expected, on average, growth is significantly associated with a decrease in distinctiveness, answering RQ1. However, the estimate for the standard deviation of these subreddit-specific main effects is large, indicating substantial variation around this mean.
Surprisingly, we do not see significant interactions between removal rate and growth (
γ, RQ2) or diversification and growth (
η, RQ3), conflicting somewhat with results from prior qualitative work [
18,
32]. We provide several possible explanations for this in section
6.
Note that although the main effect of removal rate on distinctiveness change is significant, the size of the association is extremely small.To make the strength of associations in our model more intuitive, we visualize effect sizes in fig.
4. We plot the predicted distinctiveness change associated with different monthly growth factors. We do this for all combinations of two hypothetical comment removal rates (0% and 5%) and three hypothetical changes in diversity score (-0.1, 0, and 0.1). Consider a set of subreddits with a 0% removal rate that experiences no diversification. Our model predicts that, after doubling in size, these communities’ distinctiveness scores would drop by .014 (95% CI [.012, .016]), on average. In communities that increase ten times in size, we would expect an average change in distinctiveness of .051 (95% CI [.044, .057]). To put these numbers into perspective, a change of .014 is roughly equivalent to the difference in distinctiveness between r/pothos (.513) and r/monstera (.494) – two subreddits about houseplants. A change of .051 approximates the difference between r/ratemydessert (.420) and r/ratemyplate (.370), subreddits for rating user-submitted pictures of desserts and more general food, respectively. Figure
4 also demonstrates that the predicted average distinctiveness changes are similar across different removal rate and diversification numbers, even though the main effect of removal rate is marginally significant. Although the predicted changes to distinctiveness are small, they correspond to modeled averages for changes over the span of a single month. Thus, our results are still consistent with larger changes experienced over broader time spans.
5.4 Comparing Newcomers and Returning Users
Although growth is associated with a decrease in distinctiveness, the mechanism underlying this phenomenon is unclear. One explanation is that newcomers, unfamiliar with community norms, make comments containing linguistic patterns that are different from that of the community. Thus, distinctiveness would decrease during growth periods when there are many newcomers. To test this hypothesis, we compare the distinctiveness of comments created by subreddit newcomers and returning users. For each subreddit-month pair, we define a newcomer to be a user who commented in the subreddit for the first time that month and a returner to be a user who commented in the subreddit that month and also in some previous month. We compute two distinctiveness scores for each subreddit-month pairing: a newcomer distinctiveness score, using
only the first comment from each newcomer that month, and a returner distinctiveness score using the first comment from each returner that month. The returner-newcomer gap is the difference between these two scores. We use only the first comment for each user since newcomers may learn community norms during subsequent contributions. We compute these scores in months where there are at least 50 newcomers and at least 50 returning users. We compute these scores using a maximum of 100 newcomers or returners.
We use a hierarchical mixed effects model to analyze the data. We model the newcomer and returner distinctiveness scores as a linear function of the subreddit growth that month, plus a set of subreddit-specific random intercepts (refer to section
F). In fig.
5 we plot our model’s posterior predictions for the returner-newcomer gap. We do this for 80 randomly selected subreddit-month pairs, half of which had high overall distinctiveness (above the 90
th percentile) and half of which had low overall distinctiveness (below the 10
th percentile). Although we do find that the returner distinctiveness scores are larger than those of newcomers on average, these differences are extremely small. Further, the returner-newcomer gaps were not significantly larger during periods of growth. Thus, it is unlikely that the returner-newcomer gap entirely explains the changes to distinctiveness observed during growth periods.
6 Discussion
In this work, we study changes in online community distinctiveness during growth periods. This represents one of the first attempts to validate qualitatively derived theories about community evolution and moderation [
16,
18,
32]. In this section we discuss implications of our work and contextualize our findings within the literature.
6.1 A Trade-off between Growth and Distinctiveness
Our regression analysis indicates that, on average, subreddits become less distinctive during growth periods. Size and activity level are often used as measures of success in foundational online communities research [
2,
5,
9,
17]. However, our findings provide quantitative support to a line of qualitative work that suggests communities may have a more complicated relationship with growth [
16,
18,
32]. Hwang and Foote [
16], for example, found that users felt smaller communities allowed for topical focus and niche discussion, which our linguistic analysis supports.
6.2 Mechanistic Explanations for the Growth-Distinctiveness Trade-off
To make the association between growth and distinctiveness actionable for community moderators, it is imperative to understand the mechanisms underlying this relationship. We explore one possible explanation: that newcomers’ tend to use more indistinct language, reducing the community’s overall distinctiveness. Although we find support for this theory, the difference between the language use of newcomers and returners is extremely small and is unlikely to fully explain the observed losses in community distinctiveness. This finding is not entirely surprising – prior work suggests that users may learn community norms before they begin actively participating [27]. Given the linguistic similarity between newcomer and returner contributions, we suggest two alternative explanations for the association between growth and distinctiveness, each with different implications for moderators. First, influxes of newcomers may cause decreases in community distinctiveness that affect both newcomers and returning users alike. For example, newcomers might post off-topic content which returning users end up engaging with, decreasing both groups’ distinctiveness scores. If future work finds support for this mechanism, moderators should think carefully before intervening to try to preserve distinctiveness – the newer, less distinctive content may actually be engaging for old and new users alike. Second, reverse causality may be at play, whereby newcomers are more likely to join a subreddit during periods when the community’s distinctiveness is lower (e.g., when the community shifts discussion towards more general-interest topics). If this is the case, interventions that attempt to lower community distinctiveness could be an effective tool for attracting new users. In a community on a technical topic, for example, moderators could cultivate a more welcoming environment for newcomers by encouraging returning users to use less jargon.
6.3 Explaining Additional Variation
Although we find an aggregate negative association between growth and distinctiveness, our hierarchical modeling approach reveals substantial variation around this trend – in some communities, growth was actually associated with an increase in distinctiveness. The presence of this variation is nearly as important as the observed aggregate trend, as it indicates that growing communities are not destined to lose distinctiveness. Inspired by prior work, we explore two potential sources of this variation: degree of moderation [18] and community diversification [32]. Surprisingly, we find neither degree of moderation nor diversification to be significantly associated with responsiveness to growth. Below, we provide several potential explanations for this finding. 6.3.1 Moderation.
We highlight two factors could explain the discrepancy between our findings around moderation and qualitative accounts of community growth [18]. First, and perhaps most important, is that communities may decide to moderate more aggressively whenever they anticipate larger-than-normal changes to community distinctiveness. Because our regression analysis does not estimate the outcome in the counterfactual case (i.e., what would have happened to communities had they not moderated?), we could be underestimating the impact of comment removals. Second, other moderation tools might be more effective at preserving distinctiveness [30]. Moderators have a wide range of non-punitive strategies for setting, monitoring, and maintaining community norms. These include stepping in to promote desirable content, increasing the visibility of community guidelines [23], and providing alternative channels for off-topic content, like separate threads devoted to casual conversation. Interestingly, qualitative findings suggest that Reddit moderators tend to use these tools less often than moderators on other platforms, like Twitch [32]. Currently, there is little quantitative work on non-punitive moderation interventions, suggesting the need for future work on this topic. 6.3.2 Diversification.
Following Seering et al. [32], we expected that communities undergoing a combination of growth and diversification would be especially likely to become less distinctive. However, we found no significant interaction between the two measures. A possible explanation is that, regardless of background, newcomers are equally able to pick up on existing community norms before participating [27]. Selection bias may play a role, as well, since users may decide to join a community only if they are already supportive of the existing norms [16]. Still, our results around diversification do not rule out the possibility that the background of incoming users effects the magnitude of the associated distinctiveness change. For example, its possible that users from specific kinds of communities, like those with higher tolerances for trolling, might be especially disruptive. A large influx of such users could actually make a community appear less diverse, even if it comes with a drop in distinctiveness. Ultimately, communities are interested in understanding what will happen to them, not what will happen to the hypothetical “average" community. Knowing what kinds of communities lose more or less distinctiveness during growth periods would be invaluable to community designers, who may desire both size and distinctiveness in a community. While we made an attempt at disentangling the sources of variation underpinning the trade-off between growth and distinctiveness, we believe that this remains a ripe area for future work.
7 Limitations
Although we focus our analysis on linguistic distinctiveness, we recognize that it represents a single dimension of online community culture. A community could undergo significant changes to discussion content while remaining at the same level of distinctiveness, as measured by our embedding-based approach. Future work could conduct a similar analysis on other computational measures of community culture.
Further, while we believe that both our user and comment embeddings produce meaningful measures of distinctiveness and diversity, we acknowledge two limitations in our current approach. First, our embedding-based methodology does not accurately capture the underlying uncertainty in the data [
1]. This is important given that subreddits tend to have few comments during the first several months, and because of the relative diversity of comments across Reddit. Though the overall size of our dataset helps to mitigate this concern, our analysis may be overconfident for specific subsections of the data. Second, we acknowledge that the evidence for the construct validity of both measures is largely qualitative. Although both approaches are inspired by prior work [
25,
36,
37], further validation of these measures would strengthen our current analysis.
B Bot Detection Approach
After curating our comments dataset, we manually inspected known bot accounts to identify behavioral traces indicative of bot activity. We observed that bot accounts tended to signal their bot status through either their account name or messages included in their comments. This led us to include two sets of filters for flagging bot accounts.
First we begin by examining account names. If an account name contains ’transcriber’, ’automoderator’, or ’savevideo’, or if the name ends with some form of ’-bot’, ’_bot’, or ’bot’, we flag the account. For any account flagged by this name filter, we collect the account’s entire comment history. Among accounts flagged by the name filter, we mark an account as a bot account if it has either: (i) commented 10,000 or more times or (ii) commented between 10 and 10,000 times, with 80% of the comments in a single subreddit. This was designed to catch bots that create comments across all of Reddit (e.g. haikubot, which turns random reddit comments into haikus), as well as bots that assist with specific subreddit functionalities (e.g. DeltaBot, which manages a leaderboard for r/ChangeMyView).
Still, many bot accounts evade this first filter (e.g., RugScreen). For this reason, we also examine the text of each accounts comments. We flag any accounts that contain one of a handful of keyword phrases in their body text:
•
∧∧I ∧∧am ∧∧a ∧∧bot OR ∧I ∧am ∧a ∧bot OR ∧∧I’m ∧∧a ∧∧bot OR ∧I’m ∧a ∧bot
•
∧∧this ∧∧comment ∧∧was ∧∧written ∧∧by ∧∧a ∧∧bot OR ∧this ∧comment ∧was ∧written ∧by ∧a ∧bot
•
[Info] OR [*Info*] OR [**Info**] OR [(Info)] OR [∧∧Info]
•
This bot wants to find the best and worst Reddit bots
•
I detect haikus. And sometimes, successfully
•
this comment was written by a bot
If an account has at least 20 comments flagged and at least 50% of its history is flagged after the first flag, we mark the account as a suspected bot. A threshold of 20 comments was chosen because manual review revealed that non-bot accounts may occasionally use the above phrases, but usually not repeatedly. An account only needs to be flagged by either the name filter or the comment text filter.
We evaluate our bot detection approach by manually reviewing a random sample of suspected bot and non-bot accounts. We curate a corpus of suspected bot and non-bot accounts and comments by randomly sampling 300 suspected bot accounts and 700 suspected non-bot accounts from Pushshift’s aggregated comment files from March, 2018 to December, 2021. For each account, we collect 10% of the comment history. Two of the paper authors manually and individually reviewed a random assortment of 500 accounts. For each account, the authors judged if the account was a bot or not. If the author’s were unsure based on the comment history, they reviewed the account’s name. If still unsure, the authors reviewed the account’s page on Reddit to view a greater proportion of the account’s comment history, and check the account description. When comparing our automated approach to human review, we find that our approach achieves a recall of .96 and a precision of .8.
D Comment-embedding Model Fine-tuning Details
We conduct all model fine-tuning and hyperparameter tuning on a separate set of 1,639,400 comments from 16,394 subreddits sampled during the same window of time as our main dataset. Importantly, this fine-tuning dataset contains only comments from subreddits that do not appear in our final sample of 1,620 subreddits. We included 100 randomly selected comments from any non-NSFW, English-speaking subreddit that had at least 100 comments in 2018 or 2019. As such, the model is tuned to produce embeddings that are generally effective at separating subreddits in the embedding space, rather than specifically learning to separate the 1620 subreddits we focus on in our study. We use similar NSFW and English language-based filters on this additional set. We use a 70%-15%-15% training-validation-test split across subreddits and conducted all model training on a single GPU provided by Google Colab. We use the first 128 tokens of each comment to produce an embedding.
After conducting a modest grid search to select hyperparameters, we fine-tuned our embedding model with a learning rate of 5e-6 on batches consisting of 10 subreddits and 10 comments per subreddit. We fine-tuned for a single epoch. We report two measures to assess the quality of our embeddings before and after fine-tuning: the average loss over the test set, and performance on a few-shot classification task described by McIlroy-Young et al. [
25]. In this task, each subreddit in the test set is given a set of 50 known “reference" comments. We then match sets of 50 unlabelled "query" comments to the subreddits in the test set. This is done by computing centroids for the reference and query sets, and then matching each query centroid to the closest reference centroid. Overall, our model achieved an accuracy of 87.6% prior to fine-tuning and 88.1% after fine-tuning, suggesting a modest performance gain. We see a larger improvement to average test set loss, with an average loss of .958 prior to fine-tuning and .867 after.