Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Twitter Users #Codeswitch Hashtags! #Moltoimportante #Wow #

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Twitter Users #CodeSwitch Hashtags!

#MoltoImportante #wow #ᄒ



David Jurgens, Stefan Dimitrov, Derek Ruths
School of Computer Science
McGill University
Montreal, Canada
jurgens@cs.mcgill.ca, stefan.dimitrov@mail.mcgill.ca,
druths@networkdynamics.org

Abstract words are integrated into a native language’s lexi-


con and morphology (Gumperz, 1982; Poplack et
When code switching, individuals incor- al., 1988; Sankoff et al., 1990).
porate elements of multiple languages into While work on code switching began with con-
the same utterance. While code switching versational analyses, recent work has examined
has been studied extensively in formal and the phenomena in electronic communication, find-
spoken contexts, its behavior and preva- ing similar evidence of code switching (Climent
lence remains unexamined in many newer et al., 2003; Lee, 2007; Paolillo, 2011). How-
forms of electronic communication. The ever, these investigations into code switching have
present study examines code switching in largely examined interpersonal communication or
Twitter, focusing on instances where an settings where the number of participants is lim-
author writes a post in one language and ited. In contrast, social media platforms such as
then includes a hashtag in a second lan- Twitter offer individuals the ability to write a text
guage. In the first experiment, we per- that is decoupled from direct conversation but may
form a large scale analysis on the lan- be read widely.
guages used in millions of posts to show Twitter enables users to post messages with spe-
that authors readily incorporate hashtags cial markers known as hashtags, which can serve
from other languages, and in a manual as a side channel to comment on the post itself
analysis of a subset the hashtags, reveal (Davidov et al., 2010). As a result, multilingual
prolific code switching, with code switch- authors have embraced using hashtags from lan-
ing occurring for some hashtags in over guages other than the language of their post. Con-
twenty languages. In the second experi- sider the following real examples:
ment, French and English posts from three
• Eating an apple for lunch while everyone
bilingual cities are analyzed for their code
around me eats cheeseburgers and fries.
switching frequency and its content.
#yoquiero
• Jetzt gibt’s was vernünftiges zum es-
1 Introduction sen! #salad #turkey #lunch #healthy
#healthylifestyle #loveit
Online platforms enable individuals from a wide • Hasta mañana a todo mundo. Que tengan
variety of linguistic backgrounds to communi- linda noche. #MarketerosNocturnos #Mar-
cate. When individuals share multiple languages ketingDigital #BlackVirs #SocialMedia
in common, their communication will occasion- • 1% มันสำคัญมากนะ เพราะมันอาจเปลี่ยน-
ally include linguistic elements from multiple lan-
จากD+ เป็น C และ B+เป็นA เกรดเฉลี่ยคง-
guages (Nilep, 2006), a practice commonly re-
ดีกว่านี้อ่ะ #พลาด #เสียดาย #fail
ferred to as code switching. Typically, during code
switching, the text or speech in a language retains Here, the first author posted in English with a
its syntactic and morphological constraints for that Spanish hashtag reflecting the author’s envious
language, rather than having text from both lan- disposition. In the second, the author comments
guages conform to one of the language’s grammat- in German on sensible food, using multiple En-
ical rules. This requirement enables code switch- glish hashtags to describe the meal and their atti-
ing to be separated from borrowing, where foreign tude. In the third and fourth, the authors comment

51
Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 51–61,
October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics
on sleep and school, respectively, and then each ena that can be falsely interpreted as code switch-
use hashtags with similar meanings in both their ing and therefore must be accounted for in future
native language and English. analyses. Third, in a study of French and English
Hashtags provide authors with a communica- tweets from three cities, we find that an increased
tion medium that also has broader social utility rate of bilinguality decreases the frequency of in-
by embedding their post within global discussion cluding hashtags from another language but in-
of other posts using the same hashtag (Letierce et creases the overall rate of code switching when
al., 2010) or by becoming a part of a virtual com- such hashtags are present. Furthermore, all data
munity (Gupta et al., 2010). These social motiva- for the experiments is made publicly available.
tions resemble those seen for why individuals may
2 Related Work
code switch, such as to assimilate into a group or
make discussions easier (Urciuoli, 1995). Twit- Research on code switching is long standing, with
ter and other hashtag-supporting platforms such as many theories proposed for the motivations be-
Instagram and Facebook offer a unique setting for hind code switching and how the two languages
code switching hashtags for two reasons: (1) po- interact linguistically (Poplack and Sankoff, 1984;
tential readers are disconnected from the author, Myers-Scotton, 1997; Auer, 1998). Most related
who may not know of their language fluency, and to the present work are those studies examining
(2) text translation is built into the platform, which code switching in online communications.
enables readers to translate a post into their na- Climent et al. (2003) examined the use of Span-
tive language. As such, authors may be motivated ish and Catalan in newsgroups, finding it occurs
to include a hashtag of another language to in- 2.2% and 4.4% of the Catalan and Spanish con-
crease their potential audience size or to appear as texts, respectively. Lee (2007) analyzed a cor-
a member of a multilingual virtual community. pus of Cantonese and English emails and ICQ
Despite the prevalence of non-English tweets, instant messages and surveyed Hong Kong users
which are approaching 50% of the total volume of each form of communication. She found that
(Liu et al., 2014), no study has examined the the users preferred mixed-language communica-
prevalence of hashtag code switching. We pro- tion, with no user indicating that they communi-
pose an initial study of hashtag code switching in cated in only Cantonese. Furthermore, the shorter,
Twitter focusing on three central questions: (1) more informal ICQ messages were more likely to
for which language pairs do authors write in the be code switched (99.4%) than emails (41.3%).
first language and then incorporate a hashtag of Paolillo (2011) measured code switching
the second language, (2) when tweets include a amongs English, Hindi, and Punjabi in both
hashtag of a different language, which instances IRC and Usenet forum posts, finding similar
signal code switching behavior, and (3) the degree to Climent et al. (2003) that the shorter, more
to which bilingual populations code switch hash- conversational IRC posts had higher rates of
tags. Here, we adopt a general definition of code code switching. Paolillo (2011) also note that
switching as instances where an individual estab- code switching rates differed between Hindi and
lishes a linguistic context in one language and Punjabi speakers.
then includes elements (such as words) from one The present work differs significantly from
or more other languages different from the first. these three studies in two aspects. First, we as-
Two experiments are performed to answer these sess code switching across all language commu-
questions. In the first, we test general methods to nities on Twitter, rather than examining individual
identify which languages adopt the same hashtags groups of bilingual speakers. Second, we focus
and whether those shared hashtags are examples of our analysis only on the code switching of a post’s
code switching. In the second, we focus on three hashtag due to its unique role in microtext (Gupta
bilingual cities to examine hashtag code switching et al., 2010), which has yet to be examined in this
behavior in French and English speakers. context.
Our study provides three main contributions.
3 Hashtag Use in Twitter
First, we demonstrate that hashtag code switching
is widespread in Twitter. Second, we show that Hashtags provide general functionality on Twit-
Twitter as a platform includes multiple phenom- ter and prior works have proposed that they serve

52
Name Description Examples
A NNOTATION Serves as an annotation about the author’s feelings or comments #happy #fail #cute #joking
on the content of a tweet. #YoloSwaggins
C OMMUNITY A topical entity that links the tweet with an external community, #music #friends #BecauseItIs-
which is commonly topical but also includes ”team-like” groups TheCup #TeamEdward
NAMED Refers to a specific entity that has a universally recognized #Glee #TeenChoiceAwards
E NTITY name. #WorldCup2014
P LATFORM Refer to some feature or behavior specific to the Twitter plat- #followback #lasttweet #oomf
form.
A PPLICATION Generated by a third-party application, which automatically in- #AndroidGames #NowPlaying
cludes its hashtag in the message. #iPhone #Android
VOTING Created as a result of certain real-world phenomena asking in- #MtvHottest #iHeartAwards
dividuals to tweet with specific hashtags as a way of voting.
A DVERTISING Promoting an item, good, or service, which can be sought out #forsale #porn
by interested parties.
S PAM Used by adversarial parties to appear on trending lists and to #NanaLoveLingga #681team
make spam accounts appear real. #LORDJASONJEROME

Table 1: A taxonomy of hashtag according to their intended use.

a dual role as (1) bookmarking content with the Table 1 shows the proposed taxonomy, contain-
tag’s particular expression and (2) functioning as a ing eight broad types of hashtags. The first two
method for ad hoc community formation and dis- types of hashtags correspond to the main hash-
cussion around a tag’s topic (Gupta et al., 2010; tag roles proposed in Yang et al. (2012). The
Davidov et al., 2010; Yang et al., 2012). However, NAMED E NTITY tags also serve as method for
the diverse user base of the Twitter platform has individuals to link their content with a specific
given rise to additional roles for hashtags beyond audience like the C OMMUNITY type; however,
these two. For example, many popular hashtags NAMED E NTITY tags were treated as a separate
focus on promoting users to follow each other,1 group for the purposes of this study because the
such as #followback and #openfollow. Similarly, entities typically have a common name which is
contests are run on Twitter, which have individu- used in all languages and therefore would not be
als vote by posting using a specific hashtag, e.g., translated; in contrast, C OMMUNITY hashtags re-
#MtvHottest. fer to more general topics such as #soccer, which
Given hashtags’ flexible roles, some may be may be translated, e.g., #futbol. Hashtags of the
used in multiple languages without being exam- five remaining types would likely not be observed
ples of code switching, such as the contest-based in instances of code switching, with such hash-
or follower-promotion hashtags noted above. tags often being used for purposes other than inter-
Therefore, we first propose a taxonomy for clas- personal communication.
sifying all types of hashtags according to their pri-
mary observed use in order to disentangle poten-
tial code switching behavior from Twitter-specific 4 Experiment 1: Popular Hashtags
behavior. To construct the taxonomy, two an-
notators independently reviewed several thousand Persistently popular hashtags reflect established
hashtags of different frequency to assess the dif- norms of communication on Twitter. We hypoth-
ferences in how the tag was used in practice. Each esize that these hashtags may be adopted by the
annotator then proposed their own taxonomy. The speakers of multiple languages for joining a global
final taxonomy was produced from a discussion of discussion. Therefore, the first experiment ex-
differences, with both annotators initially propos- amines the most-used hashtags over a five month
ing highly similar taxonomies.2 period to measure two aspects: (1) which lan-
1
guages adopt the hashtags of other languages and
In Twitter, following denotes creating a directional social
relation from one account to another.
(2) which hashtags used in multiple languages are
2
We note that a small number of hashtags did not fit this evidence of code switching.
taxonomy due to their idiosyncratic use. These hashtags were
typically single-letter hashtags used when spelling out words,
e.g., “tonight is going to be #f #u #n,” or when the author has ter’s definition of a hashtag, e.g., “#I’mAwesome,” which has
mistakenly used punctuation, which is not included in Twit- the hashtag #I rather than the full expression.

53
Hashtag # Langs. Primary Type
4.1 Experimental Setup Lang.
Data Hashtag frequencies were calculated from #lastfm 39 en A PPLICATION
#WaliSupitKEPO 32 id S PAM
981M tweets spanning March 2014 to July 2014. #RenggiTampan- 32 id S PAM
Frequencies were calculated over this five month DanKece
#NP 32 en A PPLICATION
period in order to focus on widely-used hashtags, #Np 32 en A PPLICATION
#MTVHottest 31 en VOTING
rather than bursty hashtags that are popular only #SidikLoveTini 30 id S PAM
for a short time, such as those studied in Huang #np 30 en A PPLICATION
#GER 29 en NAMED E NTITY
et al. (2010) and Lin et al. (2013). For each hash- #User Indonesia 29 id A PPLICATION
tag, up to 10K non-retweet posts containing that #Soccer 29 en C OMMUNITY
#RobotKepo 29 id A PPLICATION
hashtag were retained, randomly sampling from #KeePO 27 id A PPLICATION
the time period studied when more than 10K were #NowPlaying 28 en A PPLICATION
observed. To enable a more reliable estimate of #Hot 28 en A DVERTISEMENT

the language distribution, we restrict our analysis Table 3: The hashtags associated with the most
to only those hashtags with more than 1000 posts, number of languages having at least 20 tweets us-
for a total number of 19.4M posts for 4624 hash- ing that hashtag
tags, with an average of 4204 posts per hashtag.
Language Identification The languages of
with l2 . To quantify the accuracy of our hashtag
tweets were identified using a two-step procedure.
adoption measure, two annotators inspected the
First, message content was filtered to remove con-
second-language tweets of 200 hashtags, sampled
tent such as usernames, URLs, emoji, and hash-
from the data and representing 40 language pair
tags. Tweets with fewer than three remaining to-
combinations; this analysis showed that with the
kens were excluded (e.g., a message with only
filtering the assertion that at least one author from
hashtags). Second, the remaining content was
language l1 used a hashtag of language l2 was cor-
processed using langid.py (Lui and Baldwin,
rect in 67% of the instances.
2012), a state of the art language identification
program that supports the diversity of languages Table 2 shows the frequency with which au-
found on Twitter. thors using the 15 most-commonly observed lan-
Determining the language of a hashtag in a gen- guages (shown as columns using their ISO 639-1
eral setting for all languages is difficult due to the language codes) adopt a hashtag from another of
presence of acronyms, abbreviations, and slang. the most-common languages (shown as rows), re-
Therefore, we adopt a heuristic where a hashtag’s vealing widespread sharing of hashtags between
language is set as the language used by the major- languages. English hashtags are the most fre-
ity of its tweets. To quantify the accuracy of this quently used in other languages, likely due to it be-
heuristic, two annotators inspected the tweets of ing the most common language in Twitter. How-
200 hashtags to identify the language of the hash- ever, other languages’ hashtags are also adopted,
tag and for the majority of the tweets. This anal- with Spanish, Japanese, and Indonesian being the
ysis showed that the heuristic correctly identifies most common after English.
the hashtag’s language in 96.5% of the instances. Despite the strong evidence of using of a sin-
gle hashtag in multiple languages, the results in
4.2 Hashtag Sharing by Languages Table 2 should not be interpreted as evidence of
The adoption of a hashtag by a second language code switching. Table 3 shows the 15 hashtags
was measured by calculating the frequency with used in the most number of languages. The ma-
which tweets using a hashtag with language l1 jority of these hashtags are generated by either
were labeled with language l2 . The noisy nature (1) Twitter-based applications that automatically
of microtext is known to make language identifi- write a tweet in a user’s native language and then
cation difficult (Bergsma et al., 2012; Goldszmidt append a fixed English-language hashtag or (2)
et al., 2013) and can create spurious instances spam-like accounts that use the same hashtag and
of second-language hashtag adoption. Therefore, include random text snippets in various languages,
we impose a minimum frequency of hashtag use neither of which signal code switching behavior.
where l2 is only said to use a hashtag of l1 if at Furthermore, given the noise introduced by lan-
least 20 tweets using that hashtag were labeled guage misidentification and spam behavior on the

54
Language of tweet
de ru ko pt en it fr zh es ar th ja id nl tr
de 2 1 4 15 6 9 4 9 1 4 6 1
ru 3 3 3 25 7 5 8 7 2 1 7 7 7 1
ko 4 2 13 3 6 5 10 3 10 11 4 2
pt 14 3 64 45 40 13 63 2 4 3 15 10
en 1705 532 155 1235 1735 2183 1171 2482 362 176 742 1097 1101 342
it 5 2 1 10 29 15 4 22 5 3 6 3 1
fr 38 2 3 36 87 49 28 67 8 1 12 19 29 6
zh 3 4 2 2 12 1 2 4 1 11 1 1
es 67 17 3 321 435 264 206 105 29 5 32 66 66 31
ar 6 2 38 4 9 6 7 8 5 1 2
th 3 7 1 24 5 4 8 8 2 6 4 1
ja 17 18 11 11 123 17 24 132 45 2 2 14 12 4
id 84 2 6 25 131 88 58 14 92 6 5 11 52 17
nl 13 1 3 17 6 11 2 9 1 1
tr 17 1 3 28 9 7 7 13 3 1 22 9

Table 2: The frequency with which a hashtag is used by multiple languages. Columns denote the lan-
guage in which the tweet is written; rows denote the hashtag’s language; and cell values report the
number of hashtags where the column’s language has used the hashtag in at least 20 tweets. Diagonal
same-language values are omitted for clarity.

Twitter platform, we view the initial results in Ta- an example of a language change that results in
ble 2 an overestimate of hashtag adoption by lan- code switching; however, the country has differ-
guages other than the hashtag’s source language. ent names depending on the language used (e.g.,
A further inspection of language classification er- Deutschland), which does point to an active choice
rors revealed four common factors: (1) the lack of on an author’s part when selecting a particular
accents on characters,3 (2) the use of short words, name and its abbreviation.
which appeared ambiguous to langid.py, (3)
the use of non-Latin characters for emoticons or 4.3 Analysis by Hashtag Type
visual affect, and (4) proper names originating In a second analysis, we focus specifically on
from a language different from the tweet’s. Never- hashtags classified as C OMMUNITY and A NNO -
theless, the observed trends do provide some guid- TATION , which are more associated with inten-
ance as to which language pairs might share hash- tional communication actions and therefore more
tags and also may code switch. likely to be used in instances of code switching.
Among the hashtags in Table 3, two are legit- Performing such an analysis at scale would re-
imately used by authors in multiple languages: quire automated methods for classifying hashtags
#soccer and #GER, the latter corresponding to the by their use, which is beyond the scope of this ini-
German soccer team. Both hashtags were popular tial investigation. Therefore, we performed a man-
due to the World Cup, which occurred during the ual analysis of the 100 most-common, 100 least-
time period studied. For both, authors included common, and 100 median-frequency hashtags in
these hashtags while taking part in a global con- our dataset to assess the distribution of hashtag
versation about the games and event. The hashtag types and cases of code switching among the
#soccer is a clear case of code switching, where C OMMUNITY and A NNOTATION hashtags. Two
individuals are communicating their interests in annotators labeled each hashtag, achieving 64.6%
multiple languages, even when equivalent hash- agreement on the type annotations; disagreements
tags in the tweet’s language are actively being were largely due to mistaken assignments rather
used. Indeed, over half of the languages using that disputed classifications.4 An adjudication step
#football had at least one tweet containing both resolved all disagreements. Additionally, eleven
#football and #futbol. The example of #GER high- hashtags were excluded from analysis due being
lights a boundary case of code switching. Here, made of common words (e.g., #go, #be) which had
GER is an abbreviation for the country’s name,
making it a highly-recognized marker, rather than 4
In particular, mistakes were more common when analyz-
ing hashtags used in languages outside the annotators’ flu-
3
In particular, the lack of character accents caused signif- ency, which required a more careful assessment of why the
icant difficulties in distinguish between Spanish and Catalan. hashtag was being used.

55
Hashtag Lang. Lang. of Code Switched Tweet
50
Lowest Frequency #Noticias es en
Median Frequency #Facts en id th fr es ru
40 Highest Frequency #simple en id es fr ms tr tl sw zh ja ko
#bitch en ar cs de es fr id it ja ms nl pt ru
Frequency

30 sv tl tr zh
#delicious en ca de es fr id it ja ko ms nl ru th
20 tr zh
#Design en ar de es fr ja kr pt th tl zh
10 #Felicidad es ca en
#SWAG en de es fr id it pl pt ru
0 #fresh en es fr id it ms nl sv
#BoludecesNO es en
ad

an

ap

co tion

na

pl

sp

vo
at
m #truth en ar bs bu es fr hi id ja it ms pa pt

am

tin
ve

no

pl

fo
ic

ed

g
rti

ta

rm
a

un ru tl zh
se

t io

en
ity
m

tit
#Hadith ar nl en
en

y
t

#Quran ar fa ms id sw az it de en
#hadith ar fr en
#tech en de es nl ar el fr ro id it ja ms no
Figure 1: Type distributions of the sets of 100 pl pt ru sq sv zh
highest, median, and lowest frequency hashtags #RemajaIndonesia jv ms
#class en ar tr es bg de fr pt he hr id it ja
used in our dataset lt lv ms nl ru sw tl uk zh
#animals en ar ca de es fr pt it ms ja mk pl pt
ro ru tl tr ur vi
no meaningful interpretation for their use. Fol- #cine es ca de en fr ja pt ro ru
#sunday en es ar tr fr ca de el gl hu id it ja
lowing, we describe the results of the analysis and ms ko pt nl nn no pl ro ru sl sv
th tl zh
then highlight several types of hashtags. #Energy en ru es de fr it pt tr
Figure 1 shows the distribution of hashtag types #change en ar nl es cs de el eu fr pt id it ja
ko jv lv ms nb no pl ro uk ru sv
observed in the three samples. S PAM and A P - ta th tl tr ur zh
PLICATION hashtags were most common among #magic en nl fr ar ru ca cs de el it es hu id
ja jv ko lv ru ms nn pl pt ro sq
highest frequency hashtags, whereas the low- sv sw sl tl tr zh
est frequency tags in the dataset were also ei-
ther S PAM or VOTING. Surprisingly, the me- Table 4: Code switched hashtags and the lan-
dian frequency hashtags had the majority of the guages of the tweets in which they were seen
discussion-related hashtag types (A NNOTATION types top, C OMMUNITY types bot-
Within the A NNOTATION and C OMMUNITY tom).
types, we selected thirteen hashtags each to man-
ually evaluate if code switching behavior was ob-
served. For each hashtag, two annotators reviewed to have bilingual speakers fluent in English. How-
all associated tweets that were identified as using ever, several hashtags were used in a variety of
a different language than that of the hashtag. An- diverse languages. For example, #truth was used
notators were instructed to consider the tweet an with languages such as Arabic, Bosnian, Bulgar-
instance of code switching only in cases where ian Hindi, and Punjabi. The most widely code
(1) there was sufficient text to determine the mes- switched hashtag was #magic. In English, the
sage’s actual language and (2) the message was an hashtag is commonly used with content on magic
act of communication (in contrast to spam-like or tricks; however, in other languages, the hashtag
nonsensical messages). often connotes surprise. For example, the Lat-
Code switching behavior was observed for vian tweet “Es izmeklēju visu plauktu, nekur nav.
eleven of the A NNOTATION hashtags and twelve Mamma piejiet ne sekunde nepagāja, kad viņa
of the C OMMUNITY hashtags. Table 4 shows atrada. #magic” comments on having an item on
those code switched hashtags and the languages the shelf disappear when looking for it, only for it
in which they were seen, highlighting the varying to reappear like magic.
frequency with which hashtags were used in multi- During annotation, we observed that authors
ple languages. For example, the primarily Arabic were highly productive in their code switching, us-
hashtag #Hadith was used in English and Dutch ing these hashtags to generate the types of emo-
tweets; similarity, all three Spanish hashtags were tional and sarcastic messages typically seen in
used in English tweets. same-language messages. For example, in the
Many hashtags are used primarily with lan- Swedish tweet “Bussen luktar spya och öl. #fresh”
guages that are associated with countries known the author is sarcastically commenting on a bus

56
that smells of vomit and beer. and feature fully-grammatical text that appears to
be randomly sampled from other sources, such as
4.4 Discussion lists of proverbs. After examining multiple ac-
counts, we speculate that these messages are actu-
The process of annotating code switching for
ally bot accounts who need to generate sufficient
hashtags revealed four notable trends in author be-
number of messages to avoid Twitter’s spam fil-
havior that occurred with multiple hashtags. First,
ters. Work on detecting fake accounts has largely
authors fluent in non-Latin writing systems will
been done in English (Benevenuto et al., 2010;
often use Latin-transliterated hashtags, which are
Grier et al., 2010; Ghosh et al., 2012) and so may
then adopted by authors of Latin-based systems.
benefit from detecting this cross-lingual hashtag
For example, the hashtag #aikatsu describes a col-
use in accounts.
lectible card game and anime and is heavily used
by both Japanese and English authors. Similar-
5 Experiment 2: Bilingual Cities
ity, the transliterated hashtags #Hadith and #Quran
are commonly associated with Arabic-language The second experiment measures the prevalence
tweets, which rarely include an Arabic-script ver- of hashtag code switching in tweets from three
sion of those hashtags even when the tweets in- cities with different populations of English and
clude other hashtags in Arabic. French speakers: Montreal, Canada, Quebec City,
Second, when two or more languages share the Canada and Paris, France. All three cities are
same written form of a word (i.e., homographs), known to contain bilingual speaker as well, who
the resulting hashtags become conflated and ap- have been shown to actively code switch (Heller,
pear as false examples of code switching. For ex- 1992). To test for differences in the code switch-
ample, #Real was widely used in both English and ing behavior of populations, each city is analyzed
Spanish, but with two meanings: the English us- according to the degree to which Anglophone
ages denoting something existent (i.e., not fake) and Francophone speakers incorporate hashtags
and the Spanish usages referring to Real Madrid of other languages into their tweets and whether
FC, a soccer club. The hashtag #cine also posed translations of the code switched hashtags are used
a challenge due to abbreviation. While many in the original language.
Spanish-language tweets include #cine (cinema),
tweets in other languages include #cinema and its 5.1 Experimental Setup
abbreviated form #cine, which matches the Span-
Data Tweets were gathered for each city by us-
ish term, creating false evidence of code switch-
ing the method of Jurgens (2013) to identify Twit-
ing.
ter users with a home location within each city’s
Third, multilingual individuals may adopt a greater metropolitan area. Tweets were then ex-
common hashtag for reasons other than code tracted for these users over a three year sample of
switching, which we highlight with two examples. 10% of Twitter. This process yielded 4.4M tweets
The hashtag #1DWelcomeToBrazil is used in a for Montreal, 203K for Quebec City, and 58.1M
large number of English and Portuguese tweets. for Paris. For efficiency, we restricted the Paris
This hashtag is associated with the travel arrival dataset to 5M tweets, randomly sampled across the
of the English-speaking band One Direction to time period.
Brazil. Similarly, the #100happydays hashtag was
spawned from a movement where individuals de- Language Identification The language of a
scribe positive aspects of their day. These global tweet was identified using a similar process as in
phenomena increases the difficulty of automati- Experiment 1. Because this setting restricts the
cally identifying code switching instances. analysis to only English and French, a different
Fourth, spam accounts will occasionally latch method was used to determine the language of a
onto a hashtag and use it in a variety of languages. hashtag. Given a tweet in language l1 , the text of
For example, the popular hashtag #1000ADAY is a hashtag is tested to see if it wholly occurs within
used to attract new followers, which resulted in the dictionary for l1 ; if not, a greedy tokenization
adult content services also using the hashtag to algorithm is run to attempt to split a hashtag into
post spam advertisements. Surprisingly, nearly constituent words that are in the dictionary of l1 . If
a third of tweets for this hashtag are in Russian either the dictionary-lookup and tokenization steps

57
French hashtags on English tweets English hashtags on French tweets
Quebec City Montreal Paris Quebec City Montreal Paris
imfc imfc comprendraquipourra lasttweet gohabsgo bbl
rilive charte sachezle bbl fail teaminsomniaque
relev seriea nian mtvhottest ind teamportugal
ceta bel hollande gohabsgo mtvhottest ps
preorderproblemonitunes brasil2014 federer not not findugame
derpatrash touspourgagner tropa fail soccer adp
villequebec 2ne1 guillaumeradio 100factsaboutme wow lasttweet
tufnations ma vousetespaspret herbyvip podcast follow
ta lavoixtva bel foodies ukraine teamom
rougeetor passionforezria retouraupensionnat electionsqc2014 int thebest

Table 5: The ten most frequent hashtags occurring in French and English tweets

Englist tweet with French hastag


French tweet with English hastag
should Francophone authors need to express them-
selves with an English hashtag, they may write the
14
entire tweet in English, rather than code switch-
12 ing. In contrast, Parisian authors are less likely to
Percentage of tweets

10 be fully fluent in English (though functional) and


8 therefore express themselves primarily in French
6 with English hashtags as desired. An analogous
4
trend may be seen for French hashtags in the En-
glish tweets from Montreal, which has a higher
2
population of primarily Anglophone speakers who
0
Montreal Quebec City Paris might be less willing to communicate entirely in
French but will still use French hashtags to con-
Figure 2: Percentages of tweets with any hashtag nect their content with the dominant language used
that include a hashtag from the other language in the city.
For each language and city, Table 5 shows
succeed, the hashtag is said to be in l1 . Other- the ten most popular hashtags incorporated into
wise, the tests are repeated with the second lan- tweets of the other language. Examining the most
guage l2 . If the hashtag cannot be recognized in popular English tags in French tweets shows a
l1 or l2 , it is assumed to be in the language of its clear distinction in the two populations; French
tweet. The aspell dictionaries were used to rec- Parisian tweets include more universal English
ognize words. Furthermore, after analyzing the er- hashtags or those generated by applications, which
rors made due to missing words, dictionaries were are not generally instances of code switching. In
augmented to include common social media terms contrast, the Canadian cities include more A N -
NOTATION type hashtags, including the sarcasm-
in each language (e.g., “selfie”). A manual anal-
ysis of 100 hashtags each for French and English marking #not, which are more indicative of code
showed that this language assignment method was switching behavior.
correct for 91% of the instances. An established linguistic convention within a
population can also motivate authors to prefer
5.2 Results one language’s expression over another (Myers-
Francophone authors were much more likely to Scotton, 1997). To test whether a high-frequency
use English hashtags than Anglophone authors concept was equally expressed in French and En-
were for French hashtags. For tweets in each lo- glish or whether one language’s expression was
cale and language, Figure 2 shows the percentage preferred, we created pairs of equivalent English
containing a hashtag in the other language relative and French hashtags expressing the same con-
to the total number in that city using a hashtag in cept (e.g., #happy/#heureux) by translating the
either language. Notably, Paris has a higher rate of 50 most-popular English hashtags used in French
using English hashtags than both Canadian cities. tweets. Then, the tweets for each city were an-
We speculate that this difference is due to the high alyzed to identify which languages were used in
rate of bilingualism in Montreal and Quebec City; expressing each concept as a hashtag. The results
because authors are fully fluent in both languages, in Figure 3 reveal that for nearly half of the hash-

58
English only
Both Languages
French Only in “Nadie dijo que serı́a fácil, pero cómo cuesta
estudiar después de 4 años de no tener nada
50
académico cerquita #goingbacktoschool” where
40 the author is commenting on the difficulty of re-
Hashtag count

30 turning for a degree. Still other posts include


20 multiple single-token hashtags from a second lan-
10
guage, e.g., the earlier example of “Jetzt gibt’s was
vernünftiges zum essen! #salad #turkey #lunch
0
#healthy #healthylifestyle #loveit.” Although indi-
M

Pa
ue
on

ris
be vidually these hashtags may be widely recognized
tre

c
al

C
ity
and operate as interlingual markers, their com-
bined presence suggests an intentional language
Figure 3: For 50 most-common concepts ex- shift on the part of the author that could be inter-
pressed in equivalent French and English transla- preted as code switching. Together, the examples
tions, the frequency with which the hashtags for a point to hashtag use by multiple languages as a
concept were seen in each language. complex phenomena where shared hashtag enti-
ties exist on a graded scale from simple borrow-
ing to fully signaling code switching. Our study
tags, equivalent French language versions are in
is intended as a starting point for analyzing this
use; however, examining the relative frequencies
practice and all our data is made available to sup-
shows that in all cases, the English version is still
port future discussions on the roles these hashtags
preferred, despite the presence of a large Franco-
play and how they facilitate communication both
phone population. For hashtags that were only
within and across language communities.
seen in English, many were of the C OMMUNITY
type, e.g., #50factsaboutme, which may not have
7 Conclusion
an equivalent French-language version. However,
we observed that when both an English hashtag The present work has provided an initial study of
and its French translation were attested, the use code switching in Twitter focusing on instances
of the English hashtag in French was most often where an author produces a message in one lan-
an instance of code switching. Hence, testing for guage and then includes a hashtag from a sec-
the presence hashtag translation pairs may serve as ond language. Our work provides three main con-
a helpful heuristic for identifying hashtags whose tributions. First, using state-of-the-art language
use signals code switching behavior. identification techniques, we show that hashtags
are widely shared across languages, though the
6 Discussion
challenges of correctly classifying the language
Typically, code switching is distinguished from of tweets limits our ability to quantify the exact
the related phenomena of borrowing by testing scale. Second, in a manual analysis of A NNOTA -
whether the word is being fluently mixed into TION and C OMMUNITY hashtags, we show that
the utterance instead of simply functioning as a authors readily code switch with these types of
loan word (Poplack, 2001). Hashtags present a hashtags, using them just as they would in single
unique challenge for distinguishing between the language tweets (e.g., indicating sarcasm). Third,
two phenomena due their brief content and un- in a case study of French and English tweets from
structured usage: a hashtag may occur anywhere three Francophone cities with bilingual speakers,
in a tweet and its general content lacks grammat- we find that the cities with more bilingual speakers
ical constraints. Examining the hashtags seen in tended to have fewer occurrences English hashtags
our study, we find evidence spanning both types in French tweets, which we speculate is due to au-
of uses. Common hashtags such as #win or #fail thors being more likely to write such tweets en-
are widely recognized outside of English and their tirely in English, rather than code switch; however,
uses could easily be interpreted instances of bor- when English hashtags were observed in French
rowing. However, the complexity of other hash- tweets from these more bilingual cities, they were
tags gives the appearance that their uses go be- much more likely to be used in instances of code
yond that of borrowing, e.g., #goingbacktoschool switching. Data for all of the experiments is

59
available at http://www.networkdynamics.org/ Chris Grier, Kurt Thomas, Vern Paxson, and Michael
datasets/. Zhang. 2010. @ spam: the underground on 140
characters or less. In Proceedings of the 17th ACM
Our work raises several avenues for future
conference on Computer and communications secu-
work. First, we plan to examine how to improve rity (CCS), pages 27–37. ACM.
language identification in microtext in order to
gain a more accurate estimation of hashtag sharing John Joseph Gumperz. 1982. Discourse strategies.
and code switching rates for languages. Second, Cambridge University Press.
the Twitter platform enables measuring additional Manish Gupta, Rui Li, Zhijun Yin, and Jiawei Han.
factors that may influence an individual’s rate of 2010. Survey on social tagging techniques. ACM
code switching; specifically, we plan to investigate SIGKDD Explorations Newsletter, 12(1):58–72.
(1) a user’s historical tweets to estimate the degree
Monica Heller. 1992. The politics of codeswitch-
of bilinguality and (2) the impact of a user’s social ing and language choice. Journal of Multilingual
network with respect to homophily and language & Multicultural Development, 13(1-2):123–142.
use.
Jeff Huang, Katherine M Thornton, and Efthimis N
Efthimiadis. 2010. Conversational tagging in twit-
References ter. In Proceedings of the 21st ACM conference on
Hypertext and hypermedia, pages 173–178. ACM.
Peter Auer. 1998. Code-switching in conversation:
Language, interaction and identity. Routledge. David Jurgens. 2013. That’s what friends are for:
Inferring location in online social media platforms
Fabrıcio Benevenuto, Gabriel Magno, Tiago Ro- based on social relationships. In Proceedings of the
drigues, and Virgılio Almeida. 2010. Detect- 7th International Conference on Weblogs and Social
ing spammers on twitter. In Collaboration, elec- Media (ICWSM). AAAI.
tronic messaging, anti-abuse and spam conference
(CEAS), volume 6, page 12. Carmen K. M. Lee. 2007. Linguistic features of email
and icq instant messaging in hong kong. In Brenda
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Danet and Susan C. Herring, editors, The Multilin-
Clayton Fink, and Theresa Wilson. 2012. Language gual Internet: Language, Culture, and Communica-
identification for creating language-specific twitter tion Online. Oxford University Press.
collections. In Proceedings of the Second Workshop
on Language in Social Media, pages 65–74. Associ- Julie Letierce, Alexandre Passant, John Breslin, and
ation for Computational Linguistics. Stefan Decker. 2010. Understanding how twitter
is used to spread scientific messages. In WebSci10:
S. Climent, J. Moré, A. Oliver, M. Salvatierra,
Extending the Frontiers of Society On-Line.
I. Sànchez, M. Taulé, and L. Vallmanya. 2003.
Bilingual newsgroups in catalonia: A challenge for
Yu-Ru Lin, Drew Margolin, Brian Keegan, Andrea
machine translation. Journal of Computer-Mediated
Baronchelli, and David Lazer. 2013. # bigbirds
Communication, 9(1).
never die: Understanding social dynamics of emer-
Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. gent hashtags. In Seventh International Conference
Semi-supervised recognition of sarcastic sentences on Weblogs and Social Media (ICWSM). AAAI.
in twitter and amazon. In Proceedings of the Four-
teenth Conference on Computational Natural Lan- Yabing Liu, Chloe Kliman-Silver, and Alan Mislove.
guage Learning (CoNLL), pages 107–116. Associa- 2014. The tweets they are a-changin’: Evolution
tion for Computational Linguistics. of twitter users and behavior. In Proceedings of the
8th International Conference on Weblogs and Social
Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Media (ICWSM). AAAI.
Naveen Kumar Sharma, Gautam Korlam, Fabri-
cio Benevenuto, Niloy Ganguly, and Krishna Phani Marco Lui and Timothy Baldwin. 2012. langid. py:
Gummadi. 2012. Understanding and combating An off-the-shelf language identification tool. In
link farming in the twitter social network. In Pro- Proceedings of the ACL 2012 System Demonstra-
ceedings of the 21st international conference on tions, pages 25–30. Association for Computational
World Wide Web (WWW), pages 61–70. ACM. Linguistics.

Moises Goldszmidt, Marc Najork, and Stelios Papari- Carol Myers-Scotton. 1997. Duelling Languages:
zos. 2013. Boot-strapping language identifiers Grammatical Structure in Codeswitching. Claren-
for short colloquial postings. In Proceedings of don Press.
the European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Chad Nilep. 2006. Code switching in sociocultural lin-
Databases (ECMLPKDD 2013). Springer Verlag, guistics. Colorado Research in Linguistics, 19(1):1–
September. 22.

60
John C. Paolillo. 2011. Conversational codeswitch-
ing on usenet and internet relay chat. Lan-
guage@Internet, 8.
Shana Poplack and David Sankoff. 1984. Borrowing:
the synchrony of integration. Linguistics, 22(1):99–
136.
Shana Poplack, David Sankoff, and Christopher Miller.
1988. The social correlates and linguistic processes
of lexical borrowing and assimilation. Linguistics,
26(1):47–104.
Shana Poplack. 2001. Code-switching (linguistic). In
International Encyclopedia of the Social and Behav-
ioral Sciences, pages 2062–2065. Elsevier Science
Ltd., 2nd edition.
David Sankoff, Shana Poplack, and Swathi Vanniara-
jan. 1990. The case of the nonce loan in tamil. Lan-
guage variation and change, 2(01):71–101.
Bonnie Urciuoli. 1995. Language and borders. An-
nual Review of Anthropology, 24:pp. 525–546.
Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei.
2012. We know what@ you# tag: does the dual
role affect hashtag adoption? In Proceedings of the
21st international conference on World Wide Web
(WWW), pages 261–270. ACM.

61

You might also like