Twitter As Data PDF
Twitter As Data PDF
Twitter As Data PDF
TWITTER AS DATA
Zachary C. Steinert-Threlkeld
University of California, Los Angeles
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781108438339
DOI: 10.1017/9781108529327
© Zachary C. Steinert-Threlkeld 2018
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2018
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-43833-9 Paperback
ISSN 2398-4023 (Online)
ISSN 2514-3794 (Print)
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
1
Twitter As Data
Zachary C. Steinert-Threlkeld*
1 Twitter
The increasing prevalence of digital communications technology –
the internet and mobile phones – provides the possibility of ana-
lyzing human behavior at a level of detail previously unimaginable.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
2 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 3
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
4 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 5
Topic Reference
Prediction Box office (Asur and Huberman, 2010)
Coups (Kallus, 2013)
Crime (Gerber, 2014)
Memes (Garcia-Herranz et al., 2014)
Stock market (Bollen et al., 2011; Zheludev et al.,
2014)
Unrest (Ramakrishnan et al., 2014; Steinert-
Threlkeld, 2017b)
Disaster Response Starbird and Palen, 2010; Vieweg et al., 2010; Yardi
and Boyd, 2010
Polarization Barberá et al., 2015a; Borge-Holthoefer, et al., 2015
Congress Barberá et al., 2014; Anastasopoulos et al., 2016
Demographics Hale et al., 2011; Zamal et al., 2012; Mocanu et al.,
2013
Economics Acemoglu et al., 2014; Llorente et al., 2014
Geography Yardi and Boyd, 2010; Kulshrestha et al., 2012;
Conover et al., 2013; Frank et al., 2013
Sentiment Dodds et al., 2011; Golder and Macy, 2011.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
6 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 7
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
8 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 9
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
10 Quantitative and Computational Methods for Social Science
this approach include being able to define search terms, not relying
on others for data, and, depending on how much data is involved,
cost. Disadvantages include a steeper learning curve than purchas-
ing or working with others, difficulty accessing historic data, and
needing to maintain your own infrastructure. Sections 2.1.1 and
2.1.2 explain the two application programming interfaces (APIs) for
acquiring data for free and what kinds of data are available from
each. Though Twitter does not charge for using those interfaces,
you still need hardware with which to store and analyze the data.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 11
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
12 Quantitative and Computational Methods for Social Science
1
The streaming API technically has three endpoints: GET statuses/sample, GET
user, and GET site. Academics will only need to work with GET statuses/sample,
so that is the connection assumed for the rest of this Element.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 13
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
14 Quantitative and Computational Methods for Social Science
asking for tweets from San Francisco and tweets in Spanish will
return all tweets from San Francisco (regardless of language) and
all tweets in Spanish. Since 2–3% of tweets contain GPS coordi-
nates (Leetaru et al., 2013), passing the coordinate pairs
[–180,90,180,90] – a box around the world – will return 33% to
50% of all tweet with GPS coordinates. Twitter accepts up to
25 bounding boxes per connection. The streaming API does not
use a user’s self-reported location.
Specific Keywords. Twitter will return tweets containing a user-
supplied string, and multiple strings can be passed. This func-
tionality can be used to search for specific hashtags, individual
words, links (Twitter will search the expanded URL of a shor-
tened link), retweets, or mentions of a user. Four hundred pieces
of text can be passed per connection. Note that non-space sepa-
rated languages, like Korean, Japanese, and Chinese, are not
supported.
Language Sample. When connecting to the streaming API, you can
request only tweets in a certain language. Twitter will then return
all tweets in that language up to the 1% ceiling. Note that language
cannot be the only parameter passed. To download tweets in
a specific language, it is therefore best to pass generic keywords
in that language, e.g. “this”, “the”, “is”, and so on if you want
a sample of tweets in English. The request will then return
a random sample of the tweets in English that contain one of
those words. Multiple languages can be requested simultaneously.
Because Twitter does not filter for non-space separated languages,
asking for tweets in those languages requires use of other
parameters.
Specific People. You can submit specific user identification num-
bers to the streaming API and receive all tweets the users create, all
tweets the users retweet, replies to tweets of the users, and retweets
of the users’ tweets. Five thousand people can be followed per
connection. This feature is especially useful when the accounts to
be studied are known. The best way to identify accounts is through
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 15
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
16 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 17
Fourth, while the REST API allows you to search historic tweets,
the results are only from the previous seven to nine days and not
exhaustive of those days. Twitter does not explain how it decides
which tweets to return, so it should not be relied upon to recon-
struct histories. Twitter only returns 100 results per request, up to
180 requests per 15 minutes.2 A broad search with thousands of
results may therefore take awhile to download and will not provide
the population of tweets matching a search query. Searching
directly at www.twitter.com returns all historic matches, but you
cannot download those matches.
Fifth, Twitter only allows you to share 50,000 public tweets and/or
accounts’ metadata per day, and the sharing cannot be automated.
For example, if a researcher uses more than 50,000 tweets for
a paper and needs to share them, they cannot be made freely
available. A system would need to be constructed to verify that an
interested party is not downloading the data more than once
per day, and the data cannot be pushed (sent automatically) to an
interested party. Twitter does allow, however, the unlimited distri-
bution of the numeric identification number of each tweet or user
account. An interested party can then take these numbers to the
REST API and download the full tweet and account information.
But, as noted in the third limitation, the metadata from these tweets
will differ from the metadata of the original tweets, if the original
tweets were obtained from the streaming API.
Sixth, the streaming API occasionally disconnects. These dis-
connections are rare and random but can imperil research design
if not caught quickly. At least three solutions are available. If your
connection is designed to last indefinitely, that connection’s code
should generate an e-mail, or similar notification, whenever the
connection is interrupted. Alternatively, you can intentionally dis-
connect from the streaming API and immediately reconnect at
2
Technically, the number of requests depends on whether you are authenticated
as a user or application. Since whether or not one form of authentication returns
more results depends on the type of request and as most academics are not
trying to build an application, the rate limits presented are for the user
authentication.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
18 Quantitative and Computational Methods for Social Science
2.2 Collaborate
On April 14, 2010, Twitter and the Library of Congress announced the
Twitter Research Access project, a collaboration to make every tweet
ever published available to researchers (Stone, 2010). Scheduled for
completion in 2013, the project still has not resulted in an available
archive. Updates from the Library have been intermittent, though it is
clear it has at least all tweets from 2006, when Twitter started, through
the end of 2010. Disconcertingly, a report from the Library in 2013
suggested that the hardware necessary to enable fast searches of the
archive are “cost-prohibitive and impractical for a public institution”
(Update on the Twitter Archive at the Library of Congress, 2013).
The most detailed report on the project and its current status is to
be found in Zimmer (2015) and McGill (2016). The situation appears
no closer to resolving itself.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 19
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
20 Quantitative and Computational Methods for Social Science
2.3 Purchase
If you are interested in tweets from the past, the most thorough
approach is to purchase them from a vendor. (Section 2.1.1
describes how to download some old tweets for free.)
Companies which provide access to old tweets pay a large, undi-
sclosed licensing fee to Twitter, and their main customers are
marketing and public relations firms. Since many companies
provide this service, this section focuses on three of the most
prominent.
The main vendor is Gnip; it was founded in 2008 and started
licensing Twitter’s data in 2010, and Twitter purchased the com-
pany outright in 2014. While Gnip’s target market is businesses,
anyone can purchase their old Twitter data using their Historical
PowerTrack application programming interface (API). Gnip
claims that prices start at $500, but a project will more likely
spend upwards of $5,000 purchasing data. The price is
a function of the number of tweets that will be returned and the
timespan of the request, though the final price requires consulta-
tion with a sales representative. Gnip provides its own metadata
as well, including expanded links, a Klout score, language detec-
tion, and enhanced geo-information. Though Gnip provides
a programming interface, a one-time purchase is best handled
through contacting a salesperson directly. Gnip has offered Data
Grants, free downloads of tweets to winning applicants; its first,
and so far only, competition, in 2014, saw 1,300 entrants compete
for six grants.
DataSift is the second major reseller of archived tweets. They
provide the same services as Gnip, provides a programming
interface that uses their own syntax to filter historic data.
(DataSift and Gnip also ingest other datasources such as
Wikipedia, reddit, YouTube, and WordPress.) DataSift’s historic
data starts on January 1, 2010, whereas Gnip has every tweet
since 2006. DataSift charges $1 per 5 hours of computation time,
plus $0.10 per 1,000 tweets a query returns. To estimate the cost
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 21
3 Process Data
3.1 Getting Started
This section takes the reader through creating a Twitter account
and creating an application with that account. The application is
what will connect to Twitter’s APIs, and an account can own multi-
ple applications.
Section 3.1.1 explains how to create an account, and
Section 3.1.2 shows how to use that account to obtain the cre-
dentials you will need to access Twitter’s API. Section 3.2
provides Python and R scripts to download a user’s tweets,
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
22 Quantitative and Computational Methods for Social Science
Image 1.
3
These steps are current as of May 2016.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 23
Image 2.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
24 Quantitative and Computational Methods for Social Science
Image 3.
8. Check your e-mail using the e-mail address you gave Twitter.
You will have an e-mail with a blue “Confirm now” button. Click
that button. If you do not see this e-mail after a few minutes,
check your Spam folder.
9. Clicking the “Confirm now” button will take you to the Twitter
homepage for your account, as shown in Image 4. It is worth
following some accounts and sending a couple of tweets
immediately so Twitter does not delete your account for
inactivity.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 25
Image 4.
Image 5.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
26 Quantitative and Computational Methods for Social Science
Image 6.
Image 7.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 27
Image 8.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
28 Quantitative and Computational Methods for Social Science
Image 9.
8. Copy and paste the “Consumer Key (API Key)” and “Consumer
Secret (API Secret)” fields somewhere you can easily retrieve
them, such as in a text document. These will be necessary
soon. The method of connecting to the streaming API through
R does not require the next steps; if you are using R, skip to “Use
R to Verify Your Identity”. Libraries in other languages will
accept the two items created in the next steps or let you replicate
the steps R requires.
9. At the bottom of the screen, click the “Create my access token”
button.
10. Once you click the “Create my access token” button, the screen
will refresh, with more information displayed under the “Your
Access Token” section. If there is no information, wait a few
minutes and click the “Refresh” blue text at the top of the
screen. Your screen should look like Image 10:
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 29
Image 10.
11. Copy and paste the “Access Token” and “Access Token Secret”
fields somewhere you can easily retrieve them, such as in a text
document. You will also need these shortly.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
30 Quantitative and Computational Methods for Social Science
1 temp = connection.get_user_timeline(screen_name=-
screen_name, count=200, max_id=maxID – 1)
2
3 tweets = getUserTweets(screen_name=’ZacharyST’)
4 saveTweets(tweets, filename=’<insert˽your˽name˽here’)
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 31
3.2.2 Tweets by ID
You can download specific tweets if the tweet identification number
is known. This feature is useful if you want to replicate other Twitter
studies: Twitter does not allow you to share more than 50,000 raw
tweets per day, but you can share an unlimited number of tweet
identification numbers. For tweet identification numbers, Twitter
returns 6,000 tweets per 15 minutes, or 576,000 per day. For an
example of a researcher sharing tweets, see Freelon (2012).
The full R script to download specific tweets is here (www
.cambridge.org/download_file/949242). The key lines are:
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
32 Quantitative and Computational Methods for Social Science
1 # Download tweets
2 for chunk in chunkedTweets:
3 print(’On˽new˽chunk’)
4 temp = connection.lookup_status(id=chunk) # Notice that
this is lookup_status and not show_status
5 tweets.extend([item for item in temp])
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 33
City that contain the acronym “NBA”. None of the tweets contains
GPS coordinates, and the self-reported location of many of the
users is unmistakably not within five miles of the city. For example,
one user is from “Philly”, another from “ALABAMA”, and another
from “Watching a Game Somewhere.” The original tweets can be
found at this link (www.cambridge.org/download_file/949236).
The full R script to search for tweets matching specific criteria is
here (www.cambridge.org/download_file/949224). The key lines
are:
The full Python script that does the same is here (www
.cambridge.org/download_file/949221). The key lines are:
1 # Connect
2 connection = twy.Twython(APP_KEY, APP_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
3
4 # How many tweets to return per query
5 size = 100 # Modify as needed, maximum is 100
6
7 # One word
8 nba_tweets = connection.search(q=’nba’, count=size)
# Also returns hashtags, and Twitter is not case
sensitive.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
34 Quantitative and Computational Methods for Social Science
3.2.4 Followers
The full R script to download a user’s followers is here (www
.cambridge.org/download_file/949218). The key lines are:
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 35
Continued
10 total = len(ids)
11 i = 0
12 while i < total:
13 print((”On˽follower˽\%d”) \% i)
14 j = i + 100
15 hydrateFollowers(user=’BarackObama’, IDs=ids,
start=i, end=j)
16 i += 100
17 pct_done = (j / total) * 100
18 print((”Finished˽\%f10˽percent”) \% pct_done)
3.2.5 Following
The full R script to download a user’s friends is here (www
.cambridge.org/download_file/949212). The key lines are:
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
36 Quantitative and Computational Methods for Social Science
Continued
7
8
9 ###### Hydrate friends list. Below function just get list
of Twitter IDs.
10 ids = openIDs(user=’BarackObama’)
11
12 total = len(ids)
13 i = 0
14 while i < total:
15 print((”On˽friend˽\%d”) \% i)
16 j = i + 100
17 hydrateFriends(user=’BarackObama’, IDs=ids, start=i,
end=j)
18 i += 100
19 pct_done = (j / total) * 100
20 print((”Finished˽\%f10˽percent”) \% pct_done)
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 37
1 # Import libraries
2 import twython as twy
3 import json
4 import datetime as dt
5
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
38 Quantitative and Computational Methods for Social Science
Continued
6 # Key, secret, token, token_secret for one of my developer
accounts.
7 # Update with your own strings as necessary
8 APP_KEY = ’yourConsumerKey’
9 APP_SECRET = ’yourConsumerSecret’
10 OAUTH_TOKEN = ’yourAccessToken’
11 OAUTH_TOKEN_SECRET = ’yourAccessTokenSecret’
12
13 # Make class
14 class MyStreamer(twy.TwythonStreamer):
15 fileDirectory = ’/path/to/directory/to/save/to/’ # Any
result from this class will save to this directory
16
17 stop_time = dt.datetime.now() + dt.timedelta(min-
utes=60) # Connect to Twitter for 60 minutes. Comment
out if do not want it timed.
18
19 def on_success(self, data):
20 if dt.datetime.now() > self.stop_time: # Once min-
utes=60 have passed, stop. Comment out these 2
lines if do not want timed connection.
21 raise Exception(’Time˽expired’)
22
23 fileName = self.fileDirectory + ’Tweets_’ + dt.date-
time.now().strftime(”\%Y_\%m_\%d_\%H”) + ’.
txt’ # File name includes date out to hour.
24 open(fileName, ’a’).write(json.dumps(data) +
’\n’) # Append tweet to the file
25 # Because the file name includes the hour, a new file is
created automatically every hour.
26
27 def on_error(self, status_code, data):
28 fileName = self.fileDirectory + dt.datetime.now().
strftime(”\%Y_\%m_\%d_\%H”) + ’_Errors.txt’
29 open(fileName, ’a’).write(json.dumps(data) + ’\n’)
30
31
32 # Make function. Tracks key words.
33 def streamConnect(APP_KEY, APP_SECRET, OAUTH_TOKEN,
OAUTH_TOKEN_SECRET):
34 stream = MyStreamer(APP_KEY, APP_SECRET, OAUTH_TOKEN,
OAUTH_TOKEN_SECRET)
35 stream.statuses.sample()
36
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 39
Continued
37 # Execute
38 streamConnect(APP_KEY, APP_SECRET, OAUTH_TOKEN,
OAUTH_TOKEN_SECRET)
1 filterStream(file.name=’raw_tweets_language.txt’, time-
out = 30, track = ’a, an, the, and, but, is, this, that’,
oauth = my_oauth, language = ’en’) # Use filler words to
capture a large amount of tweets.
1 # Make function
2 def streamConnect(APP_KEY, APP_SECRET, OAUTH_TOKEN,
OAUTH_TOKEN_SECRET):
3 stream = MyStreamer(APP_KEY, APP_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
40 Quantitative and Computational Methods for Social Science
Continued
4 stream.statuses.sample(track=[’a, an, the, and, but,
is, this, that, on, in, up, to’], language=[’en’])
1 # Make function
2 def streamConnect(APP_KEY, APP_SECRET, OAUTH_TOKEN,
OAUTH_TOKEN_SECRET):
3 stream = MyStreamer(APP_KEY, APP_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
4 stream.statuses.filter(locations=[—180, —90, 180, 90])
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 41
1 filterStream(file.name=’tweets_keywords.txt’, timeout =
30, track=’LeBron˽James, Steph˽Curry, NBA, basketball,
Warriors, GSW, Cavaliers, espn˽com’, oauth = my_oauth)
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
42 Quantitative and Computational Methods for Social Science
LeBron”; results are not case sensitive, so “lebron james is ready for
the game” matches as well. “espn com” is the recommended way to
download any tweet from the espn.com domain. Finally, note there
are no spaces after the commas; Twitter will treat those as char-
acters to match, so “Steph Curry is ready for the game” would not
match if the term passed is “. . ., Steph Curry”
1 filterStream(file.name=’tweets_accounts.txt’, timeout =
30, oauth = my_oauth, follow = ’813286,1536791610’)
# @BarackObama, @POTUS
1 # Make function
2 def streamConnect(APP_KEY, APP_SECRET, OAUTH_TOKEN,
OAUTH_TOKEN_SECRET):
3 stream = MyStreamer(APP_KEY, APP_SECRET,
OAUTH_TOKEN,OAUTH_TOKEN_SECRET)
4 stream.statuses.filter(follow=
[’813286,1536791610’])
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 43
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
44 Quantitative and Computational Methods for Social Science
Continued
format
10
11 # Make local time
12 utc_time = dt.datetime.strptime(tweet[’created_at’],
’%a˽\%b˽\%d˽\%H:\%M:\%S˽ +0000˽\%Y’).replace(tzinfo=-
pytz.UTC) # Convert tweet timestamp to datetime object
13 local_time = utc_time.replace(tzinfo=pytz.utc).
astimezone(timezone) # Get local time as datetime object
14
15 ### If tweets do not contain GPS coordinates
16 # Correct for user timezone
17 utc_time = dt.datetime.strptime(tweet[’created_at’],
’%a˽\%b˽\%d˽\%H:\%M:\%S˽ +0000˽\%Y’)
18 local_time = utc_time + dt.timedelta(seconds=tweet
[’user’][’utc_offset’]) # Subtract hours based on
timezone from profile
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 45
1 if(tweet[’place’][’place_type’] == ’city’):
2 tweet[’city’] = tweet[’place’][’name’]
3
4 if(tweet[’place’][’place_type’] != ’city’): # If the pla-
ce_type is admin, neighborhood, or poi
5 # Other processing not shown here but that is in the
script
6 tweet[’reversegeocode_results’] = rg.search(tweet
[’place.bounding_box.SWcorner_rg’]) # Perform
reverse geocode
7 tweet[’city’] = [item[’name’] for item in tweet
[’reversegeocode_results’]]
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
46 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 47
legislation will have values for date, sponsor, number of words, and
so on. Structured data resemble the spreadsheets that most people
are comfortable working with. Databases which store these data
are known as SQL databases after the main language for querying
them, Structured Query Language (SQL). Many kinds of data,
especially from social media, do not have the same variables for
each datum. For example, a tweet downloaded from Twitter will list
the hashtags, links, and user mentions it contains if the tweet text
has any; if none exists, an empty list is returned. Twitter will also
identify any stock symbols in a tweet as long as they are proceeded
by a dollar sign and are uppercase. A tweet with zero hashtags
therefore looks different than one with one, and one with one looks
different than one with two.
While a researcher can create a structured database that
accounts for this eventuality, it is unwise to do so. First, it is
important to define how many columns to create for the variable
that may or may not exist. Returning to the hashtag example,
a tweet could contain up to 47 hashtags.5 Constructing the data-
base requires similar calculations for user mentions, stock sym-
bols, and links. You can create a SQL database with as many
columns as possible variables, but doing so leads to a much larger
database than is required. Second, Twitter could decide to change
the actual data a tweet contains. For example, in April 2013, Twitter
added annotation to tweets if they contained stock symbols.
Attempting to load a tweet with a stock symbol would crash the
database script, preventing subsequent tweets without stock sym-
bols from entering the database. Your script can be structured to
avoid this problem, but then you would miss the data on stock
symbols. Structured databases are not ideal for semi-structured
data.
Semi-structured databases, commonly called NoSQL databases,
are designed designed to work with data whose representation can
5
The smallest hashtag is two characters, e.g. “#a”, and hashtags will have
a space character separating them, except for the final hashtag. Solving
2x þ x 1 ≤ 140, x ¼ 47.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
48 Quantitative and Computational Methods for Social Science
vary for each datum. There is not a dominant query language for
semi-structured data; the two most common are Cassandra and
MongoDB. Cassandra started at Facebook but is now an open
source project, and it did not originally handle JSON objects
(tweets are delivered as JSON objects). MongoDB is the preferred
database for JSON objects and is therefore the one most commonly
used for storing tweets in a database. (Ironically, Twitter stores
tweets internally as SQL objects. See their GitHub page for their
implementation, and this Quora conversation for more informa-
tion.) The Social Media and Political Participation Lab at New York
University, for example, stores its tweets in MongoDB databases.
MongoDB is open source and can be used through R or Python.
Databases may not be necessary, however. Their advantages
over flat files dominate when the object to be scanned is too large
to fit into the memory or computing time is a constraining factor,
neither of which are as large an impediment as they may originally
seem. First, a day’s tweets require approximately 12 gigabytes, and
high-performance laptops commonly have 16 gigabytes of RAM;
processing and compressing the tweets after they are downloaded
makes them even smaller. Desktops commonly have 32 or 64
gigabytes of RAM, and a server much more. Second, the connec-
tion to Twitter can be maintained in such a way as to minimize the
size of flat files. For example, my connections to Twitter’s API
restart every hour, meaning I have one file per hour of tweets.
These files are 500 megabytes raw, 33 subsetted and compressed,
and it is trivial to read a file that size into memory.
Though individual files may be small, you are likely to want to
read many of them to find tweets of interest. For example, I have
downloaded tweets since August 26, 2013 and frequently want to
scan the 365*24*(number of days since then) files for tweets from
a particular country. A database could do this quickly. But it is
trivial to write a script to scan these files and pull the tweets that are
from a country (or match any other criteria in which I am inter-
ested). How long this script takes will depend on how many files
exist, how they are loaded into memory, and if the code runs in
parallel. My script takes a few days to scan every file, but it is rare
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 49
that I need to read every file; the vast majority of files can be
ignored based on their date
Computing time is inexpensive, and there are always other tasks
to focus on in the meantime. Once the subset of tweets matching
my criteria are found, the equivalent of the results of a database
query, subsequent analysis can proceed much more quickly.
4.1 Clean
4.1.1 Tweets from Sifter
I purchased tweets from 21 accounts in Egypt and Bahrain
over a three-month period in early 2011; the accounts generated
55,849 tweets in that period. The raw tweets from Egypt are here
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
50 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 51
1 tweets <- NULL # Empty object that tweets will feed into
2 # Download tweets
3 for(i in 1:length(chunkedTweets)){
4 print(paste(’On˽cycle’, i, ’of’, length-
(chunkedTweets), sep=’˽’)) # Status tracker
5 temp <- twListToDF(lookup_statuses(ids =
chunkedTweets[[i]]))
6 tweets <- rbind(tweets, temp)
7 print(c(’Pausing’))
8 Sys.sleep(delay) # How many seconds to pause so that do
not trip rate limit. Commented out in this loop because
downloading 3,200 tweets will never exceed the rate
limit. (60 requests * 100 tweets per request) > 3200
tweets
9 }
10 write.csv(tweets, ’Sifter_Twitter_IDs_
Downloadedtweets.csv’)
The full script takes you through steps to verify your account
with Twitter, load the identification numbers, and calculate the
length of the delay based on the rate limit. Because the process
is no different than downloading tweets not acquired through
Sifter, I have not created a specific Python script for this pro-
cess; use the Python script for tweets based on their identifica-
tion number.
Once the tweets are downloaded and put into a data frame,
they require further processing to integrate with the scripts
developed to analyze Sifter data. This R script (www.cambridge
.org/download_file/949137) modifies column names; assigns
each tweet to a country; adds variables for hashtags, user men-
tions, retweets, and the country of the tweet author; and adjusts
the time to local time. It then saves the data as
Tweets_Dataframe_Twitter_IDs_Downloadedtweets_Cle-
aned.csv. This Python script does the same (www.cambridge
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
52 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 53
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
54 Quantitative and Computational Methods for Social Science
1000
900
800
Tweets with at Least 1 Hashtag
700
600
Country
500 Bahrain
Egypt
400
300
200
100
0 5 10 15 20
Hour
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 55
1
Percent of Tweets with at Least 1 Hashtag
0.75
Country
0.5 Bahrain
Egypt
0.25
0 5 10 15 20
Hour
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
56 Quantitative and Computational Methods for Social Science
1.5
1.25
Average Hashtags per Tweet
Country
0.75 Bahrain
Egypt
0.5
0.25
0 5 10 15 20
Hour
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 57
2.25
1.75
1.5
Country
1.25 Bahrain
Egypt
1
0.75
0.5
0.25
0 5 10 15 20
Hour
Note that this script will find when that word is a hashtag (“Let’s
go #protest”) and stands on its own (“Let’s go protest”). The script
then aggregates and plots these words. Figure 5 shows the result.
Finding a tweet with a link is a similar process to finding
a keyword. Instead of looking for a whole word, however, you can
look for “http://”. The key line then becomes:
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
58 Quantitative and Computational Methods for Social Science
0.75
Keyword
Protest
Percent of Tweets
Police
jan25
feb14
0.5 Egypt
Bahrain
Country
Bahrain
Egypt
0.25
0 5 10 15 20
Hour
Figure 5. Keywords
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 59
0.75
Percent of Tweets
Country
0.5 Bahrain
Egypt
0.25
0 5 10 15 20
Hour
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
60 Quantitative and Computational Methods for Social Science
0.5
0.45
0.4
0.35
Percent of Tweets
Retweet Method
0.3 Native Retweet
Read Tweet for ‘RT’
0.25
Country
Bahrain
0.2 Egypt
0.15
0.1
0.05
0 5 10 15 20
Hour
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 61
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
62 Quantitative and Computational Methods for Social Science
0.9
0.8
0.7
Percent of Tweets
Language
0.6 English
Arabic
0.5 Country
Bahrain
0.4 Egypt
0.3
0.2
0.1
0 5 10 15 20
Hour
1 xtable(as.table(sort(table(data$source), decreasing =
TRUE))) # Outputs latex table
2
3 desktop <- c (’web’, ’Choqok’, ’TweetDeck’, ’HootSuite’,
’Ping.fm’) # The sources most likely to be from a desktop
computer
4 data$desktop <- ifelse(data$source \%in\% desktop, 1, 0)
5 data$mobile <- ifelse(!(data$source \%in\% desktop),
1, 0)
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 63
Tweet source
Choqok 22285
web 17424
Twitter for BlackBerry¨ 5142
Twitter for iPhone 4157
Gravity 2003
†berSocial 1977
Facebook 741
HootSuite 614
TweetDeck 353
Snaptu 344
twitterfeed 282
Ping.fm 178
Tweet Button 148
Samsung Mobile 45
Twitter for iPad 32
Google 27
oauth:173069 21
Mobile Web 17
harassmap.org 13
Bambuser 11
TwitLonger Beta 10
Yfrog 8
TweetMeme 5
Twitpic 4
The BOBs 3
See Who Viewed Your Profile 2
My Tweet Lovers 1
oauth:3294 1
StumbleUpon iPhone 1
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
64 Quantitative and Computational Methods for Social Science
0.9
0.8
0.7
Percent of Tweets
Country
0.6 Bahrain
Egypt
0.5 Source
Desktop
0.4 Mobile
0.3
0.2
0.1
0 5 10 15 20
Hour
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 65
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
66 Quantitative and Computational Methods for Social Science
50 Source 50 Source
ICEWS ICEWS
40 40
Twitter Twitter
Unique Events
Unique Actors
30 30
20 20
10 10
0 0
2011−02−13
2011−02−19
2011−02−26
2011−03−05
2011−03−12
2011−02−13
2011−02−19
2011−02−26
2011−03−05
2011−03−12
50 Source
ICEWS
40
Unique Locations
Twitter
30
20
10
0
2011−02−13
2011−02−19
2011−02−26
2011−03−05
2011−03−12
I compare them. The tweets record events across many more loca-
tions than ICEWS, an average of 20 per day versus three for ICEWS.
For example, the tweets record clashes in suburbs such as Duraz or
outlying cities such as Sanabis, and there are two reports of nerve
gas used against protestors. Within Manama, clashes are recorded
at the airport, Dana Mall, and Bahrain University, among other
places too precise for ICEWS to reference. Other locations include
an activist’s home and Sitra, Bahrain’s seventh largest city.
The tweets also record more actors than ICEWS, an average of 25
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 67
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
68 Quantitative and Computational Methods for Social Science
they remain understudied. The most common way links are ana-
lyzed is for researchers to note the percentage of tweets containing
them (Suh et al., 2010; Steinert-Threlkeld, 2017a), but I am aware of
no articles that analyze the content of those links. There are two
reasons links remain understudied.
First, spam accounts tweet links. The presence of a link, in
conjunction with age of an account and the use of trending topics,
is a common way researchers identify spam accounts (Kwak et al.,
2010). While bots may represent 6% to 8.4% of all accounts, it is
unknown what percentage of links they share (Lotan et al., 2011); it
is probably more. (Not all bots are spam accounts, but I am not
aware of any work which manually identifies spam accounts.)
Astroturf political campaigns commonly use authentic looking
accounts controlled by a political operation to share one or a few
links, creating the appearance of a grassroots concern where none
exists (Mustafaraj and Metaxas, 2010; Ratkiewicz et al., 2011).
Without a reliable, precise spam filter, a researcher studying links
risks studying spam.
Second, studying links requires additional processing work. When
Twitter delivers a tweet with a link, it extracts the link for the down-
loader. It does not, however, deliver the content contained at the
link. Theoretically, you could estimate the link’s content by reading
the URL, as newspaper links often contain the headline. Twitter,
however, automatically shortens links; while useful to the tweet
creator, the shortening means information the full URL contains is
removed, and the researcher has to follow the link to the webpage.
A researcher interested in link content therefore has to build a web
crawling system on top of the one connecting to Twitter.
The most compelling reason to study links is because they often
point to images, and image analysis is a new frontier of machine
learning. Twitter is traditionally text focused, but imagery is
becoming the most popular content shared on social networks.
Images are commonly shared on Twitter, including images con-
taining more than 140 characters of text. I am aware of no study
which quantifies what percentage of tweets contain imagery,
though a few studies have analyzed images on Twitter. Kaneko
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 69
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
70 Quantitative and Computational Methods for Social Science
First, you can consult the metadata Twitter provides with a tweet.
These data include how many followers the author has, how many
accounts the author follows, when the account was created, the
account’s default language, the language of the tweet, and if the
tweet has GPS coordinates, among others. The user’s self-reported
location is provided in approximately 50% of tweets (Leetaru and
Schrodt, 2013), and individuals commonly make their screen name
the same as, or similar to, their real name (Barberá et al., 2015a).
Second, you could also interview the individuals behind an
account. Two difficulties arise. First, messaging an account
requires that the recipient follows the sending account. In other
words, survey respondents have to first follow the survey adminis-
trator, which is unrealistic. Second, interviewing would require
Institutional Review Board (IRB) approval. A more expeditious
approach is to interview people and ask if they use Twitter
(Tufekci and Wilson, 2012; Zickuhr, 2013).
Third, accounts’ tweeting patterns and social network can reveal
information not in tweets’ text. For example, accounts belonging to
unemployed individuals have tweets more during the day, and
cities with higher levels of unemployment have low communica-
tion entropy (Llorente et al., 2014). The style of tweets gives some
indication of an account’s age (Nguyen et al., 2013). Social class,
ethnicity, and education can be estimated probabilistically when
tweets contain GPS coordinates (Malik et al., 2015). A user’s social
network is also more predictive of that user’s age and political
affiliation than relying on just user attributes (Zamal et al., 2012).
These techniques are compelling, but there exists no R or Python
packages that implement them automatically, increasing the costs
of their use. A researcher may have to reinvent the wheel each time;
the need for such data should be great before time is invested in
this approach.
Finally, you can manually inspect each account in a sample.
By viewing a profile’s photo, gender may be obvious, and an age
range could be created. An account’s past tweets can give an indica-
tion of the author’s primary location and interests. Googling the
account name may reveal other sites where the author has
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 71
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
72 Quantitative and Computational Methods for Social Science
after 2010 (Twitter was founded in 2006). The appeal of Twitter, and
“big data” more broadly, is that it provides data on more people in
more places across more time than scholars could realistically hope
to achieve with survey methods. For example, it has long been known
that individuals’ happiness is lowest during the middle of the day,
and lower during the week than on weekends; the cost of acquiring
data meant these results were only tested in WEIRD (Western, edu-
cated, industrialized, rich, and democratic) countries (Henrich et al.,
2010), but Twitter reveals that it applies in at least 84 countries
(Golder and Macy, 2011). Countries also experience similar changes
in their total happiness, though baseline levels vary (Poblete et al.,
2011). Because people use social media to talk about topics such as
health, these platforms can also be used to monitor public health and
identify individuals susceptible to treatment (Charles-Smith et al.,
2015).
In political science, Twitter has been used in two main areas of
research: conflict dynamics and public opinion. For an early review
of social media and social movements, see the 2013 special issue of
American Behavioral Scientist called “New Media and Social Unrest”
(Tufekci and Freelon, 2013). For a review of the potential and
challenges of using online social network platforms for research,
see Golder and Macy (2014). For a series of essays on the role of big
data in the social sciences, see the 2015 symposium in PS: Political
Science and Politics titled “Big Data, Causal Inference, and Formal
Theory: Contradictory Trends in Political Science?” (www
.cambridge.org/core/journals/ps-political-science-and-politics
/issue/F71EE285BFB51E27DCE368E94D5A0F8B).
Thomas Zeitzoff has three papers that show how Twitter can
generate new insights into conflict and foreign policy. Because the
costs of posting on Twitter are much lower than for publishing in
a newspaper or broadcasting on television, individuals and small
organizations have become sources of events data. Zeitzoff (2011)
combines @AJGaza (an Al-Jazeera Twitter account) and
@QassamCount (a record of rocket attacks into Israel) tweets
with blog reports and a Wikipedia event timeline to examine the
microdynamics of Israel’s 2009 war in Gaza, finding that:
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 73
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
74 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 75
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
76 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 77
users and their 3,200 previous tweets, a classifier for pro- or anti-
military tweets, and known secular or Islamist accounts, the
authors show that few users, never more than 3% on any
given day, express views that contradict their previous preferences
as expressed on Twitter (Borge-Holthoefer et al., 2015). Twitter can
be used to measure pro-Ukraine and pro-Russia sentiment from
the start of the protests and in Ukraine and through its civil war;
because Twitter polling does not require enumerators, it is espe-
cially useful in violent areas of the world such as Ukraine’s Luhansk
and Donetsk oblasts (Driscoll and Steinert-Threlkeld, 2017).
The failure of voter turnout models used to predict the results of
the presidential election in the United States in 2016 suggests that
many voters are also difficult to survey. If the election result sug-
gests that traditional institutions, such as the political party and
media, are not as influential over individuals as previously
believed, then data sources which provide direct access to those
individuals may become an important source of polling.
Republican nominee Donald Trump‘s personal use of Twitter
also defied expectations and galvanized new political actors. That
these actors may misrepresent themselves to polsters, if they
respond in the first place, but may exhibit more candor online
suggests that Twitter may have more relevance for understanding
American political behavior than previously thought.
The presidential election also saw the first use of “fake news” on
social media platforms, especially Facebook. (“Fake news” refers to
articles that appear to detail actual events but are designed instead
to generate internet traffic against which advertisements can be
sold. They generate traffic with sensationalist headlines.) Because
the behavior is so novel, there does not yet exist a method for
automatically detecting fake news; in response to post-election
backlash, Facebook’s solution is to flag questionable articles for
independent organizations to verify manually.
While your initial desire may be to remove tweets containing
fake news from a dataset, they should be preserved since their
presence presents the opportunity to study information flow in
all its forms. Just as social media may provide insight into difficult
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
78 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 79
5.2 Competitors
Twitter is unique in its global reach and data availability; it is
probably the most studied social network. As of September 21,
2016, a Google Scholar search for “twitter” returns 6,370,000
items, “facebook” 5,390,000, and “instagram” 172,000. Other com-
mon platforms are Tumblr (owned by Yahoo), reddit, and Sina
Weibo. For a history of social networks and the internet, see Bury
et al. (2013).
Founded in 2004 for elite US undergraduates, Facebook opened
itself to anyone over the age of 13 on September 26, 2006. It is now
a major corporation with global market penetration; with over
1 billion users, it is the most popular social network and one of
the internet’s most visited sites (Bhatia, 2016; Solon, 2016).
Facebook was the first social network platform to reach such
a large audience, and it quickly drew attention from academics of
all disciplines. The first article about it, discussing privacy con-
cerns, appeared in 2005 and analyzed information sharing beha-
viors of college students (Jones and Soltren, 2005). The ability to
observe social connections across large groups of people has
drawn the interest of network scientists (Lazer et al., 2009; Gjoka
et al., 2010; Ferrara, 2012), physicists and computer scientists
(Catanese et al., 2011; Ugander et al., 2011), social scientists
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
80 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 81
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
82 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 83
6 Discussion
I conclude by discussing non-programmatic aspects of Twitter.
The first section discusses the types of data that tweets do not
provide and the limitations thus imposed on analysis. Section 6.2
raises potential ethical concerns of using Twitter data, especially as
it relates to minors. I then user behaviors on Twitter to argue that it
has features of both a media platform, like newspapers or televi-
sion, and a social network.
6.1 Limitations
While the works referred to above that are skeptical of Twitter
enumerate shortcomings of the platform’s data, it is worth empha-
sizing them here as well. The simplest way to summarize these
shortcomings is that individual tweets contain little information.
The main reason individual tweets have little information is
because they are limited to 140 characters, 20 characters fewer
than a text message. (As of November 2017, Twitter has is transi-
tioning to 280 characters per tweet.) As any quick perusal of Twitter
reveals, this restriction leads to frequent use of abbreviations; it is
also common for a tweet to be a comment on a link shared in the
tweet.
What tweets lack in information they make up for in quantity.
One tweet may be about football, another music, and another
a political candidate; individually, they are not interesting, but
aggregated, they reveal interesting patterns about what topics are
salient to a given group of people and how that saliency varies by
place and time.
One of the drawbacks of Twitter for researchers is one of its appeals
for users: anonymity. Registration is free, and registrants can make
their screen name any word or phrase they want. Unlike Facebook,
then, where you are asked for your first and last name, Twitter will let
you appear to the world as “Zachary Steinert-Threlkeld” or “Brown
Curtain”. While many users choose a screen name that is their name,
most do not. Moreover, Twitter does not ask the users their age,
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
84 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 85
There are two idiosyncrasies to working with tweet text. First, the
140 character limitation means tweets tend to concern themselves
with one topic, simplifying analysis. This limitation pushes users
towards abbreviations and slang, however, which complicate lan-
guage processing. Second, tweet style is bimodal, with very many,
perhaps a majority, of them using abbreviations and slang. Existing
corpora used for dictionary approaches do not include slang, and
the idiosyncratic nature of slang means unsupervised approaches
are more likely to assign tweets about the same topic to different
topics. For a more detailed discussion on natural language proces-
sing and Twitter, see Sriram et al. (2010) and Han and Baldwin
(2011). For a thorough introduction to natural language processing
more broadly, see Manning and Schütze (1999).
6.2 Ethics
Ethical concerns around the use of Twitter data flow from the
scarcity of information in tweets and user profiles.
Because Twitter requires no identifying information to register,
it is possible that tweets in a dataset are from children. Because
Twitter is a common marketing and branding tool, many products
exist which will estimate the demographic age of an account’s
followers, but age-verification products are expensive on an aca-
demics’ budget. (Twitter also allows brands to require potential
followers to confirm their age before they are allowed to follow.)
It is also unknown what percent of users on Twitter are under 18;
Pew, which conducts an annual survey of social media usage in
America, does not interview minors. While there are studies that
estimate the demographic characteristics from the behavior of
a Twitter account (Nguyen et al., 2013; Sloan et al., 2015; Sloan
and Morgan, 2015), any solutions are not trivial to implement,
and I am not aware of any academic Twitter study which attempts
to remove minors (or other protected categories) from their
samples.
The researcher must also be careful to respect users’ desire to
delete tweets. If a user deletes a specific tweet, the streaming API
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
86 Quantitative and Computational Methods for Social Science
will deliver a JSON message with the tweet ID of the now deleted
tweet. It is incumbent on the person or group connected to the API
to delete the tweet from their data. Twitter is unclear about
whether or not the streaming API provides the identification num-
ber of all deleted tweets or only a sample of them. Twitter will not
make a deleted tweet available via the REST API.
IRBs have not established a common standard for the treatment of
Twitter data. When tweets are publicly available and researchers are
not conducting interventions, there is no prima facie reason studies
should need IRB approval. Twitter’s public nature has not stopped
IRBs from expressing caution about using its data (Halavais, 2011;
Hayden, 2013). For a project where my colleague interviewed activists
in Egypt and I examined their Twitter behavior, my university’s IRB
had to approve the fieldwork (Fowler and Steinert-Threlkeld, 2016).
When our IRB application mentioned Twitter, they asked for more
detail, though Twitter’s public nature mollified them.
IRBs’ approach to minimally invasive research, such as down-
loading public data from Twitter, is in flux. On January 19, 2017,
United States Government agencies in charge of protecting human
subjects issued new guidelines for research that will take effect
in January 2018. These guidelines create new exempt categories
that require minimal IRB review, and research under these cate-
gories does not require continuing review. One of these exempt
categories is the “observation and recording of verbal and non-
verbal behavior in schools and public places”, under which obser-
vational Twitter research should fall (Shweder and Nisbett, 2017).
How institutions interpret these rules and by how much they lower
the administrative cost of research remains to be seen.
While IRBs appear to have adopted an appropriate attitude to
observational data from Twitter, there is growing interest in conduct-
ing experiments (Coppock et al., 2016; Munger, 2016). Procedures to
protect research subjects on Twitter appear to be the same as those
for offline experiments. For example, the replication data for Munger
(2016) is anonymized and aggregated to the account level so that the
accounts targeted with messages cannot be identified. Coppock,
Guess and Ternovski (2016) similarly do not share individual tweets,
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 87
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
88 Quantitative and Computational Methods for Social Science
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 89
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
90 Quantitative and Computational Methods for Social Science
are not followed by any of the accounts they follow, suggesting that
these people use Twitter more to gather information than to
engage socially. Twitter has a short diameter (4.12 average links
between each account and all other accounts), which the authors
interpret as support for the broadcast side of Twitter. Like social
networks, users exhibit homophily, in this case with respect to their
number of followers and time-zone. Finally, ranking accounts by
their number of followers, their PageRank, and the number of
times they are retweeted shows that the top 20 accounts in each
tend to be news organizations or celebrities.
Though news organizations and celebrities dominate in terms of
followers and retweets, that does not mean they dominate on other
dimensions. For example, Steinert-Threlkeld (2017b) finds that
those with the most followers did not drive protest mobilization
during the Arab Spring. Even though those accounts will tweet
about an upcoming or ongoing protest, the need for a critical
mass of protestors means that it is the use of hashtags by those
not at the top of the follower distribution that correlates with
subsequent protest mobilization (Marwell et al., 1988).
This result is in line with Barberá et al. (2015a). They find that
communication around the 2014 Academy Awards and raising the
United States’ minimum wage resembles a broadcast network,
while that for collective action has the same network dynamics
identified in Steinert-Threlkeld (2017a). Gonzalez-Bailon et al.
(2013) find four major types of Twitter users. Two of them – broad-
casters (follow many fewer accounts than follow them, mentioned
infrequently) and influentials (follow many fewer accounts than
follow them, mentioned frequently) – are consistent with a media
platform. The other two – common users (follow many more
accounts than follow them, mentioned infrequently) and hidden
influentials (follow many more accounts than follow them, men-
tioned frequently) – are consistent with a social network. Finally,
a study of 1.8 billion tweets from four months in 2014 finds that
only 0.8% of tweets are from news organizations, though some
topics have up to 15% of their tweets coming from news organiza-
tions (Malik and Pfeffer, 2016). Individuals also commonly use
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 91
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
92 Quantitative and Computational Methods for Social Science
6
Facebook and Instagram are great sources of data, if you can convince them to
work with you and are willing to risk their veto power.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Twitter as Data 93
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Glossary
94
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
Glossary 95
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References
96
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 97
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
98 References
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 99
Chang, Yi, Lei Tang, Yoshiyuki Inagaki, and Yan Liu (2014). “What is
Tumblr: A Statistical Overview and Comparison.” SIGKDD
Explorations 16(1), 21–30.
Charles-Smith, Lauren E., Tera L. Reynolds, Mark A. Cameron, Mike
Conway, Eric H. Y. Lau, Jennifer M. Olsen, Julie A. Pavlin, Mika
Shigematsu, Laura C. Streichert, Katie J. Suda, and Courtney D.
Corley (2015). “Using Social Media for Actionable Disease
Surveillance and Outbreak Management: A Systematic Literature
Review.” PLOS One 10(10), e0139701.
Cheng, Zhiyuan, James Caverlee, and Kyumin Lee (2010). “You Are
Where You Tweet: A Content-Based Approach to Geo-locating
Twitter Users.” In ACM International Conference on Information and
Knowledge Management. Toronto.
Christia, Fotini, Leon Yao, Stephen Wittels, and Jure Leskovec (2015).
“Yemen Calling: Seven Things Cell Data Reveal about Life in the
Republic.” Foreign Affairs. www.foreignaffairs.com/articles/yemen/
2015–07-06/yemen-calling.
Conover, M.D., J. Ratkiewicz, M. Francisco, B. Goncalves, A. Flammini,
and F. Menczer (2011). “Political Polarization on Twitter.” In Fifth
International AAAI Conference on Weblogs and Social Media, pp.
89–96.
Conover, Michael D., Bruno Gonçalves, Alessandro Flammini and Filippo
Menczer (2012). “Partisan Asymmetries in Online Political Activity.”
EPJ Data Science 1(1), 1–19.
Conover, Michael D, Clayton Davis, Emilio Ferrara, Karissa McKelvey,
Filippo Menczer, and Alessandro Flammini (2013). “The Geospatial
Characteristics of a Social Movement Communication Network.” PloS
one 8(3), e55957.
Coppock, Alexander, Andrew Guess, and John Ternovski (2016). “When
Treatments are Tweets: A Network Mobilization Experiment over
Twitter.” Political Behavior 38(1), 105–128. http://dx.doi.org/10.1007/
s11109-015-9308-6.
Dalton, Russell J., Steven Greene, Paul Allen Beck, and Robert Huckfeldt
(2002). “The Social Calculus of Voting: Interpersonal, Media, and
Organizational Influences on Presidential Choices.” The American
Political Science Review 96(1), 57–73.
Davenport, Christian and Patrick Ball (2002). “Views to a Kill: Exploring
the Implications of Source Selection in the Case of Guatemalan State
Terror, 1977–1995).” Journal of Conflict Resolution 46(3), 427–450.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
100 References
Diaz, Fernando, Michael Gamon, Jake Hofman, Emre Kiciman, and David
Rothschild (2016). “Online and Social Media Data as a Flawed
Continuous Panel Survey.” PLoS One 11(1), e0145406.
Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Kloumann,
Catherine A. Bliss, and Christopher M. Danforth (2011). “Temporal
Patterns of Happiness and Information in a Global Social Network:
Hedonometrics and Twitter.” PLoS ONEcomput 6(12), e26752.
Douglass, Rex W, David a Meyer, Megha Ram, David Rideout, and
Dongjin Song (2015). “High Resolution Population Estimates from
Telecommunications data.” EPJ Data Science 4(1), 4.
Dowle, Matt, T Short, S Lianoglou, and A Srinivasan (2015). “data.table:
Extension of data.frame.” https://cran.r-project.org/web/packages/
data.table/index.html.
Driscoll, Jesse and Zachary C. Steinert-Threlkeld (2017). “Structure,
Agency, Hegemony, and Action: Ukrainian Nationalism in East
Ukraine.” Working paper.
Dunbar, R. I. M (2011). “Constraints on the Evolution of Social Institutions
and Their Implications for Information Flow.” Journal of Institutional
Economics 7(03), 345–371. www.journals.cambridge.org/
abstracES1744137410000366.
Dunbar, R. I. M. (1995). “Neocortex Size and Group Size In Primates:
A Test of the Hypothesis.” Journal of Human Evolution 28(3), 287–296.
Dunbar, R.I.M., Valerio Arnaboldi, Marco Conti, and Andrea Passarella
(2015). “The Structure of Online Social Networks Mirrors Those in the
Offline World.” Social Networks 43: 39–47.
Eubank, Nicholas (2016). “Social Networks and the Political Salience of
Ethnicity.” Working paper.
Evans, Heather K., Victoria Cordova, and Savannah Sipole (2014).
“Twitter Style: An Analysis of How House Candidates Used Twitter
in Their 2012 Campaigns.” PS: Political Science & Politics 47(02),
454–462.
Farrell, Henry (2012). “The Consequences of the Internet for Politics.”
Annual Review of Political Science 15(1), 35–52.
Ferrara, Emilio (2012). “A Large-Scale Community Structure Analysis in
Facebook.” EPJ Data Science 1(9), 1–30.
Ferrara, Emilio and Alessandro Bessi (2016). “Social Bots Distort the 2016
US Presidential Election Online Discussion.” First Monday 21(11), 1–17.
Ferrara, Emilio, Onur Varol, Clayton Davis, Filippo Menczer, and
Alessandro Flammini (2016a. “BotOrNot: A System to Evaluate Social
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 101
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
102 References
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 103
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
104 References
Kallus, Nathan (2013). “Predicting Crowd Behavior with Big Public Data.”
In 23rd International Conference on World Wide Web.
Kalyvas, Stathis N (2004). The Urban Bias in Research on Civil Wars.
Vol. 13.
Kaneko, Takamu and Keiji Yanai (2013). “Visual Event Mining from Geo-
Tweet Photos.” In IEEE International Conference on Multimedia and
Expo Workshops, pp. 1–6.
King, Gary, Jennifer Pan, and Margaret E. Roberts (2014). “Reverse-
Engineering Censorship in China: Randomized Experimentation and
Participant Observation.” Science 345(6199), 1–10.
King, Gary, Jennifer Pan, and Margaret E. Roberts (2016). “How the
Chinese Government Fabricates Social Media Posts for Strategic
Distraction, not Engaged Argument.” http://gking.harvard.edu/50c?
platform=hootsuite.
Kramer, Adam D.I., Jamie E. Guillory, and Jeffrey T. Hancock (2014).
“Experimental evidence of massive-scale emotional contagion through
social networks.” In Proceedings of the National Academy of Sciences
111(24), 8788–8790.
Kulshrestha, Juhi, Farshad Kooti, Ashkan Nikravesh, and Krishna P
Gummadi (2012). “Geographic Dissection of the Twitter Network.” In
Proceedings of the Sixth International AAAI Conference on Weblogs and
Social Media, pp. 202–209.
Kwak, Haewoon, Changhyun Lee, Hosung Park, and Sue Moon (2010).
“What Is Twitter, a Social Network or a News Media?” In
International World Wide Conference. Raleigh: ACM Press, pp.
591–600.
Lake, Ronald La Due and Robert Huckfeldt (1998). “Social Capital, Social
Networks, and Political Participation.” Political Psychology 19(3),
567–584.
Lakkaraju, Himabindu, Julian J. McAuley, and Jure Leskovec (2013).
“What’s in a Name? Understanding the Interplay between Titles,
Content, and Communities in Social Media.” In International
Conference on Web and Social Media.
Lang, Duncan Temple and the CRAN team (2016). RCurl: General
Network Client Interface for R. R package version 1.95-4.8. https://
CRAN.R-project.org/package=RCurl
Larson, Jennifer M., Jonathan Nagler, Jonathan Ronen, and Joshua A.
Tucker (2016). “Social Networks and Protest Participation: Evidence
from 130 Million Twitter Users.” Working paper.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 105
Lazer, David, Devon Brewer, Nicholas Christakis, James Fowler, and Gary
King (2009). “Life in the Network: The Coming Age of Computational
Social Science.” Science 323(5915), 721–723.
Leetaru, Kalev H., Shaowen Wang, Guofeng Cao, Anand Padmanabhan,
and Eric Shook (2013). “Mapping the Global Twitter Heartbeat: The
Geography of Twitter.” First Monday 18(5–6), 1–33.
Leetaru, Kalev and Philip Schrodt (2013). “GDELT: Global Data on Events,
Language, and Tone, 1979–2012.” International Studies Association
Annual Conference.
Lewis, Kevin, Jason Kaufman, Marco Gonzalez, Andreas Wimmer, and
Nicholas Christakis (2008). “Tastes, Ties, and Time: A New Social
Network Dataset Using Facebook.com.” Social Networks 30(4), 330–
342. http://linkinghub.elsevier.com/retrieve/pii/S0378873308000385.
Lin, Chengfeng, Jianhua He, Yi Zhou, Xiaokang Yang, Kai Chen, and Li
Song (2013). “Analysis and Identification of Spamming Behaviors in
Sina Weibo Microblog.” In Proceedings of the 7th Workshop on Social
Network Mining and Analysis 13: 1–9.
Llorente, Alejandro, Manuel Garcia-Herranz, Manuel Cebrian, and
Esteban Moro (2014). “Social media fingerprints of unemployment.”
http://arxiv.org/abs/1411.3140.
Lotan, Gilad, Mike Ananny, Devin Gaffney, Danah Boyd, Ian Pearce, and
Erhardt Graeff (2011). “The Revolutions Were Tweeted: Information
Flows During the 2011 Tunisian and Egyptian Revolutions Web.”
International Journal of Communications 5:1375–1406.
Lucas, Christopher, Richard A. Nielsen, Margaret E. Roberts, Brandon M.
Stewart, Alex Storer, and Dustin Tingley (2015). “Computer-Assisted
Text Analysis for Comparative Politics.” Political Analysis 23(2),
254–277.
Malik, Momin M., Constantine Nakos, Hemank Lamba, and Jiirgen
Pfeffer (2015). “Population Bias in Geotagged Tweets.” In 9th
International AAAI Conference on Weblogs and Social Media. Oxford.
Malik, Momin M. and Jurgen Pfeffer (2016). “A Macroscopic Analysis of
News Content in Twitter.” Digital Journalism 0811(May), 1–25.
Manning, Christopher D. and Hinrich Schutze (1999). Foundations of
Statistical Natural Language Processing. Cambridge, MA: Massachusetts
Institute of Technology.
Marwell, Gerald, Pamela E. Oliver, and Ralph Prahl (1988). “Social
Networks and Collective Action: A Theory of the Critical Mass.”
American Journal of Sociology 94(3), 502–534.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
106 References
Masad, David (2013). “Studying the Syrian Civil War with GDELT.” The
Monkey Cage. http://themonkeycage.org/2013/07/09/how-computers-
can-help-us-track-violent-conflicts-including-right-now-in-syria/.
McAdam, Doug (1986). “Recruitment to High-Risk Activism: The Case of
Freedom Summer.” American Journal of Sociology 92(1), 64–90.
McGrath, Ryan (2015). “twython.” https://twython.readthedocs.io/en/
latest/.
McKinney, Wes (2015). “pandas.” http://pandas.pydata.org/.
Metternich, Nils W., Cassy Dorff, Max Gallop, Simon Weschle, and
Michael D. Ward (2013). “Antigovernment Networks in Civil
Conflicts: How Network Structures Affect Conflictual Behavior.”
American Journal of Political Science 57(4).
Mislove, Alan, Sune Lehmann, Yong-Yeol Ahn, Jukka-Pekka Onnela, and
J. Niels Rosenquist. 2011). “Understanding the Demographics of
Twitter Users.” In Proceedings of the Fifth International AAI
Conference on the Weblogs and Social Media, pp. 554–557.
Mocanu, Delia, Andrea Baronchelli, Nicola Perra, Alessandro Vespignani,
Bruno Goncalves, and Qian Zhang (2013). “The Twitter of Babel:
Mapping World Languages through Microblogging Platforms.” PLOS
One 8(4), e61981.
Morstatter, Fred, Jurgen Pfeffer, Kathleen M. Carley, and Huan Liu (2013).
“Is the Sample Good Enough? Comparing Data from Twitter’s
Streaming API with Twitter’s Firehose.” In Association for the
Advancement of Artificial Intelligence.
Mueller, Andreas (2015). “scikit-learn.” http://scikit-learn.org/stable/.
Munger, Kevin (2016). “Tweetment Effects on the Tweeted:
Experimentally Reducing Racist Harassment.” Political Behavior, pp.
1–21.
Mustafaraj, E. and Pt Metaxas (2010). “From Obscurity to Prominence in
Minutes: Political Speech and Real-Time Search.” In WebSci10:
Extending the Frontiers of Society On-Line. p. 317. http://repository
.wellesley.edu/computersciencefaculty/9/.
Nguyen, Dong, Rilana Gravel, Dolf Trieschnigg, and Theo Meder (2013).
“”How Old Do You Think I Am ?: A Study of Language and Age in
Twitter.” Proceedings of the Seventh International AAAI Conference on
Weblogs and Social Media.
Nickerson, David W. (2008). “Is Voting Contagious? Evidence from Two
Field Experiments.” American Political Science Review 102(01),
49–57.
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 107
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
108 References
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 109
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
110 References
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
References 111
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327
112 References
Downloaded from https://www.cambridge.org/core. IP address: 39.46.131.221, on 09 Apr 2018 at 16:42:53, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108529327