research-article

Free access

Don't turn social media into another 'Literary Digest' poll

Author:

Daniel Gayo-AvelloAuthors Info & Claims

Communications of the ACM, Volume 54, Issue 10

Pages 121 - 128

https://doi.org/10.1145/2001269.2001297

Published: 01 October 2011 Publication History

All formats PDF

Abstract

The power to predict outcomes based on Twitter data is greatly exaggerated, especially for political elections.

References

[1]

Asur, S. and Huberman, B.A. Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ ACM International Conference on Web Intelligence and Intelligent Agent Technology (Toronto, Aug. 31--sept. 3). IEEE Computer Society, Los Alamitos, CA, 2010, 492--499.

Digital Library

Google Scholar

[2]

Boiy, E., Hens, P., Deschacht, K., and Moens, M.F. Automatic sentiment analysis in online text. In Proceedings of the 2007 Conference on Electronic Publishing (Vienna, June 13--15). ÖKK Editions, Vienna, 2007, 349--360.

Google Scholar

[3]

Choi, H. Predicting the Present with Google Trends. Tech. Rep. Google, Inc., Mountain View, CA, 2009; http://google.com/googleblogs/pdfs/google_predicting_the_present.pdf

Google Scholar

[4]

Choi, Y. and Cardie, C. Adapting a polarity lexicon using integer Linear programming for domain-specific sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (Singapore, Aug. 6--7). Association for Computational linguistics, Stroudsburg, PA, 2009, 590--598.

Digital Library

Google Scholar

[5]

Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., and Brilliant, L. Detecting influenza epidemics using search engine query data. Nature 457, 7232 (Feb. 19, 2009), 1012--1014.

Crossref

Google Scholar

[6]

Keeter, S., Kiley, J., Christian, L., and Dimock, M. Perils of Polling in Election '08. Pew Internet and American Life Project, Washington, D.C., 2009; http://pewresearch.org/pubs/1266/polling-challenges-election-08-success-in-dealing-with

Google Scholar

[7]

Kwak, H, Lee, C., Park, H., and Moon, S. What is Twitter: a social network or a news media? In Proceedings of the 19th International World Wide Web Conference (Raleigh, NC, Apr. 26--30). ACM Press, New York, 2010, 591--600.

Digital Library

Google Scholar

[8]

Hughes, A.L. and Palen, L. Twitter adoption and use in mass convergence and emergency events. In Proceedings of the Sixth International Community on Information Systems for Crisis Response and Management Conference (Gothenburg, Sweden, May 10--13, 2009).

Crossref

Google Scholar

[9]

Lenhart, A. and Fox, S. Twitter and Status Updating. Pew Internet and American Life Project, Washington, D.C. 2009; http://www.pewinternet.org/Reports/2009/Twitter-and-status-updating.aspx

Google Scholar

[10]

O'Connor, B., Balasubramanyan, R., Routledge, B.R., and Smith, N.A. From tweets to polls: linking text sentiment to public opinion time series. In Proceedings of the Fourth International Association for the Advancement of Artificial Intelligence Conference on Weblogs and Social Media (Washington, D.C, May 23--26). Association for the Advancement of Artificial Intelligence, Menlo Park, CA, 2010, 122--129.

Google Scholar

[11]

Smith, A. and Rainie, L. The Internet and the 2008 Election. Pew Internet and American Life Project, Washington, D.C., 2008; http://www.pewinternet.org/Reports/2008/The-Internet-and-the-2008-Election.aspx

Google Scholar

[12]

Squire, P. Why the 1936 Literary Digest poll failed. Public Opinion Quarterly 52, 1 (Spring 1988), 125--133.

Google Scholar

[13]

Tancer, B. Click: What Millions of People Are Doing Online and Why It Matters. Hyperion New York, 2008.

Google Scholar

[14]

Tumasjan, A., Sprenger, T.O., Sandner, P.G., and Welpe, I.M. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of the Fourth International Association for the Advancement of Artificial Intelligence Conference on Weblogs and Social Media (Washington, D.C., May 23--26). Association for the Advancement of Artificial Intelligence, Menlo Park, CA, 2010, 178--185.

Google Scholar

[15]

Turney, P.D. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Philadelphia, July 6--12). Association for Computational Linguistics, Stroudsburg, PA, 2002, 417--424.

Digital Library

Google Scholar

[16]

Wilson, T., Wiebe, J., and Hoffmann, P. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Human Language Technology Conference, Conference on Empirical Methods in Natural Language Processing (Vancouver, Canada, Oct. 6--8). Association for Computational linguistics, Stroudsburg, PA, 2005, 347--354.

Digital Library

Google Scholar

[17]

Yu, B., Kaufmann, S., and Diermeier, D. Exploring the characteristics of opinion expressions for political opinion classification. In Proceedings of the 2008 International Conference on Digital Government Research (Montréal, May 18--21). Digital Government Society of North America, Marina del Rey, CA, 2008, 82--91.

Digital Library

Google Scholar

Cited By

View all

Gaur AYadav D(2025)A comprehensive analysis of forecasting elections using social media textMultimedia Tools and Applications10.1007/s11042-024-20528-wOnline publication date: 27-Jan-2025
https://doi.org/10.1007/s11042-024-20528-w
de Almeida MVieira VAgnihotri Rde Freitas Souza R(2024)The Social Psychological Explanations for the Effects of Social Media Marketing and Traditional Media on Voting IntentionsJournal of Marketing Theory and Practice10.1080/10696679.2024.2415973(1-18)Online publication date: 20-Oct-2024
https://doi.org/10.1080/10696679.2024.2415973
Brito KSilva Filho RAdeodato P(2024)Stop trying to predict elections only with twitter – There are other data sources and technical issues to be improvedGovernment Information Quarterly10.1016/j.giq.2023.10189941:1(101899)Online publication date: Mar-2024
https://doi.org/10.1016/j.giq.2023.101899
Show More Cited By

Index Terms

Don't turn social media into another 'Literary Digest' poll

Recommendations

Social Media Marketing: A Beginner Guide To Get Success In Your Business (Volume 1)
Social media user classification: based on social capital expectation, susceptibility, and compulsion loop
ICEC '17: Proceedings of the International Conference on Electronic Commerce

Social media such as Facebook, Instagram and Twitter are originally developed as communication tools among individuals for private conversations. Through the platforms, people share photos, stories and news with their social media friends to interact ...
My Social Media for Seniors (2nd Edition)

Reviews

Reviewer: Bernice T. Glenn

Twitter, due to its growth and ever-expanding user base, has become a gold mine of information for analysts who mine tweet content as a data source for gauging public opinion. Even The New York Times is discussing this phenomenon [1]. But what does a Literary Digest poll, conducted in 1936, have to do with the current practice of mining data on Twitter to make projections__?__ Gayo-Avello discusses the dangers that may result when negative results from data extracted from Twitter are ignored, stating that "current research risks turning social media analytics into the next Literary Digest poll."? For those unfamiliar with this reference, it is the classic demonstration of how data bias seriously skewed poll results. The poll was conducted by the Literary Digest for the 1936 US presidential election. The magazine's own readers were its query source, and readers were asked which candidate they preferred: New Deal candidate Franklin Roosevelt, or Republican Alf Landon. The readers were reached by telephone numbers listed in nationwide directories and by a list of registered car owners. The poll concluded that Landon would win in a landslide. However, Roosevelt ultimately won with 61 percent of the popular vote. This poll became the quintessential example of the need to mine unbiased data as sources for reliable projections. In the 2008 US presidential election, projections based on tweet data were heavily in favor of Barack Obama"?even in states where he eventually lost heavily to John McCain. Gayo-Avello's study looks for reasons for the faulty projections. He states: My aim was not to compare Twitter data with pre-election polls or with the popular vote, as had been done previously, but to obtain predictions on a state-by-state basis. Additionally, unlike the other studies, my predictions were not to be derived from aggregating Twitter data but by detecting voting intention for every single user from their individual tweets. Gayo-Avello "applied four different sentiment-analysis methods described in the most recent literature and carefully evaluated their performance."? He then demonstrated that the results for the 2008 US presidential election "could not have been predicted from Twitter data alone through commonly applied methods."? A substantial bibliography lists all of the references Gayo-Avello used in his research. Relying heavily on statistical research and evaluation, Gayo-Avello, in the "Election Twitter Data Set"? section, first describes an analysis in which simple Twitter data was used to overestimate President Obama's victory in the 2008 US presidential election. He then hypothesizes that Twitter users are probably a sample group and, most likely, a biased one. The article continues, focusing on whether data extracted from Twitter can be used to reliably predict outcomes, both current and future. Gayo-Avello provides a series of statistical analyses, described through detailed text and tables, and highlights both errors and corrections in his research. The data used in his study was collected from users' unprotected tweets viewed in Twitter's public timeline. These tweets are easily accessed and collected through Twitter's own application programming interface (API). For Gayo-Avello's study, tweets were collected shortly after the 2008 election using Twitter's API search function. He used the following query parameters, picking up 100 tweets per candidate, per county, per day: only one query was used per candidate"?Obama or Biden for the Democratic candidates, and McCain or Palin for the Republican candidates, and the query was limited to only those tweets published by US residents within a specified time interval. This view counted the number of appearances of a candidate in a user's tweets, assuming that the candidate mentioned more often would be the one the user would later vote for. This view would ultimately prove to be wrong. In the next section, "Inferring Voter Intention,"¿ Gayo-Avello presents a second method using terms labeled either positive or negative. If a tweet contained more positive terms than negative, it was labeled positive; the opposite was true for negative terms. Since each tweet in the collection was applied to just one candidate, it was possible to count, for each user, the number of positive and negative tweets for each set of candidates. Three more elaborate procedures were tested: vote and flip; semantic orientation; and polarity lexicon, which ended up being the one used to infer votes for all users in the dataset since it was better at estimating McCain support and global accuracy. However, polling results achieved by analyzing Twitter data were still far less accurate than the predictive results achieved through traditional polling methods. Selection bias had tainted the sample. Gayo-Avello then tested for Twitter bias. The first test applied to the data checked for the number of users per county. This was based on the premise that city dwellers and young adults are more likely to use Twitter and lean toward more liberal political opinions. This test looked for correlations between percentage of users per county and population density using the actual elections results for each county. Results showed that, within Twitter, all of the states showed a positive correlation between population density and the democratic vote in the 2008 US presidential election. Moreover, every state except for Missouri and Texas expressed positive correlation between population density and Twitter use. (In my own 2008 study, younger people were clearly overrepresented in Twitter, which explains part of the faulty prediction.) Results also showed that Republican voters used Twitter less than Democratic voters, or were reluctant to express their political opinions publicly. Republicans, or at least McCain supporters, tweeted much less than Democratic voters during the 2008 election. In conclusion, the outcome of the 2008 US presidential election could not have been predicted from user content published through Twitter by applying the most common sentiment-analysis methods. According to the article, "Due to the prevalence of younger users and their tilt toward Democrats and Obama, "¿Democrats and Obama backers are more in evidence on the Internet than backers of other candidates or parties' [2]."¿ The possible biases in the data are consistent with the conclusions drawn by Lenhart and Fox [3], and Rainie and Smith [2]. The article ends with "Lessons Learned."¿ The problem with trying to predict the outcome of the 2008 US presidential election was not data collection per se, but two other things: the need to learn how to minimize the importance of bias in social media data, and the tendency to ignore how such data differs from the actual population. Four lessons can be learned from this study. First, because researchers can assemble very large sets of data for mining does not make it statistically representative of overall populations. Second, bias can be introduced through the relative youth of social networking users. Researchers need to correct for bias by knowing user ages within their samples. Third, a topic that appears frequently within a given sample, or is repeated often, can skew results within Twitter. And finally, a no response within Twitter may play a more important role than is realized, especially if lack of information mainly affects only one group in particular. Gayo-Avello concludes: Until social media is used regularly by a broad segment of the voting population, its users cannot be considered a representative sample, and forecasts from the data will be of questionable value at best and incorrect in many cases. Until then, researchers using such data should identify the various strata of users"¿based on, say, age, income, gender, and race"¿to properly weigh their opinions according to the proportion of each of them in the population. Statisticians, pollsters, media, and others interested in social behavior will find this paper enlightening. Follow-ups on the topic might investigate whether instinctive responses when tweeting, herd behavior, or the lack of critical evaluation by tweeters skew the results of a Twitter survey. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

Communications of the ACM Volume 54, Issue 10

October 2011

126 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/2001269

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2011

Published in CACM Volume 54, Issue 10

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Popular
Refereed

Funding Sources

University of Oviedo

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

93
Total Citations
View Citations
2,528
Total Downloads

Downloads (Last 12 months)361
Downloads (Last 6 weeks)64

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Gaur AYadav D(2025)A comprehensive analysis of forecasting elections using social media textMultimedia Tools and Applications10.1007/s11042-024-20528-wOnline publication date: 27-Jan-2025
https://doi.org/10.1007/s11042-024-20528-w
de Almeida MVieira VAgnihotri Rde Freitas Souza R(2024)The Social Psychological Explanations for the Effects of Social Media Marketing and Traditional Media on Voting IntentionsJournal of Marketing Theory and Practice10.1080/10696679.2024.2415973(1-18)Online publication date: 20-Oct-2024
https://doi.org/10.1080/10696679.2024.2415973
Brito KSilva Filho RAdeodato P(2024)Stop trying to predict elections only with twitter – There are other data sources and technical issues to be improvedGovernment Information Quarterly10.1016/j.giq.2023.10189941:1(101899)Online publication date: Mar-2024
https://doi.org/10.1016/j.giq.2023.101899
Kumari SSingh M(2024)Machine Learning-Based Election Results Prediction Using Twitter ActivitySN Computer Science10.1007/s42979-024-03180-x5:7Online publication date: 23-Aug-2024
https://dl.acm.org/doi/10.1007/s42979-024-03180-x
Lima JSantana MCorrea ABrito K(2023)The use and impact of TikTok in the 2022 Brazilian presidential electionProceedings of the 24th Annual International Conference on Digital Government Research10.1145/3598469.3598485(144-152)Online publication date: 11-Jul-2023
https://dl.acm.org/doi/10.1145/3598469.3598485
Pokhriyal NValentino BVosoughi S(2023)Quantifying participation biases on social mediaEPJ Data Science10.1140/epjds/s13688-023-00405-612:1Online publication date: 28-Jul-2023
https://doi.org/10.1140/epjds/s13688-023-00405-6
Dahish ZMaih S(2023)Crafting and Analyzing Advanced Social Monitoring Techniques for Digital Retail Platforms2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)10.1109/CSDE59766.2023.10487650(1-5)Online publication date: 4-Dec-2023
https://doi.org/10.1109/CSDE59766.2023.10487650
Vicente P(2023)Sampling Twitter users for social science research: evidence from a systematic review of the literatureQuality & Quantity10.1007/s11135-023-01615-w57:6(5449-5489)Online publication date: 27-Jan-2023
https://doi.org/10.1007/s11135-023-01615-w
Chauhan PSharma NSikka G(2023)On the importance of pre-processing in small-scale analyses of twitter: a case study of the 2019 Indian general electionMultimedia Tools and Applications10.1007/s11042-023-16158-383:7(19219-19258)Online publication date: 26-Jul-2023
https://doi.org/10.1007/s11042-023-16158-3
Kumaran P Sridhar RNandy H(2023)Multi-layered perceptron based deep learning model for emotion extraction on monolingual text using intelligence feature engineering and filtering techniquesMultimedia Tools and Applications10.1007/s11042-023-15438-282:28(44037-44052)Online publication date: 27-Apr-2023
https://dl.acm.org/doi/10.1007/s11042-023-15438-2
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Social Media Marketing: A Beginner Guide To Get Success In Your Business (Volume 1)

Social media user classification: based on social capital expectation, susceptibility, and compulsion loop

My Social Media for Seniors (2nd Edition)

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Digital Edition

Magazine Site

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations