Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3308558.3313684acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

Published: 13 May 2019 Publication History

Abstract

Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts.
In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.

References

[1]
Faiyaz Al Zamal, Wendy Liu, and Derek Ruths. 2012. Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors. In Proceedings of ICWSM.
[2]
Sultan Alzahrani, Chinmay Gore, Amin Salehi, and Hasan Davulcu. 2018. Finding Organizational Accounts Based on Structural and Behavioral Factors on Twitter. In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation. Springer, 164-175.
[3]
Jisun An and Ingmar Weber. 2015. Whom should we sense in “social sensing”-Analyzing which users work best for social media now-casting. EPJ Data Science 4, 1 (30 Nov 2015), 22.
[4]
Ehsan Mohammady Ardehaly and Aron Culotta. 2017. Co-training for Demographic Classification Using Deep Learning from Label Proportions. In Data Mining Workshops (ICDMW), 2017 IEEE International Conference on. IEEE, 1017-1024.
[5]
Microsoft Azure. 2018. Cognitive Services. https://azure.microsoft.com/en-us/services/cognitive-services/
[6]
David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18, 2 (2014), 135-160.
[7]
Daniele Barchiesi, Helen Susannah Moat, Christian Alis, Steven Bishop, and Tobias Preis. 2015. Quantifying International Travel Flows Using Flickr. PLOS ONE 10, 7 (07 2015), 1-8.
[8]
Charley Beller, Rebecca Knowles, Craig Harman, Shane Bergsma, Margaret Mitchell, and Benjamin Van Durme. 2014. I'm a belieber: Social roles via self-identification and conceptual attributes. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2. 181-186.
[9]
Shane Bergsma and Benjamin Van Durme. 2013. Using Conceptual Class Attributes to Characterize Social Media Users. In Proc. ACL.
[10]
Jelke G Bethlehem and Wouter J Keller. 1987. Linear weighting of sample survey data. Journal of Official Statistics 3, 2 (1987), 141-153.
[11]
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory. ACM, 92-100.
[12]
Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77-91.
[13]
Xin Chen, Yu Wang, Eugene Agichtein, and Fusheng Wang. 2015. A Comparative Study of Demographic Attribute Inference in Twitter. In Proceedings of ICWSM, Vol. 15. 590-593.
[14]
Menzie D Chinn and Robert W Fairlie. 2007. The determinants of the global digital divide: a cross-country analysis of computer and internet penetration. Oxford Economic Papers 59, 1 (2007), 16-44.
[15]
Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1693-1703.
[16]
Morgane Ciot, Morgan Sonderegger, and Derek Ruths. 2013. Gender inference of Twitter users in non-English contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1136-1145.
[17]
Jennifer Coates. 1998. Language and gender: A reader. Wiley-Blackwell.
[18]
Jennifer Coates. 2015. Women, men and language: A sociolinguistic account of gender differences in language. Routledge.
[19]
Ryan Compton, David Jurgens, and David Allen. 2014. Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization. In IEEE Conference on BigData.
[20]
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herve´ Je´gou. 2017. Word Translation Without Parallel Data. arXiv preprint arXiv:1710.04087(2017).
[21]
Brian de Silva and Ryan Compton. 2014. Prediction of Foreign Box Office Revenues Based on Wikipedia Page Activity. CoRR abs/1405.5924(2014). arxiv:1405.5924http://arxiv.org/abs/1405.5924
[22]
Victoria Pruin DeFrancisco, Catherine Helen Palczewski, and Danielle D McGeough. 2013. Gender in communication: A critical introduction. Sage Publications.
[23]
Hedwige Dehon and Serge Bre´dart. 2001. An 'other-race' effect in age estimation from faces. Perception 30, 9 (2001), 1107-1113.
[24]
Penelope Eckert. 2008. Variation and the indexical field. Journal of sociolinguistics 12, 4 (2008), 453-476.
[25]
Penelope Eckert and Sally McConnell-Ginet. 2003. Language and gender. Cambridge University Press.
[26]
European Commission. {n. d.}. NUTS-Nomenclature of Territorial Units for Statistics. https://ec.europa.eu/eurostat/web/nuts/background.
[27]
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Chris Dyer. 2016. Morphological inflection generation using character sequence to sequence learning. In Proceedings of EMNLP.
[28]
Daniel Gayo-Avello. 2012. ” I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper”-A Balanced Survey on Election Prediction using Twitter Data. arXiv preprint arXiv:1204.6441(2012).
[29]
Daniel Gayo-Avello. 2013. A meta-analysis of state-of-the-art electoral prediction from Twitter data. Social Science Computer Review 31, 6 (2013), 649-679.
[30]
Nicholas Generous, Geoffrey Fairchild, Alina Deshpande, Sara Y. Del Valle, and Reid Priedhorsky. 2014. Global Disease Monitoring and Forecasting with Wikipedia. PLOS Computational Biology 10, 11 (11 2014), 1-16.
[31]
Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and Larry Brilliant. 2008. Detecting influenza epidemics using search engine query data. Nature 457 (nov 2008), 1012. https://doi.org/10.1038/nature07634.
[32]
Rob Goot, Nikola Ljubešic, Ian Matroos, Malvina Nissim, and Barbara Plank. 2018. Bleaching Text: Abstract Features for Cross-lingual Gender Prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 383-389.
[33]
Sumit Goswami, Sudeshna Sarkar, and Mayur Rustagi. 2009. Stylometric analysis of bloggers' age and gender. In Proceedings of ICWSM.
[34]
Scott A. Hale. 2014. Global Connectivity and Multilinguals in the Twitter Network. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems(CHI '14). ACM, New York, NY, USA, 833-842.
[35]
Brent Hecht and Monica Stephens. 2014. A Tale of Cities: Urban Biases in Volunteered Geographic Information. In Proceedings of ICWSM. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8114
[36]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735-1780.
[37]
D. Holt and T. M. F. Smith. 1979. Post Stratification. Journal of the Royal Statistical Society. Series A (General) 142, 1(1979), 33-46. http://www.jstor.org/stable/2344652
[38]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely Connected Convolutional Networks. In CVPR, Vol. 1. 3.
[39]
Aaron Jaech and Mari Ostendorf. 2015. What your username says about you. arXiv preprint arXiv:1507.02045(2015).
[40]
Bernard J Jansen. 2010. Use of the internet in higher-income households. Pew Research Center Washington, DC.
[41]
Isaac Johnson, Connor McMahon, Johannes Schöning, and Brent Hecht. 2017. The Effect of Population and Structural Biases on Social Media-based Algorithms: A Case Study in Geolocation Inference Across the Urban-Rural Spectrum. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 1167-1178.
[42]
Soon-Gyo Jung, Jisun An, Haewoon Kwak, Joni Salminen, and Bernard J Jansen. 2017. Inferring Social Media Users' Demographics from Profile Pictures: A Face++ Analysis on Twitter Users. In Proceedings of The 17th International Conference on Electronic Business. 140-145).
[43]
Andreas Jungherr. 2017. Normalizing digital trace data. Digital Discussions: How Big Data Informs Political Communication (2017).
[44]
Andreas Jungherr, Pascal Jürgens, and Harald Schoen. 2012. Why the pirate party won the german election of 2009 or the trouble with predictions: Tumasjan, A., Sprenger, T. O., Sander, P. G., & Welpe, I. M. “Predicting elections with twitter: What 140 characters reveal about political sentiment”. Social science computer review 30, 2 (2012), 229-234.
[45]
Shari Kendall, Deborah Tannen, 1997. Gender and language in the workplace. Gender and Discourse. London: Sage(1997), 81-105.
[46]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models. In AAAI. 2741-2749.
[47]
Rebecca Knowles, Josh Carroll, and Mark Dredze. 2016. Demographer: Extremely simple name demographics. In Proceedings of the First Workshop on NLP and Computational Social Science. 108-113.
[48]
Klaus Krippendorff. 2011. Computing Krippendorff's alpha-reliability. University of Pennsylvania Departmental papers (ASC) (2011).
[49]
Robin Tolmach Lakoff and Mary Bucholtz. 2004. Language and woman's place: Text and commentaries. Vol. 3. Oxford University Press, USA.
[50]
Fabio Lamanna, Maxime Lenormand, María Henar Salas-Olmedo, Gustavo Romanillos, Bruno Gonçalves, and Jose´ J Ramasco. 2018. Immigrant community integration in world cities. PloS one 13, 3 (2018), e0191612.
[51]
David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. The Parable of Google Flu: Traps in Big Data Analysis. Science 343, 6167 (2014), 1203-1205.
[52]
Roderick Little and Donald Rubin. 2002. Statistical analysis with missing data, Second edition. 408 pages.
[53]
Michael McCandless. 2010. Accuracy and performance of Google's compact language detector. http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html.
[54]
James McCorriston, David Jurgens, and Derek Ruths. 2015. Organizations Are Users Too: Characterizing and Detecting the Presence of Organizations on Twitter. In Proceedings of ICWSM. 650-653.
[55]
Márton Mestyán, Taha Yasseri, and János Kerte´sz. 2013. Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data. PLOS ONE 8, 8 (08 2013), 1-8.
[56]
Alan Mislove, Sune Lehmann, Yong-Yeol Ahn, Jukka-Pekka Onnela, and J Niels Rosenquist. 2011. Understanding the Demographics of Twitter Users.Proceedings of ICWSM 11, 5th (2011), 25.
[57]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807-814.
[58]
Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. 2013. “How Old Do You Think I Am?” A Study of Language and Age in Twitter. In Proceedings of ICWSM.
[59]
Dong Nguyen, Noah A Smith, and Carolyn P Rose´. 2011. Author age prediction from text using linear regression. In Proc. of the Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Association for Computational Linguistics, 115-123.
[60]
David K Park, Andrew Gelman, and Joseph Bafumi. 2004. Bayesian multilevel estimation with poststratification: state-level estimates from national polls. Political Analysis 12, 4 (2004), 375-385.
[61]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
[62]
Francisco Rangel, Paolo Rosso, Manuel Montes-y Gómez, Martin Potthast, and Benno Stein. 2018. Overview of the 6th author profiling task at PAN 2018: Multimodal gender identification in Twitter. Working Notes Papers of the CLEF(2018).
[63]
Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in twitter. In Proc. of the 2nd International Workshop on Search and Mining User-generated Contents. ACM, 37-44.
[64]
Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. (2018).
[65]
Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proc. ACL. Association for Computational Linguistics, 763-772.
[66]
Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2016. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV) (July 2016).
[67]
Derek Ruths and Jürgen Pfeffer. 2014. Social media for large studies of behavior. Science 346, 6213 (2014), 1063-1064.
[68]
Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and H. Andrew Schwartz. 2014. Developing Age and Gender Predictive Lexica over Social Media. In Proc. EMNLP. Association for Computational Linguistics, 1146-1151.
[69]
Carl-Erik Särndal and Sixten Lundström. 2005. Estimation in Surveys with Nonresponse. John Wiley & Sons, Ltd, Chichester, UK. 1-199 pages.
[70]
Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. 2006. Effects of age and gender on blogging. In AAAI spring symposium: Computational approaches to analyzing weblogs, Vol. 6. 199-205.
[71]
Luke Sloan. 2017. Who Tweets in the United Kingdom? Profiling the Twitter Population Using the British Social Attitudes Survey 2015. Social Media + Society 3, 1 (2017), 2056305117698981.
[72]
Deborah Tannen. 1991. You just don't understand: Women and men in conversation. Virago London.
[73]
Deborah Tannen. 1993. Gender and conversational interaction. Oxford University Press.
[74]
UK Office for National Statistics 2015. Harmonised Concepts and Questions for Social Data Sources - Primary Principles. UK Office for National Statistics. http://www.ons.gov.uk/ons/guide-method/harmonisation/primary-set-of-harmonised-concepts-and-questions/demographic-information--household-composition-and-relationships.pdf
[75]
United States Department of Education 2009. Implementation guidelines: Measures and methods for the national reporting system for adult education. United States Department of Education. http://www.air.org/sites/default/files/downloads/report/ImplementationGuidelines_0.pdf
[76]
Nicola Van Rijsbergen, Katarzyna Jaworska, Guillaume A Rousselet, and Philippe G Schyns. 2014. With age comes representational wisdom in social signals. Current Biology 24, 23 (2014), 2792-2796.
[77]
Wei Wang, David Rothschild, Sharad Goel, and Andrew Gelman. 2015. Forecasting elections with non-representative polls. International Journal of Forecasting 31, 3 (2015), 980-991.
[78]
Zijian Wang and David Jurgens. 2018. It's going to be okay: Measuring Access to Support in Online Communities. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 33-45.
[79]
Zach Wood-Doughty, Praateek Mahajan, and Mark Dredze. 2018. Johns Hopkins or johnny-hopkins: Classifying Individuals versus Organizations on Twitter. In Proceedings of the Second Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media. 56-61.
[80]
Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Machine Translation 21, 3 (2007), 165-181.
[81]
Emilio Zagheni, Ingmar Weber, and Krishna Gummadi. 2017. Leveraging Facebook's Advertising Platform to Monitor Stocks of Migrants. Population and Development Review 43, 4 (2017), 721-734.
[82]
Jinxue Zhang, Xia Hu, Yanchao Zhang, and Huan Liu. 2016. Your Age Is No Secret: Inferring Microbloggers' Ages via Content and Interaction Analysis. In Proceedings of ICWSM. 476-485.

Cited By

View all
  • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
  • (2024)Public Perceptions and Discussions of the US Food and Drug Administration's JUUL Ban Policy on Twitter: Observational StudyJMIR Formative Research10.2196/513278(e51327)Online publication date: 11-Jul-2024
  • (2024)Detecting Substance Use Disorder Using Social Media Data and the Dark Web: Time- and Knowledge-Aware StudyJMIRx Med10.2196/485195(e48519-e48519)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep Learning
  2. Demographics
  3. Inclusion Probabilities
  4. Latent Attribute Inference
  5. Multilingual
  6. Post-stratification
  7. Social Media

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)335
  • Downloads (Last 6 weeks)30
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
  • (2024)Public Perceptions and Discussions of the US Food and Drug Administration's JUUL Ban Policy on Twitter: Observational StudyJMIR Formative Research10.2196/513278(e51327)Online publication date: 11-Jul-2024
  • (2024)Detecting Substance Use Disorder Using Social Media Data and the Dark Web: Time- and Knowledge-Aware StudyJMIRx Med10.2196/485195(e48519-e48519)Online publication date: 1-May-2024
  • (2024)Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping ReviewJournal of Medical Internet Research10.2196/4792326(e47923)Online publication date: 15-Mar-2024
  • (2024)Mapping the Risk of Spreading Fake-News Via Wisdom-of-The-Crowd & MrPSSRN Electronic Journal10.2139/ssrn.4868717Online publication date: 2024
  • (2024)Negative affect variability differs between anxiety and depression on social mediaPLOS ONE10.1371/journal.pone.027210719:2(e0272107)Online publication date: 21-Feb-2024
  • (2024)Inferring Depression and Its Semantic Underpinnings from Simple Lexical ChoicesDepression and Anxiety10.1155/2024/30108312024(1-11)Online publication date: 28-Mar-2024
  • (2024)Social Group Differences in the Social Media Discussion about ChatGPT and Bing ChatProceedings of the 16th ACM Web Science Conference10.1145/3614419.3643997(114-118)Online publication date: 21-May-2024
  • (2024)Profile update: the effects of identity disclosure on network connections and languageEPJ Data Science10.1140/epjds/s13688-024-00483-013:1Online publication date: 28-Jun-2024
  • (2024)Analysis of Public Sentiment on COVID-19 Mitigation Measures in Social Media in the United States Using Machine LearningIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321452711:1(307-318)Online publication date: Feb-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media