Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3292500.3330784acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Predicting Economic Development using Geolocated Wikipedia Articles

Published: 25 July 2019 Publication History

Abstract

Progress on the UN Sustainable Development Goals (SDGs) is hampered by a persistent lack of data regarding key social, environmental, and economic indicators, particularly in developing countries. For example, data on poverty - the first of seventeen SDGs - is both spatially sparse and infrequently collected in Sub-Saharan Africa due to the high cost of surveys. Here we propose a novel method for estimating socioeconomic indicators using open-source, geolocated textual information from Wikipedia articles. We demonstrate that modern NLP techniques can be used to predict community-level asset wealth and education outcomes using nearby geolocated Wikipedia articles. When paired with nightlights satellite imagery, our method outperforms all previously published benchmarks for this prediction task, indicating the potential of Wikipedia to inform both research in the social sciences and future policy decisions.

References

[1]
The afro program website. http://http://afrobarometer.org.
[2]
Philip Antwi-Agyei, Evan DG Fraser, Andrew J Dougill, Lindsay C Stringer, and Elisabeth Simelton. Mapping the vulnerability of crop production to drought in ghana using rainfall, yield and socioeconomic data. Applied Geography, 32(2):324--334, 2012.
[3]
Joshua Blumenstock, Gabriel Cadamuro, and Robert On. Predicting poverty and wealth from mobile phone metadata. Science, 350(6264):1073--1076, 2015.
[4]
Herman Anthony Carneiro and Eleftherios Mylonakis. Google trends: a webbased tool for real-time surveillance of disease outbreaks. Clinical infectious diseases, 49(10):1557--1564, 2009.
[5]
Quang Vinh Dang and Claudia-Lavinia Ignat. Quality assessment of wikipedia articles without feature engineering. In Proc. of the 16th Joint Conference on Digital Libraries, pages 27--30. ACM, 2016.
[6]
Quang-Vinh Dang and Claudia-Lavinia Ignat. An end-to-end learning solution for assessing the quality of wikipedia articles. In Proc. of the 13th International Symposium on Open Collaboration, page 4. ACM, 2017.
[7]
The dhs program website. http://www.dhsprogram.com.
[8]
Christopher D Elvidge, Kimberly Baugh, Mikhail Zhizhin, Feng Chi Hsu, and Tilottama Ghosh. Viirs night-time lights. Int. J. Remote Sens., 38(21):5860--5879, November 2017.
[9]
Deon Filmer and Lant Pritchett. The effect of household wealth on educational attainment: evidence from 35 countries. Population and development review, 25(1):85-- 120, 1999.
[10]
Mordechai Haklay and Patrick Weber. Openstreetmap: User-generated street maps. Ieee Pervas Comput, 7(4):12--18, 2008.
[11]
Neal Jean, Marshall Burke, Michael Xie, W. Matthew Davis, David B. Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790--794, 2016.
[12]
Kathleen Kahn, Stephen M Tollman, Mark A Collinson, Samuel J Clark, Rhian Twine, Benjamin D Clark, Mildred Shabangu, Francesc Xavier Gomez-Olive, Obed Mokoena, and Michel L Garenne. Research into health, population and social transitions in rural south africa: Data and methods of the agincourt health and demographic surveillance system1. Scandinavian Journal of Public Health, 35(69_suppl):8--20, 2007.
[13]
Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368, 2016.
[14]
Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188--1196, 2014.
[15]
Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4247--4255, 2015.
[16]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.
[17]
Azadeh Nikfarjam, Abeed Sarker, Karen O'connor, Rachel Ginn, and Graciela Gonzalez. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association, 22(3):671--681, 2015.
[18]
Abdisalan M Noor, Victor A Alegana, Peter W Gething, Andrew J Tatem, and Robert W Snow. Using remotely sensed night-time light as a proxy for poverty in africa. Population Health Metrics, 6(1):5, 2008.
[19]
Barak Oshri, Annie Hu, Peter Adelson, Xiao Chen, Pascaline Dupas, Jeremy Weinstein, Marshall Burke, David Lobell, and Stefano Ermon. Infrastructure quality assessment in africa using satellite imagery and deep learning. In Proc. of SIGKDD, pages 616--625. ACM, 2018.
[20]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text matching as image recognition. In AAAI, 2016.
[21]
Anthony Perez, Christopher Yeh, George Azzari, Marshall Burke, David Lobell, and Stefano Ermon. Poverty prediction with public landsat 7 satellite imagery and machine learning. 11 2017.
[22]
Jeremy Proville, Daniel Zavala-Araiza, and Gernot Wagner. Night-time lights: A global, long term look at links to socio-economic trends. PloS one, 12(3):e0174610, 2017.
[23]
UN Global Pulse. Mining indonesian tweets to understand food price crises. Jakarta: UN Global Pulse, 2014.
[24]
Arnau Ramisa, Fei Yan, Francesc Moreno-Noguer, and Krystian Mikolajczyk. Breakingnews: Article annotation by image and text processing. IEEE TPAMI, 40(5):1072--1085, 2018.
[25]
David E Sahn and David C Stifel. Poverty comparisons over time and across countries in africa. World development, 28(12):2123--2155, 2000.
[26]
Evan Sheehan, Burak Uzkent, Chenlin Meng, Zhongyi Tang, Marshall Burke, David Lobell, and Stefano Ermon. Learning to interpret satellite images using wikipedia. arXiv preprint arXiv:1809.10236, 2018.
[27]
Alessio Signorini, Alberto Maria Segre, and Philip M Polgreen. The use of twitter to track levels of disease activity and public concern in the us during the influenza a h1n1 pandemic. PloS one, 6(5):e19467, 2011.
[28]
Jeroen Smits and Roel Steendijk. The international wealth index (iwi). Social Indicators Research, 122(1):65--85, 2015.
[29]
Burak Uzkent, Evan Sheehan, Chenlin Meng, Zhongyi Tang, Marshall Burke, David Lobell, and Stefano Ermon. Learning to interpret satellite images in global scale using wikipedia. arXiv preprint arXiv:1905.02506, 2019.
[30]
Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):394--407, 2019.
[31]
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence, 40(6):1367-- 1381, 2018.
[32]
Sang Michael Xie, Neal Jean, Marshall Burke, David B. Lobell, and Stefano Ermon. Transfer learning from deep features for remote sensing and poverty mapping. In AAAI, 2016.
[33]
Frank F Xu, Bill Y Lin, Qi Lu, Yifei Huang, and Kenny Q Zhu. Cross-region traffic prediction for china on openstreetmap. In Proceedings of the 9th ACM SIGSPATIAL International Workshop on Computational Transportation Science, pages 37--42. ACM, 2016.

Cited By

View all
  • (2024)Am I Hurt?: Evaluating Psychological Pain Detection in Hindi Text Using Transformer-based ModelsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365020623:8(1-17)Online publication date: 5-Mar-2024
  • (2023)Explainability of Automated Fact Verification Systems: A Comprehensive ReviewApplied Sciences10.3390/app13231260813:23(12608)Online publication date: 23-Nov-2023
  • (2023)HindiPersonalityNet: Personality Detection in Hindi Conversational Data Using Deep Learning with Static EmbeddingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362522823:8(1-13)Online publication date: 29-Sep-2023
  • Show More Cited By

Index Terms

  1. Predicting Economic Development using Geolocated Wikipedia Articles

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    July 2019
    3305 pages
    ISBN:9781450362016
    DOI:10.1145/3292500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. computational sustainability
    2. deep learning
    3. remote sensing

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    KDD '19
    Sponsor:

    Acceptance Rates

    KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)311
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 06 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Am I Hurt?: Evaluating Psychological Pain Detection in Hindi Text Using Transformer-based ModelsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365020623:8(1-17)Online publication date: 5-Mar-2024
    • (2023)Explainability of Automated Fact Verification Systems: A Comprehensive ReviewApplied Sciences10.3390/app13231260813:23(12608)Online publication date: 23-Nov-2023
    • (2023)HindiPersonalityNet: Personality Detection in Hindi Conversational Data Using Deep Learning with Static EmbeddingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362522823:8(1-13)Online publication date: 29-Sep-2023
    • (2023)GeoVeXProceedings of the 6th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery10.1145/3615886.3627750(3-13)Online publication date: 13-Nov-2023
    • (2023)Tackling the Accuracy-Interpretability Trade-off: Interpretable Deep Learning Models for Satellite Image-based Real Estate AppraisalACM Transactions on Management Information Systems10.1145/356743014:1(1-24)Online publication date: 16-Jan-2023
    • (2023)HIEF: a holistic interpretability and explainability frameworkJournal of Decision Systems10.1080/12460125.2023.220726833:3(335-375)Online publication date: 24-May-2023
    • (2023)A human-machine collaborative approach measures economic development using satellite imageryNature Communications10.1038/s41467-023-42122-814:1Online publication date: 26-Oct-2023
    • (2023)Spatiotemporal self-supervised pre-training on satellite imagery improves food insecurity predictionEnvironmental Data Science10.1017/eds.2023.422Online publication date: 18-Dec-2023
    • (2023)Program targeting with machine learning and mobile phone data: Evidence from an anti-poverty intervention in AfghanistanJournal of Development Economics10.1016/j.jdeveco.2022.103016161(103016)Online publication date: Mar-2023
    • (2022)Farm Parcel Extraction in Remote-Sensing Images Based on Semantic Segmentation2022 14th International Conference on Signal Processing Systems (ICSPS)10.1109/ICSPS58776.2022.00068(359-362)Online publication date: Nov-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media