Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3543507.3583218acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article
Open access

Longitudinal Assessment of Reference Quality on Wikipedia

Published: 30 April 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Wikipedia plays a crucial role in the integrity of the Web. This work analyzes the reliability of this global encyclopedia through the lens of its references. We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. We release Citation Detective, a tool for automatically calculating the RN score, and discover that the RN score has dropped by 20 percent point in the last decade, with more than half of verifiable statements now accompanying references. The RR score has remained below 1% over the years as a result of the efforts of the community to eliminate unreliable references. We propose pairing novice and experienced editors on the same Wikipedia article as a strategy to enhance reference quality. Our quasi-experiment indicates that such a co-editing experience can result in a lasting advantage in identifying unreliable sources in future edits. As Wikipedia is frequently used as the ground truth for numerous Web applications, our findings and suggestions on its reliability can have a far-reaching impact. We discuss the possibility of other Web services adopting Wiki-style user collaboration to eliminate unreliable content.

    Supplemental Material

    PDF File
    Appendix

    References

    [1]
    Andrew Wyrich Andrew Couts. 2016. Here are all the "fake news" sites to watch out for on Facebook. Retrieved October 11, 2022 from https://tinyurl.com/yc4xehrr
    [2]
    Avro Apache. 2022. Avro apache docs. Retrieved October 11, 2022 from https://avro.apache.org/docs/current/spec.html
    [3]
    Mediawiki API. 2022. API Main page. Retrieved October 11, 2022 from https://www.mediawiki.org/wiki/API:Main_page
    [4]
    Peter C Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research 46, 3 (2011), 399–424.
    [5]
    Sonia Castelo, Thais Almeida, Anas Elghafari, Aécio Santos, Kien Pham, Eduardo Nakamura, and Juliana Freire. 2019. A topic-agnostic approach for identifying fake news pages. In proc. of the WWW.
    [6]
    Chih-Chun Chen and Camille Roth. 2012. Citation Needed: The Dynamics of Referencing in Wikipedia. In proc. of the WikiSym.
    [7]
    Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. 2021. WhatTheWikiFact: Fact-Checking Claims Against Wikipedia. arXiv preprint arXiv:2105.00826 (2021).
    [8]
    Giovanni Luca Ciampaglia and Dario Taraborelli. 2015. MoodBar: Increasing new user retention in Wikipedia through lightweight socialization. In proc. of the ACM CSCW. 734–742.
    [9]
    Johannes Daxenberger, Steffen Eger, Ivan Habernal, Christian Stab, and Iryna Gurevych. 2017. What is the Essence of a Claim¿ Cross-Domain Claim Identification. In proc. of the EMNLP.
    [10]
    Diego Esteves, Aniketh Janardhan Reddy, Piyush Chawla, and Jens Lehmann. 2018. Belittling the source: trustworthiness indicators to obfuscate fake news on the web. arXiv preprint arXiv:1809.00494 (2018).
    [11]
    Github facebookresearch. 2022. fastText Repository. Retrieved October 11, 2022 from https://tinyurl.com/bdzbn7s9
    [12]
    Andrea Forte, Nazanin Andalibi, Tim Gorichanaz, Meen Chul Kim, Thomas Park, and Aaron Halfaker. 2018. Information Fortification: An Online Citation Behavior. In proc. of the ACM GROUP.
    [13]
    Mediawiki Foundation. 2022. Resolution : Biographies of living people. Retrieved October 11, 2022 from https://tinyurl.com/2s3d5dht
    [14]
    fullfact.org. 2022. fullfact. Retrieved October 11, 2022 from https://fullfact.org/
    [15]
    Aaron Halfaker, Aniket Kittur, and John Riedl. 2011. Don’t bite the newbies: how reverts affect the quantity and quality of Wikipedia work. In proc. of the WikiSym.
    [16]
    A Halfaker, B Mansurov, M Redi, and D Taraborelli. 2018. Citations with identifiers in Wikipedia. Figshare (2018).
    [17]
    Peter Holtz, Besnik Fetahu, and Joachim Kimmerle. 2018. Effects of contributor experience on the quality of health-related Wikipedia articles. JMIR 20, 5 (2018), e9683.
    [18]
    Isaac Johnson, Martin Gerlach, and Diego Sáez-Trumper. 2021. Language-agnostic Topic Classification for Wikipedia. In proc. of the WWW.
    [19]
    Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
    [20]
    Gerald C. Kane. 2011. A Multimethod Study of Information Quality in Wiki Collaboration. ACM TMIS 2, 1 (2011).
    [21]
    Georgios Karagiannis, Immanuel Trummer, Saehan Jo, Shubham Khandelwal, Xuezhi Wang, and Cong Yu. 2019. Mining an" anti-knowledge base" from Wikipedia updates with applications to fact checking and beyond. proc. of the VLDB 13, 4 (2019), 561–573.
    [22]
    Kayvan Kousha and Mike Thelwall. 2017. Are Wikipedia citations important evidence of the impact of scholarly articles and books¿JASIST 68, 3 (2017), 762–779.
    [23]
    Kim LaCapria. 2016. Snopes’ Field Guide to Fake News Sites and Hoax Purveyors. Retrieved October 11, 2022 from https://tinyurl.com/ycv5ctf8
    [24]
    Florian Lemmerich, Diego Sáez-Trumper, Robert West, and Leila Zia. 2019. Why the world reads Wikipedia: Beyond English speakers. In proc. of the ACM WSDM.
    [25]
    Włodzimierz Lewoniewski, Krzysztof Węcel, and Witold Abramowicz. 2017. Analysis of references across Wikipedia languages. In proc. of the ICIST.
    [26]
    Włodzimierz Lewoniewski, Krzysztof Węcel, and Witold Abramowicz. 2022. Reliability in Time: Evaluating the Web Sources of Information on COVID-19 in Wikipedia across Various Language Editions from the Beginning of the Pandemic. https://doi.org/10.48550/ARXIV.2204.14130
    [27]
    Berkeley Library. 2022. Fake News. Retrieved October 11, 2022 from https://guides.lib.berkeley.edu/fake-news
    [28]
    Jun Liu and Sudha Ram. 2011. Who does what: Collaboration patterns in the Wikipedia and their impact on article quality. ACM TMIS 2, 2 (2011), 1–23.
    [29]
    MediaWiki. 2022. ORES Articletopic taxonomy. Retrieved October 11, 2022 from https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy
    [30]
    Mediawiki. 2022. Personalized first day. Retrieved October 11, 2022 from https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks
    [31]
    Priyanka Meel and Dinesh Kumar Vishwakarma. 2020. Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities. Expert Systems with Applications 153 (2020), 112986.
    [32]
    Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto Lanamäki. 2015. "The Sum of All Human Knowledge": A Systematic Review of Scholarly Research on the Content of Wikipedia. JASIST 66, 2 (2015), 219–245.
    [33]
    Jonathan T Morgan, Siko Bouterse, Heather Walls, and Sarah Stierch. 2013. Tea and sympathy: crafting positive new user experiences on wikipedia. In proc. of the ACM CSCW.
    [34]
    Salman Bin Naeem, Rubina Bhatti, and Aqsa Khan. 2021. An exploration of how fake news is taking over social media and putting public health at risk. Health Information & Libraries Journal 38, 2 (2021), 143–149.
    [35]
    Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In proc. of the EMNLP.
    [36]
    Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West. 2020. Quantifying engagement with citations on Wikipedia. In proc. of the WWW.
    [37]
    Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West. 2021. On the Value of Wikipedia as a Gateway to the Web. In proc. of the WWW.
    [38]
    Francesco Pierri and Stefano Ceri. 2019. False news on social media: a data-driven survey. ACM Sigmod Record 48, 2 (2019), 18–27.
    [39]
    Miriam Redi, Besnik Fetahu, Jonathan Morgan, and Dario Taraborelli. 2019. Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia’s Verifiability. In proc. of the WWW.
    [40]
    Björn Ross, Anna Jung, Jennifer Heisel, and Stefan Stieglitz. 2018. Fake news on social media: The (in) effectiveness of warning messages. (2018).
    [41]
    Diego Saez-Trumper. 2019. Online disinformation and the role of wikipedia. arXiv preprint arXiv:1910.12596 (2019).
    [42]
    Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, and Joonsuk Park. 2020. Automated fact-checking of claims from Wikipedia. In proc. of the LREC.
    [43]
    Harshdeep Singh, Robert West, and Giovanni Colavizza. 2021. Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia. Quantitative Science Studies 2, 1 (2021), 1–19.
    [44]
    Misha Teplitskiy, Grace Lu, and Eamon Duede. 2017. Amplifying the impact of open access: Wikipedia and the diffusion of science. JASIST 68, 9 (2017).
    [45]
    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018).
    [46]
    Toolforge. 2022. Citation Hunt. Retrieved October 11, 2022 from https://citationhunt.toolforge.org/
    [47]
    Mykola Trokhymovych and Diego Saez-Trumper. 2021. WikiCheck: An end-to-end open source Automatic Fact-Checking API based on Wikipedia. In proc. of the ACM CIKM.
    [48]
    David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi. 2022. SciFact-Open: Towards open-domain scientific claim verification. ArXiv abs/2210.13777 (2022).
    [49]
    David Wadden, Kyle Lo, Lucy Lu Wang, Shanchuan Lin, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. ArXiv abs/2004.14974 (2020).
    [50]
    English Wikipedia. 2022. 1Lib1Ref. Retrieved October 11, 2022 from https://en.wikipedia.org/wiki/1Lib1Ref
    [51]
    English Wikipedia. 2022. Hurricane Andrew. Retrieved October 11, 2022 from https://en.wikipedia.org/wiki/Hurricane_Andrew
    [52]
    English Wikipedia. 2022. Verifiability Policy. https://en.wikipedia.org/wiki/Wikipedia:Verifiability
    [53]
    English Wikipedia. 2022. Wikipedia: Reliable source, Perennial Source. Retrieved October 11, 2022 from https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources/Perennial_sources
    [54]
    English Wikipedia. 2022. Wikipedia: Wikipedia is a tertiary source. Retrieved October 11, 2022 from https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_is_a_tertiary_source
    [55]
    English Wikipedia. 2022. Wikipedia:Revert notification opt-out. https://en.wikipedia.org/wiki/Wikipedia:Revert_notification_opt-out
    [56]
    English Wikipedia. 2022. Wikipedia:WikiProject Editor Retention. Retrieved October 11, 2022 from https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Editor_Retention
    [57]
    Wikimedia Wikitech. 2022. Wikitech Anlytics Pageviews. Retrieved October 11, 2022 from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
    [58]
    Puyu Yang and Giovanni Colavizza. 2021. A Map of Science in Wikipedia. https://doi.org/10.48550/ARXIV.2110.13790
    [59]
    Puyu Yang and Giovanni Colavizza. 2022. Polarization and reliability of news sources in Wikipedia. https://doi.org/10.48550/ARXIV.2210.16065
    [60]
    Olga Zagovora, Roberto Ulloa, Katrin Weller, and Fabian Flöck. 2020. ‘I Updated the <ref>’: The evolution of references in the English Wikipedia and the implications for altmetrics. Quantitative Science Studies (2020), 1–38.
    [61]
    Melissa Zimdars. 2016. False, misleading, clickbait-y, and satirical “news” sources. Retrieved October 14, 2022 from https://tinyurl.com/4v63h2pv

    Cited By

    View all
    • (2024)The Most Cited Scientific Information Sources in Wikipedia Articles Across Various LanguagesBiblioteka10.14746/b.2023.27.12(269-294)Online publication date: 7-Mar-2024
    • (2024)Polarization and reliability of news sources in WikipediaOnline Information Review10.1108/OIR-02-2023-0084Online publication date: 18-Jan-2024
    • (2023)A Comparative Study of Reference Reliability in Multiple Language Editions of WikipediaProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615254(3743-3747)Online publication date: 21-Oct-2023

    Index Terms

    1. Longitudinal Assessment of Reference Quality on Wikipedia

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '23: Proceedings of the ACM Web Conference 2023
      April 2023
      4293 pages
      ISBN:9781450394161
      DOI:10.1145/3543507
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 April 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Fake News
      2. NLP
      3. Verifiability
      4. Wikipedia
      5. the Web

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Data Availability

      Funding Sources

      Conference

      WWW '23
      Sponsor:
      WWW '23: The ACM Web Conference 2023
      April 30 - May 4, 2023
      TX, Austin, USA

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)352
      • Downloads (Last 6 weeks)24

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)The Most Cited Scientific Information Sources in Wikipedia Articles Across Various LanguagesBiblioteka10.14746/b.2023.27.12(269-294)Online publication date: 7-Mar-2024
      • (2024)Polarization and reliability of news sources in WikipediaOnline Information Review10.1108/OIR-02-2023-0084Online publication date: 18-Jan-2024
      • (2023)A Comparative Study of Reference Reliability in Multiple Language Editions of WikipediaProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615254(3743-3747)Online publication date: 21-Oct-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media