Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3517745.3561451acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Characterizing "permanently dead" links on Wikipedia

Published: 25 October 2022 Publication History

Abstract

It is common for a web page to include links which help visitors discover related pages on other sites. When a link ceases to work (e.g., because the page that it is pointing to either no longer exists or has been moved), users could rely on an archived copy of the linked page. However, due to the incompleteness of web archives, a sizeable fraction of dead links have no archived copies.
We study this problem in the context of Wikipedia. Broken external references on Wikipedia which lack archived copies are marked as "permanently dead". But, we find this term to be a misnomer, as many previously dysfunctional links work fine today. For links which do not work, it is rarely the case that no archived copies exist. Instead, we find that the current policy for determining which archived copies for an URL are not erroneous is too conservative, and many URLs are archived for the first time only after they no longer work. We discuss the implications of our findings for Wikipedia and the web at large.

Supplementary Material

M4V File (316.m4v)
Presentation video

References

[1]
Scott G Ainsworth, Ahmed Alsum, Hany SalahEldeen, Michele C Weigle, and Michael L Nelson. 2011. How much of the web is archived?. In ACM/IEEE Joint Conference on Digital Libraries.
[2]
Ahmed AlSum, Michele C Weigle, Michael L Nelson, and Herbert Van de Sompel. 2014. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries 14, 3 (2014), 149--166.
[3]
internetarchive/internetarchivebot. https://github.com/internetarchive/internetarchivebot.
[4]
Wayback Machine APIs. https://archive.org/help/wayback_api.php.
[5]
wayback/wayback-cdx-server at master • internetarchive/wayback. https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.
[6]
Wikipedia Eventstream. https://archive.org/details/wikipedia-eventstream.
[7]
Wikipedia Near Real Time (from IRC). https://archive.org/details/NO404-WKP.
[8]
Ziv Bar-Yossef, Andrei Z Broder, Ravi Kumar, and Andrew Tomkins. 2004. Sic transit gloria telae: Towards an understanding of the web's decay. In WWW.
[9]
Andrei Z Broder, Steven C Glassman, Mark S Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer networks and ISDN systems 29, 8-13 (1997), 1157--1166.
[10]
Robert P Dellavalle, Eric J Hester, Lauren F Heilig, Amanda L Drake, Jeff W Kuntzman, Marla Graber, and Lisa M Schilling. 2003. Going, going, gone: Lost Internet references. Science (2003).
[11]
Cristian Duda, Gianni Frey, Donald Kossmann, Reto Matter, and Chong Zhou. 2009. Ajax crawl: Making ajax applications searchable. In ICDE.
[12]
User:GreenC/WaybackMedic 2.5. https://en.wikipedia.org/wiki/User:GreenC/WaybackMedic_2.5.
[13]
Mat Kelly, Michael L Nelson, and Michele C Weigle. 2018. A framework for aggregating private and public web archives. In ACM/IEEE Joint Conference on Digital Libraries.
[14]
Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, and Richard Tobin. 2014. Scholarly context not found: One in five articles suffers from reference rot. PloS one 9, 12 (2014), e115253.
[15]
Steve Lawrence, Frans Coetzee, Eric Glover, Gary Flake, David Pennock, Bob Krovetz, Finn Nielsen, Andries Kruger, and Lee Giles. 2000. Persistence of information on the web: Analyzing citations contained in research articles. In Proceedings of the ninth international conference on Information and knowledge management. 235--242.
[16]
Wayback Machine. https://web.archive.org/.
[17]
Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. 2008. Google's deep web crawl. VLDB 1, 2 (2008), 1241--1252.
[18]
John Markwell and David W Brooks. 2003. "Link rot" limits the usefulness of web-based educational materials in biochemistry and molecular biology. Biochemistry and Molecular Biology Education 31, 1 (2003), 69--72.
[19]
Catherine C Marshall and Frank M Shipman. 2012. On the institutional archiving of social media. In ACM/IEEE Joint Conference on Digital Libraries.
[20]
Allison McDonald, Matthew Bernhard, Luke Valenta, Benjamin VanderSloot, Will Scott, Nick Sullivan, J Alex Halderman, and Roya Ensafi. 2018. 403 forbidden: A global view of CDN geoblocking. In IMC.
[21]
nexB/python-publicsuffix2: A small Python library to deal with publicsuffix data (includes a bundled PSL as "package data") in a wheel friendly format. Fork and continuation of Tomaž Šolc's "publicsuffix". https://github.com/nexb/python-publicsuffix2.
[22]
Sandeep Pandey and Christopher Olston. 2008. Crawl ordering by search impact. In WSDM.
[23]
Ailsa Parker. 2007. Link rot: How the inaccessibility of electronic citations affects the quality of New Zealand scholarly literature. New Zealand Library & Information Management Journal 50, 2 (2007), 172--192.
[24]
Uri Schonfeld and Narayanan Shivakumar. 2009. Sitemaps: Above and beyond the crawl of duty. In WWW.
[25]
Diomidis Spinellis. 2003. The decay and failures of web references. Commun. ACM 46, 1 (2003), 71--77.
[26]
Thomas Vissers, Wouter Joosen, and Nick Nikiforakis. 2015. Parking sensors: Analyzing and detecting parked domains. In NDSS.
[27]
InternetArchiveBot. https://meta.wikimedia.org/wiki/InternetArchiveBot.
[28]
InternetArchiveBot/How the bot fixes broken links. https://meta.wikimedia.org/wiki/InternetArchiveBot/How_the_bot_fixes_broken_links.
[29]
1983--84 French Rugby Union Championship - Wikipedia. https://en.wikipedia.org/wiki/1983%E2%80%9384_French_Rugby_Union_Championship.
[30]
39 Stripes. https://en.wikipedia.org/w/index.php?title=39_Stripes&oldid=861122903.
[31]
Category:Articles with permanently dead external links - Wikipedia. https://en.wikipedia.org/wiki/Category:Articles_with_permanently_dead_external_links.
[32]
Size of Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.
[33]
Jonathan Zittrain, Kendra Albert, and Lawrence Lessig. 2014. Perma: Scoping and addressing the problem of link and reference rot in legal citations. Legal Information Management 14, 2 (2014), 88--99.

Cited By

View all

Index Terms

  1. Characterizing "permanently dead" links on Wikipedia

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IMC '22: Proceedings of the 22nd ACM Internet Measurement Conference
      October 2022
      796 pages
      ISBN:9781450392594
      DOI:10.1145/3517745
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      • USENIX Assoc: USENIX Assoc

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 October 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. link rot
      2. web archives

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      IMC '22
      IMC '22: ACM Internet Measurement Conference
      October 25 - 27, 2022
      Nice, France

      Acceptance Rates

      Overall Acceptance Rate 277 of 1,083 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 194
        Total Downloads
      • Downloads (Last 12 months)39
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media