Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3618257.3624832acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Reviving Dead Links on the Web with Fable

Published: 24 October 2023 Publication History

Abstract

The web is littered with millions of links which previously worked but no longer do. When users encounter any such broken link, they resort to looking up an archived copy of the linked page. But, for a sizeable fraction of these broken links, no archived copies exist. Even if a copy exists, it often poorly approximates the original page, e.g., any functionality on the page which requires the client browser to communicate with the page's backend servers will not work, and even the latest copy will be missing updates made to the page's content after that copy was captured.
To address this situation, we observe that broken links are often merely a result of website reorganizations; the linked page still exists on the same site, albeit at a different URL. Therefore, given a broken link, our system FABLE attempts to find the linked page's new URL by learning and exploiting the pattern in how the old URLs for other pages on the same site have transformed to their new URLs. We show that our approach is significantly more accurate and efficient than prior approaches which rely on stability in page content over time. FABLE increases the fraction of dead links for which the corresponding new URLs can be found by 50%, while reducing the median delay incurred in identifying the new URL for a broken link from over 40 seconds to less than 10 seconds.

References

[1]
KDE 1.92 Release Announcement. https://web.archive.org/web/20060209082707/ http://www.kde.org:80/announcements/announce-1.92.html.
[2]
What If? (2008) #1 | Comic Books | Comics | Marvel.com. http://web.archive.org/web/20121017122005/http://marvel.com/comic_books/ issue/22962/what_if_2008_1.
[3]
After the Revolution: Youth, Democracy, and the Politics of Disappointment in Serbia - Jessica Greenberg. http://web.archive.org/web/20140701030455/http://sup.org/book.cgi?id=21682.
[4]
Harvard Kennedy School - Mossavar-Rahmani Center for Business and Government:: About:: Fellows:: Senior Fellows: 2017--2018 (copy on July 12, 2017). https://web.archive.org/web/20170712144006/http://www.hks.harvard.edu/centers/mrcbg/about/fellows/currentsrfellows.
[5]
Brave Browser and the Wayback Machine: Working together to help make the Web more useful and reliable. http://blog.archive.org/2020/02/25/brave-browserand- the-wayback-machine-working-together-to-help-make-the-web-moreuseful- and-reliable/.
[6]
Cloudflare and the Wayback Machine, joining forces for a more reliable Web. https://blog.archive.org/2020/09/17/internet-archive-partners-withcloudflare-to-help-make-the-web-more-useful-and-reliable/.
[7]
410 Gone - HTTP. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/410.
[8]
Alexa - Competitive Analysis, Marketing Mix, and Website Traffic. https://www.alexa.com/siteinfo.
[9]
Canonical link element - Wikipedia. https://en.wikipedia.org/wiki/Canonical_ link_element.
[10]
Category:Articles with dead external links - Wikipedia. https://en.wikipedia.org/ wiki/Category:Articles_with_dead_external_links.
[11]
Category:Articles with permanently dead external links - Wikipedia. https://en. wikipedia.org/wiki/Category:Articles_with_permanently_dead_external_links.
[12]
chromium/dom-distiller: Distills the DOM. https://github.com/chromium/domdistiller.
[13]
Internet Archive: Wayback Machine. https://archive.org/web/.
[14]
InternetArchiveBot. https://meta.wikimedia.org/wiki/InternetArchiveBot.
[15]
IPFS Powers the Distributed Web. https://ipfs.tech/.
[16]
Klazify - Free Website Categorization & Logo API. Find company's category and logo from URL. https://www.klazify.com/.
[17]
Medium Sitemap. https://medium.com/sitemap/sitemap.xml.
[18]
Newspaper3k: Article scraping & curation - newspaper 0.0.2 documentation. https://newspaper.readthedocs.io/en/latest/.
[19]
Perma.cc. https://perma.cc/.
[20]
PROSE - Text Transformation - Microsoft Research. https://www.microsoft.com/ en-us/research/project/prose-text-transformation/usage/.
[21]
Public Suffix List. https://publicsuffix.org/.
[22]
Robust Links - Make Your Link Robust. https://robustlinks.mementoweb.org/.
[23]
Stack Exchange Data Dump : Stack Exchange, Inc.: Free Download, Borrow, and Streaming: Internet Archive. https://archive.org/details/stackexchange.
[24]
User: FABLEBot/New URLs for permanently dead external links - Wikipedia. https://en.wikipedia.org/wiki/User:FABLEBot/New_URLs_for_permanently_ dead_external_links.
[25]
Using Flash Fill in Excel. https://support.microsoft.com/en-us/office/using-flashfill-in-excel-3f9bcf1e-db93-4890-94a0-1578341f73f7.
[26]
Web Archive, Available Online | Library of Congress. https://www.loc.gov/webarchives/.
[27]
Wikipedia:Link rot - Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Link_ rot#Internet_archives.
[28]
Scott G Ainsworth, Ahmed Alsum, Hany SalahEldeen, Michele C Weigle, and Michael L Nelson. 2011. How much of the web is archived?. In ACM/IEEE Joint Conference on Digital Libraries.
[29]
Ahmed AlSum, Michele C Weigle, Michael L Nelson, and Herbert Van de Sompel. 2014. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries (2014).
[30]
Ziv Bar-Yossef, Andrei Z Broder, Ravi Kumar, and Andrew Tomkins. 2004. Sic transit gloria telae: Towards an understanding of the web's decay. In WWW.
[31]
Andrei Broder. 2002. A taxonomy of web search. In ACM SIGIR Forum.
[32]
Junghoo Cho and Hector Garcia-Molina. 1999. The evolution of the web and implications for an incremental crawler. Technical Report.
[33]
Dennis Fetterly, Mark Manasse, and Marc Najork. 2003. On the evolution of clusters of near-duplicate web pages. In IEEE/LEOS International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices.
[34]
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L Wiener. 2004. A largescale study of the evolution of Web pages. Software: Practice and Experience 34, 2 (2004), 213--237.
[35]
Ayush Goel, Jingyuan Zhu, and Harsha V. Madhyastha. 2022. Making Links on Your Web Pages Last Longer than You. In HotNets.
[36]
Ayush Goel, Jingyuan Zhu, Ravi Netravali, and Harsha V. Madhyastha. 2022. Jawa: Web Archival in the Era of JavaScript. In OSDI.
[37]
Turn all references blue. https://archive.org/details/mark-graham-presentation.
[38]
Sumit Gulwani. 2011. Automating string processing in spreadsheets using inputoutput examples. ACM SIGPLAN Notices 46, 1 (2011), 317--330.
[39]
Daniel Conrad Halbert. 1984. Programming by example. Ph.D. Dissertation. University of California, Berkeley.
[40]
William R Harris and Sumit Gulwani. 2011. Spreadsheet table transformations from examples. ACM SIGPLAN Notices 46, 6 (2011), 317--328.
[41]
Terry L Harrison and Michael L Nelson. 2006. Just-in-time recovery of missing web pages. In ACM Conference on Hypertext and Hypermedia.
[42]
Monika Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR.
[43]
Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In SIGMOD.
[44]
Shawn M Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, and Claire Grover. 2016. Scholarly context adrift: Three out of four URI references lead to changed content. PloS one (2016).
[45]
Martin Klein and Michael L Nelson. 2008. Revisiting lexical signatures to (re-) discover web pages. In International Conference on Theory and Practice of Digital Libraries. Springer, 371--382.
[46]
Martin Klein and Michael L Nelson. 2010. Evaluating methods to rediscover missing web pages from the web infrastructure. In ACM/IEEE Joint Conference on Digital Libraries.
[47]
Martin Klein, Jeffery Shipman, and Michael L Nelson. 2010. Is this a good title?. In ACM Conference on Hypertext and Hypermedia.
[48]
Martin Klein, Jeb Ware, and Michael L Nelson. 2011. Rediscovering missing web pages using link neighborhood lexical signatures. In ACM/IEEE Joint Conference on Digital Libraries.
[49]
Wallace Koehler. 2002. Web page change and persistence-A four-year longitudinal study. Journal of the American society for information science and technology 53, 2 (2002), 162--171.
[50]
Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In WSDM.
[51]
John Kunze and Richard Rodgers. 2008. The ARK identifier scheme. (2008).
[52]
Steve Lawrence, Frans Coetzee, Eric Glover, Gary Flake, David Pennock, Bob Krovetz, Finn Nielsen, Andries Kruger, and Lee Giles. 2000. Persistence of information on the web: Analyzing citations contained in research articles. In CIKM.
[53]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW.
[54]
John Markwell and David W Brooks. 2003. ?Link rot" limits the usefulness of webbased educational materials in biochemistry and molecular biology. Biochemistry and Molecular Biology Education 31, 1 (2003), 69--72.
[55]
Anders Miltner, Kathleen Fisher, Benjamin C Pierce, David Walker, and Steve Zdancewic. 2017. Synthesizing bijective lenses. In POPL.
[56]
Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What's new on the Web? The evolution of the Web from a search engine perspective. In WWW.
[57]
Anish Nyayachavadi, Jingyuan Zhu, and Harsha V Madhyastha. 2022. Characterizing ?permanently dead" links on Wikipedia. In IMC.
[58]
Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directed program synthesis. ACM SIGPLAN Notices 50, 6 (2015), 619--630.
[59]
Seung-Taek Park, David M Pennock, C Lee Giles, and Robert Krovetz. 2004. Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Transactions on Information Systems (TOIS) 22, 4 (2004), 540--572.
[60]
Thomas A Phelps and Robert Wilensky. 2000. Robust hyperlinks cost just five words each. University of California, Berkeley, Computer Science Division.
[61]
Sarah Rhodes. 2010. Breaking down link rot: The Chesapeake project legal information archive's examination of URL stability. Law Libr. J. 102 (2010), 581.
[62]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513--523.
[63]
Carmine Sellitto. 2005. The impact of impermanent Web-located citations: A study of 123 scholarly conference publications. Journal of the American Society for Information Science and Technology 56, 7 (2005), 695--703.
[64]
Diomidis Spinellis. 2003. The decay and failures of web references. Commun. ACM 46, 1 (2003), 71--77.
[65]
Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. 2008. SpotSigs: Robust and efficient near duplicate detection in large web collections. In SIGIR.
[66]
Dennis Trautwein, Aravindh Raman, Gareth Tyson, Ignacio Castro, Will Scott, Moritz Schubotz, Bela Gipp, and Yiannis Psaras. 2022. Design and evaluation of IPFS: a storage layer for the decentralized web. In SIGCOMM.
[67]
Thomas Vissers, Wouter Joosen, and Nick Nikiforakis. 2015. Parking sensors: Analyzing and detecting parked domains. In NDSS.
[68]
Jonathan L Zittrain, John Bowers, and Clare Stanton. 2021. The Paper of Record Meets an Ephemeral Web: An Examination of Linkrot and Content Drift within The New York Times. Available at SSRN 3833133 (2021).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IMC '23: Proceedings of the 2023 ACM on Internet Measurement Conference
October 2023
746 pages
ISBN:9798400703829
DOI:10.1145/3618257
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. web archives
  2. web page rediscovery

Qualifiers

  • Research-article

Conference

IMC '23
Sponsor:
IMC '23: ACM Internet Measurement Conference
October 24 - 26, 2023
Montreal QC, Canada

Acceptance Rates

Overall Acceptance Rate 277 of 1,083 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 273
    Total Downloads
  • Downloads (Last 12 months)220
  • Downloads (Last 6 weeks)15
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media