Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1149941.1149971acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
Article

Just-in-time recovery of missing web pages

Published: 22 August 2006 Publication History

Abstract

We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers by mutual harvesting using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.

References

[1]
L. A. Adamic and B. A. Huberman. Zipf's law and the Internet. Glottometrics, 3:143--150, 2002.
[2]
G. Amati, C. Carpineto, and G. Romano. FUB at TREC-10 web track: a probabilistic framework for topic relevance term weighting. In Proceedings of TREC-10, pages 182--191, 2001.
[3]
T. Berners-Lee. Cool URIs don't change. 1998. http://www.w3.org/Provider/Style/URI.html.
[4]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. SIGMOD Record, 24(2):398--409, 1995.
[5]
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. SIGMOD Record, 29(2):355--366, 2000.
[6]
Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: evaluating web page changes, movement, and replacement. In JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 160--168, 2004.
[7]
L. Francisco-Revilla, F. Shipman, R. Furuta, U. Karadkar, and A. Arora. Managing change on the web. In JCDL '01: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, pages 67--76, 2001.
[8]
T. G. Habing, T. W. Cole, and W. H. Mischo. Developing a technical registry of OAI data providers. In ECDL '04: Proceedings of the 8th European Conference on Research and Advanced Technology for Digital Libraries, pages 400--410, 2004.
[9]
T. L. Harrison. Opal: In vivo based preservation framework for locating lost web pages. Master's thesis, Old Dominion University, 2005.
[10]
P. Hochstenbach, H. Jerez, and H. Van de Sompel. The OAI-PMH static repository and static repository gateway. In JCDL '03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pages 210--217, 2003.
[11]
B. Kahle. Preserving the Internet. Scientific American, 276(3):82--83, March 1997.
[12]
W. Koehler. Web page change and persistence --- a four-year longitudinal study. Journal of the American Society for Information Science and Technology, 53(2):162--171, 2002.
[13]
C. Lagoze and H. Van de Sompel. The Open Archives Initiative: building a low-barrier interoperability framework. In JCDL '01: Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, pages 54--62, 2001.
[14]
C. Lagoze, H. Van de Sompel, M. L. Nelson, and S. Warner. The Open Archives Initiative Protocol for Metadata Harvesting. http://www.openarchives.org/OAI/openarchivesprotocol.html, 2002.
[15]
S. Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz, F. M. Coetzee, E. Glover, F. Nielsen, A. Kruger, and C. L. Giles. Persistence of web references in scientific research. Computer, 34(2):26--31, 2001.
[16]
P. Maniatis, M. Roussopoulos, T. J. Giuli, D. S. H. Rosenthal, and M. Baker. The LOCKSS peer-to-peer digital preservation system. ACM Transactions on Computer Systems, 23(1):2--50, 2005.
[17]
F. McCown, S. Chan, M. L. Nelson, and J. Bollen. The availability and persistance of web references in D-Lib Magazine. In 5th International Web Archiving Workshop (IWAW'05), September 2005.
[18]
F. McCown and M. L. Nelson. Evaluation of crawler policies for a web-repository crawler. In HYPERTEXT '06: Proceedings of the seventeenth ACM conference on Hypertext and hypermedia, 2006.
[19]
M. L. Nelson and B. D. Allen. Object persistence and availability in digital libraries. D-Lib Magazine, 8(1), 2002.
[20]
S. Pandey, S. Roy, C. Olston, J. Cho, and S. Chakrabarti. Shuffling a stacked deck: the case for partially randomized ranking of search engine results. In VLDB '05: Proceedings of the 31st international conference on very large data bases, pages 781--792, 2005.
[21]
S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz. Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Transactions on Information Systems, 22(4):540--572, 2004.
[22]
T. A. Phelps and R. Wilensky. Robust hyperlinks cost just five words each. Technical Report UCB/CSD-00-1091, EECS Department, University of California, Berkeley, 2000.
[23]
N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents and servers on the web. In WebDB '98: Selected papers from the International Workshop on The World Wide Web and Databases, pages 204--212, 1999.
[24]
J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), 2006.
[25]
D. Spinellis. The decay and failures of web references. Communications of the ACM, 46(1):71--77, 2003.
[26]
K. Sugiyama, K. Hatano, M. Yoshikawa, and S. Uemura. Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages. In HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, pages 198--207, 2003.
[27]
H. Van de Sompel and C. Lagoze. Notes from the interoperability front: A progress report on the Open Archives Initiative. In ECDL '02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pages 144--157, 2002.
[28]
H. Van de Sompel, J. A. Young, and T. B. Hickey. Using the OAI-PMH .. differently. D-Lib Magazine, 9(7/8), 2003.

Cited By

View all
  • (2023)Reviving Dead Links on the Web with FableProceedings of the 2023 ACM on Internet Measurement Conference10.1145/3618257.3624832(131-144)Online publication date: 24-Oct-2023
  • (2022)Proxy-Terms Based Query Obfuscation Technique for Private Web SearchIEEE Access10.1109/ACCESS.2022.314992910(17845-17863)Online publication date: 2022
  • (2018)Automatic Recovery of Broken Links Using Information Retrieval TechniquesProceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval10.1145/3278293.3278296(32-36)Online publication date: 7-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia
August 2006
178 pages
ISBN:1595934170
DOI:10.1145/1149941
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 404 web pages
  2. apache web server
  3. digital preservation

Qualifiers

  • Article

Conference

HT06
Sponsor:
HT06: 17th Conference on Hypertext and Hypermedia
August 22 - 25, 2006
Odense, Denmark

Acceptance Rates

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Reviving Dead Links on the Web with FableProceedings of the 2023 ACM on Internet Measurement Conference10.1145/3618257.3624832(131-144)Online publication date: 24-Oct-2023
  • (2022)Proxy-Terms Based Query Obfuscation Technique for Private Web SearchIEEE Access10.1109/ACCESS.2022.314992910(17845-17863)Online publication date: 2022
  • (2018)Automatic Recovery of Broken Links Using Information Retrieval TechniquesProceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval10.1145/3278293.3278296(32-36)Online publication date: 7-Sep-2018
  • (2017)Broken link repairing system for constructing contextual information portalsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2017.12.013Online publication date: Dec-2017
  • (2014)Who and what links to the Internet ArchiveInternational Journal on Digital Libraries10.1007/s00799-014-0111-514:3-4(101-115)Online publication date: 1-Aug-2014
  • (2014)Moved but not goneInternational Journal on Digital Libraries10.1007/s00799-014-0108-014:1-2(17-38)Online publication date: 1-Apr-2014
  • (2013)Who and What Links to the Internet ArchiveResearch and Advanced Technology for Digital Libraries10.1007/978-3-642-40501-3_35(346-357)Online publication date: 2013
  • (2012)Identifying "soft 404" error pagesProceedings of the Second international conference on Theory and Practice of Digital Libraries10.1007/978-3-642-33290-6_22(197-208)Online publication date: 23-Sep-2012
  • (2011)Page History Explorer: Visualizing and Comparing Page HistoriesIEICE Transactions on Information and Systems10.1587/transinf.E94.D.564E94-D:3(564-577)Online publication date: 2011
  • (2010)Evaluating methods to rediscover missing web pages from the web infrastructureProceedings of the 10th annual joint conference on Digital libraries10.1145/1816123.1816133(59-68)Online publication date: 21-Jun-2010
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media