Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1145581.1145623acmconferencesArticle/Chapter ViewAbstractPublication PagesicweConference Proceedingsconference-collections
Article

Modelling information persistence on the web

Published: 11 July 2006 Publication History

Abstract

Models of web data persistency are essential tools for the designof efficient information extraction systems that repeatedlycollect and process the data. This study models the persistence ofweb data through the measurement of URL and content persistenceacross several snapshots of a national community web, collectedfor 3 years. We found that the lifetimes of URLs and contents aremodelled by logarithmic functions. We gathered statistics on thestructure of the web, identified reasons for URL death andcharacterized persistent URLs and contents. The lasting contentstend to be referenced by different URLs during their lifetime,while half of the contents referenced by persistent URLs do notchange.

References

[1]
L. Bent, M. Rabinovich, G. M. Voelker, and Z. Xiao. Characterization of a large web site population with implications for content delivery. In Proceedings of the 13thinternational conference on World Wide Web, pages 522--533. ACM Press, 2004.
[2]
B. E. Brewington and G. Cybenko. How dynamic is the web? Computer Networks (Amsterdam, Netherlands: 1999), 33(1--6):257--276, 2000.
[3]
C. Castillo. E ective Web Crawling. PhD thesis, University of Chile, November 2004.
[4]
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, pages 200--209, September 2000.
[5]
J. Cho and H. Garcia-Molina. Estimating frequency of change. ACM Trans. Inter. Tech., 3(3):256--290, 2003.
[6]
J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(17):161--172, 1998.
[7]
F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul. Rate of change and other metrics: a live study of the world wide web. In USENIX Symposium on Internet Technologies and Systems, 1997.
[8]
D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 669678, New York, NY, USA, 2003. ACM Press.
[9]
T. A. S. Foundation. Apache HTTP Server Version 1.3: Module mod include, November 2004.
[10]
D. Gomes, A. L. Santos, and M. J. Silva. Managing duplicates in a web archive. In L. M. Liebrock, editor, Proceedings of the 21th Annual ACM Symposium on Applied Computing (ACM-SAC-06), Dijon, France, April 2006.
[11]
D. Gomes and M. J. Silva. Characterizing a national community web. ACM Trans. Inter. Tech., 5(3):508--531, 2005.
[12]
A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4):219--229, 1999.
[13]
W. Koehler. Web page change and persistencea four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol., 53(2):162--171, 2002.
[14]
S. Lawrence, F. Coetzee, E. Glover, G. Flake, D. Pennock, B. Krovetz, F. Nielsen, A. Kruger, and L. Giles. Persistence of information on the web: analyzing citations contained in research articles. In CIKM '00: Proceedings of the ninth international conference on Information and knowledge management, pages 235--242, New York, NY, USA, 2000. ACM Press.
[15]
J. Markwell and D. W. Brooks. 'link rot' limits the usefulness of web-based educational materials in biochemistry and molecular biology. Biochemistry and Molecular Biology Education, 31(1):69--72, 2003.
[16]
A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web, pages 1--12. ACM Press, 2004.
[17]
D. Spinellis. The decay and failures of web references. Communications of the ACM, 46(1):71--77, 2003.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICWE '06: Proceedings of the 6th international conference on Web engineering
July 2006
384 pages
ISBN:1595933522
DOI:10.1145/1145581
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content persistence
  2. tomba
  3. url persistence

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Searching images in a web archive2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA60987.2023.10302607(1-10)Online publication date: 9-Oct-2023
  • (2022)Web archives as research infrastructure for digital societies: the case study of Arquivo.ptArcheion10.4467/26581264ARC.22.012.16665123Online publication date: 14-Nov-2022
  • (2021)Automatic Generation of Timelines for Past-Web EventsThe Past Web10.1007/978-3-030-63291-5_18(225-242)Online publication date: 1-Jul-2021
  • (2021)The Problem of Web EphemeraThe Past Web10.1007/978-3-030-63291-5_1(5-10)Online publication date: 1-Jul-2021
  • (2019)Data Collection from the Web for Informetric PurposesSpringer Handbook of Science and Technology Indicators10.1007/978-3-030-02511-3_30(781-800)Online publication date: 2019
  • (2018)Evolving networksIntelligent Data Analysis10.5555/2595545.259554817:1(27-48)Online publication date: 27-Dec-2018
  • (2018)Macroscopic characterisations of Web accessibilityThe New Review of Hypermedia and Multimedia10.1080/13614568.2010.53418516:3(221-243)Online publication date: 14-Dec-2018
  • (2016)A quantitative approach to evaluate Website Archivability using the CLEAR+ methodInternational Journal on Digital Libraries10.1007/s00799-015-0144-417:2(119-141)Online publication date: 1-Jun-2016
  • (2015)The fallacy of the multi-API cultureJournal of Documentation10.1108/JD-07-2013-009871:2(233-252)Online publication date: 9-Mar-2015
  • (2014)Learning temporal-dependent ranking modelsProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609619(757-766)Online publication date: 3-Jul-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media