Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Effective page refresh policies for Web crawlers

Published: 01 December 2003 Publication History

Abstract

In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.

References

[1]
Alonso, R., Barbara, D., and Garcia-Molina, H. 1990. Data caching issues in an information retrieval system. ACM Trans. Datab. Syst. 15, 3 (Sept.), 359--384.
[2]
Barbara, D. and Garcia-Molina, H. 1995. The demarcation protocol: A technique for maintaining linear arithmetic constraints in distributed database systems. In Proceedings of the International Conference on Extending Database Technology (Vienna, Austria). 373--388.
[3]
Bernstein, P., Blaustein, B., and Clarke, E. 1980. Fast maintenance of semantic integrity assertions using redundant aggregate data. In Proceedings of the 6th International Conference on Very Large Databases (Montreal, Ont., Canada). 126--136.
[4]
Bernstein, P. and Goodman, N. 1984. The failure and recovery problem for replicated distributed databases. ACM Trans. Datab. Syst. 9, 4 (Dec.), 596--615.
[5]
Brewington, B. E. and Cybenko, G. 2000a. How dynamic is the web. In Proceedings of the 9th International World-Wide Web Conference.
[6]
Brewington, B. E. and Cybenko, G. 2000b. Keeping up with the changing web. IEEE Comput. 33, 5 (May), 52--58.
[7]
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World-Wide Web Conference (Toronto, Ont., Canada).
[8]
Cho, J. 2001. Crawling the web: Discovery and maintenance of a large-scale web data. Ph.D. thesis, Stanford University.
[9]
Cho, J. and Garcia-Molina, H. 2000. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt).
[10]
Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World-Wide Web Conference (Honolulu, Hawaii).
[11]
Cho, J. and Garcia-Molina, H. 2003. Estimating frequency of change. ACM Trans. Internet Tech. 3, 3 (Aug.).
[12]
Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International World-Wide Web Conference (Brisbane, Australia).
[13]
Coffman, Jr., E. G., Liu, Z., and Weber, R. R. 1998. Optimal robot scheduling for web search engines. J. Sched. 1, 1 (June), 15--29.
[14]
Colby, L. S., Kawaguchi, A., Lieuwen, D. F., and l Singh Mumick, I. 1997. Supporting multiple view maintenance policies. In Proceedings of the International Conference on Management of Data (Tuscon, Az.). 405--416.
[15]
de Carvalho, O. S. F. and Roucairol, G. 1982. On the distribution of an assertion. In Proceedings of the ACM Symposium on Principles of Distributed Computing (Ottawa, Ont., Canada). ACM, New York, 121--131.
[16]
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt). 527--534.
[17]
Douglis, F., Feldmann, A., and Krishnamurthy, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the 2nd USENIX Symposium on Internetworking Technologies and Systems (Boulder, Colo.).
[18]
Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the 10th International World-Wide Web Conference.
[19]
Golding, R. A. and Long, D. D. 1993. Modeling replica divergence in a weak-consistency protocol for global-sc ale distributed data bases. Tech. rep. UCSC-CRL-93-09, Computer and Information Sciences Board, University of California, Santa Cruz, Santa Cruz, Calif.
[20]
Google. Google Inc. http://www.google.com.
[21]
Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible web crawler. In Proceedings of the 8th International World-Wide Web Conference (Toronto, Ont., Canada). 219--229.
[22]
Krishnakumar, N. and Bernstein, A. 1991. Bounded ignorance in replicated systems. In Proceedings of the 10th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Denver, Colo.). ACM, New York, 63--74.
[23]
Krishnakumar, N. and Bernstein, A. 1994. Bounded ignorance: A technique for increasing concurrency in a replicated system. ACM Trans. Datab. Syst. 19, 4 (Dec.). 586--625.
[24]
Ladin, R., Liskov, B., Shrira, L., and Ghemawat, S. 1992. Providing high availability using lazy replication. ACM Trans. Comput. Syst. 10, 4 (Nov.). 360--391.
[25]
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 280, 5360 (Apr.), 98--100.
[26]
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the web. Nature 400, 6740 (July), 107--109.
[27]
Menczer, F., Pant, G., and Ruiz, M. E. 2001. Evaluating topic-driven web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, La.). ACM, New York.
[28]
Olston, C. and Widom, J. 2000. Offering a precision-performance tradeoff for aggregation queries over replicated data. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt).
[29]
Page, L. and Brin, S. 1998. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World-Wide Web Conference (Brisbane, Australia).
[30]
Pinkerton, B. 1994. Finding what people want: Experiences with the web crawler. In Proceedings of the 2nd World-Wide Web Conference (Chicago, Ill.).
[31]
Pitkow, J. and Pirolli, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems (CHI'97). (Atlanta, Ga.). 383--390.
[32]
Pu, C. and Leff, A. 1991. Replica control in distributed systems: An asynchronous approach. In Proceedings of the International Conference on Management of Data (Denver, Colo.), 377--386.
[33]
Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (San Jose, Calif.).
[34]
Taylor, H. M. and Karlin, S. 1998. An Introduction to Stochastic Modeling, 3rd ed. Academic Press, Orlando, Fla.
[35]
Thomas, Jr., G. B. 1969. Calculus and analytic geometry, 4th ed. Addison-Wesley, Reading, Mass.
[36]
Wills, C. E. and Mikhailov, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the 8th International World-Wide Web Conference (Toronto, Ont., Canada).
[37]
Wolman, A., Voelker, G. M., Sharma, N., Cardwell, N., Karlin, A., and Levy, H. M. 1999. On the scale and performance of cooperative web proxy caching. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York. 16--31.
[38]
Yu, H. and Vahdat, A. 2000. Efficient numerical error bounding for replicated network services. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt). 123--133.

Cited By

View all
  • (2024)Timely Cache Updating in Parallel Multi-Relay NetworksIEEE Transactions on Wireless Communications10.1109/TWC.2023.323597123:1(2-15)Online publication date: 1-Jan-2024
  • (2024)Water-Filling-Based Scheduling for Weighted Binary Freshness in Cache Update SystemsIEEE Internet of Things Journal10.1109/JIOT.2023.332236011:5(8961-8972)Online publication date: 1-Mar-2024
  • (2024)Intelligent algorithm selection for efficient update predictions in social media feedsSocial Network Analysis and Mining10.1007/s13278-024-01315-914:1Online publication date: 20-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 28, Issue 4
December 2003
286 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/958942
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2003
Published in TODS Volume 28, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web crawlers
  2. page refresh
  3. web search engines
  4. world-wide web

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Timely Cache Updating in Parallel Multi-Relay NetworksIEEE Transactions on Wireless Communications10.1109/TWC.2023.323597123:1(2-15)Online publication date: 1-Jan-2024
  • (2024)Water-Filling-Based Scheduling for Weighted Binary Freshness in Cache Update SystemsIEEE Internet of Things Journal10.1109/JIOT.2023.332236011:5(8961-8972)Online publication date: 1-Mar-2024
  • (2024)Intelligent algorithm selection for efficient update predictions in social media feedsSocial Network Analysis and Mining10.1007/s13278-024-01315-914:1Online publication date: 20-Aug-2024
  • (2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
  • (2023)The Role of Gossiping in Information Dissemination over a Network of AgentsEntropy10.3390/e2601000926:1(9)Online publication date: 21-Dec-2023
  • (2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
  • (2023)Age of Gossip on Generalized RingsMILCOM 2023 - 2023 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM58377.2023.10356227(182-187)Online publication date: 30-Oct-2023
  • (2023)Optimal Update Times for Stale Information Metrics Including the Age of InformationIEEE Journal on Selected Areas in Information Theory10.1109/JSAIT.2023.33447604(734-746)Online publication date: 2023
  • (2023)Timely Opportunistic Gossiping in Dense NetworksIEEE INFOCOM 2023 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS57453.2023.10225855(1-6)Online publication date: 20-May-2023
  • (2023)Lock-based or Lock-less: Which Is Fresh?IEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10229077(1-10)Online publication date: 17-May-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media