Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3331184.3331241acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Optimal Freshness Crawl Under Politeness Constraints

Published: 18 July 2019 Publication History

Abstract

A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine's index, in order to keep the index fresh by picking up the pages' changed content. Determining how often to recrawl pages requires making tradeoffs based on the pages' relative importance and change rates, subject to multiple resource constraints - the limited daily budget of crawl requests on the search engine's end and politeness constraints restricting the rate at which pages can be requested from a given host. In this paper, we introduce PoliteBinaryLambdaCrawl, the first optimal algorithm for freshness crawl scheduling in the presence of politeness constraints as well as non-uniform page importance scores and the crawler's own crawl request limit. We also propose an approximation for it, stating its theoretical optimality conditions and in the process discovering a connection to an approach previously thought of as a mere heuristic for freshness crawl scheduling. We explore the relative performance of PoliteBinaryLambdaCrawl and other methods for handling politeness constraints on a dataset collected by crawling over 18.5M URLs daily over 14 weeks.

Supplementary Material

MP4 File (cite3-13h50-d2.mp4)

References

[1]
2013. Predicting content change on the web. In WSDM. 415--424.
[2]
Y. Azar, E. Horvitz, E. Lubetzky, Y. Peres, and D. Shahaf. 2018. Tractable Near-optimal Policies for Crawling. Proceedings of the National Academy of Sciences (PNAS) (2018).
[3]
P. Boldi, A. Marino, M. Santini, and S. Vigna. 2014. BUbiNG: massive crawling for the masses. In WWW Companion.
[4]
B. Brewington and G. Cybenko. 2000. How dynamic is the Web. In WWW.
[5]
L. Bright, A. Gal, and L. Raschid. 2006. Adaptive Pull-Based Policies for Wide Area Data Delivery. ACM Transactions on Database Systems (TODS), Vol. 31, 2 (2006), 631--671.
[6]
J. Cho and H. Garcia-Molina. 2000. The evolution of the web and implications for an incremental crawler.
[7]
J. Cho and H. Garcia-Molina. 2000. Synchronizing a Database to Improve Freshness. In ACM SIGMOD International Conference on Management of Data.
[8]
J. Cho and H. Garcia-Molina. 2003. Effective page refresh policies for web crawlers. ACM Transactions on Database Systems, Vol. 28, 4 (2003), 390--426.
[9]
J. Cho and H. Garcia-Molina. 2003 b. Estimating frequency of change. ACM Transactions on Internet Technology, Vol. 3, 3 (2003), 256--290.
[10]
J. Cho and A. Ntoulas. 2002. Effective change detection using sampling. In VLDB.
[11]
E. G. Coffman, Z. Liu, and R. R. Weber. 1998. Optimal robot scheduling for web search engines. Journal of Scheduling, Vol. 1, 1 (1998).
[12]
J. Eckstein, A. Gal, and S. Reiner. 2007. Monitoring an Information Source Under a Politeness Constraint. INFORMS Journal on Computing, Vol. 20, 1 (2007), 3--20.
[13]
J. Edwards, K. S. McCurley, and J. A. Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In WWW.
[14]
N. Immorlica and R. Kleinberg. 2018. Recharging bandits. In FOCS.
[15]
C. Olston and M. Najork. 2010. Web Crawling. Foundations and Trends in Information Retrieval, Vol. 3, 1 (2010), 175--246.
[16]
C. Olston and Sandeep Pandey. 2008. Recrawl scheduling based on information longevity. In WWW. 437--446.
[17]
L. Page, S. Brin, R. Motwani, and T. Winograd. 1998. The PageRank citation ranking: Bringing order to the web. Technical Report. MA, USA.
[18]
S. Pandey, K. Dhamdhere, and C. Olston. 2004. WIC: A general-purpose algorithm for monitoring Web information sources.
[19]
S. Pandey and C. Olston. 2005. User-centric web crawling. In WWW.
[20]
R. Rashkovits and A. Gal. 2013. A Cooperative Model for Preference-Based Information Sharing in Narrow Bandwidth Networks. International Journal of Cooperative Information Systems, Vol. 22, 10 (2013).
[21]
V. Shkapenyuk and T. Suel. 2002. Design and implementation of a high performance distributed web crawler. In ICDE.
[22]
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. 2002. Optimal crawling strategies for web search engines. In WWW.

Cited By

View all
  • (2024)Timely Cache Updating in Parallel Multi-Relay NetworksIEEE Transactions on Wireless Communications10.1109/TWC.2023.323597123:1(2-15)Online publication date: Jan-2024
  • (2023)The Role of Gossiping in Information Dissemination over a Network of AgentsEntropy10.3390/e2601000926:1(9)Online publication date: 21-Dec-2023
  • (2022)Version Age of Information in Clustered Gossip NetworksIEEE Journal on Selected Areas in Information Theory10.1109/JSAIT.2022.31597453:1(85-97)Online publication date: Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2019
1512 pages
ISBN:9781450361729
DOI:10.1145/3331184
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. convex optimization
  2. lagrange multiplier
  3. planning under uncertainty
  4. politeness constraint
  5. search engine
  6. web crawling

Qualifiers

  • Research-article

Conference

SIGIR '19
Sponsor:

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)4
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Timely Cache Updating in Parallel Multi-Relay NetworksIEEE Transactions on Wireless Communications10.1109/TWC.2023.323597123:1(2-15)Online publication date: Jan-2024
  • (2023)The Role of Gossiping in Information Dissemination over a Network of AgentsEntropy10.3390/e2601000926:1(9)Online publication date: 21-Dec-2023
  • (2022)Version Age of Information in Clustered Gossip NetworksIEEE Journal on Selected Areas in Information Theory10.1109/JSAIT.2022.31597453:1(85-97)Online publication date: Mar-2022
  • (2022)The Dissemination of Time-Varying Information over Networked Agents with Gossiping2022 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT50566.2022.9834729(934-939)Online publication date: 26-Jun-2022
  • (2022)Timely Gossiping with File Slicing and Network Coding2022 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT50566.2022.9834557(928-933)Online publication date: 26-Jun-2022
  • (2021)Age of Information for Updates With Distortion: Constant and Age-Dependent Distortion ConstraintsIEEE/ACM Transactions on Networking10.1109/TNET.2021.309149329:6(2425-2438)Online publication date: Dec-2021
  • (2021)Freshness Based Cache Updating in Parallel Relay Networks2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518003(3355-3360)Online publication date: 12-Jul-2021
  • (2021)Gossiping with Binary Freshness Metric2021 IEEE Globecom Workshops (GC Wkshps)10.1109/GCWkshps52748.2021.9682174(1-6)Online publication date: Dec-2021
  • (2021)Cache Freshness in Information Updating Systems2021 55th Annual Conference on Information Sciences and Systems (CISS)10.1109/CISS50987.2021.9400310(01-06)Online publication date: 24-Mar-2021
  • (2021)Online algorithms for estimating change rates of web pagesPerformance Evaluation10.1016/j.peva.2021.102261(102261)Online publication date: Nov-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media