Crawling Policies Based on Web Page Popularity Prediction

Ostroumova, Liudmila; Bogatyy, Ivan; Chelnokov, Arseniy; Tikhonov, Alexey; Gusev, Gleb

doi:10.1007/978-3-319-06028-6_9

Liudmila Ostroumova²²,
Ivan Bogatyy²²,
Arseniy Chelnokov²²,
Alexey Tikhonov²² &
…
Gleb Gusev²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

European Conference on Information Retrieval

3066 Accesses
1 Altmetric

Abstract

In this paper, we focus on crawling strategies for newly discovered URLs. Since it is impossible to crawl all the new pages right after they appear, the most important (or popular) pages should be crawled with a higher priority. One natural measure of page importance is the number of user visits. However, the popularity of newly discovered URLs cannot be known in advance, and therefore should be predicted relying on URLs’ features. In this paper, we evaluate several methods for predicting new page popularity against previously investigated crawler performance measurements, and propose a novel measurement setup aiming to evaluate crawler performance more realistically. In particular, we compare short-term and long-term popularity of new ephemeral URLs by estimating the rate of popularity decay. Our experiments show that the information about popularity decay can be effectively used for optimizing ordering policies of crawlers, but further research is required to predict it accurately enough.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A multi-perspective micro-analysis of popularity trend dynamics for user-generated content

Article 05 October 2022

Identification and impact of discoverers in online social systems

Article Open access 30 September 2016

Modeling and predicting the popularity of online news based on temporal and content-related features

Article 12 January 2017

References

Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proc. WWW Conference (2003)
Google Scholar
Abramson, M., Aha, D.: What’s in a URL? Genre classification from URLs. In: Conference on Artificial Intelligence, pp. 262–263 (2012)
Google Scholar
Bai, X., Cambazoglu, B.B., Junqueira, F.P.: Discovering urls through user feedback. In: Proc. CIKM Conference, pp. 77–86 (2011)
Google Scholar
Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for url-based topic classification. ACM Trans. Web (2011)
Google Scholar
Baykan, E., Henzinger, M., Weber, I.: Efficient discovery of authoritative resources. ACM Trans. Web (2013)
Google Scholar
Cho, J., Schonfeld, U.: Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proc. VLDB (2007)
Google Scholar
Edwards, J., McCurley, K.S., Tomlin, J.A.: Adaptive model for optimizing performance of an incremental web crawler. In: Proc. WWW Conference (2001)
Google Scholar
Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proc. SIGIR Conference, pp. 580–587 (2009)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. Springer, New York (2001)
Google Scholar
Kan, M.Y.: Web page classification without the web page. In: Proc. WWW Conference, pp. 262–263 (2004)
Google Scholar
Kumar, R., Lang, K., Marlow, C., Tomkins, A.: Efficient discovery of authoritative resources. Data Engineering (2008)
Google Scholar
Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P.: Timely crawling of high-quality ephemeral new content. In: Proc. CIKM Conference, pp. 745–750 (2011)
Google Scholar
Lei, T., Cai, R., Yang, J.M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: Proc. WWW Conference, pp. 611–620 (2010)
Google Scholar
Liu, M., Cai, R., Zhang, M., Zhang, L.: User browsing behavior-driven web crawling. In: Proc. CIKM Conference, pp. 87–92 (2011)
Google Scholar
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Article MATH Google Scholar
Pandey, S., Olston, C.: User-centric web crawling. In: Proc. WWW Conference (2005)
Google Scholar
Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. WSDM Conference (2008)
Google Scholar
Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E.: Modeling and predicting behavioral dynamics on the web. In: Proc. WWW Conference, pp. 599–608 (2012)
Google Scholar
Tsur, O., Rappoport, A.: What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proc. WSDM Conference, pp. 643–652 (2012)
Google Scholar
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proc. WWW Conference (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Yandex, Moscow, Russia
Liudmila Ostroumova, Ivan Bogatyy, Arseniy Chelnokov, Alexey Tikhonov & Gleb Gusev

Authors

Liudmila Ostroumova
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Bogatyy
View author publications
You can also search for this author in PubMed Google Scholar
Arseniy Chelnokov
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Tikhonov
View author publications
You can also search for this author in PubMed Google Scholar
Gleb Gusev
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke & Tom Kenter &
Centrum Wiskunde en Informatica, Amsterdam, The Netherlands and Delft University of Technology, Delft, The Netherlands
Arjen P. de Vries
University of Illinois at Urbana-Champaign, Urbana, IL, USA
ChengXiang Zhai
University of Twente, Twente, The Netheralnds and Erasmus University Rotterdam, Rotterdam, The Netherlands
Franciska de Jong
SalesPredict, Haifa, Israel
Kira Radinsky
Microsoft Research, Cambridge, UK
Katja Hofmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ostroumova, L., Bogatyy, I., Chelnokov, A., Tikhonov, A., Gusev, G. (2014). Crawling Policies Based on Web Page Popularity Prediction. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-06028-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics