Abstract
In this paper, we focus on crawling strategies for newly discovered URLs. Since it is impossible to crawl all the new pages right after they appear, the most important (or popular) pages should be crawled with a higher priority. One natural measure of page importance is the number of user visits. However, the popularity of newly discovered URLs cannot be known in advance, and therefore should be predicted relying on URLs’ features. In this paper, we evaluate several methods for predicting new page popularity against previously investigated crawler performance measurements, and propose a novel measurement setup aiming to evaluate crawler performance more realistically. In particular, we compare short-term and long-term popularity of new ephemeral URLs by estimating the rate of popularity decay. Our experiments show that the information about popularity decay can be effectively used for optimizing ordering policies of crawlers, but further research is required to predict it accurately enough.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proc. WWW Conference (2003)
Abramson, M., Aha, D.: What’s in a URL? Genre classification from URLs. In: Conference on Artificial Intelligence, pp. 262–263 (2012)
Bai, X., Cambazoglu, B.B., Junqueira, F.P.: Discovering urls through user feedback. In: Proc. CIKM Conference, pp. 77–86 (2011)
Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for url-based topic classification. ACM Trans. Web (2011)
Baykan, E., Henzinger, M., Weber, I.: Efficient discovery of authoritative resources. ACM Trans. Web (2013)
Cho, J., Schonfeld, U.: Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proc. VLDB (2007)
Edwards, J., McCurley, K.S., Tomlin, J.A.: Adaptive model for optimizing performance of an incremental web crawler. In: Proc. WWW Conference (2001)
Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proc. SIGIR Conference, pp. 580–587 (2009)
Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. Springer, New York (2001)
Kan, M.Y.: Web page classification without the web page. In: Proc. WWW Conference, pp. 262–263 (2004)
Kumar, R., Lang, K., Marlow, C., Tomkins, A.: Efficient discovery of authoritative resources. Data Engineering (2008)
Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P.: Timely crawling of high-quality ephemeral new content. In: Proc. CIKM Conference, pp. 745–750 (2011)
Lei, T., Cai, R., Yang, J.M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: Proc. WWW Conference, pp. 611–620 (2010)
Liu, M., Cai, R., Zhang, M., Zhang, L.: User browsing behavior-driven web crawling. In: Proc. CIKM Conference, pp. 87–92 (2011)
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Pandey, S., Olston, C.: User-centric web crawling. In: Proc. WWW Conference (2005)
Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. WSDM Conference (2008)
Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E.: Modeling and predicting behavioral dynamics on the web. In: Proc. WWW Conference, pp. 599–608 (2012)
Tsur, O., Rappoport, A.: What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proc. WSDM Conference, pp. 643–652 (2012)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proc. WWW Conference (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ostroumova, L., Bogatyy, I., Chelnokov, A., Tikhonov, A., Gusev, G. (2014). Crawling Policies Based on Web Page Popularity Prediction. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)