Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2964060.2964158guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Crawling Policies Based on Web Page Popularity Prediction

Published: 13 April 2014 Publication History

Abstract

In this paper, we focus on crawling strategies for newly discovered URLs. Since it is impossible to crawl all the new pages right after they appear, the most important or popular pages should be crawled with a higher priority. One natural measure of page importance is the number of user visits. However, the popularity of newly discovered URLs cannot be known in advance, and therefore should be predicted relying on URLs' features. In this paper, we evaluate several methods for predicting new page popularity against previously investigated crawler performance measurements, and propose a novel measurement setup aiming to evaluate crawler performance more realistically. In particular, we compare short-term and long-term popularity of new ephemeral URLs by estimating the rate of popularity decay. Our experiments show that the information about popularity decay can be effectively used for optimizing ordering policies of crawlers, but further research is required to predict it accurately enough.

References

[1]
Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proc. WWW Conference 2003
[2]
Abramson, M., Aha, D.: What's in a URL? Genre classification from URLs. In: Conference on Artificial Intelligence, pp. 262---263 2012
[3]
Bai, X., Cambazoglu, B.B., Junqueira, F.P.: Discovering urls through user feedback. In: Proc. CIKM Conference, pp. 77---86 2011
[4]
Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for url-based topic classification. ACM Trans. Web 2011
[5]
Baykan, E., Henzinger, M., Weber, I.: Efficient discovery of authoritative resources. ACM Trans. Web 2013
[6]
Cho, J., Schonfeld, U.: Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proc. VLDB 2007
[7]
Edwards, J., McCurley, K.S., Tomlin, J.A.: Adaptive model for optimizing performance of an incremental web crawler. In: Proc. WWW Conference 2001
[8]
Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proc. SIGIR Conference, pp. 580---587 2009
[9]
Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. Springer, New York 2001
[10]
Kan, M.Y.: Web page classification without the web page. In: Proc. WWW Conference, pp. 262---263 2004
[11]
Kumar, R., Lang, K., Marlow, C., Tomkins, A.: Efficient discovery of authoritative resources. Data Engineering 2008
[12]
Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P.: Timely crawling of high-quality ephemeral new content. In: Proc. CIKM Conference, pp. 745---750 2011
[13]
Lei, T., Cai, R., Yang, J.M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: Proc. WWW Conference, pp. 611---620 2010
[14]
Liu, M., Cai, R., Zhang, M., Zhang, L.: User browsing behavior-driven web crawling. In: Proc. CIKM Conference, pp. 87---92 2011
[15]
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 43, 175---246 2010
[16]
Pandey, S., Olston, C.: User-centric web crawling. In: Proc. WWW Conference 2005
[17]
Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. WSDM Conference 2008
[18]
Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E.: Modeling and predicting behavioral dynamics on the web. In: Proc. WWW Conference, pp. 599---608 2012
[19]
Tsur, O., Rappoport, A.: What's in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proc. WSDM Conference, pp. 643---652 2012
[20]
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proc. WWW Conference 2002
  1. Crawling Policies Based on Web Page Popularity Prediction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ECIR 2014: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416
    April 2014
    826 pages
    ISBN:9783319060279
    • Editors:
    • Maarten Rijke,
    • Tom Kenter,
    • Arjen Vries,
    • Chengxiang Zhai,
    • Franciska Jong,
    • Kira Radinsky,
    • Katja Hofmann

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 13 April 2014

    Author Tags

    1. crawling policies
    2. new web pages
    3. popularity prediction

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media