Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1571941.1572041acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The impact of crawl policy on web search effectiveness

Published: 19 July 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.

    References

    [1]
    ]]S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, New York, NY, USA, 2003. ACM.
    [2]
    ]]R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007.
    [3]
    ]]R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 864--872, 2005.
    [4]
    ]]Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW, pages 328--337, 2004.
    [5]
    ]]P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.
    [6]
    ]]J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB, pages 200--209, 2000.
    [7]
    ]]J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through {URL} ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998.
    [8]
    ]]J. Cho and U. Schonfeld. Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In Proceedings of VLDB, pages 375--386, 2007.
    [9]
    ]]A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the web. In Proceedings of WWW '07, pages 421--430, 2007.
    [10]
    ]]D. Fetterly, N. Craswell, and V. Vinay. Search effectiveness with a breadth-first crawl. In Proceedings of 31st European Conference on Information Retrieval (ECIR), 2009.
    [11]
    ]]D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of WWW, pages 669--678, 2003.
    [12]
    ]]M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. COMPUT. NETWORKS, 31(11):1291--1303, 1999.
    [13]
    ]]K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.
    [14]
    ]]H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: scaling to 6 billion pages and beyond. In Proceedings of WWW 2008, pages 427--436, 2008.
    [15]
    ]]M. A. Najork, H. Zaragoza, and M. J. Taylor. Hits on the web: how does it compare? In Proceedings of SIGIR, pages 471--478, 2007.
    [16]
    ]]A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of WWW, pages 1--12, 2004.
    [17]
    ]]S. Pandey and C. Olston. Crawl ordering by search impact. In Proceedings of WSDM, pages 3--14, 2008.
    [18]
    ]]K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. la-web, 00:132, 2003.
    [19]
    ]]J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo's logs. In Proceedings of SIGIR, pages 151--158, 2007.

    Cited By

    View all
    • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
    • (2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
    • (2014)Crawling Policies Based on Web Page Popularity PredictionProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.5555/2964060.2964158(100-111)Online publication date: 13-Apr-2014
    • Show More Cited By

    Index Terms

    1. The impact of crawl policy on web search effectiveness

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
      July 2009
      896 pages
      ISBN:9781605584836
      DOI:10.1145/1571941
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. corpus selection
      2. crawl ordering
      3. web crawling

      Qualifiers

      • Research-article

      Conference

      SIGIR '09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
      • (2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
      • (2014)Crawling Policies Based on Web Page Popularity PredictionProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.5555/2964060.2964158(100-111)Online publication date: 13-Apr-2014
      • (2014)A genetic programming framework to schedule webpage updatesInformation Retrieval Journal10.1007/s10791-014-9248-518:1(73-94)Online publication date: 28-Oct-2014
      • (2014)Crawling Policies Based on Web Page Popularity PredictionAdvances in Information Retrieval10.1007/978-3-319-06028-6_9(100-111)Online publication date: 2014
      • (2013)Incorporating social anchors for ad hoc retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491786(181-188)Online publication date: 15-May-2013
      • (2013)Web Corpus ConstructionSynthesis Lectures on Human Language Technologies10.2200/S00508ED1V01Y201305HLT0226:4(1-145)Online publication date: 19-Jul-2013
      • (2013)Timely crawling of high-quality ephemeral new contentProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505641(745-750)Online publication date: 27-Oct-2013
      • (2012)Novel approaches to crawling important pages earlyKnowledge and Information Systems10.1007/s10115-012-0535-433:3(707-734)Online publication date: 1-Dec-2012
      • (2012)An Effectively Focused Crawling SystemInnovations in Intelligent Machines – 210.1007/978-3-642-23190-2_5(61-76)Online publication date: 2012
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media