research-article

The impact of crawl policy on web search effectiveness

Authors:

Dennis Fetterly,

Nick Craswell, and

Vishwa VinayAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

Pages 580 - 587

https://doi.org/10.1145/1571941.1572041

Published: 19 July 2009 Publication History

Abstract

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.

References

[1]

]]S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, New York, NY, USA, 2003. ACM.

Digital Library

[2]

]]R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007.

Digital Library

[3]

]]R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 864--872, 2005.

Digital Library

[4]

]]Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW, pages 328--337, 2004.

Digital Library

[5]

]]P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.

[6]

]]J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB, pages 200--209, 2000.

Digital Library

[7]

]]J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through {URL} ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998.

Digital Library

[8]

]]J. Cho and U. Schonfeld. Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In Proceedings of VLDB, pages 375--386, 2007.

Digital Library

[9]

]]A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the web. In Proceedings of WWW '07, pages 421--430, 2007.

Digital Library

[10]

]]D. Fetterly, N. Craswell, and V. Vinay. Search effectiveness with a breadth-first crawl. In Proceedings of 31st European Conference on Information Retrieval (ECIR), 2009.

Digital Library

[11]

]]D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of WWW, pages 669--678, 2003.

Digital Library

[12]

]]M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. COMPUT. NETWORKS, 31(11):1291--1303, 1999.

Digital Library

[13]

]]K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.

Digital Library

[14]

]]H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: scaling to 6 billion pages and beyond. In Proceedings of WWW 2008, pages 427--436, 2008.

Digital Library

[15]

]]M. A. Najork, H. Zaragoza, and M. J. Taylor. Hits on the web: how does it compare? In Proceedings of SIGIR, pages 471--478, 2007.

Digital Library

[16]

]]A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of WWW, pages 1--12, 2004.

Digital Library

[17]

]]S. Pandey and C. Olston. Crawl ordering by search impact. In Proceedings of WSDM, pages 3--14, 2008.

Digital Library

[18]

]]K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. la-web, 00:132, 2003.

Digital Library

[19]

]]J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo's logs. In Proceedings of SIGIR, pages 151--158, 2007.

Digital Library

Cited By

Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Tran GTurk ACambazoglu BNejdl WBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767737
Ostroumova LBogatyy IChelnokov ATikhonov AGusev G(2014)Crawling Policies Based on Web Page Popularity PredictionProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.5555/2964060.2964158(100-111)Online publication date: 13-Apr-2014
https://dl.acm.org/doi/10.5555/2964060.2964158
Show More Cited By

Index Terms

The impact of crawl policy on web search effectiveness
1. Information systems
  1. Information retrieval

Recommendations

Optimal Freshness Crawl Under Politeness Constraints
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep ...
Read More
Crawl ordering by search impact
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of ...
Read More
A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
697
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Tran GTurk ACambazoglu BNejdl WBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767737
Ostroumova LBogatyy IChelnokov ATikhonov AGusev G(2014)Crawling Policies Based on Web Page Popularity PredictionProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.5555/2964060.2964158(100-111)Online publication date: 13-Apr-2014
https://dl.acm.org/doi/10.5555/2964060.2964158
Santos Ade Carvalho CAlmeida Jde Moura Eda Silva AZiviani N(2014)A genetic programming framework to schedule webpage updatesInformation Retrieval Journal10.1007/s10791-014-9248-518:1(73-94)Online publication date: 28-Oct-2014
https://doi.org/10.1007/s10791-014-9248-5
Ostroumova LBogatyy IChelnokov ATikhonov AGusev G(2014)Crawling Policies Based on Web Page Popularity PredictionAdvances in Information Retrieval10.1007/978-3-319-06028-6_9(100-111)Online publication date: 2014
https://doi.org/10.1007/978-3-319-06028-6_9
Lee CCroft W(2013)Incorporating social anchors for ad hoc retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491786(181-188)Online publication date: 15-May-2013
https://dl.acm.org/doi/10.5555/2491748.2491786
Schäfer RBildhauer F(2013)Web Corpus ConstructionSynthesis Lectures on Human Language Technologies10.2200/S00508ED1V01Y201305HLT0226:4(1-145)Online publication date: 19-Jul-2013
https://doi.org/10.2200/S00508ED1V01Y201305HLT022
Lefortier DOstroumova LSamosvat ESerdyukov PHe QIyengar ANejdl WPei JRastogi R(2013)Timely crawling of high-quality ephemeral new contentProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505641(745-750)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505641
Alam MHa JLee S(2012)Novel approaches to crawling important pages earlyKnowledge and Information Systems10.1007/s10115-012-0535-433:3(707-734)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1007/s10115-012-0535-4
Uemura YItokawa TKitasuka TAritsugi M(2012)An Effectively Focused Crawling SystemInnovations in Intelligent Machines – 210.1007/978-3-642-23190-2_5(61-76)Online publication date: 2012
https://doi.org/10.1007/978-3-642-23190-2_5
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents