research-article

Clustering-based incremental web crawling

Authors:

Prasenjit MitraAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 28, Issue 4

Article No.: 17, Pages 1 - 27

https://doi.org/10.1145/1852102.1852103

Published: 23 November 2010 Publication History

Abstract

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.

References

[1]

Ali, H. and Williams, H. E. 2003. What's changed&quest; measuring document change in Web crawling for search engines. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). 28--42.

[2]

Bar-Ilan, J., Mat-Hassan, M., and Levene, M. 2006. Methods for comparing rankings of search engine results. Comput. Netw. 50, 10, 1448--1463.

Digital Library

[3]

Barbosa, L., Salgado, A. C., de Carvalho, F., Robin, J., and Freire, J. 2005. Looking at both the present and the past to efficiently update replicas of Web content. In Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management (WIDM). 75--80.

Digital Library

[4]

Bouras, C., Poulopoulos, V., and Thanou, A. 2005. Creating a polite adaptive and selective incremental crawler. In Procedings of the IADIS International Conference (WWW/INTERNET). 307--314.

[5]

Brewington, B. E. and Cybenko, G. 2000a. How dynamic is the Web&quest; Comput. Netw. 33, 1-6, 257--276.

Digital Library

[6]

Brewington, B. E. and Cybenko, G. 2000b. Keeping up with the changing Web. Comput. 33, 5, 52--58.

Digital Library

[7]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107--117.

Digital Library

[8]

Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 1623.

Digital Library

[9]

Cho, J. and Garcia-Molina, H. 2000a. The evolution of the Web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 200--209.

Digital Library

[10]

Cho, J. and Garcia-Molina, H. 2000b. Synchronizing a database to improve freshness. SIGMOD Record 29, 2, 117--128.

Digital Library

[11]

Cho, J. and Garcia-Molina, H. 2003. Effective page refresh policies for Web crawlers. ACM Trans. Datab. Syst. 28, 4, 390--426.

Digital Library

[12]

Cho, J. and Ntoulas, A. 2002. Effective change detection using sampling. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 514--525.

Digital Library

[13]

Cho, J. and Roy, S. 2004. Impact of search engines on page popularity. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM, New York, NY, 20--29.

Digital Library

[14]

de Bra, P., jan Houben, G., Kornatzky, Y., and Post, R. 1994. Information retrieval in distributed hypertexts. In Proceedings of the 1st Recherche d'Informations Assistee par Ordinateur Conference (RIAO). 481--491.

[15]

Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. 1997. Rate of change and other metrics: a live study of the World Wide Web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems (USITS). USENIX Association, Berkeley, CA, 147--158.

Digital Library

[16]

Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the 10th International Conference on World Wide Web (WWW). ACM, New York, NY, 106--113.

Digital Library

[17]

Fetterly, D., Manasse, M., Najork, M., and Wiener, J. 2003. A large-scale study of the evolution of Web pages. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM, New York, NY, 669--678.

Digital Library

[18]

Grimmett, G. and Stirzaker, D. 1992. Probability and Random Processes, 2nd Ed. Oxford University Press, Oxford, UK.

[19]

Herscovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., and Ur, S. 1998. The shark-search algorithm. an application: Tailored web site mapping. Comput. Netw. 30, 317.

Digital Library

[20]

Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4, 219--229.

Digital Library

[21]

Karypis, G. and Han, E.-H. S. 2000. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 12--19.

Digital Library

[22]

Lempel, R. and Moran, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM, New York, NY, 19--28.

Digital Library

[23]

Leskovec, J. and Faloutsos, C. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 631--636.

Digital Library

[24]

Liu, B., Abdulla, G., Johnson, T., and Fox, E. A. 1998. Web response time and proxy caching. In Proceedings of The 3rd World Conference of the WWW, Internet, and Intranet (WebNet). 92--97.

[25]

Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 241--249.

Digital Library

[26]

Mukherjea, S. 2000. WTMS: a system for collecting and analyzing topic-specific Web information. Comput. Netw. 33, 1-6, 457--471.

Digital Library

[27]

Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the Web&quest;: The evolution of the Web from a search engine perspective. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM, New York, NY, 1--12.

Digital Library

[28]

Olston, C. and Pandey, S. 2008. Recrawl scheduling based on information longevity. In Proceedings of the 17th International Conference on World Wide Web (WWW). ACM, New York, NY, 437--446.

Digital Library

[29]

Page, L., Brin, S., Motwani, R., and Winograd, T. 1999. The pagerank citation ranking: Bringing order to the Web. Technical Rep. 1999--66, Stanford InfoLab. November.

[30]

Pandey, S. and Olston, C. 2005. User-centric Web crawling. In Proceedings of the 14th International Conference on World Wide Web (WWW). ACM, New York, NY, 401--411.

Digital Library

[31]

Pant, G. and Srinivasan, P. 2009. Predicting Web page status. Inform. Syst. Res.

Digital Library

[32]

Salton, G. 1991. Developments in automatic text retrieval. Science 253, 974--979.

[33]

Schelfler, W. 1988. Statistics: Concepts and Applications. Benjamin/Cummings Publishing Company.

Digital Library

[34]

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.

Digital Library

[35]

Tan, Q., Mitra, P., and Giles, C. L. 2007a. Designing clustering-based web crawling policies for search engine crawlers. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA, 535--544.

Digital Library

[36]

Tan, Q., Zhuang, Z., Mitra, P., and Giles, C. L. 2007b. Efficiently detecting Web page updates using samples. In Proceedings of International Conference of Web Engineering (ICWE). 285--300.

Digital Library

[37]

Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J., and Ozsen, L. 2002 Optimal crawling strategies for Web search engines. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 136--147.

Digital Library

[38]

Zheng, S., Dmitriev, P., and Giles, C. L. 2009. Graph based crawler seed selection. In Proceedings of the 18th International Conference on World Wide Web (WWW). ACM, New York, NY, 1089--1090.

Digital Library

Cited By

Alanoğlu ZAkcayol M(2024)EMACrawler: Web Arama Motoru Veritabanı Tazeliği OptimizasyonuPoliteknik Dergisi10.2339/politeknik.134705427:6(2201-2214)Online publication date: 12-Dec-2024
https://doi.org/10.2339/politeknik.1347054
Arthy JRaja K(2024)A Study on Design, Development and Deployment of Web Crawler Algorithms and Their Metrics2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS)10.1109/ADICS58448.2024.10533459(1-6)Online publication date: 18-Apr-2024
https://doi.org/10.1109/ADICS58448.2024.10533459
ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Show More Cited By

Index Terms

Clustering-based incremental web crawling
1. Information systems
  1. Information systems applications

Recommendations

Designing clustering-based web crawling policies for search engine crawlers
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Using Web Pages Dynamicity to Prioritise Web Crawling
MLMI '19: Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence

Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 28, Issue 4

November 2010

204 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1852102

Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2010

Accepted: 01 March 2010

Revised: 01 October 2009

Received: 01 October 2008

Published in TOIS Volume 28, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
936
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alanoğlu ZAkcayol M(2024)EMACrawler: Web Arama Motoru Veritabanı Tazeliği OptimizasyonuPoliteknik Dergisi10.2339/politeknik.134705427:6(2201-2214)Online publication date: 12-Dec-2024
https://doi.org/10.2339/politeknik.1347054
Arthy JRaja K(2024)A Study on Design, Development and Deployment of Web Crawler Algorithms and Their Metrics2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS)10.1109/ADICS58448.2024.10533459(1-6)Online publication date: 18-Apr-2024
https://doi.org/10.1109/ADICS58448.2024.10533459
ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Dang TBucur DAtil BPitel GRuis FKadkhodaei HLitvak N(2023)Look back, look around: A systematic analysis of effective predictors for new outlinks in focused Web crawlingKnowledge-Based Systems10.1016/j.knosys.2022.110126260(110126)Online publication date: Jan-2023
https://doi.org/10.1016/j.knosys.2022.110126
Han SBendersky MGajda PNovikov SNajork MBrodowsky BPopescul A(2020)Adversarial Bandits Policy for Crawling Commercial Web ContentProceedings of The Web Conference 202010.1145/3366423.3380125(407-417)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380125
Makkar AKumar N(2020)An efficient deep learning-based scheme for web spam detection in IoT environmentFuture Generation Computer Systems10.1016/j.future.2020.03.004Online publication date: Mar-2020
https://doi.org/10.1016/j.future.2020.03.004
Han SBrodowsky BGajda PNovikov SBendersky MNajork MDua RPopescul A(2019)Predictive Crawling for Commercial Web ContentThe World Wide Web Conference10.1145/3308558.3313694(627-637)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313694
Yang G(2019)Network node grouping algorithm and evaluation model based on clustering and Bayesian classifierInternational Journal of Computers and Applications10.1080/1206212X.2019.170333745:1(70-76)Online publication date: 18-Dec-2019
https://doi.org/10.1080/1206212X.2019.1703337
Pham KSantos AFreire JChang YZhai CLiu YMaarek Y(2018)Learning to Discover Domain-Specific Web ContentProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159724(432-440)Online publication date: 2-Feb-2018
https://dl.acm.org/doi/10.1145/3159652.3159724
Akhtar URazzaq MUr Rehman UAmin MKhan WHuh ELee S(2018)Change-Aware Scheduling for Effectively Updating Linked Open Data CachesIEEE Access10.1109/ACCESS.2018.28715116(65862-65873)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2871511
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents