Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Clustering-based incremental web crawling

Published: 23 November 2010 Publication History

Abstract

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.

References

[1]
Ali, H. and Williams, H. E. 2003. What's changed? measuring document change in Web crawling for search engines. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). 28--42.
[2]
Bar-Ilan, J., Mat-Hassan, M., and Levene, M. 2006. Methods for comparing rankings of search engine results. Comput. Netw. 50, 10, 1448--1463.
[3]
Barbosa, L., Salgado, A. C., de Carvalho, F., Robin, J., and Freire, J. 2005. Looking at both the present and the past to efficiently update replicas of Web content. In Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management (WIDM). 75--80.
[4]
Bouras, C., Poulopoulos, V., and Thanou, A. 2005. Creating a polite adaptive and selective incremental crawler. In Procedings of the IADIS International Conference (WWW/INTERNET). 307--314.
[5]
Brewington, B. E. and Cybenko, G. 2000a. How dynamic is the Web? Comput. Netw. 33, 1-6, 257--276.
[6]
Brewington, B. E. and Cybenko, G. 2000b. Keeping up with the changing Web. Comput. 33, 5, 52--58.
[7]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107--117.
[8]
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 1623.
[9]
Cho, J. and Garcia-Molina, H. 2000a. The evolution of the Web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 200--209.
[10]
Cho, J. and Garcia-Molina, H. 2000b. Synchronizing a database to improve freshness. SIGMOD Record 29, 2, 117--128.
[11]
Cho, J. and Garcia-Molina, H. 2003. Effective page refresh policies for Web crawlers. ACM Trans. Datab. Syst. 28, 4, 390--426.
[12]
Cho, J. and Ntoulas, A. 2002. Effective change detection using sampling. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 514--525.
[13]
Cho, J. and Roy, S. 2004. Impact of search engines on page popularity. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM, New York, NY, 20--29.
[14]
de Bra, P., jan Houben, G., Kornatzky, Y., and Post, R. 1994. Information retrieval in distributed hypertexts. In Proceedings of the 1st Recherche d'Informations Assistee par Ordinateur Conference (RIAO). 481--491.
[15]
Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. 1997. Rate of change and other metrics: a live study of the World Wide Web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems (USITS). USENIX Association, Berkeley, CA, 147--158.
[16]
Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the 10th International Conference on World Wide Web (WWW). ACM, New York, NY, 106--113.
[17]
Fetterly, D., Manasse, M., Najork, M., and Wiener, J. 2003. A large-scale study of the evolution of Web pages. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM, New York, NY, 669--678.
[18]
Grimmett, G. and Stirzaker, D. 1992. Probability and Random Processes, 2nd Ed. Oxford University Press, Oxford, UK.
[19]
Herscovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., and Ur, S. 1998. The shark-search algorithm. an application: Tailored web site mapping. Comput. Netw. 30, 317.
[20]
Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4, 219--229.
[21]
Karypis, G. and Han, E.-H. S. 2000. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 12--19.
[22]
Lempel, R. and Moran, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM, New York, NY, 19--28.
[23]
Leskovec, J. and Faloutsos, C. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 631--636.
[24]
Liu, B., Abdulla, G., Johnson, T., and Fox, E. A. 1998. Web response time and proxy caching. In Proceedings of The 3rd World Conference of the WWW, Internet, and Intranet (WebNet). 92--97.
[25]
Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 241--249.
[26]
Mukherjea, S. 2000. WTMS: a system for collecting and analyzing topic-specific Web information. Comput. Netw. 33, 1-6, 457--471.
[27]
Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the Web?: The evolution of the Web from a search engine perspective. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM, New York, NY, 1--12.
[28]
Olston, C. and Pandey, S. 2008. Recrawl scheduling based on information longevity. In Proceedings of the 17th International Conference on World Wide Web (WWW). ACM, New York, NY, 437--446.
[29]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1999. The pagerank citation ranking: Bringing order to the Web. Technical Rep. 1999--66, Stanford InfoLab. November.
[30]
Pandey, S. and Olston, C. 2005. User-centric Web crawling. In Proceedings of the 14th International Conference on World Wide Web (WWW). ACM, New York, NY, 401--411.
[31]
Pant, G. and Srinivasan, P. 2009. Predicting Web page status. Inform. Syst. Res.
[32]
Salton, G. 1991. Developments in automatic text retrieval. Science 253, 974--979.
[33]
Schelfler, W. 1988. Statistics: Concepts and Applications. Benjamin/Cummings Publishing Company.
[34]
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.
[35]
Tan, Q., Mitra, P., and Giles, C. L. 2007a. Designing clustering-based web crawling policies for search engine crawlers. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA, 535--544.
[36]
Tan, Q., Zhuang, Z., Mitra, P., and Giles, C. L. 2007b. Efficiently detecting Web page updates using samples. In Proceedings of International Conference of Web Engineering (ICWE). 285--300.
[37]
Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J., and Ozsen, L. 2002 Optimal crawling strategies for Web search engines. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 136--147.
[38]
Zheng, S., Dmitriev, P., and Giles, C. L. 2009. Graph based crawler seed selection. In Proceedings of the 18th International Conference on World Wide Web (WWW). ACM, New York, NY, 1089--1090.

Cited By

View all
  • (2024)EMACrawler: Web Arama Motoru Veritabanı Tazeliği OptimizasyonuPoliteknik Dergisi10.2339/politeknik.134705427:6(2201-2214)Online publication date: 12-Dec-2024
  • (2024)A Study on Design, Development and Deployment of Web Crawler Algorithms and Their Metrics2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS)10.1109/ADICS58448.2024.10533459(1-6)Online publication date: 18-Apr-2024
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • Show More Cited By

Index Terms

  1. Clustering-based incremental web crawling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 28, Issue 4
    November 2010
    204 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/1852102
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 November 2010
    Accepted: 01 March 2010
    Revised: 01 October 2009
    Received: 01 October 2008
    Published in TOIS Volume 28, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Clustering
    2. Web crawler
    3. refresh policy
    4. sampling
    5. search engine

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 03 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)EMACrawler: Web Arama Motoru Veritabanı Tazeliği OptimizasyonuPoliteknik Dergisi10.2339/politeknik.134705427:6(2201-2214)Online publication date: 12-Dec-2024
    • (2024)A Study on Design, Development and Deployment of Web Crawler Algorithms and Their Metrics2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS)10.1109/ADICS58448.2024.10533459(1-6)Online publication date: 18-Apr-2024
    • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
    • (2023)Look back, look around: A systematic analysis of effective predictors for new outlinks in focused Web crawlingKnowledge-Based Systems10.1016/j.knosys.2022.110126260(110126)Online publication date: Jan-2023
    • (2020)Adversarial Bandits Policy for Crawling Commercial Web ContentProceedings of The Web Conference 202010.1145/3366423.3380125(407-417)Online publication date: 20-Apr-2020
    • (2020)An efficient deep learning-based scheme for web spam detection in IoT environmentFuture Generation Computer Systems10.1016/j.future.2020.03.004Online publication date: Mar-2020
    • (2019)Predictive Crawling for Commercial Web ContentThe World Wide Web Conference10.1145/3308558.3313694(627-637)Online publication date: 13-May-2019
    • (2019)Network node grouping algorithm and evaluation model based on clustering and Bayesian classifierInternational Journal of Computers and Applications10.1080/1206212X.2019.170333745:1(70-76)Online publication date: 18-Dec-2019
    • (2018)Learning to Discover Domain-Specific Web ContentProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159724(432-440)Online publication date: 2-Feb-2018
    • (2018)Change-Aware Scheduling for Effectively Updating Linked Open Data CachesIEEE Access10.1109/ACCESS.2018.28715116(65862-65873)Online publication date: 2018
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media