article

Topical web crawlers: Evaluating adaptive algorithms

Authors:

Filippo Menczer,

Padmini SrinivasanAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 4, Issue 4

Pages 378 - 419

https://doi.org/10.1145/1031114.1031117

Published: 01 November 2004 Publication History

Abstract

Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.

References

[1]

Aggarwal, C., Al-Garawi, F., and Yu, P. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th International World Wide Web Conference. 96--105.

Digital Library

[2]

Ben-Shaul, I., Herscovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shialhaim, M., and Soroka, V. 1999. Adding support for dynamic and focused search with Fetuccino. Comput. Netw. 31, 11--16, 1653--1665.

Digital Library

[3]

Brewington, B. E. and Cybenko, G. 2000. How dynamic is the Web? In Proceedings of the 9th International World-Wide Web Conference.

Digital Library

[4]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. 30, 1--7, 107--117.

Digital Library

[5]

Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640.

Digital Library

[6]

Cho, J. and Garcia-Molina, H. 2000. The evolution of the Web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases (VLDB).

Digital Library

[7]

Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. Comput. Netw. 30, 1--7, 161--172.

Digital Library

[8]

Cyveillance. 2000. Sizing the internet. White paper. http://www.cyveillance.com/.

[9]

De Bra, P. and Post, R. 1994. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proceedings of the 1st International World Wide Web Conference (Geneva).

Digital Library

[10]

Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000). Cairo, Egypt, 527--534.

Digital Library

[11]

Flake, G., Lawrence, S., Giles, C., and Coetzee, F. 2002. Self-organization of the Web and identification of communities. IEEE Comput. 35, 3, 66--71.

Digital Library

[12]

Gibson, D., Kleinberg, J., and Raghavan, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. 225--234.

Digital Library

[13]

Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford Database Group.

[14]

Henzinger, M., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring search engine quality using random walks on the Web. In Proceedings of the 8th International World Wide Web Conference (Toronto). 213--225.

Digital Library

[15]

Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., and Ur, S. 1998. The shark-search algorithm---An application: Tailored Web site mapping. In Proceedings of the 7th International World-Wide Web Conference.

Digital Library

[16]

Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.

Digital Library

[17]

Kleinberg, J. and Lawrence, S. 2001. The structure of the Web. Science 294, 5548, 1849--1850.

[18]

Kumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. 2000. Stochastic models for the Web graph. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Silver Spring, MD, 57--65.

Digital Library

[19]

Kumar, S., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber-communities. Comput. Netw. 31, 11--16, 1481--1493.

Digital Library

[20]

Lawrence, S. and Giles, C. 1998. Searching the World Wide Web. Science 280, 98--100.

[21]

Lawrence, S. and Giles, C. 1999. Accessibility of information on the Web. Nature 400, 107--109.

[22]

McCallum, A., Nigam, K., Rennie, J., and Seymore, K. 1999. A machine learning approach to building domain-specific search engines. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 662--667.

Digital Library

[23]

Menczer, F. 1997. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 227--235.

Digital Library

[24]

Menczer, F. 2003. Complementing search engines with online Web mining agents. Decision Support Systems 35, 2, 195--212.

Digital Library

[25]

Menczer, F. 2004. Lexical and semantic clustering by Web links. J. Amer. Soc. Inform. Sci. Tech. 55, 14.

Digital Library

[26]

Menczer, F. and Belew, R. 1998. Adaptive information agents in distributed textual environments. In Proceedings of the 2nd International Conference on Autonomous Agents. Minneapolis, MN, 157--164.

Digital Library

[27]

Menczer, F. and Belew, R. 2000. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39, 2--3, 203--242.

Digital Library

[28]

Menczer, F. and Monge, A. 1999. Scalable Web search by adaptive online agents: An InfoSpiders case study. In Intelligent Information Agents: Agent-Based Information Discovery and Management on the Internet, M. Klusch, Ed. Springer, Berlin, 323--347.

[29]

Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, Eds. ACM Press, New York, NY, 241--249.

Digital Library

[30]

Moukas, A. and Zacharia, G. 1997. Evolving a multi-agent information filtering solution in Amalthaea. In Proceedings of the 1st International Conference on Autonomous Agents.

Digital Library

[31]

Najork, M. and Wiener, J. L. 2001. Breadth-first search crawling yields high-quality pages. In Proceedings of the 10th International World Wide Web Conference.

Digital Library

[32]

Nick, Z. and Themis, P. 2001. Web search using a genetic algorithm. IEEE Internet Computing 5, 2, 18--26.

Digital Library

[33]

O'Meara, T. and Patel, A. 2001. A topic-specific Web robot model based on restless bandits. IEEE Internet Computing 5, 2, 27--35.

Digital Library

[34]

Pant, G., Bradshaw, S., and Menczer, F. 2003. Search engine - crawler symbiosis. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Springer Verlag, Berlin.

[35]

Pant, G. and Menczer, F. 2002. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5, 2, 221--229.

Digital Library

[36]

Pant, G. and Menczer, F. 2003. Topical crawling for business intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Berlin.

[37]

Pant, G., Srinivasan, P., and Menczer, F. 2002. Exploration versus exploitation in topic driven crawlers. In Proceedings of the WWW-02 Workshop on Web Dynamics.

[38]

Pinkerton, B. 1994. Finding what people want: Experiences with the WebCrawler. In Proceedings of the 2nd International World Wide Web Conference (Chicago).

[39]

Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.

[40]

Rennie, J. and McCallum, A. 1999. Using reinforcement learning to spider the Web efficiently. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 335--343.

Digital Library

[41]

Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart and J. McClelland, Eds. Vol. 1. Bradford Books (MIT Press), Cambridge, MA, Chapter 8, 318--362.

Digital Library

[42]

Srinivasan, P., Pant, G., and Menczer, F. 2004. A general evaluation framework for topical crawlers. Information Retrieval. Forthcoming.

Digital Library

[43]

Wills, C. and Mikhailov, M. 1999. Towards a better understanding of Web resources and server responses for improved caching. In Proceedings of the 8th International World Wide Web Conference (Toronto).

Digital Library

Cited By

Zhou S(2024)Analyzing the justification for using generative AI technology to generate judgments based on the virtue jurisprudence theoryJournal of Decision Systems10.1080/12460125.2024.2428999(1-24)Online publication date: 6-Dec-2024
https://doi.org/10.1080/12460125.2024.2428999
Dalvi APatel DShah NNakrani VBhirud S(2023)Summarizing Dark Web Services with TF-IDF and LSA2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392126(1-6)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392126
Rajwar KDeep KDas S(2023)An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challengesArtificial Intelligence Review10.1007/s10462-023-10470-y56:11(13187-13257)Online publication date: 9-Apr-2023
https://dl.acm.org/doi/10.1007/s10462-023-10470-y
Show More Cited By

Recommendations

DE/BBO: a hybrid differential evolution with biogeography-based optimization for global numerical optimization

Differential evolution (DE) is a fast and robust evolutionary algorithm for global optimization. It has been widely used in many areas. Biogeography-based optimization (BBO) is a new biogeography inspired algorithm. It mainly uses the biogeography-based ...
Multi-operator based biogeography based optimization with mutation for global numerical optimization

Biogeography based optimization (BBO) is a new evolutionary optimization based on the science of biogeography for global optimization. We propose two extensions to BBO. First, we propose a new migration operation based multi-parent crossover called ...
To explore or to exploit: An entropy-driven approach for evolutionary algorithms

An evolutionary algorithm is an optimization process comprising two important aspects: exploration discovers potential offspring in new search regions; and exploitation utilizes promising solutions already identified. Intelligent balance between these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 4, Issue 4

November 2004

108 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/1031114

Issue’s Table of Contents

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2004

Published in TOIT Volume 4, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

174
Total Citations
View Citations
3,839
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)10

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou S(2024)Analyzing the justification for using generative AI technology to generate judgments based on the virtue jurisprudence theoryJournal of Decision Systems10.1080/12460125.2024.2428999(1-24)Online publication date: 6-Dec-2024
https://doi.org/10.1080/12460125.2024.2428999
Dalvi APatel DShah NNakrani VBhirud S(2023)Summarizing Dark Web Services with TF-IDF and LSA2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392126(1-6)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392126
Rajwar KDeep KDas S(2023)An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challengesArtificial Intelligence Review10.1007/s10462-023-10470-y56:11(13187-13257)Online publication date: 9-Apr-2023
https://dl.acm.org/doi/10.1007/s10462-023-10470-y
Hegade PChitragar RKulkarni RNaik PSanath A(2023)Crawl Smart: A Domain-Specific CrawlerBig Data, Machine Learning, and Applications10.1007/978-981-99-3481-2_25(313-326)Online publication date: 30-Nov-2023
https://doi.org/10.1007/978-981-99-3481-2_25
Liu YSun Z(2023)The Crawler Strategy Based on Adaptive Immune Optimization7th International Conference on Computing, Control and Industrial Engineering (CCIE 2023)10.1007/978-981-99-2730-2_88(957-967)Online publication date: 21-Jul-2023
https://doi.org/10.1007/978-981-99-2730-2_88
Xue CZhou CSu XQin Z(2022)Employment Psychology of Young Migrant Workers During Coronavirus Disease 2019: A Comparative Study Between Construction Workers and Food Delivery KnightsFrontiers in Sociology10.3389/fsoc.2022.8746817Online publication date: 28-Jun-2022
https://doi.org/10.3389/fsoc.2022.874681
Naghibi MAnvari RForghani AMinaei B(2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/IDA-205107
Chen YTao YZhai SSui S(2022)A customized method of web crawler for the automatic collection of embedded device firmware2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP54964.2022.9778464(1345-1348)Online publication date: 15-Apr-2022
https://doi.org/10.1109/ICSP54964.2022.9778464
Martinez-Castano RLosada DPichel J(2022)Real-Time Focused Extraction of Social Media UsersIEEE Access10.1109/ACCESS.2022.316897710(42607-42622)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3168977
Shrivastava GPateriya RKaushik P(2022)An efficient focused crawler using LSTM-CNN based deep learningInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01808-w14:1(391-407)Online publication date: 19-Dec-2022
https://doi.org/10.1007/s13198-022-01808-w
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents