Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Topical web crawlers: Evaluating adaptive algorithms

Published: 01 November 2004 Publication History

Abstract

Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.

References

[1]
Aggarwal, C., Al-Garawi, F., and Yu, P. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th International World Wide Web Conference. 96--105.
[2]
Ben-Shaul, I., Herscovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shialhaim, M., and Soroka, V. 1999. Adding support for dynamic and focused search with Fetuccino. Comput. Netw. 31, 11--16, 1653--1665.
[3]
Brewington, B. E. and Cybenko, G. 2000. How dynamic is the Web? In Proceedings of the 9th International World-Wide Web Conference.
[4]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. 30, 1--7, 107--117.
[5]
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640.
[6]
Cho, J. and Garcia-Molina, H. 2000. The evolution of the Web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases (VLDB).
[7]
Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. Comput. Netw. 30, 1--7, 161--172.
[8]
Cyveillance. 2000. Sizing the internet. White paper. http://www.cyveillance.com/.
[9]
De Bra, P. and Post, R. 1994. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proceedings of the 1st International World Wide Web Conference (Geneva).
[10]
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000). Cairo, Egypt, 527--534.
[11]
Flake, G., Lawrence, S., Giles, C., and Coetzee, F. 2002. Self-organization of the Web and identification of communities. IEEE Comput. 35, 3, 66--71.
[12]
Gibson, D., Kleinberg, J., and Raghavan, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. 225--234.
[13]
Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford Database Group.
[14]
Henzinger, M., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring search engine quality using random walks on the Web. In Proceedings of the 8th International World Wide Web Conference (Toronto). 213--225.
[15]
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., and Ur, S. 1998. The shark-search algorithm---An application: Tailored Web site mapping. In Proceedings of the 7th International World-Wide Web Conference.
[16]
Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[17]
Kleinberg, J. and Lawrence, S. 2001. The structure of the Web. Science 294, 5548, 1849--1850.
[18]
Kumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. 2000. Stochastic models for the Web graph. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Silver Spring, MD, 57--65.
[19]
Kumar, S., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber-communities. Comput. Netw. 31, 11--16, 1481--1493.
[20]
Lawrence, S. and Giles, C. 1998. Searching the World Wide Web. Science 280, 98--100.
[21]
Lawrence, S. and Giles, C. 1999. Accessibility of information on the Web. Nature 400, 107--109.
[22]
McCallum, A., Nigam, K., Rennie, J., and Seymore, K. 1999. A machine learning approach to building domain-specific search engines. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 662--667.
[23]
Menczer, F. 1997. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 227--235.
[24]
Menczer, F. 2003. Complementing search engines with online Web mining agents. Decision Support Systems 35, 2, 195--212.
[25]
Menczer, F. 2004. Lexical and semantic clustering by Web links. J. Amer. Soc. Inform. Sci. Tech. 55, 14.
[26]
Menczer, F. and Belew, R. 1998. Adaptive information agents in distributed textual environments. In Proceedings of the 2nd International Conference on Autonomous Agents. Minneapolis, MN, 157--164.
[27]
Menczer, F. and Belew, R. 2000. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39, 2--3, 203--242.
[28]
Menczer, F. and Monge, A. 1999. Scalable Web search by adaptive online agents: An InfoSpiders case study. In Intelligent Information Agents: Agent-Based Information Discovery and Management on the Internet, M. Klusch, Ed. Springer, Berlin, 323--347.
[29]
Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, Eds. ACM Press, New York, NY, 241--249.
[30]
Moukas, A. and Zacharia, G. 1997. Evolving a multi-agent information filtering solution in Amalthaea. In Proceedings of the 1st International Conference on Autonomous Agents.
[31]
Najork, M. and Wiener, J. L. 2001. Breadth-first search crawling yields high-quality pages. In Proceedings of the 10th International World Wide Web Conference.
[32]
Nick, Z. and Themis, P. 2001. Web search using a genetic algorithm. IEEE Internet Computing 5, 2, 18--26.
[33]
O'Meara, T. and Patel, A. 2001. A topic-specific Web robot model based on restless bandits. IEEE Internet Computing 5, 2, 27--35.
[34]
Pant, G., Bradshaw, S., and Menczer, F. 2003. Search engine - crawler symbiosis. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Springer Verlag, Berlin.
[35]
Pant, G. and Menczer, F. 2002. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5, 2, 221--229.
[36]
Pant, G. and Menczer, F. 2003. Topical crawling for business intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Berlin.
[37]
Pant, G., Srinivasan, P., and Menczer, F. 2002. Exploration versus exploitation in topic driven crawlers. In Proceedings of the WWW-02 Workshop on Web Dynamics.
[38]
Pinkerton, B. 1994. Finding what people want: Experiences with the WebCrawler. In Proceedings of the 2nd International World Wide Web Conference (Chicago).
[39]
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
[40]
Rennie, J. and McCallum, A. 1999. Using reinforcement learning to spider the Web efficiently. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 335--343.
[41]
Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart and J. McClelland, Eds. Vol. 1. Bradford Books (MIT Press), Cambridge, MA, Chapter 8, 318--362.
[42]
Srinivasan, P., Pant, G., and Menczer, F. 2004. A general evaluation framework for topical crawlers. Information Retrieval. Forthcoming.
[43]
Wills, C. and Mikhailov, M. 1999. Towards a better understanding of Web resources and server responses for improved caching. In Proceedings of the 8th International World Wide Web Conference (Toronto).

Cited By

View all
  • (2024)Analyzing the justification for using generative AI technology to generate judgments based on the virtue jurisprudence theoryJournal of Decision Systems10.1080/12460125.2024.2428999(1-24)Online publication date: 6-Dec-2024
  • (2023)Summarizing Dark Web Services with TF-IDF and LSA2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392126(1-6)Online publication date: 18-Aug-2023
  • (2023)An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challengesArtificial Intelligence Review10.1007/s10462-023-10470-y56:11(13187-13257)Online publication date: 9-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 4, Issue 4
November 2004
108 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1031114
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2004
Published in TOIT Volume 4, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Efficiency
  2. evaluation
  3. evolution
  4. exploitation
  5. exploration
  6. reinforcement learning
  7. topical crawlers

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)10
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Analyzing the justification for using generative AI technology to generate judgments based on the virtue jurisprudence theoryJournal of Decision Systems10.1080/12460125.2024.2428999(1-24)Online publication date: 6-Dec-2024
  • (2023)Summarizing Dark Web Services with TF-IDF and LSA2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392126(1-6)Online publication date: 18-Aug-2023
  • (2023)An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challengesArtificial Intelligence Review10.1007/s10462-023-10470-y56:11(13187-13257)Online publication date: 9-Apr-2023
  • (2023)Crawl Smart: A Domain-Specific CrawlerBig Data, Machine Learning, and Applications10.1007/978-981-99-3481-2_25(313-326)Online publication date: 30-Nov-2023
  • (2023)The Crawler Strategy Based on Adaptive Immune Optimization7th International Conference on Computing, Control and Industrial Engineering (CCIE 2023)10.1007/978-981-99-2730-2_88(957-967)Online publication date: 21-Jul-2023
  • (2022)Employment Psychology of Young Migrant Workers During Coronavirus Disease 2019: A Comparative Study Between Construction Workers and Food Delivery KnightsFrontiers in Sociology10.3389/fsoc.2022.8746817Online publication date: 28-Jun-2022
  • (2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 1-Jan-2022
  • (2022)A customized method of web crawler for the automatic collection of embedded device firmware2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP54964.2022.9778464(1345-1348)Online publication date: 15-Apr-2022
  • (2022)Real-Time Focused Extraction of Social Media UsersIEEE Access10.1109/ACCESS.2022.316897710(42607-42622)Online publication date: 2022
  • (2022)An efficient focused crawler using LSTM-CNN based deep learningInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01808-w14:1(391-407)Online publication date: 19-Dec-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media