Learning to crawl: Comparing classification schemes

Published: 01 October 2005


Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.


Jonathan P. E. Hodgson

The construction of a Web crawler that searches for pages relevant to a specific topic requires some way of choosing the links to pursue, based on information provided by the pages seen so far. This paper describes experiments designed to compare topical Web crawlers. Each crawler considered uses a classification scheme derived from a set of seed pages that include both relevant and nonrelevant pages, so that they can be used as training examples. Three kinds of selection mechanism were considered: a naive Bayes classifier, a support vector machine classifier, and a neural network. In fact, within these types, the authors experimented with slightly different mechanisms before settling on one that seemed best in each category. Since neither precision nor recall can be calculated for the Web, the authors used proxies. For precision, they use the harvest rate, which determines the relevance of the returned pages based on a classifier trained on a larger set than that used for the crawler's classifier. For recall, they use target recall, which measures the proportion of a set of target pages that are retrieved. The paper describes the experiments in detail, and comes to several conclusions. One conclusion is that the naive Bayes classifier is inferior to the other two mechanisms. This appears to be a consequence of the sharp distinctions between relevant and not relevant made by the naive Bayes system, whereas the others have more shades of gray. A second conclusion notes that the support vector machine and neural network crawlers have a no more than 50 percent average overlap of both uniform resource locators (URLs) retrieved and targets fetched, but that their results seem equally good. The paper covers a great deal of ground, and initiates an interesting area of research. It is recommended to anyone interested in topical information retrieval on the Web. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.


Information & Contributors


Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 23, Issue 4
October 2005
135 pages
Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2005
Published in TOIS Volume 23, Issue 4


Author Tags

  1. Topical crawlers
  2. classifiers
  3. focused crawlers
  4. machine learning


