Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Learning to crawl: Comparing classification schemes

Published: 01 October 2005 Publication History

Abstract

Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.

References

[1]
Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th International World Wide Web Conference. Hong Kong.
[2]
Ben-Shaul, I., Herscovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., Soroka, V., and Ur, S. 1999. Adding support for dynamic and focused search with fetuccino. Computer Networks and ISDN Systems 31, 11--16, 1653--1665.
[3]
Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 2, 121--167.
[4]
Chakrabarti, S., Punera, K., and Subramanyam, M. 2002. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International World Wide Web Conference. Hawaii.
[5]
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the 8th International World Wide Web Conference.
[6]
Chau, M., Zeng, D., and Chen, H. 2001. Personalized spiders for web search and analysis. In Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital libraries.
[7]
Chen, H., Chau, M., and Zeng, D. 2002. Ci spider: A tool for competitive intelligence on the Web. Decision Support Systems 1--17.
[8]
Chen, H., Chung, Y., Ramsey, M., and Yang, C. 1998. A smart itsy bitsy spider for the Web. J. Ameri. Soc. Info. Sci. 49, 7, 604--618.
[9]
Chow, C. K. 1957. An optimum character recognition system using decision functions. IRE Transactions 247--254.
[10]
Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003. Overview of the trec-2003 web track. In Proceedings of TREC-2003.
[11]
Cristianini, N. and Schölkopf, B. 2002. Support vector machines and kernel methods: the new generation of learning machines. AI Magazine 23, 3, 31--41.
[12]
Davison, B. D. 2000. Topical locality in the web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[13]
Day, M. 2003. Collecting and preserving the World Wide Web. Tech. rep., UKOLN, University of Bath. February. http://library.wellcome.ac.uk/assets/WTL039229.pdf.
[14]
De Bra, P. M. E. and Post, R. D. J. 1994. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proceedings of the 1st International World Wide Web Conference (Geneva).
[15]
Dietterich, T. G. 1998. Machine-learning research: Four current directions. The AI Magazine 18, 4, 97--136.
[16]
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000). Cairo, Egypt, 527--534.
[17]
Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification (2nd Edition). Wiley-Interscience.
[18]
Dumais, S. T. 1998. Using svms for text categorization. IEEE Intelligent Systems Magazine 13, 4.
[19]
Elkan, C. 1997. Boosting and naive bayesian learning. In International Conference on Knowledge Discovery in Databases.
[20]
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., and Ur, S. 1998. The shark-search algorithm---An application: Tailored Web site mapping. In Proceedings of the 7th International World Wide Web Conference.
[21]
Hogg, R. V., Craig, A., and McKean, J. W. 2004. Introduction to Mathematical Statistics, 6 ed. Prentice Hall.
[22]
Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Network 2, 5, 359--366.
[23]
Jain, A. K., Mao, J., and Mohiuddin, K. M. 1996. Artificial neural networks: A tutorial. Computer 29, 3, 31--44.
[24]
Joachims, T. 2002. Learning to classify text using support vector machines. Ph.D. thesis, Kluwer.
[25]
John, G. H. and Langley, P. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the 11th Annual Conference on Uncertainty in Artificial Intelligence (UAI 95). Montreal, Quebec, Canada, 338--345.
[26]
Johnson, J., Tsioutsiouliklis, K., and Giles, C. L. 2003. Evolving strategies for focused web crawling. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003). Washington DC.
[27]
Katzer, J., McGill, M. J., Tessier, J. A., Frakes, W., and Das-Gupta, P. 1982. A study of the overlap among document representations. Information Technology: Research and Development 2, 261--274.
[28]
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 280, 98--100.
[29]
LeCun, Y. 1986. Learning processes in an asymmetric threshold network. In Disordered Systems and Biological Organization, E. Bienenstock, F. Fogelman-Soulié, and G. Weisbuch, Eds. Springer-Verlag, Les Houches, France, 233--240.
[30]
Lewis, D. D. 1998. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning. Springer-Verlag, 4--15.
[31]
Lippmann, R. P. 1988. An introduction to computing with neural nets. In Artificial Neural Networks: Theoretical Concepts. V. Vemuri Ed. IEEE Computer Society Press, Los Alamitos, CA, 36--54.
[32]
Maarek, Y. S., Jacovi, M., Shtalhaim, M., Ur, S., Zernik, D., and Ben-Shaul, I. Z. 1997. Webcutter: a system for dynamic and tailorable site mapping. Computer Networks and ISDN Systems 29, 8--13, 1269--1279.
[33]
McCallum, A. and Nigam, K. 1998. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI 98 Workshop on Learning for Text Categorization.
[34]
McCallum, A. K., Nigam, K., Rennie, J., and Seymore, K. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3, 2, 127--163.
[35]
McCulloch, W. S. and Pitts, W. 1943. A logical calculus of ideas imminent in nervous activity. Bulletin of Mathematical Biophysics 5, 115--133.
[36]
McLachlan, G. J. 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.
[37]
Menczer, F. and Belew, R. K. 2000. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39, 2--3, 203--242.
[38]
Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[39]
Menczer, F., Pant, G., and Srinivasan, P. 2004. Topical Web crawlers: Evaluating adaptive algorithms. ACM Trans. Int. Tech. 4, 4, 378--419.
[40]
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York.
[41]
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B. 2001. An introduction to kernel-based learning algorithms. IEEE Neural Networks 12, 2, 181--201.
[42]
Pant, G. and Menczer, F. 2002. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5, 2, 221--229.
[43]
Pant, G. and Menczer, F. 2003. Topical crawling for business intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003). Trondheim, Norway.
[44]
Pant, G., Srinivasan, P., and Menczer, F. 2002. Exploration versus exploitation in topic driven crawlers. In Proceedings of the 11th World Wide Web Workshop on Web Dynamics.
[45]
Pant, G., Srinivasan, P., and Menczer, F. 2004a. Web Dynamics. Springer-Verlag, Chapter Crawling the Web.
[46]
Pant, G., Tsioutsiouliklis, K., Johnson, J., and Giles, C. L. 2004b. Panorama: Extending digital libraries with topical crawlers. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 142--150.
[47]
Platt, J. C. 1999. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods Support Vector Learning, B. Schölkopf and A. Smola, Eds. M.I.T. Press, 185--208.
[48]
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
[49]
Qin, J., Zhou, Y., and Chau, M. 2004. Building domain-specific web collections for scientific digital libraries: A meta-search enhanced focused crawling method. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries.
[50]
Rennie, J. and McCallum, A. K. 1999. Using reinforcement learning to spider the Web efficiently. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 335--343.
[51]
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. Parallel Data Processing 1, 318--362.
[52]
Rumelhart, D. E., Widrow, B., and Lehr, M. A. 1994. The basic ideas in neural networks. Comm. ACM 37, 3, 87--92.
[53]
Salton, G. 1971. The SMART Retrieval System---Experiments in automatic document processing. Prentice Hall Inc., Englewood Cliffs, NJ.
[54]
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill.
[55]
Schapire, R. E. 1999. A brief introduction to boosting. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 1401--1406.
[56]
Schölkopf, B., Burges, C. J. C., and Smola, A. J. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press.
[57]
Schölkopf, B. and Smola, A. J. 2003. A short introduction to learning with kernels. In Advanced Lectures on Machine Learning, S. Mendelson and A. J. Smola, Eds. Lecture Notes in Artificial Intelligence. Springer-Verlag, New York, NY, 41--64.
[58]
Srinivasan, P., Menczer, F., and Pant, G. 2003. Defining evaluation methodologies for topical crawlers. In SIGIR 2003 Workshop on Defining Evaluation Methodologies for Terabyte-Scale Collections. http://dollar.biz.uiowa.edu/~gpant/Papers/crawl_framework_position.pdf.
[59]
Srinivasan, P., Menczer, F., and Pant, G. 2005. A general evaluation framework for topical crawlers. Information Retrieval 8, 3, 417--447.
[60]
Theodoridis, S. and Koutroumbas, K. 2003. Pattern Recognition. Academic Press, San Diego, CA.
[61]
Vapnik, V. N. 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc.
[62]
Widrow, B. 1990. 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation. Proceedings of the IEEE 78, 9, 1415--1452.
[63]
Wright, S. and Nocedal, J. 1999. Numerical Optimization. Springer.
[64]
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML-97). Morgan Kaufmann Publishers, 412--420.

Cited By

View all
  • (2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
  • (2023)Automatic Creation of a Pharmaceutical Corpus Based on Open-DataComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24337-0_31(432-450)Online publication date: 26-Feb-2023
  • (2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Reviews

Jonathan P. E. Hodgson

The construction of a Web crawler that searches for pages relevant to a specific topic requires some way of choosing the links to pursue, based on information provided by the pages seen so far. This paper describes experiments designed to compare topical Web crawlers. Each crawler considered uses a classification scheme derived from a set of seed pages that include both relevant and nonrelevant pages, so that they can be used as training examples. Three kinds of selection mechanism were considered: a naive Bayes classifier, a support vector machine classifier, and a neural network. In fact, within these types, the authors experimented with slightly different mechanisms before settling on one that seemed best in each category. Since neither precision nor recall can be calculated for the Web, the authors used proxies. For precision, they use the harvest rate, which determines the relevance of the returned pages based on a classifier trained on a larger set than that used for the crawler's classifier. For recall, they use target recall, which measures the proportion of a set of target pages that are retrieved. The paper describes the experiments in detail, and comes to several conclusions. One conclusion is that the naive Bayes classifier is inferior to the other two mechanisms. This appears to be a consequence of the sharp distinctions between relevant and not relevant made by the naive Bayes system, whereas the others have more shades of gray. A second conclusion notes that the support vector machine and neural network crawlers have a no more than 50 percent average overlap of both uniform resource locators (URLs) retrieved and targets fetched, but that their results seem equally good. The paper covers a great deal of ground, and initiates an interesting area of research. It is recommended to anyone interested in topical information retrieval on the Web. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 23, Issue 4
October 2005
135 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1095872
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2005
Published in TOIS Volume 23, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Topical crawlers
  2. classifiers
  3. focused crawlers
  4. machine learning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
  • (2023)Automatic Creation of a Pharmaceutical Corpus Based on Open-DataComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24337-0_31(432-450)Online publication date: 26-Feb-2023
  • (2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 1-Jan-2022
  • (2022)A new similarity measure for vector space models in text classification and information retrievalJournal of Information Science10.1177/016555152096805548:4(463-476)Online publication date: 1-Aug-2022
  • (2022)An efficient focused crawler using LSTM-CNN based deep learningInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01808-w14:1(391-407)Online publication date: 19-Dec-2022
  • (2021)Creating Event-Centric Collections from Web ArchivesThe Past Web10.1007/978-3-030-63291-5_6(57-67)Online publication date: 1-Jul-2021
  • (2020)Crawling the German Health Web: Exploratory Study and Graph AnalysisJournal of Medical Internet Research10.2196/1785322:7(e17853)Online publication date: 24-Jul-2020
  • (2020)Towards extracting event-centric collections from Web archivesInternational Journal on Digital Libraries10.1007/s00799-018-0258-621:1(31-45)Online publication date: 1-Mar-2020
  • (2019)An Improved Focused Web Crawler based on Hybrid SimilarityInternational Journal of Performability Engineering10.23940/ijpe.19.10.p10.2645265615:10(2645)Online publication date: 2019
  • (2019)An ontology-driven multimedia focused crawler based on linked open data and deep learning techniquesMultimedia Tools and Applications10.1007/s11042-019-08252-279:11-12(7577-7598)Online publication date: 24-Dec-2019
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media