article

Learning to crawl: Comparing classification schemes

Authors:

Gautam Pant,

Padmini SrinivasanAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 23, Issue 4

Pages 430 - 462

https://doi.org/10.1145/1095872.1095875

Published: 01 October 2005 Publication History

Get Access

Abstract

Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.

References

[1]

Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th International World Wide Web Conference. Hong Kong.

Crossref

Google Scholar

[2]

Ben-Shaul, I., Herscovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., Soroka, V., and Ur, S. 1999. Adding support for dynamic and focused search with fetuccino. Computer Networks and ISDN Systems 31, 11--16, 1653--1665.

Crossref

Google Scholar

[3]

Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 2, 121--167.

Crossref

Google Scholar

[4]

Chakrabarti, S., Punera, K., and Subramanyam, M. 2002. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International World Wide Web Conference. Hawaii.

Crossref

Google Scholar

[5]

Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the 8th International World Wide Web Conference.

Crossref

Google Scholar

[6]

Chau, M., Zeng, D., and Chen, H. 2001. Personalized spiders for web search and analysis. In Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital libraries.

Crossref

Google Scholar

[7]

Chen, H., Chau, M., and Zeng, D. 2002. Ci spider: A tool for competitive intelligence on the Web. Decision Support Systems 1--17.

Crossref

Google Scholar

[8]

Chen, H., Chung, Y., Ramsey, M., and Yang, C. 1998. A smart itsy bitsy spider for the Web. J. Ameri. Soc. Info. Sci. 49, 7, 604--618.

Crossref

Google Scholar

[9]

Chow, C. K. 1957. An optimum character recognition system using decision functions. IRE Transactions 247--254.

Google Scholar

[10]

Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003. Overview of the trec-2003 web track. In Proceedings of TREC-2003.

Google Scholar

[11]

Cristianini, N. and Schölkopf, B. 2002. Support vector machines and kernel methods: the new generation of learning machines. AI Magazine 23, 3, 31--41.

Crossref

Google Scholar

[12]

Davison, B. D. 2000. Topical locality in the web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Crossref

Google Scholar

[13]

Day, M. 2003. Collecting and preserving the World Wide Web. Tech. rep., UKOLN, University of Bath. February. http://library.wellcome.ac.uk/assets/WTL039229.pdf.

Google Scholar

[14]

De Bra, P. M. E. and Post, R. D. J. 1994. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proceedings of the 1st International World Wide Web Conference (Geneva).

Crossref

Google Scholar

[15]

Dietterich, T. G. 1998. Machine-learning research: Four current directions. The AI Magazine 18, 4, 97--136.

Google Scholar

[16]

Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000). Cairo, Egypt, 527--534.

Crossref

Google Scholar

[17]

Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification (2nd Edition). Wiley-Interscience.

Crossref

Google Scholar

[18]

Dumais, S. T. 1998. Using svms for text categorization. IEEE Intelligent Systems Magazine 13, 4.

Google Scholar

[19]

Elkan, C. 1997. Boosting and naive bayesian learning. In International Conference on Knowledge Discovery in Databases.

Google Scholar

[20]

Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., and Ur, S. 1998. The shark-search algorithm---An application: Tailored Web site mapping. In Proceedings of the 7th International World Wide Web Conference.

Crossref

Google Scholar

[21]

Hogg, R. V., Craig, A., and McKean, J. W. 2004. Introduction to Mathematical Statistics, 6 ed. Prentice Hall.

Google Scholar

[22]

Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Network 2, 5, 359--366.

Crossref

Google Scholar

[23]

Jain, A. K., Mao, J., and Mohiuddin, K. M. 1996. Artificial neural networks: A tutorial. Computer 29, 3, 31--44.

Crossref

Google Scholar

[24]

Joachims, T. 2002. Learning to classify text using support vector machines. Ph.D. thesis, Kluwer.

Google Scholar

[25]

John, G. H. and Langley, P. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the 11th Annual Conference on Uncertainty in Artificial Intelligence (UAI 95). Montreal, Quebec, Canada, 338--345.

Crossref

Google Scholar

[26]

Johnson, J., Tsioutsiouliklis, K., and Giles, C. L. 2003. Evolving strategies for focused web crawling. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003). Washington DC.

Google Scholar

[27]

Katzer, J., McGill, M. J., Tessier, J. A., Frakes, W., and Das-Gupta, P. 1982. A study of the overlap among document representations. Information Technology: Research and Development 2, 261--274.

Google Scholar

[28]

Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 280, 98--100.

Google Scholar

[29]

LeCun, Y. 1986. Learning processes in an asymmetric threshold network. In Disordered Systems and Biological Organization, E. Bienenstock, F. Fogelman-Soulié, and G. Weisbuch, Eds. Springer-Verlag, Les Houches, France, 233--240.

Google Scholar

[30]

Lewis, D. D. 1998. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning. Springer-Verlag, 4--15.

Crossref

Google Scholar

[31]

Lippmann, R. P. 1988. An introduction to computing with neural nets. In Artificial Neural Networks: Theoretical Concepts. V. Vemuri Ed. IEEE Computer Society Press, Los Alamitos, CA, 36--54.

Crossref

Google Scholar

[32]

Maarek, Y. S., Jacovi, M., Shtalhaim, M., Ur, S., Zernik, D., and Ben-Shaul, I. Z. 1997. Webcutter: a system for dynamic and tailorable site mapping. Computer Networks and ISDN Systems 29, 8--13, 1269--1279.

Crossref

Google Scholar

[33]

McCallum, A. and Nigam, K. 1998. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI 98 Workshop on Learning for Text Categorization.

Google Scholar

[34]

McCallum, A. K., Nigam, K., Rennie, J., and Seymore, K. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3, 2, 127--163.

Crossref

Google Scholar

[35]

McCulloch, W. S. and Pitts, W. 1943. A logical calculus of ideas imminent in nervous activity. Bulletin of Mathematical Biophysics 5, 115--133.

Google Scholar

[36]

McLachlan, G. J. 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.

Google Scholar

[37]

Menczer, F. and Belew, R. K. 2000. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39, 2--3, 203--242.

Crossref

Google Scholar

[38]

Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Crossref

Google Scholar

[39]

Menczer, F., Pant, G., and Srinivasan, P. 2004. Topical Web crawlers: Evaluating adaptive algorithms. ACM Trans. Int. Tech. 4, 4, 378--419.

Crossref

Google Scholar

[40]

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York.

Crossref

Google Scholar

[41]

Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B. 2001. An introduction to kernel-based learning algorithms. IEEE Neural Networks 12, 2, 181--201.

Crossref

Google Scholar

[42]

Pant, G. and Menczer, F. 2002. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5, 2, 221--229.

Crossref

Google Scholar

[43]

Pant, G. and Menczer, F. 2003. Topical crawling for business intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003). Trondheim, Norway.

Google Scholar

[44]

Pant, G., Srinivasan, P., and Menczer, F. 2002. Exploration versus exploitation in topic driven crawlers. In Proceedings of the 11th World Wide Web Workshop on Web Dynamics.

Google Scholar

[45]

Pant, G., Srinivasan, P., and Menczer, F. 2004a. Web Dynamics. Springer-Verlag, Chapter Crawling the Web.

Google Scholar

[46]

Pant, G., Tsioutsiouliklis, K., Johnson, J., and Giles, C. L. 2004b. Panorama: Extending digital libraries with topical crawlers. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 142--150.

Crossref

Google Scholar

[47]

Platt, J. C. 1999. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods Support Vector Learning, B. Schölkopf and A. Smola, Eds. M.I.T. Press, 185--208.

Crossref

Google Scholar

[48]

Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.

Google Scholar

[49]

Qin, J., Zhou, Y., and Chau, M. 2004. Building domain-specific web collections for scientific digital libraries: A meta-search enhanced focused crawling method. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries.

Crossref

Google Scholar

[50]

Rennie, J. and McCallum, A. K. 1999. Using reinforcement learning to spider the Web efficiently. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 335--343.

Crossref

Google Scholar

[51]

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. Parallel Data Processing 1, 318--362.

Crossref

Google Scholar

[52]

Rumelhart, D. E., Widrow, B., and Lehr, M. A. 1994. The basic ideas in neural networks. Comm. ACM 37, 3, 87--92.

Crossref

Google Scholar

[53]

Salton, G. 1971. The SMART Retrieval System---Experiments in automatic document processing. Prentice Hall Inc., Englewood Cliffs, NJ.

Crossref

Google Scholar

[54]

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill.

Crossref

Google Scholar

[55]

Schapire, R. E. 1999. A brief introduction to boosting. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 1401--1406.

Crossref

Google Scholar

[56]

Schölkopf, B., Burges, C. J. C., and Smola, A. J. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press.

Google Scholar

[57]

Schölkopf, B. and Smola, A. J. 2003. A short introduction to learning with kernels. In Advanced Lectures on Machine Learning, S. Mendelson and A. J. Smola, Eds. Lecture Notes in Artificial Intelligence. Springer-Verlag, New York, NY, 41--64.

Crossref

Google Scholar

[58]

Srinivasan, P., Menczer, F., and Pant, G. 2003. Defining evaluation methodologies for topical crawlers. In SIGIR 2003 Workshop on Defining Evaluation Methodologies for Terabyte-Scale Collections. http://dollar.biz.uiowa.edu/~gpant/Papers/crawl_framework_position.pdf.

Google Scholar

[59]

Srinivasan, P., Menczer, F., and Pant, G. 2005. A general evaluation framework for topical crawlers. Information Retrieval 8, 3, 417--447.

Crossref

Google Scholar

[60]

Theodoridis, S. and Koutroumbas, K. 2003. Pattern Recognition. Academic Press, San Diego, CA.

Crossref

Google Scholar

[61]

Vapnik, V. N. 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc.

Crossref

Google Scholar

[62]

Widrow, B. 1990. 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation. Proceedings of the IEEE 78, 9, 1415--1452.

Google Scholar

[63]

Wright, S. and Nocedal, J. 1999. Numerical Optimization. Springer.

Google Scholar

[64]

Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML-97). Morgan Kaufmann Publishers, 412--420.

Crossref

Google Scholar

Cited By

View all

Gaur ASingh VKumar SKaur M(2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-981-97-3690-4_12
Bravo COtálora SOrdoñez-Salinas S(2023)Automatic Creation of a Pharmaceutical Corpus Based on Open-DataComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24337-0_31(432-450)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24337-0_31
Naghibi MAnvari RForghani AMinaei B(2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/IDA-205107
Show More Cited By

Index Terms

Learning to crawl: Comparing classification schemes
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Information retrieval query processing

Recommendations

Topical web crawlers: Evaluating adaptive algorithms

Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide ...
Evaluating topic-driven web crawlers
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is ...
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained ...

Reviews

Reviewer: Jonathan P. E. Hodgson

The construction of a Web crawler that searches for pages relevant to a specific topic requires some way of choosing the links to pursue, based on information provided by the pages seen so far. This paper describes experiments designed to compare topical Web crawlers. Each crawler considered uses a classification scheme derived from a set of seed pages that include both relevant and nonrelevant pages, so that they can be used as training examples. Three kinds of selection mechanism were considered: a naive Bayes classifier, a support vector machine classifier, and a neural network. In fact, within these types, the authors experimented with slightly different mechanisms before settling on one that seemed best in each category. Since neither precision nor recall can be calculated for the Web, the authors used proxies. For precision, they use the harvest rate, which determines the relevance of the returned pages based on a classifier trained on a larger set than that used for the crawler's classifier. For recall, they use target recall, which measures the proportion of a set of target pages that are retrieved. The paper describes the experiments in detail, and comes to several conclusions. One conclusion is that the naive Bayes classifier is inferior to the other two mechanisms. This appears to be a consequence of the sharp distinctions between relevant and not relevant made by the naive Bayes system, whereas the others have more shades of gray. A second conclusion notes that the support vector machine and neural network crawlers have a no more than 50 percent average overlap of both uniform resource locators (URLs) retrieved and targets fetched, but that their results seem equally good. The paper covers a great deal of ground, and initiates an interesting area of research. It is recommended to anyone interested in topical information retrieval on the Web. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 23, Issue 4

October 2005

135 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1095872

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2005

Published in TOIS Volume 23, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
2,757
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Gaur ASingh VKumar SKaur M(2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-981-97-3690-4_12
Bravo COtálora SOrdoñez-Salinas S(2023)Automatic Creation of a Pharmaceutical Corpus Based on Open-DataComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24337-0_31(432-450)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24337-0_31
Naghibi MAnvari RForghani AMinaei B(2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/IDA-205107
Eminagaoglu M(2022)A new similarity measure for vector space models in text classification and information retrievalJournal of Information Science10.1177/016555152096805548:4(463-476)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1177/0165551520968055
Shrivastava GPateriya RKaushik P(2022)An efficient focused crawler using LSTM-CNN based deep learningInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01808-w14:1(391-407)Online publication date: 19-Dec-2022
https://doi.org/10.1007/s13198-022-01808-w
Demidova ERisse T(2021)Creating Event-Centric Collections from Web ArchivesThe Past Web10.1007/978-3-030-63291-5_6(57-67)Online publication date: 1-Jul-2021
https://doi.org/10.1007/978-3-030-63291-5_6
Zowalla RWetter TPfeifer D(2020)Crawling the German Health Web: Exploratory Study and Graph AnalysisJournal of Medical Internet Research10.2196/1785322:7(e17853)Online publication date: 24-Jul-2020
https://doi.org/10.2196/17853
Gossen GRisse TDemidova E(2020)Towards extracting event-centric collections from Web archivesInternational Journal on Digital Libraries10.1007/s00799-018-0258-621:1(31-45)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1007/s00799-018-0258-6
Songtao SHuaiguang WJiangtao M(2019)An Improved Focused Web Crawler based on Hybrid SimilarityInternational Journal of Performability Engineering10.23940/ijpe.19.10.p10.2645265615:10(2645)Online publication date: 2019
https://doi.org/10.23940/ijpe.19.10.p10.26452656
Capuano ARinaldi ARusso C(2019)An ontology-driven multimedia focused crawler based on linked open data and deep learning techniquesMultimedia Tools and Applications10.1007/s11042-019-08252-279:11-12(7577-7598)Online publication date: 24-Dec-2019
https://doi.org/10.1007/s11042-019-08252-2
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Topical web crawlers: Evaluating adaptive algorithms

Evaluating topic-driven web crawlers

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Topical web crawlers: Evaluating adaptive algorithms

Evaluating topic-driven web crawlers

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations