Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/11424758_20guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

A focused crawling for the web resource discovery using a modified proximal support vector machines

Published: 09 May 2005 Publication History

Abstract

With the rapid growth of the World Wide Web, a focused crawling has been increasingly of importance. The goal of the focused crawling is to seek out and collect the pages that are relevant to a predefined set of topics. The determination of the relevance of a page to a specific topic has been addressed as a classification problem. However, when training the classifiers, one can often encounter some difficulties in selecting negative samples. Such difficulties come from the fact that collecting a set of pages relevant to a specific topic is not a classification process by nature.
In this paper, we propose a novel focused crawling method using only positive samples to represent a given topic as a form of hyperplane, where we can obtain such representation from a modified Proximal Support Vector Machines. The distance from a page to the hyperplane is used to prioritize the visit order of the page. We demonstrated the performance of the proposed method over the WebKB data set and the Web. The promising results suggest that our proposed method be more effective to the focused crawling problem than the traditional approaches.

References

[1]
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. 8th International World Wide Web Conference, Toronto (1999) 1623-1640.
[2]
Aggarwal, C. C., Al-Garawi, F., Yu, P. S.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. 10th International World Wide Web Conference, Hong Kong (2001) 96-105.
[3]
Rennie, J., McCallum, A. K.: Using Reinforcement Learning to Spider the Web Efficiently. 16th International Conference on Machine Learning (ICML) (1999) 335-343.
[4]
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., Gori, M.: Focused Using Context Graphs. 26th International Conference on Very Large Databases (VLDB) (2000) 527-534.
[5]
Cho, J., Garcia-Mlina, H., Page, Lawrence.: Efficient Crawling Through URL Ordering. Computer Networks and ISDN Systems (1998) 161-172.
[6]
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Proc. 7th Int. World Wide Web Conference, Brisbane, Australia, Computer Networks and ISDN Systems 30 (1998) 107-117.
[7]
Fung, G., Mangasarian, O. L.: Proximal Support Vector Machine Classifiers. KDD2001: 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco (2001) 77-86.
[8]
Choi, Y. S., Noh, J. S.: Relevance Feedback for Content-Based Image Retrieval Using Proximal Support Vector Machine. International Conference on Computational Science and Its Applications (ICCSA), Vol. 2. Assisi, Italy (2004) 942-951.
[9]
Charkrabarti, S.: mining the web Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers (2003).
[10]
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press (2000).
[11]
Najork, M., Heydon, A.: High-performance Web crawling. Tech. Rep. Research Report 173, Compaq SRC (2001).

Cited By

View all

Index Terms

  1. A focused crawling for the web resource discovery using a modified proximal support vector machines
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICCSA'05: Proceedings of the 2005 international conference on Computational Science and its Applications - Volume Part I
    May 2005
    1233 pages
    ISBN:3540258604
    • Editors:
    • Osvaldo Gervasi,
    • Marina L. Gavrilova,
    • Vipin Kumar,
    • Antonio Laganà,
    • Heow Pueh Lee

    Sponsors

    • University of Minnesota - IMA: University of Minnesota - IMA
    • The University of Perugia: The University of Perugia
    • SIAM: Society for Industrial and Applied Mathematics
    • UOC: University of Calgary
    • The Queen's University of Belfast: The Queen's University of Belfast

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 09 May 2005

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media