Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/355214.355216acmconferencesArticle/Chapter ViewAbstractPublication PagesiralConference Proceedingsconference-collections
Article
Free access

Web page classification based on k-nearest neighbor approach

Published: 01 November 2000 Publication History
  • Get Citation Alerts
  • Abstract

    Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web. In this paper, we propose a Web page classifier based on an adaptation of k-Nearest Neighbor (k-NN) approach. To improve the performance of k-NN approach, we supplement k-NN approach with a feature selection method and a term-weighting scheme using markup tags, and reform document-document similarity measure used in vector space model. In our experiments on a Korean commercial Web directory, our proposed methods in k-NN approach for Web page classification improved the performance of classification.

    References

    [1]
    Andrew McCallum and Kamal Nigam, "A Comparison of Event Models for Naive Bayes Text Classification," In AAAI-98 Workshop on Learning for Text Categorization, 1998. http ://www.cs.cmu.edu/~mccallum.
    [2]
    Brij Masand, Gordon Linoff and David Waltz, "Classifying News Stories using Memory Based Reasoning," In Proceedings of the 15 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'92), pp. 59-65, Copenhagen, Denmark, 1992.
    [3]
    Chidanand Apt6 and Fred Damerau, "Automated Learning of Decision Rules for Text Categorization," ACM Transactions on Information Systems, Vol. 12, No. 3, pp. 233-251, 1994.
    [4]
    C.J. Van Rijsbergen, "A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval," Journal of Documentation, Vol. 33, No. 2, pp. 106- 119, June 1977.
    [5]
    David D. Lewis, "Representation and Learning in Information Retrieval," PhD thesis, Department of Computer Science; Univ. of Massachusetts; Amherst, MA 01003, 1992.
    [6]
    David D. Lewis, "An Evaluation of Phrasal and Clustered Representations on a Task", In Proceedings of the 15 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'92), pp. 37-50, Copenhagen, Denmark, 1992.
    [7]
    David D. Lewis and Marc Ringuette, "A comparison of two learning algorithms for text categorization," In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94), University of Nevada, Las Vegas, USA, pp. 81-93, 1994.
    [8]
    G. Salton, "Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer," Addison-Weseley, Reading, Massachusetts, 1989.
    [9]
    L. Douglas Baker and Andrew Kachites McCallum, "Distributional Clustering of Word for Text Classification," In Proceedings of the 21 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'98), Melbourne, Australia, pp. 96-103, 1998.
    [10]
    Leah Larkey and W. Bruce Croft, "Combining Classifiers in Text Categorization," In Proceedings of the 19 th International Conference on Research and Development in Information Retrieval (SIGIR '96), pp. 289-297, Zurich, Switzerland, 1996
    [11]
    Makoto Iwayama and Takenobu Tokunaga, "Cluster- Based Text Categorization: A Comparison of Category Search Strategies," In Proceedings of the 18 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'95), pp. 273-280, Seattle, Washington, USA, 1995.
    [12]
    Oh-Woog Kwon, Sung-Hwa Jung, Jong-Hyeok Lee, and Geunbae Lee, "Evaluation of Category Features and Text Structural Information on a Text Categorization Using Memory Based Reasoning," In Proceedings of the 18 th International Conference on Computer Processing of Oriental Languages (ICCPOL'99), pp. 153-158, University of Tokushim, Japan, 1999.
    [13]
    Susan Dumais, John Platt, David Heckerman, and Mehran Sahami, "Inductive Learning Algorithms and Representations for Text Categorization," In Proceedings of the 7 th International Conference on Information and Knowledge Management (CIKM'98), 1998, http ://robotics.stanford.edu/users/sahami/papers.html.
    [14]
    Thorsten Joachims, "Text Categorization with Support Vector Machine: Learning with Many Relevant Feature," In Proceedings of European Conference on Machine Learning (ECML), 1998, http://www-ai.cs.unidortmund.de/PERSONAL/joachims.eng.html.
    [15]
    Thorsten Joachims, "Transductive Inference for Text Classification using Support Vector Machines," In Proceedings of International Conference on Machine Learning (ICML), 1999, http://www-ai.cs.unidortmund.de/PERSONAL/j oachims.eng.html.
    [16]
    Yiming Yang, "Expert Network: Effective and Efficient Learning from Human Decision in Text Categorization and Retrieval," In Proceedings of the 17 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 13-22, Dublin, Ireland, 1994.
    [17]
    Yiming Yang, "An Evaluation of Statistical Approach to Text Categorization," Information Retrieval, Vol. 1, No. 1/2, pp. 69-90, 1999.
    [18]
    Yiming Yang, and Xin Lui, "A Re-examination of Text Categorization Methods," In Proceedings of the 22 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'99), pp. 42-49, University of California, Berkeley, USA, 1999.
    [19]
    Wai Lam and Chao Yang Ho, "Using A Generalized Instance Set for Automatic Text Categorization," In Proceedings of the 21 th Annual International Conference on Research and Development in Information Retrieval (SIGIR'98), Melbourne, Australia, pp. 81-89, 1998.

    Cited By

    View all
    • (2023)Analyzing the likeness of a person based on DNS logs using machine learning2023 International Conference on Signal Processing, Computation, Electronics, Power and Telecommunication (IConSCEPT)10.1109/IConSCEPT57958.2023.10170228(1-6)Online publication date: 25-May-2023
    • (2023)Comparative Analysis of Various Ensemble Approaches for Web Page ClassificationData Engineering and Data Science10.1002/9781119841999.ch6(137-172)Online publication date: 5-Sep-2023
    • (2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages
    November 2000
    220 pages
    ISBN:1581133006
    DOI:10.1145/355214
    • Chairmen:
    • Kam-Fai Wong,
    • Dik L. Lee,
    • Jong-Hyeok Lee
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2000

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Web page classification
    2. feature selection
    3. k-nearest neighbor approach
    4. similarity measure
    5. term weighting scheme
    6. text categorization

    Qualifiers

    • Article

    Conference

    IRAL00
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)67
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Analyzing the likeness of a person based on DNS logs using machine learning2023 International Conference on Signal Processing, Computation, Electronics, Power and Telecommunication (IConSCEPT)10.1109/IConSCEPT57958.2023.10170228(1-6)Online publication date: 25-May-2023
    • (2023)Comparative Analysis of Various Ensemble Approaches for Web Page ClassificationData Engineering and Data Science10.1002/9781119841999.ch6(137-172)Online publication date: 5-Sep-2023
    • (2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
    • (2022)Web Page Classification Based on an Accurate Technique for Key Data ExtractionAdvanced Intelligent Systems for Sustainable Development (AI2SD’2020)10.1007/978-3-030-90639-9_91(1124-1131)Online publication date: 10-Feb-2022
    • (2021)LEVERAGING DATA MINING TECHNIQUES IN DEVELOPING AN INTEGRATED FRAMEWORK TO PREDICT THE HEALTH PREDICTIONSInternational Journal of Research in Medical Sciences & Technology10.37648/ijrmst.v11i02.00712:1Online publication date: 31-Dec-2021
    • (2021)Ensemble approach for web page classificationMultimedia Tools and Applications10.1007/s11042-021-10891-3Online publication date: 15-Apr-2021
    • (2020)Comparison of Gradient Boosting and Extreme Boosting Ensemble Methods for Webpage Classification2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)10.1109/ICRCICN50933.2020.9296176(77-82)Online publication date: 26-Nov-2020
    • (2020)Naive Website Categorization Based on Text CoverageAdvanced Technologies, Systems, and Applications V10.1007/978-3-030-54765-3_30(435-448)Online publication date: 5-Nov-2020
    • (2019)A new architecture for improving focused crawling using deep neural networkJournal of Intelligent & Fuzzy Systems10.3233/JIFS-182683(1-13)Online publication date: 15-Jun-2019
    • (2018)A Review of Machine Learning Algorithms for Web Page Classification2018 IEEE 5th International Congress on Information Science and Technology (CiSt)10.1109/CIST.2018.8596420(220-226)Online publication date: Oct-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media