Article

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

Authors:

George Karypis,

Vipin KumarAuthors Info & Claims

PAKDD '01: Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining

Pages 53 - 65

Published: 16 April 2001 Publication History

Abstract

Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi-modality of categories. Existing classification techniques have limited applicability in the data sets of these natures. In this paper, we present a Weight Adjusted k-Nearest Neighbor (WAKNN) classification that learns feature weights based on a greedy hill climbing technique. We also present two performance optimizations of WAKNN that improve the computational performance by a few orders of magnitude, but do not compromise on the classification quality. We experimentally evaluated WAKNN on 52 document data sets from a variety of domains and compared its performance against several classification algorithms, such as C4.5, RIPPER, Naive-Bayesian, PEBLS and VSM. Experimental results on these data sets confirm that WAKNN consistently outperforms other existing classification algorithms.

References

[1]

D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review, 13(5-6), 1999.

[2]

W.W. Cohen. Fast effective rule induction. In Proc. of the Twelfth International Conference on Machine Learning, 1995.

[3]

S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10(1):57-78, 1993.

[4]

T. Curran and P. Thompson. Automatic categorization of statute documents. In Proc. of the 8th ASIS SIG/CR Classification Research Workshop, Tucson, Arizona, 1997.

[5]

I.S. Dhillon and D.M. Modha. Visualizing class structure of multi-dimensional data. In Proc. of the 30th Symposium of the Interface: Computing Science and Statistics, pages 488-493, 1998.

[6]

R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.

[7]

E.H. Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification . PhD thesis, University of Minnesota, October 1999.

[8]

W. Hersh, C. Buckley, T.J. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In SIGIR-94, pages 192-201, 1994.

[9]

A.K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

[10]

T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of the European Conference on Machine Learning, 1998.

[11]

L. N. Kanal and Vipin Kumar, editors. Search in Artificial Intelligence. Springer-Verlag, New York, NY, 1988.

[12]

I. Kononenko. Estimating attributes: Analysis and extensions of relief. In Proc. of the 1994 European Conference on Machine Learning, 1994.

[13]

D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994.

[14]

D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/lewis, 1999.

[15]

D.G. Lowe. Similarity metric learning for a variable-kernel classifier. Neural Computation , pages 72-85, January 1995.

[16]

A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.

[17]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.

[18]

J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

[19]

G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.

[20]

G.W. Snedecor and W.G. Cochran. Statistical Methods. Iowa State University Press, 1989.

[21]

TREC. Text REtrieval conference.

[22]

D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation of feature-weighting methods for a class of lazy learning algorithms. AI Review, 11, 1997.

[23]

Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In SIGIR-94, 1994.

[24]

Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR- 99, 1999.

Cited By

Ren XXing ZXia XLo DWang XGrundy J(2019)Neural Network-based Detection of Self-Admitted Technical DebtACM Transactions on Software Engineering and Methodology10.1145/332491628:3(1-45)Online publication date: 29-Jul-2019
https://dl.acm.org/doi/10.1145/3324916
Tran THuynh HTran PTruong D(2019)Text Classification Based on Keywords with Different ThresholdsProceedings of the 2019 4th International Conference on Intelligent Information Technology10.1145/3321454.3321473(101-106)Online publication date: 20-Feb-2019
https://dl.acm.org/doi/10.1145/3321454.3321473
Marge MRudnicky A(2019)Miscommunication Detection and Recovery in Situated Human–Robot DialogueACM Transactions on Interactive Intelligent Systems10.1145/32371899:1(1-40)Online publication date: 17-Feb-2019
https://dl.acm.org/doi/10.1145/3237189
Show More Cited By

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Text categorization using weight adjusted k-nearest neighbor classification (information retrieval)
Modified nearest neighbour classifier for hyperspectral data classification

A modified k-nearest neighbour k-NN classifier is proposed for supervised remote sensing classification of hyperspectral data. To compare its performance in terms of classification accuracy and computational cost, k-NN and a back-propagation neural ...
An improved K-nearest-neighbor algorithm for text categorization

Text categorization is a significant tool to manage and organize the surging text data. Many text categorization algorithms have been explored in previous literatures, such as KNN, Naive Bayes and Support Vector Machine. KNN text categorization is an ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

PAKDD '01: Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining

April 2001

592 pages

ISBN:3540419101

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 16 April 2001

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ren XXing ZXia XLo DWang XGrundy J(2019)Neural Network-based Detection of Self-Admitted Technical DebtACM Transactions on Software Engineering and Methodology10.1145/332491628:3(1-45)Online publication date: 29-Jul-2019
https://dl.acm.org/doi/10.1145/3324916
Tran THuynh HTran PTruong D(2019)Text Classification Based on Keywords with Different ThresholdsProceedings of the 2019 4th International Conference on Intelligent Information Technology10.1145/3321454.3321473(101-106)Online publication date: 20-Feb-2019
https://dl.acm.org/doi/10.1145/3321454.3321473
Marge MRudnicky A(2019)Miscommunication Detection and Recovery in Situated Human–Robot DialogueACM Transactions on Interactive Intelligent Systems10.1145/32371899:1(1-40)Online publication date: 17-Feb-2019
https://dl.acm.org/doi/10.1145/3237189
Kadhim A(2019)Survey on supervised machine learning techniques for automatic text classificationArtificial Intelligence Review10.1007/s10462-018-09677-152:1(273-292)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10462-018-09677-1
Zhang LJiang LLi C(2019)A discriminative model selection approach and its application to text classificationNeural Computing and Applications10.1007/s00521-017-3151-031:4(1173-1187)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1007/s00521-017-3151-0
Vandic DFrasincar FKaymak U(2018)A framework for product description classification in e-commerceJournal of Web Engineering10.5555/3370048.337004917:1-2(1-27)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.5555/3370048.3370049
Xiang YCao ZYao SHe JBen-Othman JYu HUnger HArai M(2018)CW-kNNProceedings of the 4th International Conference on Communication and Information Processing10.1145/3290420.3290431(7-11)Online publication date: 2-Nov-2018
https://dl.acm.org/doi/10.1145/3290420.3290431
Gu YGu MLong YXu GYang ZZhou JQu W(2018)An enhanced short text categorization model with deep abundant representationWorld Wide Web10.1007/s11280-018-0542-921:6(1705-1719)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s11280-018-0542-9
Jiang LLi CWang SZhang L(2016)Deep feature weighting for naive Bayes and its application to text classificationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2016.02.00252:C(26-39)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.engappai.2016.02.002
Tang PChow T(2013)Recognition of word collocation habits using frequency rank ratio and inter-term intimacyExpert Systems with Applications: An International Journal10.1016/j.eswa.2013.01.00340:11(4301-4314)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1016/j.eswa.2013.01.003
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents