Abstract
In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.
Similar content being viewed by others
Notes
In \(tf \times rf\), uniformly distributed terms, that is, those that appear equally in both classes, are assigned a constant weight of \(1.58 tf\).
Generated by Wordle (http://www.wordle.net).
We use the LibSVM (http://www.csie.ntu.edu.tw/cjlin/libsvm/) library with linear kernels and default parameters.
References
Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J R Stat Soc 28:131–142
Bekkerman R, Gavish M (2011) High-precision phrase-based document classification on a modern scale. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’11, ACM, New York, pp 231–239
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: The association for computer linguistics (ACL)
Bruce RF, Wiebe JM (1999) Recognizing subjectivity: a case study in manual tagging. Nat Lang Eng 5:187–205
Cover T, Hart P (2002) Nearest neighbor pattern classification. Knowl Based Syst 13:373–389
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on applied computing. SAC ’03, ACM, New York, pp 784–788
Géry M, Largeron C (2011) BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 32:1–25
Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: 18th international world wide web conference, pp 201–201
Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: PKDD ’00: Proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, London, pp 424–431
Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of adjectives. In: Proceedings of the eighth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, Morristown, pp 174–181
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 168–177
Joachims T (2001) A statistical learning model of text classification with support vector machines. In: Proceedings of ACM SIGIR, pp 128–136
Junejo KN, Karim A (2012) Robust personalizable spam filtering via local and global discrimination modeling. Knowl Inf Syst 1–36. doi:10.1007/s10115-012-0477-x
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. Pattern Anal Mach Intell 31:721–735
Langley P, Iba W, Thompson K (1992) An analysis of bayesian classifiers. In: AAAI ’92: Proceedings of the tenth national conference on artificial intelligence. AAAI Press, pp 223–228
Lavesson N, Boldt M, Davidsson P, Jacobsson A (2011) Learning to detect spyware using end user license agreements. Knowl Inf Syst 26(2):285–307
Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: ECML ’98: Proceedings of the 10th European conference on machine learning. Springer, London, pp 4–15
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37:145–151
Liu W, Wang T (2011) Online active multi-field learning for efficient email spam filtering. Knowl Inf Syst 1–20. doi:10.1007/s10115-011-0461-x
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Martineau J, Finin T, Joshi A, Patel S (2009) Improving binary classification on text problems using differential word features. In: Proceeding of the 18th ACM conference on information and knowledge management. CIKM ’09, ACM, New York, pp 2019–2024
McCullagh P, Nelder JA (2000) Generalized linear models. Champman and Hall/CRC, New York
Nguyen TT, Chang K, Hui SC (2011) Supervised term weighting for sentiment analysis. In: Intelligence and security informatics
Nguyen TT, Chang K, Hui SC (2011) Word cloud model for text categorization. In: Proceedings of the 11th IEEE international conference on data mining, pp 487–496
Pang B, Lee L (2004) A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the ACL
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the ACL
Quinlan JR, Rivest RL (1989) Inferring decision trees using the minimum description length principle. Inf Comput 80(3):227–248
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Wang B, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25:1–20. doi:10.1007/s10115-009-0198-y
Wei F, Liu S, Song Y, Pan S, Zhou MX, Qian W, Shi L, Tan L, Zhang Q (2010) Tiara: a visual exploratory text analytic system. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining KDD ’10, ACM, New York, NY, USA, pp 153–162
Acknowledgments
This research was supported in part by Singapore Ministry of Education’s Academic Research Fund Tier 2 grant ARC 9/12 (MOE2011-T2-2-056).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nguyen, T.T., Chang, K. & Hui, S.C. Supervised term weighting centroid-based classifiers for text categorization. Knowl Inf Syst 35, 61–85 (2013). https://doi.org/10.1007/s10115-012-0559-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0559-9