Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Live and learn from mistakes: A lightweight system for document classification

Published: 01 January 2013 Publication History

Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a ''balanced state'' for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by ''leashing'' the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naive Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.

References

[1]
The handbook of brain theory and neural networks. 2008. MIT Press.
[2]
Basagni, S., Herrin, K., Bruschi, D., Rosti, E. (2001). Secure pebblenets. In Proceedings of the 2nd ACM international symposium on mobile ad hoc networking &; computing (pp. 156-163). Long Beach, CA, USA: ACM.
[3]
Online adaptive decision trees: Pattern classification and function approximation. Neural Computation. v18 i9. 2062-2101.
[4]
Berikov, V., Litvinenko, A. (2003). Methods for statistical data analysis with decision tree. Novosibirsk Sobolev Institute of Mathematics.
[5]
Beroule, D. (1988). The never-ending learning. In Proceedings of the NATO advanced research workshop on neural computers (pp. 219-230).
[6]
Dynamics of on-line competitive learning. A Letters Journal Exploring the Frontiers of Physics. v38 i1. 73-78.
[7]
Bloehdorn, S., Hotho, A. (2004). Boosting for text classification with semantic features. In Proceedings of the MSW 2004 workshop at the 10th ACM SIGKDD conference on knowledge discovery and data mining.
[8]
Bordes, A., Bottou, L. (2005). The Huller: A simple and efficient online SVM. In Proceedings of ECML, 16th European conference on machine learning.
[9]
Convergence properties of the K-means algorithms. Advances in Neural Information Processing Systems. v7. 585-592.
[10]
Chai, K.M.A., Chieu, H.L., Ng, H.T. (2002). Bayesian online classifiers for text classification and filtering. In Proceedings of SIGIR '02: The 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 97-104).
[11]
Dasarathy, B. V. (1991). Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE.
[12]
Concept decompositions for large sparse text data using clustering. Machine Learning. v42 i1/2. 143-175.
[13]
Dumais, S., Furnas, G. W., Landauer, T. K., Deerwester, S., Harshman, R. (1988). Using latent semantic analysis to improve information retrieval.
[14]
Garey, M. R., Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. New York, NY: W.H. Freeman.
[15]
Godbole, S., Harpale, A., Sarawagi, S., Chakrabarti, S. (2004). Document classification through interactive supervision of document and term labels. In Proceedings of PKDD '04: The 8th European conference on principles and practice of knowledge discovery in databases (pp. 185-196). Pisa, Italy: Springer-Verlag New York, Inc.
[16]
Guan, H., Zhou, J., Guo, M. (2009). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on world wide web. Madrid, Spain: ACM.
[17]
Han, E.-H., Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. Principles of Data Mining and Knowledge Discovery, 424-431.
[18]
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning.
[19]
King, I., Lau, T. (1997). Competitive learning clustering for information retrieval in image databases. In Proceedings of the 1997 international conference on neural networks.
[20]
Kumar, D., Aseri, T. C., Patel, R. B. (2009). EECHE: energy-efficient cluster head election protocol for heterogeneous wireless sensor networks. In Proceedings of the international conference on advances in computing, communication and control (pp. 75-80). Mumbai, India: ACM.
[21]
Hierarchical document classification using automatically generated hierarchy. Journal of Intelligent Information Systems. v29 i2. 211-230.
[22]
Merkl, D. (1999). Document classification with self-organizing maps. In Kohonen maps (pp. 183-197). Elsevier: Amsterdam.
[23]
Minsky, M. L., Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press.
[24]
Opper, M. (1998). A Bayesian approach to on-line learning. In On-line learning in neural networks (pp. 363-378). New York, NY, USA: Cambridge University Press; ISBN:0-521-65263-4.
[25]
Rennie, J. (2000). ifile: An application of machine learning to e-mail filtering. In Proceedings of the KDD-2000 workshop on text mining.
[26]
Rish, I. (2001). An empirical study of the naive Bayes classifier.
[27]
Rocchio, J. (1971). Relevance feedback in information retrieval. In Salton: The SMART reitrieval system: Experiments in automatic document processing (pp. 313-323). Prentice-Hall.
[28]
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. Learning for text categorization: Papers from the 1998 Workshop.
[29]
Sculley, D., Wachman, G. M. (2007). Relaxed online SVM for spam filtering. In SIGIR, 2007: Proceedings of the 30th annual international ACM SIGIR conference.
[30]
Solla, S., Winther, O. (1998). Optimal perceptron learning: An online bayesian approach. In On-line learning in neural networks. New York, NY, USA: Cambridge University Press; ¿1998 ISBN:0-521-65263-4.
[31]
Vapnik, V. (1992). Principles of risk minimization for learning theory. In D.S. Lippman, J.E. Moody, D.S. Touretzky (Eds.), Advances in neural information processing systems (Vol. 3, pp. 831-838).
[32]
Zhang, Z., Guo, C., Yu, S., Qi, D.Y., Long, S. (2005). Web prediction using online support vector machine. ICTAI 05.
[33]
Zhong, S. (2005). Efficient online spherical k-means clustering. Neural Networks. In Proceedings of the 2005 IEEE international joint conference on IJCNN '05 (Vol, 5, pp. 3180-3185).
[34]
Generative model-based document clustering: a comparative study. Knowledge and Information Systems. v8 i3. 374-384.
  1. Live and learn from mistakes: A lightweight system for document classification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Information Processing and Management: an International Journal
      Information Processing and Management: an International Journal  Volume 49, Issue 1
      January, 2013
      405 pages

      Publisher

      Pergamon Press, Inc.

      United States

      Publication History

      Published: 01 January 2013

      Author Tags

      1. 3LM
      2. Centroid
      3. Classifier
      4. Clusterhead
      5. Lifelong
      6. Online

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Feb 2025

      Other Metrics

      Citations

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media