Abstract
The problem of classification has been widely studied in the data mining, machine learning, database, and information retrieval communities with applications in a number of diverse domains, such as target marketing, medical diagnosis, news group filtering, and document organization. In this paper we will provide a survey of a wide variety of text classification algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision for Text Categorization, IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004.
C. C. Aggarwal, N. Li. On Node Classification in Dynamic Contentbased Networks, SDM Conference, 2011.
I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, C. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. Workshop on Machine Learning in the New Information Age, in conjunction with ECML Conference, 2000. http://arxiv.org/PS_cache/cs/pdf/0006/0006013v1.pdf
R. Angelova, G. Weikum. Graph-based text classification: learn from your neighbors. ACM SIGIR Conference, 2006.
C. Apte, F. Damerau, S. Weiss. Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, 12(3), pp. 233–251, 1994.
M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning, Automation and Remote Control, 25: pp. 821–837, 1964.
L. Baker, A. McCallum. Distributional Clustering ofWords for Text Classification, ACM SIGIR Conference, 1998.
R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional Clustering for Text Categorization. ACM SIGIR Conference, 2001.
S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by Seeding. ICML Conference, 2002.
P. Bennett, S. Dumais, E. Horvitz. Probabilistic Combination of Text Classifiers using Reliability Indicators: Models and Results. ACM SIGIR Conference, 2002.
P. Bennett, N. Nguyen. Refined experts: improving classification in large taxonomies. ACM SIGIR Conference, 2009.
S. Bhagat, G. Cormode, S. Muthukrishnan. Node Classification in Social Networks, Book Chapter in Social Network Data Analytics, Ed. Charu Aggarwal, Springer, 2011.
A. Blum, T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998.
D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems, Vol. 27, pp. 329–341, 1999.
L. Brieman, J. Friedman, R. Olshen, C. Stone. Classification and Regression Trees, Wadsworth Advanced Books and Software, CA, 1984.
L. Breiman. Bagging Predictors. Machine Learning, 24(2), pp. 123– 140, 1996.
L. Cai, T. Hofmann. Text categorization by boosting automatically extracted concepts. ACM SIGIR Conference, 2003.
S. Chakrabarti, S. Roy, M. Soundalgekar. Fast and Accurate Text Classification via Multiple Linear Discriminant Projections, VLDB Journal, 12(2), pp. 172–185, 2003.
S. Chakrabarti, B. Dom. R. Agrawal, P. Raghavan. Using taxonomy, discriminants and signatures for navigating in text databases, VLDB Conference, 1997.
S. Chakrabarti, B. Dom, P. Indyk. Enhanced hypertext categorization using hyperlinks. ACM SIGMOD Conference, 1998.
S. Chakraborti, R. Mukras, R. Lothian, N. Wiratunga, S. Watt, D. Harper. Supervised Latent Semantic Indexing using Adaptive Sprinkling, IJCAI, 2007.
D. Chickering, D. Heckerman, C. Meek. A Bayesian approach for learning Bayesian networks with local structure. Thirteenth Conference on Uncertainty in Artificial Intelligence, 1997.
V. R. de Carvalho, W. Cohen. On the collective classification of email ”speech acts”, ACM SIGIR Conference, 2005.
V. Castelli, T. M. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 16(1), pp. 105–111, 1995.
W. Cohen, H. Hirsh. Joins that generalize: text classification using Whirl. ACM KDD Conference, 1998.
W. Cohen, Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), pp. 141–173, 1999.
W. Cohen. Learning rules that classify e-mail. AAAI Conference, 1996.
W. Cohen. Learning with set-valued features. AAAI Conference, 1996.
W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1), pp. 100–111, 1995.
C. Cortes, V. Vapnik. Support-vector networks. Machine Learning, 20: pp. 273–297, 1995.
T. M. Cover, J. A. Thomas. Elements of information theory. New York: John Wiley and Sons, 1991.
M. Craven, S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43: pp. 97–119, 2001.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the Worldwide Web. AAAI Conference, 1998.
I. Dagan, Y. Karov, D. Roth. Mistake-driven Learning in Text Categorization, Proceedings of EMNLP, 1997.
A. Dayanik, D. Lewis, D. Madigan, V. Menkov, A. Genkin. Constructing informative prior distributions from domain knowledge in text classification. ACM SIGIR Conference, 2006.
A. P. Dempster, N.M. Laird, D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1): pp. 1–38, 1977.
F. Denis, A. Laurent. Text Classification and Co-Training from Positive and Unlabeled Examples, ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data. http://www.grappa. univ-lille3.fr/ftp/reports/icmlws03.pdf.
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407, 1990.
P. Domingos, M. J. Pazzani. On the the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), pp. 103–130, 1997.
P. Domingos. MetaCost: A General Method for making Classifiers Cost-Sensitive. ACM KDD Conference, 1999.
H. Drucker, D. Wu, V. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5), pp. 1048–1054, 1999.
R. Duda, P. Hart, W. Stork. Pattern Classification, Wiley Interscience, 2000.
S. Dumais, J. Platt, D. Heckerman, M. Sahami. Inductive learning algorithms and representations for text categorization. CIKM Conference, 1998.
S. Dumais, H. Chen. Hierarchical Classification of Web Content. ACM SIGIR Conference, 2000.
C. Elkan. The foundations of cost-sensitive learning, IJCAI Conference, 2001.
R. Fisher. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, pp. 179–188, 1936.
R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. NIPS Conference, 2002.
Y. Freund, R. Schapire. A decision-theoretic generalization of online learning and an application to boosting. In Proc. Second European Conference on Computational Learning Theory, pp. 23–37, 1995.
Y. Freund, R. Schapire, Y. Singer, M. Warmuth. Using and combining predictors that specialize. Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pp. 334–343, 1997.
S. Gao, W. Wu, C.-H. Lee, T.-S. Chua. A maximal figure-of-merit learning approach to text categorization. SIGIR Conference, 2003.
R. Gilad-Bachrach, A. Navot, N. Tishby. Margin based feature selection – theory and algorithms. ICML Conference, 2004.
S. Gopal, Y. Yang. Multilabel classification with meta-level features. ACM SIGIR Conference, 2010.
L. Guthrie, E.Walker. Document Classification by Machine: Theory and Practice. COLING, 1994.
E.-H. Han, G. Karypis, V. Kumar. Text Categorization using Weighted-Adjusted k-nearest neighbor classification, PAKDD Conference, 2001.
E.-H. Han, G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results, PKDD Conference, 2000.
D. Hardin, I. Tsamardinos, C. Aliferis. A theoretical characterization of linear SVM-based feature selection. ICML Conference, 2004.
T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, 1999.
P. Howland, M. Jeon, H. Park. Structure Preserving Dimension Reduction for Clustered Text Data based on the Generalized Singular Value Decomposition. SIAM Journal of Matrix Analysis and Applications, 25(1): pp. 165–179, 2003.
P. Howland, H. Park. Generalizing discriminant analysis using the generalized singular value decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), pp. 995–1006, 2004.
D. Hull, J. Pedersen, H. Schutze. Method combination for document filtering. ACM SIGIR Conference, 1996.
R. Iyer, D. Lewis, R. Schapire, Y. Singer, A. Singhal. Boosting for document routing. CIKM Conference, 2000.
M. James. Classification Algorithms, Wiley Interscience, 1985.
D. Jensen, J. Neville, B. Gallagher. Why collective inference improves relational classification. ACM KDD Conference, 2004.
T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. ICML Conference, 1997.
T. Joachims. Text categorization with support vector machines: learning with many relevant features. ECML Conference, 1998.
T. Joachims. Transductive inference for text classification using support vector machines. ICML Conference, 1999.
T. Joachims. A Statistical Learning Model of Text Classification for Support Vector Machines. ACM SIGIR Conference, 2001.
D. Johnson, F. Oles, T. Zhang, T. Goetz. A Decision Tree-based Symbolic Rule Induction System for Text Categorization, IBM Systems Journal, 41(3), pp. 428–437, 2002.
I. T. Jolliffee. Principal Component Analysis. Springer, 2002.
T. Kalt, W. B. Croft. A new probabilistic model of text classification and retrieval. Technical Report IR-78, University of Massachusetts Center for Intelligent Information Retrieval, 1996. http://ciir. cs.umass.edu/publications/index.shtml
G. Karypis, E.-H. Han. Fast Supervised Dimensionality Reduction with Applications to Document Categorization and Retrieval, ACM CIKM Conference, 2000.
T. Kawatani. Topic difference factor extraction between two document sets and its application to text categorization. ACM SIGIR Conference, 2002.
Y.-H. Kim, S.-Y. Hahn, B.-T. Zhang. Text filtering by boosting naive Bayes classifiers. ACM SIGIR Conference, 2000.
D. Koller, M. Sahami. Hierarchically classifying documents with very few words, ICML Conference, 2007.
S. Lam, D. Lee. Feature reduction for neural network based text categorization. DASFAA Conference, 1999.
W. Lam, C. Y. Ho. Using a generalized instance set for automatic text categorization. ACM SIGIR Conference, 1998.
W. Lam, K.-Y. Lai. A meta-learning approach for text categorization. ACM SIGIR Conference, 2001.
K. Lang. Newsweeder: Learning to filter netnews. ICML Conference, 1995.
L. S. Larkey, W. B. Croft. Combining Classifiers in text categorization. ACM SIGIR Conference, 1996.
D. Lewis, J. Catlett. Heterogeneous uncertainty sampling for supervised learning. ICML Conference, 1994.
D. Lewis, M. Ringuette. A comparison of two learning algorithms for text categorization. SDAIR, 1994.
D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML Conference, 1998.
D. Lewis. An Evaluation of Phrasal and Clustered Representations for the Text Categorization Task, ACM SIGIR Conference, 1992.
D. Lewis, W. Gale. A sequential algorithm for training text classifiers, SIGIR Conference, 1994.
D. Lewis, K. Knowles. Threading electronic mail: A preliminary study. Information Processing and Management, 33(2), pp. 209– 217, 1997.
H. Li, K. Yamanishi. Document classification using a finite mixture model. Annual Meeting of the Association for Computational Linguistics, 1997.
Y. Li, A. Jain. Classification of text documents. The Computer Journal, 41(8), pp. 537–546, 1998.
B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. ACM KDD Conference, 1998.
B. Liu, L. Zhang. A Survey of Opinion Mining and Sentiment Analysis. Book Chapter in Mining Text Data, Ed. C. Aggarwal, C. Zhai, Springer, 2011.
N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2: pp. 285– 318, 1988.
P. Long, R. Servedio. Random Classification Noise defeats all Convex Potential Boosters. ICML Conference, 2008.
S. A. Macskassy, F. Provost. Classification in Networked Data: A Toolkit and a Univariate Case Study, Journal of Machine Learning Research, Vol. 8, pp. 935–983, 2007.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu. edu/~mccallum/bow, 1996.
A. McCallum, K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. AAAI Workshop on Learning for Text Categorization, 1998.
A. McCallum, R. Rosenfeld, T. Mitchell, A. Ng. Improving text classification by shrinkage in a hierarchy of classes. ICML Conference, 1998.
McCallum, Andrew Kachites. ”MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002.
T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.
T. M. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999.
D. Mladenic, J. Brank, M. Grobelnik, N. Milic-Frayling. Feature selection using linear classifier weights: interaction with classification models. ACM SIGIR Conference, 2004.
K. Myers, M. Kearns, S. Singh, M. Walker. A boosting approach to topic spotting on subdialogues. ICML Conference, 2000.
H. T. Ng, W. Goh, K. Low. Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Conference, 1997.
A. Y. Ng, M. I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. NIPS. pp. 841- 848, 2001.
K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify text from labeled and unlabeled documents. AAAI Conference, 1998.
H.-J. Oh, S.-H. Myaeng, M.-H. Lee. A practical hypertext categorization method using links and incrementally available class information. ACM SIGIR Conference, 2000.
X. Qi, B. Davison. Classifiers without borders: incorporating fielded text from neighboring web pages. ACM SIGIR Conference, 2008.
J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1(1), pp 81–106, 1986.
H. Raghavan, J. Allan. An interactive algorithm for asking and incorporating feature feedback into support vector machines. ACM SIGIR Conference, 2007.
S. E. Robertson, K. Sparck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27: pp. 129–146, 1976.
J. Rocchio. Relevance feedback information retrieval. The Smart Retrieval System- Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, pp 313–323, 1971.
M. Ruiz, P. Srinivasan. Hierarchical neural networks for text categorization. ACM SIGIR Conference, 1999.
F. Sebastiani. Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), 2002.
M. Sahami. Learning limited dependence Bayesian classifiers, ACM KDD Conference, 1996.
M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. A Bayesian approach to filtering junk e-mail. AAAI Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http:// robotics.stanford.edu/users/sahami/papers.html
T. Salles, L. Rocha, G. Pappa, G. Mourao, W. Meira Jr., M. Goncalves. Temporally-aware algorithms for document classification. ACM SIGIR Conference, 2010.
G. Salton. An Introduction to Modern Information Retrieval, Mc Graw Hill, 1983.
R. Schapire, Y. Singer. BOOSTEXTER: A Boosting-based System for Text Categorization, Machine Learning, 39(2/3), pp. 135–168, 2000.
H. Schutze, D. Hull, J. Pedersen. A comparison of classifiers and document representations for the routing problem. ACM SIGIR Conference, 1995.
R. Shapire, Y. Singer, A. Singhal. Boosting and Rocchio applied to text filtering. ACM SIGIR Conference, 1998.
J. Shavlik, T. Eliassi-Rad. Intelligent agents for web-based tasks: An advice-taking approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press, 1998. http://www.cs.wisc.edu/~shavlik/mlrg/publications.html
V. Sindhwani, S. S. Keerthi. Large scale semi-supervised linear SVMs. ACM SIGIR Conference, 2006.
N. Slonim, N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001.
N. Slonim, N. Friedman, N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR Conference, 2002.
J.-T. Sun, Z. Chen, H.-J. Zeng, Y. Lu, C.-Y. Shi, W.-Y. Ma. Supervised Latent Semantic Indexing for Document Categorization. ICDM Conference, 2004.
V. Vapnik. Estimations of dependencies based on statistical data, Springer, 1982.
V. Vapnik. The Nature of Statistical Learning Theory, Springer, New York, 1995.
A. Weigand, E. Weiner, J. Pedersen. Exploiting hierarchy in text catagorization. Information Retrieval, 1(3), pp. 193–216, 1999.
S, M. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz, T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), pp. 63–69, 1999.
S. M. Weiss, N. Indurkhya. Optimized Rule Induction, IEEE Exp., 8(6), pp. 61–69, 1993.
E. Wiener, J. O. Pedersen, A. S. Weigend. A Neural Network Approach to Topic Spotting. SDAIR, pp. 317–332, 1995.
G.-R. Xue, D. Xing, Q. Yang, Y. Yu. Deep classification in largescale text hierarchies. ACM SIGIR Conference, 2008.
J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan, W.-Y. Ma. OCFS: optimal orthogonal centroid feature selection for text categorization. ACM SIGIR Conference, 2005.
Y. Yang, L. Liu. A re-examination of text categorization methods, ACM SIGIR Conference, 1999.
Y. Yang, J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, 1995.
Y. Yang, C.G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3), 1994.
Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization, ACM SIGIR Conference, 1995.
Y. Yang. A Study on Thresholding Strategies for Text Categorization. ACM SIGIR Conference, 2001.
Y. Yang, T. Ault, T. Pierce. Combining multiple learning strategies for effective cross-validation. ICML Conference, 2000.
J. Zhang, Y. Yang. Robustness of regularized linear classification methods in text categorization. ACM SIGIR Conference, 2003.
T. Zhang, A. Popescul, B. Dom. Linear prediction models with graph regularization for web-page categorization, ACM KDD Conference, 2006.
S. Zhu, K. Yu, Y. Chi, Y. Gong. Combining content and link for classification using matrix factorization. ACM SIGIR Conference, 2007.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Aggarwal, C.C., Zhai, C. (2012). A Survey of Text Classification Algorithms. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_6
Download citation
DOI: https://doi.org/10.1007/978-1-4614-3223-4_6
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)