Measures of Rule Quality for Feature Selection in Text Categorization

Montañés, Elena; Fernández, Javier; Díaz, Irene; Combarro, Elías F.; Ranilla, José

doi:10.1007/978-3-540-45231-7_54

Elena Montañés⁹,
Javier Fernández⁹,
Irene Díaz⁹,
Elías F. Combarro⁹ &
…
José Ranilla⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2810))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1695 Accesses
4 Citations

Abstract

Text Categorization is the process of assigning documents to a set of previously fixed categories. A lot of research is going on with the goal of automating this time-consuming task. Several different algorithms have been applied, and Support Vector Machines have shown very good results. In this paper we propose a new family of measures taken from the Machine Learning environment to apply them to feature reduction task. The experiments are performed on two different corpus (Reuters and Ohsumed). The results show that the new family of measures performs better than the traditional Information Theory measures.

The research reported in this paper has been supported in part under MCyT and Feder grant TIC2001-3579 and FICYT grant BP01-114.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improved Document Categorization Through Feature-Rich Combinations

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Feature selection based on term frequency deviation rate for text classification

Article 11 November 2020

References

Aha, D.W.: A Study of Instance-based Algorithms for Supervised Learning Tasks: Mathematical, Empirical, and Psychological Evaluations. PhD thesis, University of California at Irvine (1990)
Google Scholar
Apte, C., Damerau, F., Weiss, S.: Automated learning of decision rules for text categorization. Information Systems 12(3), 233–251 (1994)
Google Scholar
Clark, P., Niblett, T.: The cn2 induction algorithm. Machine Learning 3(4), 261–283 (1989)
Google Scholar
Ohsumed 91 Collection, http://trec.nist.gov/data/t9-filtering
Reuters Collection, http://www.research.attp.com/lewis/reuters21578.html
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
MATH Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the International Conference on Information and Knowledge Management (1998)
Google Scholar
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)
Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of 16th International Conference on Machine Learning ICML 1999, pp. 258–267, Bled, SL (1999)
Google Scholar
National Library of Medicine. Medical subject headings (mesh), http://www.nlm.nih.gov/mesh/2002/index.html
Porter, M.F.: An algorithm for suffix stripping. Program (Automated Library and Information Systems) 14(3), 130–137 (1980)
Article Google Scholar
Ranilla, J., Bahamonde, A.: Fan: Finding accurate inductions. International Journal of Human Computer Studies 56(4), 445–474 (2002)
Article Google Scholar
Ranilla, J., Luaces, O., Bahamonde, A.: A heuristic for learning decision trees and pruning them into classification rules. AICom (Artificial Intelligence Communication), 16(2) (2003) (in press)
Google Scholar
Salton, G., McGill, M.J.: An introduction to modern information retrieval. McGraw-Hill, New York (1983)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorisation. ACM Computing Survey 34(1) (2002)
Google Scholar
Spiegel, M.R.: Estadística. McGraw-Hill, New York (1970) (in spanish)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Yang, T.: Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In: Proceedings of SIGIR 1994, ACM Int. Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)
Google Scholar
Yang, T., Pedersen, J.P.: A comparative study on feature selection in text categorisation. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Center, University of Oviedo, Spain
Elena Montañés, Javier Fernández, Irene Díaz, Elías F. Combarro & José Ranilla

Authors

Elena Montañés
View author publications
You can also search for this author in PubMed Google Scholar
Javier Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Irene Díaz
View author publications
You can also search for this author in PubMed Google Scholar
Elías F. Combarro
View author publications
You can also search for this author in PubMed Google Scholar
José Ranilla
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Berkeley Initiative in Soft Computing (BISC), University of California at Berkeley, USA
Michael R. Berthold
Freie Universität Berlin, Garystr. 21, 14195, Berlin, Germany
Hans-Joachim Lenz
Department of Computer Science, University of Colorado, Boulder, Colorado, USA
Elizabeth Bradley
Otto-von-Guericke-University of Magdeburg, Germany
Rudolf Kruse
Department of Knowledge Processing and Language Engineering, University of Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Christian Borgelt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Montañés, E., Fernández, J., Díaz, I., Combarro, E.F., Ranilla, J. (2003). Measures of Rule Quality for Feature Selection in Text Categorization. In: R. Berthold, M., Lenz, HJ., Bradley, E., Kruse, R., Borgelt, C. (eds) Advances in Intelligent Data Analysis V. IDA 2003. Lecture Notes in Computer Science, vol 2810. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45231-7_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-45231-7_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40813-0
Online ISBN: 978-3-540-45231-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Measures of Rule Quality for Feature Selection in Text Categorization

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Feature selection based on term frequency deviation rate for text classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Measures of Rule Quality for Feature Selection in Text Categorization

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Feature selection based on term frequency deviation rate for text classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation