Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Measures of Rule Quality for Feature Selection in Text Categorization

  • Conference paper
Advances in Intelligent Data Analysis V (IDA 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2810))

Included in the following conference series:

Abstract

Text Categorization is the process of assigning documents to a set of previously fixed categories. A lot of research is going on with the goal of automating this time-consuming task. Several different algorithms have been applied, and Support Vector Machines have shown very good results. In this paper we propose a new family of measures taken from the Machine Learning environment to apply them to feature reduction task. The experiments are performed on two different corpus (Reuters and Ohsumed). The results show that the new family of measures performs better than the traditional Information Theory measures.

The research reported in this paper has been supported in part under MCyT and Feder grant TIC2001-3579 and FICYT grant BP01-114.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Aha, D.W.: A Study of Instance-based Algorithms for Supervised Learning Tasks: Mathematical, Empirical, and Psychological Evaluations. PhD thesis, University of California at Irvine (1990)

    Google Scholar 

  2. Apte, C., Damerau, F., Weiss, S.: Automated learning of decision rules for text categorization. Information Systems 12(3), 233–251 (1994)

    Google Scholar 

  3. Clark, P., Niblett, T.: The cn2 induction algorithm. Machine Learning 3(4), 261–283 (1989)

    Google Scholar 

  4. Ohsumed 91 Collection, http://trec.nist.gov/data/t9-filtering

  5. Reuters Collection, http://www.research.attp.com/lewis/reuters21578.html

  6. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  7. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the International Conference on Information and Knowledge Management (1998)

    Google Scholar 

  8. Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  9. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  10. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)

    Google Scholar 

  11. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of 16th International Conference on Machine Learning ICML 1999, pp. 258–267, Bled, SL (1999)

    Google Scholar 

  12. National Library of Medicine. Medical subject headings (mesh), http://www.nlm.nih.gov/mesh/2002/index.html

  13. Porter, M.F.: An algorithm for suffix stripping. Program (Automated Library and Information Systems) 14(3), 130–137 (1980)

    Article  Google Scholar 

  14. Ranilla, J., Bahamonde, A.: Fan: Finding accurate inductions. International Journal of Human Computer Studies 56(4), 445–474 (2002)

    Article  Google Scholar 

  15. Ranilla, J., Luaces, O., Bahamonde, A.: A heuristic for learning decision trees and pruning them into classification rules. AICom (Artificial Intelligence Communication), 16(2) (2003) (in press)

    Google Scholar 

  16. Salton, G., McGill, M.J.: An introduction to modern information retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  17. Sebastiani, F.: Machine learning in automated text categorisation. ACM Computing Survey 34(1) (2002)

    Google Scholar 

  18. Spiegel, M.R.: Estadística. McGraw-Hill, New York (1970) (in spanish)

    Google Scholar 

  19. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  20. Yang, T.: Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In: Proceedings of SIGIR 1994, ACM Int. Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)

    Google Scholar 

  21. Yang, T., Pedersen, J.P.: A comparative study on feature selection in text categorisation. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Montañés, E., Fernández, J., Díaz, I., Combarro, E.F., Ranilla, J. (2003). Measures of Rule Quality for Feature Selection in Text Categorization. In: R. Berthold, M., Lenz, HJ., Bradley, E., Kruse, R., Borgelt, C. (eds) Advances in Intelligent Data Analysis V. IDA 2003. Lecture Notes in Computer Science, vol 2810. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45231-7_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45231-7_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40813-0

  • Online ISBN: 978-3-540-45231-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics