Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Simple and efficient classification scheme based on specific vocabulary

  • Original Paper
  • Published:
Computational Management Science Aims and scope Submit manuscript

Abstract

Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values can be used to derive a simple and efficient categorization scheme. To evaluate this proposition and demonstrate its effectiveness, we develop two experiments. First, the system must categorize speeches given by B. Obama as being either electoral or presidential speech. In a second experiment, sentences are extracted from these speeches and then categorized under the headings electoral or presidential. Based on these evaluations, the proposed classification scheme tends to perform better than a support vector machine model for both experiments, on the one hand, and on the other, shows a better performance level than a Naïve Bayes classifier on the first test and a slightly lower performance on the second (10-fold cross validation).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abassi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans Inf Syst 26(3)

  • Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123

    Article  Google Scholar 

  • Baayen HR (2001) Word frequency distributions. Kluwer Academic Press, Dordrecht

    Book  Google Scholar 

  • Boiy E, Moens M-F (2009) A machine learning approach to sentiment analysis in multilingual Web texts. Inf Retr 12(5): 526–558

    Article  Google Scholar 

  • Burrows JF (2002) Delta: a measure of stylistic difference and a guide to likely authorship. Lit Linguist Comput 17(3): 267–287

    Article  Google Scholar 

  • Crystal D (2006) Language and the Internet. The Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Fautsch C, Savoy J (2009) Algorithmic stemmers or morphological analysis: an evaluation. J Am Soc Inf Sci Technol 60(8): 1616–1624

    Article  Google Scholar 

  • Finn A, Kushmerick N (2006) Learning to classify documents according to genre. J Am Soc Inf Sci Technol 57(11): 1506–1518

    Article  Google Scholar 

  • Grimm LG (1993) Statistical applications for the behavioural sciences. Wiley, New York

    Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. Springer, New York

    Google Scholar 

  • Hirst G, Riabinin Y, Graham J (2010) Party status as a confound in the automatic classification of political speech by ideology. In: Proceedings JADT-2010, Rome, pp 731–742

  • Hoover DL (2006) Stylometry, chronology and the styles of Henry James. Digital Humanities, pp 78–80

  • Joachims T (2002) Learning to classify text using support vector machines. Methods, theory and algorithms. Kluwer, London

    Book  Google Scholar 

  • Juola P (2006) Authorship attribution. Found Trends Inf Retr 1(3)

  • Kanaris I, Stamatatos E (2009) Learning to recognize webpages genres. Inf Process Manag 45(5): 499–512

    Article  Google Scholar 

  • Labbé D, Monière D (2003) Le discours gouvernemental. Canada, Québec, France (1945–2000). Champion, Paris

  • Labbé D, Monière D (2010) Quelle est la spécificité des discours électoraux? Le cas de Stephen Harper. Can J Political Sci 43(1): 69–86

    Article  Google Scholar 

  • Manning CD, Schütze H (2000) Foundations of statistical natural language processing. MIT Press, Cambridge

    Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Merriam T (1998) Heterogeneous authorship in early Shakespeare and the problem of Henry V. Lit Linguist Comput 13: 15–28

    Article  Google Scholar 

  • Mitchell TM (1997) Machine learning. McGraw-Hill, New York

    Google Scholar 

  • Muller C (1992) Principes et méthodes de statistique lexicale. Honoré Champion, Paris

    Google Scholar 

  • Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2)

  • Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Article  Google Scholar 

  • Sampson G (2001) Empirical linguistics. Continuum, London

    Google Scholar 

  • Savoy J (2010) Lexical analysis of US political speeches. J Quant Linguist 17(2): 123–141

    Article  Google Scholar 

  • Sebastiani F (2002) Machine learning in automatic text categorization. ACM Comput Surv 14(1): 1–27

    Article  Google Scholar 

  • Stamatatos J (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3): 538–556

    Article  Google Scholar 

  • Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp 252–259

  • Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, London

    Book  Google Scholar 

  • Yang Y, Pedersen JO (1997) A comparative study of feature selection in text categorization. In: Proceedings of the fourteenth conference on machine learning ICML, pp 412–420

  • Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Inf Sci Technol 57(3): 378–393

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jacques Savoy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Savoy, J., Zubaryeva, O. Simple and efficient classification scheme based on specific vocabulary. Comput Manag Sci 9, 401–415 (2012). https://doi.org/10.1007/s10287-012-0149-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10287-012-0149-z

Keywords