Abstract
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values can be used to derive a simple and efficient categorization scheme. To evaluate this proposition and demonstrate its effectiveness, we develop two experiments. First, the system must categorize speeches given by B. Obama as being either electoral or presidential speech. In a second experiment, sentences are extracted from these speeches and then categorized under the headings electoral or presidential. Based on these evaluations, the proposed classification scheme tends to perform better than a support vector machine model for both experiments, on the one hand, and on the other, shows a better performance level than a Naïve Bayes classifier on the first test and a slightly lower performance on the second (10-fold cross validation).
Similar content being viewed by others
References
Abassi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans Inf Syst 26(3)
Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123
Baayen HR (2001) Word frequency distributions. Kluwer Academic Press, Dordrecht
Boiy E, Moens M-F (2009) A machine learning approach to sentiment analysis in multilingual Web texts. Inf Retr 12(5): 526–558
Burrows JF (2002) Delta: a measure of stylistic difference and a guide to likely authorship. Lit Linguist Comput 17(3): 267–287
Crystal D (2006) Language and the Internet. The Cambridge University Press, Cambridge
Fautsch C, Savoy J (2009) Algorithmic stemmers or morphological analysis: an evaluation. J Am Soc Inf Sci Technol 60(8): 1616–1624
Finn A, Kushmerick N (2006) Learning to classify documents according to genre. J Am Soc Inf Sci Technol 57(11): 1506–1518
Grimm LG (1993) Statistical applications for the behavioural sciences. Wiley, New York
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. Springer, New York
Hirst G, Riabinin Y, Graham J (2010) Party status as a confound in the automatic classification of political speech by ideology. In: Proceedings JADT-2010, Rome, pp 731–742
Hoover DL (2006) Stylometry, chronology and the styles of Henry James. Digital Humanities, pp 78–80
Joachims T (2002) Learning to classify text using support vector machines. Methods, theory and algorithms. Kluwer, London
Juola P (2006) Authorship attribution. Found Trends Inf Retr 1(3)
Kanaris I, Stamatatos E (2009) Learning to recognize webpages genres. Inf Process Manag 45(5): 499–512
Labbé D, Monière D (2003) Le discours gouvernemental. Canada, Québec, France (1945–2000). Champion, Paris
Labbé D, Monière D (2010) Quelle est la spécificité des discours électoraux? Le cas de Stephen Harper. Can J Political Sci 43(1): 69–86
Manning CD, Schütze H (2000) Foundations of statistical natural language processing. MIT Press, Cambridge
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Merriam T (1998) Heterogeneous authorship in early Shakespeare and the problem of Henry V. Lit Linguist Comput 13: 15–28
Mitchell TM (1997) Machine learning. McGraw-Hill, New York
Muller C (1992) Principes et méthodes de statistique lexicale. Honoré Champion, Paris
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2)
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Sampson G (2001) Empirical linguistics. Continuum, London
Savoy J (2010) Lexical analysis of US political speeches. J Quant Linguist 17(2): 123–141
Sebastiani F (2002) Machine learning in automatic text categorization. ACM Comput Surv 14(1): 1–27
Stamatatos J (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3): 538–556
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp 252–259
Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, London
Yang Y, Pedersen JO (1997) A comparative study of feature selection in text categorization. In: Proceedings of the fourteenth conference on machine learning ICML, pp 412–420
Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Inf Sci Technol 57(3): 378–393
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Savoy, J., Zubaryeva, O. Simple and efficient classification scheme based on specific vocabulary. Comput Manag Sci 9, 401–415 (2012). https://doi.org/10.1007/s10287-012-0149-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10287-012-0149-z