Simple and efficient classification scheme based on specific vocabulary

Savoy, Jacques; Zubaryeva, Olena

doi:10.1007/s10287-012-0149-z

Simple and efficient classification scheme based on specific vocabulary

Original Paper
Published: 05 July 2012

Volume 9, pages 401–415, (2012)
Cite this article

Computational Management Science Aims and scope Submit manuscript

Jacques Savoy¹ &
Olena Zubaryeva¹

136 Accesses
Explore all metrics

Abstract

Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values can be used to derive a simple and efficient categorization scheme. To evaluate this proposition and demonstrate its effectiveness, we develop two experiments. First, the system must categorize speeches given by B. Obama as being either electoral or presidential speech. In a second experiment, sentences are extracted from these speeches and then categorized under the headings electoral or presidential. Based on these evaluations, the proposed classification scheme tends to perform better than a support vector machine model for both experiments, on the one hand, and on the other, shows a better performance level than a Naïve Bayes classifier on the first test and a slightly lower performance on the second (10-fold cross validation).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abassi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans Inf Syst 26(3)
Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123
Article Google Scholar
Baayen HR (2001) Word frequency distributions. Kluwer Academic Press, Dordrecht
Book Google Scholar
Boiy E, Moens M-F (2009) A machine learning approach to sentiment analysis in multilingual Web texts. Inf Retr 12(5): 526–558
Article Google Scholar
Burrows JF (2002) Delta: a measure of stylistic difference and a guide to likely authorship. Lit Linguist Comput 17(3): 267–287
Article Google Scholar
Crystal D (2006) Language and the Internet. The Cambridge University Press, Cambridge
Book Google Scholar
Fautsch C, Savoy J (2009) Algorithmic stemmers or morphological analysis: an evaluation. J Am Soc Inf Sci Technol 60(8): 1616–1624
Article Google Scholar
Finn A, Kushmerick N (2006) Learning to classify documents according to genre. J Am Soc Inf Sci Technol 57(11): 1506–1518
Article Google Scholar
Grimm LG (1993) Statistical applications for the behavioural sciences. Wiley, New York
Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. Springer, New York
Google Scholar
Hirst G, Riabinin Y, Graham J (2010) Party status as a confound in the automatic classification of political speech by ideology. In: Proceedings JADT-2010, Rome, pp 731–742
Hoover DL (2006) Stylometry, chronology and the styles of Henry James. Digital Humanities, pp 78–80
Joachims T (2002) Learning to classify text using support vector machines. Methods, theory and algorithms. Kluwer, London
Book Google Scholar
Juola P (2006) Authorship attribution. Found Trends Inf Retr 1(3)
Kanaris I, Stamatatos E (2009) Learning to recognize webpages genres. Inf Process Manag 45(5): 499–512
Article Google Scholar
Labbé D, Monière D (2003) Le discours gouvernemental. Canada, Québec, France (1945–2000). Champion, Paris
Labbé D, Monière D (2010) Quelle est la spécificité des discours électoraux? Le cas de Stephen Harper. Can J Political Sci 43(1): 69–86
Article Google Scholar
Manning CD, Schütze H (2000) Foundations of statistical natural language processing. MIT Press, Cambridge
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book Google Scholar
Merriam T (1998) Heterogeneous authorship in early Shakespeare and the problem of Henry V. Lit Linguist Comput 13: 15–28
Article Google Scholar
Mitchell TM (1997) Machine learning. McGraw-Hill, New York
Google Scholar
Muller C (1992) Principes et méthodes de statistique lexicale. Honoré Champion, Paris
Google Scholar
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2)
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Article Google Scholar
Sampson G (2001) Empirical linguistics. Continuum, London
Google Scholar
Savoy J (2010) Lexical analysis of US political speeches. J Quant Linguist 17(2): 123–141
Article Google Scholar
Sebastiani F (2002) Machine learning in automatic text categorization. ACM Comput Surv 14(1): 1–27
Article Google Scholar
Stamatatos J (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3): 538–556
Article Google Scholar
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp 252–259
Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, London
Book Google Scholar
Yang Y, Pedersen JO (1997) A comparative study of feature selection in text categorization. In: Proceedings of the fourteenth conference on machine learning ICML, pp 412–420
Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Inf Sci Technol 57(3): 378–393
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Neuchatel, Rue Emile Argand 11, 2000, Neuchâtel, Switzerland
Jacques Savoy & Olena Zubaryeva

Authors

Jacques Savoy
View author publications
You can also search for this author in PubMed Google Scholar
Olena Zubaryeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacques Savoy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Savoy, J., Zubaryeva, O. Simple and efficient classification scheme based on specific vocabulary. Comput Manag Sci 9, 401–415 (2012). https://doi.org/10.1007/s10287-012-0149-z

Download citation

Received: 27 June 2011
Accepted: 18 June 2012
Published: 05 July 2012
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10287-012-0149-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simple and efficient classification scheme based on specific vocabulary

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Feature selection based on term frequency deviation rate for text classification

Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Simple and efficient classification scheme based on specific vocabulary

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Feature selection based on term frequency deviation rate for text classification

Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now