research-article

A pipeline and comparative study of 12 machine learning models for text classification

Authors:

Annalisa Occhipinti,

Louis Rogers,

Claudio AngioneAuthors Info & Claims

Volume 201, Issue C

https://doi.org/10.1016/j.eswa.2022.117193

Published: 01 September 2022 Publication History

Abstract

Text-based communication is highly favoured as a communication mean, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem.

We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models’ performance by applying specific methods (based on natural language processing) in the preprocessing stage.

Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. B analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public corpus. Statistical tests and explainability techniques (SHAP) are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classification.

References

[1]

M. Aci, C. İnan, M. Avci, A hybrid classification method of k nearest neighbor, Bayesian methods and genetic algorithm, Expert Systems with Applications 37 (7) (2010) 5061–5067.

Abstract

References

Cited By

Index Terms

Recommendations

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Boosting to correct inductive bias in text classification

Active learning for text classification with reusability

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations