Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A pipeline and comparative study of 12 machine learning models for text classification

Published: 01 September 2022 Publication History

Abstract

Text-based communication is highly favoured as a communication mean, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem.
We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models’ performance by applying specific methods (based on natural language processing) in the preprocessing stage.
Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. B analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public corpus. Statistical tests and explainability techniques (SHAP) are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classification.

References

[1]
M. Aci, C. İnan, M. Avci, A hybrid classification method of k nearest neighbor, Bayesian methods and genetic algorithm, Expert Systems with Applications 37 (7) (2010) 5061–5067.
[2]
F.S. Alotaibi, V. Gupta, A cognitive inspired unsupervised language-independent text stemmer for Information retrieval, Cognitive Systems Research 52 (2018) 291–300.
[3]
T.A. Almeida, J.M.G. Hidalgo, A. Yamakami, September). Contributions to the study of SMS spam filtering: new collection and results, in: Proceedings of the 11th ACM symposium on Document engineering, 2011, pp. 259–262.
[4]
Androutsopoulos, I. (2003) Ling-spam data set. Available from: https://aclweb.org/aclwiki/Spam_filtering_datasets.
[5]
Barrett, P., Hunter, J., Miller, J. T., Hsu, J. C., & Greenfield, P. (2005, December). matplotlib--A Portable Python Plotting Package. In Astronomical data analysis software and systems XIV (Vol. 347, p. 91).
[6]
E. Bertino, N. Islam, Botnets and internet of things security, Computer 50 (2) (2017) 76–79.
[7]
A. Bhardwaj, V. Sapra, A. Kumar, N. Kumar, S. Arthi, Why is phishing still successful?, Computer Fraud & Security 2020 (9) (2020) 15–19.
[8]
S. Boughorbel, J.P. Tarel, N. Boujemaa, Conditionally positive definite kernels for svm based image recognition, in: IEEE International Conference on Multimedia and Exp, IEEE, 2005, pp. 113–116.
[9]
A.L. Boulesteix, A. Bender, J. Lorenzo Bermejo, C. Strobl, Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations, Briefings in Bioinformatics 13 (3) (2011) 292–304.
[10]
E. Cambria, B. White, Jumping NLP curves: A review of natural language processing research, IEEE Computational intelligence magazine 9 (2) (2014) 48–57.
[11]
J. Cao, R. Panetta, S. Yue, A. Steyaert, M. Young-Bellido, S. Ahmad, A naive Bayes model to predict coupling between seven transmembrane domain receptors and G-proteins, Bioinformatics 19 (2) (2003) 234–240.
[12]
M.W. Chang, W.T. Yih, C. Meek, August). Partitioned logistic regression for spam filtering, in: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 97–105.
[13]
Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, C.J. Lin, Training and testing low-degree polynomial data mappings via linear SVM, Journal of Machine Learning Research 11 (Apr) (2010) 1471–1490.
[14]
C. Chen, J. Zhang, Y. Xie, Y. Xiang, W. Zhou, M.M. Hassan, …., M. Alrubaian, A performance evaluation of machine learning-based streaming spam tweets detection, IEEE Transactions on Computational social systems 2 (3) (2015) 65–76.
[15]
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). ACM.
[16]
Clark, J., Koprinska, I., & Poon, J. (2003). A neural network based approach to automated classification. In Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003) (pp. 702-705). IEEE.
[17]
C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297.
[18]
P. Cunningham, S.J. Delany, k-Nearest neighbour classifiers, Multiple Classifier Systems 34 (8) (2007) 1–17.
[19]
M.N. Feroz, S. Mengel, Examination of data, rule generation and detection of phishing URLs using online logistic regression, in: IEEE International Conference on Big Data (Big Data), IEEE, 2014, pp. 241–250.
[20]
I. Fette, N. Sadeh, A. Tomasic, May). Learning to detect phishing s, in: Proceedings of the 16th international conference on World Wide Web, 2007, pp. 649–656.
[21]
J.H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of statistics (2001) 1189–1232.
[22]
T. Fu, G. Zampieri, D. Hodgson, C. Angione, Y. Zeng, Modeling Customer Experience in a Contact Center through Process Log Mining, ACM Transactions on Intelligent Systems and Technology (TIST) 12 (4) (2021) 1–21.
[23]
A. Genkin, D.D. Lewis, D. Madigan, Large-scale Bayesian logistic regression for text categorization, Technometrics 49 (3) (2007) 291–304.
[24]
P. George, P. Vinod, September). Machine learning approach for filtering spam s, in: Proceedings of the 8th International Conference on Security of Information and Networks, 2015, pp. 271–274.
[25]
P. George, P. Vinod, Composite features for spam identification, in: Cyber Security, Springer, Singapore, 2018, pp. 281–289.
[26]
K.L. Goh, A.K. Singh, K.H. Lim, Multilayer perceptrons neural network based web spam detection application, in: IEEE China Summit and International Conference on Signal and Information Processing, IEEE, 2013, pp. 636–640.
[27]
Gomez Hidalgo, Jose & Cajigas Bringas, Guillermo & Sanz, Enrique & García, Francisco. (2006). Content based SMS spam filtering. Proceedings of the 2006 ACM Symposium on Document Engineering. 2006. 107-114. 10.1145/1166160.1166191.
[28]
Z. Jia, W. Li, W. Gao, Y. Xia, May). Research on web spam detection based on support vector machine, IEEE, 2012, pp. 517–520.
[29]
N. Jindal, B. Liu, Review spam detection, in: Proceedings of the 16th international conference on World Wide Web, 2007, pp. 1189–1190.
[30]
Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec C., Rouveirol C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol 1398. Springer, Berlin, Heidelberg.
[31]
Juan, A., & Ney, H. (2002, April). Reversing and Smoothing the Multinomial Naive Bayes Text Classifier. In PRIS (pp. 200-212).
[32]
S.S. Keerthi, C.J. Lin, Asymptotic behaviors of support vector machines with Gaussian kernel, Neural computation 15 (7) (2003) 1667–1689.
[33]
R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, In Ijcai 14 (2) (1995) 1137–1145.
[34]
D.D. Lewis, Naive (Bayes) at forty: The independence assumption in information retrieval, in: European conference on machine learning, Springer, Berlin, Heidelberg, 1998, pp. 4–15.
[35]
H.T. Lin, C.J. Lin, A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. submitted to, Neural Computation 3 (2003) 1–32.
[36]
Z.C. Lipton, The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery, Queue 16 (3) (2018) 31–57.
[37]
S.M. Lundberg, S.I. Lee, December). A unified approach to interpreting model redictions, in: Proceedings of the 31st international conference on neural information processing systems, 2017, pp. 4768–4777.
[38]
M.W. Gardner, S.R. Dorling, Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences, Atmospheric environment 32 (14–15) (1998) 2627–2636.
[39]
G. Magazzù, G. Zampieri, C. Angione, Multimodal regularised linear models with flux balance analysis for mechanistic integration of omics data, Bioinformatics (2021).
[40]
K. Manjusha, R. Kumar, Spam mail classification using combined approach of bayesian and neural network, in: 2010 International Conference on Computational Intelligence and Communication Networks, IEEE, 2010, pp. 145–149.
[41]
McCallum, A., & Nigam, K. (1998, July). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, No. 1, pp. 41-48).
[42]
J.R. Méndez, E.L. Iglesias, F. Fdez-Riverola, F. Díaz, J.M. Corchado, Tokenising, stemming and stopword removal on anti-spam filtering domain, in: Conference of the Spanish Association for Artificial Intelligence, Springer, Berlin, Heidelberg, 2005, pp. 449–458.
[43]
V. Metsis, I. Androutsopoulos, G. Paliouras, Spam filtering with naive bayes-which naive bayes?, In CEAS 17 (2006) 28–69.
[44]
Metsis, V. Androutsopoulos, I. Paliouras, G. (2006b) Enron-Spam datasets. Available from: http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html.
[45]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
[46]
T.M. Mitchell, Machine learning and data mining, Communications of the ACM 42 (11) (1999) 30–36.
[47]
Muja, M., & Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340), 2.
[48]
G. Mujtaba, L. Shuib, R.G. Raj, N. Majeed, M.A. Al-Garadi, classification research trends: Review and open issues, IEEE Access 5 (2017) 9044–9064.
[49]
Y. Murakami, K. Mizuguchi, Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites, Bioinformatics 26 (15) (2010) 1841–1848.
[50]
Nawi, N. M., Ransing, M. R., & Ransing, R. S. (2006, October). An improved learning algorithm based on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method for back propagation neural networks. In Sixth International Conference on Intelligent Systems Design and Applications (Vol. 1, pp. 152-157). IEEE.
[51]
T.E. Oliphant, A guide to NumPy 1 (2006) p. 85).
[52]
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011, June). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 309-319). Association for Computational Linguistics.
[53]
Panda, M., Abraham, A., & Patra, M. R. (2010, August). Discriminative multinomial naive bayes for network intrusion detection. In 2010 Sixth International Conference on Information Assurance and Security (pp. 5-10). IEEE.
[54]
T.R. Patil, S.S. Sherekar, Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification, International Journal Of Computer Science And Applications, ISSN 0974 (2013) 1011.
[55]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, …., E. Duchesnay, Scikit-learn: Machine learning in Python, The Journal of machine Learning research 12 (2011) 2825–2830.
[56]
Perkins, J. (2010). Python text processing with NLTK 2.0 cookbook. PACKT publishing.
[57]
M.F. Porter, An algorithm for suffix stripping, Program 14 (3) (1980) 130–137.
[58]
R.D. Raizada, Y.S. Lee, Smoothness without smoothing: why Gaussian naive Bayes is not naive for multi-subject searchlight studies, PLoS One 8 (7) (2013).
[59]
R. Roelofs, V. Shankar, B. Recht, S. Fridovich-Keil, M. Hardt, J. Miller, L. Schmidt, A meta-analysis of overfitting in machine learning, in: Advances in Neural Information Processing Systems, 2019, p. 32.
[60]
Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).
[61]
Saeed, W. (2021). Comparison of Automated Machine Learning Tools for SMS Spam Message Filtering. arXiv preprint arXiv:2106.08671.
[62]
D. Sculley, G.M. Wachman, Relaxed online SVMs for spam filtering, in: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 415–422.
[63]
Soonthornphisaj, N., Chaikulseriwat, K., & Tang-On, P. (2002, August). Anti-spam filtering: a centroid-based classification approach. In 6th International Conference on Signal Processing, 2002. (Vol. 2, pp. 1096-1099). IEEE.
[64]
Soranamageswari, M., & Meena, C. (2010, February). Statistical feature extraction for classification of image spam using artificial neural networks. In 2010 Second International Conference on Machine Learning and Computing (pp. 101-105). IEEE.
[65]
K.M. Svore, Q. Wu, C.J. Burges, A. Raman, May). Improving web spam classification using rank-time features, in: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, 2007, pp. 9–16.
[66]
W.G. Touw, J.R. Bayjanov, L. Overmars, L. Backus, J. Boekhorst, M. Wels, S.A. van Hijum, Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?, Briefings in bioinformatics 14 (3) (2013) 315–326.
[67]
S.K. Trivedi, S. Dey, Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited s, Journal of Advances in Computer Networks 1 (2) (2013) 132–136.
[68]
Trivedi, S. K. (2016). A study of machine learning classifiers for spam detection. In 2016 4th international symposium on computational and business intelligence (ISCBI) (pp. 176-180). IEEE.
[69]
V. Vapnik, The nature of statistical learning theory, Springer science & business media, 2013.
[70]
J.P. Vert, K. Tsuda, B. Schölkopf, A primer on kernel methods, Kernel methods in computational biology 47 (2004) 35–70.
[71]
Y. Zhang, J. Yao, Gini objective functions for three-way classifications, International journal of approximate reasoning 81 (2017) 103–114.

Cited By

View all
  • (2025)XAIRF-WFP: a novel XAI-based random forest classifier for advanced email spam detectionInternational Journal of Information Security10.1007/s10207-024-00920-124:1Online publication date: 1-Feb-2025
  • (2024)A Machine Learning-Based Framework for Accurate and Early Diagnosis of Liver DiseasesInternational Journal of Intelligent Systems10.1155/2024/61113122024Online publication date: 1-Jan-2024
  • (2024)Elastic deep autoencoder for text embedding clustering by an improved graph regularizationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121780238:PAOnline publication date: 15-Mar-2024
  • Show More Cited By

Index Terms

  1. A pipeline and comparative study of 12 machine learning models for text classification
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Expert Systems with Applications: An International Journal
    Expert Systems with Applications: An International Journal  Volume 201, Issue C
    Sep 2022
    1333 pages

    Publisher

    Pergamon Press, Inc.

    United States

    Publication History

    Published: 01 September 2022

    Author Tags

    1. Text classification
    2. Machine learning
    3. Spam classification
    4. Text classifiers
    5. Model explainability

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)XAIRF-WFP: a novel XAI-based random forest classifier for advanced email spam detectionInternational Journal of Information Security10.1007/s10207-024-00920-124:1Online publication date: 1-Feb-2025
    • (2024)A Machine Learning-Based Framework for Accurate and Early Diagnosis of Liver DiseasesInternational Journal of Intelligent Systems10.1155/2024/61113122024Online publication date: 1-Jan-2024
    • (2024)Elastic deep autoencoder for text embedding clustering by an improved graph regularizationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121780238:PAOnline publication date: 15-Mar-2024
    • (2024)User Story Classification with Machine Learning and LLMsKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_13(161-175)Online publication date: 16-Aug-2024
    • (2023)Maximizing total yield in safety hazard monitoring of online reviewsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120540229:PAOnline publication date: 13-Jul-2023
    • (2023)Detecting and mitigating DDoS attacks with moving target defense approach based on automated flow classification in SDN networksComputers and Security10.1016/j.cose.2023.103462134:COnline publication date: 1-Nov-2023
    • (2022)Integrate deep learning and physically-based models for multi-step-ahead microclimate forecastingExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118481210:COnline publication date: 30-Dec-2022
    • (2022)Introducing attentive neural networks into unconventional oil and gas violation analysis and emergency response systemExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118352210:COnline publication date: 30-Dec-2022

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media