(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
ABSTRACT
Email users often face an issue of number of spam emails coming from unfamiliar senders in their mailboxes
daily. Spamming is also triggering online cyber fraud based on social engineering. Most of these frauds starts
via an email from an unauthentic origin in which a URL is comprised and show compromised one's personal
data after its opening. The email spam can be detected in numerous stages such to pre-process the data, extract
the attributes and classify the emails. Researchers have constructed several ML (Machine Learning) algorithms
in order to detect the email spam. This paper conducts a review on diverse methods used to detect the email
spam.
Keywords: - Email Spam, Machine Learning, Supervised learning
mining [3]. The main objective of this stage is to frequently Boolean formulas that relate to the term
exclude some terms from email structures that aren't weights in the document. By starting at the base of
crucial for classification, like combination words the tree and working up and down its branches, one
and articles. A few typical pre-processing activities can categorise a document by choosing conditions
are keyword identification, tokenization, stop-word that are believed to be true. Once a leaf is reached,
removal, stemming, and spell checking. To reduce successive assessments place the document in the
the amount of data after pre-processing, a subset of category that was used to annotate the leaf. The
features is selected during feature selection. With learning tree is computed using a number of modern
this strategy, a certain cost function is minimized. techniques, including ID3, C4.5, and C5 [7].
Feature selection, as opposed to feature extraction,
does not change the data and is used to clean the data The k-nearest neighbour (K-NN) is an example-
before a classifier model is trained. Another name based classifier. In other words, this system
for this procedure is variable selection, which is also compares training documents rather than explicitly
known as feature reduction and variable subset describing categories. Often, there is no training
selection [4]. phase with this method. To classify a new document,
the k most similar documents are searched. Unless
Some of the most useful characteristics for email another class has been assigned to the bulk of these
spam identification are the mail body and subject, documents, the new document is likewise included
word count, word size, circadian rhythms, recipient in this group. Moreover, this strategy may discover
age, gender, and nation, recipient reacted (indicates the closest neighbours more quickly than traditional
whether the recipient responded to the message), indexing techniques. The class of the messages that
mature content, and bag of words from the mail are closest to a communication while determining
content. Spam emails frequently include many whether it is spam or ham is taken into
semantic anomalies. The framework for email consideration. Real-time vector comparison is
categorization by domain's last step involves possible.
learning classification algorithms while utilizing the
features that were selected in the stage before. The 1.1 Email Semantic Features Extraction
learned models are used to classify the new email
documents (test data) into one of the predefined This stage involves extracting the semantic
categories, such as Health, Education, Money, features from email text. A group of obscure ideas
Adult, or Computing. Several methods are compared that characterize an email's content is referred to as
in the experimental section in classifying emails into email semantics. The ultimate goal is to create a
different domains [5]. semantic representation for spam identification that
is extremely accurate. An effective method for
The Bayes classifier, often known as naive bayes, is automatically extracting semantic information in
one of the most frequently used statistical spam this situation is CN2-SD [8]. The classification rule
classifiers. It is referred regarded as the "naive" learner CN2 and the Subgroup Discovery are the two
technique because it ignores any dependencies or most often employed techniques for sematic feature
correlations among the inputs and breaks down a extraction (SD). The class labels are predicted using
multivariate problem into a series of univariate CN2's induction of classification rules, and the
problems. Spam emails can be categorized using this training data are inspected for intriguing patterns
technique. Probabilities are used as the main using SD. Finding a subgroup is different from
operational strategy for these classifiers. If specific classifying something since finding a subgroup is a
terms are regularly found in spam but not in ham, descriptive work, but classifying anything is a
then this incoming email is most likely spam. The prediction activity. These two algorithms are
use of this classification approach has become very described as follows:
common in mail filtering software [6]. It is
necessary to receive good Bayesian filter training. In • Subgroup discovery algorithm: The subgroup
its database, every word has a predefined probability discovery algorithm's descriptive induction
of turning up in spam or trash email. Similar to a feature makes it possible to look for patterns that
finite tree, a decision tree has branches that represent most closely match the data [9]. The semantic
tests and leaves that represent categories. Tests are ideas in email communications are explained
using this technique. Condensing and making to stop the same rule from being injected in
understandable the features of a target further rounds.
population (domain) into a set of patterns is a
vital function of data mining's semantic concept 1.2 Generation of Domain-specific Classifiers
description. The SD is a data mining technique
for figuring out connections between different For the purpose of developing a domain-specific
things (like emails) and particular characteristics classifier for each distinct domain, the collection of
of a target variable (class). These relations are semantic features that were extracted in the
encoded using the form rules: preceding stage are used as learning attributes [12].
The classification of email messages is a supervised
r ∶ cond → y learning activity. It seeks to create a probabilistic
model of a function for email classification. The
where cond is a combination of properties of the supervised learning of text in email messages
form, and y is the target variable (in our case presents a learning algorithm with a set of pre-
spam or ham). The objective of SD is not to classified, or labelled, patterns, where a whole email
generate a global model. Instead, it makes it dataset serves as one example of a message to be
possible to spot particular patterns of interest and classified. This is referred to as the practise set.
extract knowledge that can then be analysed and Certain classified messages from the training set are
evaluated for descriptive purposes. eliminated before creating a model to be used for
testing its efficacy. This collection serves as the
• CN2 rule induction algorithm: The CN2 testing set. Several models are created utilising
algorithm is one of the conventional rule-based different partitioning of the instances into training
learning methods for producing propositional and testing sets in order to evaluate the classification
classification rules. The algorithm is made up of accuracy of the obtained model [13]. After then, the
two fundamental parts: a low-level component categorization error for all models is averaged. The
and a high-level component. A low-level number of divisions of the instance set, "n," is the
component, usually referred to as a search number of times this procedure is performed.
strategy, searches for a single rule that applies to Several models are created through this cycle for
numerous circumstances [10]. A high-level analysis and repeated cross validations. Once
component, also referred to as a control developed, the model can be used to classify
procedure, repeatedly executes the lower level to incoming emails.
enforce a set of rules. Many heuristic metrics are
used in the literature to assess the quality of an II. LITERATURE REVIEW
induced rule at the low level. The two high-level
control processes that the CN2 algorithm can N. Saidani, et.al (2020) emphasized on analyzing a
employ are a technique for producing an ordered text semantic for enhancing the accuracy to detect
list of rules and a way for producing an the spam [14]. A two semantic level analysis based
unordered list of rules. The low-level part technique was investigated for detecting the spam.
generates an ordered list of rules by using Primarily, the particular domains such as healthcare,
heuristic metrics to choose the best rule in the educational and commercial sectors, were utilized
training set. During each iteration of the search for classifying the emails so that a separate
procedure, the high-level section deletes all cases conceptual view was separated for spams in every
domain. Subsequently, a set of manual and
covered by the induced (learned) rule until all
automatic semantic attributes was incorporated to
examples in the training data are covered [11]. In
detect the spam in every domain. These features
order to learn the rules for each class separately
assisted in summarizing the email content into
in an unordered set of rules, the control approach compact topics to distinguish the spam from
(high-level) is repeated. With each learned rule, authentic emails efficiently. The results depicted that
just the covered examples that are part of the rule the investigated technique offered higher efficiency
class are deleted rather than all covered examples as compared to the traditional techniques and
as is the case for an ordered list. CN2 removes provided more interpretability in results.
the circumstances that learnt rules cover in order
G. Andresini, et.al (2022) developed a novel experimental results revealed the effectiveness of
technique known as EUPHORIA for distinguishing the introduced algorithm for recognizing the
amid spam authentic reviews [15]. In this, MVL balanced and unbalanced dataset. Additionally,
(multi-view learning) was integrated with DL (deep based on some characteristics, the introduced
learning) for attaining more accuracy with regard to algorithm was capable of detecting the spam more
different information related to the content of successfully in contrast to the conventional methods.
reviews and behavior of reviewers. Two datasets of
Yelp.com – Hotel and Restaurant employed to G. Al-Rawashdeh, et.al (2019) devised a hybrid
conduct the experiments. The results validated that approach of WC (Water Cycle) and SA (Simulated
the developed technique assisted in enhancing the Annealing) implemented for optimizing the results
efficacy of DL (deep learning) algorithm to detect and to detecting the spam [19]. The groundwork,
the spam in reviews. Moreover, this technique introduction, enhancement, estimation and
offered AUC-ROC around 0.813 on initial dataset comparison quality were comprised in this
and 0.708 on second dataset. approach. The data was trained and tested using the
cross-validation and the devised approach was
C. Kumar, et.al (2023) formulated a hybrid computed on 7 datasets for classifying the spam.
mechanism called SMOTE-ENN (Synthetic This work exploited meta-heuristic called WCFS
Minority Oversampling Technique-Edited Nearest (water cycle feature selection) and 3 schemes of
Neighbor) for detecting the spam on Twitter [16]. hybridization with SA as a technique of selecting
Both the algorithms were put together for generating features. The experimental results confirmed that the
the balanced data. Different DL (deep learning) devised approach attained an accuracy 96.3%. This
methods were presented which made the approach assisted in diminishing the amount of
deployment of this data for recognizing the tweet as attributes.
spam or genuine. Moreover, classifiers namely DT
(Decision Tree), SVM (Support Vector Machine), S. A. A. Ghaleb, et.al (2022) designed a wrapper
LR (Logistic Regression) etc. were implemented. technique on the basis of MOGOA (multi-objective
The simulation and comparative analysis was grasshopper optimization algorithm) to improve the
conducted to quantify the formulated mechanism efficiency of SDS (spam detection system) [20].
with respect to different parameters. The formulated Hence, the attributes were extracted. Moreover,
mechanism performed well and the RF algorithm recently revised EGOA algorithm was utilized to
yielded an accuracy of 99.26%, recall of 99.07% and train MLP (multilayer perceptron). SpamBase,
precision of 99.49%. SpamAssassin, and UK-2011 datasets were applied
to evaluate the designed technique. The simulation
X. Liu, et.al (2021) suggested a modified outcomes demonstrated the supremacy of the
Transformer algorithm in order to detect SMS spam designed technique over other methods. In addition,
messages [17]. SMS Spam Collection v.1 dataset the accuracy of the designed technique was
and UtkMl's dataset were applied to simulate the measured 97.5% on first dataset, 98.3% on second,
suggested algorithm against diverse ML (machine and 96.4% on last dataset.
learning) algorithms. The experimental results
reported that the suggested algorithm was more D. Liu, et.al (2020) projected an innovative
effective and yielded an accuracy of 98.92%, recall detection technique in which the viewpoint of users
up to 94.51%, and F1-Score of 96.13%. Moreover, was considered and screenshots of malevolent
the suggested algorithm offered higher performance webpages were captured for invalidating the Web
on second dataset that represented its adaptability spams [21]. CNN (Convolutional Neural Network),
for dealing with other similar issues as compared to form of DNN (deep neural network) was
other methods. implemented as a classifier. The projected technique
was quantified in the experimentation. Initially, this
Z. Zhang, et.al (2020) focused on analyzing Twitter technique was compared with the other ML
spam attributes as the user attribute, content, activity (machine learning) based methods. Subsequently,
and association [18]. A new algorithm of detecting the testing of the projected technique was done for
the spam was introduced on the basis of RELM detecting the malicious websites in a real-time Web
(regularized extreme learning machine) recognized environment. The experimental outcomes revealed
as I2FELM (Improved Incremental Fuzzy-kernel- the applicability of the projected technique to a
regularized Extreme Learning Machine), for practical Web environment in contrast to the
detecting the Twitter spam in accurate manner. The traditional methods.
J. D. Rosita, et.al (2022) recommended MOGA– classified when this model helped in enhancing the
CNN–DLAS (Multi-Objective Genetic Algorithm structure of the classic CapsNet (capsule network)
and a CNN-based Deep Learning Architectural and optimizing the dynamic routing algorithm.
Scheme) method to detect the Twitter spam [22]. Hence, the established model offered higher
The MO (multi-objective optimization) procedure accuracy at higher running speed. Experimental
was integrated with selection, mutation, and cross- results reported the superiority of the stablished
layer to assist in classifying the tweets as genuine model over the existing methods for classifying and
and malevolent spam tweets. The experimental detecting the spam at accuracy of 98.72% on an
outcomes proved that the recommended method was unbalanced dataset and 99.30% on a balanced
more efficient to enhance the accuracy up to 0.17, dataset.
precision around 0.13, recall of 0.10 and F-score of
0.19 and mitigate the RMSE around 19%, MAD of A. S. Mashaleh, et.al (2022) introduced a new
16%, and MAE of 21% method in which HHO (Harris Hawks optimizer)
algorithm was combined with the KNN (K-Nearest
X. Tong, et.al (2021) established a CapsNet (capsule Neighbor) algorithm for classifying the spam [24].
network) model in which LSA (long-short attention) HHO algorithm was based on cooperative relations
mechanism was adopted for attaining higher of Harris’ Hawks. The introduced algorithm assisted
efficacy to detect Chinese spam [23]. The text was in handling the data of higher dimensionality.
represented using a MCS (multi-channel structure) Moreover, its accuracy was counted higher in
on the basis of LSA mechanism for capturing the comparison with the traditional methods. According
complex text attributes in spam and generating the to the experimental results, the introduced method
contextual word vectors with more semantic yielded an accuracy of 94.3% for classifying and
information. The attributes were mined and detecting the spam.
[7] S. Shrivastava and R. Anju, "Spam mail [15] G. Andresini, A. Iovine and A. Appice,
detection through data mining techniques," 2017 “EUPHORIA: A neural multi-view approach to
International Conference on Intelligent combine content and behavioral features in review
Communication and Computational Techniques spam detection”, Journal of Computational
(ICCT), Jaipur, India, 2017, pp. 61-64 Mathematics and Data Science, vol. 7, no. 4, pp.
170003-170011, 22 April 2022
[8] W. Peng, L. Huang, J. Jia and E. Ingram,
"Enhancing the Naive Bayes Spam Filter Through [16] C. Kumar, T. S. Bharti and S. Prakash, “A
hybrid Data-Driven framework for Spam detection
Intelligent Text Modification Detection," 2018 17th
in Online Social Network”, Procedia Computer
IEEE International Conference On Trust, Security
Science, vol. 218, pp. 124-132, 31 January 2023
And Privacy In Computing And Communications/
12th IEEE International Conference On Big Data
[17] X. Liu, H. Lu and A. Nayak, "A Spam
Science And Engineering (TrustCom/BigDataSE),
Transformer Model for SMS Spam Detection," in
New York, NY, USA, 2018, pp. 849-854 IEEE Access, vol. 9, pp. 80253-80263, 2021
[9] S. E. Rahman and S. Ullah, "Email Spam [18] Z. Zhang, R. Hou and J. Yang, "Detection of
Detection using Bidirectional Long Short Term Social Network Spam Based on Improved Extreme
Memory with Convolutional Neural Network," 2020 Learning Machine," in IEEE Access, vol. 8, pp.
IEEE Region 10 Symposium (TENSYMP), Dhaka, 112003-112014, 2020
Bangladesh, 2020, pp. 1307-1311,
[19] G. Al-Rawashdeh, R. Mamat and N. Hafhizah
[10] R. P. Cota and D. Zinca, "Comparative Results Binti Abd Rahim, "Hybrid Water Cycle
of Spam Email Detection Using Machine Learning Optimization Algorithm With Simulated Annealing
Algorithms," 2022 14th International Conference on for Spam E-mail Detection," in IEEE Access, vol. 7,
Communications (COMM), Bucharest, Romania, pp. 143721-143734, 2019
2022, pp. 1-5
[20] S. A. A. Ghaleb et al., "Feature Selection by
[11] N. Nisar, N. Rakesh and M. Chhabra, "Voting- Multiobjective Optimization: Application to Spam
Detection System by Neural Networks and
Ensemble Classification for Email Spam Detection,"
Grasshopper Optimization Algorithm," in IEEE
2021 International Conference on Communication
Access, vol. 10, pp. 98475-98489, 2022
information and Computing Technology (ICCICT),
Mumbai, India, 2021, pp. 1-6
[21] D. Liu and J. -H. Lee, "CNN Based Malicious
Website Detection by Invalidating Multiple Web
[12] V. Vishagini and A. K. Rajan, "An Improved Spams," in IEEE Access, vol. 8, pp. 97258-97266,
Spam Detection Method with Weighted Support 2020
Vector Machine," 2018 International Conference on
Data Science and Engineering (ICDSE), Kochi, [22] J. D. Rosita P and W. S. Jacob, “Multi-Objective
India, 2018, pp. 1-5 Genetic Algorithm and CNN-Based Deep Learning
Architectural Scheme for effective spam detection”,
[13] T. Toma, S. Hassan and M. Arifuzzaman, "An International Journal of Intelligent Networks, vol.
Analysis of Supervised Machine Learning 10, no. 2, pp. 5207-5222, 2 February 2022
Algorithms for Spam Email Detection," 2021
International Conference on Automation, Control [23] X. Tong et al., "A Content-Based Chinese Spam
and Mechatronics for Industry 4.0 (ACMI), Detection Method Using a Capsule Network With
Rajshahi, Bangladesh, 2021, pp. 1-5 Long-Short Attention," in IEEE Sensors Journal,
vol. 21, no. 22, pp. 25409-25420, 15 Nov.15, 2021
[14] N. Saidani, K. Adi and M. S. Allili, “A
semantic-based classification approach for an [24] A. S. Mashaleh, N. F. B. Ibrahim and Q. M.
enhanced spam detection”, Computers & Security, Yaseen, “Detecting Spam Email with Machine
vol. 11, no. 2, pp. 6594-6609, 9 January 2020 Learning Optimized with Harris Hawks optimizer
(HHO) Algorithm”, Procedia Computer Science,
vol. 201, pp. 659-664, 27 April 2022