Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning

With the spread of modern life, messaging has become one of the most important forms of communication. SMS (Short Message Service) is a text messaging service available on all smart phones and mobiles. Facebook, WhatsApp etc. Unlike other chat- based communication applications, SMS does not require any internet connection. SMS traffic has increased significantly and spam has also been increased rapidly. Hackers and spammers are trying to scam over devices through SMSs.

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Message Spam Identification by Naive Bayes

Classifier Algorithm using Machine Learning
Lokam. Devi Naga Srinu1
Meesala. Dhanush Kumar2
Mulaparthi. Mani Gopal3
Swarnandhra College of Engineering and Technology

Abstract:- With the spread of modern life, messaging has results of a simple SQL query for spam emails. The low cost
become one of the most important forms of and high bandwidth of SMS networks attract a large number
communication. SMS (Short Message Service) is a text of spam messages. Every time a spam message arrives in the
messaging service available on all smart phones and user's inbox, a notification is sent to the user's phone. Users
mobiles. Facebook, WhatsApp etc. Unlike other chat- get angry when they see spam, malicious emails taking up
based communication applications, SMS does not require space on their phone storage. The aim of this project is to
any internet connection. SMS traffic has increased apply different learning machines to the spam message
significantly and spam has also been increased rapidly. classification problem, to get an idea and learn more about
Hackers and spammers are trying to scam over devices the problem by comparing their performances and to create
through SMSs. As a result, SMS support for mobile an application as one of the algorithms that can be effectively
devices becomes difficult. Spammers may ask for business filtered. The truth is that it is spam. Complete the extraction
expansion, lottery information, credit card information, and initial analysis results in MATLAB and then use the
etc. They also try to send spam emails to obtain financial scikitlearn library function to implement various machine
or commercial benefits such as: attackers attempt to learning algorithms in python.
disrupt the system by sending spam links that, when
clicked, allow them to control mobile devices. To analyze Spam and not spam: Spam refers to the content of email
this communication, the authors developed a system that and the use of electronic communications to send unsolicited
can analyze malicious messages and determine whether messages especially advertisements and bad links are called
they are RAW or SPAM. Here, we use text classification spam. Therefore, if you do not know the sender, the message
methods such as Naive Bayes classifier algorithm to may be spam. Many users do not realize that when
classify the texts and determine the message whether it is downloading free services, software or updates, they are only
spam or not. signing up for certain emails. "Raw" means the email is not
spam.
Keywords:- Machine Learning, Language Processing, Spam,
Ham, SMS, Naive Bayes, Logistic Regression. In this project, we apply the Naive Bayes technique to
learn a model that may be used to categorize text as spam or
I. INTRODUCTION non-spam. Frequently used as "free", "win", "win", "money",
"gift", etc. It is used in its meanings. These languages exist
Short text messages (SMS) are more than just a because they are designed to grab your attention and force
conversation. SMS, which was first defined as part of the you to give them your full attention. Additionally, spam
GSM family of standards in 1985, is a method of sending emails contain words written in all capital letters and also use
messages of up to 160 characters to GSM mobile phones. lots of exclamation marks. Spam emails are often easily
SMS technology is based on international mobile detected by recipients, and our goal is to train the model to do
communications standards and is recognized worldwide. this for us. The definition of spam is a binary classification
Spam is the misuse of email to send illegal, unsolicited problem since messages are either categorized as "spam" or
messages. While the most common type of spams are SMS "not spam." This is also a learning problem because we will
spam, the word is also used for similar abuse in other media feed data to the model and it can learn to make predictions
and news. Spam messages are unsolicited messages that about the future.
resemble spam, often for commercial benefit. Spam emails
are used to market and send phishing links. Commercial SMS is a manual communication system that allows
spammers use malware to send spam because spam is illegal mobile phone users to send text messages. This is the most
in many countries. Sending spam from an infected computer widely used data application and is expected to have 3.5
reduces the spammer's risk by masking the source of the billion users by the end of 2010; This number accounts for
spam. Text can contain limited characters, including letters, approximately 80% of all mobile phone users [3]. As the
numbers, and some symbols. importance of detail increases, we also see an increase in the
number of jobs sent via mobile notifications. Spam doesn't
Check out the news to see the full model. Nearly all even look like spam. In 2010, about 90% of emails were
spam emails ask users to call a number, respond to a text spam, and this is not yet a big problem in North America.
message, or visit a URL. This pattern can be seen in the The change in December 2012 was less than 1% [4]. But due

IJISRT24MAR103 www.ijisrt.com 58
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
to the significant growth of the youth market and the long- Bayes classifier algorithm which is used to classify whether
term decline in the cost of SMS notifications (the cost of SMS as spam or the regular mail. Training the data using
instant messaging in China is now under $0.001), the SMS spam collected in the earlier studies. For the SMS Spam
potential for spam SMS is different. In 2012, this rate Collection v.1 dataset, results in using Naive Bayes and FP-
increased to 30% in Asia. Instant messages are spam. Some Growth collaboration improve the average maximum
operators are responsible for sending instant messages in the accuracy by 98%, 506% and 0.025% respectively and
Middle East. Additionally, spam is more problematic than increases the correct score without using FP-Growth. Thus,
spam because in some countries the cost of spam to making the department more efficient.
recipients can increase. These factors, combined with limited
access to mobile spam filters, make instant messaging spam [4] Short Message Service Spam refers to uninvited
an interesting problem worth studying. There are many links or unwanted messages which will be received on
differences between instant messages and text messages. The mobile phones. These spam messages are real problem for
actual number of spam messages is very small compared to phone users. This business practice is also worrisome for
the amount of information available. Moreover, due to the service providers because it can upset their customers and
current small size of the language, the number of domains even cause them to lose customers. To reduce this,
available for its preparation is less than the corresponding researchers have suggested quite many solutions to control
number in the language. and filter out spam messages. In this article, we review the
methods existing already, pros and cons, and further research
II. LITERATURE SURVEY for spam detection, filtering, and how to reduce risk of
mobile SMS spam. The body of research literature was
[1] Due to the great popularity of short message examined and evaluated. The most commonly occurring
services (SMS), spammers have managed to achieve many SMS spam detection, filtering, and reduction techniques are
targets. Spam messages can trick mobile phone users into compared, including the data used their outcomes, limitations
revealing confidential information, which can lead to serious and future research directions are discussed. In this review
consequences. The magnitude of this problem increases the the main goal is to help researchers to identify areas that
need to develop spam filtering solutions. Machine learning require further development.
algorithms have become the best tools to classify data into
text. This explanation is perfect for our case because it [5] Spam message analysis is an important task in
separates SMS into two labels: spam or normal. By merging identifying and filtering spam messages. We can see that;
two machine learning techniques—supervised and SMS messages are sent each day increases and it becomes
unsupervised learning algorithms—this paper will show an more difficult for users to remember the new SMS messages
SMS spam filtering solution. The new hybrid system is they receive in their inbox and associate them with the
designed to improve spam filtering accuracy and F-measure. messages they have received before. In this study, detection
of messages spam and device identification problems are
[2] In today's world where digitalization is everywhere, discussed. Plan is made in two phases. In the first phase,
messaging has grow to be among the most significant binary classification algorithm is used to divide the text into
channels of communication. In contrast to alternative social two groups i.e spam text and non-spam text. Then, in the
media networks like Facebook and WhatsApp, messaging second phase, non-negative matrix factorization and K-means
does not require an internet connection at all. It is known that group algorithm are used. The definition of SMS messages
hackers and spammers trying to hack devices and message according to similar messages, that is, the duration of
supported mobiles had become more vulnerable as attackers consecutive communication, is explained and the effect is
try to filter through the system by sending uninvited links and tried to be analyzed at the beginning of the analysis title.
the attacker gains remote access of mobile phones by clicking Performance parameters such as accuracy, precision value,
on such links. Therefore, to identify these messages, the regression and F-measure are also studied and evaluated.
authors have developed a system that can detect these types SMS messages defined in this application can be used by
of messages and detect whether the message is SPAM or not. other apps such as SMS message content, distribution in
Authors use the TF - IDF Vectorizer algorithm to create a SMS inbox and other related SMS management.
dictionary containing the entire content of spam messages.
III. METHODOLOGY
[3] Although today's mobile phones continue to evolve
with many different communication mediums, SMS still After cleansing the dataset, we divide it into training
remains people’s option of communication tools. However, and testing sets. Using the training set of data, the Naive
today, as the cost of SMS has decreased, SMS spam has also Bayes classifier is trained. Using test results, the efficacy of
increased, and some people are using SMS as another way to teacher preparation was evaluated. A Description of the
advertising and fraud. From there after, it became a major Dataset The dataset SMS Spam Collection v.1 was utilized
problem as it affects and harms users, and one of the by us [9]. These data were downloaded from [10]. 5572 text
solutions is to automatic SMS spam filtering. One of the messages, categorized as spam or regular, are included in the
most major problems in spam filtering is accuracy. In this dataset. It is divided into two columns, v1 and v2. To indicate
work, we aim to improve SMS spam filtering which is if the text in the second v2 column is legitimate email or
performed by the combination of both information by spam, the first v1 column has two values: ham and spam
correlation and classification. FP correlation enhancement is only.
used to examine active SMS samples and here we have Naive

IJISRT24MAR103 www.ijisrt.com 59
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The CSV (comma separated values) files for these files One of the most straightforward and effective
are provided. The Grumbletext website, the NUS SMS algorithms for classifiers is Naive Bayes. The relationship
Corpus (NSC), Caroline Tag's PhD thesis, and the SMS between the possibility of the previous assumption from
Spam Corpus v.0.1 Large are the sources of the words in this proof P(A) and the possibility of the ultimate assumption
article. In this instance, 747 messages were classified as spam confirmed by proof P(AB) is defined by the Bayes theorem
and 4825 messages as regular messages. B. The original data given assumption A and evidence B:
lines v1 and v2 are renamed as class and text, respectively.

After renaming the columns, we shuffle the dataset to

reduce over fitting. After shuffling, the data set is cleaned. To
clean the file, all text is converted to lowercase and
punctuation, and numbers, stopped words, and URLs are Where:
removed. Naive Bayes Classifier After data preprocessing, A, B = event
the dataset is divided into training data and testing data. P(AB) = probability of A given that B is true
There are a total of 5572 messages in the file, 747 of which P(BA) = probability of B given that A is true
are marked as spam and 4825 of which are marked as normal P(A), P (B) = independence of A and B
text. The data is divided into two data sets. The tutorial
contains 4,000 words; Of these, 3,461 were marked as The Naive Bayes classifier in statistics is a
regular emails and 539 were marked as spam. straightforward, uncertain classifier that makes use of
Bayes' theorem, which is based on a hypothetical decision
The evaluation data included the remaining 1,572 given the data and some prior information. Despite this
messages, of which 1,364 were normal messages and 208 basic assumption, which is regularly not the case in
were marked as spam. For classification, a model or practice, the Naive Bayes classifier is widely used in
distribution is created which is further used to predict class many applications due to its effectiveness and efficiency.
names [11]. First, we convert the text of the training data into
time matrix data and remove words with frequency less than Naive Bayes classifier algorithm is one of the
5. The 0 entry of the time matrix data is replaced with "no ", simplest forms of model in Bayesian network. We can
and the other non-zero entries are replaced with "Yes". So, achieve high accuracy when combined with fast
this information time matrix has only two values: "yes" and prediction. This method includes the use of kernel
"no". Use this matrix of data elements and word lists from function to calculate the probability of all possible inputs
the training data to train the Naive Bayes classifier. Similarly, and allowing the scheduler to improve its efficiency
an object database was created for the test data and Naive under certain conditions. Hence, Naive Bayes is a
Bayes classifier was used to predict the text name. powerful tool in Machine Learning. It especially works
effectively in text classification, filtering out spam
messages.

Fig (1): Data Set

IJISRT24MAR103 www.ijisrt.com 60
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. RESULT

Fig (2): Code

Output:
Accuracy score: 0.9885139985642(OR) 98%

Fig (3): Precision-Recall Curve

Fig (4): Confusion Matrix

IJISRT24MAR103 www.ijisrt.com 61
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig (5): ROC Curve

V. CONCLUSION

We conclude that Naive Bayes algorithm is the best for

classification in SMS spam detection and it is worth studying
Polynomial Naive Bayes algorithm as it has many
applications in many industries and the predictions made in
this way of the algorithm are real and fast. Media
classification is one of the most popular users of the Naive
Bayes algorithm. News political, regional, international etc.

REFERENCES

[1]. Baaqeel, Hind, and Rachid Zagrouba. "Hybrid SMS

Spam Filtering System Using Machine Learning
Techniques." 2020 21st International Arab Conference
on Information Technology (ACIT). IEEE, 2020.
[2]. Gupta, Suparna Das, Soumyabrata Saha, and Suman
Kumar Das. "SMS Spam Detection Using Machine
Learning." Journal of Physics: Conference Series. Vol.
1797. No. 1. IOP Publishing, 2021.
[3]. Dea Delvia Arifin, Shaufiah Moch and Arif Bijaksana,
"Enhancing Spam Detection on mobile phone short
message service(SMS) performance using FP-Growth
and naive bayes classifier", Wireless and Mobile
(APWiMob) 2016 IEEE Asia Pacific Conference, 2016.
[4]. Shafil Muhammad Abdulhamid, "A Review on Mobile
SMS Spam Filtering Techniques", IEEE Access, 2017.
[5]. Nagwani Naresh Kumar and Aakanksha Sharaff, "SMS
Spam Filtering and thread identification using bi-levcl
text classification and clustering techniques", Journal of
Information Science, 2017.

IJISRT24MAR103 www.ijisrt.com 62