Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
ISSN No:-2456-2165
Abstract:- With the spread of modern life, messaging has results of a simple SQL query for spam emails. The low cost
become one of the most important forms of and high bandwidth of SMS networks attract a large number
communication. SMS (Short Message Service) is a text of spam messages. Every time a spam message arrives in the
messaging service available on all smart phones and user's inbox, a notification is sent to the user's phone. Users
mobiles. Facebook, WhatsApp etc. Unlike other chat- get angry when they see spam, malicious emails taking up
based communication applications, SMS does not require space on their phone storage. The aim of this project is to
any internet connection. SMS traffic has increased apply different learning machines to the spam message
significantly and spam has also been increased rapidly. classification problem, to get an idea and learn more about
Hackers and spammers are trying to scam over devices the problem by comparing their performances and to create
through SMSs. As a result, SMS support for mobile an application as one of the algorithms that can be effectively
devices becomes difficult. Spammers may ask for business filtered. The truth is that it is spam. Complete the extraction
expansion, lottery information, credit card information, and initial analysis results in MATLAB and then use the
etc. They also try to send spam emails to obtain financial scikitlearn library function to implement various machine
or commercial benefits such as: attackers attempt to learning algorithms in python.
disrupt the system by sending spam links that, when
clicked, allow them to control mobile devices. To analyze Spam and not spam: Spam refers to the content of email
this communication, the authors developed a system that and the use of electronic communications to send unsolicited
can analyze malicious messages and determine whether messages especially advertisements and bad links are called
they are RAW or SPAM. Here, we use text classification spam. Therefore, if you do not know the sender, the message
methods such as Naive Bayes classifier algorithm to may be spam. Many users do not realize that when
classify the texts and determine the message whether it is downloading free services, software or updates, they are only
spam or not. signing up for certain emails. "Raw" means the email is not
spam.
Keywords:- Machine Learning, Language Processing, Spam,
Ham, SMS, Naive Bayes, Logistic Regression. In this project, we apply the Naive Bayes technique to
learn a model that may be used to categorize text as spam or
I. INTRODUCTION non-spam. Frequently used as "free", "win", "win", "money",
"gift", etc. It is used in its meanings. These languages exist
Short text messages (SMS) are more than just a because they are designed to grab your attention and force
conversation. SMS, which was first defined as part of the you to give them your full attention. Additionally, spam
GSM family of standards in 1985, is a method of sending emails contain words written in all capital letters and also use
messages of up to 160 characters to GSM mobile phones. lots of exclamation marks. Spam emails are often easily
SMS technology is based on international mobile detected by recipients, and our goal is to train the model to do
communications standards and is recognized worldwide. this for us. The definition of spam is a binary classification
Spam is the misuse of email to send illegal, unsolicited problem since messages are either categorized as "spam" or
messages. While the most common type of spams are SMS "not spam." This is also a learning problem because we will
spam, the word is also used for similar abuse in other media feed data to the model and it can learn to make predictions
and news. Spam messages are unsolicited messages that about the future.
resemble spam, often for commercial benefit. Spam emails
are used to market and send phishing links. Commercial SMS is a manual communication system that allows
spammers use malware to send spam because spam is illegal mobile phone users to send text messages. This is the most
in many countries. Sending spam from an infected computer widely used data application and is expected to have 3.5
reduces the spammer's risk by masking the source of the billion users by the end of 2010; This number accounts for
spam. Text can contain limited characters, including letters, approximately 80% of all mobile phone users [3]. As the
numbers, and some symbols. importance of detail increases, we also see an increase in the
number of jobs sent via mobile notifications. Spam doesn't
Check out the news to see the full model. Nearly all even look like spam. In 2010, about 90% of emails were
spam emails ask users to call a number, respond to a text spam, and this is not yet a big problem in North America.
message, or visit a URL. This pattern can be seen in the The change in December 2012 was less than 1% [4]. But due
IJISRT24MAR103 www.ijisrt.com 58
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
to the significant growth of the youth market and the long- Bayes classifier algorithm which is used to classify whether
term decline in the cost of SMS notifications (the cost of SMS as spam or the regular mail. Training the data using
instant messaging in China is now under $0.001), the SMS spam collected in the earlier studies. For the SMS Spam
potential for spam SMS is different. In 2012, this rate Collection v.1 dataset, results in using Naive Bayes and FP-
increased to 30% in Asia. Instant messages are spam. Some Growth collaboration improve the average maximum
operators are responsible for sending instant messages in the accuracy by 98%, 506% and 0.025% respectively and
Middle East. Additionally, spam is more problematic than increases the correct score without using FP-Growth. Thus,
spam because in some countries the cost of spam to making the department more efficient.
recipients can increase. These factors, combined with limited
access to mobile spam filters, make instant messaging spam [4] Short Message Service Spam refers to uninvited
an interesting problem worth studying. There are many links or unwanted messages which will be received on
differences between instant messages and text messages. The mobile phones. These spam messages are real problem for
actual number of spam messages is very small compared to phone users. This business practice is also worrisome for
the amount of information available. Moreover, due to the service providers because it can upset their customers and
current small size of the language, the number of domains even cause them to lose customers. To reduce this,
available for its preparation is less than the corresponding researchers have suggested quite many solutions to control
number in the language. and filter out spam messages. In this article, we review the
methods existing already, pros and cons, and further research
II. LITERATURE SURVEY for spam detection, filtering, and how to reduce risk of
mobile SMS spam. The body of research literature was
[1] Due to the great popularity of short message examined and evaluated. The most commonly occurring
services (SMS), spammers have managed to achieve many SMS spam detection, filtering, and reduction techniques are
targets. Spam messages can trick mobile phone users into compared, including the data used their outcomes, limitations
revealing confidential information, which can lead to serious and future research directions are discussed. In this review
consequences. The magnitude of this problem increases the the main goal is to help researchers to identify areas that
need to develop spam filtering solutions. Machine learning require further development.
algorithms have become the best tools to classify data into
text. This explanation is perfect for our case because it [5] Spam message analysis is an important task in
separates SMS into two labels: spam or normal. By merging identifying and filtering spam messages. We can see that;
two machine learning techniques—supervised and SMS messages are sent each day increases and it becomes
unsupervised learning algorithms—this paper will show an more difficult for users to remember the new SMS messages
SMS spam filtering solution. The new hybrid system is they receive in their inbox and associate them with the
designed to improve spam filtering accuracy and F-measure. messages they have received before. In this study, detection
of messages spam and device identification problems are
[2] In today's world where digitalization is everywhere, discussed. Plan is made in two phases. In the first phase,
messaging has grow to be among the most significant binary classification algorithm is used to divide the text into
channels of communication. In contrast to alternative social two groups i.e spam text and non-spam text. Then, in the
media networks like Facebook and WhatsApp, messaging second phase, non-negative matrix factorization and K-means
does not require an internet connection at all. It is known that group algorithm are used. The definition of SMS messages
hackers and spammers trying to hack devices and message according to similar messages, that is, the duration of
supported mobiles had become more vulnerable as attackers consecutive communication, is explained and the effect is
try to filter through the system by sending uninvited links and tried to be analyzed at the beginning of the analysis title.
the attacker gains remote access of mobile phones by clicking Performance parameters such as accuracy, precision value,
on such links. Therefore, to identify these messages, the regression and F-measure are also studied and evaluated.
authors have developed a system that can detect these types SMS messages defined in this application can be used by
of messages and detect whether the message is SPAM or not. other apps such as SMS message content, distribution in
Authors use the TF - IDF Vectorizer algorithm to create a SMS inbox and other related SMS management.
dictionary containing the entire content of spam messages.
III. METHODOLOGY
[3] Although today's mobile phones continue to evolve
with many different communication mediums, SMS still After cleansing the dataset, we divide it into training
remains people’s option of communication tools. However, and testing sets. Using the training set of data, the Naive
today, as the cost of SMS has decreased, SMS spam has also Bayes classifier is trained. Using test results, the efficacy of
increased, and some people are using SMS as another way to teacher preparation was evaluated. A Description of the
advertising and fraud. From there after, it became a major Dataset The dataset SMS Spam Collection v.1 was utilized
problem as it affects and harms users, and one of the by us [9]. These data were downloaded from [10]. 5572 text
solutions is to automatic SMS spam filtering. One of the messages, categorized as spam or regular, are included in the
most major problems in spam filtering is accuracy. In this dataset. It is divided into two columns, v1 and v2. To indicate
work, we aim to improve SMS spam filtering which is if the text in the second v2 column is legitimate email or
performed by the combination of both information by spam, the first v1 column has two values: ham and spam
correlation and classification. FP correlation enhancement is only.
used to examine active SMS samples and here we have Naive
IJISRT24MAR103 www.ijisrt.com 59
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The CSV (comma separated values) files for these files One of the most straightforward and effective
are provided. The Grumbletext website, the NUS SMS algorithms for classifiers is Naive Bayes. The relationship
Corpus (NSC), Caroline Tag's PhD thesis, and the SMS between the possibility of the previous assumption from
Spam Corpus v.0.1 Large are the sources of the words in this proof P(A) and the possibility of the ultimate assumption
article. In this instance, 747 messages were classified as spam confirmed by proof P(AB) is defined by the Bayes theorem
and 4825 messages as regular messages. B. The original data given assumption A and evidence B:
lines v1 and v2 are renamed as class and text, respectively.
IJISRT24MAR103 www.ijisrt.com 60
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. RESULT
Output:
Accuracy score: 0.9885139985642(OR) 98%
IJISRT24MAR103 www.ijisrt.com 61
Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. CONCLUSION
REFERENCES
IJISRT24MAR103 www.ijisrt.com 62