Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
70 views

Sms Spam Detection Using Machine Learning and Deep Learning Techniques

Contact us for project abstract, enquiry, explanation, code, execution, documentation. Phone/Whatsap : 9573388833 Email : info@datapro.in Website : https://dcs.datapro.in/contact-us-2 Tags: btech, mtech, final year project, datapro, machine learning, cyber security, cloud computing, blockchain,

Uploaded by

dataprodcs
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
70 views

Sms Spam Detection Using Machine Learning and Deep Learning Techniques

Contact us for project abstract, enquiry, explanation, code, execution, documentation. Phone/Whatsap : 9573388833 Email : info@datapro.in Website : https://dcs.datapro.in/contact-us-2 Tags: btech, mtech, final year project, datapro, machine learning, cyber security, cloud computing, blockchain,

Uploaded by

dataprodcs
Copyright
© © All Rights Reserved
You are on page 1/ 11

ABSTRACT

The number of people using mobile devices increasing day by day. SMS (short

message service) is a text message service available in smartphones as well as basic

phones. So, the traffic of SMS increased drastically. The spam messages also

increased. The hackers try to send spam messages for their financial or business

benefits like market growth, lottery ticket information, credit card information, etc. So,

spam classification has special attention. In this paper, we applied various machine

learning and deep learning techniques for SMS spam detection. we used a dataset to

train the machine learning and deep learning models like LSTM and NB. The SMS

spam collection data set is used for testing the method. The dataset is split into two

categories for training and testing the research. Our experimental results have shown

that our NB model outperforms previous models in spam detection with an accuracy of

good.

5
TABLE OF CONTENTS

CHAPTER PAGE
NO. TITLE NO.

CHAPTER 1 : INTRODUCTION
1.1 GENERAL
1.1.1 THE MACHINE LEARNING SYSTEM
1.
1.1.2 FUNDAMENTAL
8
1.2 JUPYTER
1.3 MACHINE LEARNING
10
1.4 CLASSIFICATION TECHNIQUES
13
1.4.1 NEURAL NETWORK AND DEEP LEARNING
1.4.2 METHODOLOGIES - GIVEN INPUT AND EXPECTED 15
OUTPUT
1.5 OBJECTIVE AND SCOPE OF THE PROJECT 16
1.6 EXISTING SYSTEM
1.6.1 DISADVANTAGES OF EXISTING SYSTEM 18
1.6.2 LITERATURE SURVEY
1.7 PROPOSED SYSTEM 21
22
1.7.1 PROPOSED SYSTEM ADVANTAGES

CHAPTER 2 :PROJECT DESCRIPTION


25
2.1 INTRODUCTION
2.
2.2 DETAILED DIAGRAM 25
2.2.1 FRONT END DESIGN
26
2.2.2 BACK END FLOW
26
2.3 SOFTWARE SPECIFICATION
2.3.1 HARDWARE SPECIFICATION
2.3.2 SOFTWARE SPECIFICATION
6
2.4 MODULE DESCRIPTION 27
2.4.1 DATA COLLECTION
2.4.2 DATA AUGUMENTATION
2.4.3 DATA SPLITTING
2.4.4 CLASSIFICATION
32
2.4.5 PERFORMANCES MATRICES
2.4.6 CONFUSION MATRIX 32
2.5 MODULE DIAGRAM
2.5.1 SYSTEM ARCHITECTURE 33
2.5.2 USECASE DIAGRAM
34
2.5.3 CLASS DIAGRAM
35
2.5.4 ACTIVITY DIAGRAM
2.5.5 SEQUENCE DIAGRAM 35
2.5.6 STATE FLOW DIAGRAM
36
2.5.7 FLOW DIAGRAM 37

CHAPTER 3 : SOFTWARE SPECIFICATION


3.1 GENERAL 38
38
3. 3.2 ANACONDA
3.3 PYTHON 41
42
3.2.1 SCIENTIFIC AND NUMERIC COMPUTING
3.2.2 CREATING SOFTWARE PROTOTYPES 42
43
3.2.3 GOOD LANGUAGE TO TEACH PROGRAMMING
CHAPTER 4 : IMPLEMENTATION
4. 4.1 GENERAL 44
4.2 IMPLEMENTATION CODING 44
4.3 SNAPSHOTS 51
CHAPTER 5 : CONCLUSION & REFERENCES
5. 5.1 CONCLUSION 52
53
5.2 REFERENCES

7
CHAPTER I

INTRODUCTION

1.2Jupyter

Jupyter, previously known as IPython Notebook, is a web-based, interactive


development environment. Originally developed for Python, it has since expanded to
support over 40 other programming languages including Julia and R.
Jupyter allows for notebooksto be written that contain text, live code, images, and
equations. These notebooks can be shared, and can even be hosted on GitHubfor free.
For each section of this tutorial, you can download a Juypter notebook that allows you
to edit and experiment with the code and examples for each topic. Jupyter is part of the
Anaconda distribution; it can be started from the command line using the jupyter
command:

1.2 Machine Learning

We will now move on to the task of machine learning itself. In the following sections
we will describe how to use some basic algorithms, and perform regression,
classification, and clustering on some freely available medical datasets concerning
breast cancer and diabetes, and we will also take a look at a DNA microarray dataset.

8
SciKit-Learn
SciKit-Learn provides a standardised interface to many of the most commonly
used machine learning algorithms, and is the most popular and frequently used library
for machine learning for Python. As well as providing many learning algorithms, SciKit-
Learn has a large number of convenience functions for common preprocessing tasks
(for example, normalisation or k-fold cross validation).
SciKit-Learn is a very large software library.

Clustering
Clustering algorithms focus on ordering data together into groups. In general
clustering algorithms are unsupervised—they require no y response variable as input.
That is to say, they attempt to find groups or clusters within data where you do not know
the label for each sample. SciKit-Learn have many clusteringalgorithms, but in this
section we will demonstrate hierarchical clustering on a DNA expression microarray
dataset using an algorithm from the SciPy library.
We will plot a visualisation of the clustering using what is known as a dendrogram, also
using the SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer
types represented by each sample‘s expression data. We do this using agglomerative
hierarchical clustering, using Ward‘s linkage method:

9
1.4Classification

weanalysed data that was unlabelled—we did not know to what class a sample
belonged (known as unsupervised learning). In contrast to this, a supervised problem
deals with labelled data where are aware of the discrete classes to which each sample
belongs. When we wish to predict which class a sample belongs to, we call this a
classification problem. SciKit-Learn has a number of algorithms for classification, in this
section we will look at the Support Vector Machine.
We will work on the Wisconsin breast cancer dataset, split it into a training set and a
test set, train a Support Vector Machine with a linear kernel, and test the trained model
on an unseen dataset. The Support Vector Machine model should be able to predict if a
new sample is malignant or benign based on the features of a new, unseen sample:

10
You will notice that the SVM model performed very well at predicting the malignancy of
new, unseen samples from the test set—this can be quantified nicely by printing a
number of metrics using the classification report function. Here, the precision, recall,
and F1 score (F1 = 2· precision·recall/precision+recall) for each class is shown. The
support column is a count of the number of samples for each class.
Support Vector Machines are a very powerful tool for classification. They work well in
high dimensional spaces, even when the number of features is higher than the number
of samples. However, their running time is quadratic to the number of samples so large
datasets can become difficult to train. Quadratic means that if you increase a dataset in
size by 10 times, it will take 100 times longer to train.
Last, you will notice that the breast cancer dataset consisted of 30 features. This makes
it difficult to visualize or plot the data. To aid in visualization of highly dimensional data,
we can apply a technique called dimensionality reduction.

Dimensionality Reduction
Another important method in machine learning, and data science in general, is
dimensionality reduction. For this example, we will look at the Wisconsin breast cancer
dataset once again. The dataset consists of over 500 samples, where each sample has
30 features. The features relate to images of a fine needle aspirate of breast tissue, and
the features describe the characteristics of the cells present in the images. All features

11
are real values. The target variable is a discrete value (either malignant or benign) and
is therefore a classification dataset.
You will recall from the Iris example in Sect. 7.3 that we plotted a scatter matrix of the
data, where each feature was plotted against every other feature in the dataset to look
for potential correlations (Fig. 3). By examining this plot you could probably find features
which would separate the dataset into groups. Because the dataset only had 4 features
we were able to plot each feature against each other relatively easily. However, as the
numbers of features grow, this becomes less and less feasible, especially if you
consider the gene expression example in Sect. 9.4 which had over 6000 features.
One method that is used to handle data that is highly dimensional is Principle
Component Analysis, or PCA. PCA is an unsupervised algorithm for reducing the
number of dimensions of a dataset. For example, for plotting purposes you might want
to reduce your data down to 2 or 3 dimensions, and PCA allows
you to do this by generating components, which are combinations of the original
features, that you can then use to plot your data.
PCA is an unsupervised algorithm. You supply it with your data, X, and you specify the
number of components you wish to reduce its dimensionality to. This is known as
transforming the data:

Again, you would not use this model for new data—in a real world scenario, you would,
for example, perform a 10-fold cross validation on the dataset, choosing the model
parameters that perform best on the cross validation. This model would be much more
likely to perform well on new data. At the very least, you would randomly select a
subset, say 30% of the data, as a test set and train the model on the remaining 70% of
12
the dataset. You would evaluate the model based on the score on the test set and not
on the training set

.
1.4.1 NEURAL NETWORKS AND DEEP LEARNING

While a proper description of neural networks and deep learning is far beyond the scope
of this chapter, we will however discuss an example use case of one of the most
popular frameworks for deep learning: Keras4.
In this section we will use Keras to build a simple neural network to classify
theWisconsin breast cancer dataset that was described earlier. Often, deep learning
algorithms and neural networks are used to classify images—convolutional neural
networks are especially used for image related classification. However,
they can of course be used for text or tabular-based data as well. In this we will build a
standard feed-forward, densely connected neural network and classify a text-based
cancer dataset in order to demonstrate the framework‘susage.
In this example we are once again using the Wisconsin breast cancer dataset, which
consists of 30 features and 569 individual samples. To make it more challenging for the
neural network, we will use a training set consisting of only 50% of the entire dataset,
and test our neural network on the remaining 50% of the data.
Note,Keras is not installed as part of the Anaconda distribution, to install it use pip:

Keras additionally requires either Theano or TensorFlow to be installed. In the examples


in this chapter we are using Theano as a backend, however the code will work
identically for either backend. You can install Theano using pip, but it has a number of

13
dependencies that must be installed first. Refer to the Theano and TensorFlow
documentation for more information [12].
Keras is a modular API. It allows you to create neural networks by building a stack of
modules, from the input of the neural network, to the output of the neural network, piece
by piece until you have a complete network. Also, Keras can be configured to use your
Graphics Processing Unit, or GPU. This makes training neural networks far faster than if
we were to use a CPU. We begin by importing Keras:

We may want to view the network‘s accuracy on the test (or its loss on the training set)
over time (measured at each epoch), to get a better idea how well it is learning. An
epoch is one complete cycle through the training data.
Fortunately, this is quite easy to plot as Keras‘ fit function returns a history object which
we can use to do exactly this:

This will result in a plot similar to that shown. Often you will also want to plot the loss on
the test set and training set, and the accuracy on the test set and training set.
Plotting the loss and accuracy can be used to see if you are over fitting (you experience
tiny loss on the training set, but large loss on the test set) and to see when your training
has plateaued.

14
PROBLEM STATEMENT:

A number of major differences exist between spam-filtering in text messages and


emails. Unlike emails, which have a variety of large datasets available, real databases
for SMS spams are very limited. Additionally, due to the small length of text messages,
the number of features that can be used for their classification is far smaller than the
corresponding number in emails. Here, no header exists as well. Additionally, text
messages are full of abbreviations and have much less formal language that what one
would expect from emails. All of these factors may result in serious degradation in
performance of major email spam filtering algorithms applied to short text messages.

1.5 OBJECTIVE STATEMENT:

Prediction of SMS spam has been an important area of research for a long time. the
goal is to apply different machine learning algorithms to SMS spam classification

15

You might also like