0% found this document useful (0 votes)

64 views

Machine Learning Learning With Email Spam Detection

This document describes the development of a spam filter using machine learning. It shows an example pipeline that: 1. Preprocesses text data by removing punctuation, stopwords, and tokenizing. 2. Splits the data into train and test sets. 3. Defines a scikit-learn pipeline with CountVectorizer, Tf-idfTransformer, and MultinomialNB. 4. Trains the pipeline on the training data and tests it on held-out data, achieving 97% accuracy. The document provides code snippets for each step and evaluates performance on different classifiers and datasets. It concludes with exercises to further experiment with and improve the spam filter.

Uploaded by

kknk

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Machine Learning Learning With Email Spam Detection

Uploaded by

kknk

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Machine Learning Lab4

August 8, 2019

0.0.1 Aim : To develop a Spam Filter

Objective: Refer to an example of Spam Filter and Develop your own version.

Example: Kaggle dataset and example

[ ]:
Step 1: Import Necessary Libraries
[1]: import nltk
from nltk.corpus import stopwords
import string
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix
Step 2: Read the spam dataset corpus using panda library and remove and/or rename at-
tributes
[2]: messages = pd.read_csv('/home/brijesh/collegework/labmanual/ML/
,→sms-spam-collection-dataset/spam.csv', encoding='latin-1')

messages.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)

messages = messages.rename(columns={'v1': 'class','v2': 'text'})
Step 3 : View dataset and stats
[3]: messages.head()
[3]: class text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
[4]: messages.groupby('class').describe()

1
[4]: text
count unique top freq
class
ham 4825 4516 Sorry, I'll call later 30
spam 747 653 Please call our customer service representativ... 4
[5]: messages['length'] = messages['text'].apply(len)
[6]: messages.hist(column='length',by='class',bins=50, figsize=(15,6))
[6]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f41ea8c96a0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f41ea89eb70>],
dtype=object)

Step 4: Perform basic text preprocessing

[7]: def process_text(text):
'''
What will be covered:
1. Remove punctuation
2. Remove stopwords
3. Return list of clean text words
'''

#1
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)

#2
clean_words = [word for word in nopunc.split() if word.lower() not in␣
,→stopwords.words('english')]

#3
return clean_words

2
[8]: messages['text'].apply(process_text).head()
[8]: 0 [Go, jurong, point, crazy, Available, bugis, n...
1 [Ok, lar, Joking, wif, u, oni]
2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3 [U, dun, say, early, hor, U, c, already, say]
4 [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object
Step 5: Divide the corpus in training and testing set
[9]: msg_train, msg_test, class_train, class_test =␣
,→train_test_split(messages['text'],messages['class'],test_size=0.2)

Step 6: Define transformation pipeline

[10]: '''
1. Convert Text into numbers and prepare a vector representation for every␣
,→vector.

2. Calculate TF*IDF for every vector

3. Train the classifier
Since we need to do above 3 steps on the entire training set, we define␣
,→pipeline object to sequentially apply all three transformation on the␣

,→dataset.

'''
pipeline = Pipeline([
('bow',CountVectorizer(analyzer=process_text)), # converts strings to␣
,→integer counts

('tfidf',TfidfTransformer()), # converts integer counts to weighted TF-IDF␣

,→scores

('classifier',MultinomialNB()) # train on TF-IDF vectors with Naive Bayes␣

,→classifier

])
Step 7 : train the classifier by calling ’fit’ method on pipeline object.
[11]: pipeline.fit(msg_train,class_train)
[11]: Pipeline(memory=None,
steps=[('bow',
CountVectorizer(analyzer=<function process_text at
0x7f41ea573d90>,
binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8',
input='content', lowercase=True, max_df=1.0,
max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None,
stop_words=None, strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)),
('tfidf',

3
TfidfTransformer(norm='l2', smooth_idf=True,
sublinear_tf=False, use_idf=True)),
('classifier',
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
verbose=False)
Step 8 : Check the performance of the system by predicting output on the test data
[12]: predictions = pipeline.predict(msg_test)
[13]: print(classification_report(class_test,predictions))

precision recall f1-score support

ham 0.97 1.00 0.98 974

spam 1.00 0.76 0.86 141

accuracy 0.97 1115

macro avg 0.98 0.88 0.92 1115
weighted avg 0.97 0.97 0.97 1115

[14]: import seaborn as sns

sns.heatmap(confusion_matrix(class_test,predictions),annot=True)
[14]: <matplotlib.axes._subplots.AxesSubplot at 0x7f41e9c5ca58>

4
Exercise

1. Check output of the system for the following messages,

(A) You have won 1 bn dollar lottery.

(B) Hi, I miss you.
(C) Contact customer care service for more details.
(D) Tomorrow’s meeting is scheduled at 1: 30 pm
(E) You can fool all the people some of the time, and you can fool some of the people all
the time, but you can not fool all the people all the time.
(F) Not my circus not my monkey.
(G) They say teaching is like walking in a park, what they don’t say is that the park is the
Jurrasic park.

2. Use Decision Tree classifier instead of Naive Bayes and rerun the program. Comment on the
performance of both algorithms.

3. Try to run the program for different dataset and observe the performance.

4. Print all the messages for which Naive Bayes classifier predicts wrong class. (i.e. all spam
messages which are predicted as ham, and all ham messages which are predicted as spam)

5. Comment on overall performance of the system and suggest some ways for improvement.

6. Design and develop your own spam filter.

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Glove
100% (1)
Glove
10 pages
BERT - Assignment - Jupyter Notebook
0% (2)
BERT - Assignment - Jupyter Notebook
8 pages
ABAP Unit Test Driven Development
No ratings yet
ABAP Unit Test Driven Development
17 pages
ProMAX Seismic Manual
100% (7)
ProMAX Seismic Manual
278 pages
Problem
No ratings yet
Problem
13 pages
Manual
No ratings yet
Manual
21 pages
Fundamentals of Data Science Lab Manual
No ratings yet
Fundamentals of Data Science Lab Manual
34 pages
DATA SPLITTING-TRAINING MATERIAL
No ratings yet
DATA SPLITTING-TRAINING MATERIAL
42 pages
DLA Week 7
No ratings yet
DLA Week 7
8 pages
CSE160-Midterm-16wi
No ratings yet
CSE160-Midterm-16wi
10 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
CNN and RNN code
No ratings yet
CNN and RNN code
10 pages
Dl 5 Excuted
No ratings yet
Dl 5 Excuted
13 pages
Week2 lab
No ratings yet
Week2 lab
8 pages
Machine Learning Lecture - 4 and Lecture - 5
No ratings yet
Machine Learning Lecture - 4 and Lecture - 5
73 pages
PY 2 ANS
No ratings yet
PY 2 ANS
3 pages
Lab5 Linear Regression
No ratings yet
Lab5 Linear Regression
1 page
dl_5 excuted
No ratings yet
dl_5 excuted
13 pages
Deep Learning Exp
No ratings yet
Deep Learning Exp
25 pages
Pythonfile
No ratings yet
Pythonfile
36 pages
Spam Detection Using Tensorflow
No ratings yet
Spam Detection Using Tensorflow
13 pages
Matlab Lab Manual
0% (1)
Matlab Lab Manual
22 pages
University of Engineering and Technology Taxila: Engr. Asma Shafi Muhammad Jarrar Mehdi (22-TE-04) OOP Lab Manuals
No ratings yet
University of Engineering and Technology Taxila: Engr. Asma Shafi Muhammad Jarrar Mehdi (22-TE-04) OOP Lab Manuals
26 pages
Python Introduction
No ratings yet
Python Introduction
20 pages
Python Numpy Programming: Eliot Feibush
No ratings yet
Python Numpy Programming: Eliot Feibush
66 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Python Question Bank - Class Test 2
No ratings yet
Python Question Bank - Class Test 2
15 pages
CSE160-Final-18sp-key
No ratings yet
CSE160-Final-18sp-key
9 pages
22MCA1008 - Varun ML LAB ASSIGNMENTS
100% (1)
22MCA1008 - Varun ML LAB ASSIGNMENTS
41 pages
Classificacao Spam
No ratings yet
Classificacao Spam
11 pages
How to Train a Neural Network with TensorFlow_Pytorch and evaluation of logistic regression using tensorflow
No ratings yet
How to Train a Neural Network with TensorFlow_Pytorch and evaluation of logistic regression using tensorflow
5 pages
Python and Web Programming Assignment
No ratings yet
Python and Web Programming Assignment
23 pages
NeetCode_150_GPT
No ratings yet
NeetCode_150_GPT
128 pages
Practice Tasks
No ratings yet
Practice Tasks
8 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
rldl
No ratings yet
rldl
27 pages
TA ZC142 Computer Programming Date: 23/01/2013
No ratings yet
TA ZC142 Computer Programming Date: 23/01/2013
43 pages
Mail Spam
No ratings yet
Mail Spam
4 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
Deep Learning Lab Practicals
No ratings yet
Deep Learning Lab Practicals
24 pages
Assignment 3 DS5620
No ratings yet
Assignment 3 DS5620
11 pages
DL
No ratings yet
DL
12 pages
FAM PR-10
No ratings yet
FAM PR-10
4 pages
CNN TF Keras
No ratings yet
CNN TF Keras
6 pages
Fundamentals of Data Science Lab Manual-5-26
No ratings yet
Fundamentals of Data Science Lab Manual-5-26
22 pages
nndl
No ratings yet
nndl
20 pages
Lstm-Load-Forecasting:6 - All - Features - Ipynb at Master Dafrie:lstm-Load-Forecasting GitHub
No ratings yet
Lstm-Load-Forecasting:6 - All - Features - Ipynb at Master Dafrie:lstm-Load-Forecasting GitHub
5 pages
Howto Sorting
No ratings yet
Howto Sorting
6 pages
22PLC15B-model QP Sloved-Set2
No ratings yet
22PLC15B-model QP Sloved-Set2
30 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Brain Tumor Detection Using Deep Learning
No ratings yet
Brain Tumor Detection Using Deep Learning
96 pages
Sorting HOW TO: Guido Van Rossum Fred L. Drake, JR., Editor
No ratings yet
Sorting HOW TO: Guido Van Rossum Fred L. Drake, JR., Editor
5 pages
genaifile
No ratings yet
genaifile
39 pages
Dl Lab 8 Excuted
No ratings yet
Dl Lab 8 Excuted
3 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
AIML LAB(2025)
No ratings yet
AIML LAB(2025)
16 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
2 Mark Python Imp
No ratings yet
2 Mark Python Imp
11 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Ds836 Axi Master Lite
No ratings yet
Ds836 Axi Master Lite
16 pages
Sharpen Up On C#
No ratings yet
Sharpen Up On C#
19 pages
NetScaler TCP Counters
No ratings yet
NetScaler TCP Counters
11 pages
Image For Dos
No ratings yet
Image For Dos
1 page
Project Monitoring and Evaluation and Its Importance
No ratings yet
Project Monitoring and Evaluation and Its Importance
47 pages
Computer Network Notes
No ratings yet
Computer Network Notes
8 pages
Pocket Certificates Using Double Encryption: Features
No ratings yet
Pocket Certificates Using Double Encryption: Features
4 pages
Excel2kml Address
No ratings yet
Excel2kml Address
5 pages
Looping in Datastage
No ratings yet
Looping in Datastage
7 pages
Event Handling in Siebel CRM
No ratings yet
Event Handling in Siebel CRM
2 pages
User's Manual: Hi-Speed USB 2.0 Flash Disk
No ratings yet
User's Manual: Hi-Speed USB 2.0 Flash Disk
15 pages
I o Channels
No ratings yet
I o Channels
7 pages
Ccie Books Name
No ratings yet
Ccie Books Name
3 pages
3DSSpecifications CoreFunctions
No ratings yet
3DSSpecifications CoreFunctions
109 pages
Gezgin Satici Problemi Icin Sezgisel Metotlarin Performans Analizi
No ratings yet
Gezgin Satici Problemi Icin Sezgisel Metotlarin Performans Analizi
8 pages
PTC Creo Object Toolkit Java
100% (2)
PTC Creo Object Toolkit Java
2 pages
Problems On Trains - Aptitude Questions and Answers
No ratings yet
Problems On Trains - Aptitude Questions and Answers
3 pages
GATE Computer Networks Book
100% (1)
GATE Computer Networks Book
12 pages
Converting SMART FORMS Output To PDF Format
No ratings yet
Converting SMART FORMS Output To PDF Format
4 pages
Biba Security
No ratings yet
Biba Security
10 pages
PID Modullar Controller
No ratings yet
PID Modullar Controller
200 pages
System SW 4
No ratings yet
System SW 4
11 pages
Cyber Security Unit 5
No ratings yet
Cyber Security Unit 5
12 pages
Course Outcomes by Cad Cam
0% (1)
Course Outcomes by Cad Cam
2 pages
Faiaz@Embedded
No ratings yet
Faiaz@Embedded
5 pages
HowTo-100-CA Signed PxGridClient Selfsigned PxGridISEnode
No ratings yet
HowTo-100-CA Signed PxGridClient Selfsigned PxGridISEnode
22 pages
Small Change Form Template
No ratings yet
Small Change Form Template
2 pages
09-15-16 GreyHat SQL Injection
No ratings yet
09-15-16 GreyHat SQL Injection
25 pages