Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
64 views

Machine Learning Learning With Email Spam Detection

This document describes the development of a spam filter using machine learning. It shows an example pipeline that: 1. Preprocesses text data by removing punctuation, stopwords, and tokenizing. 2. Splits the data into train and test sets. 3. Defines a scikit-learn pipeline with CountVectorizer, Tf-idfTransformer, and MultinomialNB. 4. Trains the pipeline on the training data and tests it on held-out data, achieving 97% accuracy. The document provides code snippets for each step and evaluates performance on different classifiers and datasets. It concludes with exercises to further experiment with and improve the spam filter.

Uploaded by

kknk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Machine Learning Learning With Email Spam Detection

This document describes the development of a spam filter using machine learning. It shows an example pipeline that: 1. Preprocesses text data by removing punctuation, stopwords, and tokenizing. 2. Splits the data into train and test sets. 3. Defines a scikit-learn pipeline with CountVectorizer, Tf-idfTransformer, and MultinomialNB. 4. Trains the pipeline on the training data and tests it on held-out data, achieving 97% accuracy. The document provides code snippets for each step and evaluates performance on different classifiers and datasets. It concludes with exercises to further experiment with and improve the spam filter.

Uploaded by

kknk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine Learning Lab4

August 8, 2019

0.0.1 Aim : To develop a Spam Filter


Objective: Refer to an example of Spam Filter and Develop your own version.

Example: Kaggle dataset and example


[ ]:
Step 1: Import Necessary Libraries
[1]: import nltk
from nltk.corpus import stopwords
import string
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix
Step 2: Read the spam dataset corpus using panda library and remove and/or rename at-
tributes
[2]: messages = pd.read_csv('/home/brijesh/collegework/labmanual/ML/
,→sms-spam-collection-dataset/spam.csv', encoding='latin-1')

messages.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)


messages = messages.rename(columns={'v1': 'class','v2': 'text'})
Step 3 : View dataset and stats
[3]: messages.head()
[3]: class text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
[4]: messages.groupby('class').describe()

1
[4]: text
count unique top freq
class
ham 4825 4516 Sorry, I'll call later 30
spam 747 653 Please call our customer service representativ... 4
[5]: messages['length'] = messages['text'].apply(len)
[6]: messages.hist(column='length',by='class',bins=50, figsize=(15,6))
[6]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f41ea8c96a0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f41ea89eb70>],
dtype=object)

Step 4: Perform basic text preprocessing


[7]: def process_text(text):
'''
What will be covered:
1. Remove punctuation
2. Remove stopwords
3. Return list of clean text words
'''

#1
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)

#2
clean_words = [word for word in nopunc.split() if word.lower() not in␣
,→stopwords.words('english')]

#3
return clean_words

2
[8]: messages['text'].apply(process_text).head()
[8]: 0 [Go, jurong, point, crazy, Available, bugis, n...
1 [Ok, lar, Joking, wif, u, oni]
2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3 [U, dun, say, early, hor, U, c, already, say]
4 [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object
Step 5: Divide the corpus in training and testing set
[9]: msg_train, msg_test, class_train, class_test =␣
,→train_test_split(messages['text'],messages['class'],test_size=0.2)

Step 6: Define transformation pipeline


[10]: '''
1. Convert Text into numbers and prepare a vector representation for every␣
,→vector.

2. Calculate TF*IDF for every vector


3. Train the classifier
Since we need to do above 3 steps on the entire training set, we define␣
,→pipeline object to sequentially apply all three transformation on the␣

,→dataset.

'''
pipeline = Pipeline([
('bow',CountVectorizer(analyzer=process_text)), # converts strings to␣
,→integer counts

('tfidf',TfidfTransformer()), # converts integer counts to weighted TF-IDF␣


,→scores

('classifier',MultinomialNB()) # train on TF-IDF vectors with Naive Bayes␣


,→classifier

])
Step 7 : train the classifier by calling ’fit’ method on pipeline object.
[11]: pipeline.fit(msg_train,class_train)
[11]: Pipeline(memory=None,
steps=[('bow',
CountVectorizer(analyzer=<function process_text at
0x7f41ea573d90>,
binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8',
input='content', lowercase=True, max_df=1.0,
max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None,
stop_words=None, strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)),
('tfidf',

3
TfidfTransformer(norm='l2', smooth_idf=True,
sublinear_tf=False, use_idf=True)),
('classifier',
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
verbose=False)
Step 8 : Check the performance of the system by predicting output on the test data
[12]: predictions = pipeline.predict(msg_test)
[13]: print(classification_report(class_test,predictions))

precision recall f1-score support

ham 0.97 1.00 0.98 974


spam 1.00 0.76 0.86 141

accuracy 0.97 1115


macro avg 0.98 0.88 0.92 1115
weighted avg 0.97 0.97 0.97 1115

[14]: import seaborn as sns


sns.heatmap(confusion_matrix(class_test,predictions),annot=True)
[14]: <matplotlib.axes._subplots.AxesSubplot at 0x7f41e9c5ca58>

4
Exercise

1. Check output of the system for the following messages,

(A) You have won 1 bn dollar lottery.


(B) Hi, I miss you.
(C) Contact customer care service for more details.
(D) Tomorrow’s meeting is scheduled at 1: 30 pm
(E) You can fool all the people some of the time, and you can fool some of the people all
the time, but you can not fool all the people all the time.
(F) Not my circus not my monkey.
(G) They say teaching is like walking in a park, what they don’t say is that the park is the
Jurrasic park.

2. Use Decision Tree classifier instead of Naive Bayes and rerun the program. Comment on the
performance of both algorithms.

3. Try to run the program for different dataset and observe the performance.

4. Print all the messages for which Naive Bayes classifier predicts wrong class. (i.e. all spam
messages which are predicted as ham, and all ham messages which are predicted as spam)

5. Comment on overall performance of the system and suggest some ways for improvement.

6. Design and develop your own spam filter.

You might also like