Machine Learning Learning With Email Spam Detection
Machine Learning Learning With Email Spam Detection
August 8, 2019
1
[4]: text
count unique top freq
class
ham 4825 4516 Sorry, I'll call later 30
spam 747 653 Please call our customer service representativ... 4
[5]: messages['length'] = messages['text'].apply(len)
[6]: messages.hist(column='length',by='class',bins=50, figsize=(15,6))
[6]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f41ea8c96a0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f41ea89eb70>],
dtype=object)
#1
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)
#2
clean_words = [word for word in nopunc.split() if word.lower() not in␣
,→stopwords.words('english')]
#3
return clean_words
2
[8]: messages['text'].apply(process_text).head()
[8]: 0 [Go, jurong, point, crazy, Available, bugis, n...
1 [Ok, lar, Joking, wif, u, oni]
2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3 [U, dun, say, early, hor, U, c, already, say]
4 [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object
Step 5: Divide the corpus in training and testing set
[9]: msg_train, msg_test, class_train, class_test =␣
,→train_test_split(messages['text'],messages['class'],test_size=0.2)
,→dataset.
'''
pipeline = Pipeline([
('bow',CountVectorizer(analyzer=process_text)), # converts strings to␣
,→integer counts
])
Step 7 : train the classifier by calling ’fit’ method on pipeline object.
[11]: pipeline.fit(msg_train,class_train)
[11]: Pipeline(memory=None,
steps=[('bow',
CountVectorizer(analyzer=<function process_text at
0x7f41ea573d90>,
binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8',
input='content', lowercase=True, max_df=1.0,
max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None,
stop_words=None, strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)),
('tfidf',
3
TfidfTransformer(norm='l2', smooth_idf=True,
sublinear_tf=False, use_idf=True)),
('classifier',
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
verbose=False)
Step 8 : Check the performance of the system by predicting output on the test data
[12]: predictions = pipeline.predict(msg_test)
[13]: print(classification_report(class_test,predictions))
4
Exercise
2. Use Decision Tree classifier instead of Naive Bayes and rerun the program. Comment on the
performance of both algorithms.
3. Try to run the program for different dataset and observe the performance.
4. Print all the messages for which Naive Bayes classifier predicts wrong class. (i.e. all spam
messages which are predicted as ham, and all ham messages which are predicted as spam)
5. Comment on overall performance of the system and suggest some ways for improvement.