Final ML Report
Final ML Report
NIKITA LAWANDE(16CE1045)
PRAKARSHA DAHAT(16CE1057)
RIYA THAKUR (16CE2025)
Under the guidance of
Mrs PUJA PADIA
CERTIFICATE
This is to certify that Mini Project report entitled
(Name of ProjectCoordinator)
This is to certify that the Mini Project entitled “FRAUD DETECTION SYSTEM” is a bonafide
work done by PRAKARSHA DAHAT, NIKITA LAWANDE AND RIYA THAKUR under the
supervision of Mrs. PUJA PADIA This Mini Project has been approved for third Year Computer
Engineering.
Internal Examiner:
1...............................
2...............................
External Examiners:
1...............................
2...............................
Date :. . . /. . . /. . . . . .
Place :. . . . . . . . . . . .
DECLARATION
I declare that this written submission represents my ideas and does not involve plagiarism. I
have adequately cited and referenced the original sources wherever others ideas or words have
been included. I also declare that I have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. I understand that any violation of the above will be cause for disciplinary action
against me by the Institute and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.
Date:
NIKITA LAWANDE(16CE1045)
RIYA THAKUR (16CE2025)
PRAKARSHA DAHAT (16CE1057)
Abstract
List of Tables
List of Figures
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Organization of report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Survey 3
2.1 Existing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Outcome of Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Proposed System 5
3.1 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 5
3.1.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Proposed Methodology/Techniques . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Sample and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Variables measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Design of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Hardware/Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4.1 Hardware System configuration: . . . . . . . . . . . . . . . . . . . . . 7
3.4.2 Software System configuration: . . . . . . . . . . . . . . . . . . . . . 7
iii
Chapter 1
Introduction
1.1 Overview
Machine learning is an application of artificial intelligence that provides
systems the ability to automatically learn and improve from experience without
being explicitly programmed. This project requires a classification algorithm
(supervised learning) for predicting the fraud emails. There are many algorithms
available to do so. Naïve Bayes algorithm was found to be the most accurate
algorithm as it outclasses the performance of other algorithms. Naive Bayes
classifiers are a collection of classification algorithms based on Bayes Theorem.
It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of
each other.
1.2 Motivation
The harmful effects of spam emails could be extent to access the user's
confidential details, which could result in financial losses for users and even
prevent them from access their own accounts. Therefore, we will quantify and
qualify the spam email features to prevent and mitigate the risk of fraudulent
emails. Hence the need of developing a fraud email detector is necessary to avoid
loss of confidential data.
iv
1.4 Objectives
The objectives of this project are as follows:
Determine and evaluate the best set of features to be used for fraud Emails
detection using Manual feature selection based on the Email structure and
automated selection techniques. To determine the best classification algorithm
for email fraud detection. The idea is to learn the differences between legit and
malicious emails and predict the nature of an unseen email later..
The remainder of this project discusses a fraud detection platform trained on the
open dataset provided by Eron , Ling spam and nigerian fraud emails. Section 2
describes related work in the field of fraud detection, and work that has been
done to classify fraud emails. Section 3 describes the implementation of the
system, and the pipeline for data processing. Next, section and provides an in-
depth look at the machine learning framework used in the final version of the
fraud detetction program.Section 4 provides an overall assessment of the results
of this project and provides future work that can be done to improve this
project. The paper concludes with Section 5, which offers concluding remarks
on the project as a whole.
Chapter 2
Literature Survey
Some work has been done on searching out the spam and fraud emails on the
internet using various algorithm. The most famous fraud email detection filter
“Spambayes” used by Microsoft outlook as a plug-in uses Baye’s theorem and
uses keyword based approach for fraud email detection. They note that the
internet is an effective tool for the solicitation of fraud emails, and that there
v
are systematic patterns in the ways that these contens of messages in the emails
are dissemi- nated. Thus, we can ascertain that there may be patterns in such
type of emails that are detectable by some form of classifier.
ALGORITHM COMPARISON
In this section, our results are compared with some of the related work. Of the
references mentioned throughout this paper, we only compare our results with the
commen- surate ones—those that used the same dataset. Moreover, the compared
results are evaluated with a student t-test with a significance level set to 5% (i.e.,
α = 0.05) to report if the differences are significant.We have mixed results for the
CSDMC2010 dataset. Qaroush et al. [19], for instance, investigated the
performance of several learning algorithms on this dataset. They concluded that
RF outperforms the rests. The reported spam recall in their paper is 0.958, which
is significantly better than what we found (0.912) (Table VIa). Whereas their
precision is similar to that of our approach, because of their high recall, their
0.958 F-score also outperforms our F-score of 0.922 (Table VIa). Surprisingly,
we outperform them if we do a cost-sensitive analysis of our data. The AUC that
we found for the dataset is 0.988 (Table VIa) which is better than what they found
(0.981). An SVM-based spam filter developed by Yang et al. [28], on the other
vi
hand, reported 0.943 precision, 0.965 recall, and a promising AUC of 0.995.
Among the three measures, we only obtained a better precision. Their second
anti-spam filter uses an NB classifier. This filter, interestingly, achieved 100%
recall. Its precision of 0.935 and AUC of 0.976, however, was outperformed by
our approach (Table VIa). Note that, the differences in the results are statistically
significant. By using 328 features, the filter developed by Ma et al. generates a
Neural Network classifier. On the SpamAssassin dataset, they reported that both
their precision and accuracy was 0.920. On the other hand, our approach achieved
a 0.948 precision and 0.957 accuracy. Both of these results are statistically
significant. Another Neural Network based filter developed by Srisanyalak and
Sornil [29] uses immunity- based features from emails. The filter has been
reported to be accurate 92.4% of the time. Our reported accuracy is better than
this (Table VIb). The phenomenal FPR and FNR achieved by the filter developed
by Bratko et al. (FPR=0.001 and FNR=0.012) indicates that our approach needs
further improvement in these measures; our reported FPR and FNR are 0.023 and
0.079, respectively (Table VIb).
From previous studies, we found that the performance of the filters are relatively
low on the LingSpam dataset. Prab- hakar and Basavaraju [10], for instance,
applied K-NNC and a data clustering algorithm called BIRCH on this dataset.
Their filter achieved 0.698 precision, 0.637 recall, 0.828 specificity, and an
accuracy of 0.755. In contrast, the data in Table Vic show that our approach has
a precision of 0.944 with 0.838 recall, 0.990 specificity (1 − FPR), and 0.960
accuracy. Our reported AUC on LingSpam also outperformed that reported by
Cormack and Bratko [30]; our AUC of 0.986 is significantly better than their
AUC of 0.960. The recall we have on this dataset is much better than that reported
by Yang et al. [28]; the precisions, however, are similar. Their NB-based filter
achieved 0.943 precision and 0.820 recall. Surprisingly, the AUC of their filter
(e.g., 0.992) significantly outperformed the AUC of our approach (Table VIc).
As mentioned in Section VI-B, our results with the Enron- Spam dataset are not
satisfactory because of the properly balanced property of the dataset. The curators
of the dataset, however, reported a spectacular spam recall of 0.975 [9] while our
best spam recall on the dataset is 0.929 with BOOSTED RF.
Moreover, their reported ham recall is 0.972; ours is a mere 0.842 (Table VId).
However, we have recently surpassed the results reported by Metsis et al. [9]
using an anti-spam filter named SENTINEL [31] that we have developed using
the ideas presented in this paper.
vi
i
Various research papers were studied based on our topic. Techniques used were
The naïve Bayes classifier, New convolution neural network based multimodal
fraud email detection algorithm, Back propagation Algorithm. On studying the
research papers, we found that the best suited algorithm was The Naïve Bayes
classifier since it works well with large amount of data and can predict the result
accurately.
Chapter 3
Proposed System
3.1 Proposed Work
In this project, an efficient multiclass Naïve Bayes algorithm is used for
prediction of illegitimate emails by training it on a set of data before
implementation. Fraud emails can result in loss of confidential information and
cause harms to the recipient. The algorithm helps to identify the malicious emails
and warn the recipient of the fraud that could take place due to either opening
the email or clicking on the links or websites attached to the particular email sent
by the attacker.
The first approach taken used a Naive Bayes algorithm. Naive Bayes
constructs class labels in classifier problems by looking at interrelations. The
algorithm leverages Bayes’ theorem and conditional probabilities. For each
email text T, we look for the class label that maximizes the posteriori
probabilities. This dataset involves only two potential labels in a complex
dataset. Many of the conventional methods for improving Naive Bayes
performance, like the inclusion of tf-idf scores, were very effective for this
implementation because each individual document had many tokens.
Additionally, each email had varying length of tokens and some emails had
unnecessarily high number of tokens. Thus, td-if scores werr considered,
irrespective of the length of the email.
vi
ii
Naïve Bayes Algorithm is a conditional probability type of classifier under the
Bayes’ theorem, which describes the probability of an event based on prior
knowledge of conditions related to the event.
For example, if phishing emails are arrived at due existence of phishing email
keywords, then a particular keyword can be used to more accurately assess the
probability that a particular email is indeed a phishing email, compared to the
assessment of the probability of phishing emails made without considering that
particular keyword.
i) Posterior probability (𝑪| 𝒙)-the probability of the target 𝑪 (in our case
could be the probability of fraud email) given the predictor 𝒙 (a Keyword
fed to the classifier).
(ii) (𝑪) - The prior probability of the target.
(iii)(𝒙 | 𝑪) - The likelihood i.e. the probability of the predictor given the
target.
(iv)(𝒙) - The probability of the predictor
Notebook- Jupyter
Programming language- Python
ix
Frontend- Tkinter
Dataset- CSV file format.
STEP 1:Access the legitimate and fraud text files and combine them into .csv
file
Fig. 3.6.1
The individual fraudulent and legitimate emails are stored as txt files in the spam and ham folder. By
using the os module in python, the individual txt files are fetched iteratively. The fetched emails are
stored in the dataframe with column names Text and Class. The emails from the ham folder are
given class 0 wheras emails from spam folder are assigned class as 1. This dataframe is randomized
and then saved as a .csv file. The top 5 entries of the dataframe are viewed as shown in Fig.3.6.1.
x
Fig.3.6.2
The headers of the columns i.e ‘Text’ and ‘Class’ are fetched and stored in a dictionary. The
information of the header is retrieved. Only the Text column from the dataframe is stored in a
variable ‘message’. The email strings in the Text column is converted to message object structure.
Finally the body of the messages is retrieved using the get_payload() function which then replaces
the emails in the Text column. The top 5 entries of this modified dataframe are viewed as shown in
Fig.3.6.2.This modified dataframe is saved in the machine as a .csv file.
STEP 3: Combine the second sample of emails, randomize and store as new
.csv file
Remove the Null values
xi
Fig.3.6.3
The .csv file with the Nigerian fraud emails is combined with the previously modified .csv file using
the command line prompt and saved as a new .csv file on our system. The final .csv file with both the
email dataset is fetched in a dataframe which is randomized. The number of null values in our final
dataframe is viewed. Using the dropna() command the null rows are dropped and dataframe is
saved using the inplace() function as shown in Fig.3.6.3.
xi
i
3.6.4
Next, the number of duplicate rows in our dataframe is checked. As shown in Fig.3.6.4 we can see
that our dataframe has 1884 duplicate rows. These rows are dropped using the drop_duplicates()
command. Further we check the extra labels in our dataframe. We have only 2 classes ‘1’ and ‘0’ in
our dataset but in Fig.3.6.4 we can see that our dataframe has 3 class labels. So we drop the extra
label and then save our dataframe.
xi
ii
STEP 5: Save the cleaned data in new .csv file. Split the cleaned data into
training and testing sample with 70% and 30% data respectively.
3.6.5
Here the cleaning of our data is complete and we save the dataframe as a .csv
file. The cleaned data is divided into test and train set. The email body and the
class labels of the test and train sets are saved separately in x_train, x_test,
y_train, y_test respectively. The splitting is done using the inbuilt
train_test_split() function of python in the ratio 7:3. The length of the train and
test sets are printed as shown in Fig.3.6.5.
STEP 6: Preprocess the training and testing data to make it lower case, remove
html tags, punctuations, stop words and stem the data.
xi
v
Fig.3.6.6
We create a preprocess function. This function recognizes the punctuation marks in the email body
and strips the email body of all punctuation marks iteratively. The whole body is changed to
lowercase using the the lower() function. The words with same meaning are stemmed using the
Snowball stemmer in python. The stemmed words are then lemmatized using the WordNet
lemmatizer. All the unnecessary stop words are removed from the body. As shown in Fig.3.6.6, the
training and test datasets are preprocessed usung this function and the first 5 entries are viewed.
STEP 7: Vectorize the training and testing data. Train the Naïve Bayes model
with the vectorized training data and training labels. Test the model with
vectorized testing data and find its accuracy, confusion matrix.
xv
Fig.3.6.7
The preproessed training data is vectorized using the CountVectorizer function with maximum
features as 8000. The tf-idf values are calculated of the vectors of these top 8000 features using the
TfidfTransformer function. The classifier MultinomialNB() is called. This classifier fits the tf-idf values
to the training classes and trains our model. The test set is again vectorized like the training set and
given to our trained model to calculate its accuracy. From Fig.3.6.7 we can see that the accuracy of
our model is 95.832%.
Fig.3.6.8
xv
i
To find the classification metrics of our trained model,weuse the classification_report function
available in python. As shown in Fig.3.6.8, this function inputs the test lables and predicted test
labels and outputs the classification metrics. The classification metrics include the precision, recall,
f1-score and support of our trained model.
Fig.3.6.9
Finally we deploy our model using the tkinter module of python. As shown in Fig.3.6.9, the user is
asked to enter an email in the text box field. This input is fetched by the text box field of the tkinter
module and stored in the variable which is forwarded to our trained model.
xv
ii
STEP 10: Display the result of the classification.
Fig 3.6.10
This input email from the user goes through all the steps that our training and testing datasets went
through. Finally as shown in Fig.3.6.10, our trained model predicts the class of the user entered
email and display the result as either Legitimate or Fraud.
xv
iii
xi
x
Chapter 4
When the user receives any email the naïve Baye’s algorithm is applied on it
and it separates the fraudulent emails(spam) from the normal ones. The spam is
predicted based on the calculated probability. It was found that Naïve Bayes
works well on the emails selected by the user and gives the accurate predicted
fraud emails.
The interdependent nature of these spams’ attributes disagrees with the
conditional independence assumption of this algorithm, which can sometimes
cause mistakes or errors when classifying emails, further affecting the accuracy
of Naive Bayes Classifiers. Furthermore, the value of core parameters may
cause different results in NBC, such as the default probability of spams and
threshold values. At the same time, corresponding issues comes up. To fix
these issues and enhance the performance of spam filtering, Machine Leaning
(ML) is used in resent researches and applications. Higher accuracy The first
feature of ML is the supervision mechanism. The outputs can be used as the
references to the inputs, which is called feedback. For example, the Back
propagation algorithm is one of the most famous methods in image recognition
[6]. In the spam filtering, ML could help with refining the prior parameters, to
tune the filtering results.
Chapter 5
5.1 Conclusion
To sum up, we consider the task of email classification as a supervised
machine-learning problem. The novelty of this work is the use of a set of
features related to the readability of email texts. Because the features are
language-independent, the method reported in this paper is potentially able to
classify emails written in any language. The aforementioned features as well as
the traditional ones are used to generate binary classifiers by five well-known
learning algorithms. We then evaluate the classifier performances on four
benchmark email datasets. The evidence from this study suggests that although
traditional features are individually more important than the other feature types,
xx
the combination of all of the features produces the optimal results. Extensive
experiments also imply that classifiers generated using meta-learning algorithms
perform better than trees, functions, and probabilistic methods. Finally, we
compare the results of our method with that of many stateof-the-art anti-spam
filters. Although the performance of our method is not always superior to other
filter-dataset instances, we find that our approach surpasses a number of them.
Taken together, the results suggest that the method described in this paper can
be a good means to classify spam emails. Because our results suggest that meta-
learning algorithms perform the best, further tests should be carried out to see
the performance of classifiers generated by stacking several algorithms .
5.2 FUTURE
Even though this algorithm is more sophisticated and effective in real life, it still
has certain limitations, which accounts for the mistakes made by these anti-
spam filters, as users can find a spam email in their “Inbox” sometimes while
finding certain legitimate emails in the “Spam” section once in a while. One
significant weakness of this algorithm is the conditional independence
assumption, which is the premise to use Bayes Theorem. Often in reality,
events, or attributes, are interrelated to and dependent on each other. Thus, the
presence or absence of each word or phrase have an impact on the presence or
absence of other words or phrases as words of relevant topics appear together,
vice versa.
Looking towards future developments, with the ever-advancing Artificial
Intelligence technology and the efforts of countless mathematicians, computer
scientists, and researcher, the Naive Bayes Classifier will certain evolve and
improve overtime to cater to the needs of email users. One possible direction is
the better classifying criteria. Currently, the Naive Bayes Classifier simply
computes the probability of an email being spam based solely on the presence
and absence of words and phrases. However, as the field of AI makes more
progress on the robots’ ability to understand semantic meanings of languages,
better classification decisions can be made considering the more complex
defining properties of spam emails as decided by the meanings of sentences.
Therefore, a future with enhanced and userfriendlier spam email filters is in
prospect.
xx
i
Bibliography
[1] http://airccse.org/journal/jcsit/0211ijcsit12.pdf
[2] https://meu.edu.jo/libraryTheses/590422b4d5dd8_1.pdf
[3] https://aip.scitation.org/doi/abs/10.1063/1.5038979
[4] https://www.sciencedirect.com/science/article/pii/
S1110866514000280
[5] https://su-plus.strathmore.edu/handle/11071/5616
[6] Yuanyuan Grace Zeng, “Identifying Email Threats Using
Predictive Analysis”, Rep. No. 08074848.
[7] Rushdi Shams and Robert E. Mercer, “Classifying Spam Emails
using
Text and Readability Features”, Department of Computer Science,
University of Western Ontario.
[8] Reshma Varghese and Dhanya K.A, “Efficient Feature Set for Spam
Email
Filtering”, IEEE 7th International Advance Computing Conference,
2017.
[9] Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos,
George
Paliouras and Constantine D. Spyropoulos, “An Evaluation of Naive
Bayesian Anti-Spam Filtering”, 11th European Conference on Machine
Learning, Barcelona, Spain, pp. 9-17, 2000.
xx
ii
Acknowledgements
We would like to express our deepest appreciation to all those who provided us
the possibility to complete this report. We wish to express sincere gratitude to,
Dr. Ramesh Vasappanavara , Principal, R.A.I.T. College, Nerul. We owe a deep
sense of gratitude to Dr. Leena Ragha, Head of Department, Computer
Engineering, R.A.I.T. College, Nerul . A special gratitude we give to our guide
Mrs. Puja Padiya, and evaluator Mrs. Trupti Patil whose contribution in
stimulating suggestions and encouragement, helped us to coordinate the project
especially in writing this report. Lastly, we would like to extend our thanks to
the faculties of the college who have directly or indirectly helped us in
exploring this topic and completing the study of ‘Email Fraud Detector’.
xx
iii
Date:
xx
iv