Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
112 views

Credit Card Fraud Detection Using Machine Learning

This document discusses credit card fraud detection using machine learning and data science. It begins with an introduction explaining that credit card companies need to identify fraudulent transactions to protect customers. Machine learning algorithms are used to analyze past transactions and detect suspicious activity. The model is updated over time using feedback from investigators. The document then reviews common fraud detection methods and related literature before describing the proposed methodology. The methodology uses anomaly detection algorithms like Isolation Forest and Local Outlier Factor on a credit card transaction dataset to identify fraudulent transactions.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

Credit Card Fraud Detection Using Machine Learning

This document discusses credit card fraud detection using machine learning and data science. It begins with an introduction explaining that credit card companies need to identify fraudulent transactions to protect customers. Machine learning algorithms are used to analyze past transactions and detect suspicious activity. The model is updated over time using feedback from investigators. The document then reviews common fraud detection methods and related literature before describing the proposed methodology. The methodology uses anomaly detection algorithms like Isolation Forest and Local Outlier Factor on a credit card transaction dataset to identify fraudulent transactions.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Credit Card Fraud Detection using Machine

Learning and Data Science

Abstract— It is vital that credit card companies are able to These are not the only challenges in the implementation of a
identify fraudulent credit card transactions so that customers real-world fraud detection system, however. In real world
are not charged for items that they did not purchase. Such examples, the massive stream of payment requests is quickly
problems can be tackled with Data Science and its importance, scanned by automatic tools that determine which transactions
along with Machine Learning, cannot be overstated. This
project intends to illustrate the modelling of a data set using
to authorize.
machine learning with Credit Card Fraud Detection. The Credit Machine learning algorithms are employed to analyse all the
Card Fraud Detection Problem includes modelling past credit authorized transactions and report the suspicious ones. These
card transactions with the data of the ones that turned out to be reports are investigated by professionals who contact the
fraud. This model is then used to recognize whether a new cardholders to confirm if the transaction was genuine or
transaction is fraudulent or not. Our objective here is to detect fraudulent.
100% of the fraudulent transactions while minimizing the The investigators provide a feedback to the automated system
incorrect fraud classifications. Credit Card Fraud Detection is a which is used to train and update the algorithm to eventually
typical sample of classification. In this process, we have focused
improve the fraud-detection performance over time.
on analysing and pre-processing data sets as well as the
deployment of multiple anomaly detection algorithms such as
Local Outlier Factor and Isolation Forest algorithm on the PCA
transformed Credit Card Transaction data.

Keywords— Credit card fraud, applications of machine


learning, data science, isolation forest algorithm, local outlier
factor, automated fraud detection.

I. INTRODUCTION
'Fraud' in credit card transactions is unauthorized and
unwanted usage of an account by someone other than the
owner of that account. Necessary prevention measures can be
taken to stop this abuse and the behaviour of such fraudulent
practices can be studied to minimize it and protect against
similar occurrences in the future.In other words, Credit Card
Fraud can be defined as a case where a person uses someone
else’s credit card for personal reasons while the owner and
the card issuing authorities are unaware of the fact that the
card is being used.
Fraud detection involves monitoring the activities of
populations of users in order to estimate, perceive or avoid
objectionable behaviour, which consist of fraud, intrusion,
and defaulting. Fraud detection methods are continuously developed to
This is a very relevant problem that demands the attention of defend criminals in adapting to their fraudulent strategies.
communities such as machine learning and data science These frauds are classified as:
where the solution to this problem can be automated.  Credit Card Frauds: Online and Offline
This problem is particularly challenging from the perspective  Card Theft
of learning, as it is characterized by various factors such as  Account Bankruptcy
class imbalance. The number of valid transactions far  Device Intrusion
outnumber fraudulent ones. Also, the transaction patterns  Application Fraud
often change their statistical properties over the course of  Counterfeit Card
time.  Telecommunication Fraud

1
Some of the currently used approaches to detection of such was accompanied by classification problem with variable
fraud are: misclassification costs.
 Artificial Neural Network
 Fuzzy Logic III. METHODOLOGY
 Genetic Algorithm The approach that this paper proposes, uses the latest
 Logistic Regression machine learning algorithms to detect anomalous activities,
 Decision tree called outliers.
 Support Vector Machines The basic rough architecture diagram can be represented with
 Bayesian Networks the following figure:
 Hidden Markov Model
 K-Nearest Neighbour

II. LITERATURE REVIEW


Fraud act as the unlawful or criminal deception intended to
result in financial or personal benefit. It is a deliberate act that
is against the law, rule or policy with an aim to attain
unauthorized financial benefit.
Numerous literatures pertaining to anomaly or fraud detection
in this domain have been published already and are available
for public usage. A comprehensive survey conducted by When looked at in detail on a larger scale along with real life
Clifton Phua and his associates have revealed that techniques elements, the full architecture diagram can be represented as
employed in this domain include data mining applications, follows:
automated fraud detection, adversarial detection. In another
paper, Suman, Research Scholar, GJUS&T at Hisar HCE
presented techniques like Supervised and Unsupervised
Learning for credit card fraud detection. Even though these
methods and algorithms fetched an unexpected success in
some areas, they failed to provide a permanent and consistent
solution to fraud detection.
A similar research domain was presented by Wen-Fang YU
and Na Wang where they used Outlier mining, Outlier
detection mining and Distance sum algorithms to accurately
predict fraudulent transaction in an emulation experiment of
credit card transaction data set of one certain commercial
bank. Outlier mining is a field of data mining which is
basically used in monetary and internet fields. It deals with
detecting objects that are detached from the main system i.e.
the transactions that aren’t genuine. They have taken
attributes of customer’s behaviour and based on the value of
those attributes they’ve calculated that distance between the
First of all, we obtained our dataset from Kaggle, a data
observed value of that attribute and its predetermined value.
analysis website which provides datasets.
Unconventional techniques such as hybrid data
Inside this dataset, there are 31 columns out of which 28 are
mining/complex network classification algorithm is able to
named as v1-v28 to protect sensitive data.
perceive illegal instances in an actual card transaction data
The other columns represent Time, Amount and Class. Time
set, based on network reconstruction algorithm that allows
shows the time gap between the first transaction and the
creating representations of the deviation of one instance from
following one. Amount is the amount of money transacted.
a reference group have proved efficient typically on medium
Class 0 represents a valid transaction and 1 represents a
sized online transaction.
fraudulent one.
There have also been efforts to progress from a completely
We plot different graphs to check for inconsistencies in the
new aspect. Attempts have been made to improve the alert-
dataset and to visually comprehend it:
feedback interaction in case of fraudulent transaction.
In case of fraudulent transaction, the authorised system
would be alerted and a feedback would be sent to deny the
ongoing transaction.
Artificial Genetic Algorithm, one of the approaches that shed
new light in this domain, countered fraud from a different
direction.
It proved accurate in finding out the fraudulent transactions
and minimizing the number of false alerts. Even though, it

2
any values in the dataset. This is done to ensure that we don’t
require any missing value imputation and the machine
learning algorithms can process the dataset smoothly.

This graph shows that the number of fraudulent transactions


is much lower than the legitimate ones.

After this analysis, we plot a heatmap to get a coloured


representation of the data and to study the correlation
between out predicting variables and the class variable. This
heatmap is shown below:

This graph shows the times at which transactions were done


within two days. It can be seen that the least number of
transactions were made during night time and highest during
the days.

The dataset is now formatted and processed. The time and


amount column are standardized and the Class column is
removed to ensure fairness of evaluation. The data is
processed by a set of algorithms from modules. The
following module diagram explains how these algorithms
work together: This data is fit into a model and the following
outlier detection modules are applied on it:
 Local Outlier Factor
 Isolation Forest Algorithm

This graph represents the amount that was transacted. A These algorithms are a part of sklearn. The ensemble module
majority of transactions are relatively small and only a in the sklearn package includes ensemble-based methods and
handful of them come close to the maximum transacted functions for the classification, regression and outlier
amount. detection.
This free and open-source Python library is built using
After checking this dataset, we plot a histogram for every NumPy, SciPy and matplotlib modules which provides a lot
column. This is done to get a graphical representation of the of simple and efficient tools which can be used for data
dataset which can be used to verify that there are no missing analysis

3
and machine learning. It features various classification, By comparing the local values of a sample to that of its
clustering and regression algorithms and is designed to neighbours, one can identify samples that are substantially
interoperate with the numerical and scientific libraries. lower than their neighbours. These values are quite amanous
We’ve used Jupyter Notebook platform to make a program in and they are considered as outliers.
Python to demonstrate the approach that this paper suggests. As the dataset is very large, we used only a fraction of it in
This program can also be executed on the cloud using Google out tests to reduce processing times.
Collab platform which supports all python notebook files. The final result with the complete dataset processed is also
Detailed explanations about the modules with pseudocodes determined and is given in the results section of this paper.
for their algorithms and output graphs are given as follows:
B. Isolation Forest Algorithm
A. Local Outlier Factor
The Isolation Forest ‘isolates’ observations by arbitrarily
It is an Unsupervised Outlier Detection algorithm. 'Local selecting a feature and then randomly selecting a split value
Outlier Factor' refers to the anomaly score of each sample. It between the maximum and minimum values of the designated
measures the local deviation of the sample data with respect feature.
to its neighbours. Recursive partitioning can be represented by a tree, the
More precisely, locality is given by k-nearest neighbours, number of splits required to isolate a sample is equivalent to
whose distance is used to estimate the local data. the path length root node to terminating node.
The pseudocode for this algorithm is written as: The average of this path length gives a measure of normality
and the decision function which we use.
The pseudocode for this algorithm can be written as:

On plotting the results of Local Outlier Factor algorithm, we On plotting the results of Isolation Forest algorithm, we get
get the following figure: the following figure:

4
Partitioning them randomly produces shorter paths for
anomalies. When a forest of random trees mutually produces
shorter path lengths for specific samples, they are extremely
likely to be anomalies.
Once the anomalies are detected, the system can be used to
report them to the concerned authorities. For testing purposes,
we are comparing the outputs of these algorithms to
determine their accuracy and precision.

IV. IMPLEMENTATION
This idea is difficult to implement in real life because it
requires the cooperation from banks, which aren’t willing to
share information due to their market competition, and also
due to legal reasons and protection of data of their users.
Therefore, we looked up some reference papers which
followed similar approaches and gathered results. As stated in
one of these reference papers:
“This technique was applied to a full application data set
supplied by a German bank in 2006. For banking
confidentiality reasons, only a summary of the results
obtained is presented below. After applying this technique, Results with the complete dataset is used:
the level 1 list encompasses a few cases but with a high
probability of being fraudsters.
All individuals mentioned in this list had their cards closed to
avoid any risk due to their high-risk profile. The condition is
more complex for the other list. The level 2 list is still
restricted adequately to be checked on a case by case basis.
Credit and collection officers considered that half of the cases
in this list could be considered as suspicious fraudulent
behaviour. For the last list and the largest, the work is
equitably heavy. Less than a third of them are suspicious.
In order to maximize the time efficiency and the overhead
charges, a possibility is to include a new element in the
query; this element can be the five first digits of the phone
numbers, the email address, and the password, for instance,
those new queries can be applied to the level 2 list and level 3
list.”.
V. RESULTS
The code prints out the number of false positives it detected
and compares it with the actual values. This is used to
calculate the accuracy score and precision of the algorithms.
The fraction of data we used for faster testing is 10% of the
entire dataset. The complete dataset is also used at the end VI. CONCLUSION
and both the results are printed.
These results along with the classification report for each Credit card fraud is without a doubt an act of criminal
algorithm is given in the output as follows, where class 0 dishonesty. This article has listed out the most common
means the transaction was determined to be valid and 1 methods of fraud along with their detection methods and
means it was determined as a fraud transaction. reviewed recent findings in this field. This paper has also
This result matched against the class values to check for explained in detail, how machine learning can be applied to
false positives. get better results in fraud detection along with the algorithm,
Results when 10% of the dataset is used: pseudocode, explanation its implementation and
experimentation results.
While the algorithm does reach over 99.6% accuracy, its
precision remains only at 28% when a tenth of the data set is
taken into consideration. However, when the entire dataset is
fed into the algorithm, the precision rises to 33%. This high
percentage of accuracy is to be expected due to the huge
imbalance between the number of valid and number of
genuine transactions.

5
Since the entire dataset consists of only two days’ transaction REFERENCES
records, its only a fraction of data that can be made available
if this project were to be used on a commercial scale. Being [1] “Credit Card Fraud Detection Based on Transaction Behaviour -by
based on machine learning algorithms, the program will only John Richard D. Kho, Larry A. Vea” published by Proc. of the 2017
increase its efficiency over time as more data is put into it. IEEE Region 10 Conference (TENCON), Malaysia, November 5-8,
2017
VII. FUTURE ENHANCEMENTS [2] CLIFTON PHUA1, VINCENT LEE1, KATE SMITH1 & ROSS
GAYLER2 “ A Comprehensive Survey of Data Mining-based Fraud
While we couldn’t reach out goal of 100% accuracy in fraud Detection Research” published by School of Business Systems, Faculty
detection, we did end up creating a system that can, with of Information Technology, Monash University, Wellington Road,
enough time and data, get very close to that goal. As with any Clayton, Victoria 3800, Australia
such project, there is some room for improvement here. [3] “Survey Paper on Credit Card Fraud Detection by Suman” , Research
The very nature of this project allows for multiple algorithms Scholar, GJUS&T Hisar HCE, Sonepat published by International
Journal of Advanced Research in Computer Engineering & Technology
to be integrated together as modules and their results can be (IJARCET) Volume 3 Issue 3, March 2014
combined to increase the accuracy of the final result. [4] “Research on Credit Card Fraud Detection Model Based on Distance
This model can further be improved with the addition of more Sum – by Wen-Fang YU and Na Wang” published by 2009
algorithms into it. However, the output of these algorithms International Joint Conference on Artificial Intelligence
needs to be in the same format as the others. Once that [5] “Credit Card Fraud Detection through Parenclitic Network Analysis-
condition is satisfied, the modules are easy to add as done in By Massimiliano Zanin, Miguel Romance, Regino Criado, and
SantiagoMoral” published by Hindawi Complexity Volume 2018,
the code. This provides a great degree of modularity and Article ID 5764370, 9 pages
versatility to the project. [6] “Credit Card Fraud Detection: A Realistic Modeling and a Novel
More room for improvement can be found in the dataset. As Learning Strategy” published by IEEE TRANSACTIONS ON
demonstrated before, the precision of the algorithms increases NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.
when the size of dataset is increased. Hence, more data will 8, AUGUST 2018
surely make the model more accurate in detecting frauds and [7] “Credit Card Fraud Detection-by Ishu Trivedi, Monika, Mrigya,
Mridushi” published by International Journal of Advanced Research in
reduce the number of false positives. However, this requires Computer and Communication Engineering Vol. 5, Issue 1, January
official support from the banks themselves. 2016
[8] David J.Wetson,David J.Hand,M Adams,Whitrow and Piotr Jusczak
“Plastic Card Fraud Detection using Peer Group Analysis” Springer,
Issue 2008.

You might also like