Report On Credit Card Fraud Detection Algo Using Machine Learning 1
Report On Credit Card Fraud Detection Algo Using Machine Learning 1
Submitted By
Atharva Nitin Gokhare
20CO028
1|Page
DEPARTMENT OF COMPUTER ENGINEERING
CERTIFICATE
2|Page
SEMINAR APPROVAL
The Seminar entitled
By
Bachelor of Engineering
In
Computer Engineering
Examiner 1 Examiner 2
Name and Signature Name and Signature
Date: -
Place: -
3|Page
ABSTRACT
People can use credit cards to make purchases online since they offer
a convenient and effective method. Credit card misuse is now more
likely because of increased credit card use. Both the owners of credit
cards and financial institutions suffer large financial losses because
of credit card theft. The primary goal of this research study is to
identify such frauds, which includes high-class data imbalance, data
accessibility, changes in fraud nature, and high false alarm rates.
Many machine learning-based algorithms for credit card detection
are presented in the pertinent literature which is the Random Forest
Algorithm (RFA). However, due to poor precision, it is still
necessary to use state-of-the-art.
4|Page
ACKNOWLEDGEMENT
It gives me great pleasure to acknowledge the contribution of all those who have directly or
indirectly contributed to the completion of this seminar.
I express my foremost and deepest gratitude to my guide Mrs. Minal Swami for her
supervision, noble guidance and constant encouragement in carrying out the seminar.
Acknowledgement will not be completed if I forget to mention special thanks to all the
teaching and non-teaching staff of the Computer Engineering Department for rendering
support directly or indirectly.
Date-
5|Page
CONTENTS
Certificate
Seminar Approval
Abstract
Acknowledgements
Contents
List of Figures
Introduction
2 Related work
3 Research methodology
4 Feature Selection
11 SYSTEM ARCHITECTURE
12 Implementation Modules
13 Future Enhancements
14 Conclusion
15 References
6|Page
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
ALGORITHM
Introduction
The number of valid transactions far outnumber fraudulent ones. Also, the transaction
patterns often change their statistical properties over the course of time. These are not the
only challenges in the implementation of a real-world fraud detection system, however. In
real world examples, the massive stream of payment requests is quickly scanned by
automatic tools that determine which transactions to authorize. Machine learning algorithms
are employed to analyze all the authorized transactions and report suspicious ones. These
reports are investigated by professionals who contact the cardholders to confirm if the
transaction was genuine or fraudulent. The investigators provide feedback to the automated
system which is used to train and update the algorithm to eventually improve the fraud-
detection performance over time.ML models have been used in many studies to solve
numerous challenges. Deep learning algorithms applied applications in computer network,
intrusion detection, banking, insurance, mobile cellular networks, health care fraud
detection, medical and malware detection, detection for video surveillance, location
tracking,
7|Page
Android malware detection, home automation, and heart disease prediction. We explore the
practical application of ML, particularly DL algorithms, to identify credit card thefts in the
banking industry in this paper. For data categorization challenges, the support vector
machine (SVM) is a supervised ML technique. It is employed in a variety of domains,
including image recognition, credit rating, and public safety. SVM can tackle linear and
nonlinear binary classification problems, and it finds a hyperplane that separates the input
data in the support vector, which is superior to other classifiers. Neural networks were the
first method used to identify credit card theft in the past. As a result, (DL), a branch of ML,
is currently focused on DL approaches. In recent years, deep learning approaches have
received significant attention due to substantial and promising outcomes in various
applications, such as computer vision, natural language processing, and voice. However,
only a few studies have examined the application of deep neural networks in identifying
CCF. It uses several deep learning algorithms for detecting CCF. However, in this study, we
choose the CNN model and its layers to determine if the original fraud is the normal
transaction of qualified datasets. Some transactions are common in datasets that have been
labelled fraudulent and demonstrate questionable transaction behavior. As a result, we focus
on supervised and unsupervised learning in this research paper. The class imbalance is the
problem in ML where the total number of a class of data (positive) is far less than the total
number of another class of data (negative). The classification challenge of the unbalanced
dataset has been the subject of several studies. An extensive collection of studies can
provide several answers.
Therefore, to the best of our knowledge, the problem of class imbalance has not yet been
solved. We propose to alter the DL algorithm of the CNN model by adding additional layers
for features extraction and the classification of credit card transactions as fraudulent or
otherwise. The top attributes from the prepared dataset are ranked using feature selection
techniques. After that, CCF is classified using several supervised machine-driven and deep
learning models. In this study, the main aim is to detect fraudulent transactions using credit
cards with the help of ML algorithms and deep learning algorithms. This study makes the
following contributions: • Feature selection algorithms are used to rank the top features from
the CCF transaction dataset, which help in class label predictions. • The deep learning model
is proposed by adding a number of additional layers that are then used to extract the features
and classification from the credit card farad detection dataset. • To analyze the performance
CNN model, apply different architecture of CNN layers. • To perform a comparative
8|Page
analysis between ML with DL algorithms and proposed CNN with baseline model, the
results prove that the proposed approach outperforms existing approaches. • To assess the
accuracy of the classifiers, performance evaluation measures, accuracy, precision, and recall
are used. Experiments are performed on the latest credit cards dataset. The rest of the paper
is structured as follows: The second section examines the related works. It also shows the
outcomes of our tests on a real dataset, as well as the analysis
9|Page
1.SUPERVISED MACHINE LEARNING APPROACHES
ML has many branches, and each branch can deal with different learning tasks. However,
ML learning has different framework types. The ML approach provides a solution for CCF,
such as random forest (RF). The ensemble of the decision tree is the random forest. Most
researchers use the RF approach. To combine the model, we can use (RF) along with
network analysis. This method is called APATE. Researchers can use different ML
techniques, such as supervised learning and unsupervised techniques. ML algorithms, such
as LR, ANN, DT, SVM and NB, are commonly used for CCF detection. The researcher can
combine these techniques with ensemble techniques to construct solid detection classifiers.
The linking of multiple neurons and nodes is known as an artificial neural network. A feed-
forward perceptron multilayer is built up of numerous layers: an input layer, an output layer
and one or more hidden layers. For the representation of the exploratory variables, the first
layer contains the input nodes. With a precise weight, these input layers are multiplied, and
each of the hidden layer nodes is transferred with a certain bias, and they are added together.
An activation function is then applied to create the output of each neuron for this
summation, which is then transferred to the next layer. Finally, the algorithm’s reply is
provided by the output layer. The first set randomly used weights and formerly used the
training set to minimise the error. All these weights were adjusted by detailed algorithms
such as backpropagation. The graphic model for contingency relationships between a set of
variables is called the Bayesian belief network. The independence assumption in naïve
Bayes TABLE 1. Algorithms of machine learning and their accuracy. is that it was
developed to relax and allow for dependencies among variables. Variable quantity is
characterised as nodes, although dependencies of conditions between variables are shown as
arcs between nodes. The conditional probability table of each node is linked, which makes
the possibilities of the node’s variable conditional on the parent’s node values. The
computational system of the bilateral-branch network (BBN) is as follows: Finding a
construction for the network is the first step: it was raised by human experts, which may be
conditional on the specific algorithms by using the data. When this network topology
originates, straightforwardly fitting the network uses antique data in naïve Bayes so that the
constant variables are also discretised and supposedly distributed normally.
Correspondingly, in BBN, it is expected that each node is autonomous of its no offspring,
assuming its maternities in the graph. This is acknowledged as the condition of Markov. The
linear classification model is a support vector machine (SVM) and problems of regression.
Rendering to the SVM algorithm, we can find the points closest to the line from both
classes. These points are called support vectors. This report is concerned with the integration
of unsupervised techniques with supervised techniques for the classification of CCF
detection.
10 | P a g e
2. Related work
Someone proposed a credit card fraud detection method using ML. The
authors used a credit card fraud dataset sourced from Kaggle. This dataset
contains transactions made within 2 days by European credit card holders.
To deal with the class imbalance problem present in the dataset, the
researcher implemented the Synthetic Minority Oversampling Technique
(SMOTE) oversampling technique. The following ML methods were
implemented to assess the efficacy of the proposed method: RF, NB, and
multilayer perceptron (MLP). The experimental results demonstrated that
the RF algorithm performed optimally with a fraud detection accuracy of
99.96%. The NB and the MLP methods obtained accuracy scores of
11 | P a g e
99.23% and 99.93%, respectively. The authors concede that more research
should be conducted to implement a feature selection method that could
improve the accuracy of other ML methods.
13 | P a g e
3.Research methodology
Dataset
In this research, we use a dataset that includes credit card transactions
that were made by European cardholders for 2 days in September 2013.
This dataset contains 284807 transactions in total in which 0.172% of the
transactions are fraudulent. The dataset has the following 30 features (V1,,
V28), Time and Amount. All the attributes within the dataset are
numerical. The last column represents the class (type of transaction)
whereby the value of 1 denotes a fraudulent transaction and the value of 0
otherwise. The features V1 to V28 are not named for data security and
integrity reasons. This dataset has been used and one of the key issues that
we discovered is the low detection accuracy score that was obtained by
those models because of the highly imbalanced nature of the dataset. In
order to solve the issue of class imbalance, we applied the Synthetic
Minority Oversampling Technique (SMOTE) method in the Data-
Preprocessing phase of the proposed framework. The SMOTE method
works by picking samples that are close to each other within the feature
space, drawing a line between the data points in the feature space and
creating a new instance of the minority class at a point along the line.
4.Feature selection
14 | P a g e
5. RANDOM FOREST – The machine learning Algorithm
Random forests are an ensemble method used for classification. In Random Forest, we grow
multiple trees as opposed to a single tree in Decision Tree model. But the question arises
why to use multiple trees when the same work can be done by a single tree as well. One of
the major problems of Decision Tree is overfitting which gives us a very bad predictive
model and adding multiple trees in the random introduces randomness which in turn gets rid
of overfitting and gives us a very superior predictive model. To classify a new object based
on attributes, each tree gives a classification, and we say the tree “votes” for that class. The
forest chooses the classification having the most votes (over all the trees in the forest) and in
case of regression, it takes the average of outputs by different trees. Random Forest is also
called Random Decision Forest (RFA) which is used for Classification, Regression and
other tasks that are performed by constructing multiple decision trees. This Random Forest
Algorithm is based on supervised learning and the major advantage of this algorithm is that
it can be used for both Classification and Regression. Random Forest Algorithm gives you
better accuracy when compared with all other existing systems and this is the most
commonly used algorithm. In this paper the use of Random Forest algorithm in credit card
fraud detection can give you accuracy of about 90 to 95%.
15 | P a g e
6. How does Random Forest algorithm work?
Random Forest works in two-phases: first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random Forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result,
16 | P a g e
and when a new data point occurs, then based on most results, the Random Forest
classifier predicts the final decision. Consider the below image:
17 | P a g e
7. Applications of Random Forest
There are mainly four sectors where Random Forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
18 | P a g e
10. RFA IMPLEMENTATION IN CREDIT CARD
FRAUD DETECTION
In credit card fraud detection, the Random Forest Algorithm gives better accuracy in
results. First all the datasets will be collected and analyzed. During analysis process all the
duplicate values and the null values will be removed from the dataset. Now the dataset will
be preprocessed based on the amount and transaction time for finding the accuracy of the
resultant dataset. After the preprocessing of dataset into amount and transaction time
now the dataset will be divided into two categories. The dataset is classified in two
categories as trained data and test dataset. Here for dataset classification, we use a
software called ‘Scikit-learn’. Scikit-learn is a free software for machine learning library in
python where it contains features like classification, regression, Clustering algorithms and
various algorithms to interoperate with Python. After the preprocessing of the dataset now
we apply the Random Forest Algorithm. By applying Random Forest Algorithm, the
preprocessed dataset will be analyzed again and then a confusion matrix will be obtained.
In confusion matrix the dataset will be partitioned into four blocks as True Positive (Positive
(TP), True Negative (TN), False Positive (FP) and False Negative (FN). Now the dataset will
be partitioned continuously until all the data is validated. Now all these partitioned data
will be evaluated and finally it will be represented as separate graphs. These separate
graphs will give only less accuracy about the resultant dataset. So, to obtain better
accuracy we use Random Forest Algorithm where it takes all the graph values and gives us
only necessary values with better accuracy when compared with all other algorithms.
19 | P a g e
11. SYSTEM ARCHITECTURE
In our architecture first we have a credit card dataset where this contains all the details
about credit cards. But here we take only Amount and Transaction time for analysis and
preprocessing of dataset. The next step is the process of data cleaning where the dataset
will be analyzed, and all the duplicate and null values will be eliminated from the dataset
taken. The next step is the data partition where the credit card dataset will be partitioned
into two partitions as trained dataset and testing dataset. After that Random Forest
Algorithm will be applied and a confusion matrix will be obtained. Now the performance
analysis will be done on the obtained confusion matrix. This Performance analysis will give
the accuracy of about 90% in this credit card fraud detection system.
20 | P a g e
21 | P a g e
22 | P a g e
23 | P a g e
13.FUTURE ENHANCEMENT
While we couldn’t reach our goal of 100% accuracy in fraud detection, we did
end up creating a system that can, with enough time and data, get very close
to that goal. As with any such project, there is some room for improvement
here. The very nature of this project allows for multiple algorithms to be
integrated together as modules and their results can be combined to increase
the accuracy of the result. This model can further be improved with the
addition of more algorithms into it. However, the output of these algorithms
needs to be in the same format as the others. Once that condition is satisfied,
the modules are easy to add as done in the code. This provides a great degree
of modularity and versatility to the project. More room for improvement can
be found in the dataset. As demonstrated before, the precision of the
algorithms increases when the size of dataset is increased. Hence, more data
will surely make the model more accurate in detecting frauds and reduce the
number of false positives. However, this requires official support from the
banks themselves.
24 | P a g e
14.CONCLUSION
Credit card fraud is without a doubt an act of criminal dishonesty. This article
has listed out the most common methods of fraud along with their detection
methods and reviewed recent findings in this field. This paper has also
explained in detail how machine learning can be applied to get better results
in fraud detection along with the algorithm, pseudocode, explanation its
implementation and experimentation results. While the algorithm does reach
over 99.6% accuracy, its precision remains only at 28% when a tenth of the
data set is taken into consideration. However, when the entire dataset is fed
into the algorithm, the precision rises to 33%. This high percentage of
accuracy is to be expected due to the huge imbalance between the number of
valid and the number of genuine transactions.
25 | P a g e
15.References
• [1] Y. Abakarim, M. Lahby, and A. Attioui, ‘‘An efficient real time model for credit card fraud
detection based on deep learning,’’ in Proc. 12th Int. Conf. Intell. Systems: Theories Appl.,
Oct. 2018, pp. 1–7, doi: 10.1145/3289402.3289530.
• [2] H. Abdi and L. J. Williams, ‘‘Principal component analysis,’’ Wiley Interdiscipl. Rev.,
Comput. Statist., vol. 2, no. 4, pp. 433–459, Jul. 2010, doi: 10.1002/wics.101.
• [3] V. Arora, R. S. Leekha, K. Lee, and A. Kataria, ‘‘Facilitating user authorization from
imbalanced data logs of credit cards using artificial intelligence,’’ Mobile Inf. Syst., vol.
2020, pp. 1–13, Oct. 2020, doi: 10.1155/2020/8885269.
• [4] A. O. Balogun, S. Basri, S. J. Abdulkadir, and A. S. Hashim, ‘‘Performance analysis of
feature selection methods in software defect prediction: A search method approach,’’ Appl.
Sci., vol. 9, no. 13, p. 2764, Jul. 2019, doi: 10.3390/app9132764.
• [5] B. Bandaranayake, ‘‘Fraud and corruption control at education system level: A case
study of the Victorian department of education and early childhood development in
Australia,’’ J. Cases Educ. Leadership, vol. 17, no. 4, pp. 34–53, Dec. 2014, doi:
10.1177/1555458914549669.
• [6] J. Błaszczyński, A. T. de Almeida Filho, A. Matuszyk, M. Szelg¸, and R. Słowiński, ‘‘Auto
loan fraud detection using dominance-based rough set approach versus machine learning
methods,’’ Expert Syst. Appl., vol. 163, Jan. 2021, Art. no. 113740, doi:
10.1016/j.eswa.2020.113740.
• [7] B. Branco, P. Abreu, A. S. Gomes, M. S. C. Almeida, J. T. Ascensão, and P. Bizarro,
‘‘Interleaved sequence RNNs for fraud detection,’’ in Proc. 26th ACM SIGKDD Int. Conf.
Knowl. Discovery Data Mining, 2020, pp. 3101–3109, doi: 10.1145/3394486.3403361.
• [8] F. Cartella, O. Anunciacao, Y. Funabiki, D. Yamaguchi, T. Akishita, and O. Elshocht,
‘‘Adversarial attacks for tabular data: Application to fraud detection and imbalanced data,’’
2021, arXiv:2101.08030.
26 | P a g e
• [9] S. S. Lad, I. Dept. of CSERajarambapu Institute of
TechnologyRajaramnagarSangliMaharashtra, and A. C. Adamuthe, ‘‘Malware classification
with improved convolutional neural network model,’’ Int. J. Comput. Netw. Inf. Secur., vol.
12, no. 6, pp. 30–43, Dec. 2021, doi: 10.5815/ijcnis.2020.06.03.
• [10] V. N. Dornadula and S. Geetha, ‘‘Credit card fraud detection using machine learning
algorithms,’’ Proc. Comput. Sci., vol. 165, pp. 631–641, Jan. 2019, doi:
10.1016/j.procs.2020.01.057.
• [11] I. Benchaji, S. Douzi, and B. E. Ouahidi, ‘‘Credit card fraud detection model based on
LSTM recurrent neural networks,’’ J. Adv. Inf. Technol., vol. 12, no. 2, pp. 113–118, 2021,
doi: 10.12720/jait.12.2.113-118.
• [12] Y. Fang, Y. Zhang, and C. Huang, ‘‘Credit card fraud detection based on machine
learning,’’ Comput., Mater. Continua, vol. 61, no. 1, pp. 185–195, 2019, doi:
10.32604/cmc.2019.06144.
• [13] J. Forough and S. Momtazi, ‘‘Ensemble of deep sequential models for credit card fraud
detection,’’ Appl. Soft Comput., vol. 99, Feb. 2021, Art. no. 106883, doi:
10.1016/j.asoc.2020.106883.
• [14] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’
2015, arXiv:1512.03385.
• [15] X. Hu, H. Chen, and R. Zhang, ‘‘Short paper: Credit card fraud detection using LightGBM
with asymmetric error control,’’ in Proc. 2nd Int. Conf. Artif. Intell. for Industries (AII), Sep.
2019, pp. 91–94, doi: 10.1109/AI4I46381.2019.00030.
• [16] J. Kim, H.-J. Kim, and H. Kim, ‘‘Fraud detection for job placement using hierarchical
clusters-based deep neural networks,’’ Int. J. Speech Technol., vol. 49, no. 8, pp. 2842–
2861, Aug. 2019, doi: 10.1007/s10489-019-01419-2.
• https://www.javatpoint.com/machine-learning-random-forest-
algorithm
• https://journalofbigdata.springeropen.com/articles/10.1186/s40537-
022-00573-8
• https://www.indeed.com/career-advice/career-development/what-is-
resampling
27 | P a g e
• https://www.researchgate.net/publication/336800562_Credit_Card_Fr
aud_Detection_using_Machine_Learning_and_Data_Science
• https://en.wikipedia.org//wiki/Random_forest
• https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9755930
• https://edu.authorcafe.com/academies/7920/a-report-on-decision-
tree-random-forest-and-deep-forest
28 | P a g e