project report ML
project report ML
On
DEGREE
Session 2023-24
in
[MACHINE LEARNING]
By
[DIVY ANANT VARSHNEY 23SCSE2030]
[MONIKA GUPTA 23SCSE2030438]
[YADVENDRA SINGH RATHAUR 23SCSE2030453]
April, 2024
1
SCHOOL OF COMPUTER APPLICATION AND TECHNOLOGY
CANDIDATE’S DECLARATION
I/We hereby certify that the work which is being presented in the project, entitled “Online
Payment Fraud Detection” in partial fulfillment of the requirements for the award of the
and Technology of Galgotias University, Greater Noida, is an original work carried out
during the period of August, 2023 to Jan and 2024, under the supervision of “Dr. Prashant
The matter presented in the thesis/project/dissertation has not been submitted by me/us for
This is to certify that the above statement made by the candidates is correct to the best of
my knowledge.
2
CERTIFICATE
This is to certify that Project Report entitled “Online payment Fraud Detection” which
submitted by [DIVY ANANT VARSHNEY 23SCSE2030] [MONIKA GUPTA
23SCSE2030438] [YADVENDRA SINGH RATHAUR 23SCSE2030453] in partial
fulfillment of the requirement for the award of degree MCA. in Department of Computer
Science of School of Computer Application and Technology Galgotias University, Greater
Noida, India is a record of the candidate own work carried out by him/them under my
supervision. The matter embodied in this thesis is original and has not been submitted for
the award of any other degree.
Supervisor(s)
3
TABLE OF CONTENTS Page
DECLARATION ................................................................................................... 2
CERTIFICATE……....................................................................................................3
LIST OF ABBREVIATIONS ............................................................................... 5
CHAPTER 1 6
1.1. ................................................................................................................. 7
1.2. ................................................................................................................. 7-8
1.3…………………………………………………………………………………… 8
CHAPTER 2 ……………………………. ......................................................... 11
2.1. ............................................................................................................... 11
2.2. ............................................................................................................... 11-12
2.3 ....................................................................................................... 12-14
2.4. ....................................................................................................... 14-17
CHAPTER 3 …………………………….......................................................... 18
3.1. ................................................................................................................18
3.2. ................................................................................................................19
3.3…………………………………………………………………………………...20-22
3.4……………………………………………………………………………………22-26
CHAPTER 4…………………………………………………………………………….27-28
CHAPTER 5 (CONCLUSIONS) ......................................................................29
REFERENCES... ....................................................................................................30
4
ABBREVIATIONS
1 . ML Machine Learning
9. FP - False Positive
5
CHAPTER 1
INTRODUCTION
Online transaction fraud is a simple and easy target. E-commerce and other online sites
have increased the number of online payment methods, raising the danger of online fraud.
With the rise in fraud rates, machine learning approaches can be used to identify and
evaluate fraud in online transactions. The primary goal of this project is to implement
supervised machine learning models for fraud detection, with the goal of analyzing prior
transaction information. Where transactions are classified into distinct groups based on the
type of transaction. Following that, various classifiers are trained independently, and
models are assessed forcorrectness. The classifier with the highest rating score can then be
picked as one of the best approachesfor predicting fraud. We worked with the Kaggle
Synthetic Financial Datasets for Fraud Detection dataset collected by Edgar Lopez-
Rojas.In this project K Nearest Neighbor, Logistic Regression, Support Vector Machine
(SVM), Decision Tree, and Random Forest Machine Learning models areimplemented for
detection of fraudulent transactions. A comparative analysis of these algorithms is
performed to identify an optimal solution.
So far, mobile payment has become one of the mainstream payment methods. Thousands
of transactions are carried out on the online trading platform all the time. The popularity of
network transactions provides some criminals with the opportunity to commit crimes.
Personal property in the complex network environment has the risk of theft, which not only
damages the interests of consumers, but also seriously affects the healthy development of
the network economy. Therefore, the transaction fraud detection is one of the key tools to
solve the problem of network transaction fraud [1]. Traditional fraud detection mostly
adopts statistical and multi-dimensional analysis techniques. Since they are verification
techniques, it is difficult to obtain the laws hidden behind the transaction data. The big data
technology and machine learning algorithm provide efficient detection methods for
transaction fraud detection [2]. Compared to the traditional statistical methods, machine
learning can represent important features through a large amount of data, which cannot be
described by the former. By using the corresponding machine learning method, we can
establish a model based on the existing transaction data to realize the detection of network
transaction fraud, so as to reduce the loss caused by fraud. In 2018, Zhaohui Zhang
proposed a reconstructed feature convolutional neural network prediction model applied to
transaction fraud detection, which has better stability and availability in classification
effect compared with other convolutional neural network models [3]. However, there is
also a problem that the detection accuracy is not high enough due to the imbalance of
sample labels. Combined with requirements, this paper proposed two fraud detection
algorithms based on Fully Connected Neural Network and XGBoost. The former algorithm
integrated two neural network models with different cross entropy loss functions, and the
6
design process of the integrated model is quick and convenient. The latter algorithm used
Hyperopt to optimize the XGBoost classifier, so that the model can be constructed with the
best parameters and have a high performance of fraud detection. The two algorithms have
different application scenarios. In order to ensure the good detection performance, we
decided to use XGBoost model to build an online transaction fraud detection system. This
system has obvious advantages in running time and applicability, and can accurately
predict the fraud probability of network transaction behaviors.
2. *Enhance Security*:
- Protect sensitive financial and personal information from unauthorized access and
breaches.
- Ensure that legitimate customers have a seamless and hassle-free experience without
frequent false positives that might inconvenience them.
7
4. *Compliance with Regulations*:
- Adhere to legal and regulatory requirements related to financial transactions and data
privacy, such as PCI DSS (Payment Card Industry Data Security Standard).
5. *Adaptive Detection*:
- Continuously update and improve fraud detection algorithms and rules to adapt to
evolving fraud tactics and schemes.
7. *Cost-Effectiveness*:
- Optimize the balance between fraud detection accuracy and operational costs to ensure
the solution is financially sustainable.
- Utilize transaction data to gain insights into fraud patterns and trends, enabling
proactive measures and strategic decisions.
2. *Data Analysis:*
- *Device and Network Data:* Tracking device IDs, IP addresses, and network
behaviors to spot inconsistencies.
3. *Technological Tools:*
- *GDPR, PCI DSS, and PSD2:* Adhering to regulations and standards that
mandate security measures for protecting payment data and ensuring user
privacy.
5. *Industry Applications:*
- *Banking and Financial Services:* Securing online banking, wire transfers, and
credit card transactions.
6.*Challenges:*
- *Data Privacy:* Ensuring fraud detection methods comply with privacy laws and
regulations.
7. *Future Trends:*
9
- *Enhanced AI Capabilities:* Leveraging advanced AI for more accurate and
faster fraud detection.
10
CHAPTER 2
2.1 System Design
The system first acquires the raw transaction data uploaded by users, and gets the specified
data by data preprocessing. Then use the fraud detection algorithm based on XGBoost to
predict the probability that the transaction is a fraud transaction. If the probability detected
exceeds the set threshold (This system is set to 0.85), a warning will be issued
immediately. Finally, the detection results will be saved in the database and returned to the
user. The System Flow Chart is shown in Figure 1.
11
Figure 1. System Flow Chart
. Fraud Detection Algorithm Based On Fully Connected Neural Network The fraud
detection algorithm based on Fully Connected Neural Network consists of three
steps:feature selection, data preprocessing and neural network construction [4]. (1) Feature
selection: The experiment contains two types of data sets: Transaction and Identity, which
comes from the payment service company Vesta. Since selecting all the features did not
significantly improve the model, we decided to train the neural network model only using the
features of the Transaction data set.
(2) Data preprocessing: Features in the Transaction data set can be divided into continuous features
and categorical features. For continuous features, we use the logarithmic transformation first to
make the processed data conform to the standard normal distribution. Secondly, Z-score
standardization is carried out on the data, so that each dimension is dimensionless to avoid the huge
influence of different dimensional selection on the distance calculation. For categorical features,
the One-Hot Encoding is applied to generate feature vectors, but only for the top 50 most common
values of each feature to reduce its sparsity.
(3) Neural network construction: In this step, we use Keras to build the Fully Connected Neural
12
z
The classification problem is a formalized task, which contains a set of objects (situations),
divided in a certain way into classes. There is specified a finite set of objects, and we know
to which classes each of them belongs. This set is called a sample. There is no info about
other objects, so we do not know to what class they belong. The aim is to create an
algorithm that will be able to classify an arbitrary object from the initial set. To classify an
object means to indicate the number (or name) of the class to which this object belongs.
The classification of an object is the number or class name, issued by the classification
algorithm because of its application to this particular object [1]. In mathematical statistics,
the classification problems are also called as the problems of discrete analysis. In machine
13
learning, the classification problems can be solved with the help of artificial neural
network methods, particularly by staging an experiment in the form of training with a
teacher. Let X be a set of object descriptions, Y is a plurality of numbers (or names) of
classes. There is an unknown target dependence – mapping (1) – whose values are known
only for elementsof a finite learning sample (2). The aim is to construct an algorithm (3)
capable of classifying any arbitrary object х Х [2].
y* : X → Y
(1)
Xm = {( x1, y1 ), … ,( xm, ym )}
(2)
a : X→Y
(3)
The probabilistic definition of the problem is more general. It assumes that the set of pairs
“object-class” X × Y is a probabilistic space with an unknown probabilistic degree P. There
is a finite study sample of observations (2) generated in accordance with the probabilistic
degree P. The aim is to construct an algorithm (3), capable of classifying arbitrary object х
Х.
The characteristic is the mapping (4), where Df – the set of permissible values of the
characteristic.
f : X → Df (4)
If the characteristics f1, …, fn are given, then the vector (5) is called the characteristic
description of the object х Х.
14
Characteristics can be identified with the objects themselves. In this case, the set (6) is
called the space of characteristics.
Depending on the Dfset, the characteristics are divided into the following types.
• Binary characteristics: Df = { 0, 1 };
• Two-class classification, which technically is the easiest case, and serves as the basis for
solving more complex tasks;
• Multiclass classification. The number of classes reaches thousands (for example, when
recognizing hieroglyphs or fused speech), the task of classification becomes significantly
more difficult;
• Non-overlapping classes;
• Fuzzy classes. It is necessary to determine the degree of belonging of the object to each
of the classes, usually it is a valid number from 0 to 1 [2]. In our case, we are interested in
the binary
Bertrand Lebichot and Yann-Ael Le Borgne have researched the problem in the “Deep-
Learning Domain Adaptation Techniques for Credit Cards Fraud Detection” publication
[3]. They worked on the design of automatic Fraud Detection Systems (FDS) able to detect
fraudulent transactions with high precision and deal with the heterogeneous nature of the
15
fraudster behavior. Indeed, the nature of the fraud behavior may strongly differ according
to the payment system (e.g. e-commerce or shop terminal), the country and the population
segment. The another publication is “Improving Card Fraud Detection Through Suspicious
Pattern Discovery” by Olivier Caelen and Evgueni N. Smirnov [4]. They proposed a new
approach to detect credit card fraud based on suspicious payment patterns. According to
their hypothesis fraudsters use stolen credit card data at specific, recurring sets of shops.
They exploited this behavior to identify fraudulent transactions.
Also the problem was mentioned in “Calibrating Probability with Undersampling for
Unbalanced Classification” article by Andrea Dal Pozzolo, Olivier Caelen, Gianluca
Bontempi [5]. In this paper, they study analytically and experimentally how undersampling
affects the posterior probability of a machine learning model. They formalize the problem
of undersampling and explore the relationship between conditional probability in the
presence and absence of undersampling. They use Bayes Minimum Risk theory to find the
correct classification threshold and show how to adjust it after undersampling.
Logistic regression is suitable for solving the classification problem. This is a statistical
regression method used in the case when the dependent variable is categorical, so it can
acquire only two values (or, more generally, a finite set of values) [6].
Let some set Y have only two values, which are usually indicated by numbers 0 and 1. Let
this value depend on some set of explanatory variables (7).
Then (9):
𝑌= 0, 𝑦*≤ 0
1, 𝑦* > 0 (9)
16
The next tool is the method of support vectors – a data analysis method for classification
and regression using models with controlled training with associated learning algorithms,
which are called support vector machines. For a given set of training samples, each of
which is marked as belonging to one or other of the two categories, the training algorithm
of the SVM builds a model that relates new samples to one or another category, making it
an incredible binary linear classifier. The SVM model is the representation of samples as
points in space, displayed in such a way that samples from individual categories are
separated by a blank space that is most extensive. New samples then appear to the same
space, and predictions about their belonging to the category are based on which
The next method for solving the problems of classification uses a slightly different
approach. The method of knearest neighbours is a simple nonparametric classification
method, where the distances (usually Euclidean) used to classify objects within the space
of 2 SHS Web oConferences65,02002(2019)https://doi.org/10.1051/shsconf/20196502002
M3E2 2019properties, counted among all other objects. The objects to which the distance
is the smallest are selected, and they are allocated in a separate class.
The basic principle of the method of the closest neighbours is that the object is assigned to
that class, which is the most common among the neighbours of this element. Neighbours
are taken on the basis of a set of objects whose classes are already known, and based on the
key for the given method, the value of k is calculated on which class is the most numerous
among them. Each object has a finite number of attributes (dimensions). It is assumed that
there is a certain set of objects with an already existing classification [7].
17
The next method for solving the classification tasks is the decision tree, which is used in
the field of statistics and data analysis for predictive models.
The tree structure contains the following elements: “leaves” and “branches”. On the edges
(“branches”) of decision trees, attributes are written, on which the target function depends,
in the “leaves” the values of the target function are written, while in other nodes there are
attributes that distinguish the cases. To classify a new case, we must go down the tree to
the letter and give the corresponding value. Similar decision trees are widely used in
intelligent data analysis. The goal is to create a model that predicts the value of the target
variable based on multiple input variables [7].
Each leaf represents the value of the target variable, changed in the course of movement
from the root to the letter. Each internal node corresponds to one of the input variables. A
tree can also be “studied” by dividing the output sets of variables into subsets that are
based on the testing of attribute values. This process is repeated on each of the received
subsets. Recursion ends when the subset in the node has the same value as the target
variable, so it does not add value to the predictions. The process of going from top to
bottom, TDIDT, is an example of an absorbing “greedy” algorithm, and is by far the most
widespread decision tree for data, but this is not the only one possible strategy.
The decision trees used in Data Mining are of two main types:
• Analysis of the classification tree when the predicted result is a class to which the data
belongs;
• Regression analysis of a tree when the predicted result can be considered as a valid
number (e.g. house price, or length of stay of a patient in a hospital) [8].
In the context of the current task, we are interested in the first type of decision tree for
solving classification issues.
Artificial neural networks can also be used to solve classification problems. An artificial
neural network is a network of simple elements called neurons that receive
input, change their internal state (activation) according to this input, and produce an output
that is dependent on input and excite. The network is formed by connecting the outputs of
certain neurons with inputs of other neurons with the formation of a directed weighted
graph. Scales, as well as functions that calculate excitement, can change with the process
called learning, which is guided by the rule of learning [7].
The neuron with the label j, which receives input pj(t) from the neuronal predecessors,
consists of the following components:
• The threshold θj (for binary neuron), which remains unchanged, if it does not change the
learning function;
• Activation functions f, which calculates the new activation at the given time t+1 from
aj(t), θj and the network input pj(t), giving as a result the relation (10). The function is
applied to all layers except the last one (where the output function is applied). Each
intermediate connection has its own activation function.
The output function is often just the same function. The input neuron has no predecessors,
but serves as the login interface for the entire network. Similarly, the exit neuron has no
successors, and thus serves as an interface for the output for the entire network.
The network consists of connections, each of which transmits the output of the neuron і to
the input of the neuron j. In other words, і is the precursor(parent) of j, and the j is the
successor (child) of і. Each such connection is assigned wij weight.
3) Distribution Functions
The distribution function calculates the input pj(t) to the neuron j from the outputs of oi(t)
of the precursor neurons and usually has the form:
(12)
19
4) The rule of training
Training rule is a rule or algorithm that changes the parameters of the neural network so
that the given input to the network produces a suitable output. This learning processusually
involves changing the weights and thresholds of the network variables [7]. There are three
main paradigms of learning, each of which corresponds to a particular learning objective.
They are guided learning, spontaneous learning, and training with reinforcement [7]. We
are interested in the first paradigm, because it is used to solve classification problems.
Guided learning uses a set of examples of pairs (x, y), x X, y Y, and has the purpose of
finding a function (13) in a permitted class of functions that corresponds to these examples.
f:X→Y (13)
In other words, we want to display a reflection on which this data hints; the cost function is
connected to the discrepancy between our reflection and the data, and it implicitly contains
a priori knowledge of the subject domain. The tasks that fit into the guided learning
paradigm are pattern recognition (also known as classification) and regression (also known
as approximation of functions). A guided learning paradigm is also applicable to sequential
data (for example, to the recognition of manual writing, speech and gestures). It can beseen
as learning with a “teacher” in the form of a function that provides a constant feedback on
the quality of the solutions obtained so far.
20
CHAPTER 3
To investigate this problem and find a solution, a database [9] of the payment system with
transactional accounts was obtained. The database reflects transactions executed within 2
days, generally containing 284,807 transactions, of which 492 are fraud (0.172%). The
dataset was gathered by Worldline and ULB (Université Libre de Bruxelles) and prepared
by them using various approaches:their private software
algorithms,manualtesting,customers’ feedback. That resulted into the merged dataset. The
database consists only of numerical data. For confidentiality, the field of the database is
anonymized. Because of this, it is not possible to specify a description of one or another
peculiarity for which the field corresponds, and to give a more precise description of the
data from an economic point of view.
21
All 28 parameters (V1, V2, ..., V28) were obtained using the main component method –
principal component analysis method – a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables (entities
each of which takes on various numerical values) into a set of values of linearly
uncorrelated variables called principal components. This transformation is defined in such
a way that the first principal component has the largest possible variance (that is, accounts
for as much of the variability in the data as possible), and each succeeding component in
turn has the highest variance possible under the constraint that it is orthogonal to the
preceding components. The resulting vectors (each being a linear combination of the
variables and containing n observations) are an uncorrelated orthogonal basis set.
The only 2 fields that have not been transformed are “time” and “quantity”. The “time”
value shows the number of seconds that passed between this transaction and the first
transaction. The “quantity” field shows the amount of money that went through the
transaction. All other fields have no marks or legend because of security and privacy
reasons. The bank decided to not share what exactly these fields are, giving only their
transformed numerical values. The data set is very unbalanced, since the target class
– fraudulent transactions – is only 0.17% of all transactions (Figure 1). If we use them to
construct models, we will probably get a lot of false classifications due to overtraining of
the model. The resulted model will assume that the transaction is likely to be a verified
one,
22
We need to create a balanced subset of data with the same frequency of fraudulent and
verified transactions, which will help further algorithms to show more accurate results.
What will be a subset of data? In our case, this will be a dataset with a ratio of 50/50
varified and fraudulent operations. The number of fraudulent and normal operations will be
the same.
Why create a subset of data? We found that the initial set of data is very unbalanced. Its
use can create the following problems:
• Overtraining. Since almost all records are verified, our model will empirically assign
almost every transaction as non-fraudulent.
• Wrong correlation. Although we do not know what exactly corresponds to the “V” field,
it will definitely be useful to understand how each of them affects the target function.
Again, having an unbalanced set of data, the correlation matrix will be fuzzy and shifted
toward nonfraud transactions [8].
Before applying random subsampling to the training set of data, we must divide the initial
set of data into the training set and test set. Applying data balancing techniques (over-
sampling or sub-sampling) should be done only on a training set of data in order to create a
model, but the model testing should be done on the initial dataset. In the next step, we will
apply the technique of random over-sampling, which is about removing those entries
from the set of data, which count is bigger. Thus, we achieve a ratio of 50/50 by excluding
verified transactions (Figure 2). Correlation matrixes are the basis for understanding the
data. It is interesting for us to understand which arguments significantly affect the
classification of the transaction. Particularly indicative is matrix comparison for balanced
and unbalanced data sets (Figure 3).
23
Fig. 2. Histogram of equally distributed classes after sub
sampling.
• negative correlation: V10, V12, V14, V17. The smaller the value of these variables is, the
more likely the transaction will be fraudulent.
24
• positive correlation: V2, V4, V11, V19. The larger the value of the variable is, the more
likely the operation is fraudulent [8].
Before we begin, we need to divide our data into training and test subsets.
Of course, computing of large volumes of data and deducting the result, and, most
importantly, high-speed computing, requires the use of computing machines. In practice,
there are many tools and technologies for data processing, but the most popular are Python
and R. What language to use is completely up to a user, the mathematical and statistical
methods described above are implemented in both environments. In the given work will
work in Python [10], but all the same techniques and methods are implemented in R.
accuracy [12].
25
Fig. 4. Logistic regression learning curve.
Logistic regression showed the best accuracy with an estimate of 94%. This is a training
result that was obtained from an assessment of how precisely the model determines fraud
in the training sample. For a more accurate result, check the resulting models on the test
sample (remember that this is still a balanced sample, so the result will still be inaccurate).
As we see from the obtained results, the logistic regression method was best demonstrated
with a result of 94% on the training sample and 93.52% on the test sample (the best result
was evaluated as the maximum arithmetic mean of the data of 2 indicators [13]). The
method of knearest neighbors and the method of support vectors also showed a fairly
precise result, and the support vectors method showed even better results on the test
sample than the logistic regression – 93.78%.
For a more detailed demonstration of the results, we output a confusion matrix [14] for
logistic regression method. In the upper left and lower right squares (yellow) the correct
results are placed, in other squares (black) wrong results are places.
26
Fig. 6. K-nearest neighbours learning curve.
27
Fig. 8. Logistic regression results’ confusion matrix.
To create the neural network, the same Python software package, based on the Tensorflow,
was used.
The structure of the neural network: a simple model that consists of one input layer, one
hidden layer of 32 nodes, and one output layer that can take one of two possible values: 0
or 1.
We will supervise two studies of the neural network: the first by means of sub-sampling,
and the other by means of over-sampling. In the first case, we will narrow our data to a
ratio of 50/50, so we will randomly drop a significant portion of the verified transactions.
During the over-sampling, we will expand our data by adding new records of fraudulent
data that will be generated basing on the existing records of the fraudulent data.
To supervise the neural network, 20 iterations were performed on the corresponding data
set. After performing the neural network training, we evaluate it on the original data set
and compare the results between the neural networks itself and the best classifiers.
28
Fig. 9. Confusion matrix for the neural network, trained on sub-sampling.
Fig. 10. Confusion matrix for the neural network, trained on over-sampling.
29
As we see from Figure 9, the neural network on the sub-sampled data classified a
significant part of the verified transactions (Y-axis) in the class of fraudulent, but only 1
fraudulent transaction passed. In general, the score of the neural network was 93.1%.
Over-sampling (data expansion) showed the best result (Figure 10) among both neural
networks and all models in general, having demonstrated 99.9% of the correct
classifications. However, it should be noted that 24 fraudulent transactions have passed,
and therefore the percentage of blocked fraudulent transactions is lower.
4 Proposed System
The proposed system aims to bolster the security of online payment systems by employing
advanced data analysis techniques and machine learning algorithms to detect and prevent
fraudulent transactions effectively.
4.1 Data Collection: The system collects comprehensive transactional data from various
sources, including payment gateways, merchants, and financial institutions. This data
encompasses transaction amounts, timestamps, user demographics, device information, and
transaction histories.
Upon collection, the raw transactional data undergoes preprocessing, including data
cleaning, normalization, and feature engineering. Missing values are handled, outliers are
identified and treated, and relevant features are extracted or transformed to enhance model
performance.
4.3Feature Selection:
Feature selection techniques, such as correlation analysis and feature importance ranking,
are employed to identify the most discriminative features for fraud detection. This step
helps reduce dimensionality and improve model efficiency.
The system utilizes machine learning algorithms, including supervised and unsupervised
techniques, to build robust fraud detection models. Supervised algorithms such as logistic
regression, decision trees, and ensemble methods learn from labeled data to classify
transactions as either legitimate or fraudulent. Unsupervised algorithms such as clustering
and anomaly detection identify unusual patterns indicative of fraudulent activities without
the need for labeled data.
30
4.5 Model Training and Evaluation:
The selected models are trained on historical transaction data and evaluated using
appropriate performance metrics such as accuracy, precision, recall, and F1-score. Cross-
validation techniques ensure the generalizability of the models, while hyperparameter
tuning optimizes their performance.
The trained models are deployed in a real-time monitoring system that continuously
analyzes incoming transactions for signs of fraud. Transactions flagged as suspicious
trigger immediate alerts for further investigation by fraud analysts or automated response
mechanisms.
The system incorporates adaptive learning mechanisms to continuously update and refine
the fraud detection models based on new data and emerging fraud trends. Feedback loops
enable the system to adapt to evolving fraud tactics and maintain high detection accuracy
over time.
Comprehensive reports and visualizations are generated to provide insights into the
effectiveness of the fraud detection system. Key performance indicators, trends, and
patterns are communicated to stakeholders to support decision-making and strategic
planning.
4.9 Flowchart
31
The flowchart concludes with the end symbol, indicating the completion of the decision
tree algorithm. The flowchart shown in figure 3 provides a visual representation of the
steps involved in training and evaluating the decision tree model, aiding in understanding
the overall process and facilitating communication between different stakeholders.
Here is a brief explanation of the flowchart of model training in figure 3. 1. Start: The
flowchart begins with the start symbol, indicating the beginning of the decision tree
algorithm. 2. Load Dataset: The algorithm loads the dataset, which contains the input
features and target variable. 3. Define Features and Target: The feature columns and target
column are defined, specifying the variables to be used for training the decision tree. 4.
Split Data: The dataset is split into training and testing sets using the train_test_split
function, allocating a portion of the data for model evaluation. 5. Data Imputation: The
SimpleImputer object is used to handle missing values in the dataset, replacing them with
the mean value of the respective feature. 6. Build Decision Tree: The
DecisionTreeClassifier object is created, representing the decision tree model. It is trained
on the training data using the fit function. 7. Predictions: The trained decision tree is
utilized to make predictions on the test set, using the predict function.
32
CHAPTER 4
33
IMPLEMENTATION AND RESULTS
4.11 Methodology
The methodologies include the algorithm used, dataset used and flowchart of the data used
and implemented. Below is the provided step by step explanation of the algorithm used.
Algorithm Used: The decision tree algorithm is a widely used supervised learning
technique employed for both classification and regression tasks. It constructs a structured
model resembling a flowchart, driven by input features.
Tree Construction: The algorithm commences by considering the entire dataset as the
root node, and selects the optimal feature for partitioning the data.
Feature Split: The chosen feature is utilized to divide the data into subsets, thereby
creating branches or paths within the decision tree. Recursive Splitting: The process of
feature splitting is iteratively applied to each subset until a predefined stopping criterion is
satisfied.
Leaf Node Assignment: Leaf nodes are assigned class labels or regression values based
on the majority class or mean value of the target variable within each respective subset.
Prediction: To make predictions, the algorithm traverses the decision tree by evaluating
feature values and ultimately reaching a leaf node to obtain the final prediction.Easy to
comprehend and interpret accommodates numerical and categorical data handles missing
values gracefully captures nonlinear relationships effectively. Prone to over
fitting,necessitating proper regularization techniques - Can be sensitive to changes in
thedataset,leading to instability. Exhibits bias towards features with high cardinality or
many levels In conclusion, decision trees offer versatility and transparency in model
4.12 Result
34
The goal was to predict whether a transaction is a legal transaction or a fraudulent
transaction, this falls under the scope of a classification problem. We intend to deploy
Supervised Machine Learning models in order to achieve the highest prediction accuracy.K
Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree and
Random Forest models were trained using k-fold technique, training contained total 5 folds
and with each fold accuracy of the model kept increasing up to 5th fold. After the 5th fold,
accuracy started decreasing because our dataset was not sufficient enough for more than 5
folds. So, the final model was trained on 5 folds with 88.55% average accuracy. This
means that if someone would train Random Forest with a bigger data set using the k-
foldtechnique then the average accuracy of the model would be even higher.
As a result, the Decision Tree model had the greatest prediction accuracy of 99.92% and
recall of 86.96% Due to huge amount of data models for Support Vector Machine and
Random Forest were unable to compile, even on Google Collab. Further work can be done
by under sampling of data by 50:50, that would reduce data size even more and as a result
SVM and Random Forest results can be compiled accurately.Initial results, Final results
could not be compiled due to insufficient computing power.
35
CHAPTER 5
CONCLUSION
The logistic regression reaches up to 94% of the correct classifications, while the neural
network on the subsampled data shows a result of 93.1%, and over-sampled data shows as
much as 99.9%, but misses a significant amount of fraudulent operations. On the one hand,
the accuracy of the neural network on the over-sampling is higher, but on the other hand, it
misses most of the fraudulent operations, although it better classifies the verified ones.
Logistic regression showed average accuracy, but also missed a significant part of
fraudulent transactions. Although the neural network on the sub-sampling showed the
worst overall result of 93.1%, but it prevented the biggest amount of the fraud transactions.
In general, the use of one or another model depends on the specific situation: whether
clients are ready sometimes get denial of the transaction, but to be sure that their funds will
not be obtained by fraud, or they are more interested in easy of use, and security is not tha
important.
36
References
1. Design and development of financial fraud detection using machine learning. (2024).
International Journal of Emerging Trends in Engineering Research, 8(9), 5838–
5843. https://doi.org/10.30534/ijeter/2020/152892020
2. Rucco, M., Giannini, F., Lupinetti, K., & Monti, M. (2019). A methodology for part
classification with supervised machine learning. Artificial Intelligence for Engineering
Design, Analysis and Manufacturing, 33(1), 100– 113.
https://doi.org/10.1017/S0890060418000197
3. Saarikoski, J., Joutsijoki, H., Järvelin, K., Laurikkala, J., & Juhola, M. (2015). On the
influence of training data quality on text document classification using machine learning
methods. International Journal of Knowledge Engineering and Data Mining, 3(2), 143.
https://doi.org/10.1504/IJKEDM.2015.071284
37