Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

project report ML

The project report focuses on the development of an online payment fraud detection system using machine learning techniques, specifically supervised models like K Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest. The objective is to identify and prevent fraudulent transactions in real-time, minimizing financial losses and enhancing security while maintaining customer trust. The system utilizes data preprocessing and algorithms such as XGBoost to predict fraud probabilities based on transaction data, with a comprehensive approach to tackle the evolving challenges of online payment fraud.

Uploaded by

17guptam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

project report ML

The project report focuses on the development of an online payment fraud detection system using machine learning techniques, specifically supervised models like K Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest. The objective is to identify and prevent fraudulent transactions in real-time, minimizing financial losses and enhancing security while maintaining customer trust. The system utilizes data preprocessing and algorithms such as XGBoost to predict fraud probabilities based on transaction data, with a comprehensive approach to tackle the evolving challenges of online payment fraud.

Uploaded by

17guptam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

A Project Report

On

Online Payment Fraud Detection


Submitted in partial fulfillment of the

requirement for the award of the degree of

MASTER OF COMPUTER APPLICATIONS

DEGREE
Session 2023-24
in

[MACHINE LEARNING]
By
[DIVY ANANT VARSHNEY 23SCSE2030]
[MONIKA GUPTA 23SCSE2030438]
[YADVENDRA SINGH RATHAUR 23SCSE2030453]

Under the guidance of


[Dr. PRASHANT JOHRI SIR]

SCHOOL OF COMPUTER APPLICATION AND TECHNOLOGY

GALGOTIAS UNIVERSITY, GREATER NOIDA, INDIA

April, 2024

1
SCHOOL OF COMPUTER APPLICATION AND TECHNOLOGY

GALGOTIAS UNIVERSITY, GREATER NOIDA

CANDIDATE’S DECLARATION

I/We hereby certify that the work which is being presented in the project, entitled “Online

Payment Fraud Detection” in partial fulfillment of the requirements for the award of the

MCA (Master of Computer Application) submitted in the School of Computer Application

and Technology of Galgotias University, Greater Noida, is an original work carried out

during the period of August, 2023 to Jan and 2024, under the supervision of “Dr. Prashant

Johri”, Department of Computer Science and Engineering/School of Computer

Application and Technology , Galgotias University, Greater Noida.

The matter presented in the thesis/project/dissertation has not been submitted by me/us for

the award of any other degree of this or any other places.

[DIVY ANANT VARSHNEY 23SCSE2030]


[MONIKAGUPTA23SCSE2030438]
[YADVENDRA SINGH RATHAUR 23SCSE2030453]

This is to certify that the above statement made by the candidates is correct to the best of

my knowledge.

Dr. Prashant Johri(Professor)

2
CERTIFICATE

This is to certify that Project Report entitled “Online payment Fraud Detection” which
submitted by [DIVY ANANT VARSHNEY 23SCSE2030] [MONIKA GUPTA
23SCSE2030438] [YADVENDRA SINGH RATHAUR 23SCSE2030453] in partial
fulfillment of the requirement for the award of degree MCA. in Department of Computer
Science of School of Computer Application and Technology Galgotias University, Greater
Noida, India is a record of the candidate own work carried out by him/them under my
supervision. The matter embodied in this thesis is original and has not been submitted for
the award of any other degree.

Signature of Examiner(s) Signature of

Supervisor(s)

Date: June 2024

Place: Greater Noida

3
TABLE OF CONTENTS Page

DECLARATION ................................................................................................... 2
CERTIFICATE……....................................................................................................3
LIST OF ABBREVIATIONS ............................................................................... 5
CHAPTER 1 6
1.1. ................................................................................................................. 7
1.2. ................................................................................................................. 7-8
1.3…………………………………………………………………………………… 8
CHAPTER 2 ……………………………. ......................................................... 11
2.1. ............................................................................................................... 11
2.2. ............................................................................................................... 11-12
2.3 ....................................................................................................... 12-14
2.4. ....................................................................................................... 14-17
CHAPTER 3 …………………………….......................................................... 18
3.1. ................................................................................................................18
3.2. ................................................................................................................19
3.3…………………………………………………………………………………...20-22
3.4……………………………………………………………………………………22-26
CHAPTER 4…………………………………………………………………………….27-28
CHAPTER 5 (CONCLUSIONS) ......................................................................29
REFERENCES... ....................................................................................................30

4
ABBREVIATIONS

1 . ML Machine Learning

2. AML - Anti-Money Laundering

3. API - Application Programming Interface

4. CNP - Card Not Present

5. CVV - Card Verification Value

6. DNN - Deep Neural Network

7. ETL - Extract, Transform, Load

8. FDS - Fraud Detection System

9. FP - False Positive

10. FN - False Negative

11. KYC - Know Your Customer

12. OTP- One-Time Password

13. PCI DSS - Payment Card Industry Data Security Standard

14. RNN - Recurrent Neural Network

5
CHAPTER 1

INTRODUCTION
Online transaction fraud is a simple and easy target. E-commerce and other online sites
have increased the number of online payment methods, raising the danger of online fraud.
With the rise in fraud rates, machine learning approaches can be used to identify and
evaluate fraud in online transactions. The primary goal of this project is to implement
supervised machine learning models for fraud detection, with the goal of analyzing prior
transaction information. Where transactions are classified into distinct groups based on the
type of transaction. Following that, various classifiers are trained independently, and
models are assessed forcorrectness. The classifier with the highest rating score can then be
picked as one of the best approachesfor predicting fraud. We worked with the Kaggle
Synthetic Financial Datasets for Fraud Detection dataset collected by Edgar Lopez-
Rojas.In this project K Nearest Neighbor, Logistic Regression, Support Vector Machine
(SVM), Decision Tree, and Random Forest Machine Learning models areimplemented for
detection of fraudulent transactions. A comparative analysis of these algorithms is
performed to identify an optimal solution.

So far, mobile payment has become one of the mainstream payment methods. Thousands
of transactions are carried out on the online trading platform all the time. The popularity of
network transactions provides some criminals with the opportunity to commit crimes.
Personal property in the complex network environment has the risk of theft, which not only
damages the interests of consumers, but also seriously affects the healthy development of
the network economy. Therefore, the transaction fraud detection is one of the key tools to
solve the problem of network transaction fraud [1]. Traditional fraud detection mostly
adopts statistical and multi-dimensional analysis techniques. Since they are verification
techniques, it is difficult to obtain the laws hidden behind the transaction data. The big data
technology and machine learning algorithm provide efficient detection methods for
transaction fraud detection [2]. Compared to the traditional statistical methods, machine
learning can represent important features through a large amount of data, which cannot be
described by the former. By using the corresponding machine learning method, we can
establish a model based on the existing transaction data to realize the detection of network
transaction fraud, so as to reduce the loss caused by fraud. In 2018, Zhaohui Zhang
proposed a reconstructed feature convolutional neural network prediction model applied to
transaction fraud detection, which has better stability and availability in classification
effect compared with other convolutional neural network models [3]. However, there is
also a problem that the detection accuracy is not high enough due to the imbalance of
sample labels. Combined with requirements, this paper proposed two fraud detection
algorithms based on Fully Connected Neural Network and XGBoost. The former algorithm
integrated two neural network models with different cross entropy loss functions, and the
6
design process of the integrated model is quick and convenient. The latter algorithm used
Hyperopt to optimize the XGBoost classifier, so that the model can be constructed with the
best parameters and have a high performance of fraud detection. The two algorithms have
different application scenarios. In order to ensure the good detection performance, we
decided to use XGBoost model to build an online transaction fraud detection system. This
system has obvious advantages in running time and applicability, and can accurately
predict the fraud probability of network transaction behaviors.

1.1 Problem Introduction


Online payment fraud poses a persistent challenge in e-commerce, threatening financial
losses and eroding consumer trust. The evolving tactics of fraudsters necessitate advanced
detection mechanisms to safeguard online transactions. Despite existing preventive
measures, the detection and mitigation of fraudulent activities remain significant concern.
The problem statement underscores the imperative for a robust and adaptive fraud
detection system capable of identifying suspicious transactions in real-time, thereby
minimizing risks for merchants and consumers alike. Addressing this challenge requires a
comprehensive understanding of fraudulent patterns and behaviors, coupled with the
implementation of sophisticated data analysis techniques and machine learning algorithms.

1.2 Project Objective


The primary objective of online payment fraud detection is to identify and prevent
fraudulent transactions while ensuring a smooth and secure experience for legitimate users.
Here are the key goals in detail:

1. *Minimize Financial Losses*:

- Detect fraudulent transactions in real-time or near real-time to prevent financial losses


to both customers and businesses.

2. *Enhance Security*:

- Protect sensitive financial and personal information from unauthorized access and
breaches.

3. *Maintain Customer Trust*:

- Ensure that legitimate customers have a seamless and hassle-free experience without
frequent false positives that might inconvenience them.

7
4. *Compliance with Regulations*:

- Adhere to legal and regulatory requirements related to financial transactions and data
privacy, such as PCI DSS (Payment Card Industry Data Security Standard).

5. *Adaptive Detection*:

- Continuously update and improve fraud detection algorithms and rules to adapt to
evolving fraud tactics and schemes.

6. *Efficient Handling of Fraudulent Cases*:

- Implement efficient systems for the investigation, resolution, and reporting of


fraudulent transactions.

7. *Cost-Effectiveness*:

- Optimize the balance between fraud detection accuracy and operational costs to ensure
the solution is financially sustainable.

8. *Data Analysis and Insights*:

- Utilize transaction data to gain insights into fraud patterns and trends, enabling
proactive measures and strategic decisions.

1.3 Scope of the Project


The scope of online payment fraud detection is broad and multifaceted, involving
various techniques, tools, and strategies to identify and prevent fraudulent
activities in online transactions. Here are the key aspects:

1. *Techniques and Methods:*

- *Machine Learning and AI:* Use of supervised and unsupervised learning


algorithms to detect patterns and anomalies indicative of fraud.

- *Rule-Based Systems:* Implementing predefined rules and thresholds to flag


suspicious transactions.

- *Behavioral Analytics:* Monitoring user behavior to identify deviations from


normal patterns.

- *Real-Time Monitoring:* Analyzing transactions in real-time to quickly identify


and respond to potential fraud.

2. *Data Analysis:*

- *Transaction Data:* Analysis of transaction details like amount, frequency, and


geographical location.
8
- *User Data:* Examination of user profiles, historical behavior, and account
activities.

- *Device and Network Data:* Tracking device IDs, IP addresses, and network
behaviors to spot inconsistencies.

3. *Technological Tools:*

- *Fraud Detection Software:* Utilizing platforms and software designed to


identify and prevent fraud, such as FICO Falcon, SAS Fraud Management, and
others.

- *Biometric Verification:* Using fingerprint, facial recognition, and other


biometric data to verify user identity.

- *Encryption and Secure Protocols:* Ensuring data security through encryption


and secure transaction protocols.

4. *Regulations and Compliance:*

- *GDPR, PCI DSS, and PSD2:* Adhering to regulations and standards that
mandate security measures for protecting payment data and ensuring user
privacy.

- *KYC (Know Your Customer):* Implementing thorough customer verification


processes to prevent identity fraud.

5. *Industry Applications:*

- *E-Commerce:* Protecting online retail transactions from fraudulent activities.

- *Banking and Financial Services:* Securing online banking, wire transfers, and
credit card transactions.

- *Digital Wallets and Cryptocurrencies:* Ensuring the security of transactions


involving digital and crypto assets.

6.*Challenges:*

*Evolving Tactics:* Fraudsters continually adapt their methods, requiring


constant updates and improvements to detection systems.

- *False Positives/Negatives:* Balancing the sensitivity of detection systems to


minimize legitimate transactions being flagged as fraudulent and vice versa.

- *Data Privacy:* Ensuring fraud detection methods comply with privacy laws and
regulations.

7. *Future Trends:*
9
- *Enhanced AI Capabilities:* Leveraging advanced AI for more accurate and
faster fraud detection.

- *Integration of Blockchain:* Using blockchain technology for secure,


transparent transactions.

- *Improved User Authentication:* Developing more sophisticated multi-factor


authentication methods to enhance security.

Overall, the scope of online payment fraud detection encompasses a combination


of advanced technologies, regulatory compliance, and continuous adaptation to
emerging fraud tactics.

10
CHAPTER 2
2.1 System Design

The system first acquires the raw transaction data uploaded by users, and gets the specified
data by data preprocessing. Then use the fraud detection algorithm based on XGBoost to
predict the probability that the transaction is a fraud transaction. If the probability detected
exceeds the set threshold (This system is set to 0.85), a warning will be issued
immediately. Finally, the detection results will be saved in the database and returned to the
user. The System Flow Chart is shown in Figure 1.

11
Figure 1. System Flow Chart

2.2 System Realization

. Fraud Detection Algorithm Based On Fully Connected Neural Network The fraud
detection algorithm based on Fully Connected Neural Network consists of three
steps:feature selection, data preprocessing and neural network construction [4]. (1) Feature
selection: The experiment contains two types of data sets: Transaction and Identity, which
comes from the payment service company Vesta. Since selecting all the features did not

significantly improve the model, we decided to train the neural network model only using the
features of the Transaction data set.

(2) Data preprocessing: Features in the Transaction data set can be divided into continuous features

and categorical features. For continuous features, we use the logarithmic transformation first to
make the processed data conform to the standard normal distribution. Secondly, Z-score
standardization is carried out on the data, so that each dimension is dimensionless to avoid the huge
influence of different dimensional selection on the distance calculation. For categorical features,
the One-Hot Encoding is applied to generate feature vectors, but only for the top 50 most common
values of each feature to reduce its sparsity.

(3) Neural network construction: In this step, we use Keras to build the Fully Connected Neural

Network. The network structure is shown in Figure 2.

12
z

Figure 2. Overall network structure

2.3 The notion of classification problem and its characteristics

2.3.1 The definition of classification problem

The classification problem is a formalized task, which contains a set of objects (situations),
divided in a certain way into classes. There is specified a finite set of objects, and we know
to which classes each of them belongs. This set is called a sample. There is no info about
other objects, so we do not know to what class they belong. The aim is to create an
algorithm that will be able to classify an arbitrary object from the initial set. To classify an
object means to indicate the number (or name) of the class to which this object belongs.

The classification of an object is the number or class name, issued by the classification
algorithm because of its application to this particular object [1]. In mathematical statistics,
the classification problems are also called as the problems of discrete analysis. In machine
13
learning, the classification problems can be solved with the help of artificial neural
network methods, particularly by staging an experiment in the form of training with a
teacher. Let X be a set of object descriptions, Y is a plurality of numbers (or names) of
classes. There is an unknown target dependence – mapping (1) – whose values are known
only for elementsof a finite learning sample (2). The aim is to construct an algorithm (3)
capable of classifying any arbitrary object х  Х [2].

y* : X → Y
(1)

Xm = {( x1, y1 ), … ,( xm, ym )}
(2)

a : X→Y
(3)

The probabilistic definition of the problem is more general. It assumes that the set of pairs
“object-class” X × Y is a probabilistic space with an unknown probabilistic degree P. There
is a finite study sample of observations (2) generated in accordance with the probabilistic
degree P. The aim is to construct an algorithm (3), capable of classifying arbitrary object х
Х.

2.3.2 The concept of characteristics in the tasks of classification

The characteristic is the mapping (4), where Df – the set of permissible values of the
characteristic.

f : X → Df (4)

If the characteristics f1, …, fn are given, then the vector (5) is called the characteristic
description of the object х  Х.

x = (f1(x), …, fn(x)) (5)

14
Characteristics can be identified with the objects themselves. In this case, the set (6) is
called the space of characteristics.

X = Df1× … × Dfn (6)

Depending on the Dfset, the characteristics are divided into the following types.

• Binary characteristics: Df = { 0, 1 };

• Nominal characteristics: Df – finite set;

• Sequence characteristics: Df – finite ordered set;

• Quantitative characteristics: Df – the set of real numbers.

And into the following classes:

• Two-class classification, which technically is the easiest case, and serves as the basis for
solving more complex tasks;

• Multiclass classification. The number of classes reaches thousands (for example, when
recognizing hieroglyphs or fused speech), the task of classification becomes significantly
more difficult;

• Non-overlapping classes;

• Ordinary classes. An object may belong to several classes at a time;

• Fuzzy classes. It is necessary to determine the degree of belonging of the object to each
of the classes, usually it is a valid number from 0 to 1 [2]. In our case, we are interested in
the binary

characteristic of the set with a two-class specification.

2.3.3 Publications dedicated to the fraud detection problem

Bertrand Lebichot and Yann-Ael Le Borgne have researched the problem in the “Deep-
Learning Domain Adaptation Techniques for Credit Cards Fraud Detection” publication
[3]. They worked on the design of automatic Fraud Detection Systems (FDS) able to detect
fraudulent transactions with high precision and deal with the heterogeneous nature of the
15
fraudster behavior. Indeed, the nature of the fraud behavior may strongly differ according
to the payment system (e.g. e-commerce or shop terminal), the country and the population
segment. The another publication is “Improving Card Fraud Detection Through Suspicious
Pattern Discovery” by Olivier Caelen and Evgueni N. Smirnov [4]. They proposed a new
approach to detect credit card fraud based on suspicious payment patterns. According to
their hypothesis fraudsters use stolen credit card data at specific, recurring sets of shops.
They exploited this behavior to identify fraudulent transactions.

Also the problem was mentioned in “Calibrating Probability with Undersampling for
Unbalanced Classification” article by Andrea Dal Pozzolo, Olivier Caelen, Gianluca
Bontempi [5]. In this paper, they study analytically and experimentally how undersampling
affects the posterior probability of a machine learning model. They formalize the problem
of undersampling and explore the relationship between conditional probability in the
presence and absence of undersampling. They use Bayes Minimum Risk theory to find the
correct classification threshold and show how to adjust it after undersampling.

2.4 Methods of solving the classification problem

2.4.1 Regression methods in solving classification problems

Logistic regression is suitable for solving the classification problem. This is a statistical
regression method used in the case when the dependent variable is categorical, so it can
acquire only two values (or, more generally, a finite set of values) [6].

Let some set Y have only two values, which are usually indicated by numbers 0 and 1. Let
this value depend on some set of explanatory variables (7).

x = (1, x1, x2, …, xk) (7)

The dependence of Y on x1, x2, …, xk can be determined by adding an additional variable


y* , where (8).

y* = θ0 + θ1x1+ ⋯ + θkxk + u (8)

Then (9):

𝑌= 0, 𝑦*≤ 0

1, 𝑦* > 0 (9)

16
The next tool is the method of support vectors – a data analysis method for classification
and regression using models with controlled training with associated learning algorithms,
which are called support vector machines. For a given set of training samples, each of
which is marked as belonging to one or other of the two categories, the training algorithm
of the SVM builds a model that relates new samples to one or another category, making it

an incredible binary linear classifier. The SVM model is the representation of samples as
points in space, displayed in such a way that samples from individual categories are
separated by a blank space that is most extensive. New samples then appear to the same
space, and predictions about their belonging to the category are based on which

side of the gaps they fall.

In addition to performing a linear classification, the SVM can effectively perform a


nonlinear classification in the application of the so-called core trick, implicitly displaying
its inputs to the spaces of attributes of high dimensionality.

Formally, the support vector machine builds a hyperplane, or a set of hyperplanes in a


space of high or infinite dimensionality that can be used for classification, regression, and
other tasks. Intuitively, good separation is achieved by a hyperplane that has the greatest
distance to the nearest points of the training data of any of the classes (so-called functional
separation) [7].

2.4.2 Discrete methods in solving classification problems

The next method for solving the problems of classification uses a slightly different
approach. The method of knearest neighbours is a simple nonparametric classification
method, where the distances (usually Euclidean) used to classify objects within the space
of 2 SHS Web oConferences65,02002(2019)https://doi.org/10.1051/shsconf/20196502002

M3E2 2019properties, counted among all other objects. The objects to which the distance
is the smallest are selected, and they are allocated in a separate class.

The basic principle of the method of the closest neighbours is that the object is assigned to
that class, which is the most common among the neighbours of this element. Neighbours
are taken on the basis of a set of objects whose classes are already known, and based on the
key for the given method, the value of k is calculated on which class is the most numerous
among them. Each object has a finite number of attributes (dimensions). It is assumed that
there is a certain set of objects with an already existing classification [7].

17
The next method for solving the classification tasks is the decision tree, which is used in
the field of statistics and data analysis for predictive models.

The tree structure contains the following elements: “leaves” and “branches”. On the edges
(“branches”) of decision trees, attributes are written, on which the target function depends,
in the “leaves” the values of the target function are written, while in other nodes there are
attributes that distinguish the cases. To classify a new case, we must go down the tree to
the letter and give the corresponding value. Similar decision trees are widely used in
intelligent data analysis. The goal is to create a model that predicts the value of the target
variable based on multiple input variables [7].

Each leaf represents the value of the target variable, changed in the course of movement
from the root to the letter. Each internal node corresponds to one of the input variables. A
tree can also be “studied” by dividing the output sets of variables into subsets that are
based on the testing of attribute values. This process is repeated on each of the received
subsets. Recursion ends when the subset in the node has the same value as the target
variable, so it does not add value to the predictions. The process of going from top to
bottom, TDIDT, is an example of an absorbing “greedy” algorithm, and is by far the most
widespread decision tree for data, but this is not the only one possible strategy.

The decision trees used in Data Mining are of two main types:

• Analysis of the classification tree when the predicted result is a class to which the data
belongs;

• Regression analysis of a tree when the predicted result can be considered as a valid
number (e.g. house price, or length of stay of a patient in a hospital) [8].

In the context of the current task, we are interested in the first type of decision tree for
solving classification issues.

2.4.3 Artificial neural networks in solving classification problems

Artificial neural networks can also be used to solve classification problems. An artificial
neural network is a network of simple elements called neurons that receive

input, change their internal state (activation) according to this input, and produce an output
that is dependent on input and excite. The network is formed by connecting the outputs of
certain neurons with inputs of other neurons with the formation of a directed weighted
graph. Scales, as well as functions that calculate excitement, can change with the process
called learning, which is guided by the rule of learning [7].

Components of the artificial neural network:


18
1) Neurons

The neuron with the label j, which receives input pj(t) from the neuronal predecessors,
consists of the following components:

• Activation aj(t), which depends on the discrete time parameter;

• The threshold θj (for binary neuron), which remains unchanged, if it does not change the
learning function;

• Activation functions f, which calculates the new activation at the given time t+1 from
aj(t), θj and the network input pj(t), giving as a result the relation (10). The function is
applied to all layers except the last one (where the output function is applied). Each
intermediate connection has its own activation function.

aj(t+1) = f(aj(t), pj(t), θj) (10)

• Output functions fout, which calculates the exit activation: (11)

oj(t) = fout(aj(t)) (11)

The output function is often just the same function. The input neuron has no predecessors,
but serves as the login interface for the entire network. Similarly, the exit neuron has no
successors, and thus serves as an interface for the output for the entire network.

2) Connections and weights

The network consists of connections, each of which transmits the output of the neuron і to
the input of the neuron j. In other words, і is the precursor(parent) of j, and the j is the
successor (child) of і. Each such connection is assigned wij weight.

3) Distribution Functions

The distribution function calculates the input pj(t) to the neuron j from the outputs of oi(t)
of the precursor neurons and usually has the form:

(12)

19
4) The rule of training

Training rule is a rule or algorithm that changes the parameters of the neural network so
that the given input to the network produces a suitable output. This learning processusually
involves changing the weights and thresholds of the network variables [7]. There are three
main paradigms of learning, each of which corresponds to a particular learning objective.
They are guided learning, spontaneous learning, and training with reinforcement [7]. We
are interested in the first paradigm, because it is used to solve classification problems.

Guided learning uses a set of examples of pairs (x, y), x  X, y  Y, and has the purpose of
finding a function (13) in a permitted class of functions that corresponds to these examples.

f:X→Y (13)

In other words, we want to display a reflection on which this data hints; the cost function is
connected to the discrepancy between our reflection and the data, and it implicitly contains
a priori knowledge of the subject domain. The tasks that fit into the guided learning

paradigm are pattern recognition (also known as classification) and regression (also known
as approximation of functions). A guided learning paradigm is also applicable to sequential
data (for example, to the recognition of manual writing, speech and gestures). It can beseen
as learning with a “teacher” in the form of a function that provides a constant feedback on
the quality of the solutions obtained so far.

20
CHAPTER 3

3 Practical example of the transaction analysis and fraud


detection using machine learning

3.1. Overview and description of the transaction database

To investigate this problem and find a solution, a database [9] of the payment system with
transactional accounts was obtained. The database reflects transactions executed within 2
days, generally containing 284,807 transactions, of which 492 are fraud (0.172%). The
dataset was gathered by Worldline and ULB (Université Libre de Bruxelles) and prepared
by them using various approaches:their private software
algorithms,manualtesting,customers’ feedback. That resulted into the merged dataset. The
database consists only of numerical data. For confidentiality, the field of the database is
anonymized. Because of this, it is not possible to specify a description of one or another
peculiarity for which the field corresponds, and to give a more precise description of the
data from an economic point of view.

21
All 28 parameters (V1, V2, ..., V28) were obtained using the main component method –
principal component analysis method – a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables (entities
each of which takes on various numerical values) into a set of values of linearly
uncorrelated variables called principal components. This transformation is defined in such
a way that the first principal component has the largest possible variance (that is, accounts
for as much of the variability in the data as possible), and each succeeding component in
turn has the highest variance possible under the constraint that it is orthogonal to the
preceding components. The resulting vectors (each being a linear combination of the
variables and containing n observations) are an uncorrelated orthogonal basis set.

The only 2 fields that have not been transformed are “time” and “quantity”. The “time”
value shows the number of seconds that passed between this transaction and the first
transaction. The “quantity” field shows the amount of money that went through the
transaction. All other fields have no marks or legend because of security and privacy
reasons. The bank decided to not share what exactly these fields are, giving only their

transformed numerical values. The data set is very unbalanced, since the target class

– fraudulent transactions – is only 0.17% of all transactions (Figure 1). If we use them to
construct models, we will probably get a lot of false classifications due to overtraining of
the model. The resulted model will assume that the transaction is likely to be a verified
one,

since almost all of the data set consists of such transactions.

Fig. 1. Distribution of the initial transactions database by classes.

3.2 Initial analysis of the transaction database

22
We need to create a balanced subset of data with the same frequency of fraudulent and
verified transactions, which will help further algorithms to show more accurate results.
What will be a subset of data? In our case, this will be a dataset with a ratio of 50/50
varified and fraudulent operations. The number of fraudulent and normal operations will be
the same.

Why create a subset of data? We found that the initial set of data is very unbalanced. Its
use can create the following problems:

• Overtraining. Since almost all records are verified, our model will empirically assign
almost every transaction as non-fraudulent.

• Wrong correlation. Although we do not know what exactly corresponds to the “V” field,
it will definitely be useful to understand how each of them affects the target function.
Again, having an unbalanced set of data, the correlation matrix will be fuzzy and shifted
toward nonfraud transactions [8].

Before applying random subsampling to the training set of data, we must divide the initial
set of data into the training set and test set. Applying data balancing techniques (over-
sampling or sub-sampling) should be done only on a training set of data in order to create a

model, but the model testing should be done on the initial dataset. In the next step, we will
apply the technique of random over-sampling, which is about removing those entries

from the set of data, which count is bigger. Thus, we achieve a ratio of 50/50 by excluding
verified transactions (Figure 2). Correlation matrixes are the basis for understanding the
data. It is interesting for us to understand which arguments significantly affect the
classification of the transaction. Particularly indicative is matrix comparison for balanced
and unbalanced data sets (Figure 3).

23
Fig. 2. Histogram of equally distributed classes after sub

sampling.

Fig. 3. Correlation matrixes of unbalanced (top) and balanced (bottom) data.

Correlation matrix analysis:

• negative correlation: V10, V12, V14, V17. The smaller the value of these variables is, the
more likely the transaction will be fraudulent.

24
• positive correlation: V2, V4, V11, V19. The larger the value of the variable is, the more
likely the operation is fraudulent [8].

3.3 Creation and evaluation of the fraud detection classifiers

Before we begin, we need to divide our data into training and test subsets.

Of course, computing of large volumes of data and deducting the result, and, most
importantly, high-speed computing, requires the use of computing machines. In practice,
there are many tools and technologies for data processing, but the most popular are Python
and R. What language to use is completely up to a user, the mathematical and statistical
methods described above are implemented in both environments. In the given work will

work in Python [10], but all the same techniques and methods are implemented in R.

We will use such libraries [11]:

• Pandas – for easier data processing;

• Matplotlib – for visualization;

• NumPy and SciPy – for scientific calculations;

• Seaborn – for visualization of statistical data;

• Sklearn – machine learning library;

• Tensorflow – machine learning library.

For each classifier, we build a model and find its

accuracy [12].

After lets analyze and compare learning curves for all

4 models (Figure 4 - Figure 7):

25
Fig. 4. Logistic regression learning curve.

Fig. 5. Support vectors learning curve.

Logistic regression showed the best accuracy with an estimate of 94%. This is a training
result that was obtained from an assessment of how precisely the model determines fraud
in the training sample. For a more accurate result, check the resulting models on the test
sample (remember that this is still a balanced sample, so the result will still be inaccurate).

As we see from the obtained results, the logistic regression method was best demonstrated
with a result of 94% on the training sample and 93.52% on the test sample (the best result
was evaluated as the maximum arithmetic mean of the data of 2 indicators [13]). The
method of knearest neighbors and the method of support vectors also showed a fairly
precise result, and the support vectors method showed even better results on the test
sample than the logistic regression – 93.78%.

For a more detailed demonstration of the results, we output a confusion matrix [14] for
logistic regression method. In the upper left and lower right squares (yellow) the correct
results are placed, in other squares (black) wrong results are places.

26
Fig. 6. K-nearest neighbours learning curve.

Fig. 7. Decision tree classifier learning curve.

27
Fig. 8. Logistic regression results’ confusion matrix.

As we see from Figure 8, this method correctly detected 96 + 89 = 185 transactions.


The other 8 transactions fell into inappropriate groups, so they were not predicted
correctly.Remember, the above results was obtained on sub-sampled test dataset.

3.4 Fraud detection using neural networks

To create the neural network, the same Python software package, based on the Tensorflow,
was used.

The structure of the neural network: a simple model that consists of one input layer, one
hidden layer of 32 nodes, and one output layer that can take one of two possible values: 0
or 1.

We will supervise two studies of the neural network: the first by means of sub-sampling,
and the other by means of over-sampling. In the first case, we will narrow our data to a
ratio of 50/50, so we will randomly drop a significant portion of the verified transactions.
During the over-sampling, we will expand our data by adding new records of fraudulent
data that will be generated basing on the existing records of the fraudulent data.

To supervise the neural network, 20 iterations were performed on the corresponding data
set. After performing the neural network training, we evaluate it on the original data set
and compare the results between the neural networks itself and the best classifiers.

28
Fig. 9. Confusion matrix for the neural network, trained on sub-sampling.

Fig. 10. Confusion matrix for the neural network, trained on over-sampling.

29
As we see from Figure 9, the neural network on the sub-sampled data classified a
significant part of the verified transactions (Y-axis) in the class of fraudulent, but only 1
fraudulent transaction passed. In general, the score of the neural network was 93.1%.

Over-sampling (data expansion) showed the best result (Figure 10) among both neural
networks and all models in general, having demonstrated 99.9% of the correct
classifications. However, it should be noted that 24 fraudulent transactions have passed,
and therefore the percentage of blocked fraudulent transactions is lower.

4 Proposed System

The proposed system aims to bolster the security of online payment systems by employing
advanced data analysis techniques and machine learning algorithms to detect and prevent
fraudulent transactions effectively.

4.1 Data Collection: The system collects comprehensive transactional data from various
sources, including payment gateways, merchants, and financial institutions. This data
encompasses transaction amounts, timestamps, user demographics, device information, and
transaction histories.

4.2 Data Preprocessing:

Upon collection, the raw transactional data undergoes preprocessing, including data
cleaning, normalization, and feature engineering. Missing values are handled, outliers are
identified and treated, and relevant features are extracted or transformed to enhance model
performance.

4.3Feature Selection:

Feature selection techniques, such as correlation analysis and feature importance ranking,
are employed to identify the most discriminative features for fraud detection. This step
helps reduce dimensionality and improve model efficiency.

4.4 Model Building:

The system utilizes machine learning algorithms, including supervised and unsupervised
techniques, to build robust fraud detection models. Supervised algorithms such as logistic

regression, decision trees, and ensemble methods learn from labeled data to classify
transactions as either legitimate or fraudulent. Unsupervised algorithms such as clustering
and anomaly detection identify unusual patterns indicative of fraudulent activities without
the need for labeled data.
30
4.5 Model Training and Evaluation:

The selected models are trained on historical transaction data and evaluated using
appropriate performance metrics such as accuracy, precision, recall, and F1-score. Cross-
validation techniques ensure the generalizability of the models, while hyperparameter
tuning optimizes their performance.

4.6 Real-time Monitoring:

The trained models are deployed in a real-time monitoring system that continuously
analyzes incoming transactions for signs of fraud. Transactions flagged as suspicious
trigger immediate alerts for further investigation by fraud analysts or automated response

mechanisms.

4.7 Adaptive Learning:

The system incorporates adaptive learning mechanisms to continuously update and refine
the fraud detection models based on new data and emerging fraud trends. Feedback loops
enable the system to adapt to evolving fraud tactics and maintain high detection accuracy
over time.

4.8 Reporting and Visualization:

Comprehensive reports and visualizations are generated to provide insights into the
effectiveness of the fraud detection system. Key performance indicators, trends, and
patterns are communicated to stakeholders to support decision-making and strategic
planning.

4.9 Flowchart

31
The flowchart concludes with the end symbol, indicating the completion of the decision
tree algorithm. The flowchart shown in figure 3 provides a visual representation of the
steps involved in training and evaluating the decision tree model, aiding in understanding
the overall process and facilitating communication between different stakeholders.

Here is a brief explanation of the flowchart of model training in figure 3. 1. Start: The
flowchart begins with the start symbol, indicating the beginning of the decision tree
algorithm. 2. Load Dataset: The algorithm loads the dataset, which contains the input
features and target variable. 3. Define Features and Target: The feature columns and target
column are defined, specifying the variables to be used for training the decision tree. 4.
Split Data: The dataset is split into training and testing sets using the train_test_split
function, allocating a portion of the data for model evaluation. 5. Data Imputation: The
SimpleImputer object is used to handle missing values in the dataset, replacing them with
the mean value of the respective feature. 6. Build Decision Tree: The
DecisionTreeClassifier object is created, representing the decision tree model. It is trained
on the training data using the fit function. 7. Predictions: The trained decision tree is
utilized to make predictions on the test set, using the predict function.

32
CHAPTER 4

33
IMPLEMENTATION AND RESULTS

4.11 Methodology

The methodologies include the algorithm used, dataset used and flowchart of the data used
and implemented. Below is the provided step by step explanation of the algorithm used.

Algorithm Used: The decision tree algorithm is a widely used supervised learning
technique employed for both classification and regression tasks. It constructs a structured
model resembling a flowchart, driven by input features.

Tree Construction: The algorithm commences by considering the entire dataset as the
root node, and selects the optimal feature for partitioning the data.

Feature Split: The chosen feature is utilized to divide the data into subsets, thereby
creating branches or paths within the decision tree. Recursive Splitting: The process of
feature splitting is iteratively applied to each subset until a predefined stopping criterion is
satisfied.

Leaf Node Assignment: Leaf nodes are assigned class labels or regression values based
on the majority class or mean value of the target variable within each respective subset.

Prediction: To make predictions, the algorithm traverses the decision tree by evaluating
feature values and ultimately reaching a leaf node to obtain the final prediction.Easy to
comprehend and interpret accommodates numerical and categorical data handles missing
values gracefully captures nonlinear relationships effectively. Prone to over
fitting,necessitating proper regularization techniques - Can be sensitive to changes in
thedataset,leading to instability. Exhibits bias towards features with high cardinality or
many levels In conclusion, decision trees offer versatility and transparency in model

interpretation. However, caution must be exercised to address overfitting issues and


effectively manage the algorithm's limitations.

4.12 Result

34
The goal was to predict whether a transaction is a legal transaction or a fraudulent
transaction, this falls under the scope of a classification problem. We intend to deploy
Supervised Machine Learning models in order to achieve the highest prediction accuracy.K
Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree and

Random Forest models were trained using k-fold technique, training contained total 5 folds
and with each fold accuracy of the model kept increasing up to 5th fold. After the 5th fold,
accuracy started decreasing because our dataset was not sufficient enough for more than 5
folds. So, the final model was trained on 5 folds with 88.55% average accuracy. This
means that if someone would train Random Forest with a bigger data set using the k-
foldtechnique then the average accuracy of the model would be even higher.

As a result, the Decision Tree model had the greatest prediction accuracy of 99.92% and
recall of 86.96% Due to huge amount of data models for Support Vector Machine and
Random Forest were unable to compile, even on Google Collab. Further work can be done
by under sampling of data by 50:50, that would reduce data size even more and as a result
SVM and Random Forest results can be compiled accurately.Initial results, Final results
could not be compiled due to insufficient computing power.

35
CHAPTER 5

CONCLUSION

The logistic regression reaches up to 94% of the correct classifications, while the neural
network on the subsampled data shows a result of 93.1%, and over-sampled data shows as
much as 99.9%, but misses a significant amount of fraudulent operations. On the one hand,
the accuracy of the neural network on the over-sampling is higher, but on the other hand, it

misses most of the fraudulent operations, although it better classifies the verified ones.
Logistic regression showed average accuracy, but also missed a significant part of
fraudulent transactions. Although the neural network on the sub-sampling showed the
worst overall result of 93.1%, but it prevented the biggest amount of the fraud transactions.
In general, the use of one or another model depends on the specific situation: whether
clients are ready sometimes get denial of the transaction, but to be sure that their funds will
not be obtained by fraud, or they are more interested in easy of use, and security is not tha
important.

36
References
1. Design and development of financial fraud detection using machine learning. (2024).
International Journal of Emerging Trends in Engineering Research, 8(9), 5838–

5843. https://doi.org/10.30534/ijeter/2020/152892020

2. Rucco, M., Giannini, F., Lupinetti, K., & Monti, M. (2019). A methodology for part
classification with supervised machine learning. Artificial Intelligence for Engineering
Design, Analysis and Manufacturing, 33(1), 100– 113.
https://doi.org/10.1017/S0890060418000197

3. Saarikoski, J., Joutsijoki, H., Järvelin, K., Laurikkala, J., & Juhola, M. (2015). On the
influence of training data quality on text document classification using machine learning
methods. International Journal of Knowledge Engineering and Data Mining, 3(2), 143.
https://doi.org/10.1504/IJKEDM.2015.071284

37

You might also like