Fraud Detectionusing Machine Learning
Fraud Detectionusing Machine Learning
net/publication/374083997
CITATION READS
1 6,301
1 author:
Oladimeji Kazeem
University of Stirling
5 PUBLICATIONS 1 CITATION
SEE PROFILE
All content following this page was uploaded by Oladimeji Kazeem on 21 September 2023.
By
Oladimeji Kazeem
Python Programming
IU University of Applied Sciences
Abstract
The threat posed by financial transaction fraud to organizations and individuals has
prompted the development of cutting-edge methods for detection and prevention. The use of
real-time monitoring systems and machine learning algorithms to improve fraud detection
and prevention in financial transactions is explored in this research study. The paper
addresses the drawbacks of conventional rule-based systems, explains why real-time
monitoring and machine learning should be used, and describes the goals of the research.
To comprehend the current methodologies and pinpoint research gaps, a thorough literature
study is done. The suggested approach includes dimensionality reduction, feature
engineering, data preparation, and the application of machine learning models built into a
real-time monitoring system. Results are assessed using performance measures and
contrasted with the performance of current systems. Adaptive thresholds and dynamic risk
scoring are two proactive fraud prevention strategies that being investigated. Considerations
for scalability and deployment, including data security and legal compliance, are also
covered. The study suggests areas for additional research in this field and helps to design
reliable fraud detection systems.
1
Table of Contents
1. Introduction................................................................................................................................................... 3
1.1 Research Objectives .......................................................................................................................... 4
1.2 Research Questions........................................................................................................................... 4
2. Literature Review .................................................................................................................................... 5
2.1 Supervised Learning Approaches .................................................................................................... 5
2.2 Unsupervised Learning Approaches ................................................................................................ 6
2.3 Hybrid Approaches ............................................................................................................................. 6
2.4 Deep Learning Approaches .............................................................................................................. 7
2.4 Features Engineering and Dimensionality Reduction ................................................................... 8
2.5 Feature extraction .............................................................................................................................. 8
2.6 Dimensionality Reduction .................................................................................................................. 9
3 Methodology............................................................................................................................................... 10
3.1 Dataset Description .......................................................................................................................... 10
3.2 Preprocessing Steps ........................................................................................................................ 10
3.3 Exploratory Data Analysis ............................................................................................................... 10
3.4 Feature Engineering and Dimensionality Reduction ................................................................... 11
3.5 Machine Learning Algorithms ......................................................................................................... 11
3.6 Solution Deployment ........................................................................................................................ 12
3.7 Model Deployment Options ............................................................................................................. 12
4 Results & Findings .................................................................................................................................... 14
4.1 Categorical Analysis of Customer Categories .............................................................................. 14
5 Discussions ................................................................................................................................................ 15
5.1 Proactive Measure for Fraud Prevention ...................................................................................... 15
5.1.1 Solution Integration into the System ..................................................................................... 15
5.1.2 Potential Efficacy and Restrictions ........................................................................................ 16
5.2 Scalability Large-Scale Financial Transaction Data Handling Issues ....................................... 16
5.2.1 Architectural Points to Keep in Mind for Financial Institutions in the Real World ........... 17
5.2.2 Data security and adherence to legal requirements ........................................................... 17
5.2.3 system integration difficulties ................................................................................................. 17
6 Conclusion .................................................................................................................................................. 18
6.1 Research Contributions and Findings ........................................................................................... 18
6.2 Future Study and Developments .................................................................................................... 18
2
1. Introduction
For organizations, financial institutions, and people everywhere, detecting and preventing
fraud in financial transactions is a top priority. The need to investigate more sophisticated
techniques has arisen as sophisticated fraud has made clear the limitations of conventional
rule-based systems. This study explores how real-time monitoring systems and machine
learning algorithms can be used to improve financial transaction fraud detection and
prevention capabilities.
In the literature, the importance of fraud prevention and detection in financial transactions
has been extensively discussed. In addition to causing significant financial losses, financial
fraud also erodes public faith in the financial system (Association of Certified Fraud
Examiners, 2020). Traditional rule-based systems look for suspected fraudulent actions
using predetermined rules and patterns. But these systems struggle to adjust to new and
developing fraud strategies, which results in many false negatives and potential financial
losses (Kumar et al., 2020). The use of machine learning algorithms has drawn a lot of
interest as a solution to these restrictions.
Large volumes of transactional data can be automatically mined for patterns and
abnormalities using machine learning algorithms, leading to more precise and adaptable
fraud detection. Financial institutions can examine past transactional data to find trends
linked to fraudulent actions by utilizing machine learning techniques like supervised learning,
unsupervised learning, and deep learning (Dal Pozzolo et al., 2015). Additionally, by
continuously monitoring transactions in real-time and sending out notifications for suspected
fraud, the integration of real-time monitoring systems improves fraud detection (Bolton et al.,
2011). With timely action made possible by this proactive strategy, potential losses and
damages are reduced.
The necessity for a more effective and efficient strategy to counteract changing fraud
strategies is what motivates the use of machine learning algorithms and real-time monitoring
systems. Financial fraud is dynamic, necessitating the use of adaptable systems that can
recognize emerging trends and abnormalities. Detecting complex and changing fraud
patterns is made possible by machine learning algorithms, allowing for early identification
and prevention (Phua et al., 2010). In addition to machine learning, real-time monitoring
systems offer fast response capabilities, enabling prompt intervention to stop fraudulent
transactions (Kou et al., 2020).
3
1.1 Research Objectives
1. Investigate the use of machine learning algorithms for fraud detection in financial
transactions.
2. Design and develop a real-time monitoring system for continuous fraud detection and
prevention.
3. Assessing the performance of the suggested approach in comparison to conventional
rule-based systems.
4. Exploring proactive measures for fraud prevention, such as dynamic risk scoring and
adaptive thresholds.
5. Analyse scalability and deployment considerations for implementing the proposed
system in real – world financial institutions.
4
2. Literature Review
In recent years, there has been a lot of study on applying machine learning algorithms to
detect fraud in financial transactions. Various strategies and algorithms have been examined
in several research to increase the precision and effectiveness of fraud detection systems.
This section reviews earlier studies and research articles in the field, addressing the benefits
and drawbacks of various strategies while identifying the gaps in the body of knowledge that
the current study seeks to fill.
Another well-liked supervised learning strategy for fraud detection is decision trees. To
categorize occurrences as fraudulent or authentic, decision tree algorithms, such the C4.5
algorithm, build a tree-like model that divides the dataset depending on feature values.
Because they can manage non-linear correlations between features and the target variable,
decision trees have the advantage of being ideal for identifying intricate fraud patterns.
The ability of Support Vector Machines (SVMs) to handle high-dimensional data and
nonlinear relationships has led to their use in fraud detection as well. SVMs look for an ideal
hyperplane that can distinguish between fraudulent and legal transactions with the greatest
margin. at dealing with unbalanced datasets, SVMs have shown to perform well at
classifying fraudulent transactions.
Although these supervised learning algorithms are easy to use and interpret, they could
have trouble spotting fraud. The complexity of fraud patterns is one of the biggest problems.
The techniques used by fraudsters are constantly changing, creating complex and dynamic
fraud patterns that these algorithms would find challenging to successfully detect.
5
Techniques such using the Synthetic Minority Over-sampling Technique (SMOTE), which
oversamples the minority class, or under-sampling the majority class have been suggested
as solutions to the problem of unbalanced data. These methods seek to improve the
identification of fraudulent transactions while balancing the distribution of classes.
Clustering algorithms were used in a study by Ranshous et al. (2015) to identify fraud. To
find clusters of connected fraudulent transactions, the authors used clustering techniques,
which made it possible to spot trends and similarities in fraudulent behaviour. This method is
especially beneficial for identifying innovative or previously unidentified fraud patterns that
may not be picked up by predetermined rules or labelled data.
Unsupervised learning techniques have the advantage of being able to adapt to new fraud
methods without relying on labels that have been predetermined. They can find irregularities
and patterns in the data that may be signs of fraud. Unsupervised learning techniques face
considerable difficulties due to their increased false positive rate when compared to
supervised methods. Unsupervised models have a high rate of false positives because they
can classify genuine transactions as anomalies or find clusters that include both valid and
fraudulent transactions.
Another drawback is the challenge of identifying specific fraud incidents. While unsupervised
learning techniques offer a more comprehensive perspective of fraud tendencies, they could
fall short in terms of the level of detail needed to pinpoint fraudulent transactions or the
participants. To recognize and authenticate specific fraud cases, more research and analysis
are frequently required.
Hybrid methods that blend supervised and unsupervised techniques have been developed
to solve the issues of false positives and the difficulty in identifying specific fraud instances.
6
A hybrid fraud detection system with integrated clustering and classification algorithms was
proposed by Bhattacharyya et al. (2018). The classification technique was used to separate
between fraudulent and valid transactions inside each cluster once the clustering algorithm
had identified groups of similar transactions. When compared to employing either strategy
alone, our hybrid model showed enhanced fraud detection performance.
The benefit of hybrid techniques is their capacity for both supervised learning to capture
well-known fraud patterns and unsupervised learning to detect new fraud patterns. Hybrid
models seek to increase fraud detection accuracy while lowering false positives by
incorporating the best features of both approaches.
However, using hybrid models in practical settings is not without its difficulties. When
compared to individual approaches, these models are typically more intricate and
computationally intensive. Large-scale implementation may be more difficult because to the
need for additional resources and knowledge for the integration and coordination of multiple
algorithms.
However, there are a few things to consider when using deep learning models for fraud
detection. First off, for deep learning models to operate at their best, a lot of labelled training
data is frequently necessary. In the area of fraud detection, gathering an extensive and
precisely annotated dataset might be difficult because fraudulent instances are frequently
more rare than valid ones. To lessen the problem of imbalanced datasets, sophisticated
sampling techniques and data augmentation approaches might be used.
7
Second, training and optimizing deep learning models can be computationally taxing and
may call for a lot of processing power. Large datasets and complex neural architectures may
require the utilization of specialized hardware or distributed computing resources in order to
train models effectively.
Despite these difficulties, convolutional neural networks and recurrent neural networks are
examples of deep learning approaches that have advanced and continue to help fraud
detection systems become more effective. The goal of ongoing research is to improve the
effectiveness of deep learning models for fraud detection. This includes developing
lightweight architectures, model compression methods, transfer learning, and transfer
learning methods.
The current study tries to fill various gaps in the literature despite the advancements made in
machine learning-based fraud detection. These gaps include the following:
1. Limited attention paid to real-time fraud detection: While real-time fraud detection
calls for prompt identification and prevention during live transactions, many existing
research concentrate on offline analysis of past data.
2. Insufficient attention to temporal aspects: Although they frequently go unnoticed,
time-dependent characteristics and temporal dependencies in financial transactions
are vital for spotting fraud.
3. Lack of consideration for interpretability and explainability: To win the trust of
stakeholders and meet regulatory obligations, it is crucial to offer explanations and
interpretability as machine learning models get increasingly complicated.
4. inadequate analysis of unbalanced datasets: In fraud detection, where there are far
fewer cases of fraud than there are of valid transactions, unbalanced datasets are
typical. Further research is required to determine how well current approaches
perform on data that is unbalanced.
8
• Time-Based Features: Extraction of temporal data, such as the day of the week, the
hour of the day, or the amount of time since the last transaction, using transaction
timestamps.
• Statistical Features: Calculating statistical measures of transaction amounts or
other pertinent variables, such as mean, standard deviation, and skewness.
• Text mining: The process of extracting terms or patterns from text-based fields, such
as transaction descriptions, that may be indicators of fraud.
• Using principal component analysis (PCA), the original characteristics are converted
into a fresh collection of uncorrelated variables (principal components), which
account for most of the variance in the data.
• The supervised dimensionality reduction technique linear discriminant analysis (LDA)
maximizes the separation between several classes while minimizing within-class
variation.
• t-Distributed Stochastic Neighbour Embedding, or t-SNE a non-linear technique,
frequently used for visualization, that maintains the data's local structure while
lowering its dimensionality.
• Feature aggregation is the process of taking averages, sums, or other aggregations
to combine several related features into a single feature.
9
3 Methodology
3.1 Dataset Description
The dataset used for the research is a synthetic dataset generated for the purpose of this
study, appendix 1. It contains information about financial transactions, including transaction
IDs, customer IDs, transaction amounts, transaction timestamps, regions, states, customer
categories, and account balances. The dataset consists of 10000 records and includes
characteristics such as geographical information, customer profiles, and transaction details.
• Handling missing values: Identify and handle any missing values in the dataset,
either by imputing them or removing the corresponding records.
• Data normalization: Scale numerical features such as transaction amounts and
account balances to a common range to ensure they have a similar impact during
model training.
• Encoding categorical variables: Convert categorical variables like regions, states,
and customer categories into numerical representations using techniques like one-
hot encoding or label encoding.
• Feature selection: Identify and select the most relevant features that contribute
significantly to fraud detection, considering their impact and reducing computational
complexity.
10
By visualizing the data, it becomes easier to identify any anomalies, outliers, or patterns that
may require further investigation or preprocessing before training the machine learning
models.
• Feature Selection: By focusing on the most crucial elements that helped with fraud
detection, we scanned through the data to identify noise. This lessened the possibility
of overfitting while also enhancing the model's accuracy and interpretability.
• Feature Extraction: Transaction data frequently contains important information that
may not be readily captured by the raw features. This is known as feature extraction.
Meaningful representations and identify significant fraud-related patterns or trends
were created.
• Dimensionality reduction: Datasets related to financial transactions may be highly
dimensional, which increases computing complexity and raises the possibility of
overfitting. Methods for dimensionality reduction reduced the number of features
while retaining the most important data, which helped to solve these problems.
The trade-off between model performance and interpretability were considered while
choosing certain strategies. Higher predicted accuracy may be obtained using more
sophisticated approaches like deep learning or ensemble methods, but they may also be
more difficult to comprehend. To balance model complexity, interpretability, and computing
efficiency, one must consider both the resources at hand as well as the needs of the fraud
detection system.
• Logistic Regression: This algorithm is suitable for binary classification tasks and
can provide interpretable results.
• Decision Trees: Decision trees can capture non-linear relationships and are
effective in handling categorical features.
• Random Forest: This ensemble method combines multiple decision trees to improve
accuracy and handle complex fraud patterns.
11
• Support Vector Machines (SVM): SVMs can handle high-dimensional data and are
effective in separating classes with a clear margin.
The four algorithms were used to be able to establish the best possible result, and the
associated algorithm as well as the applicable hyperparameters.
• Model serialization
A format was created to that makes it simple to load and use the trained machine
learning models during deployment by serializing them . Pickle files, joblib files, or
serialized representations particular to the machine learning framework of choice are
examples of common formats.
The final machine learning model were deployed to a local device on which simulates
the on-premise scenario
• On-Premises Deployment: Setting up the models on the organization's own local servers
or infrastructure.
• Cloud Deployment: Hosting the models on cloud infrastructure like AWS, Azure, or
Google Cloud.
• Containerization: Packing the models into containers for scalability and simple
deployment (like Docker).
• Serverless Deployment: This method involves deploying the models as functions using
serverless platforms (such as AWS Lambda and Google Cloud Functions).
API Development
To expose the deployed models, a microservice or an API endpoint was created. This made
it possible for other programs or systems to communicate with fraud detection models and
make predictions. Transaction data are accepted as input by the API, which should then
output estimated fraud probability or binary labels.
12
The solution was developed to allow increasing transaction volumes in real-time. To increase
performance and scalability, strategies like load balancing, caching, and parallel processing
are suggested.
Implementing monitoring and logging systems to keep tabs on the operation and behaviour
of the deployed models. This entailed logging all input information, forecasts, and runtime
faults or exceptions. Continuous improvement is made possible via monitoring, which helps
find any drift in model performance over time.
Security Consideration
Applying the proper security precautions to safeguard the deployed models and the data
they analyse. Access controls, encryption of sensitive data, and frequent security audits may
all be necessary for this.
Versioning mechanism for the deployed models was created to keep track of changes and
simplify future updates. To adapt to changing fraud tendencies, automated pipelines are
suggested for model updates and retraining.
A/B testing were performed to compare the performance of the deployed models against a
baseline or alternative approaches. Continuous evaluation of the effectiveness of the
deployed models using relevant metrics including precision, recall, and F1-score.
Continuous Improvement
Feedback loops were incorporated to collect labelled data on detected fraud cases and use
it to improve the models. This iterative process helps enhanced the accuracy and
effectiveness of the fraud detection system over time.
13
4 Results & Findings
4.1 Categorical Analysis of Customer Categories
The bar plot reveals the distribution of customer categories in the dataset. The x-axis
represents the different customer categories, and the y-axis represents the count of
customers in each category. The following observations can be made from the plot:
Low-Profile: This category has the highest count, indicating that a significant portion of the
customers falls into this category.
High-Profile: This category has a relatively low count compared to the others, indicating a
smaller proportion of customers.
Implications:
The distribution of customer categories provides valuable insights into the customer base.
The dominance of the Low-Profile category suggests that most customers in the dataset
have low transaction activity or account balances. On the other hand, the presence of
Medium-Profile and High-Profile categories indicates the existence of customers with
relatively higher transaction activity or account balances.
Understanding the distribution of customer categories can be useful for various purposes,
such as targeted marketing campaigns, customer segmentation, and fraud detection. Further
analysis can be performed to explore the relationships between customer categories and
other variables in the dataset.
It is important to note that this analysis is based on the given dataset and may not represent
the entire population accurately. Additional data and more comprehensive analysis can
provide deeper insights into customer categories and their significance in the context of the
domain.
14
5 Discussions
5.1 Proactive Measure for Fraud Prevention
Dynamic risk scoring: entails continually and in-the-moment evaluating the risk attached to
each financial transaction. It considers several factors, including the transaction amount,
previous interactions with customers, location, and the device utilized for the transaction.
Each transaction is given a risk score, which allows the system to detect suspicious activity
based on changes in the customer's usual behavior.
Adaptive Thresholds: Based on past trends and the current risk level, adaptive thresholds
modify the fraud detection criteria. The system dynamically modifies the thresholds to
account for legitimate variances and maintain sensitivity to suspected fraud trends as the
risk level changes. This lessens the likelihood of both false positives and false negatives
(valid transactions marked as fraudulent).
Behavioural Analysis: Analyzing consumer behavior and transaction trends over time is
called behavior analysis. The system can spot abnormal actions that differ from the
customer's typical usage patterns by creating a baseline of normal behavior. Changes in
transaction quantities, frequency, places, or unexpected transaction sequences fall under
this category.
Real-time Monitoring: Put in place a system for real-time monitoring that continuously
assesses incoming transactions utilizing dynamic risk scoring and flexible thresholds. This
makes it possible to quickly identify and stop suspicious transactions before they are
executed.
Machine Learning Model: Use machine learning models to analyze activity and spot odd
transaction patterns. These methods include anomaly detection and prediction modeling. To
identifying new fraud tendencies, these models can be trained using past data.
15
Integrate rule-based filters to detect well-known fraud behaviors and use them as extra
levels of security.
• Real-time fraud detection is made possible by dynamic risk scoring and adaptive
thresholds, which lowers the possibility of successful fraud attempts.
• Behavior analysis improves accuracy by spotting fresh, unheard-of fraud patterns.
• The financial losses brought on by fraudulent activity might be considerably
decreased with proactive actions.
Limitations
• If adaptive criteria are set too conservatively, high-risk transactions may result in
false positives, which would inconvenience real customers.
• It may take time for proactive methods to identify sophisticated fraud techniques,
necessitating ongoing model training and upgrades.
• Without adequate previous data to establish a baseline, behavior analysis can be
difficult for new clients.
16
5.2.1 Architectural Practices for Financial Institutions
• Adopting a microservices design enables the independent and modular construction
of system components, making it simpler to grow, update, and manage the system.
• Implement load-balancing strategies to split up incoming requests among several
servers, guaranteeing optimum resource usage and avoiding overloading of
components.
• High Availability: Assure the system's high availability by implementing failover
methods, deploying redundant components, and taking disaster recovery plans into
account.
• Data Replication: To ensure data redundancy and preserve service continuity in the
event of data center failure, use data replication across geographically dispersed
data centers.
17
6 Conclusion
This study examined numerous methods to deal with this pressing issue as it pertained to
financial transaction fraud detection and prevention. In order to identify fraudulent activity,
the study looked at the usage of supervised learning algorithms, unsupervised learning
algorithms, and hybrid approaches. In addition, the capacity to recognize intricate fraud
patterns was tested for deep learning models, notably neural networks. The study also
stressed the significance of incorporating machine learning models into real-time monitoring
to create a reliable fraud detection system.
18
d) Investigate the use of online learning strategies to modify the fraud detection system
in real-time as new data becomes available, enhancing its response to changing
fraud patterns.
e) Investigate how deep reinforcement learning can be used to detect fraud. Through
interactions with its environment, the system can learn the best practices for
preventing fraud.
f) Enhanced Data Preprocessing: Improve the training dataset's quality by further
refining data preprocessing procedures to manage missing or noisy data.
g) Integration with External Data Sources: To improve the fraud detection process, think
about integrating external data sources, such as social media data or transaction
history from partner institutions.
h) Develop a thorough system for continual monitoring, evaluation, and modifications to
accommodate new fraud schemes and guarantee the system's continued
applicability.
19
7 References
1. Buczak, A. L., & Guven, E. (2016). A Survey of Data Mining and Machine Learning Methods
for Cyber Security Intrusion Detection. IEEE Communications Surveys & Tutorials, 18(2),
1153-1176. DOI: 10.1109/COMST.2015.2494502.
2. Ranshous, S., Bay, C., Cramer, N., Henricksen, M., & Hannigan, B. (2015). Combining
Clustering and Classification for Anomalous Activity Detection in Cybersecurity. In
Proceedings of the 2015 Workshop on Artificial Intelligence and Security (pp. 49-58).
3. Bhattacharyya, D., Kalaimannan, E., & Verma, A. (2018). Anomalous Pattern Detection in
Enterprise Data Using Hybrid Classification and Clustering Techniques. Procedia Computer
Science, 132, 1066-1075. DOI: 10.1016/j.procs.2018.05.110.
4. Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). A Comprehensive Survey of Data Mining-
based Fraud Detection Research. Artificial Intelligence Review, 33(4), 229-246. DOI:
10.1007/s10462-009-9128-7.
5. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. New York, NY: Springer-Verlag.
6. Brownlee, J. (2020). Master Machine Learning Algorithms. Machine Learning Mastery.
7. Chollet, F. (2018). Deep Learning with Python. Manning Publications.
8. Varshney, A., Mishra, S., & Jha, R. P. (2019). A Review on Machine Learning Algorithms for
Fraud Detection. Procedia Computer Science, 132, 1575-1584. DOI:
10.1016/j.procs.2019.04.169.
9. Cawley, G. C., & Talbot, N. L. (2010). On Over-fitting in Model Selection and Subsequent
Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079-
2107.
10. Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. New York, NY: Springer-Verlag.
11. Kotsiantis, S. B. (2013). Decision Trees: A Recent Overview. Artificial Intelligence Review,
39(4), 261-283. DOI: 10.1007/s10462-011-9272-4.
12. Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. International
Conference on Learning Representations (ICLR).
20
8 Appendix 1 – Code using VSCODE
import pandas as pd
import joblib
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report
class FraudDetection:
def __init__(self):
self.data = None
self.model = None
self.X = None
self.y = None
def preprocess_data(self):
# Data Encoding
label_encoder = LabelEncoder()
self.data['region'] = label_encoder.fit_transform(self.data['region'])
self.data['state'] = label_encoder.fit_transform(self.data['state'])
self.data['customer_category'] =
label_encoder.fit_transform(self.data['customer_category'])
# Preprocessing steps
self.data_encoded.drop(['step', 'type', 'nameOrig', 'nameDest',
'isFlaggedFraud'], axis=1, inplace=True)
21
elif model_type == 'SVM':
self.model = SVC()
elif model_type == 'KMeans':
self.model = KMeans(n_clusters=2, random_state=42)
else:
raise ValueError("Invalid model type. Please choose
'RandomForest', 'SVM', or 'KMeans'.")
return predictions
if __name__ == "__main__":
# Instantiate the FraudDetection class
fraud_detector = FraudDetection()
22
# Sample data for prediction (replace this with your sample data)
sample_data = pd.DataFrame(...) # Provide your sample data as a
DataFrame
23
st amo nameO oldbala newbala nameDe oldbalan newbalan isFr isFlagge
type
ep unt rig nceOrg nceOrig st ceDest ceDest aud dFraud
In [4]:
df.tail()
Out[4]:
st
amou nameO oldbala newbala nameD oldbala newbala isFr isFlagge
e type
nt rig nceOrg nceOrig est nceDest nceDest aud dFraud
p
636 7
CASH 33968 C78648 339682. C77691 339682.1
261 4 0.0 0.00 1 0
_OUT 2.13 4425 13 9290 3
5 3
636 7
TRAN 63114 C15290 631140 C18818
261 4 0.0 0.00 0.00 1 0
SFER 09.28 08245 9.28 41831
6 3
636 7
CASH 63114 C11629 631140 C13651 68488.8 6379898.
261 4 0.0 1 0
_OUT 09.28 22333 9.28 25890 4 11
7 3
636 7
TRAN 85000 C16859 850002. C20803
261 4 0.0 0.00 0.00 1 0
SFER 2.52 95037 52 88513
8 3
636 7
CASH 85000 C12803 850002. C87322 6510099 7360101.
261 4 0.0 1 0
_OUT 2.52 23807 52 1189 .11 63
9 3
In [5]:
#check the data columns and rows
df.shape
24
Out[5]:
(6362620, 11)
In [6]:
#checking the columns or variables in the dataset
df.columns
Out[6]:
Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOr
ig',
'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
'isFlaggedFraud'],
dtype='object')
In [7]:
#idnetifying the column types
df.dtypes
Out[7]:
step int64
type object
amount float64
nameOrig object
oldbalanceOrg float64
newbalanceOrig float64
nameDest object
oldbalanceDest float64
newbalanceDest float64
isFraud int64
isFlaggedFraud int64
dtype: object
The dataset contains 10,000 records with 8 variables or columns:
In [8]:
#missing values
df.isnull().sum()
Out[8]:
step 0
type 0
amount 0
nameOrig 0
oldbalanceOrg 0
newbalanceOrig 0
25
nameDest 0
oldbalanceDest 0
newbalanceDest 0
isFraud 0
isFlaggedFraud 0
dtype: int64
In [9]:
# Handling missing values
df.dropna(inplace=True) # Drop rows with missing values
In [10]:
df.shape
Out[10]:
(6362620, 11)
After dropping the missing values' rows the number of rows remains 10,000 implying that there are
no missing values in the dataset.
3. Exploratory Data Analysis
In [11]:
df.describe()
Out[11]:
26
oldbalanc newbalanc oldbalance newbalance isFlaggedF
step amount isFraud
eOrg eOrig Dest Dest raud
In [12]:
# Visualize the distribution of the target variable (isFraud)
sns.countplot(df['isFraud'])
plt.title('Distribution of Fraudulent and Non-Fraudulent Transactions')
plt.xlabel('isFraud')
plt.ylabel('Count')
plt.show()
C:\Users\Hp 2022\anaconda3\lib\site-packages\seaborn\_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version 0
.12, the only valid positional argument will be `data`, and passing other a
rguments without an explicit keyword will result in an error or misinterpre
tation.
warnings.warn(
In [20]:
# Explore the distribution of 'amount' column using a histogram
plt.figure(figsize=(10, 6))
plt.hist(df['amount'], bins=50, color='blue')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.title('Distribution of Transaction Amount')
27
plt.show()
In [22]:
# Explore the distribution of 'type' column using a bar plot
plt.figure(figsize=(8, 5))
df['type'].value_counts().plot(kind='bar', color='green')
plt.xlabel('Transaction Type')
plt.ylabel('Frequency')
plt.title('Distribution of Transaction Types')
plt.xticks(rotation=45)
plt.show()
28
In [23]:
# Explore the relationship between 'amount' and 'isFraud' using a box plot
plt.figure(figsize=(8, 5))
plt.boxplot([df[df['isFraud'] == 0]['amount'], df[df['isFraud'] ==
1]['amount']], labels=['Not Fraud', 'Fraud'])
plt.xlabel('Fraud')
plt.ylabel('Transaction Amount')
plt.title('Transaction Amount vs. Fraud')
plt.show()
29
In [24]:
# Explore the distribution of 'isFraud' using a pie chart
plt.figure(figsize=(6, 6))
df['isFraud'].value_counts().plot(kind='pie', autopct='%1.1f%%',
colors=['lightcoral', 'lightgreen'])
plt.title('Percentage of Fraudulent Transactions')
plt.legend(['Not Fraud', 'Fraud'])
plt.show()
30
In [13]:
# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()
df['type'] = label_encoder.fit_transform(df['type'])
In [14]:
# Remove unnecessary columns
df.drop(['step', 'nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1,
inplace=True)
In [15]:
# Perform one-hot encoding on categorical variables
categorical_cols = ['type']
df_encoded = pd.get_dummies(df, columns=categorical_cols)
In [16]:
# Split the dataset into features (X) and labels (y)
X = df.drop('isFraud', axis=1)
y = df['isFraud']
In [17]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
In [18]:
31
# Scale the numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
32
0 1.00 1.00 1.00 1270904
1 0.99 0.47 0.64 1620
K-Means Clustering:
precision recall f1-score support
In [138]:
# ROC Curve
rf_probs = rf_model.predict_proba(X_test)[:, 1]
svm_probs = svm_model.decision_function(X_test)
kmeans_probs = kmeans_model.transform(X_test)[:, 1]
C:\Users\Hp 2022\anaconda3\lib\site-packages\sklearn\base.py:432: UserWarni
ng: X has feature names, but RandomForestClassifier was fitted without feat
ure names
warnings.warn(
C:\Users\Hp 2022\anaconda3\lib\site-packages\sklearn\base.py:432: UserWarni
ng: X has feature names, but SVC was fitted without feature names
warnings.warn(
C:\Users\Hp 2022\anaconda3\lib\site-packages\sklearn\base.py:432: UserWarni
ng: X has feature names, but KMeans was fitted without feature names
warnings.warn(
In [141]:
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
svm_fpr, svm_tpr, _ = roc_curve(y_test, svm_probs)
kmeans_fpr, kmeans_tpr, _ = roc_curve(y_test, kmeans_probs)
In [142]:
plt.plot(rf_fpr, rf_tpr, label='Random Forest')
plt.plot(svm_fpr, svm_tpr, label='Support Vector Machine')
plt.plot(kmeans_fpr, kmeans_tpr, label='K-Means Clustering')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
33
In [ ]:
34