Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Deep Learning Methods For Credit Card Fraud Detect

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8
At a glance
Powered by AI
The document discusses deep learning methods that have been applied to solve complex problems in credit card fraud detection. It compares the performance of deep learning methods against traditional machine learning algorithms on financial datasets.

Some major challenges in credit card fraud detection include class imbalance in data, availability of public data, changing nature of frauds, and high number of false alarms.

Machine learning techniques such as decision trees, naive bayes, and logistic regression have been used for credit card fraud detection but have not provided great efficiency. Recent techniques include ensemble methods and neural networks.

Deep Learning Methods for Credit Card Fraud Detection

Thanh Thi Nguyen1, Hammad Tahir1, Mohamed Abdelrazek1 and Ali Babar2
1
School of Information Technology, Deakin University, Victoria, Australia
2
School of Computer Science, The University of Adelaide, South Australia, Australia

E-mails: thanh.nguyen@deakin.edu.au, mohamed.abdelrazek@deakin.edu.au, ali.babar@adelaide.edu.au


Tel: +61 3 52278281

Abstract— Credit card frauds are at an ever-increasing rate and magnetic stripe reader obtains information on the credit card to
have become a major problem in the financial sector. Because of make a fake copy of it [2].
these frauds, card users are hesitant in making purchases and both
the merchants and financial institutions bear heavy losses. Some ML is a branch of AI in which a computer (machine) is able
major challenges in credit card frauds involve the availability of to perform predictions based on the findings from the previous
public data, high class imbalance in data, changing nature of data trends. Since the astonishing success of Google DeepMind
frauds and the high number of false alarms. Machine learning in 2015, AI and ML have expanded to new horizons. Some
techniques have been used to detect credit card frauds but no practical implementations using deep anomaly detection are
fraud detection systems have been able to offer great efficiency to computer network intrusion detection [3]; banking, insurance,
date. Recent development of deep learning has been applied to mobile cellular network, and health care fraud detection;
solve complex problems in various areas. This paper presents a medical and malware anomaly detection and anomaly detection
thorough study of deep learning methods for the credit card fraud for video surveillance [4]. Location tracking, Android malware
detection problem and compare their performance with various detection, home automation and predicting the occurrence of
machine learning algorithms on three different financial datasets. heart disease are some applications of ML in the Internet of
Experimental results show great performance of the proposed Things domain [5]. In this paper, we will study in-depth the
deep learning methods against traditional machine learning practical implementation of ML, especially deep learning
models and imply that the proposed approaches can be
methods in detecting credit card frauds in the financial sector.
implemented effectively for real-world credit card fraud detection
systems. II. BACKGROUND AND LITERATURE

Keywords—deep learning, machine learning, credit card frauds, A. Credit Card Frauds
fraud detection, cyber security, CNN, LSTM In recent years, we have become more dependent on mobile
phones and web applications, which have caused an increase in
I. INTRODUCTION the number of online payment transactions. The card frauds have
We live in a world today where with the power of a single resulted in millions of revenue loss to financial institutions
touch we can achieve massive results. We can book rides, talk globally and added stress to credit card users. Approximately,
to personal virtual assistants, get recommendations, navigate 20.48 billion cards (including credit, debit and all prepaid cards)
maps and order food to our homes. All of this is possible only are in circulation worldwide for the year 2017 [6].
because of higher computing capabilities and shared IT
infrastructure. Because of this, a large volume of data is created
and data generation is expected to reach 44 zettabytes (40 trillion
gigabytes) in 2020 from 4.4 zettabytes in 2013 [1]. The rise of
artificial intelligence (AI) and machine learning (ML) in recent
years is a consequence of this upsurge of data. Today we rely on
countless implementations of ML in our everyday life without
even realizing it. One such implementation is credit card fraud
detection systems using ML techniques that make our payment
methods more robust.
Credit card fraud is an extension of theft and fraud carried
out by obtaining credit card details or using counterfeit cards for
illegal monetary transactions. Credit card fraud can be physical
where fraudster presents a credit card physically to the merchant
or virtual where the transaction is made over the internet.
Fraudsters use various techniques for this matter such as site
Fig. 1. Card Fraud Worldwide [7]
cloning where a duplicate copy a merchant’s website is created
to obtain credit card information or cloning methods where
Fig. 1 provides the summary of gross card frauds and cents
lost per 100 USD till 2027 globally [7]. It can be seen that credit
card fraud reached to almost 30 billion USD in 2019 and is to generate transaction logs for a user and then a behaviour-
projected to increase each year while cents per 100 USD is based algorithm classifies each transaction into lower, medium
expected to decrease. Australian Payments Network in their and higher profiles by clustering the data and detects fraud by
annual report shows an increase in credit card use from the matching the transactions with the user’s spending history. Then
previous years and a total of 574 million AUD lost to fraud in the genetic algorithm is used to calculate thresholds and at last
Australia. An increase to 10% among all credit card frauds was fraudulent transaction is detected if the average value of all the
observed whereas the Card Not Present (CNP) fraud resulted in three techniques is greater than 40%. Likewise, an experimental
84.9% of all frauds [8]. Some fraud limitation steps taken by study for detecting credit card fraud using ML methods was
Australian government include CNP Fraud Mitigation provided in [13]. Out of eight classification algorithms tested
framework in which standards are defined for issuers and C5.0 (decision tree algorithm), SVM, and ANN yielded
merchants, Australian Payment Council’s partnership with the promising results on a labelled dataset. To evaluate the
Joint Cyber-Security Centre to acquire actionable information in performance combination of accuracy, recall and area under the
the said matter, regulation of EMV chips, avoiding refunds to precision-recall curve (AUC) are used. Accuracy alone is not
alternative cards and implementing fraud detection systems by used for evaluation because of the data imbalance in credit card
the financial institutions. frauds. The recall is the ratio of correctly identified fraud
transactions over the actual number of fraud transactions. This
B. Fraud Detection using Machine Learning ensures the robustness of the system. The shortlisted algorithms
ML can be implemented for credit card fraud detection are then tested with imbalanced classification techniques:
where it becomes possible to classify an incoming transaction as random over sampling, one-class classification, and cost-
legitimate or fraudulent based on the pattern of previous sensitive models.
transactions [9]. ML, nature-inspired learning, and the
combination of both these methods to form more robust hybrid A detailed mechanism using K-means clustering and the
models are used for card fraud detection [2]. Increased volume genetic algorithm to create new data samples for minority
of card transaction data can be used to find outliers in the data clusters to create a balanced dataset and improve classification
by using methods like auto encoders, long short-term memory performance was introduced in [14]. In that method,
networks, and convolutional neural networks [4]. Decision tree unsupervised learning is used to make clusters of similar data
classifier and ANN are also used for fraud detection and their points. Then the genetic algorithm which is inspired by natural
performance is compared with rule-based models [10]. Decision selection and genetics produced new samples for the minority
tree classifiers are straightforward but their performance with classes. That method will help generate more balanced training
complex data is quite low and ANNs perform better with large sets for card fraud detection and classification error. In contrast,
datasets but require heavy processing power. The rule-based a study of ensemble learning to detect credit card frauds was
methods are easy to implement but they are not good at reported in [15]. Ensemble learning is the approach used to
classifying the new type of fraud. combine several ML classifiers to increase prediction
performance. ANNs and random forest correctly identify fraud
The use of Predictive Analytic Technologies (PAT) to detect and non-fraud cases. Misclassifying normal or fraudulent
credit card frauds was advocated in [11]. PAT uses ML and transactions are both associated with high financial cost. In
statistical models to make future predictions. Five phases of the efforts to reduce the number of misclassified instances, a
predictive analytics process are: outlining the business problem, combination of 3 feed-forward NNs with different
acquiring and preparing data, analyzing data and formulating hyperparameters and 2 random forest classifiers with 300 and
model, deploying predictive model and testing model 400 decision trees are used. The output is then calculated by
performance. The common red flag schemes used by PAT to taking the majority result for the 5 models.
make predictions are uncommon purchase made by the
cardholder, sudden identical purchases on the same credit card, A comparative study on simple NN, multilayer perception
purchases with overnight shipping, purchases with international layer (MPL) and Convolutional Neural Network (CNN) was
shipment, multiple card shipments to a single address, multiple presented in [16]. The data used for this study is self-generated
transactions on a card in short time, geolocation of transaction with 60000 transactions and 12 features. Features selected in
compared with cardholders registered location and usage of generating this data are common attributes acquired from
single IP address for multiple credit cards. Most of the vendor financial institution databases and usual predictors identified for
relies on ANNs for predictions but this method is restrained by predictive modeling. The dataset is highly imbalanced. The
a high number of false positives. Other challenges associated dataset is then balanced using under-sampling. The learning rate
with PAT are model limitations such as implementation cost and is set to 0.001 and the activation function used in this study is
complexity, limited training and learning competence and ‘softmax’. The results showed that MPL performed best with the
inability to adapt to fraud tactics, misclassification cost by highest accuracy of 87.88% followed by CNN with an accuracy
emphasizing on large amount transactions and ignoring lower of 82.86%. Alternatively, results of a real-time deep learning
amounts commutatively, reluctance in sharing fraud data and model using auto-encoders for the credit card fraud detection
compliance with law and regulation. was reported in [17]. The performance metrics selected for
model evaluation are confusion matrix, precision, recall, and
A method to detect fraudulent transactions with the least accuracy effectively. The non-linear auto regression predicted
amount of false predictions was introduced in [12]. To do so, the the most fraudulent transactions however it also misclassified
methodology used is a combination of Hidden Markov Model most of the legitimate transactions. Logistic regression
(HMM), behaviour-based and genetic algorithms. HMM is used misclassified legitimate transactions the least but with low
prediction accuracy for fraud cases. Under these circumstances, number of fraud cases compared to a hundred thousand of
the deep NN Auto Encoder provides the most stable results with normal transactions is very less. Aptly addressing class
a relatively higher prediction rate and lower misclassification imbalance is a major challenge and how the behavior is changed
error. Likewise, CNNs were applied to detect credit card frauds by applying various sampling methods to deal with class
in [18] because of its ability to reduce over-fitting and reveal imbalance is another aim of this study.
hidden fraud patterns. That approach used feature engineering to
generate aggregated features from the transaction data and Traditional ML algorithms such as Support Vector Machines
introduce a novel feature called trading entropy. Synthetic fraud (SVM), Decision Tree (DT) and Logistic Regression (LR) have
samples are created from real fraudulent data using cost-based been extensively proposed for credit card fraud detection. These
sampling to balance the dataset. The sampled dataset is then traditional algorithms are not very suitable for large datasets.
transformed into a feature matrix based on different time The use of deep learning methods is still very limited and
windows. CNN similar to LeNet has 6 layers with input as a methods such as CNN and LSTM are encouraged for image
feature matrix. The evaluation parameter selected is the F1 classification and Natural Language Processing (NLP)
score. The performance of different classifiers is increased when respectively because of their ability to handle massive datasets.
using the trading entropy feature and with comparison to NN, How theses deep learning methods perform for credit card fraud
SVM, and RF, CNN produced a much better performance for all classification is the major focus of this study. In addition, data
the different sample sets tested. Similarly, credit card fraud pre-processing is an important stage in the ML process. How the
detection was investigated in [19] by looking at individual classification performance is affected in response to data pre-
transactions and advocate using time in a sequence of same card processing in credit card fraud detection is another question that
transactions to capture the changing nature of fraud. In this needs to be answered.
regard, adding statistical features obtained from rea features can A. One-Dimensional CNN (1DCNN)
help improve classification performance. One example is CNN is a deep learning method heavily associated with
transaction velocity that in a certain point of time will calculate spatial data such as image processing data. Similar to ANN,
number transactions carried out within a time frame. By adding CNN has the same hidden layer structure in addition to special
time series components to the data, the authors compared the convolution layers with a different number of channels in each
performance of SVM and LSTM models. The metrics set for the layer. The word convolution is linked with the idea of moving
performance evaluation were AUC and mean squared error. The filters that capture the key information from the data. CNN is
experimental results showed that LSTM performed far better widely used in image processing as it automatically performs the
than SVM in terms of evaluation metrics and also classification feature reduction which makes it less prone to overfitting and
rate (transactions/sec). thus training CNN does not require heavy data pre-processing.
A comparison between CNN, Stacked LSTM (SLSTM) and The role of using CNN for image processing is to minimize the
a hybrid model combining CNN and LSTM (CNN-LSTM) was processing by reducing the image without losing key features to
presented in [20] for credit card fraud detection. CNN is make predictions [21]. The key terms in CNN are feature maps,
powerful in learning from short term sequences in the data while channels, pooling, stride, and padding.
LSTM is good in capturing long term sequences. The dataset In comparison to the popular multi-layer perceptron (MLP)
used is from an Indonesian bank and the majority class of non- network, CNN are not fully connected in layer to layer
fraud values is under sampled in 4 different ratios to create 4 connection and unlike MLP that has different weights associated
datasets for testing. This study represents features with respect with each node, CNN has constant weight parameter for each
to time and PCA is used for dimensionality reduction. The filter and these two features reduce the number of parameters in
results revealed that increasing the ratio between non-fraud and a CNN model. Also, the pooling method improves the feature
fraud values increased the accuracy of the classifier. For training detection process making it more robust to size and position
accuracy, SLSTM was on top, CNN-LSTM stood second and changes of an element in an image.
CNN was last in all 4 datasets. However, due to the imbalanced
nature of the datasets, accuracy is not the only measure for CNN models are conventionally used for image and video
performance validation and the AUC values reveal that CNN processing that has two-dimensional data as input and therefore
performed best after CNN-LSTM and then SLSTM, which named as 2DCNN. The feature mapping process is used to learn
highlights that the patterns for fraud transactions are subjugated the internal representation from the input data and the same
by short-term relation over long-term. procedure can be used for one-dimensional data as well where
the location of features is not relevant. A very popular example
III. PROPOSED DEEP LEARNING-BASED CLASSIFIERS of 1DCNN application is in Natural Language Processing which
As ML methods rely solely on historical data, due to the is a sequence classification problem. In 1DCNN, the kernel filter
sensitive nature of financial data protecting user’s privacy, moves top to bottom in a sequence of a data sample instead of
publicly available datasets are not very common, which limits moving left to right and top to bottom in 2DCNN.
the scope of study in this area. Because of this, every study in TABLE I. 2DCNN STRUCTURE
this area is limited to just one dataset if used any. In ML,
performance of a model can differ widely for different datasets Layer 1 Input
Input Shape (Row sample, 5, 6, 1)
(business cases). How will performance vary for three datasets Layer 2 CONV2D
with a varying number of features and transactions is one Number of channels 64
research agenda in this study? Furthermore, credit card fraud Kernel Size 3x3
detection methods face the problem of class imbalance as the Activation Function ReLU
Layer 3 CONV2D the model on previous inputs it still suffers from short term
Number of channels 32 memory and the cell state overcomes that by remembering key
Kernel Size 3x3
Activation Function ReLU information starting from the earliest examples in the sequence.
Layer 4 Flatten To understand the complete flow of an LSTM cell, consider Fig.
Number of Nodes 64 2 below where each dotted box represents a single step [22].
Layer 5 Output
Number of Nodes 1
Activation Function Sigmoid

TABLE II. 1DCNN STRUCTURE


Layer 1 Input
Input Shape (Row sample ,1, Number of Features)
Layer 2 CONV1D Layer
Number of channels 64
ht
Kernel Size 1
Activation Function ReLU
Layer 3 CONV1D Layer
Number of channels 64 xt
Kernel Size 1
Activation Function ReLU Fig. 2. An LSTM Cell
Layer 4 Dropout
Threshold 0.5
Layer 5 MaxPooling1D
The first step shown in the red dotted box is the forget gate.
Pool size 1 Previous hidden state (ht-1) and current input (xt) are passed
Layer 6 Flatten together to a ‘’sigmoid’ activation function that provides the
Number of Nodes 64 output between 0 (to forget) and 1 (to keep). The next step
Layer 7 Dense shown in the yellow dotted box is the input gate. In this gate, the
Number of Nodes 100
Activation Function ReLU previous hidden state and current input are passed through
Layer 8 Output ‘sigmoid’ and ‘tanh’ activation function and the output of both
Number of Nodes 1 is then multiplied. The ‘tanh’ function regulates the model while
Activation Function Sigmoid ‘sigmoid’ tell which information to keep from the current
regulation. The next step shown in the purple dotted box
This study uses both 2DCNN and 1DCNN to classify fraud calculates the cell state. Here the previous cell state (Ct-1) gets
and non-fraud cases. 2DCNN is used for the European Card pointwise multiplied with the output of the forget gate and then
Dataset in which the number of features is thirty. Each the product is pointwise added with the output of the input gate
transaction sample is reshaped to a two-dimensional image and to get the new cell state (Ct). The last step shown in the blue
passed as input for the 2DCNN model. Tables I and II show the dotted box is the output gate that calculates the new hidden state
construction parameters of 2DCNN and 1DCNN respectively in (ht). The new cell state is passed through ‘tanh’ activation
this study. function and is multiplied with the output of ‘sigmoid’ activation
B. Long Short Term Memory Network function which has an input of previous hidden state and current
input. Table III below shows the hyperparameters of LSTM
LSTM is a type of Recurrent Neural Network (RNN). A network used in this study.
standard NN cannot keep track of the previous information and
every time they have to perform the learning task from scratch. TABLE III. LSTM STRUCTURE
In very simple words, RNN is a neural network with memory. Layer 1 Input
RNN tends to have short term memory because of the vanishing Input Shape (1, Number of Features)
gradient problem. Backpropagation is the backbone of neural Layer 2 Dense
Number of LSTM Blocks 50
networks as it minimizes the loss by adjusting weights of the Activation Function ReLU
network that are found using gradients. In RNN, as the gradient Layer 3 Output
moves back in the network it shrinks so the weight update is very Number of Nodes 1
small. The earlier layers in the network that are affected by this Activation Function Sigmoid
small update do not learn much and RNN loses the ability to
remember early examples in long sequences making it a short- IV. EXPERIMENTAL DATASETS
term memory network. One main focus of this study is to determine the performance
of classifiers for credit card fraud detection on datasets having a
LSTMs come to the rescue for this short-term memory
varied number of samples and features. For this purpose, three
problem by having a cell state that is the memory of the network
different datasets, i.e. European Card Data (ECD), Small Card
passing through each step and gates in each step that controls the
Data (SCD) and Tall Card Data (TCD) are used. Like all credit
flow of memory by keeping necessary and discarding irrelevant
card fraud datasets where fraud instances are very less compared
information. The gates used are the forget gate, input gate, and
to normal transactions, these datasets are highly imbalanced. All
output gate. Forget gate decides what information to keep from
three datasets in this study are labelled datasets with class value
the previous step, Input gate decides what information to add
‘0’ representing no fraud and ‘1’ indicating fraud. Further details
from the current step and the output gate decides the hidden state
and class imbalance percentage is given below.
for the next step. The difference between the hidden state and
cell state is that while the hidden state keeps the information of
A. European Card Data fraud and False-negatives are the fraud cases predicted as non-
This dataset courtesy of Machine Learning Group of fraud. To further understand the evaluation metrics, consider the
Université Libre de Bruxelles is retrieved from Kaggle 1 . It equations for Accuracy, Precision, Recall, and the F1 Score
contains two days of transaction data of European cardholders given below.
in September 2013. This dataset contains 284,807 samples and 𝑇𝑁 + 𝑇𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
31 features. Out of the given samples, only 492 are fraud cases 𝑇𝑁 + 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
and account for 0.172% of the dataset. Due to the privacy of 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
customer information and the sensitivity of transactional details, 𝑇𝑃 + 𝐹𝑃
all except ‘Time’ and ‘Amount’ features in the dataset are PCA 𝑇𝑃
transformed. The ‘Time’ feature represents the time in seconds 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
(3)
passed starting from the first sample in the dataset and the
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
‘Amount’ feature shows the total amount of the transaction. This 𝐹1 = 2 ∗ (4)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
dataset is referred to as ‘ECD’ in this study.
It can be seen that precision is associated with positive
B. Small Card Data predicted values. Decreasing the number of False positives will
This dataset, also retrieved from Kaggle2, is a small dataset increase the precision so for circumstances where the cost of
containing 3075 samples and 12 features. Half of the features having False positives is high, precision is a suitable metric. Eq.
are categorical while another half is numerical. Out of 3075 (3) tells that recall is associated with actual positives. Decreasing
samples, 448 are fraud cases contributing to 14.6% of all cases. the number of False negatives will increase the recall and
The features used in this dataset are Merchant ID, Transaction problems with the high cost of False-negative tend to achieve
date, Average transaction amount per day, the Transaction high recall. For credit card fraud detection, a balance is required
amount is declined, Number of declines per day, is a foreign between False positives and False negatives. Predicting all the
transaction, is a high-risk country, Daily chargeback average samples as fraud will have high recall but low accuracy,
amount, six-month chargeback average amount, six-month precision, and F1 score while predicting all samples as non-
chargeback frequency and is fraudulent. Due to a smaller fraud will result in high accuracy but zero recall and undefined
number of rows and columns, this data set is named as Small precision and F1 score. In this paper, we use all above four
Card Data (SCD). metrics for the comparisons.
C. Tall Card Data VI. ADDRESSING THE CLASS IMBALANCE PROBLEM
This dataset is obtained from an online database3 courtesy of Class imbalance occurs when instances in a labelled dataset
[13] containing 10 million samples (rows) and 9 features are not equally divided and the data is separated into majority
(columns). Having a high number of samples and a low number and minority class/classes. Credit card transactional data is
of features this dataset is named Tall Card Data (TCD) in this highly imbalanced because out of millions of transactions each
study. Only 5.96% of the dataset contains fraud cases. The day fraudulent transactions are very less. However, this does not
features comprising the dataset are customer ID, gender, state, mean that the impact of these fewer fraud cases compared to
number of cards a customer has, balance on the card, number of legitimate cases can be ignored. These fraud transactions heavily
transactions to date, number of international transactions to date, affect everyone in the card business, from user to merchant to
customer's credit line and fraud risk indicating fraud or non- issuer. Having an imbalanced dataset for problems where the
fraud. Due to limited computing power and higher training times class of interest is the minority class can lead to poor
associated with the classifiers we take a small proportion of this performance by having bias for majority class and
data for the study. Out of 10 million samples, we only process misclassifying minority class by treating it as noise [23]. Data
0.5 million samples which are almost double the samples in sampling, cost-sensitive learning, one-class learning, and
ECD. The class imbalance is kept the same and out of 500,000 ensemble learning are few methods to improve performance for
total samples, there are 28,000 fraud cases. imbalanced datasets. In this paper, under-sampling and over-
sampling methods are further explored and performance on
V. EVALUATION METRICS
sample data for each dataset is evaluated for all the classifiers.
Accuracy is not a suitable metric for model evaluation due The sampling methods used are as follow.
to the high class imbalance in the datasets. Selecting the metric
for evaluation depends on the nature of the solution and for A. Random Under Sampling
credit card fraud detection systems capturing all fraud cases and Random under sampling (RUS) is the process in which the
reducing false alarm (legitimate transaction identified as fraud) instances from majority class are reduced by randomly selecting
is the goal. In this study, we use confusion matrix and denote from the data. This is an under-sampling method where the class
non-fraud (legitimate) instances as Negatives and fraud imbalance is reduced by reducing majority class. To perform
instances as Positives. True negatives are the non-fraud cases random under sampling in python RandomUnderSampler class
predicted correctly, True positives are the fraud cases predicted form imbalanced-learn library is used. The sampling ratio is
correctly, False positives are the non-fraud cases predicted as

1
https://www.kaggle.com/mlg-ulb/creditcardfraud 3
http://packages.revolutionanalytics.com/datasets/
2
https://www.kaggle.com/shubhamjoshi2130of/abstract-data-set-for-credit-
card-fraud-detection#creditcardcsvpresent.csv
selected by the ‘sampling_strategy’ parameter which is the ratio dataset. Higher average amount of transactions per day are
of required majority class over minority class. associated with higher transaction amounts. Daily chargeback
amount, six month chargeback amount and six month
B. Near Miss Sampling chargeback frequency are also highly associated with each other.
Randomly selecting instances in RUS can remove key Finally the high risk country feature is most significant in
information from the dataset and for this matter Near Miss (NM) determining the fraud class. As this dataset is already very small
sampling uses distance to select to sampling instances. Near no dimensionality reduction is performed on the data.
miss has variants version 1, 2 and 3, and after evaluating
performance of all three variants (results provided in section TABLE IV. DATA EXPLORATION
below), version 1 was selected in this study that selects the ECD
majority sampling instance that has the smallest average Number of Rows 284807
distance to the closest three instances of the minority class. In Number of Columns 31
python the NearMiss class of imbalanced-learn library is used to Feature Type Numeric
Missing Values None
perform this under sampling. Dropped Features None
C. Synthetic Minority Over Sampling Technique (SMOTE) Categorical to Numeric None
Smaller Sample Used No
SMOTE is an over-sampling method that increases the SCD
number of instances in minority class by generating new Number of Rows 3075
synthetic samples. These new synthetic sample are generated by Number of Columns 12
identifying nearest neighbors of the minority class sample and Feature Type Numeric + Categorical
then generating a sample anywhere between the line of nearest Missing Values 3075
neighbors. In this study SMOTE class of imbalanced-learn Dropped Features ‘Transaction date’
Categorical to Numeric ‘Merchant_id’, ‘Is declined’,
library is used to perform the over-sampling and the sampling ‘isForeignTransaction’, ‘isHighRiskCountry’,
ratio is the number of minority class instances after resampling ‘isFradulent’
over number of majority class instances. Smaller Sample Used No
TCD
VII. EXPERIMENTS AND DISCUSIONS Number of Rows 10000000
Number of Columns 9
A. Data Pre-Processing Feature Type Numeric
The first step in the experimentation is data pre-processing. Missing Values None
In this step, all three data sets are explored in detail by enquiring Dropped Features ‘custID’
the dataset manually and applying statistical operations. The Categorical to Numeric None
Smaller Sample Used Yes
purpose of data pre-processing is to provide a refined input to
the classifiers to achieve best possible output. Missing values,
categorical features, variable scale and high dimensionality can
all affect the performance of the classifier. The two pre-
processing methods involved in this study are data exploration,
data scaling and test-train split.
Table IV down below provides key information on the data
exploration process in this study. For ECD all the features were
numeric, no missing value was found, and no feature was
dropped to clean the data. All the categorical features in SCD
were changed to numeric and ‘Transaction Date’ feature was
dropped from the dataset as it was all of missing values. Both
ECD and SCD datasets were used as whole while with TCD a
smaller fraction of the actual dataset was used. For TCD all the
values were numeric and ‘custID’ feature was dropped as it had
all the unique values and added no information to the dataset.
Next in data exploration, correlation is distinguished
between the features for each data set. Correlation is a statistical
method and helps to establish the dependency of variables. It is
a number between -1 to 1 where 0 means no relation at all,
negative correlation means inversely proportional and positive
correlation means directly proportional. Finding correlation can
help eliminate the features that have the similar behaviour in the
data reducing the dimesnions of the data. Lower dimensional
data help improve training times and classification performance.
Fig. 3 shows correlation between features in the SCD dataset. It
can be observed that there is no negative correlation between the Fig. 3. Correlation Heatmap of the SCD dataset
features and there is no uniform correlation throughout the SCD
B. Data Scaling and Standardization Fig. 5 shows performance comparison between the three
Normally the features in datasets are in different scale. Like datasets. The performance is calculated using both Validation
the features ‘Amount’ and ‘V1’ in ECD dataset has mean and Test data and the performance measure selected here is F1-
94826.6 and 0.000639 respectively. Deep learning algorithms Score. It can be observed that in general validation performance
does not perform very well if the input features are not on a fairly is decreased as the dataset size is increased. SCD gives the best
similar scale. Scaling and standardization methods bring the prediction performance on the examples that the system has
features together to almost same scale to make the input more already seen but drops down on the new unseen examples. TCD
comprehendible for the classifier. In this study, StandardScaler has the lowest validation and test performance amongst all
class from sklearn pre-processing library in python is used. The dataset maybe due to the smaller number of features and little
standard scaler transforms each feature in the dataset such that correlation between the features. The performance on ECD is
the mean is 0 and the standard deviation is 1. After applying the stable with little variation between validation and test
standard scaler on the dataset the mean for the same two features performance. The missing values of 2DCNN for SCD and TCD
discussed above changed to 0. are because of the smaller number of features in the dataset as it
was not possible to create feature matrix. Missing values of RF
C. Test, Train and Validation Split (i.e. random forest) and SVM in TCD are due to the higher
All three datasets are further divided into Test, Training and training times associated with these classifiers.
Validation data. Test Data is a small chunk of data obtained
randomly from the dataset, which occupies 3.5% of each dataset. 100

Training Data is the 80% of the remaining data used for training 80

the models. Validation Data is the 20% of the remaining data 60

used for validating the classifier. The classifiers used this 40

validation data to avoid overfitting and improve model 20

performance. 0
LSTM 2DCNN ANN RF 1DCNN SVM LSTM 2DCNN ANN RF 1DCNN SVM LSTM 2DCNN ANN RF 1DCNN SVM

SCD ECD TCD


D. Results and Discussions
F1 Score Test F1 Score Validation
Fig. 4 provides the comparison between different class
imbalance ratios of the ECD dataset and the performance is Fig. 5. F1 Score - SCD vs ECD vs TCD
evaluated using Test Dataset.
100
350
80
300
60

250
40

200
20

150
0
ANN

SVM

ANN

SVM

ANN

SVM

ANN

SVM
LSTM

LSTM

LSTM

LSTM
2DCNN

RF

1DCNN

2DCNN

RF

1DCNN

2DCNN

RF

1DCNN

2DCNN

RF

1DCNN
100

No Sampling NM (0.1) RUS (0.1) SMOTE (0.1)


50

0
Accuracy Precsion Recall F1 Score
ANN

SVM

ANN

SVM

ANN

SVM

ANN

SVM

ANN

SVM

ANN

SVM

ANN

SVM
LSTM
2DCNN

1DCNN

LSTM
2DCNN

1DCNN

LSTM
2DCNN

1DCNN

LSTM
2DCNN

1DCNN

LSTM
2DCNN

1DCNN

LSTM
2DCNN

1DCNN

LSTM
2DCNN

1DCNN
RF

RF

RF

RF

RF

RF

RF

1 0.1 0.04 0.02 0.01 0.005 0.00173 Fig. 6. Sampling Method Comparison (Validation Data ECD)
Accuracy Prescision Recall F1 Score
The comparison of sampling methods discussed earlier
Fig. 4. Class Imbalance Comparison (Test Data ECD) tested on Validation Data is presented in Fig. 6. It can be
observed that overall performance is increased with sampling.
It can be seen that initially the Recall is maximum, and all SMOTE method provides the best results whereas Near Miss
other metrics are minimum and as the class imbalance is performs slightly better than Random Under Sampler. Finding
increased, recall begins to decrease and accuracy, precision and the best sampling method leads to the next experiment.
f1-score starts to increase. When there is no class imbalance at
ratio 1, the model is predicting all the fraud cases correctly but 100

misclassifying most of the non-fraud cases resulting in low 80

precision. Low accuracy is achieved as the proportion of non- 60

fraud cases is much larger than fraud cases in Test data. When 40

the class imbalance starts to increase a significant rise in 20

accuracy is observed meaning increase in correct predictions 0


LSTM 2DCNN ANN RF 1DCNN SVM LSTM 2DCNN ANN RF 1DCNN SVM

over total predictions. F1 score and Precision rise with increase SMOTE No Sampling

in class imbalance. This is because with increase in class Accuracy Precsion Recall F1 Score
imbalance the model gets more training instances and is able to
generalize well. Fig. 7. SMOTE vs Normal Distribution (Test Data ECD)
Fig. 7 demonstrates the comparison of SMOTE with no [5] K. Sharma and R. Nandal, "A literature study on machine learning fusion
sampling using the Test Data. As expected, the Recall is with IoT," 2019 3rd International Conference on Trends in Electronics
and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 1440-1445.
increased however Precision and F1-score are decreased
[6] Payment cards projected worldwide, The Nilson Report Issue 1140, Oct.
drastically with SMOTE meaning the model is predicting fraud 2018. Accessed on: Nov. 12, 2020. [Online]. Available:
cases very accurately but misclassifying majority of the non- https://nilsonreport.com/upload/issues/1140_0321.pdf
fraud cases. This may be because of the fact that SMOTE creates [7] Issue 1164, The Nilson Report, Nov. 2019. Accessed on: Nov. 12, 2020.
new synthetic fraud instances overlapping the non-fraud [Online].https://nilsonreport.com/publication_chart_of_the_month.php?
instances thus changing the decision boundary. 1=1&issue=1164
[8] Australian Payment Card Fraud 2019, Australian Payments Network,
One of the main objectives of this study was to find the best 2019. Accessed on: Nov. 26, 2020. [Online]. Available:
performing algorithm. From Fig. 5, it can be observed that https://www.auspaynet.com.au/sites/default/files/2019-
although deep learning methods performed side to side with the 08/AustralianPaymentCardFraud2019_0.pdf
traditional algorithms, LSTM has a slightly better performance [9] I. Sakharova, "Payment card fraud: Challenges and solutions," 2012 IEEE
amongst all tested algorithms. Some other observations made International Conference on Intelligence and Security Informatics,
Arlington, VA, 2012, pp. 227-234.
during the experimentation process are as follows: 1) With the
[10] K. Modi and R. Dayma, "Review on fraud detection methods in credit
support of GPU computing, deep learning methods implemented card transactions," 2017 International Conference on Intelligent
based on the tensorflow library require less training time Computing and Control (I2C2), Coimbatore, 2017, pp. 1-5.
compared to traditional algorithms, i.e. SVM and RF, on large [11] K. T. Hafiz, S. Aghili and P. Zavarsky, "The use of predictive analytics
datasets; 2) Increasing the number of epochs increased the technology to detect credit card fraud in Canada," 2016 11th Iberian
misclassification; 3) All the variants of Near Miss Algorithm Conference on Information Systems and Technologies (CISTI), Las
provided the similar results. Palmas, 2016, pp. 1-6.
[12] A. Agrawal, S. Kumar and A. K. Mishra, "Credit card fraud detection: a
VIII. CONCLUSIONS AND FUTURE WORK case study," 2nd International Conference on Computing for Sustainable
Global Development, New Delhi, 2015, pp. 5-7.
Credit card frauds are an increasing threat to the financial [13] S. Makki, Z. Assaghir, Y. Taher, R. Haque, M. Hacid and H. Zeineddine,
institutions. Fraudsters tend to come up with new fraud methods "An experimental study with imbalanced classification approaches for
every now and then. A robust classifier is one that can cope with credit card fraud detection," in IEEE Access, vol. 7, pp. 93010-93022,
the changing nature of the frauds. Accurately predicting the 2019.
fraud cases and reducing the number of false positives is the [14] I. Benchaji, S. Douzi and B. E. Ouahidi, "Using genetic algorithm to
foremost priority of a fraud detection systems. The performance improve classification of imbalanced datasets for credit card fraud
detection," In International Conference on Advanced Information
of machine learning methods varies for each business case. Type Technology, Services and Systems, pp. 220-229. Springer, Cham, 2018.
of input data is a dominant factor driving the machine learning
[15] I .Sohony, R. Pratap and U. Nambiar, "Ensemble learning for credit card
model. For credit card detection, number of features, number of fraud detection," In ACM India Joint International Conference on Data
transactions and correlation between the features is an important Science and Management of Data, pp. 289-294. 2018.
factor in determining model performance. Deep learning [16] I. Sadgali, N. Sael and F. Benabbou, "Fraud detection in credit card
methods such as CNN and LSTM are associated with image transaction using neural networks," In Proceedings of the 4th
processing and NLP respectively. Using these methods for credit International Conference on Smart City Applications, pp. 1-4. 2019.
card fraud detection yielded better performance than traditional [17] Y. Abakarim, M. Lahby and A. Attioui, "An efficient real time model for
algorithms. While all the algorithms performed side to side, credit card fraud detection based on deep learning," In Proceedings of the
12th International Conference on Intelligent Systems: Theories and
LSTM with 50 blocks was the one on top with F1-Score of Applications, pp. 1-7. 2018.
84.85%. In this study, sampling methods have been used to deal [18] K. Fu, D. Cheng,Y. Tu and L. Zhang, "Credit card fraud detection using
with the class imbalance problem. Using various sampling convolutional neural networks," In International Conference on Neural
methods increased the performance on existing examples but Information Processing, pp. 483-490. Springer, Cham, 2016.
decreased it significantly on the newly unseen data. The [19] B. Wiese and C. Omlin, "Credit card transactions, fraud detection, and
performance on unseen data was increased as the class machine learning: Modelling time with LSTM recurrent neural
imbalance was increased. Future work associated with this study networks," In Innovations in neural information paradigms and
applications, pp. 231-268. Springer, Berlin, Heidelberg, 2009.
is to explore hyperparameters used to construct deep learning
[20] Y. Heryadi and H. L. H. S. Warnars, "Learning temporal representation
methods to improve model performance. of transaction amount for fraudulent transaction recognition using CNN,
Stacked LSTM, and CNN-LSTM," 2017 IEEE International Conference
REFERENCES on Cybernetics and Computational Intelligence (CyberneticsCom),
[1] J. Desjardins, How much data is generated each day?, World Economic Phuket, 2017, pp. 84-89.
Forum, April 17, 2019. Accessed on: Nov. 18, 2020. [Online]. Available: [21] T. T. Nguyen, C. M. Nguyen, D. T. Nguyen, D. T. Nguyen and S.
https://www.weforum.org/agenda/2019/04/how-much-data-is-generated- Nahavandi, "Deep learning for deepfakes creation and detection: a
each-day-cf4bddf29f/ survey," arXiv preprint arXiv:1909.11573 (2019).
[2] A. O. Adewumi and A. A. Akinyelu, "A survey of machine-learning and [22] M. Phi, Illustrated Guide to LSTM’s and GRU’s: A step by step
nature-inspired based credit card fraud detection explanation, Towards Data Science, 2018. Accessed on: Nov. 23, 2020.
techniques." International Journal of System Assurance Engineering and [Online]. Available at: https://towardsdatascience.com/illustrated-guide-
Management 8, no. 2 (2017): 937-953. to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
[3] T. T. Nguyen and V. J. Reddi, "Deep reinforcement learning for cyber [23] S. M. A. Elrahman and A. Abraham, "A review of class imbalance
security," arXiv preprint arXiv:1906.05799 (2019). problem," Journal of Network and Innovative Computing 1, no. 2013
[4] R. Chalapathy and S. Chawla, "Deep learning for anomaly detection: a (2013): 332-340.
survey," arXiv preprint arXiv:1901.03407 (2019).

You might also like