Deep Learning Methods For Credit Card Fraud Detect
Thanh Thi Nguyen1, Hammad Tahir1, Mohamed Abdelrazek1 and Ali Babar2
School of Information Technology, Deakin University, Victoria, Australia
School of Computer Science, The University of Adelaide, South Australia, Australia
Abstract— Credit card frauds are at an ever-increasing rate and magnetic stripe reader obtains information on the credit card to
have become a major problem in the financial sector. Because of make a fake copy of it [2].
these frauds, card users are hesitant in making purchases and both
the merchants and financial institutions bear heavy losses. Some ML is a branch of AI in which a computer (machine) is able
major challenges in credit card frauds involve the availability of to perform predictions based on the findings from the previous
public data, high class imbalance in data, changing nature of data trends. Since the astonishing success of Google DeepMind
frauds and the high number of false alarms. Machine learning in 2015, AI and ML have expanded to new horizons. Some
techniques have been used to detect credit card frauds but no practical implementations using deep anomaly detection are
fraud detection systems have been able to offer great efficiency to computer network intrusion detection [3]; banking, insurance,
date. Recent development of deep learning has been applied to mobile cellular network, and health care fraud detection;
solve complex problems in various areas. This paper presents a medical and malware anomaly detection and anomaly detection
thorough study of deep learning methods for the credit card fraud for video surveillance [4]. Location tracking, Android malware
detection problem and compare their performance with various detection, home automation and predicting the occurrence of
machine learning algorithms on three different financial datasets. heart disease are some applications of ML in the Internet of
Experimental results show great performance of the proposed Things domain [5]. In this paper, we will study in-depth the
deep learning methods against traditional machine learning practical implementation of ML, especially deep learning
models and imply that the proposed approaches can be
methods in detecting credit card frauds in the financial sector.
implemented effectively for real-world credit card fraud detection
Keywords—deep learning, machine learning, credit card frauds, A. Credit Card Frauds
fraud detection, cyber security, CNN, LSTM In recent years, we have become more dependent on mobile
phones and web applications, which have caused an increase in
I. INTRODUCTION the number of online payment transactions. The card frauds have
We live in a world today where with the power of a single resulted in millions of revenue loss to financial institutions
touch we can achieve massive results. We can book rides, talk globally and added stress to credit card users. Approximately,
to personal virtual assistants, get recommendations, navigate 20.48 billion cards (including credit, debit and all prepaid cards)
maps and order food to our homes. All of this is possible only are in circulation worldwide for the year 2017 [6].
because of higher computing capabilities and shared IT
infrastructure. Because of this, a large volume of data is created
and data generation is expected to reach 44 zettabytes (40 trillion
gigabytes) in 2020 from 4.4 zettabytes in 2013 [1]. The rise of
artificial intelligence (AI) and machine learning (ML) in recent
years is a consequence of this upsurge of data. Today we rely on
countless implementations of ML in our everyday life without
even realizing it. One such implementation is credit card fraud
detection systems using ML techniques that make our payment
methods more robust.
Credit card fraud is an extension of theft and fraud carried
out by obtaining credit card details or using counterfeit cards for
illegal monetary transactions. Credit card fraud can be physical
where fraudster presents a credit card physically to the merchant
or virtual where the transaction is made over the internet.
Fraudsters use various techniques for this matter such as site
Fig. 1. Card Fraud Worldwide [7]
cloning where a duplicate copy a merchant’s website is created
to obtain credit card information or cloning methods where
Fig. 1 provides the summary of gross card frauds and cents
lost per 100 USD till 2027 globally [7]. It can be seen that credit
card fraud reached to almost 30 billion USD in 2019 and is to generate transaction logs for a user and then a behaviour-
projected to increase each year while cents per 100 USD is based algorithm classifies each transaction into lower, medium
expected to decrease. Australian Payments Network in their and higher profiles by clustering the data and detects fraud by
annual report shows an increase in credit card use from the matching the transactions with the user’s spending history. Then
previous years and a total of 574 million AUD lost to fraud in the genetic algorithm is used to calculate thresholds and at last
Australia. An increase to 10% among all credit card frauds was fraudulent transaction is detected if the average value of all the
observed whereas the Card Not Present (CNP) fraud resulted in three techniques is greater than 40%. Likewise, an experimental
84.9% of all frauds [8]. Some fraud limitation steps taken by study for detecting credit card fraud using ML methods was
Australian government include CNP Fraud Mitigation provided in [13]. Out of eight classification algorithms tested
framework in which standards are defined for issuers and C5.0 (decision tree algorithm), SVM, and ANN yielded
merchants, Australian Payment Council’s partnership with the promising results on a labelled dataset. To evaluate the
Joint Cyber-Security Centre to acquire actionable information in performance combination of accuracy, recall and area under the
the said matter, regulation of EMV chips, avoiding refunds to precision-recall curve (AUC) are used. Accuracy alone is not
alternative cards and implementing fraud detection systems by used for evaluation because of the data imbalance in credit card
the financial institutions. frauds. The recall is the ratio of correctly identified fraud
transactions over the actual number of fraud transactions. This
B. Fraud Detection using Machine Learning ensures the robustness of the system. The shortlisted algorithms
ML can be implemented for credit card fraud detection are then tested with imbalanced classification techniques:
where it becomes possible to classify an incoming transaction as random over sampling, one-class classification, and cost-
legitimate or fraudulent based on the pattern of previous sensitive models.
transactions [9]. ML, nature-inspired learning, and the
combination of both these methods to form more robust hybrid A detailed mechanism using K-means clustering and the
models are used for card fraud detection [2]. Increased volume genetic algorithm to create new data samples for minority
of card transaction data can be used to find outliers in the data clusters to create a balanced dataset and improve classification
by using methods like auto encoders, long short-term memory performance was introduced in [14]. In that method,
networks, and convolutional neural networks [4]. Decision tree unsupervised learning is used to make clusters of similar data
classifier and ANN are also used for fraud detection and their points. Then the genetic algorithm which is inspired by natural
performance is compared with rule-based models [10]. Decision selection and genetics produced new samples for the minority
tree classifiers are straightforward but their performance with classes. That method will help generate more balanced training
complex data is quite low and ANNs perform better with large sets for card fraud detection and classification error. In contrast,
datasets but require heavy processing power. The rule-based a study of ensemble learning to detect credit card frauds was
methods are easy to implement but they are not good at reported in [15]. Ensemble learning is the approach used to
classifying the new type of fraud. combine several ML classifiers to increase prediction
performance. ANNs and random forest correctly identify fraud
The use of Predictive Analytic Technologies (PAT) to detect and non-fraud cases. Misclassifying normal or fraudulent
credit card frauds was advocated in [11]. PAT uses ML and transactions are both associated with high financial cost. In
statistical models to make future predictions. Five phases of the efforts to reduce the number of misclassified instances, a
predictive analytics process are: outlining the business problem, combination of 3 feed-forward NNs with different
acquiring and preparing data, analyzing data and formulating hyperparameters and 2 random forest classifiers with 300 and
model, deploying predictive model and testing model 400 decision trees are used. The output is then calculated by
performance. The common red flag schemes used by PAT to taking the majority result for the 5 models.
make predictions are uncommon purchase made by the
cardholder, sudden identical purchases on the same credit card, A comparative study on simple NN, multilayer perception
purchases with overnight shipping, purchases with international layer (MPL) and Convolutional Neural Network (CNN) was
shipment, multiple card shipments to a single address, multiple presented in [16]. The data used for this study is self-generated
transactions on a card in short time, geolocation of transaction with 60000 transactions and 12 features. Features selected in
compared with cardholders registered location and usage of generating this data are common attributes acquired from
single IP address for multiple credit cards. Most of the vendor financial institution databases and usual predictors identified for
relies on ANNs for predictions but this method is restrained by predictive modeling. The dataset is highly imbalanced. The
a high number of false positives. Other challenges associated dataset is then balanced using under-sampling. The learning rate
with PAT are model limitations such as implementation cost and is set to 0.001 and the activation function used in this study is
complexity, limited training and learning competence and ‘softmax’. The results showed that MPL performed best with the
inability to adapt to fraud tactics, misclassification cost by highest accuracy of 87.88% followed by CNN with an accuracy
emphasizing on large amount transactions and ignoring lower of 82.86%. Alternatively, results of a real-time deep learning
amounts commutatively, reluctance in sharing fraud data and model using auto-encoders for the credit card fraud detection
compliance with law and regulation. was reported in [17]. The performance metrics selected for
model evaluation are confusion matrix, precision, recall, and
A method to detect fraudulent transactions with the least accuracy effectively. The non-linear auto regression predicted
amount of false predictions was introduced in [12]. To do so, the the most fraudulent transactions however it also misclassified
methodology used is a combination of Hidden Markov Model most of the legitimate transactions. Logistic regression
(HMM), behaviour-based and genetic algorithms. HMM is used misclassified legitimate transactions the least but with low
prediction accuracy for fraud cases. Under these circumstances, number of fraud cases compared to a hundred thousand of
the deep NN Auto Encoder provides the most stable results with normal transactions is very less. Aptly addressing class
a relatively higher prediction rate and lower misclassification imbalance is a major challenge and how the behavior is changed
error. Likewise, CNNs were applied to detect credit card frauds by applying various sampling methods to deal with class
in [18] because of its ability to reduce over-fitting and reveal imbalance is another aim of this study.
hidden fraud patterns. That approach used feature engineering to
generate aggregated features from the transaction data and Traditional ML algorithms such as Support Vector Machines
introduce a novel feature called trading entropy. Synthetic fraud (SVM), Decision Tree (DT) and Logistic Regression (LR) have
samples are created from real fraudulent data using cost-based been extensively proposed for credit card fraud detection. These
sampling to balance the dataset. The sampled dataset is then traditional algorithms are not very suitable for large datasets.
transformed into a feature matrix based on different time The use of deep learning methods is still very limited and
windows. CNN similar to LeNet has 6 layers with input as a methods such as CNN and LSTM are encouraged for image
feature matrix. The evaluation parameter selected is the F1 classification and Natural Language Processing (NLP)
score. The performance of different classifiers is increased when respectively because of their ability to handle massive datasets.
using the trading entropy feature and with comparison to NN, How theses deep learning methods perform for credit card fraud
SVM, and RF, CNN produced a much better performance for all classification is the major focus of this study. In addition, data
the different sample sets tested. Similarly, credit card fraud pre-processing is an important stage in the ML process. How the
detection was investigated in [19] by looking at individual classification performance is affected in response to data pre-
transactions and advocate using time in a sequence of same card processing in credit card fraud detection is another question that
transactions to capture the changing nature of fraud. In this needs to be answered.
regard, adding statistical features obtained from rea features can A. One-Dimensional CNN (1DCNN)
help improve classification performance. One example is CNN is a deep learning method heavily associated with
transaction velocity that in a certain point of time will calculate spatial data such as image processing data. Similar to ANN,
number transactions carried out within a time frame. By adding CNN has the same hidden layer structure in addition to special
time series components to the data, the authors compared the convolution layers with a different number of channels in each
performance of SVM and LSTM models. The metrics set for the layer. The word convolution is linked with the idea of moving
performance evaluation were AUC and mean squared error. The filters that capture the key information from the data. CNN is
experimental results showed that LSTM performed far better widely used in image processing as it automatically performs the
than SVM in terms of evaluation metrics and also classification feature reduction which makes it less prone to overfitting and
rate (transactions/sec). thus training CNN does not require heavy data pre-processing.
A comparison between CNN, Stacked LSTM (SLSTM) and The role of using CNN for image processing is to minimize the
a hybrid model combining CNN and LSTM (CNN-LSTM) was processing by reducing the image without losing key features to
presented in [20] for credit card fraud detection. CNN is make predictions [21]. The key terms in CNN are feature maps,
powerful in learning from short term sequences in the data while channels, pooling, stride, and padding.
LSTM is good in capturing long term sequences. The dataset In comparison to the popular multi-layer perceptron (MLP)
used is from an Indonesian bank and the majority class of non- network, CNN are not fully connected in layer to layer
fraud values is under sampled in 4 different ratios to create 4 connection and unlike MLP that has different weights associated
datasets for testing. This study represents features with respect with each node, CNN has constant weight parameter for each
to time and PCA is used for dimensionality reduction. The filter and these two features reduce the number of parameters in
results revealed that increasing the ratio between non-fraud and a CNN model. Also, the pooling method improves the feature
fraud values increased the accuracy of the classifier. For training detection process making it more robust to size and position
accuracy, SLSTM was on top, CNN-LSTM stood second and changes of an element in an image.
CNN was last in all 4 datasets. However, due to the imbalanced
nature of the datasets, accuracy is not the only measure for CNN models are conventionally used for image and video
performance validation and the AUC values reveal that CNN processing that has two-dimensional data as input and therefore
performed best after CNN-LSTM and then SLSTM, which named as 2DCNN. The feature mapping process is used to learn
highlights that the patterns for fraud transactions are subjugated the internal representation from the input data and the same
by short-term relation over long-term. procedure can be used for one-dimensional data as well where
the location of features is not relevant. A very popular example
III. PROPOSED DEEP LEARNING-BASED CLASSIFIERS of 1DCNN application is in Natural Language Processing which
As ML methods rely solely on historical data, due to the is a sequence classification problem. In 1DCNN, the kernel filter
sensitive nature of financial data protecting user’s privacy, moves top to bottom in a sequence of a data sample instead of
publicly available datasets are not very common, which limits moving left to right and top to bottom in 2DCNN.
the scope of study in this area. Because of this, every study in TABLE I. 2DCNN STRUCTURE
this area is limited to just one dataset if used any. In ML,
performance of a model can differ widely for different datasets Layer 1 Input
Input Shape (Row sample, 5, 6, 1)
(business cases). How will performance vary for three datasets Layer 2 CONV2D
with a varying number of features and transactions is one Number of channels 64
research agenda in this study? Furthermore, credit card fraud Kernel Size 3x3
detection methods face the problem of class imbalance as the Activation Function ReLU
Layer 3 CONV2D the model on previous inputs it still suffers from short term
Number of channels 32 memory and the cell state overcomes that by remembering key
Kernel Size 3x3
Activation Function ReLU information starting from the earliest examples in the sequence.
Layer 4 Flatten To understand the complete flow of an LSTM cell, consider Fig.
Number of Nodes 64 2 below where each dotted box represents a single step [22].
Layer 5 Output
Number of Nodes 1
Activation Function Sigmoid
1 3
selected by the ‘sampling_strategy’ parameter which is the ratio dataset. Higher average amount of transactions per day are
of required majority class over minority class. associated with higher transaction amounts. Daily chargeback
amount, six month chargeback amount and six month
B. Near Miss Sampling chargeback frequency are also highly associated with each other.
Randomly selecting instances in RUS can remove key Finally the high risk country feature is most significant in
information from the dataset and for this matter Near Miss (NM) determining the fraud class. As this dataset is already very small
sampling uses distance to select to sampling instances. Near no dimensionality reduction is performed on the data.
miss has variants version 1, 2 and 3, and after evaluating
performance of all three variants (results provided in section TABLE IV. DATA EXPLORATION
below), version 1 was selected in this study that selects the ECD
majority sampling instance that has the smallest average Number of Rows 284807
distance to the closest three instances of the minority class. In Number of Columns 31
python the NearMiss class of imbalanced-learn library is used to Feature Type Numeric
Missing Values None
perform this under sampling. Dropped Features None
C. Synthetic Minority Over Sampling Technique (SMOTE) Categorical to Numeric None
Smaller Sample Used No
SMOTE is an over-sampling method that increases the SCD
number of instances in minority class by generating new Number of Rows 3075
synthetic samples. These new synthetic sample are generated by Number of Columns 12
identifying nearest neighbors of the minority class sample and Feature Type Numeric + Categorical
then generating a sample anywhere between the line of nearest Missing Values 3075
neighbors. In this study SMOTE class of imbalanced-learn Dropped Features ‘Transaction date’
Categorical to Numeric ‘Merchant_id’, ‘Is declined’,
library is used to perform the over-sampling and the sampling ‘isForeignTransaction’, ‘isHighRiskCountry’,
ratio is the number of minority class instances after resampling ‘isFradulent’
over number of majority class instances. Smaller Sample Used No
Number of Columns 9
A. Data Pre-Processing Feature Type Numeric
The first step in the experimentation is data pre-processing. Missing Values None
In this step, all three data sets are explored in detail by enquiring Dropped Features ‘custID’
the dataset manually and applying statistical operations. The Categorical to Numeric None
Smaller Sample Used Yes
purpose of data pre-processing is to provide a refined input to
the classifiers to achieve best possible output. Missing values,
categorical features, variable scale and high dimensionality can
all affect the performance of the classifier. The two pre-
processing methods involved in this study are data exploration,
data scaling and test-train split.
Table IV down below provides key information on the data
exploration process in this study. For ECD all the features were
numeric, no missing value was found, and no feature was
dropped to clean the data. All the categorical features in SCD
were changed to numeric and ‘Transaction Date’ feature was
dropped from the dataset as it was all of missing values. Both
ECD and SCD datasets were used as whole while with TCD a
smaller fraction of the actual dataset was used. For TCD all the
values were numeric and ‘custID’ feature was dropped as it had
all the unique values and added no information to the dataset.
Next in data exploration, correlation is distinguished
between the features for each data set. Correlation is a statistical
method and helps to establish the dependency of variables. It is
a number between -1 to 1 where 0 means no relation at all,
negative correlation means inversely proportional and positive
correlation means directly proportional. Finding correlation can
help eliminate the features that have the similar behaviour in the
data reducing the dimesnions of the data. Lower dimensional
data help improve training times and classification performance.
Fig. 3 shows correlation between features in the SCD dataset. It
can be observed that there is no negative correlation between the Fig. 3. Correlation Heatmap of the SCD dataset
features and there is no uniform correlation throughout the SCD
B. Data Scaling and Standardization Fig. 5 shows performance comparison between the three
Normally the features in datasets are in different scale. Like datasets. The performance is calculated using both Validation
the features ‘Amount’ and ‘V1’ in ECD dataset has mean and Test data and the performance measure selected here is F1-
94826.6 and 0.000639 respectively. Deep learning algorithms Score. It can be observed that in general validation performance
does not perform very well if the input features are not on a fairly is decreased as the dataset size is increased. SCD gives the best
similar scale. Scaling and standardization methods bring the prediction performance on the examples that the system has
features together to almost same scale to make the input more already seen but drops down on the new unseen examples. TCD
comprehendible for the classifier. In this study, StandardScaler has the lowest validation and test performance amongst all
class from sklearn pre-processing library in python is used. The dataset maybe due to the smaller number of features and little
standard scaler transforms each feature in the dataset such that correlation between the features. The performance on ECD is
the mean is 0 and the standard deviation is 1. After applying the stable with little variation between validation and test
standard scaler on the dataset the mean for the same two features performance. The missing values of 2DCNN for SCD and TCD
discussed above changed to 0. are because of the smaller number of features in the dataset as it
was not possible to create feature matrix. Missing values of RF
C. Test, Train and Validation Split (i.e. random forest) and SVM in TCD are due to the higher
All three datasets are further divided into Test, Training and training times associated with these classifiers.
Validation data. Test Data is a small chunk of data obtained
randomly from the dataset, which occupies 3.5% of each dataset. 100
Training Data is the 80% of the remaining data used for training 80
performance. 0
Accuracy Precsion Recall F1 Score
1 0.1 0.04 0.02 0.01 0.005 0.00173 Fig. 6. Sampling Method Comparison (Validation Data ECD)
Accuracy Prescision Recall F1 Score
The comparison of sampling methods discussed earlier
Fig. 4. Class Imbalance Comparison (Test Data ECD) tested on Validation Data is presented in Fig. 6. It can be
observed that overall performance is increased with sampling.
It can be seen that initially the Recall is maximum, and all SMOTE method provides the best results whereas Near Miss
other metrics are minimum and as the class imbalance is performs slightly better than Random Under Sampler. Finding
increased, recall begins to decrease and accuracy, precision and the best sampling method leads to the next experiment.
f1-score starts to increase. When there is no class imbalance at
ratio 1, the model is predicting all the fraud cases correctly but 100
fraud cases is much larger than fraud cases in Test data. When 40
over total predictions. F1 score and Precision rise with increase SMOTE No Sampling
in class imbalance. This is because with increase in class Accuracy Precsion Recall F1 Score
imbalance the model gets more training instances and is able to
generalize well. Fig. 7. SMOTE vs Normal Distribution (Test Data ECD)
