Predicting Credit Risk For Unsecured Lending

P REDICTING C REDIT R ISK FOR U NSECURED L ENDING : A
M ACHINE L EARNING A PPROACH
A P REPRINT
K.S. Naik
NMIMS MPSTME, Mumbai, India
arXiv:2110.02206v1 [q-fin.RM] 5 Oct 2021
Guided by Professor Dr. Lakshmi Gorty, NMIMS MPSTME, Mumbai, India
October 6, 2021
A BSTRACT
Since the 1990s, there have been significant advances in the technology space and the e-Commerce
area, leading to an exponential increase in demand for cashless payment solutions. This has led to
increased demand for credit cards, bringing along with it the possibility of higher credit defaults and
hence higher delinquency rates, over a period of time. The purpose of this research paper is to build a
contemporary credit scoring model to forecast credit defaults for unsecured lending (credit cards), by
employing machine learning techniques. As much of the customer payments data available to lenders,
for forecasting Credit defaults, is imbalanced (skewed), on account of a limited subset of default
instances, this poses a challenge for predictive modelling. In this research, this challenge is addressed
by deploying Synthetic Minority Oversampling Technique (SMOTE), a proven technique to iron
out such imbalances, from a given dataset. On running the research dataset through seven different
machine learning models, the results indicate that the Light Gradient Boosting Machine (LGBM)
Classifier model outperforms the other six classification techniques. Thus, our research indicates that
the LGBM classifier model is better equipped to deliver higher learning speeds, better efficiencies
and manage larger data volumes. We expect that deployment of this model will enable better and
timely prediction of credit defaults for decision-makers in commercial lending institutions and banks.
Keywords Machine Learning, Credit Cards, Unsecured Lending, Credit Risk, Loans, Payment Services
1 Introduction
Consume now, pay later.
Credit cards are one of the most popular modes of payment for electronic transactions and make online transactions
comfortable and convenient. However, since there has been an exponential expansion in credit card users over the
years, banks have been determining credit risk based on an individual’s credit history. After the technology boom in the
mid-1990s, companies switched to a technique that was already being used to determine credit risk and prevent defaults
– namely using credit history data.
Credit risk is defined as the risk of financial loss when a borrower fails to pay the lender within a given period of time.
With the rapid development of the credit cards industry, there has been a rise in credit card delinquency rates, which
imposes a financial risk for the lending institutions. Credit risk is the oldest form of risk in the financial markets and has
shown exponential growth in the 1990s against the backdrop of dramatic economic and technological change.
In the past few years, the number of defaults has risen significantly and has cost commercial banks millions of dollars.
Therefore, it has become critical that banks and lending financial institutions use robust mechanisms to forecast
probabilities of credit defaults before lending. With an exponential increase in customers, many a times, credit risk
must be analyzed, where customers have limited or no credit history.
Moreover, the credit card usage database is by and large unbalanced, since majority of the customers pay their dues on
time, barring a certain percentage of customers that default on payments. Machine learning algorithms are known to
have proven their ability to determine the delinquency rates accurately.
There has been immense development in the area of machine learning (ML) since the early 2000s. With greater access
Predicting Credit Risk for Unsecured Lending: A Machine Learning Approach A P REPRINT
to customer data and increased computing power, credit scoring agencies are now in a better position to enable banks,
by providing extensive credit analysis of their customers and prospects. Researchers are trying to determine better and
efficient methods for credit risk evaluation. Financial institutions are in the process of exploring and deploying machine
learning techniques that can help better decision making and enable mass customisation of product offerings.
In this paper, the focus will be on evaluating and comparing popular machine learning classification models such as
Logistic Regression, Support Vector Machines, Decision Tree, Random Forest, XGBoost and LGBM classifiers, to
recognize patterns in the customer data (with a high degree of accuracy) for credit risk evaluation. Using publicly
available datasets, machine learning classification models have been evaluated to determine credit risk defaults and
deliver optimum performance; to efficiently predict delinquency rates. The same models can be deployed for automated
processing of new credit card applications.
2 Industry overview
A credit card is typically issued by a commercial bank or a financial institution. It allows customers to borrow funds
within a stipulated credit limit, for a given period of time, to pay for goods and services, at various points of sale, on
credit in lieu of cash. These credit charges accrue, in a customers’ account as a balance, which must be squared off, on a
periodic billing cycle basis, enabling customers to better manage their cash flows. Increased technological development
and rise in e-Commerce has created exponential demand for payment solutions as cash alternatives. Availability of
affordable credit has given a fillip to the growth of the global credit card industry.
With increased deployment of unsecured credit through credit cards, delinquencies and personal bankruptcy rates also
increased during the mid-1990s. As measured by the Federal Reserve Bank of New York, outstanding card balances
stayed relatively flat in the years after an all-time peak in the fourth quarter of 2009 (during the financial crisis) but
began to increment as the economy slowly re-bounded in the beginning of 2014.
Figure 1: Delinquency Rate on Credit card loans 1
Delinquency rates have dropped in the past year due to the pandemic and are at 1.58% as of the second quarter of 2021.
Managing credit risk through prediction of credit defaults still continues to be a top priority for lenders in the unsecured
lending market, in order to manage profitability and remain competitive.
2.1 Research Problem
With increased usage of credit cards, delinquency rates are on the rise. Market research attributes this rise in delinquen-
cies to factors such as overspending, stagnant wages, increased lifestyle costs, poor financial planning amongst others.
To mitigate and limit the risk of credit defaults for unsecured lending through credit cards, a powerful mechanism is
required to determine the customers’ credit worthiness.
1
Source: Board of Governors of the Federal Reserve System (US)
2
3 Research Methodology
To study the probability of credit default risk, extensive analysis will be performed on two sets of customers’ personal
details (Reference: Table 1) and their credit history (Reference: Table 2). Using this data, the machine learning
classification models will evaluate the likelihood of credit risk defaults. The datasets are defined below.
3.1 Research Variables
The application set constitutes the customers’ personal details, each of them characterized by 18 labeled variables. The
credit history set constitutes the customers’ credit history comprising of 1,048,575 rows and 3 variables capturing the
status of each customer’s monthly dues. Detailed descriptions of both the datasets have been shown in Table 1 and
Table 2.
Table 1: Customer Details
Feature Name Explanation
ID Customer ID
CODE_GENDER Gender of customer
FLAG_OWN_CAR Car ownership
FLAG_OWN_REALTY Property ownership
CNT_CHILDREN Number of children
AMT_INCOME_TOTAL Annual income
NAME_INCOME_TYPE Income category (Working/Pensioner)
NAME_EDUCATION_TYPE Education Level (Higher education/Secondary)
NAME_FAMILY_STATUS Marital status
NAME_HOUSING_TYPE Type of House (Rented/With parents)
DAYS_BIRTH Birthday
DAYS_EMPLOYED Duration of employment
FLAG_MOBIL Mobile ownership
FLAG_WORK_PHONE Work phone ownership
FLAG_PHONE Phone ownership
FLAG_EMAIL Email
OCCUPATION_TYPE Type of Occupation
CNT_FAM_MEMBERS Family Size
Table 2: Credit Database Details

Feature Name Explanation
ID Customer ID
MONTHS_BALANCE Monthly Balance
STATUS Status of Monthly Payment
A credit card balance is the total amount of money that the customer owes to the lending institution. In Table 2, Monthly
Balance indicates the payment delay (0 = no delay in payment, -1 = payment delay of 1 month, -2 = payment delay of 2
months, -3 = payment delay of 3 months and so on).
Monthly Balance:
• 0: Current Month
• -1: Previous month and so on
Status of Monthly Payment:
• 0: 1-29 days past due

• 1: 30-59 days past due
• 2: 60-89 days overdue
3

• 5: overdue or bad debts, write-offs for more than 150 days
• C: paid off that month
• X: No loan for the month
3.2 Data Modification
In this dataset, there exist 7 categorical variables which can be further classified into smaller groups. As these features
are critical to the machine learning models, they need to be converted into numerical values. This is done using Label
Encoder.
3.2.1 Processing imbalanced data

Out of a large customer base, only a small percentage of customers default on their payments, resulting in an imbalanced
(skewed) dataset available for predictive modelling. As the original dataset is imbalanced, an oversampling technique is
employed. Usually, oversampling is preferred over an undersampling technique with a view to prevent loss of critical
data. Thus, to deal with imbalanced datasets, an algorithm is used to generate synthetic data. SMOTE (Synthetic
Minority Oversampling Technique), an oversampling technique that generates synthetic samples for a minority category
(such as defaulting customers), is employed on the dataset. This technique is based on the K-Nearest Neighbors
algorithm. The models are then fitted using the balanced training dataset.
Merging the application dataset and credit dataset, a new dataset with 17 features is created. After the completion of
feature engineering, 3 features are excavated and the merged dataset is split into two subsets, 70% for the training
dataset and 30% for test purposes. The training dataset is then verified for being balanced in order to eliminate skewed
outcomes.
The individual variables aid in understanding the credit viability of a customer. The customer data provides a plethora
of demographic information that is critical in instances wherein the customer has no previous credit record. Customer’s
demographic data along with financial information helps lenders profile their new customers and drive acquisition.
4 Research Design
4.1 Design overview
In this paper, credit card default prediction models will be created to estimate the expected financial loss that a lending
institution may suffer, if borrowers default on paying back their credit consumption. A key tool of predictive modeling
that can be used here are classification models. A classification model enables the prediction of a class of given data
points. A total of 7 machine learning classification models are used to determine credit defaults.
To assess the performance of these machine learning classification models, we create a classification report for each
model. This report displays the model’s precision, recall and F1 score. It provides a better understanding of the overall
performance of the trained model. To help get a better understanding, the above metrics are defined below:
• Precision is a measure of how well an algorithm can find true positives. In the case of this paper, it can be
translated into how well the model can predict a customer default. For example, a precision of 100% means
that a customer flagged as a defaulter will surely be unable to pay off his dues in the future with a high degree
of certainty.
T rueP ositives(T P )
P recision = (1)
T rueP ositives(T P ) + F alseP ositives(F P )
• Recall is another closely related concept to precision. It is a measure of how reliably a classification model
can identify all true positive samples. For instance, a recall of 50% would indicate that half the defaulting
customers have been found, while the other half were missed by the model. Ideally, a good classification
model should maximize both precision and recall, but in reality, there is often a trade-off that one has to make,
while training the model.
Recall = (2)
T rueP ositives(T P ) + F alseN egatives(F N )
4
• F1 score is defined as the weighted harmonic mean of precision and recall. The F1 score enables model
comparison and thereby, helps the decision making process. A perfect model would have a F1 score of 100%,
which corresponds to 100% precision and 100% recall.
2 · P recision(P ) · Recall(R)
F 1Score = (3)
P recision(P ) + Recall(R)
To visualize the relationship between Precision and Recall, we graphically map and analyze two parameters namely
True Positive Rate (TPR) also known as probability of detection and False Positive Rate (FPR) also known as probability
of false alarm and the resultant outcome is called the Receiver Operating Characteristic (ROC) curve. Performance
comparison of multiple machine learning models can be made by computing the area under the ROC curve, for each
model respectively. For a particular model, the closer the area under the curve is to 1, the better the performance of the
model. The most optimal model will have threshold values in the upper left corner of the curve, representing a very
high recall (no False Negatives) and very low False Positives.
F alseP ositives(F P )
FPR = (4)
F alseP ositives(F P ) + T rueN egatives(T N )
TPR = (5)
T rueP ositives(T P ) + F alseN egatives(F N )
An ideal ROC curve is the one which coincides with the Y-axis, but that is impossible to achieve. As such an ideal
performance for a model cannot be achieved, the one that is closest to it is chosen. The higher the area under the ROC
curve, the better the performance of a model.
4.2 Model Description
4.2.1 Logistic Regression Classifier

Logistic Regression is heavily used in initial credit scoring studies. It is easy to implement, and has a well-established
history with credit card delinquency. However, it has limited power when dealing with non-linear data, which makes it
unsuitable for complex default detection problems. Logistic Regression model is known to overfit the training data, and
its overfitting behavior becomes even more prominent with an increase in training data.
p(x)
log = β0 + xβ (6)
1 − p(x)
4.2.2 Support Vector Machine Classifier

Support Vector Machine (SVM) is a linear model for classification. It creates a line or a hyper-plane which separates
the data into classes. As compared to logistic regression, which focuses on maximizing the probability of two classes,
SVM dwells on maximizing the separation of these classes using the hyper-plane and in turn improves the classification
accuracy(i.e., minimizing the generalization error). In this model, two hyper-parameters are used, namely:
• Gamma: Determines the curvature needed in a decision boundary (i.e. hyper-plane). A low value indicates
that a lot of points can be grouped together and vice versa.
• C: Controls the margin of hyper-plane with a view to classify training points correctly. A greater value of C
indicates more correct training points.
4.2.3 K-Neighbors Classifier

K-Nearest Neighbors (KNN) is a non-parametric classification model, and is also known as a lazy algorithm which
means the entire training dataset is used for testing purposes. The testing phase thus requires greater time consumption,
more memory and higher cost.
5
4.2.4 Decision Tree Classifier
A Decision Tree is a supervised machine learning model with a binary tree structure. Beginning with the training data
that lies on a single node, it is split into two halves (two nodes). This split occurs by answering the ’if-else’ question.
After this split, the data is then classified into two nodes. This goes on till the tree reaches a leaf node that incorporates
the predicted class value. To prevent overfitting, a default value is chosen for the depth of the tree. The splitting stops
after all criteria have been satisfied. With incremental features of the data, the model grows more complex. To avoid
further overfitting, a Random Forest classification model is used.
4.2.5 Random Forest Classifier
A Random Forest classification model is a supervised learning model consisting of many decisions trees, each of which
generate a class prediction. These class predictions are then combined to compute their average which is the accuracy
of the classification model.
4.2.6 XGBoost Classifier
Extreme Gradient Boosting (XGBoost) is a supervised machine learning model and an optimization technique used for
classification and prediction. It is based on a combination of tree models with gradient boosting. As shown by Chen
and Guestrin (2016), XGBoost is faster than tree model algorithms.
4.2.7 Light Gradient Boosting Machine Classifier
Light Gradient Boosting Machine (LGBM) is called so on account of its high speed. LGBM usually deals with large
datatset (typically containing more than 10,000 data values). It is advantageous in comparison to other classification
models due to lower memory utilization during execution. LGBM’s solely focuses on accuracy of results. To display it
diagrammatically, a LGBM tree grows vertically (leaf node wise). When growing the same leaf, the loss associated
with it is much lesser compared to other boosting algorithms that use a horizontal approach (level-wise).
Figure 2: Leaf-wise growth of LGBM tree
Since the credit dataset is not large enough for a LGBM classifier, it is substituted with the following set of practices
that can be deployed to improve the efficiency of this classifier:
• max_depth: Defines the maximum depth of the tree and is responsible for managing model overfitting. It is
inversely proportional to overfitting. If overfitting is detected in the model, the max_depth value should be
reduced. Can be used to limit the depth of the tree.
• learning_rate: Measures the impact of each tree on the predicted accuracy. Usually chosen values: 0.001,
0.003
• num_leaves: This hyper-parameter defines the number of leaves in the entire tree. It controls the complexity of
the tree model. Ideal value must be less than or equal to 2max_depth . A value more than this will result in
overfitting. The default value is 31.
6
5 Research Findings: Model Performance Evaluation

5.1 Area under ROC curve
After experimenting with various classifiers, ROC curves have been created for each model. As mentioned earlier, the
ideal ROC curve will coincide with the Y axis. Area under the Receiver Operating Characteristic curve (AUROC/AUC)
is thus a useful evaluation standard used for “discrimination”: it indicates the model’s ability to discriminate between
positive examples/cases and non-cases. In this case, it discriminates customers with a higher credit risk propensity
from the good ones. A limitation of the AUROC is that it does not capture the proper performance of models built for
datasets with a much larger quantity of negative examples than positive examples.
Figure 3: ROC curve
Higher AUC values indicate a better fit. The best fits are XGBoost and LGBM classifier in this case.
5.2 Accuracy score of models
LGBM has the highest accuracy score and is thus, the most preferred model for credit scoring. Some advantages of this
model include:
• Compatibility with Large Datasets (LGBM classifier is able to classify large datasets with lower memory
usage)
• Efficient usage of memory
• Higher accuracy and better performance than other models (leaf node wise horizontal approach)
• Greater efficiency and training speed
6 Conclusion
To summarize, two datasets were introduced and described. Research variables were finalized after completion of
feature engineering. Six contemporary machine learning models were compared, to identify the most efficient and best
performing model. After giving an overview of the machine learning classification models, each model was compared
on the basis of two evaluation metrics:
• Accuracy Score
• AUC (Area under ROC)
7
Figure 4: Accuracy Score of machine learning classification models
Table 3: Accuracy Scoreboard
Models Accuracy Score

Logistic Regression 0.5578630
Support Vector Machine 0.6947107
K Nearest Neighbors 0.7676606
Decision Tree 0.8145190
Random Forest 0.8587149
XGBoost 0.9195953
LGBM 0.9552716
As observed in the ROC plot, LGBM classifier performed statistically better than other classifiers and was closest
to the Y-axis with an AUC of 0.99 and an accuracy of 95.35%. Based on the test results, it was concluded that the
LGBM model is the most favourable classification model since it gives the highest accuracy in forecasting and best
performance in identification of credit card defaults. This model is also best suited for deployment on larger datasets.
Typically, the proposed model can be further optimized by fine tuning the following hyperparameters:
• max_depth values
• learning_rate
• num_leaves
In conclusion, machine learning is a powerful tool that can be employed by lending institutions to forecast and discover
patterns in customer data, bringing in a high degree of rigor. This model can be implemented for a better and more
favorable outcome to determine credit card defaults. It’s computed prediction can be of great help to lenders to
determine borrowers’ credit repayment abilities. Therefore, it is an efficient technique to assess financial risks and make
appropriate financial decisions.
7 Acknowledgments
I would like to thank Professor Dr.Lakshmi Gorty for her invaluable guidance.
References
[1] Board of Governors of the Federal Reserve System (US), Delinquency Rate on Credit Card Loans,
All Commercial Banks [DRCCLACBS], retrieved from FRED, Federal Reserve Bank of St. Louis;
https://fred.stlouisfed.org/series/DRCCLACBS, August 30, 2021.
[2] Yang, S. and Zhang, H. (2018) Comparison of Several Data Mining Methods in Credit Card Default Prediction.
Intelligent Information Management, 10, 115-122. doi: 10.4236/iim.2018.105010.
[3] Dataset: https://www.kaggle.com/datasets
[4] Adewumi, A.O., Akinyelu, A.A. A survey of machine-learning and nature-inspired based credit card fraud
detection techniques. Int J Syst Assur Eng Manag 8, 937–953 (2017). https://doi.org/10.1007/s13198-016-0551-y
8
[5] Wang Bao, Ning Lianju, Kong Yue, Integration of unsupervised and supervised machine learning algorithms for
credit risk assessment, Expert Systems with Applications, Volume 128, 2019, Pages 301-315, ISSN 0957-4174,
https://doi.org/10.1016/j.eswa.2019.02.033.
[6] Ma, Y.H. (2020) Prediction of Default Probability of Credit-Card Bills. Open Journal of Business and Management,
8, 231-244. https://doi.org/10.4236/ojbm.2020.81014
[7] Bhatore, S., Mohan, L. & Reddy, Y.R. Machine learning techniques for credit risk evaluation: a systematic
literature review. J BANK FINANC TECHNOL 4, 111–138 (2020). https://doi.org/10.1007/s42786-020-00020-3
[8] Dalianis H. (2018) Evaluation Metrics and Evaluation. In: Clinical Text Mining. Springer, Cham.
https://doi.org/10.1007/978-3-319-78503-5_6
[9] Y. Yu, "The Application of Machine Learning Algorithms in Credit Card Default Prediction," 2020 International
Conference on Computing and Data Science (CDS), 2020, pp. 212-218, doi: 10.1109/CDS49703.2020.00050.
[10] Liu, R. (2018) Machine Learning Approaches to Predict Default of Credit Card Clients. Modern Economy, 9,
1828-1838. doi:10.4236/me.2018.911115.
[11] Shuangshuang Fan, Yanbo Shen, Shengnan Peng, "Improved ML-Based Technique for Credit Card Scor-
ing in Internet Financial Risk Control", Complexity, vol. 2020, Article ID 8706285, 14 pages, 2020.
https://doi.org/10.1155/2020/8706285
[12] Zhou Fan, Stanford University, Autumn, 2016, Statistics 200: Introduction to Statistical Inference.
https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture26.pdf
[13] Ala’raj, M., Abbod, M.F. & Majdalawieh, M. Modelling customers credit card behaviour using bidirectional
LSTM neural networks. J Big Data 8, 69 (2021). https://doi.org/10.1186/s40537-021-00461-7
[14] Adnan Khashman, Neural networks for credit risk evaluation: Investigation of different neural models and learning
schemes, Expert Systems with Applications, Volume 37, Issue 9, 2010, Pages 6233-6239, ISSN 0957-4174,
https://doi.org/10.1016/j.eswa.2010.02.101.
[15] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for
Computing Machinery, New York, NY, USA, 785–794. DOI:https://doi.org/10.1145/2939672.2939785
[16] Chow, Jacky. (2017). Analysis of Financial Credit Risk Using Machine Learning. 10.13140/RG.2.2.30242.53449.
[17] Trivedi, Naresh & Simaiya, Sarita & Kumar, Dr & Sharma, Sanjeev. (2020). An Efficient Credit Card Fraud De-
tection Model Based on Machine Learning Methods. MATTER: International Journal of Science and Technology.
[18] "Detector Performance Analysis Using ROC Curves - MATLAB & Simulink Example". www.mathworks.com.
Retrieved 11 August 2016.
[19] LightGBM Documentation. https://lightgbm.readthedocs.io/
[20] Addo, P.M.; Guegan, D.; Hassani, B. Credit Risk Analysis Using Machine and Deep Learning Models. Risks
2018, 6, 38. https://doi.org/10.3390/risks6020038
[21] Goutte, Cyril & Gaussier, Eric. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with
Implication for Evaluation. Lecture Notes in Computer Science. 3408. 345-359. 10.1007/978-3-540-31865-1_25.

Predicting Credit Risk For Unsecured Lending

Uploaded by

Copyright:

Available Formats

Predicting Credit Risk For Unsecured Lending

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Credit Risk For Unsecured Lending

Uploaded by

Copyright:

Available Formats

P REDICTING C REDIT R ISK FOR U NSECURED L ENDING : A

M ACHINE L EARNING A PPROACH

Guided by Professor Dr. Lakshmi Gorty, NMIMS MPSTME, Mumbai, India

Figure 1: Delinquency Rate on Credit card loans 1

2.1 Research Problem

3.1 Research Variables

Table 2: Credit Database Details

Status of Monthly Payment:

• 0: 1-29 days past due

• 4: 120-149 days overdue

3.2 Data Modification

3.2.1 Processing imbalanced data

4.2 Model Description

4.2.1 Logistic Regression Classifier

4.2.2 Support Vector Machine Classifier

4.2.3 K-Neighbors Classifier

4.2.4 Decision Tree Classifier

4.2.5 Random Forest Classifier

4.2.6 XGBoost Classifier

4.2.7 Light Gradient Boosting Machine Classifier

Figure 2: Leaf-wise growth of LGBM tree

5 Research Findings: Model Performance Evaluation

Figure 3: ROC curve

5.2 Accuracy score of models

Figure 4: Accuracy Score of machine learning classification models

Table 3: Accuracy Scoreboard

Models Accuracy Score

You might also like