Retail Credit Scoring

Retail Credit Scoring for
Auto Finance Ltd.
Group-7
Section A
DAUR Assignment 3
Method:
- For the ease calculation and clean the data to remove factors which might not contribute
to the analysis, we have removed several fields from the data frame
- We have used logistics regression to build the model for the classification of defaulters
(Only 2 classes: Defaulters and Non-defaulters)
- Using the cross validation approach post building the model we have tested the training
data on test data (since it is mentioned to split the data into two half, we have used equal
number (14453) of samples for test and training set)
- We have used classification tree approach to classify the data post this
- For above methods, we have built confusion matrix and determined true and false
prediction as a percentage
Analysis
Since, Monthly income in thousands is coming out be not significant in the model, we have
removed this field and run the logistics regression one more time, without “MTHINCTH”
Odds Ratio, ROC, Confusion matrix, Hosmer-Lemeshow Test

Output
Test and training set
Output
ROC plot
fraction of days for which the prediction was correct: 0.573

misclassification rate: 0.43
Classification Tree
Output
Tree
fraction of days for which the prediction was correct: 0.73

misclassification rate: 0.27
Conclusion
1. “MTHINCTH” doesn’t have a significant impact on the credit defaulting

2. On the basis of results from the confusion matrix (the classification & misclassification
rate) we can say that the true classification value has increased using Tree model, so
classification tree is be a better approach than logistic regression for this case
3. The analysis can be further extended by using 3 classes of defaulters
Code
# Load the R library "ISLR"
library(ISLR)
library(tree)
cr<-read.csv("Downloads/DAUR/Retail credit.csv",header=T)
detach(cr)
attach(cr)
names(cr)
dim(cr)
cr <- subset( cr, select = -c(1:3,21) )
names(cr)
dim(cr)
mod_1=glm(DefaulterFlag~.-DefaulterType, data=cr, family=binomial)
summary(mod_1)
mod_2=glm(DefaulterFlag~. -MTHINCTH-DefaulterType, data=cr,family=binomial)
summary(mod_2)
require(MASS)
exp(cbind(Odds_Ratio=coef(mod_2), confint(mod_2)))
# Using the "predict()" function to obtain the probabilities of the form "P(Y=1|X)"
# The "type=response" ensures the output of the form "P(Y=1|X)", rather than other information
such as the logit
mod_2.probs=predict(mod_2,type="response")
# Optimal Cut-off Point
library(Epi)
# ROC Plot
library(pROC)
R=roc(DefaulterFlag,mod_2.probs)
plot(roc(DefaulterFlag,mod_2.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")

# Conversion of probabilities into class labels
mod_2.pred=rep("No",28906)
mod_2.pred[mod_2.probs>.6845515]="Yes"
mod_train.predict=ifelse(mod_2.pred=="Yes",1,0)
# Creating Confusion Matrix to check how many observations are correctly or incorrectly classified
table(mod_train.predict,DefaulterFlag)
# Calculating the fraction of days for which the prediction was correct
mean(mod_train.predict==DefaulterFlag)
# Calculating the misclassification rate
mean(mod_train.predict!=DefaulterFlag)
# Hosmer-Lemeshow Test for checking the model
default_new<- ifelse(DefaulterFlag=="Yes", 1, 0)
library(ResourceSelection)
hoslem.test(default_new, fitted(mod_2))
set.seed(1)
# Creating a hold-out data set
# Creating the Training Data Set
train=sample(1:28906,14453)
# Training Data
cr_train=cr[train,]
dim(cr_train)
# Test Data
cr_test=cr[-train,]
dim(cr_test)
# Creating an array of "Status" variable for Training Data
df_train=DefaulterFlag[train]
# Creating an array of "Status" variable for Test Data
df_test=DefaulterFlag[-train]
# Fitting a new logistic regression model based on the training data set
mod_train=glm(DefaulterFlag~.-DefaulterType
,data=cr, subset=train,family=binomial)
# Predicting "P(Y=1|X)" for the training data set based on the fitted logistic regression model
mod_probs_train=predict(mod_train,cr_test,type="response")
names(cr_train)
dim(cr_test)
names(cr_test)
# ROC Plot
R=roc(df_train,mod_probs_train)
plot(roc(df_train,mod_probs_train),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
# Conversion of probabilities into class labels
mod_pred_train=rep("No",14453)
mod_pred_train[mod_probs_train>.644594]="Yes"
mod_train.predict=ifelse(mod_pred_train=="Yes",1,0)
# Creating Confusion Matrix to check how many observations are correctly or incorrectly classified
table(mod_train.predict,df_train)
mean(mod_train.predict==df_train)
mean(mod_train.predict!=df_train)
dim(cr_train)
require(rpart.plot)
require(rpart)
r <- rpart(DefaulterFlag~.-DefaulterType,data=cr, subset=train, method = "class")
rpart.plot(r, type=3, extra=101, fallen.leaves = T)
p <-predict(r,cr_test,type = "class")
# Confusion Matrix
table(df_test,p)
mean(df_test==p)
mean(df_test!=p)

Retail Credit Scoring

Uploaded by

Copyright:

Available Formats

Retail Credit Scoring

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Retail Credit Scoring

Uploaded by

Copyright:

Available Formats

Retail Credit Scoring for

Auto Finance Ltd.

Odds Ratio, ROC, Confusion matrix, Hosmer-Lemeshow Test

fraction of days for which the prediction was correct: 0.573

fraction of days for which the prediction was correct: 0.73

1. “MTHINCTH” doesn’t have a significant impact on the credit defaulting

cr <- subset( cr, select = -c(1:3,21) )

mod_1=glm(DefaulterFlag~.-DefaulterType, data=cr, family=binomial)

mod_2=glm(DefaulterFlag~. -MTHINCTH-DefaulterType, data=cr,family=binomial)

# Optimal Cut-off Point

coords(R, "best", ret = "threshold")

# Calculating the misclassification rate

# Hosmer-Lemeshow Test for checking the model

# Creating a hold-out data set

# Creating the Training Data Set

# Creating an array of "Status" variable for Training Data

# Creating an array of "Status" variable for Test Data

coords(R, "best", ret = "threshold")

# Conversion of probabilities into class labels

# Calculating the misclassification rate

r <- rpart(DefaulterFlag~.-DefaulterType,data=cr, subset=train, method = "class")

rpart.plot(r, type=3, extra=101, fallen.leaves = T)

# Calculating the misclassification rate

You might also like