Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Retail Credit Scoring

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Retail Credit Scoring for

Auto Finance Ltd.

Group-7
Section A

DAUR Assignment 3
Method:
- For the ease calculation and clean the data to remove factors which might not contribute
to the analysis, we have removed several fields from the data frame
- We have used logistics regression to build the model for the classification of defaulters
(Only 2 classes: Defaulters and Non-defaulters)
- Using the cross validation approach post building the model we have tested the training
data on test data (since it is mentioned to split the data into two half, we have used equal
number (14453) of samples for test and training set)
- We have used classification tree approach to classify the data post this
- For above methods, we have built confusion matrix and determined true and false
prediction as a percentage

Analysis
Since, Monthly income in thousands is coming out be not significant in the model, we have
removed this field and run the logistics regression one more time, without “MTHINCTH”

Odds Ratio, ROC, Confusion matrix, Hosmer-Lemeshow Test


Output
Test and training set

Output

ROC plot

fraction of days for which the prediction was correct: 0.573


misclassification rate: 0.43
Classification Tree

Output

Tree

fraction of days for which the prediction was correct: 0.73


misclassification rate: 0.27
Conclusion

1. “MTHINCTH” doesn’t have a significant impact on the credit defaulting


2. On the basis of results from the confusion matrix (the classification & misclassification
rate) we can say that the true classification value has increased using Tree model, so
classification tree is be a better approach than logistic regression for this case
3. The analysis can be further extended by using 3 classes of defaulters
Code
# Load the R library "ISLR"

library(ISLR)

library(tree)

cr<-read.csv("Downloads/DAUR/Retail credit.csv",header=T)

detach(cr)

attach(cr)

names(cr)

dim(cr)

cr <- subset( cr, select = -c(1:3,21) )

names(cr)

dim(cr)

mod_1=glm(DefaulterFlag~.-DefaulterType, data=cr, family=binomial)

summary(mod_1)

mod_2=glm(DefaulterFlag~. -MTHINCTH-DefaulterType, data=cr,family=binomial)

summary(mod_2)

require(MASS)

exp(cbind(Odds_Ratio=coef(mod_2), confint(mod_2)))

# Using the "predict()" function to obtain the probabilities of the form "P(Y=1|X)"

# The "type=response" ensures the output of the form "P(Y=1|X)", rather than other information
such as the logit

mod_2.probs=predict(mod_2,type="response")

# Optimal Cut-off Point

library(Epi)

# ROC Plot

library(pROC)

R=roc(DefaulterFlag,mod_2.probs)

plot(roc(DefaulterFlag,mod_2.probs),col="blue",legacy.axes = TRUE)

coords(R, "best", ret = "threshold")


# Conversion of probabilities into class labels

mod_2.pred=rep("No",28906)

mod_2.pred[mod_2.probs>.6845515]="Yes"

mod_train.predict=ifelse(mod_2.pred=="Yes",1,0)

# Creating Confusion Matrix to check how many observations are correctly or incorrectly classified

table(mod_train.predict,DefaulterFlag)

# Calculating the fraction of days for which the prediction was correct

mean(mod_train.predict==DefaulterFlag)

# Calculating the misclassification rate

mean(mod_train.predict!=DefaulterFlag)

# Hosmer-Lemeshow Test for checking the model

default_new<- ifelse(DefaulterFlag=="Yes", 1, 0)

library(ResourceSelection)

hoslem.test(default_new, fitted(mod_2))

set.seed(1)

# Creating a hold-out data set

# Creating the Training Data Set

train=sample(1:28906,14453)

# Training Data

cr_train=cr[train,]

dim(cr_train)

# Test Data

cr_test=cr[-train,]
dim(cr_test)

# Creating an array of "Status" variable for Training Data

df_train=DefaulterFlag[train]

# Creating an array of "Status" variable for Test Data

df_test=DefaulterFlag[-train]

# Fitting a new logistic regression model based on the training data set

mod_train=glm(DefaulterFlag~.-DefaulterType

,data=cr, subset=train,family=binomial)

# Predicting "P(Y=1|X)" for the training data set based on the fitted logistic regression model

mod_probs_train=predict(mod_train,cr_test,type="response")

names(cr_train)

dim(cr_test)

names(cr_test)

# ROC Plot

R=roc(df_train,mod_probs_train)

plot(roc(df_train,mod_probs_train),col="blue",legacy.axes = TRUE)

coords(R, "best", ret = "threshold")

# Conversion of probabilities into class labels

mod_pred_train=rep("No",14453)

mod_pred_train[mod_probs_train>.644594]="Yes"

mod_train.predict=ifelse(mod_pred_train=="Yes",1,0)

# Creating Confusion Matrix to check how many observations are correctly or incorrectly classified
table(mod_train.predict,df_train)

# Calculating the fraction of days for which the prediction was correct

mean(mod_train.predict==df_train)

# Calculating the misclassification rate

mean(mod_train.predict!=df_train)

dim(cr_train)

require(rpart.plot)

require(rpart)

r <- rpart(DefaulterFlag~.-DefaulterType,data=cr, subset=train, method = "class")

rpart.plot(r, type=3, extra=101, fallen.leaves = T)

p <-predict(r,cr_test,type = "class")

# Confusion Matrix

table(df_test,p)

# Calculating the fraction of days for which the prediction was correct

mean(df_test==p)

# Calculating the misclassification rate

mean(df_test!=p)

You might also like