Data Mining Project Report
Data Mining Project Report
Data Mining Project Report
1
Contents:
Problem 1: Clustering………………………………………………………………………………………………………5
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis)…………………………………………………………………………………………5
1.2 Do you think scaling is necessary for clustering in this case? Justify……………………………….24
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them……………………………………………………………………...28
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters…………………………………………………………………………………………………………….……35
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters…………………………………………………………………………………………….42
Problem 2: CART-RF-ANN……………………………………………………………………………….. 46
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)……………………………………………………………………….………….46
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network……………………………………………………………………………….………..68
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model……………………………………………………………….…………80
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized…………………………………………………………………………………………………………………...91
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations……………………………………………………………………………………………………………….94
List of Tables:
Table1 Dataset Sample………..………………………………………………………….………..….….5
Table2 Description of the data………………………..………………………………………….…….7
Table3 Correlation between variables………………………………………….….…….…….…22
Table4 Sample of the data before scaling…………………………………..……………..……25
Table5 Sample of the data after scaling…………………….……………..………….……..….25
Table6 Summary of the scaled data………………………….…………………………….………26
Table7 Sample of the data after scaling……………………………….………………….……..26
Table8 Sample of the data…………………………………………………….……………….……….30
Table9 Cluster Profile……………………………………………………………….…………………….31
Table10 Sample of the data…………………………………….…………….………………………….33
Table11 Cluster Profile…………………………………………….…………….…………………………34
Table12 Sample of the data………………………………..……………….……………………………40
Table13 Sample of the data………………………………………………..……………………….……41
Table14 Cluster Profile……………………………………………………….…..………………………..41
Table15 3 Group Cluster…………………………………………………….….…………………………42
Tabe16 3 Group Cluster……………………………………………………….…………………..….….43
Table17 3 Group Cluster………………………………………………………..…………………………43
Table18 Sample of the data………………………………………………….……………………….….47
Table19 Descriptio the data……………………………………………………………………….…....48
Table20 Descriptin of the data……………………………………………………………………...….49
Table21 Sample of the duplicate data……………………………………………………….……..50
Table22 Sample of the data……………………………………………………………………..….…..67
Table23 Sample of train data………………………………………………………………….………..68
Table24 Sample of train data……………………………………………………………………...……72
Table25 Sample of train data……………………………………………………………………….…..76
Table26 Performance Metrics…………………………………………………………………….…….91
List of Figures:
Fig1 Histogram and box plot………………………………………….………….………….…….12
Fig2 Histogram and box plot……………………………………………………………….………14
Fig3 Histogram and box plot………………………………………………………………….……16
Fig4 Scatter Plot…………………………………………………………………….…………………..17
Fig5 Scatter Plot………………………………………………………………………………….……..18
Fig6 Scatter Plot………………………………………………………………………………….……..18
Fig7 Scatter Plot………………………………………………………………………………….……..19
Fig8 Scatter Plot………………………………………………………………………………….……..19
Fig9 Pair Plot………………………………………………………………………………………………20
Fig10 Correlation heat map………………………………………………………………………….21
Fig11 Box Plots…………………………………………………………………….……………….………23
Fig12 Box Plots………………………………………………………………………..……………………24
Fig13 Histograms………………………………………………………………………………………….26
Fig14 Dendrogram…………………………………………………………………..…………….……..28
Fig15 Dendrogram…………………………………………………………………..……………….…..29
Fig16 Dendrogram………………………………………………………………..……………….……..32
Fig17 Dendrogram………………………………………………………………..………………….…..32
Fig18 WSS Plot……………………………………………………………………………..………….…..38
Fig19 WSS Plot…………………………………………………………………………………..….……..38
Fig20 Box Plot………………………………………………………………..……………………….……51
Fig21 Histogram……………………………………………………………..……………………….…..52
Fig22 Box Plot……………………………………………………………………..………………….……53
Fig23 Histogram…………………………………………………………………..………………….…..53
Fig24 Box Plot…………………………………………………………………………..…………….……54
Fig25 Histogram…………………………………………………………………………..……….……..54
Fig26 Box Plot…………………………………………………………………………………..……….…55
Fig27 Histogram…………………………………………………………………………………..……...55
Fig28 Bar Plot………………………………………………………………………………………….……56
Fig29 Box Plot……………………………………………………………………………………..….……56
Fig30 Swarm Plot…………………………….………………………………………………………..…57
Fig31 Bar Plot………………………………………………………………………..……….……………57
Fig32 Swarm Plot……………………………………………………………………….…….…….……58
Fig33 Box Plot……………………………………………………………………………….….…….……58
Fig34 Bar Plot………………………………………………………………………………………………59
Fig35 Swarm Plot………………………………………………………..………………………….……59
Fig36 Box Plot………………………………………………………………….…………………….……60
Fig37 Bar Plot……………………………………………………………………….………………..……60
Fig38 Swarm Plot………..……………………………………………………………..…………..……61
Fig39 Box Plot……………………………………………………………………………….………..……61
Fig40 Bar Plot…………………………………………………………………………………….…..……62
Fig41 Swarm Plot… ……………………………………………………………………………..…..…62
Fig42 Box Plot………………………………………………………….…………………………..………63
Fig43 Pair Plot…………………………………………………………….……………………………..…64
Fig44 Correlation heat map……………………………………………..……………………..……65
Fig45 Decision Tree…………………………………………………………….……………….………69
Fig46 ROC curve for train data……………………………………………….………………..…..80
Fig47 ROC curve for test data………………………………………………………………..……..82
Fig48 ROC curve for train data……………………………………………………….……..……..85
Fig49 ROC curve for test data………………………………………………………………..……..87
Fig50 ROC curve for train data……………………………………………………………..……...89
Fig51 ROC curve for test data………………………………………………………………..……..90
Fig52 ROC curve for 3 models on train data………………………………………..………..92
Fig53 ROC curve for 3 models on test data…………………………………………..…..….93
Problem 1: Clustering - Statement:
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past
few months. You are given the task to identify the segments based on credit card usage.
Q1.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis)
Solution:
Data Description:
The data at hand contains different features of customers like nature of spending,
advance_payments, probability of full payment, customer’s current balance, credit limit,
minimum payment amount and maximum spent in single shopping in one of the leading bank.
Attribute Information:
From the above table we can say that there are seven variables present in the data. All the
given variables are in float data type variables.
Below are the variables and data types for the given dataset:
Observation:
There are total 210 observations present in the data, ranging from 0 to 209
In the spending variable there are total of 210 records present in the data.
Minimum amount spent by the customer per month (in 1000s) is 10.59 that is 10,590/-
Maximum amount spent by the customer per month (in 1000s) is 21.18 that is 21,180/-
25% spent by the customer per month (in 1000s) is 12.27 that is 12,270/-
50% spent by the customer per month (in 1000s) is 14.355 that is 14,355/-
75% spent by the customer per month (in 1000s) is 17.305 that is 17,305/-
Average spent by the customer per month (in 1000s) is 14.847524 that is 14,847.524/-
with the standard deviation of 2.909699 that is 2,909.699/-
In the advance_payments variable there are total of 210 records present in the data.
Minimum amount paid by the customer in advance by cash (in 100s) is 12.41 that is
1,241/-
Maximum amount paid by the customer in advance by cash (in 100s) is 17.25 that is
1,725/-
25% amount paid by the customer in advance by cash (in 100s) is 13.45 that is 1,345/-
50% amount paid by the customer in advance by cash (in 100s) is 14.32 that is 1,432/-
75% amount paid by the customer in advance by cash (in 100s) is 15.715/- that is
1,571.5/-
Average amount paid by the customer in advance by cash (in 100s) is 14.559286 that is
1,455.9286/- with the standard deviation of 1.305959 that is 130.5959/-
In the probability of full payment variable there are total of 210 records present in the data.
Minimum probability of payment done in full by the customer to the bank is 0.8081.
Maximum probability of payment done in full by the customer to the bank is 0.9183.
25% probability of payment done in full by the customer to the bank is 0.85690.
50% probability of payment done in full by the customer to the bank is 0.87345.
75% probability of payment done in full by the customer to the bank is 0.887775.
Average probability of payment done in full by the customer to the bank is 0.870999
with the standard deviation of 0.023629.
In the current_balance variable there are total of 210 records present in the data.
Minimum balance amount left in the account to make purchases (in 1000s) is 4.899 that
is 4,899/-
Maximum balance amount left in the account to make purchases (in 1000s) is 6.6750
that is 6,675/-
25% balance amount left in the account to make purchases (in 1000s) is 5.26225 that is
5,262.25/-
50% balance amount left in the account to make purchases (in 1000s) is 5.5235 that is
5,523.5/-
75% balance amount left in the account to make purchases (in 1000s) is 5.97975 that is
5,979.75/-
Average balance amount left in the account to make purchases (in 1000s) is 5.628533
that is 5,628.533/- with the standard deviation of 0.443063 that is 443.063/-
In the credit_limit variable there are total of 210 records present in the data.
Minimum limit of the amount in credit card (10000s) is 2.63 that is 26,300/-
Maximum limit of the amount in credit card (10000s) is 4.033 that is 40,330/-
25% limit of the amount in credit card (10000s) is 2.944 that is 29,440/-
50% limit of the amount in credit card (10000s) is 3.237 that is 32,370/-
75% limit of the amount in credit card (10000s) is 3.561750 that is 35,617.5/-
Average limit of the amount in credit card (10000s) is 3.258605 that is 32,586.05/- with
the standard deviation of 0.377714 that is 3,777.14/-
In the min_payment_amt variable there are total of 210 records present in the data
Minimum value of minimum paid by the customer while making payments for purchases
made monthly (in 100s) is 0.7651 that is 76.51/-
Maximum value of minimum paid by the customer while making payments for
purchases made monthly (in 100s) is 8.456 that is 845.6/-
25% of minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 2.5615 that is 256.15/-
50% of minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 3.599 that is 359.9/-
75% of minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 4.76875 that is 476.875/-
Average minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 3.700201 that is 370.02/- with the standard deviation of 1.503557
that is 150.35/-
In max_spent_in_single_shopping variable is there are total of 210 records present in the data
Minimum value of Maximum amount spent in one purchase (in 1000s) is 4.519 that is
4,519/-
Maximum value of Maximum amount spent in one purchase (in 1000s) is 6.55 that is
6,550/-
25% of Maximum amount spent in one purchase (in 1000s) is 5.045 that is 5,045/-
50% of Maximum amount spent in one purchase (in 1000s) is 5.233 that is 5,233/-
75% of Maximum amount spent in one purchase (in 1000s) is 5.877 that is 5,877/-
Average Maximum amount spent in one purchase (in 1000s) is 5.408071 that is
5,408.07/- with the standard deviation of 0.49148 that is 491.48/-
Kurtosis:
Insights:
From the above skewness we can see many variables are close to 0 not exactly 0 so which
indicates that the variables are having normal or symmetrical or slight right skewed distribution
and probability_of_full_payment variable having the negative skewness which indicates that
this variable is having left skewed distribution.
Kurtosis: Kurtosis helps us to understand sharpness of the peak of the distribution. These
values will range from -1 to +1.
Data Visualization:
We can see the each variable distribution in the data through visualization as well.
Univariate analysis refers to the analysis of a single variable. The main purpose of univariate
analysis is to summarize and find patterns in the data. The key point is that there is only one
variable involved in the analysis.
Bi-variate analysis refers to the analysis of the two variables and finding the relationship
between these two variables
Multivariate analysis refers to the analysis of the more than two variables in order to find the
patterns and relationship and distribution in the variables.
In the below mentioned plots we can see the each variables distribution and also we can see
the box plot for each variable in order to verify whether there are outliers present or not in
each of the variables .
Observation:
In the spending variable: There are total of 210 records present in the data
Minimum amount spent by the customer per month (in 1000s) is 10.59 that is 10,590/- with the
maximum of 21,180/-. Median value is 14,355/- and average of 14,847.524/- with the standard
deviation of 2.909699 that is 2,909.699/-
In the advance_payments variable: There are total of 210 records present in the data.
Minimum amount paid by the customer in advance by cash (in 100s) is 12.41 that is 1,241/-
with the maximum of 1,725/-. Median value is 1,432/- and average value is 1,455.9286/- with
the standard deviation of 130.5959/-
Distribution of spending and advance_payments variables
From the above plot we can say that the spending variable has the symmetrical distribution and
there are no outliers.
The advance_payments variable has the slight right skewed distribution and there are no
outliers.
Observation:
Minimum probability of full payment is 0.808100 with the maximum value of 0.918300
Average probability of full payment is 0.7870999 with the standard deviation of 0.023629
Insights:
From the above plot we can say that the probability of full payment variable has the left skewed
distribution and there are outliers in the lower whisker side.
Current_balance variable has the right skewed distribution and there are no outliers in this
variable as well.
probability_of_full_payment variable:
current_balance variable:
Observation:
In the credit_limit variable there are total of 210 records present in the data.
Minimum limit of the amount in credit card (10000s) is 2.63 that is 26,300/- with maximum of
40,330/-, median value is 32,370/-, average value is 32,586.05/-
In the min_payment_amt variable there are total of 210 records present in the data
Minimum value of minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 0.7651 that is 76.51/- with the maximum of 845.6/-, median value is
359.9/-, average amount 370.02/- with the standard deviation of 150.35/-
In max_spent_in_single_shopping variable is there are total of 210 records present in the data
Minimum value of Maximum amount spent in one purchase (in 1000s) is 4.519 that is 4,519/-
with the maximum of 6,550/-, median value is 5,233/-, average amount 5,408.07/- with the
standard deviation of 491.48/-
The min_payment_amt variable has the normal distribution and the there are two outliers
present in the higher whisker side.
The max_spent_in_single_shopping variable has the right skewed distribution and there are no
outliers in the variable.
credit_limit variable:
min_payment_amt variable:
In the below plots we can see the Numeric vs Numeric variable distribution:
Fig4.Scatter plot
Insights:
From the above plot we can say that the amount spent by the customer per month (in 1000s) is
strongly correlated with the advance_payments variable.
Fig5.Scatter plot
Insights:
From the above plot we can say that the amount spent by the customer per month (in 1000s) is
strongly correlated with the current_balance variable as well
Fig6.Scatter plot
Insights:
From the above plot we can say that the amount spent by the customer per month (in 1000s) is
strongly correlated with the credit_limit variable as well.
Fig7.Scatter plot
Insights:
From the above plot we can say that the amount spent by the customer per month (in 1000s) is
strongly correlated with the max_spent_in_single_shopping variable as well
Fig8.Scatter plot
Insights:
From the above plot we can say that the amount paid by the customer in advance by cash (in
100s) is strongly correlated with the current_balance variable.
Pair Plot:
Fig9. Pair plot
Insights:
From the above heat map we can say that the highest or strong correlation is between
spending and advance_payments variable that is 0.99
We can see the below table for the correlation between all the variables
Table3. Correlation between variables
Insights:
From the above heat map we can say that the highest or strong correlation is between
spending and advance_payments variable that is 0.99
Outliers treatment:
As the clustering technique is very sensitive to outliers we need to treat them before
proceeding with any clustering problem.
Strategy to remove outliers:
Looking at the box plot, it seems that the two variables probability_of_full_payment and
min_payment_amt have outliers
These outliers value needs to be treated and there are several ways of treating them:
One of the strategies to remove outliers:
Replacing the outlier value using the IQR, instead of dropping them, as we will lose other
column information and also the outliers are present only in two variables and within 5 records.
From the above plot we can say that we have identified outliers in two variables.
Please find the below box plots after treating the outliers
Observation:
Though we did treated the outlier, we still see one as per the box plot, it is okay, as it is no
extreme and on lower band.
Let’s check the description of the data after outlier treatment:
There is no much change in the description of the data for min_payment_amt variable,
maximum value before outlier treatment is 8.45, after the outlier treatment it is 8.079
Q1.2 Do you think scaling is necessary for clustering in this case? Justify
Solution:
Distance calculations are done to find similarity and dissimilarity in clustering problems.
If the variance is large for one column and variance is very small for another column then will
go for normalization.
If the variance between the columns is more or less the same, but the magnitudes are different
then we will go for z-score method.
Yes, it is necessary to perform scaling for clustering. For instance, in given data set, spending
and advance_payments variables are having the values in two digits and other variables are
having values in single digit and probability_of_full_payment variable is in less than one digits.
So, the data in these variables are of different scales, it is tough to compare these variables. By
performing scaling, we can easily compare these variables.
The magnitude of all variables is different. So, in order to structure the variables in single
measurement, we need to scale the data to include all maximum and minimum values in our
dataset for further processing to get the unbiased output. We will perform z-score scaling in
which the values lies between -3 to +3. The scaled data is shown below:
Observation: Before scaling we can see that the scale of all the numerical features is different.
After scaling, it reduces the scale of all the numerical features at the same time it also gets the
mean value tending to zero and standard deviation as one.
Scaled data having all axis have same variance, we are centralizing the data by using scaling.
By scaling, all variables have the same standard deviation, thus all variables have the same
weight. Scaled data will have mean tending to 0 and standard deviation tending to 1
From the above table we can see that the data scaled to have values in the range 0 to 1
Fig13. Histograms
Insights:
From the above plot we can see that the spending variable distribution before and after scaling
the data, actually the Apps variable is having right skewed distribution but if we observe on x
axis there is difference before and after scaling.
It scales the data in such a way that the mean value of the features tends to 0 and the standard
deviation tends to 1
Min-Max method ensure that the data scaled to have values in the range 0 to 1
Actually we need scaling in this data because there are 7 numerical features are present in our
data and scale of the numerical features are different.
The scale of the spending and advance_payments variables is different with rest of the
remaining variables in our data. So if we do scaling by zscore or standardization method we can
reduce this difference at a same time, so the data will centralize and the mean value tends to
zero and standard deviation as one.
Feature scaling (also known as data normalization) is the method used to standardize the range
of features of data. Since, the range of values of data may vary widely, it becomes a necessary
step in data preprocessing while using machine learning algorithms.
In this method, we convert variables with different scales of measurements into a single scale.
In “Distance” based algorithms it is recommended to transform the features so that all features
are in same “scale”
Z-Score Z = (X - μ ) / σ
Scaled data will have mean tending to 0 and standard deviation tending to 1 Used in weight
based techniques (PCA, Neural Network etc.)
Min-Max (X-Xmin)/(Xmax-Xmin) Scaled data will range between 0 and 1 Used in distance based
techniques (Clustering, KNN etc.)
Q1.3 Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
Solution:
1. Hierarchical clustering.
Hierarchical clustering also produces a useful graphical display of the clustering process and
results, called a dendrogram. Now, we will perform the blank hierarchical clustering i.e without
optimum no. of clusters which is shown in image below:
Dendrogram: The least distance that was observed for the merge to happen, that information is
plotted in a visual from which we will call it as dendrogram.
Dendrogram
Fig14. Dendrogram
Observation:
We can see the tree like structure with different color codes we cannot see the number of
records on x-axis as the data has all the records so we will use cut tree command so that we can
visualize the same clearly.
Then, we will create clusters using cutree command and create a new data by combining old
data with clusters.
Dendrogram has created partitions for us and given a different color code for the same. From
the above image, it’s a little confusing so by passing additional parameters which truncates this
dendrogram and shows me the neater output as shown below
Dendrogram
Fig15. Dendrogram
Insights:
We need to identify the cutoff point with in the y-axis, when we are choosing the cutoff point
we are arriving at 3 clusters, how do I say 3 clusters means the number of vertical lines that
intercept this cutoff point
This cutoff is the most suitable for this data because of the following reason:
Vertical lines that are passing through this cut off are have hieghest length because of
the agglomerative property and not because of the both are similar to each other hence
this wil be the optimal place where we have ot draw the cut off
Every horizontal line that we see is nothing but two clusters merging together and
becoming one
Vertical lines are representing the height at which the merging happens
It is clear to us that we can obtain three clusters form this data.
Then, we will create clusters using cut tree command and create a new data by combining old
data with clusters. Any desired number of clusters can be obtained by cutting the dendrogram
at the proper level, here we will go for three clusters.
If the difference in the height of the vertical lines of the dendrogram is small then the clusters
that are formed will be similar.
Created 3 clusters
Observation:
From the above output we can say that we have created 3 clusters for the entire dataset.
From the above table we can see that the each record is mapped to the respective cluster
based on the distance calculations.
Total counts for each cluster:
Observation:
In cluster one there are 75 records , in cluster two there are 70 records and in cluster three
there are 65 records are there.
Cluster profiles:
In the first cluster the frequency is 75. The average spending is 18.12, average
advance_payments is 16.05, average probability_of_full_payment is 0.88, average
current balance is 6.13, average credit_limit is 3.64, average minimum payment amount
is 3.65 and average max_spent_in_single_shopping is 5.98
In the second cluster the frequency is 70. The average spending is 11.91, average
advance_payments is 13.29, average probability_of_full_payment is 0.84, average
current balance is 5.25, average credit_limit is 2.84, average minimum payment amount
is 4.61 and average max_spent_in_single_shopping is 5.11
In the third cluster the frequency is 65. The average spending is 14.21, average
advance_payments is 14.19, average probability_of_full_payment is 0.88, average
current balance is 5.44, average credit_limit is 3.25, average minimum payment amount
is 2.76 and average max_spent_in_single_shopping is 5.055
Fig16. Dendrogram
Observation:
From above picture we can see the tree like structure with different color codes we cannot see
the number of records on x-axis as the data has all the records so we will use cut tree command
so that we can visualize the same clearly.
Then, we will create clusters using cutree command and create a new data by combining old
data with clusters.
Dendrogram
Fig17. Dendrogram
Observation:
From the above picture we can see the number of records that are present in x-axis is mapped
to the respective clusters based on their distance or height on y-axis, we have used cut tree
command so that it is clearly visible to us.
Created 3 clusters
Observation:
From the above output we can say that we have created 3 clusters for the entire dataset.
From the above table we can see that the each record is mapped to the respective cluster
based on the distance calculations.
Observation:
In cluster 1 there are total 70 records mapped the to the first cluster
In cluster 2 there are total 67 records are mapped to the second cluster
In cluster 3 there are total 73 records are mapped to the third cluster
Cluster profiles:
Both the methods are almost similar means, minor variation, which we know it occurs.
For cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset, had gone for 3 group cluster solution based on the hierarchical clustering.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).
In the first cluster the frequency is 70. The average spending is 18.37, average
advance_payments is 16.14, average probability_of_full_payment is 0.88, average
current balance is 6.15, average credit_limit is 3.68, average minimum payment amount
is 3.63 and average max_spent_in_single_shopping is 6.01
In the second cluster the frequency is 67. The average spending is 11.87, average
advance_payments is 13.25, average probability_of_full_payment is 0.84, average
current balance is 5.23, average credit_limit is 2.84, average minimum payment amount
is 4.94 and average max_spent_in_single_shopping is 5.12
In the third cluster the frequency is 73. The average spending is 14.19, average
advance_payments is 14.23, average probability_of_full_payment is 0.879, average
current balance is 5.47, average credit_limit is 3.22, average minimum payment amount
is 2.61 and average max_spent_in_single_shopping is 5.086
Total 70 observations are grouped into this cluster, amount spent by the customer per
month is high when compared with the other two clusters.
High credit limit, high spending. The frequency of total number of consumers fall under
this cluster is medium.
Amount paid by the customer in advance by cash is also higher than the other two
clusters
Balance amount left in the account to make purchases also high
The average current balance is good which leads to good amount of maximum spent in
single shopping.
Total 73 observations are grouped into this cluster, amount spent by the customer per
month is not high, and customers are spending in medium range.
Little less credit limit to maximum limit but the spending is less. The frequency of total
number of consumers fall under this cluster is high. The average minimum amount paid
is also less.
minimum paid by the customer while making payments for purchases made monthly is
lower in this cluster
Maximum amount spent in one purchase is also lower than the other two clusters
Total 67 observations are grouped into this cluster, amount spent by the customer per
month is very low compared to the remaining two cluster groups
Low credit limit, low spending. Since the frequency of total number of consumers fall
under this cluster is less.
Minimum paid by the customer while making payments for purchases made monthly is
higher when compared with the other two clusters.
Maximum amount spent in one purchase is higher than the medium range cluster group
but lower than the high spent cluster group.
Spending, advance_payments, Probability of full payment and current balance all are in
lower range when compared with the other two clusters.
We will perform K-means clustering on our scaled data by using K-means function and get the
following output:
Observation:
From the above output we can say that we have created 3 clusters for the entire dataset.
Observation:
From the above output we can say that we have created 3 clusters for the entire dataset.
Within Cluster Sum of Squares is 430.44539461200657
Observation:
WSS reduces as K keeps increasing
Observation:
WSS is calculated when the K values are ranging from 1 to 10
As the K values are increasing the WSS ( within sums of squares) is reducing
We can see from K = 1 to K =2 there is a significant drop and also from K = 2 to K = 3
there is drop we can observe but from K = 3 to K = 4 there is no significant drop there is
a mild difference that’s all so we can consider K = 3 is the suitable number for the data.
We can visualize the same via point plot as well which is also called as elbow method
WSS plot: WSS plot or also call it as distortion plot or error plot , it helps to know how many
clusters are needed as output in K-means clustering.
Fig18. WSS plot
Observation:
In WSS plot we can see the significant drop, once we analyze the significant drop then we have
to obtain the optimal number of clusters that need for K-means algorithm. Here we have
chosen 3 is the optimal number of clusters from WSS plot as there is no significant drop after 3.
If the drop is not significant then additional clusters are not useful.
This is also the same wss plot but here we have not mentioned the points in the graph for each
cluster, here also we can see the significant drop from 1 to 2 and 2 to 3.
After 3 there is no significant drop so we have chosen 3 is the optimal number of clusters for
the data.
Compute clustering algorithm : k-means clustering for different values of k. For instance, by
varying k from 1 to 10 clusters.
The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
It is a indirect model evaluation techniques which we can verify once clustering procedures are
completed namely the K-means model which is distance based.
For all the observations we will calculate the sil-width and average them then the resulting
output we will call it as a silhouette_score
If silhouette_score is close to negative value then the model has done a blunder in clustering
the data
Since it is nearer to 0.5 we can say this is the well distinguished set of clusters that we have so 3
clusters that are created on and average have a silhouette score of 0.4 which is in the positive
side
Validating the clusters: The resulting cluster should be valid to generate insights.
Sil-width:
Silhouette_score is the function basically computes the average of all the silhouette widths
Silhouette_sample basically computes the silhouette width for each and every row
We can calculate the silhouette_sample for minimum value that is smallest value of silhouette
width, if it is positive then it indicates that no observation that is incorrectly mapped to a
cluster all silhouette score is positive side
From the above table we can see that the data has assigned to the respective clusters based on
their distance, and added that cluster column to our old data set, so that we can easily identify
which record is mapped to the which cluster like that and we can analyze the cluster property
as well.
Appending sil_width to the original dataset
From the above table we can see that the silhouette width is calculated for each and every row
in the data.
Observation:
In cluster 1 there are total 67 records mapped the to the first cluster
In cluster 2 there are total 72 records are mapped to the second cluster
In cluster 3 there are total 71 records are mapped to the third cluster
Cluster Profiling:
Observation:
In the first cluster we have the frequency of 67, average spending in this cluster is 18.5,
average advance_payments is 16.2, average probability of full payment is 0.88, average
current balance is 6.17, average credit_limit is 3.69, min_payment_amt is 3.62 and the
average max_spent_in_single_shopping is 6.04
In the second cluster we have the frequency of 72, average spending in this cluster is
11.85, average advance_payments is 13.24, average probability of full payment is 0.84,
average current balance is 5.23, average credit_limit is 2.84, min_payment_amt is 4.74
and the average max_spent_in_single_shopping is 5.10
In the third cluster we have the frequency of 71, average spending in this cluster is
14.43, average advance_payments is 14.33, average probability of full payment is 0.88,
average current balance is 5.51, average credit_limit is 3.25, min_payment_amt is 2.70
and the average max_spent_in_single_shopping is 5.12
In the first cluster the frequency is 70. The average spending is 18.37, average
advance_payments is 16.14, average probability_of_full_payment is 0.88, average
current balance is 6.15, average credit_limit is 3.68, average minimum payment amount
is 3.63 and average max_spent_in_single_shopping is 6.01
In the second cluster the frequency is 67. The average spending is 11.87, average
advance_payments is 13.25, average probability_of_full_payment is 0.84, average
current balance is 5.23, average credit_limit is 2.84, average minimum payment amount
is 4.94 and average max_spent_in_single_shopping is 5.12
In the third cluster the frequency is 73. The average spending is 14.19, average
advance_payments is 14.23, average probability_of_full_payment is 0.879, average
current balance is 5.47, average credit_limit is 3.22, average minimum payment amount
is 2.61 and average max_spent_in_single_shopping is 5.086
Or
In the first cluster we have the frequency of 67, average spending in this cluster is 18.5,
average advance_payments is 16.2, average probability of full payment is 0.88, average
current balance is 6.17, average credit_limit is 3.69, min_payment_amt is 3.62 and the
average max_spent_in_single_shopping is 6.04
In the second cluster we have the frequency of 72, average spending in this cluster is
11.85, average advance_payments is 13.24, average probability of full payment is 0.84,
average current balance is 5.23, average credit_limit is 2.84, min_payment_amt is 4.74
and the average max_spent_in_single_shopping is 5.10
In the third cluster we have the frequency of 71, average spending in this cluster is
14.43, average advance_payments is 14.33, average probability of full payment is 0.88,
average current balance is 5.51, average credit_limit is 3.25, min_payment_amt is 2.70
and the average max_spent_in_single_shopping is 5.12
The objective of any clustering algorithm is to ensure that the distance between data points in a
cluster is very low compared to the distance between 2 clusters i.e. Members of a group are
very similar, and members of the different group are extremely dissimilar.
Total 70 observations are grouped into this cluster, amount spent by the customer per
month is high when compared with the other two clusters.
High credit limit, high spending. The frequency of total number of consumers fall under
this cluster is medium.
Amount paid by the customer in advance by cash is also higher than the other two
clusters
Balance amount left in the account to make purchases also high
The average current balance is good which leads to good amount of maximum spent in
single shopping.
Total 73 observations are grouped into this cluster, amount spent by the customer per
month is not high, and customers are spending in medium range.
Little less credit limit to maximum limit but the spending is less. The frequency of total
number of consumers fall under this cluster is high. The average minimum amount paid
is also less.
minimum paid by the customer while making payments for purchases made monthly is
lower in this cluster
Maximum amount spent in one purchase is also lower than the other two clusters
Total 67 observations are grouped into this cluster, amount spent by the customer per
month is very low compared to the remaining two cluster groups
Low credit limit, low spending. Since the frequency of total number of consumers fall
under this cluster is less.
Minimum paid by the customer while making payments for purchases made monthly is
higher when compared with the other two clusters.
Maximum amount spent in one purchase is higher than the medium range cluster group
but lower than the high spent cluster group.
Spending, advance_payments, Probability of full payment and current balance all are in
lower range when compared with the other two clusters.
Give loan against the credit card, as they are customers with good repayment record.
Tie up with luxary brands, which will drive more one_time_maximun spending
They are potential target customers who are paying bills and doing purchases and
maintaining comparatively good credit score. So we can increase credit limit or can
lower down interest rate.
Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourge them to spend more
customers should be given remainders for payments. Offers can be provided on early
payments to improve their payment rate.
Increase their spending habits by tying up with grocery stores, utilities (electricity,
phone, gas, others)
Problem 2: CART-RF-ANN:
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management. Use CART, RF
& ANN and compare the models' performances in train and test sets.
Q2.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Solution:
Data Description:
The data at hand contains different features of customers like type of tour insurance firms,
channel, product, duration amount of sales of tour insurance policies, the commission received
for tour insurance firm, age of insured, agency code and claim status in one of the insurance
firm.
Attribute Information:
The data has uploaded properly and we can see in the table that the data has 10 variables
Data dimensions:
Age, Commision, Duration, Sales are numeric variable remaining variables are in object
data type
Observation:
The above table shows the description of all the variables present in the dataset.
Observation:
Not removing them because there is no unique identifier, it can be different customer.
Though it shows there are 139 records, but it can be of different customers, there is no
customer ID or any unique identifier, so I am not dropping them off.
Data Visualization:
We can see the each variable distribution in the data through visualization as well.
Univariate analysis refers to the analysis of a single variable. The main purpose of univariate
analysis is to summarize and find patterns in the data. The key point is that there is only one
variable involved in the analysis.
In the below mentioned plots we can see the each variables distribution and also we can see
the box plot for each variable in order to verify whether there are outliers present or not in
each of the variables .
Univariate Analysis:
From the above plot we can say that the data has slight right skewed distribution
There are many outliers in the variable in both upper side and lower side, there are many
outliers in higher whisker side than the lower whisker side
Distribution of age variable
Fig21. Histogram
Insights:
From the above plot we can see the age variable distribution, it has the slight right skewed
distribution.
Commision variable:
Fig23. Histogram
Insights:
From the above plot we can say that the commission variable has the right skewed
distribution
Insights:
Duration variable has many outliers in higher whisker side and we can observe that one data point or
outlier is very extreme from the data in higher whisker side
Fig25. Histogram
Insights: The Duration variable has the complete right skewed distribution
Sales variable: Range of values: 539
The Sales variable has the right skewed distribution and there are many outliers in the
higher whisker side
Fig27. Histogram
Insights:
Categorical Variables:
Agency_Code variable:
The sales is higher in C2B code of tour firm followed by CWT and EPX
JZI code of tour firm has the least sales
Type variable:
Travel Agency type of tour insurance firms are in higher count than the Airlines type of
tour insurance firms.
Sales is higher in Airline type of tour insurance firm compared with the Travel Agency
type of tour insurance firm
Channel variable:
Online distribution channel of tour insurance agencies are higher than the offline
distribution channel of tour insurance agencies
Amount of sales of tour insurance policies are higher in online mode of distribution
channel of tour insurance agencies than the offline.
Distribution of sales across the channel and claim status
Sales is higher in online distribution channel of tour insurance agencies and majority are
claimed followed by the offline mode with majority are claimed
Except in offline mode who are claimed, there are many outliers
Customised Plan of the the tour insurance products is in top position when compared
with other products followed by cancellation plan, Bronze Plan and Silver Plan.
Gold Plan product is in least count than the other products.
Amount of sales of tour insurance policies is higher in Gold Plan product followed by
Silver Plan
Bronze Plan product is in least position in the Sales
Destination variable:
ASIA has the maximum Destination of the tour followed by the AmericasDestination
Europe region is in the least position in the Destination of the tour
The claim status in the Americas Destination is equal, that is there is an equality who
have claimed and who have not claimed
In the ASIA destination the sales is higher and also many are not claimed but there is a
minor difference between the customers who have claimed and not claimed
In the Europe destination many are not claimed and amount of sales of tour insurance
policies are less.
We can see there are many outliers are there in all the destinations
We can see the same in the above table for the claim status with respect to the sales
Checking pair wise distribution of the continuous variables:
Insights:
From the above plot we can see that there is strong relationship between sales and
commission variables in the data. As the amount of sales of tour insurance policies is
high the commission received for tour insurance firm also high.
We cannot find any strong correlation except between sales and commission variables
We can say that there is no negative correlation between any variables
Checking for Correlations: heat map with only continuous variables:
Insights:
From the above correlation heat map we can see the strong correlation between sales
and commission variables that is 0.77
We can see that there is no negative correlation between any two variables
All the variables are having positive correlation but that is very week correlation except
between sales and commission, duration and sales that is 0.56 and duration and
commission the correlation is 0.47 remaining variables are not correlated to each other
Converting all objects to categorical codes:
We have converted the object data type variables into categorical data as the clustering
technique required integer data type only
We can see from the above information all the variables are changed into integer data type
For building a decision tree model or CART model in python we have to ensure that there are
no object data types, there are only integer data types for the both dependent and
independent variables
From the above table we can see that all the variables are in integer data type only there are no
object data types as we have converted into categorical codes.
Earlier the variables : agency_code, type, claimed, channel, product name and destination are
in object data types, once we have converted into categorical codes all these variables are
changed into integer data type.
Proportion of 1s and 0s
Insights:
We have got the data where many customers are not claimed. It is a class imbalance data but
not highly class imbalanced data
69% of the customers are not claimed from the insurance firm only 30% of the customers are
claimed from the insurance firm.
Q2.2 Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network
Solution:
CART: Classification and regression tree: It is a binary decision tree, classification has
categorical output variable and regression has continuous output variable
We have converted all objects data type into categorical codes, below is the sample for the
same.
Extracting the target column into separate vectors for training set and test set:
From the above table we can say that except target column, remaining all the columns are
stored in the separate vector and the above table is the sample for the same. We can see the
object data types are changed into categorical codes.
We have all the variables except target column that is claimed variable
Target column:
Insight:
We have extracted the target column into separate vector and the above one is the sample for
the same and it is the target column for the data.
Splitting data into training and test set:
Data Split, we split the data into train and test dataset in proportion of 70:30. The splitting
output is shown below:
Insights:
X_train and X_test are the independent variables and we have divided into train and test with
the proportion of 70:30
Here we are doing model performance on train dataset and evaluating the same on test data
set so we have 2100 observations in the training data set for 9 independent variables and 2100
observations in the dependent variables which are used for the training purpose
In the test data we have 900 observations and 9 independent variables and we have 900
observations in test data for dependent variable
70% of the data using for the training purpose and 30% of the data using for the test data set.
From the above tree we can say that the Agency_code variable has the highest gini gainthat is
0.426 so this variable is the most relevant variable to separate 0’s from 1’s.
Next variables we have used for splitting the data is Sales and Product Name variables.
After that Age, Commision and Duration variables used for splitting the data.
Type, Channel and Destination variables are not at all used for the splitting the data.
Observation:
The independent variable – Agency_code supports me in separating the zero’s from one’s in the
best possible manner, so that is the reason I will choose to split it into further child nodes. It has
the higher importance than the other variables.
Second variable which has the importance in separating the zero’s from one’s is Sales variable
We have not used Type, Channel and Destination variables in splitting the data so there is no
importance for having these variables
We have predicted on train and test dataset with 2100 and 900 records respectively.
Getting the Predicted Classes
We can see the predicted classes on test data from the above output.
From the above table we can see the predicted classes and probabilities
Key strength of ensembling is every model that we build should be independent of each other.
Now we will split the data into train and test in order to perform the Random Forest Model.
Extracting the target column into separate vectors for training set and test set
From the above table we can say that except target column, remaining all the columns are
stored in the separate vector and the above table is the sample for the same.
We have all the variables except target column that is claimed variable
Target column:
Insight:
We have extracted the target column into separate vector and the above one is the sample for
the same and it is the target column for the data.
Splitting data into training and test set:
Data Split, we split the data into train and test dataset in proportion of 70:30. The splitting
output is shown below:
Insights:
X_train and X_test are the independent variables and we have divided into train and test with
the proportion of 70:30
Here we are doing model performance on train dataset and evaluating the same on test data
set so we have 2100 observations in the training data set for 9 independent variables and 2100
observations in the dependent variables which are used for the training purpose
In the test data we have 900 observations and 9 independent variables and we have 900
observations in test data for dependent variable
70% of the data using for the training purpose and 30% of the data using for the test data set.
The important thing is that everytime we use any natural number, you will always get the same
output the first time we make the model which is similar to random state while train test split.
Here I have used Random State as 1.
In order to build the good random forest model I have taken multiple parameters into
consideration and I have tries with multiple values in each parameter. The below parameters I
have used in order to build the model:
From the above mentioned parameters I have taken second one into the consideration in order
to build my random forest classifier.
Observation:
By using the above mentioned parameters I have predicted on train and test data.
We can see the predicted classes on test data from the above output.
Grid search cross validation tells us the best parameters.
Observation:
From the above table we can see the predicted classes and probabilities
Observation:
The independent variable – Agency_code supports me in separating the zero’s from one’s in the
best possible manner so, that is the reason I will choose to split it into further child nodes. It has
the higher importance than the other variables.
Second variable which has the importance in separating the zero’s from one’s is Product Name
variable
Third and fourth variable which we have used for the splitting criteria is Sales and Commision
variables
We have not used much Type, Age, Destination and Channel variables in splitting the data so
there is a very least importance for having these variables in the data
There is a least importance to the Commision, Age and Duration variables.
It is an attempt to try and mimic how the human brain works with respect to information.
We have to scale the data in order to proceed with the model, before that we have assign the
target variable in to separate object and remaining independent variables into separate object.
Then we have to split the data into train and test , here I have divided the data into 70:30
format.
Extracting the target column into separate vectors for training set and test set:
We have all the variables except target column that is claimed variable
Target column:
Insight:
We have extracted the target column into separate vector and the above one is the sample for
the same and it is the target column for the data.
Data Split, we split the data into train and test dataset in proportion of 70:30. The splitting
output is shown below:
Insights:
X_train and X_test are the independent variables and we have divided into train and test with
the proportion of 70:30
Here we are doing model performance on train dataset and evaluating the same on test data
set so we have 2100 observations in the training data set for 9 independent variables and 2100
observations in the dependent variables which are used for the training purpose
In the test data we have 900 observations and 9 independent variables and we have 900
observations in test data for dependent variable
70% of the data using for the training purpose and 30% of the data using for the test data set.
Observation:
We have fit and transformed the train data so that the mean tends to 0 and standard deviation
is 1, the above one is the output for the trained data after scaling.
Observation:
We have scaled the test data. We have transformed the test data, to ensure scaling is uniform
for the both train and test data because test data must have same mean and standard
deviation of train data. The above one is the result for the same.
In hidden layer we have tried with 50, 100, 200 neurons , maximum iterations we have chosen
2500, 3000 and 4000, solver we have taken adam, tolerance is 0.01.
From the above parameters in grid search cross validation, it has given the below mentioned
best parameters for my model.
We can see the predicted classes on test data from the above output.
From the above table we can see the predicted classes and probabilities
Observation:
On the test data we have got the Area under the curve with 0.801 so it is not over fitted and
under fitted model. The model is good model.
CART Confusion Matrix and Classification Report for the training data
Confusion Matrix:
Observation:
For the users who have claimed the precision is 0.70 that means out of the total number
of predictions that my model has done, in that 70% of them are being my positive cases.
For the users who have claimed the recall is 0.53 that means out of the total number of
actually positive cases out of that my model is predicted 53% of them are right.
We have the f1-score of 0.60 for the users who are claimed.
We have the accuracy of 0.79. All these results are for train data
CART Confusion Matrix and Classification Report for the testing data:
Confusion Matrix:
Observation:
For the users who have claimed the precision is 0.67 that means out of the total number
of predictions that my model has done, in that 67% of them are being my positive cases.
For the users who have claimed the recall is 0.51 that means out of the total number of
actually positive cases out of that my model is predicted 51% of them are right.
We have the f1-score of 0.58 for the users who are claimed.
We have the accuracy of 0.77. All these results are for test data.
Cart Conclusion:
Train Data:
AUC: 82%
Accuracy: 79%
Precision: 70%
f1-Score: 60%
Test Data:
AUC: 80%
Accuracy: 77%
Precision: 67%
f1-Score: 58%
Training and Test set results are almost similar, it is not over fitted or under fitted and with the
overall measures high, the model is a good model.
For the users who have claimed, the precision is 0.72 that means out of the total
number of predictions that my model has done, in that 72% of them are being my
positive cases.
For the users who have claimed the recall is 0.61 that means out of the total number of
actually positive cases out of that my model is predicted 61% of them are right.
We have the f1-score of 0.66 for the users who are claimed.
We have the accuracy of 0.80. All these results are for train data.
Observation:
On the train data we have got the Area under the curve with 0.856
Confusion Matrix:
Observation:
For the users who have claimed, the precision is 0.68 that means out of the total
number of predictions that my model has done, in that 68% of them are being my
positive cases.
For the users who have claimed the recall is 0.56 that means out of the total number of
actually positive cases out of that my model is predicted 56% of them are right.
We have the f1-score of 0.62 for the users who are claimed.
We have the accuracy of 0.78. All these results are for test data.
Random Forest -AUC and ROC for the test data:
Observation:
On the test data we have got the Area under the curve with 0.818
Observation:
For the users who have claimed, the precision is 0.68 that means out of the total
number of predictions that my model has done, in that 68% of them are being my
positive cases.
For the users who have claimed the recall is 0.51 that means out of the total number of
actually positive cases out of that my model is predicted 51% of them are right.
We have the f1-score of 0.59 for the users who are claimed.
We have the accuracy of 0.78. All these results are for train data.
Neural Network -AUC and ROC for the train data:
Observation:
On the test data we have got the Area under the curve with 0.816
Confusion Matrix:
Observation:
For the users who have claimed, the precision is 0.67 that means out of the total
number of predictions that my model has done, in that 67% of them are being my
positive cases.
For the users who have claimed the recall is 0.50 that means out of the total number of
actually positive cases out of that my model is predicted 50% of them are right.
We have the f1-score of 0.57 for the users who are claimed.
We have the accuracy of 0.77. All these results are for test data.
On the test data we have got the Area under the curve with 0.804
AUC: Area Under the Curve: From the above graph we have 0.804 AUC
Larger the area under the curve, better the model. So in my case I have got AUC value of 0.804
on test data
Training and Test set results are almost similar, and with the overall measures high, it is not
over fitted and under fitted model, the model is a good model.
Q2.4 Final Model: Compare all the models and write an inference which model
is best/optimized.
Solution:
Observation:
The accuracy for the Random Forest is high for the both train and test data when
compared with the CART and Artificial Neural Network. Accuracy for the Random
Forest is 0.80, 0.78 for train and test data respectively
AUC for the Random Forest also high for the both train and test data when compared
with the CART and Artificial Neural Network. AUC for the Random Forest is 0.86, 0.82
for train and test data respectively.
Recall for the Random Forest also high for the both train and test data when compared
with the CART and Artificial Neural Network. Recall for the Random Forest is 0.61, 0.56
for train and test data respectively
Precision for the Random Forest also high for the both train and test data when
compared with the CART and Artificial Neural Network. Precision for the Random Forest
is 0.72, 0.68 for train and test data respectively.
F1-Score for the Random Forest also high for the both train and test data when
compared with the CART and Artificial Neural Network. F1-Score for the Random Forest
is 0.66, 0.62 for train and test data respectively.
Observation:
From the above graph we can say that the ROC curve for the Random Forest is little steeper
than the CART and Artificial Neural Network.
ROC curve for CART and Artificial Neural Network is overlapped and it is almost same for the
both CART or decision tree and Artificial Neural Network.
Steeper the ROC curve stronger the model, so when compared with the CART and Artificial
Neural Network, Random Forest Model has the more steeper ROC curve so Random Forest is
the stronger the model for the given dataset.
Observation:
From the above graph we can see that all the three model’s ROC curve got overlapped but for
Random Forest the curve little bit steeper than the CART and Artificial Neural Network.
The curves for the CART and Artificial Neural Network are overlapped completely.
So on test data also Random Forest model is performing good results, so this model is stronger
than the other two models.
CONCLUSION:
I am selecting the Random Forest model, as it has better accuracy, precision, recall and f1 score
so it is better than other two CART & Artificial Neural Network.
Out of the 3 models, Random Forest has slightly better performance than the Cart and Neural
network model.
Random forests consist of multiple single trees each based on a random sample of the training
data. They are typically more accurate than single decision trees. The following figure shows the
decision boundary becomes more accurate and stable as more trees are added.
Trees are unpruned. While a single decision tree like CART is often pruned, a random
forest tree is fully grown and unpruned, and so, naturally, the feature space is split into
more and smaller regions.
Trees are diverse. Each random forest tree is learned on a random sample, and at each
node, a random set of features are considered for splitting. Both mechanisms create
diversity among the trees.
We have some limitations for CART also those are: It is vulnerable to over fitting and greedy
algorithm
The solution for over fitting is pruning and for greedy algorithm it is cross validation technique.
From Cart and Random Forest Model, the variable change is found to be the most useful
feature amongst all other features for predicting if a person has claimed or not.
Finally after comparing all the models, I have chosen Random Forest as the best or optimized
model for this data.
Q2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
Solution:
From the whole analysis I have observed the below mentioned business insights and
recommendations:
Business Insights:
1. Sales is higher in C2B code of tour firm and majority are claimed
2. Second position in sales is CWT code of tour firm and majority are claimed
3. Third position in sales is EPX code of tour firm and majority are claimed
4. JZI code of tour firm has the least sales and in this majority are claimed
5. The sales is higher in C2B code of tour firm followed by CWT and EPX
6. JZI code of tour firm has the least sales
7. Travel Agency type of tour insurance firms is in higher count than the Airlines type of tour
insurance firms
8. Sales is higher in Airline type of tour insurance firm compared with the Travel Agency type of
tour insurance firm
9. Sales is higher in Airline type of tour insurance firm and majority are claimed followed by the
Travel Agency type of tour insurance firm where majority are claimed
10. Online distribution channel of tour insurance agencies are higher than the offline
distribution channel of tour insurance agencies
11. Amount of sales of tour insurance policies are higher in online mode of distribution channel
of tour insurance agencies than the offline.
12. Sales is higher in online distribution channel of tour insurance agencies and majority are
claimed followed by the offline mode with majority are claimed
13. Customised Plan of the the tour insurance products is in top position when compared with
other products followed by cancellation plan, Bronze Plan and Silver Plan.
14. Gold Plan product is in least count than the other products.
15. Amount of sales of tour insurance policies is higher in Gold Plan product followed by Silver
Plan
17. Higher range of sales is happened in Gold Plan followed by Silver Plan type of products
18. Lower range of sales is happened in the Bronze Plan followed by Customised Plan and least
range is in Cancellation Plan.
19. In the gold plan majority are not claimed and in silver plan majority of the customers are
claimed their insurance.
20. In customized plan and cancellation plan majority are not claimed and in bronze plan
majority are claimed their insurance.
Higher range of sales is happened in Gold Plan followed by Silver Plan type of products
Lower range of sales is happened in the Bronze Plan followed by Customised Plan and
least range is in Cancellation Plan.
In the gold plan majority are not claimed and in silver plan majority of the customers are
claimed their insurance .
In customized plan and cancellation plan majority are not claimed and in bronze plan
majority are claimed their insurance
We have not found any outliers in the gold plan and silver plan, in remaining product
plans there are many outliers we can see
Amount of sales of tour insurance policies is higher in ASIA
Second position in Amount of sales of tour insurance policies is in Americas Destination
In Europe the amount of sales of tour insurance policies are less when compared with
the other two destinations
The claim status in the Americas Destination is equal, that is there is an equality who
have claimed and who have not claimed
In the ASIA destination the sales is higher and also many are not claimed but there is a
minor difference between the customers who have claimed and not claimed
In the Europe destination many are not claimed and amount of sales of tour insurance
policies are less.
there is strong relationship between sales and commission variables in the data. As the
amount of sales of tour insurance policies is high the commission received for tour
insurance firm also high.
We cannot find any strong correlation except between sales and commission variables
We can say that there is no negative correlation between any variables.
Recommendations:
This is understood by looking at the insurance data by drawing relations between different
variables such as day of the incident, time, age group, and associating it with other external
information such as location, behavior patterns, weather information, airline/vehicle types, etc.
• Other interesting fact, is almost all the offline business has a claimed associated, need to find
why?
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run
promotional marketing campaign or evaluate if we need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need customer books airline
tickets or plans, cross sell the insurance based on the claim data pattern.
• Other interesting fact is more sales happened via Agency than Airlines and the trend which
shows the claims are processed more at Airline. So we may need to deep dive into the process
to understand the workflow and why?
• Combat fraud
• Reduce claim handling costs Insights gained from data and AI-powered analytics could expand
the boundaries of insurability.
• Extend existing products, and give rise to new risk transfer solutions in areas like a non-
damage business interruption and reputational damage.
The END!