Data Mining Project Report

1
Business report on bank’s customer market segmentation and predicting claim

status of Customers in Insurance firm
1
Contents:
Problem 1: Clustering………………………………………………………………………………………………………5
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis)…………………………………………………………………………………………5
1.2 Do you think scaling is necessary for clustering in this case? Justify……………………………….24
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them……………………………………………………………………...28
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters…………………………………………………………………………………………………………….……35
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters…………………………………………………………………………………………….42
Problem 2: CART-RF-ANN……………………………………………………………………………….. 46
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)……………………………………………………………………….………….46
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network……………………………………………………………………………….………..68
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model……………………………………………………………….…………80
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized…………………………………………………………………………………………………………………...91
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations……………………………………………………………………………………………………………….94
List of Tables:
Table1 Dataset Sample………..………………………………………………………….………..….….5
Table2 Description of the data………………………..………………………………………….…….7
Table3 Correlation between variables………………………………………….….…….…….…22
Table4 Sample of the data before scaling…………………………………..……………..……25
Table5 Sample of the data after scaling…………………….……………..………….……..….25
Table6 Summary of the scaled data………………………….…………………………….………26
Table7 Sample of the data after scaling……………………………….………………….……..26
Table8 Sample of the data…………………………………………………….……………….……….30
Table9 Cluster Profile……………………………………………………………….…………………….31
Table10 Sample of the data…………………………………….…………….………………………….33
Table11 Cluster Profile…………………………………………….…………….…………………………34
Table12 Sample of the data………………………………..……………….……………………………40
Table13 Sample of the data………………………………………………..……………………….……41
Table14 Cluster Profile……………………………………………………….…..………………………..41
Table15 3 Group Cluster…………………………………………………….….…………………………42
Tabe16 3 Group Cluster……………………………………………………….…………………..….….43
Table17 3 Group Cluster………………………………………………………..…………………………43
Table18 Sample of the data………………………………………………….……………………….….47
Table19 Descriptio the data……………………………………………………………………….…....48
Table20 Descriptin of the data……………………………………………………………………...….49
Table21 Sample of the duplicate data……………………………………………………….……..50
Table22 Sample of the data……………………………………………………………………..….…..67
Table23 Sample of train data………………………………………………………………….………..68
Table24 Sample of train data……………………………………………………………………...……72
Table25 Sample of train data……………………………………………………………………….…..76
Table26 Performance Metrics…………………………………………………………………….…….91
List of Figures:
Fig1 Histogram and box plot………………………………………….………….………….…….12
Fig2 Histogram and box plot……………………………………………………………….………14
Fig3 Histogram and box plot………………………………………………………………….……16
Fig4 Scatter Plot…………………………………………………………………….…………………..17
Fig5 Scatter Plot………………………………………………………………………………….……..18
Fig6 Scatter Plot………………………………………………………………………………….……..18
Fig7 Scatter Plot………………………………………………………………………………….……..19
Fig8 Scatter Plot………………………………………………………………………………….……..19
Fig9 Pair Plot………………………………………………………………………………………………20
Fig10 Correlation heat map………………………………………………………………………….21
Fig11 Box Plots…………………………………………………………………….……………….………23
Fig12 Box Plots………………………………………………………………………..……………………24
Fig13 Histograms………………………………………………………………………………………….26
Fig14 Dendrogram…………………………………………………………………..…………….……..28
Fig15 Dendrogram…………………………………………………………………..……………….…..29
Fig16 Dendrogram………………………………………………………………..……………….……..32
Fig17 Dendrogram………………………………………………………………..………………….…..32
Fig18 WSS Plot……………………………………………………………………………..………….…..38
Fig19 WSS Plot…………………………………………………………………………………..….……..38
Fig20 Box Plot………………………………………………………………..……………………….……51
Fig21 Histogram……………………………………………………………..……………………….…..52
Fig22 Box Plot……………………………………………………………………..………………….……53
Fig23 Histogram…………………………………………………………………..………………….…..53
Fig24 Box Plot…………………………………………………………………………..…………….……54
Fig25 Histogram…………………………………………………………………………..……….……..54
Fig26 Box Plot…………………………………………………………………………………..……….…55
Fig27 Histogram…………………………………………………………………………………..……...55
Fig28 Bar Plot………………………………………………………………………………………….……56
Fig29 Box Plot……………………………………………………………………………………..….……56
Fig30 Swarm Plot…………………………….………………………………………………………..…57
Fig31 Bar Plot………………………………………………………………………..……….……………57
Fig32 Swarm Plot……………………………………………………………………….…….…….……58
Fig33 Box Plot……………………………………………………………………………….….…….……58
Fig34 Bar Plot………………………………………………………………………………………………59
Fig35 Swarm Plot………………………………………………………..………………………….……59
Fig36 Box Plot………………………………………………………………….…………………….……60
Fig37 Bar Plot……………………………………………………………………….………………..……60
Fig38 Swarm Plot………..……………………………………………………………..…………..……61
Fig39 Box Plot……………………………………………………………………………….………..……61
Fig40 Bar Plot…………………………………………………………………………………….…..……62
Fig41 Swarm Plot… ……………………………………………………………………………..…..…62
Fig42 Box Plot………………………………………………………….…………………………..………63
Fig43 Pair Plot…………………………………………………………….……………………………..…64
Fig44 Correlation heat map……………………………………………..……………………..……65
Fig45 Decision Tree…………………………………………………………….……………….………69
Fig46 ROC curve for train data……………………………………………….………………..…..80
Fig47 ROC curve for test data………………………………………………………………..……..82
Fig48 ROC curve for train data……………………………………………………….……..……..85
Fig50 ROC curve for train data……………………………………………………………..……...89
Fig52 ROC curve for 3 models on train data………………………………………..………..92
Fig53 ROC curve for 3 models on test data…………………………………………..…..….93
Problem 1: Clustering - Statement:
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past
few months. You are given the task to identify the segments based on credit card usage.
Q1.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis)
Solution:
Given dataset “bank_marketing_part1_Data” is a csv file, imported in jupyter notebook

Exploratory Data Analysis:
Data Description:
The data at hand contains different features of customers like nature of spending,
advance_payments, probability of full payment, customer’s current balance, credit limit,
minimum payment amount and maximum spent in single shopping in one of the leading bank.
Attribute Information:
 spending: Amount spent by the customer per month (in 1000s)

 advance_payments: Amount paid by the customer in advance by cash (in 100s)
 probability_of_full_payment: Probability of payment done in full by the customer to the
bank
 current_balance: Balance amount left in the account to make purchases (in 1000s)
 credit_limit: Limit of the amount in credit card (10000s)
 min_payment_amt : minimum paid by the customer while making payments for

purchases made monthly (in 100s)
 max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
Sample of the data:
Table 1. Dataset Sample

Insights:
From the above table we can say that there are seven variables present in the data. All the
given variables are in float data type variables.
Shape of the data:
The Bank dataset has 210 rows and 7 columns
Below are the variables and data types for the given dataset:
Observation:
In the above table, there are 7 variables present in the data.
There are total 210 observations present in the data, ranging from 0 to 209
There are no missing values present in the dataset.
All the variables in the dataset are in float data type.
Total number of entries = 210
Total number of columns = 7
Number of null values = 0
Data types encountered = float(7)

Let’s check the description of the data:
Table 2. Description of the data

Observation:
From the above table we can say that in all the columns total 210 observations are recorded
and there are no missing values present in the data.
In the spending variable there are total of 210 records present in the data.
 Minimum amount spent by the customer per month (in 1000s) is 10.59 that is 10,590/-
 Maximum amount spent by the customer per month (in 1000s) is 21.18 that is 21,180/-
 25% spent by the customer per month (in 1000s) is 12.27 that is 12,270/-
 Average spent by the customer per month (in 1000s) is 14.847524 that is 14,847.524/-
with the standard deviation of 2.909699 that is 2,909.699/-
In the advance_payments variable there are total of 210 records present in the data.
 Minimum amount paid by the customer in advance by cash (in 100s) is 12.41 that is
1,241/-
 Maximum amount paid by the customer in advance by cash (in 100s) is 17.25 that is
1,725/-
 25% amount paid by the customer in advance by cash (in 100s) is 13.45 that is 1,345/-
 50% amount paid by the customer in advance by cash (in 100s) is 14.32 that is 1,432/-
 75% amount paid by the customer in advance by cash (in 100s) is 15.715/- that is
1,571.5/-
 Average amount paid by the customer in advance by cash (in 100s) is 14.559286 that is
1,455.9286/- with the standard deviation of 1.305959 that is 130.5959/-
In the probability of full payment variable there are total of 210 records present in the data.
 Minimum probability of payment done in full by the customer to the bank is 0.8081.
 Maximum probability of payment done in full by the customer to the bank is 0.9183.
 25% probability of payment done in full by the customer to the bank is 0.85690.
 Average probability of payment done in full by the customer to the bank is 0.870999
with the standard deviation of 0.023629.
In the current_balance variable there are total of 210 records present in the data.
 Minimum balance amount left in the account to make purchases (in 1000s) is 4.899 that
is 4,899/-
 Maximum balance amount left in the account to make purchases (in 1000s) is 6.6750
that is 6,675/-
 25% balance amount left in the account to make purchases (in 1000s) is 5.26225 that is
5,262.25/-
5,523.5/-
5,979.75/-
 Average balance amount left in the account to make purchases (in 1000s) is 5.628533
that is 5,628.533/- with the standard deviation of 0.443063 that is 443.063/-
In the credit_limit variable there are total of 210 records present in the data.
 Minimum limit of the amount in credit card (10000s) is 2.63 that is 26,300/-
 Maximum limit of the amount in credit card (10000s) is 4.033 that is 40,330/-
 25% limit of the amount in credit card (10000s) is 2.944 that is 29,440/-
 50% limit of the amount in credit card (10000s) is 3.237 that is 32,370/-
 75% limit of the amount in credit card (10000s) is 3.561750 that is 35,617.5/-
 Average limit of the amount in credit card (10000s) is 3.258605 that is 32,586.05/- with
the standard deviation of 0.377714 that is 3,777.14/-
In the min_payment_amt variable there are total of 210 records present in the data
 Minimum value of minimum paid by the customer while making payments for purchases
made monthly (in 100s) is 0.7651 that is 76.51/-
 Maximum value of minimum paid by the customer while making payments for
purchases made monthly (in 100s) is 8.456 that is 845.6/-
 25% of minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 2.5615 that is 256.15/-
 Average minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 3.700201 that is 370.02/- with the standard deviation of 1.503557
that is 150.35/-
In max_spent_in_single_shopping variable is there are total of 210 records present in the data
 Minimum value of Maximum amount spent in one purchase (in 1000s) is 4.519 that is
4,519/-
 Maximum value of Maximum amount spent in one purchase (in 1000s) is 6.55 that is
6,550/-
 25% of Maximum amount spent in one purchase (in 1000s) is 5.045 that is 5,045/-
 Average Maximum amount spent in one purchase (in 1000s) is 5.408071 that is
5,408.07/- with the standard deviation of 0.49148 that is 491.48/-
Let’s check for the missing values in the data:
Missing values in the data
There are no missing values present in the dataset
Let’s check for the duplicate records in the data:

Skewness:
Kurtosis:
Insights:
From the above skewness we can see many variables are close to 0 not exactly 0 so which
indicates that the variables are having normal or symmetrical or slight right skewed distribution
and probability_of_full_payment variable having the negative skewness which indicates that
this variable is having left skewed distribution.
Skewness: Skewness is a property it helps us to understand how the symmetric is distributed
Kurtosis: Kurtosis helps us to understand sharpness of the peak of the distribution. These
values will range from -1 to +1.
Data Visualization:
We can see the each variable distribution in the data through visualization as well.
Univariate analysis, Bi-variate & Multivariate analysis:
Univariate analysis refers to the analysis of a single variable. The main purpose of univariate
analysis is to summarize and find patterns in the data. The key point is that there is only one
variable involved in the analysis.
Bi-variate analysis refers to the analysis of the two variables and finding the relationship
between these two variables
Multivariate analysis refers to the analysis of the more than two variables in order to find the
patterns and relationship and distribution in the variables.
In the below mentioned plots we can see the each variables distribution and also we can see
the box plot for each variable in order to verify whether there are outliers present or not in
each of the variables .
Description of spending and advance_payments variables:
Observation:
In the spending variable: There are total of 210 records present in the data
Minimum amount spent by the customer per month (in 1000s) is 10.59 that is 10,590/- with the
maximum of 21,180/-. Median value is 14,355/- and average of 14,847.524/- with the standard
deviation of 2.909699 that is 2,909.699/-
In the advance_payments variable: There are total of 210 records present in the data.
Minimum amount paid by the customer in advance by cash (in 100s) is 12.41 that is 1,241/-
with the maximum of 1,725/-. Median value is 1,432/- and average value is 1,455.9286/- with
the standard deviation of 130.5959/-
Distribution of spending and advance_payments variables
Fig1.Histogram and box plots

Insights:
From the above plot we can say that the spending variable has the symmetrical distribution and
there are no outliers.
The advance_payments variable has the slight right skewed distribution and there are no
outliers.
Spending variable: Range of values: 10.59

advance_payments variable:
Description of probability_of_full_payment and current_balance variables:
Observation:
Minimum probability of full payment is 0.808100 with the maximum value of 0.918300
Average probability of full payment is 0.7870999 with the standard deviation of 0.023629
Minimum current_balance is 4.899 with the maximum value of 6.675
Average current_balance is 5.628533 with the standard deviation of 0.443063
Distribution of probability_of_full_payment and current_balance variables
Insights:
From the above plot we can say that the probability of full payment variable has the left skewed
distribution and there are outliers in the lower whisker side.
Current_balance variable has the right skewed distribution and there are no outliers in this
variable as well.
probability_of_full_payment variable:
current_balance variable:
Description of credit_limit, min_payment_amt and max_spent_in_single_shopping variables
Observation:
In the credit_limit variable there are total of 210 records present in the data.
Minimum limit of the amount in credit card (10000s) is 2.63 that is 26,300/- with maximum of
40,330/-, median value is 32,370/-, average value is 32,586.05/-
In the min_payment_amt variable there are total of 210 records present in the data
Minimum value of minimum paid by the customer while making payments for purchases made
monthly (in 100s) is 0.7651 that is 76.51/- with the maximum of 845.6/-, median value is
359.9/-, average amount 370.02/- with the standard deviation of 150.35/-
In max_spent_in_single_shopping variable is there are total of 210 records present in the data
Minimum value of Maximum amount spent in one purchase (in 1000s) is 4.519 that is 4,519/-
with the maximum of 6,550/-, median value is 5,233/-, average amount 5,408.07/- with the
standard deviation of 491.48/-
Distribution of credit_limit, min_payment_amt and max_spent_in_single_shopping variables

Insights:
From the above plot we can say that the credit_limit variable has the symmetrical distribution
and there are no outliers
The min_payment_amt variable has the normal distribution and the there are two outliers
present in the higher whisker side.
The max_spent_in_single_shopping variable has the right skewed distribution and there are no
outliers in the variable.
credit_limit variable:
min_payment_amt variable:
In the below plots we can see the Numeric vs Numeric variable distribution:
Fig4.Scatter plot
Insights:
From the above plot we can say that the amount spent by the customer per month (in 1000s) is
strongly correlated with the advance_payments variable.
Fig5.Scatter plot
Insights:
strongly correlated with the current_balance variable as well
Fig6.Scatter plot
Insights:
strongly correlated with the credit_limit variable as well.
Fig7.Scatter plot
Insights:
strongly correlated with the max_spent_in_single_shopping variable as well
Fig8.Scatter plot
Insights:
From the above plot we can say that the amount paid by the customer in advance by cash (in
100s) is strongly correlated with the current_balance variable.
Pair Plot:
Fig9. Pair plot
Insights:
Strong positive correlation between:
 spending and advance_payments,

 advance_payments and current_balance,
 credit_limit and spending,
 spending and current_balance,
 credit_limit and advance_payments,
 max_spent_in_single_shopping and current_balance
Let’s see the correlation heat map as well:

Fig10.Correlation heat map
Insights:
From the above heat map we can say that the highest or strong correlation is between
spending and advance_payments variable that is 0.99
There is a weak correlation is between probability_of_full_payment and min_payment_amt

that is -0.33
Many variables are strongly correlated to each other
We can see the below table for the correlation between all the variables
Table3. Correlation between variables
Insights:
Many variables are strongly correlated to each other.
From the above heat map we can say that the highest or strong correlation is between
spending and advance_payments variable that is 0.99
There is a weak correlation is between probability_of_full_payment and min_payment_amt

that is -0.33
Strong positive correlation between:
 spending and advance_payments,

 advance_payments and current_balance,
 credit_limit and spending,
 spending and current_balance,
 credit_limit and advance_payments,
 max_spent_in_single_shopping and current_balance
Outliers treatment:
As the clustering technique is very sensitive to outliers we need to treat them before
proceeding with any clustering problem.
Strategy to remove outliers:
Looking at the box plot, it seems that the two variables probability_of_full_payment and
min_payment_amt have outliers
These outliers value needs to be treated and there are several ways of treating them:
One of the strategies to remove outliers:
Replacing the outlier value using the IQR, instead of dropping them, as we will lose other
column information and also the outliers are present only in two variables and within 5 records.
Fig11. Box plots
From the above plot we can say that we have identified outliers in two variables.
Please find the below box plots after treating the outliers
Fig 12.Box plots
Observation:
Though we did treated the outlier, we still see one as per the box plot, it is okay, as it is no
extreme and on lower band.
Let’s check the description of the data after outlier treatment:
There is no change in the description of the data for Probability_of_full_payment variable.
There is no much change in the description of the data for min_payment_amt variable,
maximum value before outlier treatment is 8.45, after the outlier treatment it is 8.079
Q1.2 Do you think scaling is necessary for clustering in this case? Justify
Solution:
Yes, scaling is necessary for clustering in this case.
Clustering is a part of unsupervised learning, it is the technique of grouping objects with

heterogeneity between the groups and homogeneity within the groups and it uses the distance
based calculations.
Distance calculations are done to find similarity and dissimilarity in clustering problems.
When we work on distance based algorithm scaling or normalization is mandatory
If the variance is large for one column and variance is very small for another column then will
go for normalization.
If the variance between the columns is more or less the same, but the magnitudes are different
then we will go for z-score method.
Yes, it is necessary to perform scaling for clustering. For instance, in given data set, spending
and advance_payments variables are having the values in two digits and other variables are
having values in single digit and probability_of_full_payment variable is in less than one digits.
So, the data in these variables are of different scales, it is tough to compare these variables. By
performing scaling, we can easily compare these variables.
The magnitude of all variables is different. So, in order to structure the variables in single
measurement, we need to scale the data to include all maximum and minimum values in our
dataset for further processing to get the unbiased output. We will perform z-score scaling in
which the values lies between -3 to +3. The scaled data is shown below:
Sample of the data before scaling:
Table4. Sample of the data before scaling
Method1: Using Zscore for scaling/standardization:
Sample of the data after scaling:
Table5. Sample of the data after scaling
Observation: Before scaling we can see that the scale of all the numerical features is different.
After scaling, it reduces the scale of all the numerical features at the same time it also gets the
mean value tending to zero and standard deviation as one.
Scaled data having all axis have same variance, we are centralizing the data by using scaling.
We will use zscore method for scaling.
By scaling, all variables have the same standard deviation, thus all variables have the same
weight. Scaled data will have mean tending to 0 and standard deviation tending to 1
Scaling is necessary for this data in order to perform the clustering.

Summary of the scaled data:
Table6. Summary of the scaled data
Method2: Using Min-Max method:
Table7. Sample of the data after scaling

Observation:
From the above table we can see that the data scaled to have values in the range 0 to 1
Fig13. Histograms
Insights:
From the above plot we can see that the spending variable distribution before and after scaling
the data, actually the Apps variable is having right skewed distribution but if we observe on x
axis there is difference before and after scaling.
Applying zscore or using Standard Scalar gives us the same results.
It scales the data in such a way that the mean value of the features tends to 0 and the standard
deviation tends to 1
Min-Max method ensure that the data scaled to have values in the range 0 to 1
Actually we need scaling in this data because there are 7 numerical features are present in our
data and scale of the numerical features are different.
The scale of the spending and advance_payments variables is different with rest of the
remaining variables in our data. So if we do scaling by zscore or standardization method we can
reduce this difference at a same time, so the data will centralize and the mean value tends to
zero and standard deviation as one.
Scaling is necessary for this data in order to perform the clustering.
Feature scaling (also known as data normalization) is the method used to standardize the range
of features of data. Since, the range of values of data may vary widely, it becomes a necessary
step in data preprocessing while using machine learning algorithms.
In this method, we convert variables with different scales of measurements into a single scale.
StandardScaler normalizes the data using the formula (x-mean)/standard deviation.
We will be doing this only for the numerical variables.
In “Distance” based algorithms it is recommended to transform the features so that all features
are in same “scale”
Most commonly used scaling techniques are :
Z-Score Z = (X - μ ) / σ
Scaled data will have mean tending to 0 and standard deviation tending to 1 Used in weight
based techniques (PCA, Neural Network etc.)
Min-Max (X-Xmin)/(Xmax-Xmin) Scaled data will range between 0 and 1 Used in distance based
techniques (Clustering, KNN etc.)
Q1.3 Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
Solution:
Clustering: Clustering is an unsupervised learning technique. It is the way of finding similarities

and dissimilarities using distance based calculations.
There are 2 types of clustering techniques:
1. Hierarchical clustering.
2. Non-hierarchical clustering – Under partitioning clustering, we will perform K-means

clustering
Hierarchical clustering is a technique in which records are sequentially grouped to create

clusters, based on distances between records and distances between clusters.
Hierarchical clustering also produces a useful graphical display of the clustering process and
results, called a dendrogram. Now, we will perform the blank hierarchical clustering i.e without
optimum no. of clusters which is shown in image below:
Creating the Dendrogram: A dendrogram is a diagram representing a tree. This diagrammatic

representation is frequently used in hierarchical clustering; it illustrates the arrangement of the
clusters.
Dendrogram: The least distance that was observed for the merge to happen, that information is
plotted in a visual from which we will call it as dendrogram.
Choosing average linkage method:
Dendrogram
Fig14. Dendrogram
Observation:
We can see the tree like structure with different color codes we cannot see the number of
records on x-axis as the data has all the records so we will use cut tree command so that we can
visualize the same clearly.
Then, we will create clusters using cutree command and create a new data by combining old
data with clusters.
Dendrogram has created partitions for us and given a different color code for the same. From
the above image, it’s a little confusing so by passing additional parameters which truncates this
dendrogram and shows me the neater output as shown below
Cutting the Dendrogram with suitable clusters:
Dendrogram
Fig15. Dendrogram
Insights:
 It is a treelike diagram that summarizes the process of clustering

 On the x-axis are the records or observations
 Similar records are joined by lines whose vertical length reflects the distance between
the records.
 The greater the difference in height, the more the dissimilarity
 By choosing a cutoff distance on the y-axis, a set of clusters is created.
We need to identify the cutoff point with in the y-axis, when we are choosing the cutoff point
we are arriving at 3 clusters, how do I say 3 clusters means the number of vertical lines that
intercept this cutoff point
This cutoff is the most suitable for this data because of the following reason:
 Vertical lines that are passing through this cut off are have hieghest length because of
the agglomerative property and not because of the both are similar to each other hence
this wil be the optimal place where we have ot draw the cut off
 Every horizontal line that we see is nothing but two clusters merging together and
becoming one
 Vertical lines are representing the height at which the merging happens
 It is clear to us that we can obtain three clusters form this data.
Then, we will create clusters using cut tree command and create a new data by combining old
data with clusters. Any desired number of clusters can be obtained by cutting the dendrogram
at the proper level, here we will go for three clusters.
If the difference in the height of the vertical lines of the dendrogram is small then the clusters
that are formed will be similar.
Created 3 clusters
Observation:
From the above output we can say that we have created 3 clusters for the entire dataset.
Three clusters sample data:
Table8. Sample Data

Observation:
From the above table we can see that the each record is mapped to the respective cluster
based on the distance calculations.
Total counts for each cluster:
Observation:
In cluster one there are 75 records , in cluster two there are 70 records and in cluster three
there are 65 records are there.
Cluster profiles:
Table9. Cluster Profile

Observation:
 In the first cluster the frequency is 75. The average spending is 18.12, average
advance_payments is 16.05, average probability_of_full_payment is 0.88, average
current balance is 6.13, average credit_limit is 3.64, average minimum payment amount
is 3.65 and average max_spent_in_single_shopping is 5.98
 In the second cluster the frequency is 70. The average spending is 11.91, average
 In the third cluster the frequency is 65. The average spending is 14.21, average
Another method – ward:

Dendrogram
Fig16. Dendrogram
Observation:
From above picture we can see the tree like structure with different color codes we cannot see
the number of records on x-axis as the data has all the records so we will use cut tree command
so that we can visualize the same clearly.
Then, we will create clusters using cutree command and create a new data by combining old
data with clusters.
Dendrogram
Fig17. Dendrogram
Observation:
From the above picture we can see the number of records that are present in x-axis is mapped
to the respective clusters based on their distance or height on y-axis, we have used cut tree
command so that it is clearly visible to us.
Created 3 clusters
Observation:
Three clusters sample data:
Table10.Sample of the data

Observation:
From the above table we can see that the each record is mapped to the respective cluster
based on the distance calculations.
Total counts for each cluster:
Observation:
 In cluster 1 there are total 70 records mapped the to the first cluster
 In cluster 2 there are total 67 records are mapped to the second cluster
 In cluster 3 there are total 73 records are mapped to the third cluster
Cluster profiles:
Table11. Cluster Profile

Observation:
Both the methods are almost similar means, minor variation, which we know it occurs.
For cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset, had gone for 3 group cluster solution based on the hierarchical clustering.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).
Cluster Group Profiles based on ward method:
Group 1 : High Spending
Group 3 : Medium Spending
Group 2 : Low Spending
Group 1 : High Spending Group
 Total 70 observations are grouped into this cluster, amount spent by the customer per
month is high when compared with the other two clusters.
 High credit limit, high spending. The frequency of total number of consumers fall under
this cluster is medium.
 Amount paid by the customer in advance by cash is also higher than the other two
clusters
 Balance amount left in the account to make purchases also high
 The average current balance is good which leads to good amount of maximum spent in
single shopping.
month is not high, and customers are spending in medium range.
 Little less credit limit to maximum limit but the spending is less. The frequency of total
number of consumers fall under this cluster is high. The average minimum amount paid
is also less.
 minimum paid by the customer while making payments for purchases made monthly is
lower in this cluster
 Maximum amount spent in one purchase is also lower than the other two clusters
month is very low compared to the remaining two cluster groups
 Low credit limit, low spending. Since the frequency of total number of consumers fall
under this cluster is less.
 Minimum paid by the customer while making payments for purchases made monthly is
higher when compared with the other two clusters.
 Maximum amount spent in one purchase is higher than the medium range cluster group
but lower than the high spent cluster group.
 Spending, advance_payments, Probability of full payment and current balance all are in
lower range when compared with the other two clusters.
Q1.4 Apply K-Means clustering on scaled data and determine optimum

clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters.
Solution:
K-Means clustering: It is a non-hierarchical clustering approach where we specify the number

of clusters needed as output let say K. We need to pre specify the desired number of clusters, k
The means in the K-means refers to averaging of the data, that is finding the centroid
K-means clustering is widely used in large dataset application.
K-means uses Euclidian distance measurement by default.
We will perform K-means clustering on our scaled data by using K-means function and get the
following output:
Creating Clusters using KMeans

Forming 2 Clusters with K=2
Cluster Output for all the observations when K = 2:
Observation:
Within Cluster Sum of Squares is 659.2102428764274
Forming 3 Clusters with K=3:
Cluster Output for all the observations when K = 3:
Observation:
Within Cluster Sum of Squares is 430.44539461200657
Forming clusters with K = 1,4,5,6 and comparing the WSS
If K = 1 Within Cluster Sum of Squares is 1470.0000000000007
Observation:
WSS reduces as K keeps increasing
Calculating WSS for other values of K : Elbow Method
Observation:
 WSS is calculated when the K values are ranging from 1 to 10
 As the K values are increasing the WSS ( within sums of squares) is reducing
 We can see from K = 1 to K =2 there is a significant drop and also from K = 2 to K = 3
there is drop we can observe but from K = 3 to K = 4 there is no significant drop there is
a mild difference that’s all so we can consider K = 3 is the suitable number for the data.
 We can visualize the same via point plot as well which is also called as elbow method
WSS: WSS is within sums of squars:
WSS plot: WSS plot or also call it as distortion plot or error plot , it helps to know how many
clusters are needed as output in K-means clustering.
Fig18. WSS plot
Observation:
In WSS plot we can see the significant drop, once we analyze the significant drop then we have
to obtain the optimal number of clusters that need for K-means algorithm. Here we have
chosen 3 is the optimal number of clusters from WSS plot as there is no significant drop after 3.
If the drop is not significant then additional clusters are not useful.
We can visualize the same below as well:
Fig19. WSS plot

Insights:
This is also the same wss plot but here we have not mentioned the points in the graph for each
cluster, here also we can see the significant drop from 1 to 2 and 2 to 3.
After 3 there is no significant drop so we have chosen 3 is the optimal number of clusters for
the data.
To determine the optimum number of clusters:
The optimal number of clusters can be defined as follow:
Compute clustering algorithm : k-means clustering for different values of k. For instance, by
varying k from 1 to 10 clusters.
For each k, calculate the total within-cluster sum of square (wss).
Plot the curve of wss according to the number of clusters k.
The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
Cluster evaluation for 3 clusters: The silhouette score:

Silhouette_score for K-means clustering:
It is a indirect model evaluation techniques which we can verify once clustering procedures are
completed namely the K-means model which is distance based.
For all the observations we will calculate the sil-width and average them then the resulting
output we will call it as a silhouette_score
If silhouette_score is close to +1 then we have well separated
If silhouette_score is close to 0 then we have not separated well
If silhouette_score is close to negative value then the model has done a blunder in clustering
the data
In our data we have silhouette_score of 0.40066971870389617 when K = 3
Since it is nearer to 0.5 we can say this is the well distinguished set of clusters that we have so 3
clusters that are created on and average have a silhouette score of 0.4 which is in the positive
side
Validating the clusters: The resulting cluster should be valid to generate insights.
Cluster evaluation for 4 clusters
Silhouette _score of 0.32740516076243764 when K = 4

Silhouette score is better for 3 clusters than for 4 clusters. So, final clusters will be 3.
Sil-width:
Silhouette_score is the function basically computes the average of all the silhouette widths
Silhouette_sample basically computes the silhouette width for each and every row
We can calculate the silhouette_sample for minimum value that is smallest value of silhouette
width, if it is positive then it indicates that no observation that is incorrectly mapped to a
cluster all silhouette score is positive side
For our data minimum value is 0.0027029671766983627 so it is in positive side that no

observation is incorrectly mapped to a cluster
Sil-width formula = b-a/Max(a,b)
Sil-width can have minimum -1 to maximum +1
Appending Clusters to the original dataset:
Sample of the data:
Table12. Sample of the data

Insights:
From the above table we can see that the data has assigned to the respective clusters based on
their distance, and added that cluster column to our old data set, so that we can easily identify
which record is mapped to the which cluster like that and we can analyze the cluster property
as well.
Appending sil_width to the original dataset
Sample of the data:

Observation:
From the above table we can see that the silhouette width is calculated for each and every row
in the data.
Total records for each cluster:
Observation:
 In cluster 1 there are total 67 records mapped the to the first cluster
 In cluster 2 there are total 72 records are mapped to the second cluster
 In cluster 3 there are total 71 records are mapped to the third cluster
Cluster Profiling:
Table14. Cluster profile
Observation:
 In the first cluster we have the frequency of 67, average spending in this cluster is 18.5,
average advance_payments is 16.2, average probability of full payment is 0.88, average
current balance is 6.17, average credit_limit is 3.69, min_payment_amt is 3.62 and the
average max_spent_in_single_shopping is 6.04
 In the second cluster we have the frequency of 72, average spending in this cluster is
11.85, average advance_payments is 13.24, average probability of full payment is 0.84,
average current balance is 5.23, average credit_limit is 2.84, min_payment_amt is 4.74
and the average max_spent_in_single_shopping is 5.10
 In the third cluster we have the frequency of 71, average spending in this cluster is
Cluster Group Profiles
Group 3 : Medium Spending Group
Group 2 : Low Spending Group.
Q1.5 Describe cluster profiles for the clusters defined. Recommend different

promotional strategies for different clusters.
Solution:
3 group cluster via hierarchical clustering
Table15. 3 group cluster

Observation:
3 group cluster via Kmeans
Or

Observation:
 In the first cluster we have the frequency of 67, average spending in this cluster is 18.5,
average advance_payments is 16.2, average probability of full payment is 0.88, average
current balance is 6.17, average credit_limit is 3.69, min_payment_amt is 3.62 and the
average max_spent_in_single_shopping is 6.04
 In the second cluster we have the frequency of 72, average spending in this cluster is
 In the third cluster we have the frequency of 71, average spending in this cluster is
Customer segmentation by clustering:
The objective of any clustering algorithm is to ensure that the distance between data points in a
cluster is very low compared to the distance between 2 clusters i.e. Members of a group are
very similar, and members of the different group are extremely dissimilar.
Group 2 : Low Spending Group.
We can profile the customers based on following market segmentation:
Group 1: High Spending Group
Group 3: Medium Spending Group
Group 2: Low Spending Group
Group1 : High Spending Group
month is high when compared with the other two clusters.
 High credit limit, high spending. The frequency of total number of consumers fall under
this cluster is medium.
 Amount paid by the customer in advance by cash is also higher than the other two
clusters
 Balance amount left in the account to make purchases also high
 The average current balance is good which leads to good amount of maximum spent in
single shopping.
month is not high, and customers are spending in medium range.
 Little less credit limit to maximum limit but the spending is less. The frequency of total
number of consumers fall under this cluster is high. The average minimum amount paid
is also less.
 minimum paid by the customer while making payments for purchases made monthly is
lower in this cluster
 Maximum amount spent in one purchase is also lower than the other two clusters
month is very low compared to the remaining two cluster groups
 Low credit limit, low spending. Since the frequency of total number of consumers fall
under this cluster is less.
 Minimum paid by the customer while making payments for purchases made monthly is
higher when compared with the other two clusters.
 Maximum amount spent in one purchase is higher than the medium range cluster group
but lower than the high spent cluster group.
 Spending, advance_payments, Probability of full payment and current balance all are in
lower range when compared with the other two clusters.
Promotional strategies for each cluster:
 Giving any reward points might increase their purchases.
 maximum max_spent_in_single_shopping is high for this group, so can be offered

discount/offer on next transactions upon full payment
 Increase their credit limit and

 Increase spending habits
 Give loan against the credit card, as they are customers with good repayment record.
 Tie up with luxary brands, which will drive more one_time_maximun spending
 They are potential target customers who are paying bills and doing purchases and
maintaining comparatively good credit score. So we can increase credit limit or can
lower down interest rate.
 Promote premium cards/loyalty cars to increase transactions.
 Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourge them to spend more
Group 2 : Low Spending Group
 customers should be given remainders for payments. Offers can be provided on early
payments to improve their payment rate.
 Increase their spending habits by tying up with grocery stores, utilities (electricity,
phone, gas, others)
Problem 2: CART-RF-ANN:
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management. Use CART, RF
& ANN and compare the models' performances in train and test sets.
Q2.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Solution:
Given dataset “insurance_part2_data-1.csv” is a csv file, imported in jupyter notebook
Exploratory Data Analysis:
Data Description:
The data at hand contains different features of customers like type of tour insurance firms,
channel, product, duration amount of sales of tour insurance policies, the commission received
for tour insurance firm, age of insured, agency code and claim status in one of the insurance
firm.
Attribute Information:
1. Target: Claim Status (Claimed)

2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
Sample of the data:

Observation:
The data has uploaded properly and we can see in the table that the data has 10 variables
Data dimensions:
The Bank dataset has 3000 rows and 10 columns
Below are the variables and data types:

Observation:
 10 variables are there in the dataset
 Age, Commision, Duration, Sales are numeric variable remaining variables are in object
data type
 Total 3000 records are present in the dataset

 There are no missing values present in the data
 9 independant variable and one target variable – Clamied
Check for missing value in any column:
Observation: No missing values in the data
Descriptive Statistics Summary:
Table19. Description of the data

The above table shows the description of the integer columns.
Observation:
 Duration has negative value, it is not possible.

 Commision & Sales- mean and median varies significantly
 Minimum Age group is 8 and maximum is 84 with the average age of 38, median
is 36
 Minimum sales is 0 and maximum is 539 with the average sales of 60.24
Description of the data including all the variables:
Table20. Description of the data
The above table shows the description of all the variables present in the dataset.
Observation:
 All the variables are having 3000 observations in each variable.

 Average age of insured is 38, minimum age is 8, maximum age is 84, average age is 38
and median age of insured is 36
 We have 4 unique agency codes, in that EPX is in top with the frequency of 1365
 We have two types of insurance firms in that travel agency type of tour insurance firm is
having maximum number or in the top with the frequency of 1837
 In the data maximum user’s claim status is not claimed.
 Distribution channel of tour insurance agencies has two channels in that online mode is
in higher frequency.
 Here minimum duration of the tour is having -1, which is not possible and maximum
duration is 4580 with the average duration of 70 and median is 26.5
 Minimum amount of sales of tour insurance policies is 0, maximum is 539 with the
median of 33 and average amount of sales of tour insurance policies is 60.25
 We have 5 unique name of the tour insurance products in that customized plan is in top
position with the frequency of 1136
 Destination of the tour has 3 unique destinations in that ASIA is in top with the
frequency of 2465
 Minimum commission received for tour insurance firm is 0, maximum is 210.21, average
commission received is 14.52 and median is 4.63, this variable is not normally
distributed.
Check for duplicate data:
Table21. Sample of duplicate data

Removing Duplicates -
Not removing them because there is no unique identifier, it can be different customer.
Though it shows there are 139 records, but it can be of different customers, there is no
customer ID or any unique identifier, so I am not dropping them off.
Data Visualization:
We can see the each variable distribution in the data through visualization as well.
Univariate analysis & Multivariate analysis:
Univariate analysis refers to the analysis of a single variable. The main purpose of univariate
analysis is to summarize and find patterns in the data. The key point is that there is only one
variable involved in the analysis.
In the below mentioned plots we can see the each variables distribution and also we can see
the box plot for each variable in order to verify whether there are outliers present or not in
each of the variables .
Univariate Analysis:
Age variable: Range of values: 76
Fig20. Box plot

Insights:
From the above plot we can say that the data has slight right skewed distribution
There are many outliers in the variable in both upper side and lower side, there are many
outliers in higher whisker side than the lower whisker side
Distribution of age variable
Fig21. Histogram
Insights:
From the above plot we can see the age variable distribution, it has the slight right skewed
distribution.
Commision variable:
Range of values: 210.21

Fig22. Box Plot
Insights:
From the above plot we can say that:
 The commission variable has the right skewed distribution

 There are many outliers in the higher whisker side
Distribution of Commission variable
Fig23. Histogram
Insights:
 From the above plot we can say that the commission variable has the right skewed
distribution
Duration variable: Range of values: 4581

Fig24. Box Plot
Insights:
Duration variable has many outliers in higher whisker side and we can observe that one data point or
outlier is very extreme from the data in higher whisker side
Distribution of Duration variable
Fig25. Histogram
Insights: The Duration variable has the complete right skewed distribution
Sales variable: Range of values: 539
Fig26. Box Plot

Insights:
 The Sales variable has the right skewed distribution and there are many outliers in the
higher whisker side
Distribution of Sales variable
Fig27. Histogram
Insights:
 The Sales variable has the right skewed distribution

There are outliers in all the variables, but the sales and commision can be a genuine business
value. Random Forest and CART can handle the outliers. Hence, Outliers are not treated for
now, we will keep the data as it is.
Categorical Variables:
Univariate, Bivariate and Multivariate analysis:
Agency_Code variable:
Fig28. Bar Plot

Insights:
 Majority of code of tour firm is EPX Agency_Code

 C2B code of tour firm is in second position
 CWT code of tour firm is used less than the above two codes
 JZI code of tour firm is in least used code
Fig29. Box Plot

Insights:
 Sales is higher in C2B code of tour firm and majority are claimed
 Second position in sales is CWT code of tour firm and majority are claimed
 Third position in sales is EPX code of tour firm and majority are claimed
 JZI code of tour firm has the least sales and in this majority are claimed
 There are many outliers present.
Fig30. Swarm Plot

Insights:
 The sales is higher in C2B code of tour firm followed by CWT and EPX
 JZI code of tour firm has the least sales
Type variable:
Fig31. Bar Plot

Insights:
 Travel Agency type of tour insurance firms are in higher count than the Airlines type of
tour insurance firms.
Distribution of Sales across Type variable
Fig32. Swarm Plot

Insights:
 Sales is higher in Airline type of tour insurance firm compared with the Travel Agency
type of tour insurance firm
Fig33. Box Plot

Insights:
 Sales is higher in Airline type of tour insurance firm and majority are claimed followed
by the Travel Agency type of tour insurance firm where majority are claimed
 We can see there are many outliers present.
Channel variable:
Channel variable distribution
Fig34. Bar Plot

Insights:
 Online distribution channel of tour insurance agencies are higher than the offline
distribution channel of tour insurance agencies
Distribution of Sales across the Channel
Fig35. Swarm Plot

Insights:
 Amount of sales of tour insurance policies are higher in online mode of distribution
channel of tour insurance agencies than the offline.
Distribution of sales across the channel and claim status
Fig36. Box Plot

Insights:
 Sales is higher in online distribution channel of tour insurance agencies and majority are
claimed followed by the offline mode with majority are claimed
 Except in offline mode who are claimed, there are many outliers
Product Name variable:
Fig37. Bar Plot

Insights:
 Customised Plan of the the tour insurance products is in top position when compared
with other products followed by cancellation plan, Bronze Plan and Silver Plan.
 Gold Plan product is in least count than the other products.
Fig38. Swarm Plot

Insights:
 Amount of sales of tour insurance policies is higher in Gold Plan product followed by
Silver Plan
 Bronze Plan product is in least position in the Sales
Fig39. Box Plot

Insights:
 Higher range of sales is happened in Gold Plan followed by Silver Plan type of products
 Lower range of sales is happened in the Bronze Plan followed by Customised Plan and
least range is in Cancellation Plan.
 In the gold plan majority are not claimed and in silver plan majority of the customers are
claimed their insurance .
 In customized plan and cancellation plan majority are not claimed and in bronze plan
majority are claimed their insurance
 We have not found any outliers in the gold plan and silver plan, in remaining product
plans there are many outliers we can see
Destination variable:
Fig40. Bar Plot

Insights:
 ASIA has the maximum Destination of the tour followed by the AmericasDestination
 Europe region is in the least position in the Destination of the tour
Fig41. Swarm Plot

Insights:
 Amount of sales of tour insurance policies is higher in ASIA
 Second position in Amount of sales of tour insurance policies is in Americas Destination
 In Europe the amount of sales of tour insurance policies are less when compared with
the other two destinations
Fig42. Box Plot

Insights:
 The claim status in the Americas Destination is equal, that is there is an equality who
have claimed and who have not claimed
 In the ASIA destination the sales is higher and also many are not claimed but there is a
minor difference between the customers who have claimed and not claimed
 In the Europe destination many are not claimed and amount of sales of tour insurance
policies are less.
 We can see there are many outliers are there in all the destinations
We can see the same in the above table for the claim status with respect to the sales
Checking pair wise distribution of the continuous variables:
Fig43. Pair Plot
Insights:
 From the above plot we can see that there is strong relationship between sales and
commission variables in the data. As the amount of sales of tour insurance policies is
high the commission received for tour insurance firm also high.
 We cannot find any strong correlation except between sales and commission variables
 We can say that there is no negative correlation between any variables
Checking for Correlations: heat map with only continuous variables:
Fig44. Correlation Heat map
Insights:
 From the above correlation heat map we can see the strong correlation between sales
and commission variables that is 0.77
 We can see that there is no negative correlation between any two variables
 All the variables are having positive correlation but that is very week correlation except
between sales and commission, duration and sales that is 0.56 and duration and
commission the correlation is 0.47 remaining variables are not correlated to each other
Converting all objects to categorical codes:
We have converted the object data type variables into categorical data as the clustering
technique required integer data type only
Again checking for the information in the data set:

Insights:
We can see from the above information all the variables are changed into integer data type
For building a decision tree model or CART model in python we have to ensure that there are
no object data types, there are only integer data types for the both dependent and
independent variables
Sample of the data:

Insights:
From the above table we can see that all the variables are in integer data type only there are no
object data types as we have converted into categorical codes.
Earlier the variables : agency_code, type, claimed, channel, product name and destination are
in object data types, once we have converted into categorical codes all these variables are
changed into integer data type.
Proportion of 1s and 0s
Insights:
We have got the data where many customers are not claimed. It is a class imbalance data but
not highly class imbalanced data
69% of the customers are not claimed from the insurance firm only 30% of the customers are
claimed from the insurance firm.
Q2.2 Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network
Solution:
CART: Classification and regression tree: It is a binary decision tree, classification has
categorical output variable and regression has continuous output variable
It uses gini index as the measure of impurity.
We have converted all objects data type into categorical codes, below is the sample for the
same.
Extracting the target column into separate vectors for training set and test set:
Sample of the train data:
Table23. Sample of the train data

Insights:
From the above table we can say that except target column, remaining all the columns are
stored in the separate vector and the above table is the sample for the same. We can see the
object data types are changed into categorical codes.
We have all the variables except target column that is claimed variable
Target column:
Insight:
We have extracted the target column into separate vector and the above one is the sample for
the same and it is the target column for the data.
Splitting data into training and test set:
Data Split, we split the data into train and test dataset in proportion of 70:30. The splitting
output is shown below:
Checking the dimensions of the training and test data:
Insights:
X_train and X_test are the independent variables and we have divided into train and test with
the proportion of 70:30
We have divided the dependent variable into train_labels and test_labels
Here we are doing model performance on train dataset and evaluating the same on test data
set so we have 2100 observations in the training data set for 9 independent variables and 2100
observations in the dependent variables which are used for the training purpose
In the test data we have 900 observations and 9 independent variables and we have 900
observations in test data for dependent variable
70% of the data using for the training purpose and 30% of the data using for the test data set.
Building a Decision Tree Classifier:
Fig45. Decision Tree

Observation:
From the above tree we can say that the Agency_code variable has the highest gini gainthat is
0.426 so this variable is the most relevant variable to separate 0’s from 1’s.
Next variables we have used for splitting the data is Sales and Product Name variables.
After that Age, Commision and Duration variables used for splitting the data.
Type, Channel and Destination variables are not at all used for the splitting the data.
Variable Importance - DTCL
Observation:
The independent variable – Agency_code supports me in separating the zero’s from one’s in the
best possible manner, so that is the reason I will choose to split it into further child nodes. It has
the higher importance than the other variables.
Second variable which has the importance in separating the zero’s from one’s is Sales variable
We have not used Type, Channel and Destination variables in splitting the data so there is no
importance for having these variables
There is a least importance to the Commision, Age and Duration variables
Predicting on Training and Test dataset
We have predicted on train and test dataset with 2100 and 900 records respectively.
Getting the Predicted Classes
We can see the predicted classes on test data from the above output.
Getting the Predicted Classes and Probabilities:

Observation:
From the above table we can see the predicted classes and probabilities
Building a Random Forest Classifier:

Random forest Classifier: Random forest technique is an ensemble technique where in we
construct multiple models and take the average output of all the models to take final decision
Key strength of ensembling is every model that we build should be independent of each other.
One of the assumptions: Decisions are not correlated.
Now we will split the data into train and test in order to perform the Random Forest Model.
Extracting the target column into separate vectors for training set and test set
Table24. Sample of train data

Insights:
stored in the separate vector and the above table is the sample for the same.
Target column:
Insight:
Insights:
Building a Random Forest Classifier:
Ensemble Random Forest Classifier
Building the Random Forest model: Importance of Random State
The important thing is that everytime we use any natural number, you will always get the same
output the first time we make the model which is similar to random state while train test split.
Here I have used Random State as 1.
In order to build the good random forest model I have taken multiple parameters into
consideration and I have tries with multiple values in each parameter. The below parameters I
have used in order to build the model:
From the above mentioned parameters I have taken second one into the consideration in order
to build my random forest classifier.
Predicting the Training and Testing data:
Observation:
By using the above mentioned parameters I have predicted on train and test data.
Getting the Predicted Classes:
Grid search cross validation tells us the best parameters.
Observation:
Variable Importance via Random Forest:
Observation:
The independent variable – Agency_code supports me in separating the zero’s from one’s in the
best possible manner so, that is the reason I will choose to split it into further child nodes. It has
the higher importance than the other variables.
Second variable which has the importance in separating the zero’s from one’s is Product Name
variable
Third and fourth variable which we have used for the splitting criteria is Sales and Commision
variables
We have not used much Type, Age, Destination and Channel variables in splitting the data so
there is a very least importance for having these variables in the data
There is a least importance to the Commision, Age and Duration variables.
Summary of Random Forest:
 It consists of a large number of individual decision trees that operate as an ensemble

 Each tree in the random forest splits out a class prediction
 Class with most votes becomes model’s prediction.
 A large number of relatively uncorrelated models or trees operating as a committee will
outperform any of the individual models.
Building an Artificial Neural Network Classifier:

Artificial Neural Network: It is a machine learning algorithm that is roughly modeled around
what is currently known about how the human brain functions,
It is an attempt to try and mimic how the human brain works with respect to information.
Neural Network Architecture:
 Made of layers with many interconnected nodes (neurons)

 There are 3 main layers: 1.Input layer 2.Hidden layer 3.Output layer
 Input layer process the incoming data exactly as received
 Hidden layer contains artificial neurons, hidden layer can be one or more layers, it
process the signals from the input nodes prior to reaching the output node.
 Output node, it gives us the final results.
We have to scale the data in order to proceed with the model, before that we have assign the
target variable in to separate object and remaining independent variables into separate object.
Then we have to split the data into train and test , here I have divided the data into 70:30
format.
Extracting the target column into separate vectors for training set and test set:
Table25. Sample of the train data

Insights:
stored in the separate vector and the above table is the sample for the same.
Target column:
Insight:
Insights:
Now we have to scale the data :
Sample of the train data after scaling:
Observation:
We have fit and transformed the train data so that the mean tends to 0 and standard deviation
is 1, the above one is the output for the trained data after scaling.
Sample of the test data after scaling:
Observation:
We have scaled the test data. We have transformed the test data, to ensure scaling is uniform
for the both train and test data because test data must have same mean and standard
deviation of train data. The above one is the result for the same.
Building Artificial Neural Network Classifier:

We have used multiple parameters in the grid search cross validation in order to build the
neural network classifier
In hidden layer we have tried with 50, 100, 200 neurons , maximum iterations we have chosen
2500, 3000 and 4000, solver we have taken adam, tolerance is 0.01.
From the above parameters in grid search cross validation, it has given the below mentioned
best parameters for my model.
Predicting the Training and Testing data
Getting the Predicted Classes:

Observation:
We have successfully implemented three models.
Q2.3 Performance Metrics: Comment and Check the performance of Predictions

on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.
Solution:
Model Evaluation: Measuring AUC-ROC Curve
CART - AUC and ROC for the training data:
ROC curve: Receiver Operating Characteristics:
Fig46. ROC curve for train data

Observation:
The above graph tells us how good the model is.
 It is a graph between True Positive and False Positive rates

 Diagonal line in the graph is called no skill model
 ROC graph is a tradeoff between benefits that is True Positives and costs that is False
Positives
 The point (0, 1) represents perfect classified. True Positive=1, False Positive = 0
 Classifiers very close to y-axis and lower (nearer to x-axis) are conservative models and
strict in classifying positives (low TP rate)
 Classifiers on top right are liberal in classifying positives hence higher TP and FP rate.
 In the graph we observe, it is constructed between 0 to 1 in x-axis and 0 to 1 in y-axis ,
this represents percentage of probability since the TP rate FP rates are fractional values
it is restricted between 0 and 1
 So various values of cut off that is the model that predicts with some probability score.
 Ideally we will expect the curve exactly the right angle curve or graph, in reality we may
not get it because data has so many variations.
 Steeper the ROC curve stronger the model.
AUC: Area Under the Curve
From the above graph we have 0.823 AUC
Larger the area under the curve, better the model.
AUC values always between 0 to 1
 1 or 100 – Perfect model

 0.9 – Excellent model
 0.8 - Good model
 0.7 – Average model
 0.6 – Poor model
 <0.5 – Serious problem with model building
So in my case I have got AUC value of 0.823 so it is a Good model
ROC is a curve; AUC is area under the curve.
CART -AUC and ROC for the test data:

Fig47. ROC curve for test data
Observation:
On the test data we have got the Area under the curve with 0.801 so it is not over fitted and
under fitted model. The model is good model.
CART Confusion Matrix and Classification Report for the training data
Confusion Matrix:
Train Data Accuracy: 0.7852380952380953
Classification report on train data:
Observation:
 For the users who have claimed the precision is 0.70 that means out of the total number
of predictions that my model has done, in that 70% of them are being my positive cases.
 For the users who have claimed the recall is 0.53 that means out of the total number of
actually positive cases out of that my model is predicted 53% of them are right.
 We have the f1-score of 0.60 for the users who are claimed.
 We have the accuracy of 0.79. All these results are for train data
CART Confusion Matrix and Classification Report for the testing data:
Confusion Matrix:
Test Data Accuracy: 0.7711111111111111
Classification report on test data:
Observation:
 For the users who have claimed the precision is 0.67 that means out of the total number
of predictions that my model has done, in that 67% of them are being my positive cases.
 We have the accuracy of 0.77. All these results are for test data.
Cart Conclusion:
Train Data:
 AUC: 82%
 Accuracy: 79%
 Precision: 70%
 f1-Score: 60%
Test Data:
 AUC: 80%
 Accuracy: 77%
 Precision: 67%
 f1-Score: 58%
Training and Test set results are almost similar, it is not over fitted or under fitted and with the
overall measures high, the model is a good model.
RF Model Performance Evaluation on Training data:

Confusion Matrix:
Train data accuracy: 0.8042857142857143
Classification report on train data:

Observation:
 For the users who have claimed, the precision is 0.72 that means out of the total
number of predictions that my model has done, in that 72% of them are being my
positive cases.
 We have the accuracy of 0.80. All these results are for train data.
Random Forest -AUC and ROC for the train data:
Observation:
On the train data we have got the Area under the curve with 0.856

So in my case I have got AUC value of 0.856
Random Forest Model Performance Evaluation on Test data:
Confusion Matrix:
Classification Report on test data:
Observation:
positive cases.
Random Forest -AUC and ROC for the test data:
Observation:
On the test data we have got the Area under the curve with 0.818
So in my case I have got AUC value of 0.818 on test data.
Random Forest Conclusion:
Train Data: Test Data:
 AUC: 86% AUC : 82%

 Accuracy: 80% Accuracy: 78%
 Precision: 72% Precision: 68%
 f1-Score: 66% f1-Score: 62%
Training and Test set results are almost similar, it is not over fitted or under fitted and with the
overall measures high, the model is a good model.
Artificial Neural Network Model Performance Evaluation on Training data:

Confusion Matrix for train data:
Train Data accuracy: 0.7761904761904762
Classification Report on train data:
Observation:
positive cases.
 We have the accuracy of 0.78. All these results are for train data.
Neural Network -AUC and ROC for the train data:
Observation:
So in my case I have got AUC value of 0.816 on train data.
Neural Network Model Performance Evaluation on Test data:
Confusion Matrix:

Classification report on test data:
Observation:
positive cases.
Neural Network -AUC and ROC for the test data:

Observation:
AUC: Area Under the Curve: From the above graph we have 0.804 AUC
Larger the area under the curve, better the model. So in my case I have got AUC value of 0.804
on test data
Neural Network Conclusion:
Train Data: Test Data:
 AUC: 82% AUC: 80%

 Accuracy: 78% Accuracy: 77%
 Precision: 68% Precision: 67%
 f1-Score: 59% f1-Score: 57%
Training and Test set results are almost similar, and with the overall measures high, it is not
over fitted and under fitted model, the model is a good model.
Q2.4 Final Model: Compare all the models and write an inference which model
is best/optimized.
Solution:
Comparison of the performance metrics from the 3 models
Table26. Performance Metrics
Observation:
From the above table we can say that:
 The accuracy for the Random Forest is high for the both train and test data when
compared with the CART and Artificial Neural Network. Accuracy for the Random
Forest is 0.80, 0.78 for train and test data respectively
 AUC for the Random Forest also high for the both train and test data when compared
with the CART and Artificial Neural Network. AUC for the Random Forest is 0.86, 0.82
for train and test data respectively.
 Recall for the Random Forest also high for the both train and test data when compared
with the CART and Artificial Neural Network. Recall for the Random Forest is 0.61, 0.56
for train and test data respectively
 Precision for the Random Forest also high for the both train and test data when
compared with the CART and Artificial Neural Network. Precision for the Random Forest
is 0.72, 0.68 for train and test data respectively.
 F1-Score for the Random Forest also high for the both train and test data when
compared with the CART and Artificial Neural Network. F1-Score for the Random Forest
is 0.66, 0.62 for train and test data respectively.
ROC Curve for the 3 models on the Training data:
Fig52. ROC curve for 3 models on train data
Observation:
From the above graph we can say that the ROC curve for the Random Forest is little steeper
than the CART and Artificial Neural Network.
ROC curve for CART and Artificial Neural Network is overlapped and it is almost same for the
both CART or decision tree and Artificial Neural Network.
Steeper the ROC curve stronger the model, so when compared with the CART and Artificial
Neural Network, Random Forest Model has the more steeper ROC curve so Random Forest is
the stronger the model for the given dataset.
ROC Curve for the 3 models on the Test data:
Fig53. ROC curve for 3 models on test data
Observation:
From the above graph we can see that all the three model’s ROC curve got overlapped but for
Random Forest the curve little bit steeper than the CART and Artificial Neural Network.
The curves for the CART and Artificial Neural Network are overlapped completely.
So on test data also Random Forest model is performing good results, so this model is stronger
than the other two models.
CONCLUSION:
I am selecting the Random Forest model, as it has better accuracy, precision, recall and f1 score
so it is better than other two CART & Artificial Neural Network.
Out of the 3 models, Random Forest has slightly better performance than the Cart and Neural
network model.
Random forests consist of multiple single trees each based on a random sample of the training
data. They are typically more accurate than single decision trees. The following figure shows the
decision boundary becomes more accurate and stable as more trees are added.
Two reasons why random forests outperform single decision trees:
 Trees are unpruned. While a single decision tree like CART is often pruned, a random
forest tree is fully grown and unpruned, and so, naturally, the feature space is split into
more and smaller regions.
 Trees are diverse. Each random forest tree is learned on a random sample, and at each
node, a random set of features are considered for splitting. Both mechanisms create
diversity among the trees.
We have some limitations for CART also those are: It is vulnerable to over fitting and greedy
algorithm
The solution for over fitting is pruning and for greedy algorithm it is cross validation technique.
From Cart and Random Forest Model, the variable change is found to be the most useful
feature amongst all other features for predicting if a person has claimed or not.
Finally after comparing all the models, I have chosen Random Forest as the best or optimized
model for this data.
Q2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
Solution:
From the whole analysis I have observed the below mentioned business insights and
recommendations:
Business Insights:
1. Sales is higher in C2B code of tour firm and majority are claimed
2. Second position in sales is CWT code of tour firm and majority are claimed
3. Third position in sales is EPX code of tour firm and majority are claimed
4. JZI code of tour firm has the least sales and in this majority are claimed
5. The sales is higher in C2B code of tour firm followed by CWT and EPX
6. JZI code of tour firm has the least sales
7. Travel Agency type of tour insurance firms is in higher count than the Airlines type of tour
insurance firms
8. Sales is higher in Airline type of tour insurance firm compared with the Travel Agency type of
tour insurance firm
9. Sales is higher in Airline type of tour insurance firm and majority are claimed followed by the
Travel Agency type of tour insurance firm where majority are claimed
10. Online distribution channel of tour insurance agencies are higher than the offline
distribution channel of tour insurance agencies
11. Amount of sales of tour insurance policies are higher in online mode of distribution channel
of tour insurance agencies than the offline.
12. Sales is higher in online distribution channel of tour insurance agencies and majority are
claimed followed by the offline mode with majority are claimed
13. Customised Plan of the the tour insurance products is in top position when compared with
other products followed by cancellation plan, Bronze Plan and Silver Plan.
14. Gold Plan product is in least count than the other products.
15. Amount of sales of tour insurance policies is higher in Gold Plan product followed by Silver
Plan
16. Bronze Plan product is in least position in the Sales
17. Higher range of sales is happened in Gold Plan followed by Silver Plan type of products
18. Lower range of sales is happened in the Bronze Plan followed by Customised Plan and least
range is in Cancellation Plan.
19. In the gold plan majority are not claimed and in silver plan majority of the customers are
claimed their insurance.
20. In customized plan and cancellation plan majority are not claimed and in bronze plan
majority are claimed their insurance.
 Higher range of sales is happened in Gold Plan followed by Silver Plan type of products
 Lower range of sales is happened in the Bronze Plan followed by Customised Plan and
least range is in Cancellation Plan.
 In the gold plan majority are not claimed and in silver plan majority of the customers are
claimed their insurance .
 In customized plan and cancellation plan majority are not claimed and in bronze plan
majority are claimed their insurance
 We have not found any outliers in the gold plan and silver plan, in remaining product
plans there are many outliers we can see
 Amount of sales of tour insurance policies is higher in ASIA
 Second position in Amount of sales of tour insurance policies is in Americas Destination
 In Europe the amount of sales of tour insurance policies are less when compared with
the other two destinations
 The claim status in the Americas Destination is equal, that is there is an equality who
have claimed and who have not claimed
 In the ASIA destination the sales is higher and also many are not claimed but there is a
minor difference between the customers who have claimed and not claimed
 In the Europe destination many are not claimed and amount of sales of tour insurance
policies are less.
 there is strong relationship between sales and commission variables in the data. As the
amount of sales of tour insurance policies is high the commission received for tour
insurance firm also high.
 We cannot find any strong correlation except between sales and commission variables
 We can say that there is no negative correlation between any variables.
Recommendations:
I strongly recommend the below mentioned details for the business:
This is understood by looking at the insurance data by drawing relations between different
variables such as day of the incident, time, age group, and associating it with other external
information such as location, behavior patterns, weather information, airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions,

which subsequently raised profits.
• As per the data 90% of insurance is done by online channel.
•Very less insurance is done via offline mode.
• Other interesting fact, is almost all the offline business has a claimed associated, need to find
why?
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run
promotional marketing campaign or evaluate if we need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need customer books airline
tickets or plans, cross sell the insurance based on the claim data pattern.
• Other interesting fact is more sales happened via Agency than Airlines and the trend which
shows the claims are processed more at Airline. So we may need to deep dive into the process
to understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are:
• Reduce claims cycle time
• Increase customer satisfaction
• Combat fraud
• Optimize claims recovery
• Reduce claim handling costs Insights gained from data and AI-powered analytics could expand
the boundaries of insurability.
• Extend existing products, and give rise to new risk transfer solutions in areas like a non-
damage business interruption and reputational damage.
The END!

Data Mining Project Report

Uploaded by

Copyright:

Available Formats

Data Mining Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Project Report

Uploaded by

Copyright:

Available Formats

1

Business report on bank’s customer market segmentation and predicting claim

Given dataset “bank_marketing_part1_Data” is a csv file, imported in jupyter notebook

 spending: Amount spent by the customer per month (in 1000s)

 min_payment_amt : minimum paid by the customer while making payments for

 max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)

Sample of the data:

Table 1. Dataset Sample

Shape of the data:

The Bank dataset has 210 rows and 7 columns

In the above table, there are 7 variables present in the data.

There are no missing values present in the dataset.

All the variables in the dataset are in float data type.

Total number of entries = 210

Total number of columns = 7

Number of null values = 0

Data types encountered = float(7)

Table 2. Description of the data

and there are no missing values present in the data.

Let’s check for the missing values in the data:

Missing values in the data

There are no missing values present in the dataset

Let’s check for the duplicate records in the data:

Skewness: Skewness is a property it helps us to understand how the symmetric is distributed

Univariate analysis, Bi-variate & Multivariate analysis:

Description of spending and advance_payments variables:

Fig1.Histogram and box plots

Spending variable: Range of values: 10.59

Description of probability_of_full_payment and current_balance variables:

Minimum current_balance is 4.899 with the maximum value of 6.675

Average current_balance is 5.628533 with the standard deviation of 0.443063

Distribution of probability_of_full_payment and current_balance variables

Fig2.Histogram and box plots

Description of credit_limit, min_payment_amt and max_spent_in_single_shopping variables

Distribution of credit_limit, min_payment_amt and max_spent_in_single_shopping variables

Fig3.Histogram and box plots

Strong positive correlation between:

 spending and advance_payments,

Let’s see the correlation heat map as well:

There is a weak correlation is between probability_of_full_payment and min_payment_amt

Many variables are strongly correlated to each other

Many variables are strongly correlated to each other.

There is a weak correlation is between probability_of_full_payment and min_payment_amt

Strong positive correlation between:

 spending and advance_payments,

Fig11. Box plots

Fig 12.Box plots

There is no change in the description of the data for Probability_of_full_payment variable.

Yes, scaling is necessary for clustering in this case.

Clustering is a part of unsupervised learning, it is the technique of grouping objects with

When we work on distance based algorithm scaling or normalization is mandatory

Sample of the data before scaling:

Table4. Sample of the data before scaling

Method1: Using Zscore for scaling/standardization:

Sample of the data after scaling:

Table5. Sample of the data after scaling

We will use zscore method for scaling.

Scaling is necessary for this data in order to perform the clustering.

Table6. Summary of the scaled data

Method2: Using Min-Max method: