Data Mining - Assignment: Girish Nayak
Data Mining - Assignment: Girish Nayak
Girish Nayak
nykgirish@gmail.com
Contents
1. Problem 1: Clustering.........................................................................................................................2
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis)...............................................................................................................................2
1.2 Do you think scaling is necessary for clustering in this case? Justify...................................................4
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them..........................................................................................................5
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.........7
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.............................................................................................................................................8
Problem 2: CART-RF-ANN...........................................................................................................................10
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).............................................................................................................................10
1. Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory
data analysis (Univariate, Bi-variate, and multivariate analysis)
From the above results we can see that there is no missing value present in the dataset.
Correlation Plot – To show how the different attributes in the bank data are correlated to each other.
Pairplot:
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of
the variable in the form of histogram. From the graph, we can see that there is positive linear
relationship between variables like spending and advance payments
To check whether the data has outliers: As evident with the fig below, only min_payment_amt has
outlier, rest are fine.
Data values: The data values for spending, current_balance, max_spent_in_single_shopping are in 1000s
where as the others are in 100s or 10000s. So we will bring them in the same scale in the next question.
Yes, in the provided dataset, scaling is necessary because as described in the data dictionary, we have
continuous variables but on different scales –
In the below table, we can see the differences between the std, min/max etc for different features.
z = (x - u) / s
where u is the mean of the training samples or zero, and s is the standard deviation of the training
samples.
It is not really explained in the videos for finding out the optimum number of clusters using a
dendrogram. However by looking at the dendrogram below, and the number of records in the dataset, I
am going ahead with 3 clusters.
The 3 clusters being –
Below are some plots showing the clusters w.r.t some features
1.4 Apply K-Means clustering on scaled data and determine
optimum clusters. Apply elbow curve and silhouette score. Explain
the results properly. Interpret and write inferences on the finalized
clusters.
As per the below plot of within sum of squares, it is evident that the change is not drastic beyond 3
clusters, hence the optimum number of clusters = 3.
The silhouette score for the 3 clusters = 0.401 and the minimum value of the silhouette width is 0.0027.
So the positive values shows that no clusters are indirectly assigned. This also means clusters are well
apart from each other and clearly distinguished.
Below is a plot of the elbow curve using distortions, where also the optimum number of clusters appear
to be 3.
2 : 0.46577247686580914
3 : 0.40072705527512986
4 : 0.3291966792017613
5 : 0.28722184455759475
6 : 0.29127768970444345
7 : 0.2796045365286959
8 : 0.2554830824906814
9 : 0.2539488265085003
However, the minimum silhouette width for different number of clusters are given below and as we can
see, apart from the number of clusters = 3, there are interferences among the clusters evident by
negative values of silhouette width.
2 : -0.0061712389274612344
3 : 0.002713089347678376
4 : -0.053840826993600814
5 : -0.08545150449435547
6 : -0.06438844854564076
7 : -0.11950241847834445
8 : -0.048959286018583625
9 : -0.11950241847834445
The cluster numbers are assigned to the original data frame. Sample provided below, new columns are as
follows -
The clusters defined by KMeans are considered for this. Below are the profiles at a high level:
Cluster 0: These are the groups of low spenders. Below is an insight of the data in cluster 0.
Cluster 1: These are the groups of medium spenders. Below is an insight of the data in cluster 1.
Cluster 2: These are the groups of high spenders. Below is an insight of the data in cluster 2.
From the above 3 groups, we can deduce that the spread of the data is almost equal in all 3 clusters.
A few recommendations are provided below basis the data analysis:
1. The low-spender group of customers can be approached with marketing schemes on more
spending – like bonus on higher no. of transactions within a month, or a gift on total spends
reaching X amounts.
2. The probability of full payment is lower among low-spenders as compared to medium and high
spenders. Since the bank earns when the customers have revolving credit, they may want to look
into it.
3. The minimum payment amount is higher in case of the low spenders and lower in case of cluster
1 and 2. So the customers can be encouraged to make use of the credit limit. And if required,
basis their transactions, the customers can get their credit limits enhanced.
Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory
data analysis (Univariate, Bi-variate, and multivariate analysis).
From the above results we can see that there is no missing value present in the dataset.
Correlation Plot – To show how the different attributes in the bank data are correlated to each other.
Pairplot:
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of
the variable in the form of histogram.
To check whether the data has outliers: As evident with the fig below, there are many outliers.
After removing the one outlier with duration > 4000
Data values
2.2 Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network
The data is split into testing and training components using a random_state = 7. The random_state =
7 is used so that the same split can be repeated, if required. The split is as follows:
X_train (2100, 9)
X_test (900, 9)
train_labels (2100,)
test_labels (900,)
Total Obs 3000
For Random Forest, the same split is used and using the Grid Search, following are the best
parameter combination:
{'max_depth': 7,
'max_features': 4,
'min_samples_leaf': 8,
'min_samples_split': 36}
{'activation': 'relu',
'hidden_layer_sizes': (100, 100, 100),
'max_iter': 10000,
'solver': 'adam',
'tol': 0.1}
CART Model:
AUC and ROC for Training Data:
RF Model:
AUC and ROC for Training Data:
AUC and ROC for Test Data:
2.4 Final Model: Compare all the models and write an inference
which model is best/optimized.
As per the details mentioned in the above answer, Random Forest in the final model.
2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations
Recommendations:
1. The claims from C2B is very high. Since C2B has only destination as ASIA, so the company can look
into these.
2. The lowest claims are from JZI where the most trips are to Europe, which seems to be a profitable
destination for the company.
3.