Cluster Training PDF (Compatibility Mode)
Cluster Training PDF (Compatibility Mode)
…………………………..
1 3 4
2
Solution : Identify segments where people have same characters and target each of
these segments in a different way
Approach to Segmentation
Segmentation is of 2 types
Clear Objective to divide population First level analysis to see what lies within
Response rate Who are my customers?
Increase in Sales Who buys what?
Conversion proportion When do they buy?
Cluster 1
Clusters are groups within a Population. Cluster 6
17.76 20.61
5.80
Cluster 5 Cluster 2
These Groups are HOMOGENEOUS
within themselves. 16.19
12.15 27.49
And these groups are HETEROGENOUS Cluster 4 Cluster 3
among each other.
Current
Balance Medium
Example Cluster 2
High Income
Low Balance
Low
Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The objects in
Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the
objects in Cluster 2 have the same characteristic (High Balance and Low Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2.
Cluster Methodology
Methodology – Cluster Development
Development Sample
Validation Sample
Outlier Treatment
Variables Standardization
Scatter Plot
90
Outlier
80
To identify them:
70
60 • Univariate and Frequency
50 analysis
Var 2
20
10
0
0 5 10 15 20 25 30 35 40 45
Var 1
To tackle them:
1. The outliers can be deleted from analysis if they are very small in number.
2. The variables selected can be trimmed or capped.
Methodology – Missing Value Treatment
Variables with lot many (about 15%) missing values should not be used for clustering unless
‘Missing’ has a special significance and can be replaced by some meaningful number.
% of Missing Treatments
• Regression Imputation
5-10% • Mean Imputation
• Regression Imputation
More than 10% • Try to use some proxy Variable
Note: - SAS does not include observations with missing values for Clustering Process
Methodology – Multicollinearity Treatment
What is ‘Multi-collinearity’ ?
Factor Analysis: -
By Factor Analysis select those factors, which are explaining almost
90/95 % of total variation together. Then select those variables which
have high loadings towards those factors.
Generally we standardize by making the mean = 0 and variance = 1 thus deunitizing the
variables and bringing them on a common platform to analyze.
Post all the data treatment steps – “Cluster Development Process” is commenced
upon.
Post Cluster Development – “Cluster Validation” is done on the validation sample
to establish that the cluster solution is not Sample dependent.
Cluster Building
Cluster Building – Types
Hierarchical Clustering is not suitable for large datasets as the multitude of calculations
involved would be impossibly huge. Thus K-Means clustering is the most used method of
clustering.
Cluster Building – K-Means Clustering
OPTIMALITY CHECKS
# of clusters 4 to 15
Maximum Cluster size < 35 %
Minimum Cluster size >3%
The Potential Cluster Solution should Max RMSSTD < 1.4 %
Maximum distance from seed to observation < 100
satisfy all the Optimality Checks without
Maximum distance from seed to observation between 30 to 100,but
fail. ([Max dist - Min dist] / Min dist) < 5
Distance from the nearest cluster > 1.4
Minimum Variable R-square > 0.25
Overall R-square > 0.5
Approximate Expected overall R-square > 0.3
[App. Exp. Overall R-square - R-square] < 0.2
Cluster Building – Cluster Solution
Cluster Frequency RMS Std Deviation Max Distance - Seed to Observation Distance Between Cluster Centroids
Cluster Means
Variable R-Square
CNT_LAN_MAT_TW 0.923282
Loanno 0.572698
NO_ADV_EMI 0.694306
MONTHS_SINCE_LOAN_MATURITY 0.590897
TENOR 0.629882
OVER-ALL 0.682213
Scatter Plot
80
70 Cluster 1
60 Cluster 3
50
R - Square =
Cluster 2
Var 2
40
30 Between Variation
20
10 Total Variation
0
0 5 10 15 20 25 30 35 40 45 50
Var 1
Higher R-Square signifies high “between” variation and low “within” variation. Thus Higher
the R-Square, the better it is.
Understanding Cluster Solution: Other Metrics
Approximate Expected Overall R-Square is calculated based on the hypothesis that all the
explanatory variables used for Clustering are independent.
Hence if there is a lot of difference between Observed Overall R-square and Approximate
Expected Overall R-square, we can suspect high correlation among the independent
variables.
RMMSTD
RMMSTD within a cluster = Square root of Average of (Variance of variable 1 in that cluster,
Variance of variable 2 in that cluster, … ,Variance of variable p in that cluster) . Assuming p
variables were used for Clustering.
The Cluster Solution is Validated on the “Validation Sample” using the Minimum Euclidean
Distance Method. Validation is done by calculating the distance of each observation in the
Validation sample from the Cluster Seed & assigning it to the closest cluster.
Scatter Plot
80
New
70 Observation
60
50
Cluster 1
Var 2
40
30 Cluster 3
20
Cluster 2
10
0
0 5 10 15 20 25 30 35 40 45 50
Var 1
Cluster Population (% )
Validation Sample
Cluster 7 Cluster 1
8.41% 9.74%
Cluster Frequency %
1 69,899 9.74
2 164,653 22.94 23.98% 22.94%
3 84,837 11.82
Cluster 6 Cluster 2
4 53,625 7.47
5 112,250 15.64
6 172,084 23.98
7 60,320 8.41
15.64% 11.82%
Total 717,668 100 7.47% Cluster 3
Cluster 5
Cluster 4
The Validation sample was scored using the cluster solution. The frequency plot shows
a similar distribution on the Validation sample as in the Development sample.
Cluster Profiling with Example
Cluster Solution is profiled against Variables to identify and assign the character of individual
clusters.
PROFILING
Cluster
Solution