0% found this document useful (0 votes)

348 views

Cluster Training PDF (Compatibility Mode)

This document discusses segmentation and cluster analysis. Segmentation involves dividing a population into homogeneous subgroups based on common characteristics. The goal is to identify segments that can be targeted differently. Cluster analysis is an approach used for segmentation. It groups individuals into clusters such that individuals within a cluster are similar to each other but dissimilar to individuals in other clusters. The document outlines the methodology for cluster analysis, including outlier treatment, handling missing values, addressing multicollinearity, standardizing variables, and using k-means clustering to build the optimal cluster solution.

Uploaded by

Sarbani Dasgupts

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

348 views

Cluster Training PDF (Compatibility Mode)

Uploaded by

Sarbani Dasgupts

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Why Segmentation ?

Each individual is so different

that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6

…………………………..

Problem : The volume is too large for customization at individual level

1 3 4
2

Solution : Identify segments where people have same characters and target each of
these segments in a different way
Approach to Segmentation

Segmentation is of 2 types

Objective Segmentation Subjective Segmentation

Clear Objective to divide population First level analysis to see what lies within
Response rate Who are my customers?
Increase in Sales Who buys what?
Conversion proportion When do they buy?

Objective defined Analysis. To Initial Analysis to Understand &

identify the desired segment within Define the Population. Based on
population. Then devising strategy the initial understanding –
to tap the potential within. Objective Based Analysis.

CHAID Analysis Cluster Analysis

Cluster Analysis
What are Clusters ?

Cluster Size (%)

Cluster 1
Clusters are groups within a Population. Cluster 6
17.76 20.61

5.80

Cluster 5 Cluster 2
These Groups are HOMOGENEOUS
within themselves. 16.19

12.15 27.49
And these groups are HETEROGENOUS Cluster 4 Cluster 3
among each other.

Homogeneous segments making it possible to group people of similar characteristics.

Heterogeneous among themselves making it possible to differentiate segments within population.
Example of Clusters

High Example Cluster 1

High Balance
Low Income

Current
Balance Medium

Example Cluster 2
High Income
Low Balance
Low

Low Medium High

Gross Monthly Income

Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The objects in
Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the
objects in Cluster 2 have the same characteristic (High Balance and Low Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2.
Cluster Methodology
Methodology – Cluster Development

Development Sample

Population Variables Creation Final Dataset

Validation Sample

Outlier Treatment

Missing Value Treatment

Multicollinearity Treatment using Factor Cluster

Data Analysis Profiles

Variables Standardization

Cluster Solution Development & Validation

Methodology – Outlier Treatment
What is an outlier ?
An observation is said to be an outlier w.r.t. a variable if it is far away from the remaining
observations.

Scatter Plot

90
Outlier
80
To identify them:
70
60 • Univariate and Frequency
50 analysis
Var 2

40 • Histogram and Box-Plot

20
10
0
0 5 10 15 20 25 30 35 40 45
Var 1

To tackle them:
1. The outliers can be deleted from analysis if they are very small in number.
2. The variables selected can be trimmed or capped.
Methodology – Missing Value Treatment

Variables with lot many (about 15%) missing values should not be used for clustering unless
‘Missing’ has a special significance and can be replaced by some meaningful number.

% of Missing Treatments

• Delete those Observations

Less than 1%
• Mean Imputation

1-5% • Mean Imputation

• Regression Imputation
5-10% • Mean Imputation

• Regression Imputation
More than 10% • Try to use some proxy Variable

Note: - SAS does not include observations with missing values for Clustering Process
Methodology – Multicollinearity Treatment

What is ‘Multi-collinearity’ ?

A set of independent or explanatory variables are said to have ‘Multi-collinearity’, if there

is any linear relation between them.

Device to tackle ‘Multi-collinearity’: -

Factor Analysis: -
By Factor Analysis select those factors, which are explaining almost
90/95 % of total variation together. Then select those variables which
have high loadings towards those factors.

VIF (Variance Inflation Factor): -

Variables with VIF more than 2 should be dropped
Methodology – Variable Standardization

Why do we need ‘Standardization’ ?

Since the units of measurement are different for different variables, standardization is a
must.

E.g.: - Consider two variables, Age and Income.

The unit of Age is ‘Year’ and the unit of Income is say ‘Rs’.
Hence they are not comparable.
In that case there won’t be an unit of measurement for the distance between two clusters.

Generally we standardize by making the mean = 0 and variance = 1 thus deunitizing the
variables and bringing them on a common platform to analyze.

Post all the data treatment steps – “Cluster Development Process” is commenced
upon.
Post Cluster Development – “Cluster Validation” is done on the validation sample
to establish that the cluster solution is not Sample dependent.
Cluster Building
Cluster Building – Types

There are 2 ways in which Cluster solutions could be built up.

Hierarchical Clustering K-Means Clustering

Each observation is considered as an K distinct observations are randomly

individual cluster. Distance from each selected at the highest distance from
observation to all others is calculated & each other. Each observation is
the nearest observations are clubbed to considered one by one & clubbed to the
form clusters. Intensive distance nearest Cluster. If two clusters come
calculations required thus making it significantly close to each other, they are
difficult to implement. merged to each other to form a new
cluster.

Hierarchical Clustering is not suitable for large datasets as the multitude of calculations
involved would be impossibly huge. Thus K-Means clustering is the most used method of
clustering.
Cluster Building – K-Means Clustering

K-Means Clustering SAS Code

rsubmit;
proc fastclus data =out.inactive maxc=200 maxiter=100 delete=25000
out=out.final;
var
CNT_LAN_MAT_TW
Loanno
NO_ADV_EMI
MONTHS_SINCE_LOAN_MATURITY
TENOR;
run;

OPTIMALITY CHECKS

# of clusters 4 to 15
Maximum Cluster size < 35 %
Minimum Cluster size >3%
The Potential Cluster Solution should Max RMSSTD < 1.4 %
Maximum distance from seed to observation < 100
satisfy all the Optimality Checks without
Maximum distance from seed to observation between 30 to 100,but
fail. ([Max dist - Min dist] / Min dist) < 5
Distance from the nearest cluster > 1.4
Minimum Variable R-square > 0.25
Overall R-square > 0.5
Approximate Expected overall R-square > 0.3
[App. Exp. Overall R-square - R-square] < 0.2
Cluster Building – Cluster Solution

Cluster Frequency RMS Std Deviation Max Distance - Seed to Observation Distance Between Cluster Centroids

1 69696 0.8642 14.6487 2.7342

2 164495 0.3587 3.7355 1.7221
3 84576 0.7891 15.5323 3.2326
4 53434 0.6266 4.471 1.9309
5 111923 0.4809 8.6794 1.8346
6 171323 0.3729 2.4891 1.7221
7 61126 0.7138 12.3443 2.4533

Cluster Means

Cluster CNT_LAN_MAT_TW Loanno NO_ADV_EMI MONTHS_SINCE_LOAN_MATURITY TENOR

1 -0.197450101 2.27108366 -1.046054301 -0.509641873 0.312811446

2 -0.375048706 -0.34924125 0.200597434 0.994222204 -0.677162416
3 2.622903928 -0.09743388 -0.435001209 -0.046890848 0.113553959
4 -0.375048706 -0.35414491 -0.97960221 1.463155948 0.777377199
5 -0.355935079 -0.28306493 1.567501549 -0.354749804 0.347933497
6 -0.375046517 -0.28265108 0.111602877 -0.724103839 -0.705347005
7 -0.363966798 0.1052424 -1.071824808 -0.629530996 1.968822045

Variable R-Square

CNT_LAN_MAT_TW 0.923282
Loanno 0.572698
NO_ADV_EMI 0.694306
MONTHS_SINCE_LOAN_MATURITY 0.590897
TENOR 0.629882

OVER-ALL 0.682213

Approximate Expected Over-All R-Squared = 0.54085

Understanding Cluster Solution: R-Square

For a given data set “Total amount of Variation” is fixed.

If there is k Clusters in the solution then Total Variation = Within Variation + Between Variation
Within Variation = (Variation within Cluster 1) + (Variation within Cluster 2) + … +
(Variation within Cluster k)
Between Variation = Variation between one cluster to another (i.e. variation of cluster means).

Scatter Plot

70 Cluster 1
60 Cluster 3
50
R - Square =
Cluster 2
Var 2

30 Between Variation
20

10 Total Variation
0
0 5 10 15 20 25 30 35 40 45 50
Var 1

Higher R-Square signifies high “between” variation and low “within” variation. Thus Higher
the R-Square, the better it is.
Understanding Cluster Solution: Other Metrics

Approximate Expected Overall R-square

Approximate Expected Overall R-Square is calculated based on the hypothesis that all the
explanatory variables used for Clustering are independent.
Hence if there is a lot of difference between Observed Overall R-square and Approximate
Expected Overall R-square, we can suspect high correlation among the independent
variables.

RMMSTD

RMMSTD within a cluster = Square root of Average of (Variance of variable 1 in that cluster,
Variance of variable 2 in that cluster, … ,Variance of variable p in that cluster) . Assuming p
variables were used for Clustering.

There is no restriction on the number of clusters, but it should be between 5 to 15.

Care should be taken on the number of observations in each clusters. A good rule of thumb is to
have >= 5% of the population in each cluster.
Cluster Validation & Profiling
Cluster Validation

The Cluster Solution is Validated on the “Validation Sample” using the Minimum Euclidean
Distance Method. Validation is done by calculating the distance of each observation in the
Validation sample from the Cluster Seed & assigning it to the closest cluster.

Scatter Plot

80
New
70 Observation
60

50
Cluster 1
Var 2

30 Cluster 3

20
Cluster 2
10

0
0 5 10 15 20 25 30 35 40 45 50
Var 1

The New Observation will be a member of Cluster 1.

Cluster Validation: Sample Example
Cluster Population (% )
Development Sample Cluster 7 Cluster 1
8.53% 9.73%
Cluster Frequency %
1 69,696 9.73 23.91% 22.96%
2 164,495 22.96
3 84,576 11.80 Cluster 6 Cluster 2
4 53,434 7.46
5 111,923 15.62
6 171,323 23.91
7 61,126 8.53 11.80%
15.62%
7.46%
Cluster 5 Cluster 3
Total 716,573 100 Cluster 4

Cluster Population (% )
Validation Sample
Cluster 7 Cluster 1
8.41% 9.74%
Cluster Frequency %
1 69,899 9.74
2 164,653 22.94 23.98% 22.94%
3 84,837 11.82
Cluster 6 Cluster 2
4 53,625 7.47
5 112,250 15.64
6 172,084 23.98
7 60,320 8.41
15.64% 11.82%
Total 717,668 100 7.47% Cluster 3
Cluster 5
Cluster 4

The Validation sample was scored using the cluster solution. The frequency plot shows
a similar distribution on the Validation sample as in the Development sample.
Cluster Profiling with Example

Cluster Solution is profiled against Variables to identify and assign the character of individual
clusters.

PROFILING
Cluster
Solution

Microsoft Excel Microsoft Excel

Worksheet Worksheet

Data file Continuous Numeric Variables. Data file Categorical Variables.

1L Contracts Outline
100% (4)
1L Contracts Outline
10 pages
Homework 4
No ratings yet
Homework 4
4 pages
E-Commerce Capstone Project Presentation
No ratings yet
E-Commerce Capstone Project Presentation
26 pages
Project Finance - Practical Case Studies
No ratings yet
Project Finance - Practical Case Studies
9 pages
Knime Project Report
No ratings yet
Knime Project Report
12 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
25 pages
BYUOpticsBook 2013
No ratings yet
BYUOpticsBook 2013
344 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Intercultural Effectiveness Scale
No ratings yet
Intercultural Effectiveness Scale
2 pages
Computer Assignment
0% (1)
Computer Assignment
6 pages
3-Building Decision Trees Using SAS
No ratings yet
3-Building Decision Trees Using SAS
30 pages
VIP300 Protection Application Guide 2004 ENG PDF
100% (1)
VIP300 Protection Application Guide 2004 ENG PDF
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
38 pages
Cluster Analysis
No ratings yet
Cluster Analysis
12 pages
Supervised, Unsupervised, and Reinforcement Learning - by Renu Khandelwal - Medium
No ratings yet
Supervised, Unsupervised, and Reinforcement Learning - by Renu Khandelwal - Medium
12 pages
Dr. Chinmoy Jana Iiswbm: Management House, Kolkata
No ratings yet
Dr. Chinmoy Jana Iiswbm: Management House, Kolkata
22 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Supervised-Unsupervised Learning
No ratings yet
Supervised-Unsupervised Learning
2 pages
1.supervised and Unsupervised
No ratings yet
1.supervised and Unsupervised
42 pages
Introduction To Factor Analysis (Compatibility Mode) PDF
No ratings yet
Introduction To Factor Analysis (Compatibility Mode) PDF
20 pages
SAS Cluster Project Report
100% (1)
SAS Cluster Project Report
24 pages
Time Series Project
No ratings yet
Time Series Project
19 pages
Mergers and Amalgmations
No ratings yet
Mergers and Amalgmations
38 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
12 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
EV Market Research For India
No ratings yet
EV Market Research For India
11 pages
Factor Analysis
67% (3)
Factor Analysis
25 pages
Visual Analytics With Tableau
No ratings yet
Visual Analytics With Tableau
11 pages
Factor Analysis - Spss
No ratings yet
Factor Analysis - Spss
15 pages
11 Multiple Regression Part1
100% (1)
11 Multiple Regression Part1
13 pages
PG Program Dsba Classroom
No ratings yet
PG Program Dsba Classroom
16 pages
Factor Analysis
No ratings yet
Factor Analysis
11 pages
FinalPaper SalesPredictionModelforBigMart
No ratings yet
FinalPaper SalesPredictionModelforBigMart
14 pages
Managing Different Stages of CRM: Dr. Savita Sharma
No ratings yet
Managing Different Stages of CRM: Dr. Savita Sharma
28 pages
Types of Analytics: What Is Descriptive Analytics?
No ratings yet
Types of Analytics: What Is Descriptive Analytics?
3 pages
Paper 4-Churn Prediction in Telecommunication PDF
No ratings yet
Paper 4-Churn Prediction in Telecommunication PDF
3 pages
Churn Analysis in Telecommunication Using Logistic Regression
No ratings yet
Churn Analysis in Telecommunication Using Logistic Regression
6 pages
Final Capstone Report
No ratings yet
Final Capstone Report
16 pages
Business Report: Advanced Statistics Module Project - II
No ratings yet
Business Report: Advanced Statistics Module Project - II
9 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Ib - Case - 2 TSMC - Mod
No ratings yet
Ib - Case - 2 TSMC - Mod
6 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Moderating Effect of The Relationship Between Private Label Share and Store Loyalty PDF
No ratings yet
Moderating Effect of The Relationship Between Private Label Share and Store Loyalty PDF
15 pages
In-Class Practices - Session 1 - Answers
No ratings yet
In-Class Practices - Session 1 - Answers
19 pages
X Education - Lead Scoring Case Study
No ratings yet
X Education - Lead Scoring Case Study
24 pages
Data Visualisation With Tableau
No ratings yet
Data Visualisation With Tableau
26 pages
Capstone Project Proposal - HR Audit
No ratings yet
Capstone Project Proposal - HR Audit
3 pages
MGX9660: International Business Theory & Practice
No ratings yet
MGX9660: International Business Theory & Practice
37 pages
INFERENTIAL STATISTICS (Project)
No ratings yet
INFERENTIAL STATISTICS (Project)
17 pages
Questions
No ratings yet
Questions
3 pages
Bankruptcy Prevention Project
No ratings yet
Bankruptcy Prevention Project
16 pages
Channel Partner Study PDAgroup
No ratings yet
Channel Partner Study PDAgroup
19 pages
Ben & Jerry Case Study
0% (1)
Ben & Jerry Case Study
1 page
Text Analytics
No ratings yet
Text Analytics
30 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
House Price Prediction Using Data Science
No ratings yet
House Price Prediction Using Data Science
8 pages
A Review On Decision-Making Methods in Engineering Design For The Automotive Industry
No ratings yet
A Review On Decision-Making Methods in Engineering Design For The Automotive Industry
27 pages
Project Report - ML
100% (1)
Project Report - ML
17 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet
Cfe Bbva Price Strategy Optimization
No ratings yet
Cfe Bbva Price Strategy Optimization
16 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Mba Clickstream
No ratings yet
Mba Clickstream
13 pages
Greedy Algorithms and Data Compression.: Curs Fall 2017
No ratings yet
Greedy Algorithms and Data Compression.: Curs Fall 2017
95 pages
OTG Recipes
No ratings yet
OTG Recipes
22 pages
Efficient Mining of Top-K Sequential Rules: Philippe Fournier-Viger
No ratings yet
Efficient Mining of Top-K Sequential Rules: Philippe Fournier-Viger
21 pages
Association Rules & Sequential Patterns
No ratings yet
Association Rules & Sequential Patterns
65 pages
Advanced Forecasting Models Using Sas Software
No ratings yet
Advanced Forecasting Models Using Sas Software
10 pages
PROC SQL - The Dark Side of SAS ?: Kirsty Lauderdale, PRA International, Victoria, BC
No ratings yet
PROC SQL - The Dark Side of SAS ?: Kirsty Lauderdale, PRA International, Victoria, BC
5 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
Principal Component Analysis vs. Exploratory Factor Analysis
No ratings yet
Principal Component Analysis vs. Exploratory Factor Analysis
11 pages
Statistics L1 (Worksheet)
No ratings yet
Statistics L1 (Worksheet)
16 pages
ELYM 115 2nd Opp Examination MC-2022-09-21 TT
No ratings yet
ELYM 115 2nd Opp Examination MC-2022-09-21 TT
3 pages
Femtocells Technologies and Deployment.9780470742983.51847
100% (4)
Femtocells Technologies and Deployment.9780470742983.51847
329 pages
Complete Download Doing Business with the Republic of Cyprus Phillip Dew PDF All Chapters
100% (3)
Complete Download Doing Business with the Republic of Cyprus Phillip Dew PDF All Chapters
91 pages
V68.19.4Z and Earlier Summary of Firmware Release
No ratings yet
V68.19.4Z and Earlier Summary of Firmware Release
20 pages
7 Principles of QM PDF
No ratings yet
7 Principles of QM PDF
2 pages
1506a E88tag3 - 250 Kva
No ratings yet
1506a E88tag3 - 250 Kva
12 pages
Buffet Service
No ratings yet
Buffet Service
14 pages
Wind Energy
No ratings yet
Wind Energy
11 pages
Form 1 Assignment/Practice Extension Form: Module/Assessment or Placement Details
No ratings yet
Form 1 Assignment/Practice Extension Form: Module/Assessment or Placement Details
3 pages
List of Authorised Used Oil / Waste Oil Reprocessors
No ratings yet
List of Authorised Used Oil / Waste Oil Reprocessors
9 pages
Hae2
No ratings yet
Hae2
205 pages
Deepak Singh Resume P
No ratings yet
Deepak Singh Resume P
3 pages
Oil and Petrochemical Overview - Solutions For Your ... - Spirax Sarco
No ratings yet
Oil and Petrochemical Overview - Solutions For Your ... - Spirax Sarco
12 pages
Rajesh Final Book@28!11!16
No ratings yet
Rajesh Final Book@28!11!16
168 pages
Brush Creek Wine Editorial Cal
No ratings yet
Brush Creek Wine Editorial Cal
8 pages
Lake Milton Hydroelectric Project: Mahoning Hydropower, LLC
100% (1)
Lake Milton Hydroelectric Project: Mahoning Hydropower, LLC
77 pages
J40-922 Issue 01.1 - Cabri G2 Flight Manual Supplement - LH Chin Window Mount For External Equipment
No ratings yet
J40-922 Issue 01.1 - Cabri G2 Flight Manual Supplement - LH Chin Window Mount For External Equipment
4 pages
Huawei eKitEngine AP661 Access Point Datasheet
No ratings yet
Huawei eKitEngine AP661 Access Point Datasheet
13 pages
PCE1-M: Fundamentals of Surveying
No ratings yet
PCE1-M: Fundamentals of Surveying
22 pages
Pat Ma 2022-2023
No ratings yet
Pat Ma 2022-2023
9 pages
Shuvam Shukla CV PDF
No ratings yet
Shuvam Shukla CV PDF
2 pages
Porag Pachoni PDF
No ratings yet
Porag Pachoni PDF
9 pages
Daikin Altherma
No ratings yet
Daikin Altherma
48 pages
Definition and Objective of An Assurance Engagement
No ratings yet
Definition and Objective of An Assurance Engagement
3 pages
Wind Loading On A Hyperbolic Paraboloid Free Roof: Journal of Civil Engineering and Architecture October 2014
No ratings yet
Wind Loading On A Hyperbolic Paraboloid Free Roof: Journal of Civil Engineering and Architecture October 2014
11 pages
Data Sheet Agitator
No ratings yet
Data Sheet Agitator
10 pages