Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lectures 5 and 6 - Data Anaysis in Management - MBM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Data Analysis in Management

Programme: Modern Business Management


2022/2023

Roman Huptas, Department of Statistics,


Cracow University of Economics
Lectures 5 and 6

Cluster analysis
Outline of the lecture

 Introduction to cluster analysis

 Procedure of cluster analysis

 Clustering methods (algorithms)

 Example of clustering

3
Learning objectives
Upon completing this lecture, you should be able to:

 define cluster analysis and its roles,

 identify the types of research questions addressed by


cluster analysis,

 understand the key terms and basics of cluster analysis,

 understand the differences between hierarchical and


nonhierarchical clustering techniques,

 know how to interpret results from cluster analysis.


What is cluster analysis?
 When considering groups of objects in a multivariate data
set, two situations can arise.

 Given a data set containing measurements on individuals,


 in some cases we want to see if some natural groups or
classes of individuals exist, and

 in other cases, we want to classify the individuals according to


a set of existing groups.

 Cluster analysis develops tools and methods concerning


the former case.
What is cluster analysis?
 That is, given data containing multivariate measurements
on a large number of individuals (or objects), the
objective is to build some natural subgroups or clusters
of individuals.

 This is done by grouping individuals that are “similar”


according to some appropriate criterion.
What is cluster analysis?
 Cluster analysis is a group of multivariate
techniques whose primary purpose is to
group objects (e.g., respondents, products,
or other entities) based on the
characteristics they possess (on a set of
user selected characteristics).
What is cluster analysis?
An example of a simple cluster analysis:
What is cluster analysis?
An example of a simple cluster analysis:
What is cluster analysis?
 Clustering refers to the grouping of records,
observations, or cases into classes of similar objects.

 Thus, cluster analysis groups individuals or objects into


clusters so that objects in the same cluster are more
similar to one another than they are to objects in other
clusters.
What is cluster analysis?
 A cluster is a collection of records that are similar to
one another and dissimilar to records in other clusters.

 The attempt of cluster analysis is to maximize the


homogeneity of objects within the clusters while also
maximizing the heterogeneity between the clusters.
What is cluster analysis?
 Cluster analysis has also been referred to as typology
construction, classification analysis, and numerical
taxonomy.

 This variety of names is due to the usage of clustering


methods in such diverse disciplines as psychology, biology,
sociology, economics, engineering, and business.

 Although the names differ across disciplines, the methods


all have a common dimension: classification according to
relationships among the objects being clustered.
What is cluster analysis?
 Clustering differs from classification in that there is no
target variable for clustering.

 The clustering task does not try to classify, estimate, or


predict the value of a target variable.

 Instead, clustering algorithms seek to segment the entire


data set into relatively homogeneous subgroups or
clusters, where the similarity of the records within the
cluster is maximized, and the similarity to records outside
this cluster is minimized.
What is cluster analysis?
 Clustering is a form of the so-called unsupervised learing:
no predefined classes !!!

 “Learning” refers to the fact that the statistician calibrates


models to data or vice versa.

 The term “unsupervised” points to the act of creating a


model (here the separation of individuals into groups) out
of the observed data.
What is cluster analysis?
Cluster analysis is applied in many fields such as:
 economics,
 marketing,
 the natural sciences,
 the medical sciences,
 the biological and behavioral sciences.
What is cluster analysis?
Examples and applications of clustering include:
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
 Marketing researchers use cluster analysis as a customer-
segmentation strategy. Customers are arranged into
clusters based on the similarity of their demographics and
buying behaviors. Marketing campaigns are then tailored
to appeal to one or more of these subgroups.
 Target marketing of a niche product for a small-
capitalization business that does not have a large
marketing budget
What is cluster analysis?
Examples and applications of clustering include:
 In marketing, it is just useful to build and describe the
different segments of a market from a survey on potential
consumers.
 For accounting auditing purposes, to segment financial
behavior into benign and suspicious categories
 Land use: Identification of areas of similar land use in an
earth observation database
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
What is cluster analysis?
Examples and applications of clustering include:
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 An insurance company, on the other hand, might be
interested in the distinction among classes of potential
customers so that it can derive optimal prices for its
services.
 The classification of companies according to their
organizational structures, technologies, and types.
 Clustering of weblog data to discover groups of similar
access patterns.
What is cluster analysis?
Examples and applications of clustering include:
 As a dimension-reduction tool when a data set has
hundreds of attributes
 For gene expression clustering, where very large
quantities of genes may exhibit similar behavior. Medical
researchers use cluster analysis to help catalog gene-
expression patterns obtained from DNA microarray data.
This can help them to understand normal growth and
development and the underlying causes of many human
diseases.
What is cluster analysis?
 Clustering is often performed as a preliminary step in a
data mining process, with the resulting clusters being
used as further inputs into a different technique
downstream, such as neural networks.

 Due to the enormous size of many present-day databases,


it is often helpful to apply clustering analysis first, to
reduce the search space for the downstream algorithms.
What is cluster analysis?
Cluster analysis encounters the following
problems:
 How to measure similarity?
 How to recode categorical variables?
 How to standardize or normalize numerical
variables?
 How many clusters we expect to uncover?
What is cluster analysis?
Cluster analysis can be divided into two fundamental steps:
 Choice of a proximity measure:
One checks each pair of observations (objects) for the
similarity of their values. A similarity (proximity) measure is
defined to measure the “closeness” of the objects. The
“closer” they are, the more homogeneous they are.

 Choice of group-building algorithm:


On the basis of the proximity measures, the objects
assigned to groups so that differences between groups
become large and observations in a group become as close
as possible.
What is a good clustering? – quality of clustering
 A good clustering method will produce high quality clusters
with
 high intra-class similarity,
 low inter-class similarity.

 The quality of a clustering result depends on both the


similarity measure used by the method and its
implementation.

 The quality of a clustering method is also measured by its


ability to discover some or all of the hidden patterns.
Common steps in cluster analysis
An effective cluster analysis is a multistep proces with
numerous decision points. Each decision can affect the quality
and usefulness of the results.
The typical steps in a comprehensive cluster analysis:
 Choose appropriate attributes.
 Scale the data.
 Select a similarity measure. Calculate distances.
 Select a clustering algorithm (procedure).
 Determine the number of clusters present.
 Obtain a final clustering solution.
 Visualize the results.
 Interpret the clusters.
 Validate the results.
Common steps in cluster analysis
Choose appropriate attributes. Selection of clustering variables.
 The first (and perhaps most important?) step is to select
variables that you feel may be important for identifying and
understanding differences among groups of observations
within the data.
 The objectives of cluster analysis cannot be separated from
the selection of variables used to characterize the objects
being clustered.
 The researcher effectively constrains the possible results by
the variables selected for use. The derived clusters reflect the
inherent structure of the data and are defined only by the
variables.
Common steps in cluster analysis
Scale the data. Standardizing the variables.
 If the variables in the analysis vary in range, the variables with
the largest range will have the greatest impact on the results.

 This is often undesirable, and analysts scale the data before


continuing.

 For optimal performance, clustering algorithms require the


data to be normalized so that no particular variable or subset
of variables dominates the analysis.
Common steps in cluster analysis
Scale the data. Standardizing the variables.
 The most common form of standardization is the conversion
of each variable to standard scores (also known as Z scores) by
subtracting the mean and dividing by the standard deviation
for each variable. This option can be found in all computer
programs and many times is even directly included in the
cluster analysis procedure.

 Other alternatives include dividing each variable by its


maximum value or subtracting the variable’s mean and
dividing by the variable’s median absolute deviation.
Common steps in cluster analysis
Select a similarity measure. Calculate distances.
 The objective of clustering is to group similar objects
together. Some measure is needed to assess how similar or
different the objects are.

 Dissimilarity/Similarity metric: similarity is expressed in terms


of a distance function, which is typically metric.

 Although clustering algorithms vary widely, they typically


require a measure of the distance among the entities to be
clustered.
Common steps in cluster analysis
Select a similarity measure. Calculate distances.
 The starting point of a cluster analysis is a data matrix X
(n × p) with n measurements (objects) of p variables. The
similarity (proximity) among objects is described by a matrix
D (n × n)
Common steps in cluster analysis
Select a similarity measure. Calculate distances.
 The matrix D contains measures of similarity or dissimilarity
among the n objects.

 The most commonly used measures of similarity are


distance measures.

 If the values dij are distances, then they measure dissimilarity.

 The greater the distance, the less similar the objects are.

 The matrix D is then called the distance marix.


Common steps in cluster analysis
Select a similarity measure. Calculate distances.
 The most popular measure of the distance between two
observations is the Euclidean distance, but the Manhattan,
Canberra, asymmetric binary, maximum, and Minkowski
distance measures are also available.
 Euclidean distance is the most commonly recognized
measure of distance.
 The Euclidean distance between two observations is given by:

where i and j are observations and P is the number of variables.


Common steps in cluster analysis
Select a similarity measure. Calculate distances.
 An example of how Euclidean distance is obtained in the case
of two dimensions is shown geometrically in the figure below:
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 The two most popular clustering approaches are:
 hierarchical (agglomerative and divisive) clustering,
 non-hierarchical (partitioning) clustering.

 Hierarchical clustering is useful for smaller problems (say, 300


observations or less) and where a nested hierarchy of
groupings is desired.

 The partitioning method can handle much larger problems


but requires that the number of clusters be specified in
advance.
Common steps in cluster analysis
Select a clustering algorithm (procedure).
Hierarchical clustering

 In hierarchical clustering, a treelike cluster structure


(dendrogram) is created through
 recursive partitioning (divisive algorithms/
methods) or,
 combining of existing clusters (agglomerative
algorithms/methods).
Common steps in cluster analysis
Select a clustering algorithm (procedure).
Hierarchical clustering
 In agglomerative hierarchical clustering, each object or
observation starts as its own cluster. Then, in each subsequent
step, the closest clusters are then combined, two at a time,
and the process is repeated until all clusters are merged into
a single cluster.

 In this way, the number of clusters in the data set is reduced


by one at each step.

 Agglomerative clustering algorithm – from n clusters


to 1 cluster.
Common steps in cluster analysis
Select a clustering algorithm (procedure).
Hierarchical clustering
 Divisive clustering algorithms begin with all the records
in one big cluster, with the most dissimilar records being split
off recursively, into a separate cluster, until each record
represents its own cluster.

 Divisive clustering algorithm – from 1 cluster to n


sub-clusters.

 NOTE! Most computer programs that apply


hierarchical clustering use agglomerative methods!
Common steps in cluster analysis
Select a clustering algorithm (procedure).
Graph illustrating hierarchical clustering – agglomerative
methods move from left to right, and divisive methods move
from right to left:
Common steps in cluster analysis
Select a clustering algorithm (procedure).
The agglomerative clustering algorithm is as follows:
1. Define each observation (row, case) as a cluster.

2. Calculate the distances between every cluster and every


other cluster.

3. Combine the two clusters that have the smallest distance.


This reduces the number of clusters by one.

4. Repeat steps 2 and 3 until all clusters have been merged into
a single cluster containing all observations.
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 The primary difference among hierarchical clustering
algorithms is their definitions of cluster distances (step 2).

 There are many clustering algorithms to choose from.

 For hierarchical clustering, the most popular are:


 single linkage,
 complete linkage,
 average linkage,
 centroid,
 Ward’s method.
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 Five of the most common hierarchical clustering methods and
their definitions of the distance between two clusters are
given in the table below:
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 Comparison of distance measures for single-linkage and
complete-linkage cluster methods
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 Average-linkage cluster method
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 Centroid cluster method
Common steps in cluster analysis
Select a clustering algorithm (procedure).
 The Ward’s method differs from the previous methods in
that the similarity between two clusters is not a single
measure of similarity, but rather the sum of squares within the
clusters summed over all variables.
 At each step, the two clusters combined are those that
minimize the increase in the total sum of squares across all
variables in all clusters.
 This procedure tends to combine clusters with a small
number of observations, because the sum of squares is
directly related to the number of observations involved.
 Moreover, the Ward’s method also tends to produce clusters
with approximately the same number of observations.
Common steps in cluster analysis
Select a clustering algorithm (procedure).
Nonhierarchical clustering
 In contrast to hierarchical methods, nonhierarchical
procedures do not involve the treelike construction process.

 In the partitioning approach, you specify K: the


number of clusters. Observations are then randomly
divided into K groups and reshuffled to form cohesive
clusters.

 For partitioning clustering, the two most popular methods are


k-means method and partitioning around medoids (PAM)
method.
Common steps in cluster analysis
Select a clustering algorithm (procedure).
Nonhierarchical clustering
 K-means clustering can handle larger datasets than
hierarchical cluster approaches.

 Additionally, observations are not permanently committed to


a cluster. They are moved when doing so improves the overall
solution.

 But the use of means implies that all variables must be


continuous.

 The approach can be severely affected by outliers.


Common steps in cluster analysis
Select a clustering algorithm (procedure).
The most common partitioning method is the k-means cluster
analysis. Conceptually, the k-means algorithm is as follows:
1. Select k centroids (k rows chosen at random).
2. Assign each data point to its closest centroid.
3. Recalculate the centroids as the average of all data points in
a cluster (that is, the centroids are p-length mean vectors,
where p is the number of variables).
4. Assign data points to their closest centroids.
5. Continue steps 3 and 4 until the observations are not
reassigned or the maximum number of iterations (R uses 10
as a default) is reached.
Common steps in cluster analysis
Choice of the number of clusters k in k-means method.
The Silhouette coefficient

 Kaufman and Rousseeuw (1990)

 Measure for the quality of k-means clustering method that is


independent of k.

 The silhouette coefficient is a measure of how similar an


object is to its own cluster (cohesion) compared to other
clusters (separation).
Common steps in cluster analysis
Choice of the number of clusters k in k-means method.
The Silhouette Coefficient
 For the i-th object, the value of silhouette coefficient is
defined as:

 a(i) for the i-th object is the average distance between i-th
object and all other objects in the same cluster
 b(i) for the i-th object and any cluster not containing the
object is the smallest average distance between i-th object
and all other objects in any other cluster not containing this
object.
Common steps in cluster analysis
Choice of the number of clusters k in k-means method.
The Silhouette coefficient
 The silhouette coefficient ranges from −1 to +1.

 High value indicates that the object is well matched to its


own cluster and poorly matched to neighbouring clusters.

 If s(i) is close to 1, it means that the object is appropriately


clustered.
 If s(i) is close to -1, it means that the object is clustered in its
neighbouring cluster.
 If s(i) is close to 0, it means that the object is between two
clusters.
Common steps in cluster analysis
Choice of the number of clusters k in k-means method.
The Silhouette coefficient
 The average s(i) over all objects (points) of a cluster is a
measure of how tightly grouped all the objects (points) in the
cluster are.

 An overall measure of the goodness of clustering can be


obtained by computing the average silhouette coefficient of all
object (points).
Common steps in cluster analysis
Choice of the number of clusters k in k-means method.
The Silhouette coefficient

 Thus the average s(i) over all data of the entire dataset is a
measure of how appropriately the data have been clustered.

 Adopt silhouette coefficient to determine the best clustering


scheme, i.e., the one with the maximum silhouette
coefficient.
Common steps in cluster analysis
Obtain a final clustering solution.
 No single objective procedure is available to determine the
correct number of clusters; rather the researcher must
evaluate alternative cluster solutions on the following
considerations to select the optimal solution.

 Single-member or extremely small clusters are generally not


acceptable and should be eliminated.

 For hierarchical methods, ad hoc stopping rules, based on the


rate of change in a total heterogeneity measure as the
number of clusters increases or decreases, are an indication
of the number of clusters.
Common steps in cluster analysis
Obtain a final clustering solution.

 All clusters should be significantly different across the set of


clustering variables.

 Cluster solutions ultimately must have theoretical validity


assessed through external validation.
Common steps in cluster analysis
Interpreting, profiling, and validating clusters.

 Once a cluster solution has been obtained, you must interpret


(and possibly name) the clusters.

 What do the observations in a cluster have in common?

 How do they differ from the observations in other clusters?

 This step is typically accomplished by obtaining summary


statistics for each variable by cluster.
Common steps in cluster analysis
Interpreting, profiling, and validating clusters.

 The cluster centroid, a mean profile of the cluster on each


clustering variable, is particularly useful in the interpretation
stage.

 Interpretation involves examining the distinguishing


characteristics of each cluster’s profile and identifying
substantial differences between clusters.
Common steps in cluster analysis
Interpreting, profiling, and validating clusters.

 Cluster solutions failing to show substantial variation indicate


other cluster solutions should be examined.

 The cluster centroid should also be assessed for


correspondence with the researcher’s prior expectations
based on theory or practical experience.
Common steps in cluster analysis
Interpreting, profiling, and validating clusters.

 Validating the cluster solution involves asking the question


“Are these groupings in some sense real, and not a
manifestation of unique aspects of this dataset or statistical
technique?”

 If a different cluster method or different sample is employed,


would the same clusters be obtained?
Common steps in cluster analysis
Interpreting, profiling, and validating clusters.
 Validation is essential in cluster analysis because the clusters
are descriptive of structure and require additional support for
their relevance:
 Cross-validation empirically validates a cluster solution by
creating two subsamples (randomly splitting the sample)
and then comparing the two cluster solutions for
consistency with respect to number of clusters and the
cluster profiles.
 Validation is also achieved by examining differences on
variables not included in the cluster analysis but for which a
theoretical and relevant reason enables the expectation of
variation across the clusters.
Example
to illustrate the cluster analysis
Example 1
The data set consists of chest, waist, and hip measurements on a
sample of men and women and the measurements for 20 individuals.
These measurements for 20 individuals can be found in the
„Example_clusters_lecture.txt” file.
Conduct a simple cluster analysis and interpret its results.

You might also like