Lectures 5 and 6 - Data Anaysis in Management - MBM

Data Analysis in Management
Programme: Modern Business Management

2022/2023
Roman Huptas, Department of Statistics,

Cracow University of Economics
Lectures 5 and 6
Cluster analysis
Outline of the lecture
 Introduction to cluster analysis
 Procedure of cluster analysis
 Clustering methods (algorithms)
 Example of clustering
3
Learning objectives
Upon completing this lecture, you should be able to:
 define cluster analysis and its roles,
 identify the types of research questions addressed by

cluster analysis,
 understand the key terms and basics of cluster analysis,
 understand the differences between hierarchical and

nonhierarchical clustering techniques,
 know how to interpret results from cluster analysis.

What is cluster analysis?
 When considering groups of objects in a multivariate data
set, two situations can arise.
 Given a data set containing measurements on individuals,

 in some cases we want to see if some natural groups or
classes of individuals exist, and
 in other cases, we want to classify the individuals according to

a set of existing groups.
 Cluster analysis develops tools and methods concerning

the former case.
 That is, given data containing multivariate measurements
on a large number of individuals (or objects), the
objective is to build some natural subgroups or clusters
of individuals.
 This is done by grouping individuals that are “similar”

according to some appropriate criterion.
 Cluster analysis is a group of multivariate
techniques whose primary purpose is to
group objects (e.g., respondents, products,
or other entities) based on the
characteristics they possess (on a set of
user selected characteristics).
An example of a simple cluster analysis:
An example of a simple cluster analysis:
 Clustering refers to the grouping of records,
observations, or cases into classes of similar objects.
 Thus, cluster analysis groups individuals or objects into

clusters so that objects in the same cluster are more
similar to one another than they are to objects in other
clusters.
 A cluster is a collection of records that are similar to
one another and dissimilar to records in other clusters.
 The attempt of cluster analysis is to maximize the

homogeneity of objects within the clusters while also
maximizing the heterogeneity between the clusters.
 Cluster analysis has also been referred to as typology
construction, classification analysis, and numerical
taxonomy.
 This variety of names is due to the usage of clustering

methods in such diverse disciplines as psychology, biology,
sociology, economics, engineering, and business.
 Although the names differ across disciplines, the methods

all have a common dimension: classification according to
relationships among the objects being clustered.
 Clustering differs from classification in that there is no
target variable for clustering.
 The clustering task does not try to classify, estimate, or

predict the value of a target variable.
 Instead, clustering algorithms seek to segment the entire

data set into relatively homogeneous subgroups or
clusters, where the similarity of the records within the
cluster is maximized, and the similarity to records outside
this cluster is minimized.
 Clustering is a form of the so-called unsupervised learing:
no predefined classes !!!
 “Learning” refers to the fact that the statistician calibrates

models to data or vice versa.
 The term “unsupervised” points to the act of creating a

model (here the separation of individuals into groups) out
of the observed data.
Cluster analysis is applied in many fields such as:
 economics,
 marketing,
 the natural sciences,
 the medical sciences,
 the biological and behavioral sciences.
Examples and applications of clustering include:
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
 Marketing researchers use cluster analysis as a customer-
segmentation strategy. Customers are arranged into
clusters based on the similarity of their demographics and
buying behaviors. Marketing campaigns are then tailored
to appeal to one or more of these subgroups.
 Target marketing of a niche product for a small-
capitalization business that does not have a large
marketing budget
 In marketing, it is just useful to build and describe the
different segments of a market from a survey on potential
consumers.
 For accounting auditing purposes, to segment financial
behavior into benign and suspicious categories
 Land use: Identification of areas of similar land use in an
earth observation database
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 An insurance company, on the other hand, might be
interested in the distinction among classes of potential
customers so that it can derive optimal prices for its
services.
 The classification of companies according to their
organizational structures, technologies, and types.
 Clustering of weblog data to discover groups of similar
access patterns.
 As a dimension-reduction tool when a data set has
hundreds of attributes
 For gene expression clustering, where very large
quantities of genes may exhibit similar behavior. Medical
researchers use cluster analysis to help catalog gene-
expression patterns obtained from DNA microarray data.
This can help them to understand normal growth and
development and the underlying causes of many human
diseases.
 Clustering is often performed as a preliminary step in a
data mining process, with the resulting clusters being
used as further inputs into a different technique
downstream, such as neural networks.
 Due to the enormous size of many present-day databases,

it is often helpful to apply clustering analysis first, to
reduce the search space for the downstream algorithms.
Cluster analysis encounters the following
problems:
 How to measure similarity?
 How to recode categorical variables?
 How to standardize or normalize numerical
variables?
 How many clusters we expect to uncover?
Cluster analysis can be divided into two fundamental steps:
 Choice of a proximity measure:
One checks each pair of observations (objects) for the
similarity of their values. A similarity (proximity) measure is
defined to measure the “closeness” of the objects. The
“closer” they are, the more homogeneous they are.
 Choice of group-building algorithm:

On the basis of the proximity measures, the objects
assigned to groups so that differences between groups
become large and observations in a group become as close
as possible.
What is a good clustering? – quality of clustering
 A good clustering method will produce high quality clusters
with
 high intra-class similarity,
 low inter-class similarity.
 The quality of a clustering result depends on both the

similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by its

ability to discover some or all of the hidden patterns.
Common steps in cluster analysis
An effective cluster analysis is a multistep proces with
numerous decision points. Each decision can affect the quality
and usefulness of the results.
The typical steps in a comprehensive cluster analysis:
 Choose appropriate attributes.
 Scale the data.
 Select a similarity measure. Calculate distances.
 Select a clustering algorithm (procedure).
 Determine the number of clusters present.
 Obtain a final clustering solution.
 Visualize the results.
 Interpret the clusters.
 Validate the results.
Choose appropriate attributes. Selection of clustering variables.
 The first (and perhaps most important?) step is to select
variables that you feel may be important for identifying and
understanding differences among groups of observations
within the data.
 The objectives of cluster analysis cannot be separated from
the selection of variables used to characterize the objects
being clustered.
 The researcher effectively constrains the possible results by
the variables selected for use. The derived clusters reflect the
inherent structure of the data and are defined only by the
variables.
Scale the data. Standardizing the variables.
 If the variables in the analysis vary in range, the variables with
the largest range will have the greatest impact on the results.
 This is often undesirable, and analysts scale the data before

continuing.
 For optimal performance, clustering algorithms require the

data to be normalized so that no particular variable or subset
of variables dominates the analysis.
Scale the data. Standardizing the variables.
 The most common form of standardization is the conversion
of each variable to standard scores (also known as Z scores) by
subtracting the mean and dividing by the standard deviation
for each variable. This option can be found in all computer
programs and many times is even directly included in the
cluster analysis procedure.
 Other alternatives include dividing each variable by its

maximum value or subtracting the variable’s mean and
dividing by the variable’s median absolute deviation.
Select a similarity measure. Calculate distances.
 The objective of clustering is to group similar objects
together. Some measure is needed to assess how similar or
different the objects are.
 Dissimilarity/Similarity metric: similarity is expressed in terms

of a distance function, which is typically metric.
 Although clustering algorithms vary widely, they typically

require a measure of the distance among the entities to be
clustered.
 The starting point of a cluster analysis is a data matrix X
(n × p) with n measurements (objects) of p variables. The
similarity (proximity) among objects is described by a matrix
D (n × n)
 The matrix D contains measures of similarity or dissimilarity
among the n objects.
 The most commonly used measures of similarity are

distance measures.
 If the values dij are distances, then they measure dissimilarity.
 The greater the distance, the less similar the objects are.
 The matrix D is then called the distance marix.

 The most popular measure of the distance between two
observations is the Euclidean distance, but the Manhattan,
Canberra, asymmetric binary, maximum, and Minkowski
distance measures are also available.
 Euclidean distance is the most commonly recognized
measure of distance.
 The Euclidean distance between two observations is given by:
where i and j are observations and P is the number of variables.

 An example of how Euclidean distance is obtained in the case
of two dimensions is shown geometrically in the figure below:
Select a clustering algorithm (procedure).
 The two most popular clustering approaches are:
 hierarchical (agglomerative and divisive) clustering,
 non-hierarchical (partitioning) clustering.
 Hierarchical clustering is useful for smaller problems (say, 300

observations or less) and where a nested hierarchy of
groupings is desired.
 The partitioning method can handle much larger problems

but requires that the number of clusters be specified in
advance.
Hierarchical clustering
 In hierarchical clustering, a treelike cluster structure

(dendrogram) is created through
 recursive partitioning (divisive algorithms/
methods) or,
 combining of existing clusters (agglomerative
algorithms/methods).
 In agglomerative hierarchical clustering, each object or
observation starts as its own cluster. Then, in each subsequent
step, the closest clusters are then combined, two at a time,
and the process is repeated until all clusters are merged into
a single cluster.
 In this way, the number of clusters in the data set is reduced

by one at each step.
 Agglomerative clustering algorithm – from n clusters

to 1 cluster.
 Divisive clustering algorithms begin with all the records
in one big cluster, with the most dissimilar records being split
off recursively, into a separate cluster, until each record
represents its own cluster.
 Divisive clustering algorithm – from 1 cluster to n

sub-clusters.
 NOTE! Most computer programs that apply

hierarchical clustering use agglomerative methods!
Graph illustrating hierarchical clustering – agglomerative
methods move from left to right, and divisive methods move
from right to left:
The agglomerative clustering algorithm is as follows:
1. Define each observation (row, case) as a cluster.
2. Calculate the distances between every cluster and every

other cluster.
3. Combine the two clusters that have the smallest distance.

This reduces the number of clusters by one.
4. Repeat steps 2 and 3 until all clusters have been merged into
a single cluster containing all observations.
 The primary difference among hierarchical clustering
algorithms is their definitions of cluster distances (step 2).
 There are many clustering algorithms to choose from.
 For hierarchical clustering, the most popular are:

 single linkage,
 complete linkage,
 average linkage,
 centroid,
 Ward’s method.
 Five of the most common hierarchical clustering methods and
their definitions of the distance between two clusters are
given in the table below:
 Comparison of distance measures for single-linkage and
complete-linkage cluster methods
 Average-linkage cluster method
 Centroid cluster method
 The Ward’s method differs from the previous methods in
that the similarity between two clusters is not a single
measure of similarity, but rather the sum of squares within the
clusters summed over all variables.
 At each step, the two clusters combined are those that
minimize the increase in the total sum of squares across all
variables in all clusters.
 This procedure tends to combine clusters with a small
number of observations, because the sum of squares is
directly related to the number of observations involved.
 Moreover, the Ward’s method also tends to produce clusters
with approximately the same number of observations.
Nonhierarchical clustering
 In contrast to hierarchical methods, nonhierarchical
procedures do not involve the treelike construction process.
 In the partitioning approach, you specify K: the

number of clusters. Observations are then randomly
divided into K groups and reshuffled to form cohesive
clusters.
 For partitioning clustering, the two most popular methods are

k-means method and partitioning around medoids (PAM)
method.
Nonhierarchical clustering
 K-means clustering can handle larger datasets than
hierarchical cluster approaches.
 Additionally, observations are not permanently committed to

a cluster. They are moved when doing so improves the overall
solution.
 But the use of means implies that all variables must be

continuous.
 The approach can be severely affected by outliers.

The most common partitioning method is the k-means cluster
analysis. Conceptually, the k-means algorithm is as follows:
1. Select k centroids (k rows chosen at random).
2. Assign each data point to its closest centroid.
3. Recalculate the centroids as the average of all data points in
a cluster (that is, the centroids are p-length mean vectors,
where p is the number of variables).
4. Assign data points to their closest centroids.
5. Continue steps 3 and 4 until the observations are not
reassigned or the maximum number of iterations (R uses 10
as a default) is reached.
Choice of the number of clusters k in k-means method.
The Silhouette coefficient
 Kaufman and Rousseeuw (1990)
 Measure for the quality of k-means clustering method that is

independent of k.
 The silhouette coefficient is a measure of how similar an

object is to its own cluster (cohesion) compared to other
clusters (separation).
The Silhouette Coefficient
 For the i-th object, the value of silhouette coefficient is
defined as:
 a(i) for the i-th object is the average distance between i-th
object and all other objects in the same cluster
 b(i) for the i-th object and any cluster not containing the
object is the smallest average distance between i-th object
and all other objects in any other cluster not containing this
object.
 The silhouette coefficient ranges from −1 to +1.
 High value indicates that the object is well matched to its

own cluster and poorly matched to neighbouring clusters.
 If s(i) is close to 1, it means that the object is appropriately

clustered.
 If s(i) is close to -1, it means that the object is clustered in its
neighbouring cluster.
 If s(i) is close to 0, it means that the object is between two
clusters.
 The average s(i) over all objects (points) of a cluster is a
measure of how tightly grouped all the objects (points) in the
cluster are.
 An overall measure of the goodness of clustering can be

obtained by computing the average silhouette coefficient of all
object (points).
 Thus the average s(i) over all data of the entire dataset is a
measure of how appropriately the data have been clustered.
 Adopt silhouette coefficient to determine the best clustering

scheme, i.e., the one with the maximum silhouette
coefficient.
Obtain a final clustering solution.
 No single objective procedure is available to determine the
correct number of clusters; rather the researcher must
evaluate alternative cluster solutions on the following
considerations to select the optimal solution.
 Single-member or extremely small clusters are generally not

acceptable and should be eliminated.
 For hierarchical methods, ad hoc stopping rules, based on the

rate of change in a total heterogeneity measure as the
number of clusters increases or decreases, are an indication
of the number of clusters.
Obtain a final clustering solution.
 All clusters should be significantly different across the set of

clustering variables.
 Cluster solutions ultimately must have theoretical validity

assessed through external validation.
Interpreting, profiling, and validating clusters.
 Once a cluster solution has been obtained, you must interpret

(and possibly name) the clusters.
 What do the observations in a cluster have in common?
 How do they differ from the observations in other clusters?
 This step is typically accomplished by obtaining summary

statistics for each variable by cluster.
 The cluster centroid, a mean profile of the cluster on each

clustering variable, is particularly useful in the interpretation
stage.
 Interpretation involves examining the distinguishing

characteristics of each cluster’s profile and identifying
substantial differences between clusters.
 Cluster solutions failing to show substantial variation indicate

other cluster solutions should be examined.
 The cluster centroid should also be assessed for

correspondence with the researcher’s prior expectations
based on theory or practical experience.
 Validating the cluster solution involves asking the question

“Are these groupings in some sense real, and not a
manifestation of unique aspects of this dataset or statistical
technique?”
 If a different cluster method or different sample is employed,

would the same clusters be obtained?
 Validation is essential in cluster analysis because the clusters
are descriptive of structure and require additional support for
their relevance:
 Cross-validation empirically validates a cluster solution by
creating two subsamples (randomly splitting the sample)
and then comparing the two cluster solutions for
consistency with respect to number of clusters and the
cluster profiles.
 Validation is also achieved by examining differences on
variables not included in the cluster analysis but for which a
theoretical and relevant reason enables the expectation of
variation across the clusters.
Example
to illustrate the cluster analysis
Example 1
The data set consists of chest, waist, and hip measurements on a
sample of men and women and the measurements for 20 individuals.
These measurements for 20 individuals can be found in the
„Example_clusters_lecture.txt” file.
Conduct a simple cluster analysis and interpret its results.

Lectures 5 and 6 - Data Anaysis in Management - MBM

Uploaded by

Copyright:

Available Formats

Lectures 5 and 6 - Data Anaysis in Management - MBM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectures 5 and 6 - Data Anaysis in Management - MBM

Uploaded by

Copyright:

Available Formats

Data Analysis in Management

Programme: Modern Business Management

Roman Huptas, Department of Statistics,

 Introduction to cluster analysis

 Procedure of cluster analysis

 Clustering methods (algorithms)

 define cluster analysis and its roles,

 identify the types of research questions addressed by

 understand the key terms and basics of cluster analysis,

 understand the differences between hierarchical and

 know how to interpret results from cluster analysis.

 Given a data set containing measurements on individuals,

 in other cases, we want to classify the individuals according to

 Cluster analysis develops tools and methods concerning

 This is done by grouping individuals that are “similar”

 Thus, cluster analysis groups individuals or objects into

 The attempt of cluster analysis is to maximize the

 This variety of names is due to the usage of clustering

 Although the names differ across disciplines, the methods

 The clustering task does not try to classify, estimate, or

 Instead, clustering algorithms seek to segment the entire

 “Learning” refers to the fact that the statistician calibrates

 The term “unsupervised” points to the act of creating a

 Due to the enormous size of many present-day databases,

 Choice of group-building algorithm:

 The quality of a clustering result depends on both the

 The quality of a clustering method is also measured by its

 This is often undesirable, and analysts scale the data before

 For optimal performance, clustering algorithms require the

 Other alternatives include dividing each variable by its

 Dissimilarity/Similarity metric: similarity is expressed in terms

 Although clustering algorithms vary widely, they typically

 The most commonly used measures of similarity are

 If the values dij are distances, then they measure dissimilarity.

 The matrix D is then called the distance marix.

where i and j are observations and P is the number of variables.

 Hierarchical clustering is useful for smaller problems (say, 300

 The partitioning method can handle much larger problems

 In hierarchical clustering, a treelike cluster structure

 In this way, the number of clusters in the data set is reduced

 Agglomerative clustering algorithm – from n clusters

 Divisive clustering algorithm – from 1 cluster to n

 NOTE! Most computer programs that apply

2. Calculate the distances between every cluster and every

3. Combine the two clusters that have the smallest distance.

 There are many clustering algorithms to choose from.

 For hierarchical clustering, the most popular are:

 In the partitioning approach, you specify K: the

 For partitioning clustering, the two most popular methods are

 Additionally, observations are not permanently committed to

 But the use of means implies that all variables must be

 The approach can be severely affected by outliers.

 Kaufman and Rousseeuw (1990)

 Measure for the quality of k-means clustering method that is

 The silhouette coefficient is a measure of how similar an

 High value indicates that the object is well matched to its

 If s(i) is close to 1, it means that the object is appropriately

 An overall measure of the goodness of clustering can be

 Adopt silhouette coefficient to determine the best clustering