Cluster
Cluster
Cluster
Advanced Statistics
Universidad de Deusto
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R
The most typical method for representing a set of objects is a cloud of points (each point
is an object), evolving in a Euclidean space.
Euclidean refers to the fact that the distances between points are interpreted in terms of
similarities for the individuals (PCA) or categories (CA).
Another way of representing a set of objects and illustratiing the links between them
(similarities) is with a hierarchical tree.
In this unit (same as in PCA) we try to analyse the a data table without prior judgements.
The aim is to construct a hierarchical tree (rather than a principal component map) to
visualise the links between objects (study the variability within the table).
The algorithms used to construct trees such this are known as hierarchical clustering.
We will study one of the most used of these AHC-s: Ward’s algorithm.
Other type of representation of the links between objects is the partition obtained by
diving objects into groups (one object only belongs to one group).
However there exists many other distance apart from the Euclidean.
Manhattan distance (city block distance),
K
X
d(i, j) = |xik − xjk |
k=1
V1 V2 V3 a b c a b c
a 1 1 3 a 0 a 0
b 1 1 1 b √2 √0 b 2 0
c 2 2 2 c 3 3 0 c 3 3 0
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R
1- Create the matrix D with the general term d(i, l) indicating the dissimilarity between
individuals i and l.
I It is a symmetrical matrix with 0s on the diagonal.
3- Update D to D(1) deleting the rows and columns of i and l and adding new ones for the
pair (i, l).
4- Look for the closest elements in D(1) and agglomerate them, and so on.
The points where the branches corresponding to the elements being grouped coincide are
known as nodes.
The individuals being classified are referred as leaf nodes.
In tracing an horizontal line to a given index, a partition is defined.
This agglomerative method consists, at each stage of the process, of regrouping two elements by
maximizing the quality of the obtained partition.
A partition is said to be of high quality when
I individuals within a cluster are homogeneous (samll within-cluster variability).
I individuals differ from one cluster to the next (high between-cluster variability).
The already mentioned Huygen’s Theorem provides the framework for this analysis:
If we use this decomposition as a framework for the analysis, when checking a partition
quality, it is the same maximizing the variability between-clusters or minimizing the
variability within-clusters (total variability os the same).
Partition quality can be measured by
Between-clusters inertia
Within-clusters inertia
(percentege of variability imputed to the partition).
Defining a methodology that in the agglomerative process tries to minimize the within-cluster
intertia means it we wil try to agglomerate
Thus a indexed hierarchy proposes a decomposition of the total inertia (variability of the data)
and fits into the overall approach to principal component methods.
The difference is that the decomposition is conducted by clusters in one case and by
components in the other.
A hierarchy is extremely useful for justifying the choice of a partition as we can account for the
percentage of the explained variability with the clusters.
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R
The data are the same as for principal components methods: individuals × variables table and a
Euclidean distance.
Indexed hierarchies are often used as tools for obtaining partitions. Would there not be a
number of advantages in searching for a partition directly?
When dealing with a great number of individuals, the calculation time required to
construct an indexed hierarchy can be very long. Might we not achive shorter calculation
times with algorithms searching for a partition directly?
Although there are many partitioning algorithms, we will only explain one, the
K-means algorithm.
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R
This is the origin of the idea of combining the two approaches to obtain a methodology that
includes the advantages of each.
This is the origin of the idea of combining the two approaches to obtain a methodology that
includes the advantages of each.
When there are too many individuals to conduct an Agglomerative Hierarchical Clustering
(AHC) directly, the following two-phase mathodology can be implemented:
Clustering and principal components methods use simillar approaches, exploratory analysis
of a same data table.
But they differ in terms of representation methods (Euclidean clouds, indexed hierarchies
or partitions).
However, we can combine both approches to obtain a richer methodology.
Let us consider table X (I × K dimensions) in which we want to classify the rows:
1. We perform a principal component method in X (PCA or CA).
2. Retain the components that are responsible of a high percentage of the inertia (80% or
90%) and we know how to interpret.
3. Create table F with the coordinates of the individuals in those components (if we had
included all the components F and X would be equivalent as they define the same
distances between individuals).
4. Apply the AHC to table F .
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R
We will be using the Violent Crime Rates by US State dataset analysed in the PCA unit.
Once the clusters have been defined, it is important to describe them using variables or specific
individuals.
We will use the Euclidean distances, so the most suitable (in most of the cases), as done in
PCA, is to standardise the variables.
We will use the results of the PCA in roder to perform clusters of cities.
We import the dataset and peform the PCA, we use ncp=Inf in order to specify that we will
retain all the components in the clustering analysis.
> library(FactoMineR)
> library(tidyverse)
> data("USArrests") #Read the data
Note: As commented in the unit, if the number of individuals is large, there is a possibility to
create clusters with K-means algorithm before constructing the agglomerative hierarchical
clustering (kk parameter in the HCPC function).
The object data.clust gives back the original table with the classification in clusters of each
city.
> head(hcpc.USA$data.clust,n = 10)
Murder Assault UrbanPop Rape clust
Alabama 13.2 236 58 21.2 3
Alaska 10.0 263 48 44.5 4
Arizona 8.1 294 80 31.0 4
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 4
Colorado 7.9 204 78 38.7 4
Connecticut 3.3 110 77 11.1 2
Delaware 5.9 238 72 15.8 2
Florida 15.4 335 80 31.9 4
Georgia 17.4 211 60 25.8 3
> hcpc.USA$call$t$tree
Call:
flashClust::hclust(d = dissi, method = method, members = weight)
Remember, the first dimension is a scale of crimality and the second dimension is defined as a
scale of population.
The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.
choice="tree"
The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.
choice="bar"
The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.
choice="3D.map"
The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.
choice="map"
It can be also interesting to illustrate the cluster by the individual specific to that class.
We can calculate:
paragon individuals: those that are closest to the centre of the cluster.
Individuals shorted in clusters and the distance between each individual and the centre of
its class.
> hcpc.USA$desc.ind$para
Shout Dakota is the city that best represents cities in cluster 1, while Oklahoma, Alabama
and Michigan are the paragons of clusters 2,3,4 respectively.
specific individuals: those furthest from the centres of other clusters.
Individuals shorted by cluster and the distance between each individual and the closest
cluster centre.
> hcpc.USA$desc.ind$dist
Vermont is specific to cluster 1 because it is the city furthest from the centres of clusters
2, 3 and 4, so we can consider to be the most specific to cluster 1. Rhode Island,
Mississippi and Nevada are specific to cluster 2, 3 and 4.