Cluster

Clustering
Advanced Statistics
Josu Najera Zuloaga

jnajera@deusto.es
Universidad de Deusto
Ciencia de Datos e Inteligencia Artificial (+ Ingenierı́a Informática)
Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 1 / 31

Contents
1 Introduction
2 Ward’s algorithm
3 K-means algorithm
4 Real practise
5 Implementation in R

Table of Contents
1 Introduction
3 K-means algorithm
4 Real practise

Multidimensional data analysis
The most typical method for representing a set of objects is a cloud of points (each point
is an object), evolving in a Euclidean space.
Euclidean refers to the fact that the distances between points are interpreted in terms of
similarities for the individuals (PCA) or categories (CA).
Another way of representing a set of objects and illustratiing the links between them
(similarities) is with a hierarchical tree.
It is also called indexed hierarchy (or dendogram).

Living being: First node separates animal kingdom from
plant life.
Nodes are the points where branches join.
In this unit (same as in PCA) we try to analyse the a data table without prior judgements.
The aim is to construct a hierarchical tree (rather than a principal component map) to
visualise the links between objects (study the variability within the table).
The algorithms used to construct trees such this are known as hierarchical clustering.

Agglomerative Hierarchical Algorithms
There are many hierarchical algoritms.
The most common work in a agglomerative manner:

I they are called Agglomerative Hierarchical Clustering (AHC).
I first group together the most similar objects and then the group the objects.
We will study one of the most used of these AHC-s: Ward’s algorithm.

Partitioning Algorithms
Other type of representation of the links between objects is the partition obtained by
diving objects into groups (one object only belongs to one group).
The aim is to divide the objects where

I the individuals with each group are similar to one another,
I individuals differ from one group to the next.0
We will study the most of of these partitioning algorithms: K-means algorithm.

Defining the notion of similarities
In PCA, when describing I individuals in K variables, the similarities between individual i

and j is defined with the usual (Euclidean) distance in RK .
v
u K
uX
d(i, j) = t (x ik − xlk )2
k=1
However there exists many other distance apart from the Euclidean.
Manhattan distance (city block distance),
K
X
d(i, j) = |xik − xjk |
k=1
I Distance is measured with vertical and horizontal lines.
V1 V2 V3 a b c a b c
a 1 1 3 a 0 a 0
b 1 1 1 b √2 √0 b 2 0
c 2 2 2 c 3 3 0 c 3 3 0
Unless required by the data, we will use the Euclidean distance.

Distance between groups
To construct hierarchical trees, the distance between groups of individuals must be

defined.
There are a number of options, but we will describe those that are the most widely used.
Let be A and B two groups of individuals.
The single linkage between A and B is the distance between the two closest elements in
the two clusters.
The complete linkage between A and B is the distance between the two furthest elements
in the two clusters.

Distance between groups
For Euclidean distances,

Consider GA and GB the centres of gravity of group A and B.
We can measure the inertia: accounting for the groups weights (size of the group).
Apply Hyugen’s theorem to A and B, where G is the gravity centre for A ∪ B:
Total inertia=between-cluster intertia+ within-cluster inertia
Total inertia ⇒ A ∪ B with respect to G.

Between-cluster intertia ⇒ GA and GB with respect to G.
Within-cluster inertia ⇒ inertia of A with respecto to GA plus the same for B.
This partitioning suggests using between-cluster inertia as a measurement of dissimilarity

between A and B.
Ward’s method is based on this methodology.

Table of Contents
1 Introduction
3 K-means algorithm
4 Real practise

Classic Agglomerative Algorithm
1- Create the matrix D with the general term d(i, l) indicating the dissimilarity between
individuals i and l.
I It is a symmetrical matrix with 0s on the diagonal.
2- Agglomerate the most similar (closest) individuals i and l.

I in case of ex-aequo, select one couple at random
I d(i, j) is the agglomerative criterion between i and l.
I d(i, j) determines the height at which branches of the tree connect.
3- Update D to D(1) deleting the rows and columns of i and l and adding new ones for the
pair (i, l).
4- Look for the closest elements in D(1) and agglomerate them, and so on.

Classic Agglomerative Algorithm: Example
We use the Manhatan distance and the Complete linkage agglomerative rule.

Hierarchy and Partitions
The points where the branches corresponding to the elements being grouped coincide are
known as nodes.
The individuals being classified are referred as leaf nodes.
In tracing an horizontal line to a given index, a partition is defined.
The cut at A defines a partition into two

clusters.
The cut at B defines a more precise
partition into four clusters.
These partitions are always nested: each
B-level cluster is included in the same
cluster at level A.

Ward’s method
This agglomerative method consists, at each stage of the process, of regrouping two elements by
maximizing the quality of the obtained partition.
A partition is said to be of high quality when
I individuals within a cluster are homogeneous (samll within-cluster variability).
I individuals differ from one cluster to the next (high between-cluster variability).
The already mentioned Huygen’s Theorem provides the framework for this analysis:
Total inertia = Between-cluster inertia + Within-cluster inertia
If we use this decomposition as a framework for the analysis, when checking a partition
quality, it is the same maximizing the variability between-clusters or minimizing the
variability within-clusters (total variability os the same).
Partition quality can be measured by
Between-clusters inertia
Within-clusters inertia
(percentege of variability imputed to the partition).

Ward’s method
Defining a methodology that in the agglomerative process tries to minimize the within-cluster
intertia means it we wil try to agglomerate
Clusters whose centres of gravity are close together (lower varibility).

Clusters with small sample sizes (lower inertia).
Tree obtained with the Ward’s method to

data in slide 12.
The shape of the tree is identical with
different distance (Euclidean) and
agglomerative criterion.
When a structure is strong, it is
emphasised whatever the selected
method.

Ward’s method
1- At step 0, each individual represents a cluster and within-cluster inertia is 0.

2- Throughout the algorithm, the number of clusters decreases and within-inertia increases.
3- At the end of the algorithm, all the individuals are in the same cluster and within-clusters
intertia is equal to the total inertia.
Thus a indexed hierarchy proposes a decomposition of the total inertia (variability of the data)
and fits into the overall approach to principal component methods.
The difference is that the decomposition is conducted by clusters in one case and by
components in the other.

Ward’s method: Choosing Partitions
A hierarchy is extremely useful for justifying the choice of a partition as we can account for the
percentage of the explained variability with the clusters.
In order to selected the optimal partition, we should base on
The overall apprearance of the tree.

Levels of the nodes, to quantify the above; each irregularity in the decrease evokes
another division (can be represented with a bar chart, graph in the top right hand corner).
The number of clusters, which must not be too high si as not to impede the concise
nature of the approach.
Cluster interpretability: even if it corresponds to a substantial increase in betwee-cluster
inertia, we do not retain subdivisions that we do not know how to interpret.

Table of Contents
1 Introduction
3 K-means algorithm
4 Real practise

Partitioning algorithms
The data are the same as for principal components methods: individuals × variables table and a
Euclidean distance.
Partitioning algorithms approach hierarchical clustering in terms of two questions:
Indexed hierarchies are often used as tools for obtaining partitions. Would there not be a
number of advantages in searching for a partition directly?
When dealing with a great number of individuals, the calculation time required to
construct an indexed hierarchy can be very long. Might we not achive shorter calculation
times with algorithms searching for a partition directly?
Although there are many partitioning algorithms, we will only explain one, the
K-means algorithm.

K-means algorithm
Select the number of clusters, q.

Consider a given partition P0 where individuals are randomly divided into the q clusters.
Calculate ρ0 , the ration [between-cluster inertia]/[total inertia].
At step n of the algorithm:
1- The centres of gravity are calculated for each of the clusters.
2- Reassign each individual to the cluster that is closest to (in terms of the distance with
respect to the centres of gravity). We obtain a new partiion Pn+1 and calculate the ratio
ρn+1 .
3- As long as ρn+1 > ρn (Pn+1 is better than Pn ) we return to step 1, otherwise Pn+1 is
the partition we are looking for.
Example of the algorithm fro two clusters:

K-means algorithm
The algorithm converges but not necessarily towrd an optimum.

In practice, the algorithm is conducted many times using different initial partitions P0 .
Set of individuals wich remain in the same cluster whatever the initial partition is are
strong shapes (highlight dense areas).
This methodology also gives rise to some very small clusters (often only one individual),
made of individuals situated between high-dense areas. The two main solutions are
I assign them to the closest strong shape.
I create a ’residual’ cluster grouping together all of the isolated individuals.

Table of Contents
1 Introduction
3 K-means algorithm
4 Real practise

Partitioning and Hierarchical Clustering
Compared to hierarchical methods, partitioning strategies present two main advantages:
They optimse a criterium. In Agglomerative Hierarchical Clustering, a criterion is

optimised at each step, but it is not an overall optimal criterion referring the tree.
They can deal with much greater number of individuals.
However, the groups need to be defined prior.
This is the origin of the idea of combining the two approaches to obtain a methodology that
includes the advantages of each.

Partitioning and Hierarchical Clustering
Compared to hierarchical methods, partitioning strategies present two main advantages:
They optimse a criterium. In Agglomerative Hierarchical Clustering, a criterion is

optimised at each step, but it is not an overall optimal criterion referring the tree.
They can deal with much greater number of individuals.
However, the groups need to be defined prior.
This is the origin of the idea of combining the two approaches to obtain a methodology that
includes the advantages of each.
When there are too many individuals to conduct an Agglomerative Hierarchical Clustering
(AHC) directly, the following two-phase mathodology can be implemented:
1. We partition a high number of groups (100, for example).

2. We implement the AHC by taking the groups of individuals from phase one as elements to
be classified.

Clustering and Principal Components Methods
Clustering and principal components methods use simillar approaches, exploratory analysis
of a same data table.
But they differ in terms of representation methods (Euclidean clouds, indexed hierarchies
or partitions).
However, we can combine both approches to obtain a richer methodology.
Let us consider table X (I × K dimensions) in which we want to classify the rows:
1. We perform a principal component method in X (PCA or CA).
2. Retain the components that are responsible of a high percentage of the inertia (80% or
90%) and we know how to interpret.
3. Create table F with the coordinates of the individuals in those components (if we had
included all the components F and X would be equivalent as they define the same
distances between individuals).
4. Apply the AHC to table F .

Table of Contents
1 Introduction
3 K-means algorithm
4 Real practise

Violent Crime Rates by US State Example
We will be using the Violent Crime Rates by US State dataset analysed in the PCA unit.
The objective is to group cities together into comprehensive clusters.
Once the clusters have been defined, it is important to describe them using variables or specific
individuals.
We will perform a hierarchical defining the Ward’s criterion as an agglomerative criteron.
We will use the Euclidean distances, so the most suitable (in most of the cases), as done in
PCA, is to standardise the variables.
We will use the results of the PCA in roder to perform clusters of cities.

We import the dataset and peform the PCA, we use ncp=Inf in order to specify that we will
retain all the components in the clustering analysis.
Then we perform an agglomerative hierarchical clustering with the function HCPC.
> library(FactoMineR)
> library(tidyverse)
> data("USArrests") #Read the data
Perform the PCA.

> pca.USA <- PCA(USArrests,scale.unit = TRUE,graph = FALSE)
Perform the agglomerative hierarchical clustering.

> hcpc.USA <- HCPC(pca.USA,nb.clust = 4,graph = FALSE)
Note: As commented in the unit, if the number of individuals is large, there is a possibility to
create clusters with K-means algorithm before constructing the agglomerative hierarchical
clustering (kk parameter in the HCPC function).

The object data.clust gives back the original table with the classification in clusters of each
city.
> head(hcpc.USA$data.clust,n = 10)
Murder Assault UrbanPop Rape clust
Alabama 13.2 236 58 21.2 3
Alaska 10.0 263 48 44.5 4
Arizona 8.1 294 80 31.0 4
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 4
Colorado 7.9 204 78 38.7 4
Connecticut 3.3 110 77 11.1 2
Delaware 5.9 238 72 15.8 2
Florida 15.4 335 80 31.9 4
Georgia 17.4 211 60 25.8 3

We can obtain several information in the hcpc.USA object:
> hcpc.USA$call$t$tree
Call:
flashClust::hclust(d = dissi, method = method, members = weight)
Cluster method : ward

Distance : euclidean
Number of objects: 50
Optimal number of clusters: Ration between two succesive within-cluster inertias

(hcpc.USA$call$t$quot) is minimum
> hcpc.USA$call$t$nb.clust
Within-cluster inertia: For one cluster the inertia is equal to the number of variables.
> hcpc.USA$call$t$within
Between-cluster inertia: the gained inertia when moving from n to n + 1 number of
clusters.
> hcpc.USA$call$t$inert.gain

Remember, the first dimension is a scale of crimality and the second dimension is defined as a
scale of population.

The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.
choice="tree"

parameter.
choice="bar"

parameter.
choice="3D.map"

parameter.
choice="map"

It can be also interesting to illustrate the cluster by the individual specific to that class.
We can calculate:
paragon individuals: those that are closest to the centre of the cluster.
Individuals shorted in clusters and the distance between each individual and the centre of
its class.
> hcpc.USA$desc.ind$para
Shout Dakota is the city that best represents cities in cluster 1, while Oklahoma, Alabama
and Michigan are the paragons of clusters 2,3,4 respectively.
specific individuals: those furthest from the centres of other clusters.
Individuals shorted by cluster and the distance between each individual and the closest
cluster centre.
> hcpc.USA$desc.ind$dist
Vermont is specific to cluster 1 because it is the city furthest from the centres of clusters
2, 3 and 4, so we can consider to be the most specific to cluster 1. Rhode Island,
Mississippi and Nevada are specific to cluster 2, 3 and 4.

Cluster

Uploaded by

Copyright:

Available Formats

Cluster

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster

Uploaded by

Copyright:

Available Formats

Clustering

Josu Najera Zuloaga

Ciencia de Datos e Inteligencia Artificial (+ Ingenierı́a Informática)

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 1 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 2 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 3 / 31

It is also called indexed hierarchy (or dendogram).

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 4 / 31

There are many hierarchical algoritms.

The most common work in a agglomerative manner:

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 5 / 31

The aim is to divide the objects where

We will study the most of of these partitioning algorithms: K-means algorithm.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 6 / 31

In PCA, when describing I individuals in K variables, the similarities between individual i

I Distance is measured with vertical and horizontal lines.

Unless required by the data, we will use the Euclidean distance.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 7 / 31

To construct hierarchical trees, the distance between groups of individuals must be

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 8 / 31

For Euclidean distances,

Total inertia=between-cluster intertia+ within-cluster inertia

Total inertia ⇒ A ∪ B with respect to G.

This partitioning suggests using between-cluster inertia as a measurement of dissimilarity

Ward’s method is based on this methodology.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 9 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 10 / 31

2- Agglomerate the most similar (closest) individuals i and l.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 11 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 12 / 31

The cut at A defines a partition into two

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 13 / 31

Total inertia = Between-cluster inertia + Within-cluster inertia

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 14 / 31

Clusters whose centres of gravity are close together (lower varibility).

Tree obtained with the Ward’s method to

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 15 / 31

1- At step 0, each individual represents a cluster and within-cluster inertia is 0.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 16 / 31

In order to selected the optimal partition, we should base on

The overall apprearance of the tree.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 17 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 18 / 31

Partitioning algorithms approach hierarchical clustering in terms of two questions:

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 19 / 31

Select the number of clusters, q.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 20 / 31

The algorithm converges but not necessarily towrd an optimum.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 21 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 22 / 31

Compared to hierarchical methods, partitioning strategies present two main advantages:

They optimse a criterium. In Agglomerative Hierarchical Clustering, a criterion is

However, the groups need to be defined prior.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 23 / 31

Compared to hierarchical methods, partitioning strategies present two main advantages:

They optimse a criterium. In Agglomerative Hierarchical Clustering, a criterion is

However, the groups need to be defined prior.

1. We partition a high number of groups (100, for example).

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 23 / 31

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 24 / 31