Clustering and Visualisation of Data - 2020
Clustering and Visualisation of Data - 2020
Hiroshi Shimodaira∗
January-March 2020
20
20
15
Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances
15
between data points. In some cases the aim of cluster analysis is to obtain greater understanding of the
data, and it is hoped that the clusters capture the natural structure of the data. In other cases cluster
10
analysis does not necessarily add to understanding of the data, but enables it to be processed more
10
efficiently.
Cluster analysis can be contrasted with classification. Classification is a supervised learning process:
5
there is a training set in which each data item has a label. For example, in handwriting recognition the
training set may correspond to a set of images of handwritten digits, together with a label (‘zero’ to
‘nine’) for each digit. In the test set the label for each image is unknown: it is the job of the classifier to
0
predict a label for each test item. 0 5 10 15 20 0 5 10 15 20
Clustering, on the other hand, is an unsupervised procedure in which the training set does not contain (a) original data (b) two clusters
any labels. The aim of a clustering algorithm is to group such a data set into clusters, based on the
unlabelled data alone. In many situations there is no ‘true’ set of clusters. For example consider
20
20
the twenty data points shown in Figure 3.1 (a). It is reasonable to divide this set into two clusters
(Figure 3.1 (b)), four clusters (Figure 3.1 (c)) or five clusters (Figure 3.1 (d)).
15
15
There are many reasons to perform clustering. Most commonly it is done to better understand the data
(data interpretation), or to efficiently code the data set (data compression).
Data interpretation: Automatically dividing a set of data items into groups is an important way to
10
10
analyse and describe the world. Automatic clustering has been used to cluster documents (such as
web pages), user preference data (of the type discussed in the previous chapter), and many forms of
5
5
scientific observational data in fields ranging from astronomy to psychology to biology.
Data compression: Clustering may be used to compress data by representing each data item in a
cluster by a single cluster prototype, typically at the centre of the cluster. Consider D-dimensional data
0
0
which has been clustered into K clusters. Rather than representing a data item as a D-dimensional 0 5 10 15 20 0 5 10 15 20
vector, we could store just its cluster index (an integer from 1 to K). This representation, known as
vector quantisation, reduces the required storage for a large data set at the cost of some information (c) four clusters (d) five clusters
loss. Vector quantisation is used in image, video and audio compression. Figure 3.1: Clustering a set of 20 two-dimensional data points.
There are two main approaches to clustering: hierarchical and partitional. Hierarchical clustering
forms a tree of nested clusters in which, at each level in the tree, a cluster is the union of its children.
∗c
2014-2020 University of Edinburgh. All rights reserved. This note is heavily based on notes inherited from Steve
Renals and Iain Murray.
1 2
Note 3 Informatics 2B - Learning Note 3 Informatics 2B - Learning
Partitional clustering does not have a nested or hierarchical structure, it simply divides the data set
into a fixed number of non-overlapping clusters, with each data point assigned to exactly one cluster.
The most commonly employed partitional clustering algorithm, K-means clustering, is discussed
below. Whatever approach to clustering is employed, the core operations are distance computations:
computing the distance between two data points, between a data point and a cluster prototype, or
between two cluster prototypes.
There are two main approaches to hierarchical clustering. In top-down clustering algorithms, all the (4,13) (4,13)
data points are initially collected in a single top-level cluster. This cluster is then split into two (or
more) sub-clusters, and each these sub-clusters is further split. The algorithms continues to build a
10 10
tree structure in a top down fashion, until the leaves of the tree contain individual data points. An (4.33, 10)
alternative approach is agglomerative hierarchical clustering, which acts in a bottom-up way. An (2,9) (2,9)
agglomerative clustering algorithm starts with each data point defining a one-element cluster. Such an (7,8) (7,8)
algorithm operates by repeatedly merging the two closest clusters until a single cluster is obtained.
(6,6) (7,6) (7,6)
(6,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
3.2 K -means clustering (5,4) (8,4) (5,4) (8,4)
(8.75,3.75)
(3.57, 3)
K-means clustering aims to divide a set of D-dimensional data points into K clusters. The number of (1,2) (5,2) (1,2) (5,2)
clusters, K, must be specified, it is not determined by the clustering: thus it will always attempt to find (1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
K clusters in the data, whether they really exist or not. 0 0
0 5 10 0 5 10
Each cluster is defined by its cluster centre, and clustering proceeds by assigning each of the input
(a) Initialisation (b) After one iteration
data points to the cluster with the closest centre, using a Euclidean distance metric. The centre of each
cluster is then re-estimated as the centroid of the points assigned to it. The process is then iterated.
The algorithm is: (4,13)
(7,8)
– Assign each data vector xn (1 ≤ n ≤ N) to the closest cluster centre;
(7,6)
– Recompute each cluster mean as the mean of the vectors assigned to that cluster (6,6)
5 (4,5) (10,5)
(8.2,4.2)
The algorithm requires a distance measure to be defined in the data space, and the Euclidean distance (5,4) (8,4)
is often used.
(1,2) (3.17, 2.5) (5,2)
The initialisation method needs to be further specified. There are several possible ways to initialise the
(1,1) (3,1)
cluster centres, including: 0
(10,0)
0 5 10
• Choose random data points as cluster centres (c) After two iterations
• Randomly assign data points to K clusters and compute means as initial centres Figure 3.2: Example of K-means algorithm applied to 14 data points, K = 3. The lines indicate the
• Choose data points with extreme values distances from each point to the centre of the cluster to which it is assigned. Here only one point (6,6)
moves cluster after updating the means. In general, multiple points can be reassigned after each update
• Find the mean for the whole data set then perturb into K means of the centre positions.
All of these work reasonably, and there is no ‘best’ way. However, as discussed below, the initialisation
has an effect on the final clustering: different initialisations lead to different cluster solutions.
The algorithm iterates until it converges. Convergence is reached when the assignment of points to
clusters does not change after an iteration. An attractive feature of K-means is that convergence is
3 4
Note 3 Informatics 2B - Learning Note 3 Informatics 2B - Learning
8 Morgenstern
3.5 Summary
7
In this chapter we have:
Denby
6
2 4 6 8 10 12 14
1st principal component 1. introduced the notion of clustering a data set into non-overlapping groups using hierarchical or
partitional approaches;
Figure 3.4: Plot of the critics in the 2-dimensional space defined by the first two principal components.
(see Note 2) 2. described the most important partitional algorithm, K-means clustering;
1 X
N • Is guaranteed to converge (eventually, in a finite number of steps).
si j = (xni − mi )(xn j − m j ) (3.7)
N −1 n=1 • Provides a locally optimal solution, dependent on the initialisation.
N
1X
mi = xni . (3.8)
N n=1 3.6 Reading
To be explicit, let {λ1 , . . . , λD } be the eigenvalues sorted in decreasing order so that λ1 is the largest and Further reading
λD is the smallest, the optimal projection vectors u∗ and v∗ are the eigenvectors that correspond to the
two largest eigenvalues λ1 and λ2 . Figure 3.4 depicts the scatter plot with the method for the example • Bishop, section 9.1 (on clustering)
shown in Note 2.3 • Bishop, section 12.1 (on principal component analysis)
Generally speaking, if we would like to effectively transform xn in a D-dimensional space to yn in a
lower `-dimensional space (` < D), it can be done with the eigenvectors, p1 , . . . , p` , that correspond to • Segaran, chapter 3 (on clustering)
2
Covariance matrix is an extension of the variance for univariate case to the multivariate (multi-dimensional) case. We
will see more details about covariance matrices in Note 8, where we consider Gaussian distributions.
3
Someone might wonder how each film (review scores) contributes to the principal components. This can be confirmed
4
with ‘factor loading(s)’, which is the correlation coefficient between the principal component and the review scores of the It is almost always the case that reducing dimensionality involves degradation, i.e., loss of information. The ratio,
P` PD
film. i=1 λi / i=1 λi , indicates how much amount of information retains after the conversion.
7 8
Note 3 Informatics 2B - Learning
Exercises
(a) Cluster the data set into two clusters using the k-means clustering, with the cluster initialised
to 2 and 7.
(b) What if the initial cluster centres are 6 and 8?
2. For k-means clustering, give an intuitive proof that the mean squared error does not increase
after each iteration.
3. Assume we have calculated a covariance matrix S for a set of samples in a 3D space, and
obtained eigenvectors and eigenvalues of S , which are given as follows:
Eigenvalues Eigenvectors
λ1 = 1.0 p1 = (0.492404, −0.809758, 0.319109)T
λ2 = 0.9 p2 = (−0.086824, 0.319109, 0.943732)T
λ3 = 0.001 p3 = (−0.866025, −0.492404, 0.086824)T
Using dimensionality reduction with PCA, plot the following four samples on a 2D plane.