Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
56 views

Clustering and Visualisation of Data - 2020

Cluster analysis aims to partition data into meaningful groups based on distances between data points. There are two main approaches: hierarchical clustering forms nested clusters in a tree structure, while partitional clustering divides data into a fixed number of non-overlapping clusters with each point in one cluster. Clustering is unsupervised and finds natural groupings, unlike classification which is supervised with pre-defined labels. Clustering is used for data interpretation and compression by representing clusters with prototypes.

Uploaded by

Serhiy Yehress
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Clustering and Visualisation of Data - 2020

Cluster analysis aims to partition data into meaningful groups based on distances between data points. There are two main approaches: hierarchical clustering forms nested clusters in a tree structure, while partitional clustering divides data into a fixed number of non-overlapping clusters with each point in one cluster. Clustering is unsupervised and finds natural groupings, unlike classification which is supervised with pre-defined labels. Clustering is used for data interpretation and compression by representing clusters with prototypes.

Uploaded by

Serhiy Yehress
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Note 3 Informatics 2B - Learning Note 3 Informatics 2B - Learning

Clustering and Visualisation of Data

Hiroshi Shimodaira∗

January-March 2020

20

20
15
Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances

15
between data points. In some cases the aim of cluster analysis is to obtain greater understanding of the
data, and it is hoped that the clusters capture the natural structure of the data. In other cases cluster

10
analysis does not necessarily add to understanding of the data, but enables it to be processed more

10
efficiently.
Cluster analysis can be contrasted with classification. Classification is a supervised learning process:

5
there is a training set in which each data item has a label. For example, in handwriting recognition the
training set may correspond to a set of images of handwritten digits, together with a label (‘zero’ to
‘nine’) for each digit. In the test set the label for each image is unknown: it is the job of the classifier to

0
predict a label for each test item. 0 5 10 15 20 0 5 10 15 20

Clustering, on the other hand, is an unsupervised procedure in which the training set does not contain (a) original data (b) two clusters
any labels. The aim of a clustering algorithm is to group such a data set into clusters, based on the
unlabelled data alone. In many situations there is no ‘true’ set of clusters. For example consider

20
20
the twenty data points shown in Figure 3.1 (a). It is reasonable to divide this set into two clusters
(Figure 3.1 (b)), four clusters (Figure 3.1 (c)) or five clusters (Figure 3.1 (d)).

15
15
There are many reasons to perform clustering. Most commonly it is done to better understand the data
(data interpretation), or to efficiently code the data set (data compression).
Data interpretation: Automatically dividing a set of data items into groups is an important way to

10
10
analyse and describe the world. Automatic clustering has been used to cluster documents (such as
web pages), user preference data (of the type discussed in the previous chapter), and many forms of

5
5
scientific observational data in fields ranging from astronomy to psychology to biology.
Data compression: Clustering may be used to compress data by representing each data item in a
cluster by a single cluster prototype, typically at the centre of the cluster. Consider D-dimensional data

0
0
which has been clustered into K clusters. Rather than representing a data item as a D-dimensional 0 5 10 15 20 0 5 10 15 20

vector, we could store just its cluster index (an integer from 1 to K). This representation, known as
vector quantisation, reduces the required storage for a large data set at the cost of some information (c) four clusters (d) five clusters
loss. Vector quantisation is used in image, video and audio compression. Figure 3.1: Clustering a set of 20 two-dimensional data points.

3.1 Types of clustering

There are two main approaches to clustering: hierarchical and partitional. Hierarchical clustering
forms a tree of nested clusters in which, at each level in the tree, a cluster is the union of its children.
∗c
2014-2020 University of Edinburgh. All rights reserved. This note is heavily based on notes inherited from Steve
Renals and Iain Murray.

1 2
Note 3 Informatics 2B - Learning Note 3 Informatics 2B - Learning

Partitional clustering does not have a nested or hierarchical structure, it simply divides the data set
into a fixed number of non-overlapping clusters, with each data point assigned to exactly one cluster.
The most commonly employed partitional clustering algorithm, K-means clustering, is discussed
below. Whatever approach to clustering is employed, the core operations are distance computations:
computing the distance between two data points, between a data point and a cluster prototype, or
between two cluster prototypes.
There are two main approaches to hierarchical clustering. In top-down clustering algorithms, all the (4,13) (4,13)

data points are initially collected in a single top-level cluster. This cluster is then split into two (or
more) sub-clusters, and each these sub-clusters is further split. The algorithms continues to build a
10 10
tree structure in a top down fashion, until the leaves of the tree contain individual data points. An (4.33, 10)
alternative approach is agglomerative hierarchical clustering, which acts in a bottom-up way. An (2,9) (2,9)

agglomerative clustering algorithm starts with each data point defining a one-element cluster. Such an (7,8) (7,8)

algorithm operates by repeatedly merging the two closest clusters until a single cluster is obtained.
(6,6) (7,6) (7,6)
(6,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
3.2 K -means clustering (5,4) (8,4) (5,4) (8,4)
(8.75,3.75)
(3.57, 3)
K-means clustering aims to divide a set of D-dimensional data points into K clusters. The number of (1,2) (5,2) (1,2) (5,2)
clusters, K, must be specified, it is not determined by the clustering: thus it will always attempt to find (1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
K clusters in the data, whether they really exist or not. 0 0
0 5 10 0 5 10
Each cluster is defined by its cluster centre, and clustering proceeds by assigning each of the input
(a) Initialisation (b) After one iteration
data points to the cluster with the closest centre, using a Euclidean distance metric. The centre of each
cluster is then re-estimated as the centroid of the points assigned to it. The process is then iterated.
The algorithm is: (4,13)

• Initialise K cluster centres, {mk }1K 10


(4.33, 10)
• While not converged : (2,9)

(7,8)
– Assign each data vector xn (1 ≤ n ≤ N) to the closest cluster centre;
(7,6)
– Recompute each cluster mean as the mean of the vectors assigned to that cluster (6,6)
5 (4,5) (10,5)
(8.2,4.2)
The algorithm requires a distance measure to be defined in the data space, and the Euclidean distance (5,4) (8,4)

is often used.
(1,2) (3.17, 2.5) (5,2)
The initialisation method needs to be further specified. There are several possible ways to initialise the
(1,1) (3,1)
cluster centres, including: 0
(10,0)

0 5 10
• Choose random data points as cluster centres (c) After two iterations
• Randomly assign data points to K clusters and compute means as initial centres Figure 3.2: Example of K-means algorithm applied to 14 data points, K = 3. The lines indicate the
• Choose data points with extreme values distances from each point to the centre of the cluster to which it is assigned. Here only one point (6,6)
moves cluster after updating the means. In general, multiple points can be reassigned after each update
• Find the mean for the whole data set then perturb into K means of the centre positions.

All of these work reasonably, and there is no ‘best’ way. However, as discussed below, the initialisation
has an effect on the final clustering: different initialisations lead to different cluster solutions.
The algorithm iterates until it converges. Convergence is reached when the assignment of points to
clusters does not change after an iteration. An attractive feature of K-means is that convergence is

3 4
Note 3 Informatics 2B - Learning Note 3 Informatics 2B - Learning

The second solution (Figure 3.3b) has a lower error:

5 5 0 + (32/9 + 20/9 + 68/9) 10


E= = < 4.
4 3
(5,3) (7,3) (9,3) (5,3) (9,3)
(6.33,2.33) We have discussed the batch version of K-means. The online (sample-by-sample) version in which a
(1,1) (3,1) (5,1) (1,1) (5,1) point is assigned to a cluster and the cluster centre is immediately updated is less vulnerable to local
0 0 minima.
0 5 10 0 5 10
There are many variants of K-means clustering that have been designed to improve the efficiency and
(a) Within-cluster sum-squared error = 4 (b) Within-cluster sum-squared error = 3.33 to find lower error solutions.
Figure 3.3: Two different converged clusterings for the same data set, but starting from different
initialisations.
3.4 Dimensionality reduction and data visualisation
guaranteed. However, the number of iterations required to reach convergence is not guaranteed. For
large datasets it is often sensible to specify a maximum number of iterations, especially since a good Another way of obtaining better understanding of the data is to visualise it. For example, we could
clustering solution is often reached after a few iterations. Figure 3.2 illustrates the K-means clustering easily see the shapes of data clusters or spot outliers if they are plotted in a two or three dimensional
process. Figure 3.3 illustrates how different initialisations can lead to different solutions. space. It is, however, not straightforward to visualise high-dimensional data.
A simple solution would be to pick up only two components out of D ones and plot them in the
3.3 Mean squared error function same manner as we did for Figure 1 in Note 2, in which we picked up two films, ‘Hancock’ and
‘Revolutionary Road’ out of 6 films to plot critics. We would, however, need to draw more number of
K-means clustering is an intuitively sensible algorithm, but is it possible to get a better mathematical plots with different combinations of films to grasp the distribution of data.
description of what is going on? To compare two different clusterings into K clusters we can use the A more general way of visualising high-dimensional data is to transform the data to the one in a
mean squared error function, which can be thought of as measuring the scatter of the data points two-dimensional space. For example, we can apply a linear transformation or mapping using a unit
relative to their cluster centres. Thus if we have two sets of K clusters, it makes sense to prefer the one vector u = (u1 , . . . , uD )T in the original D-dimensional vector space.1 Calculating a dot-product
with the smallest mean squared error: the clustering with the lowest scatter of data points relative to (Euclidean inner-product) between u and xn gives a scalar yn :
their cluster centres.
Let us define an indicator variable znk , such that znk = 1 if the n th data point xn belongs to cluster k and yn = u · xn = uT xn = u1 xn1 + · · · + uD xnD (3.2)
znk = 0 otherwise. Then we can write the mean squared error as
K N which can be regarded as the orthogonal projection of xn on the axis defined by u.
1 XX
E= znk kxn − mk k2 , (3.1) We now consider another unit vector v that is orthogonal to u, and project xn orthogonally on it to get
N k=1 n=1
another scalar zn :
where mk is the centre of cluster k, N is the number of data points in total, and k · k denotes the
Euclidean norm (i.e. L2 -norm) of a vector. Another way of viewing this error function is in terms zn = v · xn = vT xn = v1 xn1 + · · · + vD xnD . (3.3)
of the squared deviation of the data points belonging to a cluster from the cluster centre, summed
You will see that xn is mapped to a point (yn , zn )T in a two-dimensional space, and the whole {xn }1N can
over all centres. This may be regarded as a variance measure for each cluster, and hence K-means n oN
is sometimes referred to as minimum variance clustering. Like all variance-based approaches this be mapped to (yn , zn )T in the same manner. It is easy to see that the resultant plots depend on the
1
criterion is dependent on the scale of the data. In the introduction we said that the aim of clustering choice of u and v. One option is to choose the pair of vectors that maximise the total variance of the
was to discover a structure of clusters such that points in the same group are close to each other, and projected data:
far from points in other clusters. The mean squared error only addresses the first of these: it does not max Var (y) + Var (z)
include a between-clusters term. u,v (3.4)
subject to kuk = 1, kvk = 1, u ⊥ v .
Depending on the initialisation of the cluster centres, K-means can converge to different solutions; this
is illustrated in Figure 3.3. The same data set of 4 points, can have two different clusterings, depending This means that we try to find a two dimensional space, i.e., a plane, such that the projected data on the
on where the initial cluster centres are placed. Both these solutions are local minima of the error plane spread as wide as possible rather than a plane on which the data are concentrated in a small area.
function, but they have different error values. For the solution in Figure 3.3a, the error is: It is known that the optimal projection vectors are given by the eigenvectors of the sample covariance
((4 + 4) + (4 + 4))
E= = 4. 1
NB : kuk = 1 by definition.
4
5 6
Note 3 Informatics 2B - Learning Note 3 Informatics 2B - Learning

13 the ` largest eigenvalues λ1 , . . . , λ` :


   T 
12 Turan  pT1  p1 xn 
 ..   . 
yn =  .  xn =  ..  . (3.9)

2nd principal component


11
Travers
 T   T 
Puig
p` p` xn
10
This technique is called principal component analysis (PCA) and widely used for data visualisation,
McCarthy
9
dimensionality reduction, data compression, and feature extraction. 4

8 Morgenstern
3.5 Summary
7
In this chapter we have:
Denby
6
2 4 6 8 10 12 14
1st principal component 1. introduced the notion of clustering a data set into non-overlapping groups using hierarchical or
partitional approaches;
Figure 3.4: Plot of the critics in the 2-dimensional space defined by the first two principal components.
(see Note 2) 2. described the most important partitional algorithm, K-means clustering;

3. defined a within-cluster mean squared error function for K-means clustering;


matrix2 , S, defined as
4. introduced the notion of dimensionality reduction for data visualisation;
N
X
1 5. described the technique, principal component analysis (PCA) for dimensionality reduction.
S= (xn − x̄)(xn − x̄)T (3.5)
N −1 n=1
N
The key properties of K-means are that it:
1X
x̄ = xn . (3.6)
N n=1 • Is an automatic procedure for clustering unlabelled data.

• Requires a pre-specified number of clusters.


If scalar representation is preferred to the matrix one above, the element si j at i’th row and j’th column
of S, is given as: • Chooses a set of clusters with the minimum within-cluster variance.

1 X
N • Is guaranteed to converge (eventually, in a finite number of steps).
si j = (xni − mi )(xn j − m j ) (3.7)
N −1 n=1 • Provides a locally optimal solution, dependent on the initialisation.
N
1X
mi = xni . (3.8)
N n=1 3.6 Reading

To be explicit, let {λ1 , . . . , λD } be the eigenvalues sorted in decreasing order so that λ1 is the largest and Further reading
λD is the smallest, the optimal projection vectors u∗ and v∗ are the eigenvectors that correspond to the
two largest eigenvalues λ1 and λ2 . Figure 3.4 depicts the scatter plot with the method for the example • Bishop, section 9.1 (on clustering)
shown in Note 2.3 • Bishop, section 12.1 (on principal component analysis)
Generally speaking, if we would like to effectively transform xn in a D-dimensional space to yn in a
lower `-dimensional space (` < D), it can be done with the eigenvectors, p1 , . . . , p` , that correspond to • Segaran, chapter 3 (on clustering)

2
Covariance matrix is an extension of the variance for univariate case to the multivariate (multi-dimensional) case. We
will see more details about covariance matrices in Note 8, where we consider Gaussian distributions.
3
Someone might wonder how each film (review scores) contributes to the principal components. This can be confirmed
4
with ‘factor loading(s)’, which is the correlation coefficient between the principal component and the review scores of the It is almost always the case that reducing dimensionality involves degradation, i.e., loss of information. The ratio,
P` PD
film. i=1 λi / i=1 λi , indicates how much amount of information retains after the conversion.

7 8
Note 3 Informatics 2B - Learning

Exercises

1. Consider a data set of one-dimensional samples {1, 3, 5, 6, 8, 9}.

(a) Cluster the data set into two clusters using the k-means clustering, with the cluster initialised
to 2 and 7.
(b) What if the initial cluster centres are 6 and 8?

2. For k-means clustering, give an intuitive proof that the mean squared error does not increase
after each iteration.

3. Assume we have calculated a covariance matrix S for a set of samples in a 3D space, and
obtained eigenvectors and eigenvalues of S , which are given as follows:

Eigenvalues Eigenvectors
λ1 = 1.0 p1 = (0.492404, −0.809758, 0.319109)T
λ2 = 0.9 p2 = (−0.086824, 0.319109, 0.943732)T
λ3 = 0.001 p3 = (−0.866025, −0.492404, 0.086824)T

Using dimensionality reduction with PCA, plot the following four samples on a 2D plane.

x1 = (0.88932, −1.30533, 1.58282)T


x2 = (1.07097, −0.83358, 2.49964)T
x3 = (0.14555, −0.27002, 2.22394)T
x4 = (0.49218, −0.44141, 1.25416)T

You might also like