Data-Clustering (Part I)
Data-Clustering (Part I)
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 3 / 135
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 4 / 135
What is data clustering?
Definition from Data Mining: Concepts and Techniques, J. Han et al. [1]
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 5 / 135
Data clustering (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 6 / 135
Data clustering problem
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 7 / 135
Example of data points in 2–dimensional space [3]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 8 / 135
Observations about the clustering process and the results
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 9 / 135
Requirements for data clustering
Scalability: clustering algorithms should be able to work with small, medium, and
large datasets with a consistent performance.
Ability to deal with different types of attributes: clustering algorithms can
work with different data types like binary, nominal (categorical), ordinal, numeric,
or mixtures of those data types.
Discovery of clusters with arbitrary shape: algorithms based on such distance
measures tend to find spherical clusters with similar size and density. However, a
cluster could be of any shape. It is important to develop algorithms that can detect
clusters of arbitrary shape.
Requirements for domain knowledge to determine input parameters:
clustering should be as automatic as possible, voiding (biased) domain knowledge.
Ability to deal with noisy data: most real–world data sets contain outliers
and/or missing, unknown, or erroneous data. Clustering algorithms can be sensitive
to such noise and may produce poor–quality clusters. Therefore, we need clustering
methods that are robust to noise.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 10 / 135
Requirements for data clustering (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 12 / 135
Clustering approaches (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 13 / 135
Clustering approaches (cont’d)
Grid–based methods:
Quantize the object space into a finite number of cells that form a grid structure.
All the clustering operations are performed on the grid structure.
Advantage: fast processing time, depending on the number of cells.
Efficient for spatial data clustering, can be combined with density–based method, etc.
Other approaches:
Graph–based methods:
finding clusters based on dense sub–graph mining like cliques or quasi–cliques.
Subspace models:
clusters are modeled with both cluster members and relevant attributes.
Neural models:
the most well known unsupervised neural network is the self–organizing map.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 1 4 / 135
Clustering approaches (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 1 5 / 135
Challenges in data clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 1 6 / 135
Clustering in high–dimensional space: the curse of dimensionality
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 1 7 / 135
Clustering in high–dimensional space: the curse of dimensionality (2)
(2)
• When d is very large, the numerator is much smaller than the denominator, and
t he cosine between the two vectors is very close to zero.
• If most of pairs of data points are orthogonal, it is very hard to perform clustering.
The clustering results are normally very bad.
• One of the solution is dimensionality reduction with popular techniques like PCA,
SVD, or topic analysis or word embeddings (for text data).
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 1 8 / 135
Applications of data clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 1 9 / 135
Applications of data clustering (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 2 0 / 135
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 2 1 / 135
Understanding of data distribution
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 2 2 / 135
Clustering tendency identification methods
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 2 3 / 135
Validating cluster tendency with spatial histogram
(3)
where i = (i 1 , i 2 ,. . . , i d ) denotes a cell index, with i j denoting the bin index along
dimension X j ; n = |D|.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 2 4 / 135
Validating cluster tendency with spatial histogram (cont’d)
Next, we generate t random samples, each comprising n points within the same
d–dimensional space as the input dataset D. That is, for each dimension X j , we
compute its range [ m i n ( X j ), m a x ( X j )], and generate values uniformly at random
with the given range. Let R j denote the j–th such random sample.
Compute the corresponding EPMF g j (i) for each R j , 1 j t.
Compute how much the distribution f differs from g j (for j = 1..t) using the
Kullback–Leibler (KL) divergence from f to g j , defined as:
(4)
The KL divergence is zero only when f and g j are the same distributions. Using
these divergence values, we can compute how much the dataset D differs from a
random dataset.
Compute the expectation and the variance of KL(f |gj ) (for j = 1..t).
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 2 5 / 135
Example of spatial histogram [4]
The main limitation of this approach is that as dimensionality increases, the number
of cells (bd) increases exponentially, and with a fixed sample size n, most of the cells
will be empty, or will have only one point, making it hard to estimate the
divergence. The method is also sensitive to the choice of parameter b.
The example in the next slide shows the empirical joint probability mass function for
the Iris principal components dataset that has n = 150 points in d = 2 dimensions.
It also shows the EPMF for one of the datasets generated uniformly at random in
the same data space. Both EPMFs were computed using b = 5 bins in each
dimension, for a total of 25 spatial cells.
With t = 500, and computed the KL divergence from f to g j for each 1 j t
(using logarithm with base 2).
The mean KL value was µ KL = 1.17, with a standard deviation of σKL = 0.18,
indicating that the Iris data is indeed far from the randomly generated data, and
thus is clusterable.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 26 / 135
Example of spatial histogram [4] (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 27 / 135
Validating cluster tendency with cell–based entropy
The data space is divided into a grid of k×k cells. For instance, k = 10, and then
the total number of cells is m = 100 cells.
Counting the number of data points in each cells for three cases (a), (b), and (c).
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 28 / 135
Validating cluster tendency with cell–based entropy (cont’d)
(5)
where pi = c i /n with c i is the number of data points in the i th cell, and n is the
total number of data points in all cells.
With m = 100 cells, the maximum entropy value is log2 m = log2 100 = 6.6439. The
entropy can be normalized to [0, 1] by using H / log2 m.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 29 / 135
Validating cluster tendency with cell–based entropy (cont’d)
Case (a):
Entropy = 6.5539
Normalized entropy = 0.9864 ≈ 1.0
Case (b):
Entropy = 5.5318
Normalized entropy = 0.8326
Case (c):
Entropy = 4.8118
Normalized entropy = 0.7242
The smaller entropy value, the more clustered the data is. Entropy = 0 when all
data points fall into one cell.
This method also depends on the way we divide the data space into cells, i.e, the
total number of cells.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 30 / 135
Validating cluster tendency with distance distribution
(7)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 3 3 / 135
Validating cluster tendency with Hopkins statistic (cont’d)
(9)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 3 4 / 135
Example of Hopkins statistic with uniformly distributed data
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 3 5 / 135
Example of Hopkins statistic with normal distribution clusters
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 35 / 135
Example of Hopkins statistic with normal distribution data (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 37 / 135
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 38 / 135
Hierarchical clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 39 / 135
Hierarchical clustering (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 40 / 135
The dendrogram and nested clustering solutions [4]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 41 / 135
Agglomerative hierarchical clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 42 / 135
Agglomerative hierarchical clustering: the pseudo code [4]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 43 / 135
Distance between clusters: different ways to merge clusters
The main step in the algorithm is to determine the closest pair of clusters.
The cluster–cluster distances are ultimately based on the distance between two
points, which is typically computed using the Euclidean distance or L2–norm,
defined as
v
(10)
There are several ways to measure the proximity between two clusters: single link,
complete link, average link, centroid link, radius, and diameter.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 44 / 135
Distance between clusters: different ways to merge clusters (cont’d)
Single link:
Given two clusters C i and C j , the distance between them, denoted δ(C i , C j ) is defined
as the minimum distance between a point in C i and a point in C j :
δ(C i , C j ) = min{δ(x, y) | x ∈ C i , y ∈ C j } (11)
Merging any two clusters having the smallest single link distance at each iteration.
Complete link:
The distance between two clusters is defined as the maximum distance between a
point in C i and a point in C j :
δ(C i , C j ) = max{δ(x, y) | x ∈ C i , y ∈ C j } (12)
Merging any two clusters having the smallest complete link distance at each iteration.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 45 / 135
Distance between clusters: different ways to merge clusters (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 46 / 135
Distance between clusters: different ways to merge clusters (cont’d)
Radius:
Radius of a cluster is the distance from its centroid (mean) µ to the furthest point in
the cluster:
r ( C ) = max{δ(µ C , x) | x ∈ C ) } (15)
Merging any two clusters that form a new cluster (if being merged) having smallest
radius at each iteration.
Diameter:
Diameter of a cluster is the distance between two furthest points in the cluster:
d ( C ) = max{δ(x, y) | x, y ∈ C ) } (16)
Merging any two clusters that form a new cluster (if being merged) having smallest
diameter at each iteration.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 47 / 135
Example of agglomerative hierarchical clustering
Closest pairs of points: δ((10, 5), (11, 4)) = δ((11, 4), (12, 3)) = 2 .
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 48 / 135
Example of agglomerative hierarchical clustering: cluster merging
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 49 / 135
Example of agglomerative hierarchical clustering: the results
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 50 / 135
When should we stop merging?
When we have a prior knowledge about the number of potential clusters in the data.
When the merging starts produce low–quality clusters (e.g., the average distance
from points in a cluster to its mean is larger than a given threshold).
When the algorithm produces the whole dendrogram, e.g., an evolutionary tree.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 51 / 135
Agglomerative clustering: computational complexity
Compute the distance of each cluster to all other clusters, and at each step the
number of clusters decreases by one. Initially it takes O(n 2 ) time to create the
pairwise distance matrix, unless it is specified as an input to the algorithm.
At each merge step, the distances from the merged cluster to the other clusters have
to be recomputed, whereas the distances between the other clusters remain the
same. This means that in step t, we compute O(n —t) distances.
The other main operation is to find the closest pair in the distance matrix. For this
we can keep the n 2 distances in a heap data structure, which allows us to find the
minimum distance in O(1) time; creating the heap takes O(n 2 ) time.
Deleting/updating distances from the merged cluster takes O(log n) time for each
operation, for a total time across all merge steps of O(n 2 log n).
Thus, the computational complexity of hierarchical clustering is O(n 2 log n).
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 5 2 / 135
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 5 3 / 135
Partitioning clustering methods
The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters.
We can assume that the number of clusters is given as background knowledge. This
parameter is the starting point for partitioning methods.
Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k n), where each
partition represents a cluster.
The clusters are formed to optimize an objective partitioning criterion, such as
a dissimilarity function based on distance, so that the objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters in terms of the
data set attributes.
Most popular partitioning algorithms are k–means, k–mediods, and k–medians.
These methods use a centroid point to represent each cluster. Thus, they are also
called representative or centroid methods.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 5 4 / 135
Data clustering problem revisited
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 54 / 135
K–means algorithm
Let C = {C 1 , C 2 , . . . , C k } be a clustering solution, we need some scoring function
that evaluates its quality or goodness on D. This sum of squared errors scoring
function is defined as:
(17)
The goal is to find the clustering solution C* that minimizes the SSE score:
K–means initializes the cluster means by randomly generating k points in the data
space. This is typically done by generating a value uniformly at random within the
range for each dimension.
Each iteration of k–means consists of two steps:
Cluster assignment, and
Centroid or mean update.
Given the k cluster means, in the cluster assignment step, each point x j ∈ D is
assigned to the closest mean, which induces a clustering, with each cluster C i
comprising points that are closer to µ i than any other cluster mean. That is, each
point x j is assigned to cluster C j * , where
J * = arg min x j —µ i 2 (19)
i=1..k
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 57 / 135
K–means algorithm (cont’d)
Given a set of clusters C i , i = 1..k, in the centroid update step, new mean values
are computed for each cluster from the points in C i .
The cluster assignment and centroid update steps are carried out iteratively until we
reach a fixed point or local minima.
Practically speaking, one can assume that k–means has converged if the centroids do
not change from one iteration to the next. For instance, we can stop if
(20)
where 𝜀 > 0 is the convergence threshold, and t denotes the current iteration.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 58 / 135
K–means algorithm: the pseudo code [4]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 59 / 135
K–means algorithm: computational complexity
The cluster assignment step take O(nkd) time, since for each of the n points we
have to compute its distance to each of the k clusters, which takes d operations in d
dimensions.
The centroid re–computation step takes O(nd) time, since we have to add at total of
n d–dimensional points.
Assuming that there are t iterations, the total time for k–means is O(tnkd).
In terms of the I/O cost it requires O(t) full database scans, since we have to read
the entire database in each iteration.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 60 / 135
K–means algorithm: example 1
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 61 / 135
K–means algorithm: example 2
Clustering with k–means [from Pattern Recognition and Machine Learning by C.M. Bishop]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 61 / 135
K–means algorithm: example 3 (image segmentation)
Image segmentation with k–means [from Pattern Recognition and Machine Learning by C.M. Bishop]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 63 / 135
Initialization for k mean vectors µ i
The initial means should lay in different clusters. There are two approaches:
Pick points that are as far away from one another as possible.
Cluster a (small) sample of the data, perhaps hierarchically, so there are k clusters.
Pick a point from each cluster, perhaps that point closest to the centroid of the cluster.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 64 / 135
Initialization for k mean vectors µ i : example
Initial selection for mean values [from Mining of Massive Datasets by J. Leskovec et al.]
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 65 / 135
Initialization for k mean vectors µ i : example (cont’d)
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 66 / 135
K–means is sensitive to outliers
The k–means algorithm is sensitive to outliers because such objects are far away
from the majority of the data, and thus, when assigned to a cluster, they can
dramatically distort the mean value of the cluster. This inadvertently affects the
assignment of other objects to clusters. This effect is more serious due to the use of
the squared error.
Example: consider 7 data points in the 1–d space: 1, 2, 3, 8, 9, 10, 25, with k = 2.
Intuitively, by visual inspection we may imagine the points partitioned into the clusters
{1, 2, 3} and {8, 9, 10}, where point 25 is excluded because it appears to be an outlier.
How would k–means partition the values with k = 2?
Solution1: {1, 2, 3} with mean = 2 and {8, 9, 10, 25} with mean = 13. The error is:
(1 —2)2 + (2 —2)2 + (3 —2)2 + ·· · + (10 —13)2 + (25 —13)2 = 196
Solution2: {1, 2, 3, 8} with mean = 3.5 and {9, 10, 25} with mean = 14.67. The error is:
(1 —3.5)2 + (2 —3.5)2 + (3 —3.5)2 + (10 —14.67)2 + (25 —14.67)2 = 189.67.
The Solution2 is chosen because it has the lowest squared error. However, 8 should not
be in cluster 1. In addition, the mean of the second cluster is 14.67 and quite far from
9 and 10 due to the outlier 25.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 67 / 135
K–medoids clustering algorithm
Rather using mean values, k–mediods pick actual data objects in the dataset to
represent the clusters, using one representative object per cluster.
Each remaining object is assigned to the cluster of which the representative object is
the most similar.
The partitioning method is then performed based on the principle of minimizing the
sum of the dissimilarities between each object x and its corresponding representative
object o i . That is, an absolute–error criterion is used, defined as:
This is the basis for the k–medoids method, which groups n objects into k clusters
by minimizing the absolute error.
When k = 1, we can find the exact median in O(n 2 ) time. However, when k is a
general positive number, the k–medoid problem is NP-hard.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 68 / 135
K–mediods: partitioning around mediods (PAM) algorithm
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 69 / 135
K–mediods: partitioning around mediods (PAM) algorithm (cont’d)
Specifically, let o1, o 2 ,. . . , o k be the current set of representative objects (i.e.,
medoids) of the k clusters.
To determine whether a non–representative object, denoted by o random , is a good
replacement for a current medoid o j (1 ≤ j ≤ k), we calculate the distance from
every object x to the closest object in the set {o 1 ,. . . , oj—1, o random , o j + 1 , . . . , o k },
and use the distance to update the cost function.
The reassignments of objects to {o 1 ,. . . , oj—1, o random , o j + 1 , . . . , o k } are simple:
Suppose an object x is currently assigned to a cluster represented by mediod o j : x
needs to be reassigned to either o r a n d o m or some other cluster represented by o i
(i /= j), whichever is the closest.
Suppose an object x is currently assigned to a cluster represented by some other o i
(i /= j): x remains assigned to o i as long as x is still closer to o i than to o r a n d o m .
Otherwise, x is reassigned to o r a n d o m .
The k–medoids method is more robust than k–means in the presence of noise and
outliers because a medoid is less influenced by outliers or other extreme values than
a mean.
However, the complexity of each iteration in the k–medoids algorithm is O(k(nk) 2 ).
For large values of n and k, such computation becomes very costly, and much
more costly than the k–means method.
Both methods require the user to specify k, the number of clusters.
A typical k–medoids partitioning algorithm like PAM works effectively for small
data sets, but does not scale well for large data sets. How can we scale up the
k–medoids method? To deal with larger data sets, a sampling–based method called
CLARA (Clustering LARge Applications) can be used.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 71 / 135
K–medians clustering algorithm
In the k–medians algorithm, the Manhattan distance (L 1 distance) is used in the
objective function rather than the Euclidean (L 2 distance). The objective function
in k–medians is:
(22)
where m i is the median of the data points along each dimension in cluster C i . This
is because the point that has the minimum sum of L1–distances to a set of points
distributed on a line is the median of that set.
As the median is chosen independently along each dimension, the resulting
d–dimensional representative will (typically) not belong to the original dataset D.
The k–medians approach is sometimes confused with the k–medoids approach,
which chooses these representatives from the original database D.
The k–medians approach generally selects cluster representatives in a more robust
way than k–means, because the median is not as sensitive to the presence of outliers
in the cluster as the mean.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 72 / 135
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 73 / 135
References
1 J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan
Kaufmann, Elsevier, 2012 [Book1].
2 C. Aggarwal. Data Mining: The Textbook. Springer, 2015 [Book2].
3 J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets.
Cambridge University Press, 2014 [Book3].
4 M. J. Zaki and W. M. Jr. Data Mining and Analysis: Fundamental Concepts and
Algorithms. Cambridge University Press, 2013 [Book4].
5 D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a
Highly Connected World. Cambridge University Press, 2010 [Book5].
6 J. VanderPlas. Python Data Science Handbook: Essential Tools for Working with
Data. O’Reilly, 2017 [Book6].
7 J. Grus. Data Science from Scratch: First Principles with Python. O’Reilly, 2015
[Book7].
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 7 4 / 135
Summary
Introducing important concepts of clustering: definitions, types of clustering (hard
vs. soft), main requirements for clustering, clustering approaches, challenges in
clustering, and clustering applications.
Main techniques to understanding the data distribution before clustering: spatial
histogram, cell–based entropy, distance distribution, and Hopkins statistic.
The hierarchical clustering approach with agglomerative method (bottom–up),
dendrogram, different ways to merge clusters (single link, complete link, average
link, centroid link, radius, and diameter).
The partitioning approach with k–means algorithm, the initialization of k centroids,
and the variants of k–means including k–mediods (PAM algorithm) and k–medians.
d ata analysis and mining course @ Xu an–Hieu Phan d ata clustering 7 5 / 135