Clustering

Clustering
Smrutiranjan Sahu
Talentica Software

Introduction
Spam Detection (Mail)
Handwriting Recognition (Mobile)
Voice Recognition (Mobile)
Recommendation (Youtube)
Used for ﬁnding outliers in data.
Used for Impute missing values.
Used for Feature Generation/Representation
Used for exploratory analysis and/or as a component of a
hierarchical supervised learning pipeline. Distinct classiﬁers or
regression models are trained for each cluster.
Used for Data Compression

Introduction
Measures
Norms
Distance
Algorithms
Partition Based
K-Means
Mini Batch K-Means
K-Medoids
CLARA
Hierarchical
Birch
Density Based
DBScan
Optics
Model/Distr Based
FCM
GMM
LDA
Evaluation
Internal
WSSSE
GAP
Silhouette
Dunn index
DaviesBouldin index
External
Rand index
Jaccard index
FM index
F-measure
V Measure
Entropy and Purity
MI Index
Comparision
References

Norms
lp norm: (Minkowski) ||x||p = p
n
i=1
|xi |p
l0 norm: Number of non-zero elements in a vector
text classiﬁcation, sparse, discrete data
||x||0 = (i|xi = 0) eg:[3, 4] = 2
l1 norm: Manhattan Norm (sum of absolute distance)
ridge regression, convex, non diﬀerentiable
||x||1 =
n
i=1
|xi | eg:[3, 4] = 7
l2 norm: Euclidean Norm(vector Distance)
lasso regression
||x||2 = x2
1 + x2
2 + x2
3 + .. + x2
n =
√
xT x eg:[3, 4] = 5
l∞ norm: Chebyshev Norm (king in chess)
max(|xi |) eg:[3, 4] = 4
l0 to l1 norm: for sparsity
l0: Results in NP Hard problem
lp<1: Results in Non Convex problem

DistanceMeasures d(x, y) = ||x − y||
Mahalanobis distance: d(x, y) = (x − y)T S−1(x − y)
S is the Covariance matrix
It accounts for the fact that the variances in each direction are
diﬀerent. Normalizes based on a covariance matrix (between
variables) to make the distance metric scale-invariant
Euclidean distance(l2): de(x, y) =
n
i=1
(xi − yi )2
Mahalanobis distance for Gaussian (uncorrelated variables
with unit variance)
Covariance: identity matrix.
Normalized Euclidean distance: dM(x, y) =
n
i=1
(xi −yi )2
s2
i
si : SD of the xi and yi over the sample
Covariance: diagonal matrix
Sum of squared distance: ds(x, y) =
n
i=1
(xi − yi )2
Hamming distance(l0): Number of positions at which the

DistanceMeasures (2)
Manhattan distance(l1): dm(x, y) =
n
i=1
|xi − yi |
often good for sparse features eg: text
Chebyshev/Chessboard(lmax ): d∞(x, y) = maxi (|xi − yi |)
assumes only the most signiﬁcant dimension is relevant
Vector/Cosine distance: for sparse text mining
dv (x, y) = 1 − x.y
|x||y| = 1 −
n
i=1
xi .yi
n
i=1
x2
i
n
i=1
y2
i
where |x|: euclidian norm
invariant to global scalings of the signal
Tanimoto /Jaccards Distance: dv (x, y) = 1 − xy
|x||y|−xy
Pearson Corelation:
dcor (x, y) = 1 − cov(X,Y )
σX σY
= 1 −
n
i=1
(xi −¯x)(yi −¯y)
n
i=1
(xi −¯x)2
n
i=1
(yi −¯y)2
KL(p, q) = i pi .log2(pi /qi )

Dissimilarity Matrix
Express the similarity pair to pair between two sets
Placement of clusters and the within cluster order is obtained by a
seriation algorithm which tries to place large similarities/small
dissimilarities close to the diagonal
Cophenetic distance: height of the dendrogram where the two
branches that include the two objects merge into a single branch.

K-Means
K-Means/Centroids: Minimize l2 mahalanobis/euclidean/sum of
square distances to cluster centers/means
Objective: arg min
s
k
i=1 x∈Si
x − µi
2
Algo Steps: O(I ∗ K ∗ m ∗ n)
1. Initialize µ1, ..., µk (randomly(keep lowest cost) or manually)
2. Repeat until no change(less than threshold) in µ1, ..., µk
classify x1, ..., xn to eachs nearest µi
3. recalculate µ1, ..., µk (cordinate descent)
Initialization:
Random: Choose random data index, Cons: May choose
nearby points
Distance Based: Start at random, ﬁnd next point farthest,
Cons: May choose outliers
k-means++ (Random + Distance): Find next point far but
randomly, Cons: May choose outliers

K-Means(2)
Pros:
Fast for low dimensional data.
Can ﬁnd subclusters for large k.
Cons:
Restricted to globular data(sphere/axis-aligned ellipse shape,
similar size) which has the notion of a center.
discrete & no overlap
Not guaranteed to ﬁnd a globally optimal solution.
Final clusters area sensitive to the selection of initial centroids.
Can produce empty clusters
Sensitive to Outliers

Mini Batch K-Means
Subsets of the input data, randomly sampled in each training
iteration.
Update centroids with batches of sample data(taking streaming
average of the sample and all previous samples assigned), instead
of all individuals.
Steps:
1. Samples drawn randomly and assigned to nearest centroid
2. Repeat for each example: Increment per center count, get per
center learning rate( 1
pcc )
3. c = (1 − lr)c + lr ∗ x (gradient discent for example x)
Pros: For Huge amount of Data
single pass over data(no need to read similar point many times and
compares its distance with each centroids at every iteration)
Cons: Results are generally slightly worse.

K-Medoids (PAM: Partitioning around medoids)
Medoid(representative): Cluster center is chosen from original
datapoints(object closest to the center).
Minimize manhattan(l1)/pearson-correlation/any-other
distance to cluster medoids instead of sum of squares.
Analogous to finding medians instead of means
Steps:
Build: Initialize select k of the n data points as the medoids.
Associate each data point to the closest medoid.
Swap: For each medoid m, and each non-medoid data point
o, while the cost of the configuration decreases:
Swap m and o, recompute the cost (sum of distances of points
to their medoid)
If the total cost of the configuration increased in the previous
step, undo the swap

K-Medoids (2)
Pros:
Used when a mean or centroid cannot be deﬁned (eg: 3-D
trajectories, gene expression context)
For sparse cases.
More robust than k-means for noise and outliers
Cons:
More expensive O(n2 ∗ k ∗ i) vs O(n ∗ k ∗ i) due to pairwise
distance calculations
Precompute distances matrix for speed up O(n2) memory
Ref: http://www.cs.umb.edu/cs738/pam1.pdf

CLARA: Clustering for Large Applications
By drawing multiple samples of data(containing best medoids
until then), applying PAM on each sample, assigning entire
dataset to nearest medoid and then returning the
best(minimal sum/avg of dissimilarities) clustering.

Hierarchical
Create a hierarchical decomposition of the set of data (or objects)
using some criterion. Produce nested sets of clusters.
Agglomerative: bottom up approach (UPGMA: Unweighted
Pair Group Method with Arithmatic Mean)
Each observation starts in its own cluster, and pairs of clusters
are merged as one moves up the hierarchy.
Divisive: top down approach (Diana: DIvisive ANAlysis
Clustering)
All observations start in one cluster, and splited recursively.
Algo Steps:
1. Compute proximity matrix and sort :O(m2log(m))
2. REPEAT: until 1 cluster remains :O(m − 1)
3. Merge closest 2 clusters(Linkage Criteria) :O((m − i + 1)2)
4. Update proximity matrix to reﬂect proximity between new &
old clusters :O(m − i + 1)
Dendrogram: Sequence of merge and distance(Height)
Clustrogram: Cluster on row(data) and column(feature), Reorder
data and feature to expose behavior among groups

Hierarchical (2)
Linkage criteria: determines the metric used for the merge strategy
Ward: minimizes the sum of squared diﬀerences within all
clusters (a variance-minimizing approach, similar to k-means)
single, complete, average, centroid:
minimum/maximum/average/mean pair-wise distance
between observations of pairs of clusters.
(Min/MST: long connected/small isolated, susceptible to
noise/outliers
Max/average: spacially grouped cluster/avoids elongated
clusters , may not work well with non-globular clusters)
Ward & centroid: For numerical variables
Dissimilarity measure: Euclidean or Correlation-based distance

Birch: balanced iterative reducing and clustering using
hierarchies
Steps:
1. Every new sample is inserted into the root of the Clustering
Feature Tree(height-balanced tree).
2. It is then clubbed together with the subcluster that has the
centroid closest to the new sample.
3. This is done recursively till it ends up at the subcluster of the
leaf of the tree has the closest centroid.
Pros: For reducing obs to high no. of clusters.
Cons: Not scalable to high diamentional dataset.

Density Based
Density: No of points within a speciﬁed radius.
Clusters: Areas of high density separated by areas of low density.
Points: core(more than MinPts within Eps), border(less than
MinPts within Eps and in -neighborhood of core) and outlier
Reachability:
Directly density-reachable(q to p): if q is a core and p is in
neighborhood of q (asymmetric).
Density-reachable[Indirectly](q to p): if q to p1, p1 to p2.., pn
to p are directly density-reachable.
Density connected: p and q are density connected, if there are
a core point o, such that both p and q are density reachable
from o.

DBScan: Density-based spatial clustering of applications
with noise
Flat clustering of 1 level (like kmeans)
Cluster: maximal set of density connected points
Used when: Irregular or intertwined, and when noise and outliers
are present
Steps:
1. For each point xi (not yet classiﬁed), compute distance to all
other points.
Finds all neighbors within distance eps (xi ).
Points with neighbor count ≥ MinPts, is marked as core point
else border point.
2. For each non assigned core point create a new cluster.
Find recursively all its density connected points and assign
them to the same cluster as core.
3. If p is a border point, no points are density-reachable from p,
Iterate through the remaining unvisited points in the dataset.

DBScan: Determining EPS( ) and MinPts(k)
MinPts: ≥ dimensions + 1
k too small: Noise will be classiﬁed as cluster
k too large: Small cluster will be classiﬁed as noise
Eps: For points in a cluster, their kth-nearest neighbors are at
roughly the same distance, and for noise points at farther distance
Find distance of every point to its kth (minPts) nearest neighbor
Plot distances in an ascending order
Find the knee for optimal eps

DBScan: Observations
Pros:
1. Handles data of any shape/size
2. Resistant to noise
3. One scan
4. No of clusters are automatically determined.
Cons:
1. Non-deterministic, A non-core sample(distance lower than
eps) to two core samples in different clusters assigned to
whichever cluster is generated first
2. Fails in clusters of varying density
is too small: sparser clusters will be defined as noise.
is too large: denser clusters may be merged together.
3. Sensitive to parameters (hard to determine the correct set of
parameters)

Optics: Ordering points to identify the clustering structure
Identiﬁes clusters with diﬀerent densities.
Index-based: k dimensions, N points
Core Distance of an object p: the smallest value such that the
-neighborhood of p has at least MinPts objects
Reachability Distance of object p from core object q: the min
radius value that makes p density-reachable from q
rdε,MinPts(p, q) =
UNDEFINED if q is not core
max(core-distε,MinPts(p), dist(p, o)) otherwise

Optics Algo
Steps:
1. Choose p at random or from the queue, output p and its
core/reachability distances
2. Find p’s core-distance, the smallest distance e’ ≤ e such that
p is a core-object
3. Compute the reachability-distance for each other point in p’s
e- neighborhood
4. Visit the points in order of their reachability-distance,
repeating steps 1,2,3 as necessary
Points linearly ordered such that points which are spatially closest
become neighbors in the ordering(Analogous to single-link
dendrogram).
Only ﬁx minpts, and plot the radius at which an object would be
considered dense by DBSCAN.
Cluster nearly indistinguishable from DBSCAN.
Reachability Plot: priority heap, nearby objects are nearby in plot
Ref Algo: http://www.osl.iu.edu/~chemuell/projects/
presentations/optics-v1.pdf

Optics Observations
Pros:
No strict partitioning
No need to specift eps
Hierarchical partitioning (By detecting steep areas).
Cons:
Slower(1.6 times) than DBSCAN(Bcoz: nearest neighbor
queries are more complicated than radius queries).

FCM: Fuzzy C-Means
Each element has a set of membership coeﬃcients corresponding
to degree of belonging to a cluster.
Objective:
k
j=1 xi ∈Cj
um
ij (xi − cj )2
uij is the degree to which an observation xi belongs to a cluster cj
linked inversely to the distance from x to the cluster center.
cj =
x∈cj
um
ij x
x∈cj
um
ij
uij = 1
k
l=1
|xi −cj |
|xi −ck |
2
m−1
m is the fuzziﬁer/level of fuzzyness: 1 < m < ∞
m = 1: um
ij = 1
m = ∞: um
ij = 1/k

FCM: Steps
Specify a number of clusters k (by the analyst)
Assign randomly to each point coeﬃcients for being in the
clusters. U = [uij ] matrix, U(0)
At k-step: calculate the centroid vectors C(k) = [cj ] with U(k)
Update U(k), U(k+1): coeﬃcients of being in the clusters.
Repeat until max-iteration or when the algorithm has
converged. If ||U(k+1) − U(k)|| < (sensitivity threshold).

FCM: Steps
minimum is a local minimum, and the results depend on the
initial choice of weights

GMM: Gaussian Mixture Models
Composite distribution whereby points are drawn from one of k
Gaussian sub-distributions, each with its own probability.
Mixture models are a semi-parametric alternative to
non-parametric histograms(density) to provide greater ﬂexibility
and precision. Bayes’ theorem to perform classiﬁcation.
1D Gaussian: N(x|µ, σ2)
nD Gaussian: N(x|µ, ), x & µ vectors
1D Mixture of Gaussians: N(πk, µk, σk
2)
Each Component have:
µk: mean vector
k: covariance matrix(volume, shape, orientation)
πk: mixture-weight/probability/prior (relative proportion of
data from group k in the full data)
Prior: p(zi = k) = πk
Likelihood: p(xi |zi = k, µk, k) = N(xi |µk, k) Each cluster k is
centered at the means µk, with increased density for points near
the mean.
Model: EVI: Equal(volume) Variable(shape) Identity(orientation)

EM: Expectation Maximization
E stage: given the parameters, ﬁnding the right mixture (where
does each sample come from?)
M stage: given the mixtures, ﬁnd the right parameters for each
class
Uses EM algorithm to induce the maximum-likelihood model given
a set of samples. EM requires an a priori selection of model order,
namely, the number of M components(like k-means) to be
incorporated into the model.
initialModel: starting point/random
convergenceTol: Maximum change in log-likelihood at which we
consider convergence achieved.

GMM: Observation
BIC: -2*ln(likelihood) + ln(N)*k
indicates evidence for the model(choose no of cluster).
Pros: Automatically suggest no of clusters

LDA: Latent Dirichlet allocation
By assigning query data points to the multinomial/categorical
components that maximize the component posterior probability
given the data.
Topics and documents both exist in a feature space, where feature
vectors are vectors of word counts.
Each document a mixture of topics (generated from a Dirichlet
distribution), each of which emits words with a certain probability.
Steps:
Fix K as no of topics to discover
Gibbs Sampling:
Randomly assign each word in each document to one of the K
topics (gives topic representations of all the documents and
word distributions of all the topics <poor initially>)

LDA Algo
To improve on them:
for each document d, each word w in d, and for each topic t:
compute two things:
1. p(topic t | document d) = the proportion of words in
document d that are currently assigned to topic t.
2. p(word w | topic t) = the proportion of assignments to
topic t over all documents that come from this word w.
Reassign w a new topic, where we choose topic t with
probability: p(topic t | document d) * p(word w | topic t)
(probability that topic t generated word w)
ie: Assume that all topic assignments except for the current
word in question are correct(priors), and update the
assignment of the current word using our model of how
documents are generated.
to estimate the topic mixtures of each document and the
words associated to each topic

Internal Evaluation
Compactness, connectedness, and separation
Desc:
Based on the data that was clustered itself.
Assign the best score to the algorithm that produces clusters
with high similarity within a cluster and low similarity between
clusters.
Validity as measured by such an index/criterion depends on
the claim that some kind of structure exists(eg: k-means
convex).
Cons:
Biased towards algorithms that use the same cluster model.
Eg: k-means is distance based, and distance based criteria will
overrate
Compares algo performs better than another, but not that it
produces more valid results than another
High scores on an internal measure do not necessarily result in
eﬀective IR applications

WSSSE (Within Set Sum of Squared Error)
Dk =
xi Ck xj Ck
||xi − xj ||2 = 2nk
xi Ck
||xi − µk||2
Sum of intra-cluster distances between points nk in cluster Ck
Wk =
k
k=1
1
2nk
Dk (normalized intra-cluster sums of squares)
Within cluster dispersion
Elbow method: naive solution based on intra-cluster variance
Plot: percentage of variance explained(F-test=grp-var/tot-var) vs
number of clusters
Optimal k is usually one where there is an elbow in the WSSSE
graph.

GAP statistics
Way to standardize the graph by comparing logWk with its
Expectation E∗
n {log Wk} under a null reference distribution.
Gapn(k) = E∗
n {log Wk} − logWk
E∗
n : average of x copies log W ∗
k Monte Carlo sample
k = smallest k such that Gap(k) ≥ Gap(k + 1) − sk+1
k = value for which logWk falls the farthest below this reference
value E∗
n in curve
sk = sd(k) ∗ 1 + 1/x :Simulation error
For globular, Gaussian-distributed, mildly disjoint distributions.
Ref:
http://www.web.stanford.edu/~hastie/Papers/gap.pdf

Silhouette
s(i) =
b(i) − a(i)
max{a(i), b(i)}
s(i) =



1 − a(i)/b(i), if a(i) < b(i)
0, if a(i) = b(i)
b(i)/a(i) − 1, if a(i) > b(i)
ai : average distance from the ith point to the other points in the
same cluster as i
bi : minimum average distance from the ith point to points in a
diﬀerent cluster, minimized over clusters.
Si : -1 to +1.
High value indicates that i is well-matched to its own cluster, and
poorly-matched to neighboring clusters
appropriate: if most points have a high/+ve silhouette value
too many/few: if many points have a low/-ve silhouette value

Dunn index
D =
min1≤i<j≤n d(i, j)
max1≤k≤n d (k)
,
d(i,j): inter-cluster distance between clusters i and j
d’(k): intra-cluster distance of cluster k
Smaller the score Better the Model

DaviesBouldin index
DB =
1
n
n
i=1
max
j=i
σi + σj
d(ci , cj )
n: number of clusters
cx : centroid of cluster x
σx : average distance of all elements in cluster x to centroid cx
d(ci , cj ): distance between centroids ci and cj .
Smaller the score Better the Model

External Evaluation
Desc:
Based on data that was not used for clustering(external
benchmarks/known class labels: gold standard, created by
humans).
Measure how similar the resulting clusters are to the
benchmark classes/clusters
No assumption made on cluster structure.
semisupervised

row
Rand index (Adjusted)
RI =
TP + TN
TP + FN + TN + FP
Measure of the percentage of correct decisions made by the
algorithm.
Cons:
False positives and false negatives are equally weighted
No bounded range and no guarrentee of random (uniform)
label assignments have a RI score close to 0.0 (eg: number of
clusters = number of samples)
ARI =
RI − E[RI]
max(RI) − E[RI]
Can yield negative values if index is less than expected index.

Jaccard index
J(A, B) =
|A ∩ B|
|A ∪ B|
=
TP
TP + FP + FN
0 ≤ J ≤ 1
J=1: identical
J=0: no common element number of unique elements common to
both sets divided by the total number of unique elements in both
sets.

Fowlkes-Mallows index
FM =
TP
TP + FP
·
TP
TP + FN
FM index is the geometric mean of precision and recall
Performs well in noisy data.
Higher score better..

F-measure
Fβ =
(β2 + 1) · P · R
β2 · P + R
P =
TP
TP + FP
R =
TP
TP + FN
β ≥ 0
β = 0 =⇒ F0 = P. i.e: recall has no impact on the F-measure.
Increasing β allocates an increasing amount of weight to recall in
the ﬁnal F-measure
F-measure is the harmonic mean of precision and recall

V Measure
homogeneity/purity: each cluster contains only members of a
single class.
completeness: all members of a given class are assigned to the
same cluster.
v = 2.
h.c
h + c
h = 1 −
H(C|K)
H(C)
c = 1 −
H(K|C)
H(K)
H(C|K) =
|C|
c=1
|K|
k=1
nc,k
n
log
nc,k
n
Pros: score is between 0(bad) to 1(perfect)
Bad score can be qualitatively analyzed by h & c
Cons: Random labeling wont yield zero scores especially when the
number of clusters is large

Entropy & Purity
Entropy: amount of uncertainty for a partition set
H(X) = E[I(X)] = E[− ln(P(X))] =
|X|
x=1
P(x) log P(x) =
Nx
x=1
Nx
N
log
Nx
N
P(x): probability of an object from X falls into class Xx

Mutual Information
Information theoretic measure of how much information is shared
in ’bits’ between a clustering and a ground-truth classiﬁcation.
Can detect a non-linear similarity between two clusterings.
MI(U, V ) =
|U|
u=1
|V |
v=1
P(u, v) log
P(u, v)
P(u) P(v)
P(x): probability of
an object from X falls into class Xx
P(u, v) = |Xx ∩ Yy |/N = probability of an object falls into both
class Xx and Yy
MI = 0 if U and V are independent

Normaized & Adjusted & Standardized Mutual Information
NMI =
MI
(H(U).H(V ))
AMI =
MI − E[MI]
(H(U).H(V )) − E[MI]
=
MI − E[MI]
max(H(U), H(V ) − E[MI]
SMI =
MI − E[MI]
Var(MI)
Ref: http://www.jmlr.org/proceedings/papers/v32/romano14.pdf

Comparision
Algorithm Density Size Shape Noise Complexity
k-means N N N N O(I * K * m * n)
k-medoids N N N N O(I ∗ K ∗ m2)
Hierarchical N Y N Y O(m3)
DBSCAN N Y Y Y O(m2)
OPTICS Y Y Y Y O(N*logN)
GMM Y Y Y N

References
1. https://rorasa.wordpress.com/2012/05/13/
l0-norm-l1-norm-l2-norm-l-infinity-norm
2. http://www.statmethods.net/advstats/cluster.html
3. https://www.r-bloggers.com/
k-means-clustering-from-r-in-action
4. http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
5. http:
//scikit-learn.org/stable/modules/clustering.html

Clustering

More Related Content

Clustering