Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Clustering
Smrutiranjan Sahu
Talentica Software
Introduction
Spam Detection (Mail)
Handwriting Recognition (Mobile)
Voice Recognition (Mobile)
Recommendation (Youtube)
Used for finding outliers in data.
Used for Impute missing values.
Used for Feature Generation/Representation
Used for exploratory analysis and/or as a component of a
hierarchical supervised learning pipeline. Distinct classifiers or
regression models are trained for each cluster.
Used for Data Compression
Introduction
Measures
Norms
Distance
Algorithms
Partition Based
K-Means
Mini Batch K-Means
K-Medoids
CLARA
Hierarchical
Birch
Density Based
DBScan
Optics
Model/Distr Based
FCM
GMM
LDA
Evaluation
Internal
WSSSE
GAP
Silhouette
Dunn index
DaviesBouldin index
External
Rand index
Jaccard index
FM index
F-measure
V Measure
Entropy and Purity
MI Index
Comparision
References
Norms
lp norm: (Minkowski) ||x||p = p
n
i=1
|xi |p
l0 norm: Number of non-zero elements in a vector
text classification, sparse, discrete data
||x||0 = (i|xi = 0) eg:[3, 4] = 2
l1 norm: Manhattan Norm (sum of absolute distance)
ridge regression, convex, non differentiable
||x||1 =
n
i=1
|xi | eg:[3, 4] = 7
l2 norm: Euclidean Norm(vector Distance)
lasso regression
||x||2 = x2
1 + x2
2 + x2
3 + .. + x2
n =
√
xT x eg:[3, 4] = 5
l∞ norm: Chebyshev Norm (king in chess)
max(|xi |) eg:[3, 4] = 4
l0 to l1 norm: for sparsity
l0: Results in NP Hard problem
lp<1: Results in Non Convex problem
DistanceMeasures d(x, y) = ||x − y||
Mahalanobis distance: d(x, y) = (x − y)T S−1(x − y)
S is the Covariance matrix
It accounts for the fact that the variances in each direction are
different. Normalizes based on a covariance matrix (between
variables) to make the distance metric scale-invariant
Euclidean distance(l2): de(x, y) =
n
i=1
(xi − yi )2
Mahalanobis distance for Gaussian (uncorrelated variables
with unit variance)
Covariance: identity matrix.
Normalized Euclidean distance: dM(x, y) =
n
i=1
(xi −yi )2
s2
i
si : SD of the xi and yi over the sample
Covariance: diagonal matrix
Sum of squared distance: ds(x, y) =
n
i=1
(xi − yi )2
Hamming distance(l0): Number of positions at which the
DistanceMeasures (2)
Manhattan distance(l1): dm(x, y) =
n
i=1
|xi − yi |
often good for sparse features eg: text
Chebyshev/Chessboard(lmax ): d∞(x, y) = maxi (|xi − yi |)
assumes only the most significant dimension is relevant
Vector/Cosine distance: for sparse text mining
dv (x, y) = 1 − x.y
|x||y| = 1 −
n
i=1
xi .yi
n
i=1
x2
i
n
i=1
y2
i
where |x|: euclidian norm
invariant to global scalings of the signal
Tanimoto /Jaccards Distance: dv (x, y) = 1 − xy
|x||y|−xy
Pearson Corelation:
dcor (x, y) = 1 − cov(X,Y )
σX σY
= 1 −
n
i=1
(xi −¯x)(yi −¯y)
n
i=1
(xi −¯x)2
n
i=1
(yi −¯y)2
KL(p, q) = i pi .log2(pi /qi )
Dissimilarity Matrix
Express the similarity pair to pair between two sets
Placement of clusters and the within cluster order is obtained by a
seriation algorithm which tries to place large similarities/small
dissimilarities close to the diagonal
Cophenetic distance: height of the dendrogram where the two
branches that include the two objects merge into a single branch.
Introduction
Measures
Norms
Distance
Algorithms
Partition Based
K-Means
Mini Batch K-Means
K-Medoids
CLARA
Hierarchical
Birch
Density Based
DBScan
Optics
Model/Distr Based
FCM
GMM
LDA
Evaluation
Internal
WSSSE
GAP
Silhouette
Dunn index
DaviesBouldin index
External
Rand index
Jaccard index
FM index
F-measure
V Measure
Entropy and Purity
MI Index
Comparision
References
K-Means
K-Means/Centroids: Minimize l2 mahalanobis/euclidean/sum of
square distances to cluster centers/means
Objective: arg min
s
k
i=1 x∈Si
x − µi
2
Algo Steps: O(I ∗ K ∗ m ∗ n)
1. Initialize µ1, ..., µk (randomly(keep lowest cost) or manually)
2. Repeat until no change(less than threshold) in µ1, ..., µk
classify x1, ..., xn to eachs nearest µi
3. recalculate µ1, ..., µk (cordinate descent)
Initialization:
Random: Choose random data index, Cons: May choose
nearby points
Distance Based: Start at random, find next point farthest,
Cons: May choose outliers
k-means++ (Random + Distance): Find next point far but
randomly, Cons: May choose outliers
K-Means(2)
Pros:
Fast for low dimensional data.
Can find subclusters for large k.
Cons:
Restricted to globular data(sphere/axis-aligned ellipse shape,
similar size) which has the notion of a center.
discrete & no overlap
Not guaranteed to find a globally optimal solution.
Final clusters area sensitive to the selection of initial centroids.
Can produce empty clusters
Sensitive to Outliers
Mini Batch K-Means
Subsets of the input data, randomly sampled in each training
iteration.
Update centroids with batches of sample data(taking streaming
average of the sample and all previous samples assigned), instead
of all individuals.
Steps:
1. Samples drawn randomly and assigned to nearest centroid
2. Repeat for each example: Increment per center count, get per
center learning rate( 1
pcc )
3. c = (1 − lr)c + lr ∗ x (gradient discent for example x)
Pros: For Huge amount of Data
single pass over data(no need to read similar point many times and
compares its distance with each centroids at every iteration)
Cons: Results are generally slightly worse.
K-Medoids (PAM: Partitioning around medoids)
Medoid(representative): Cluster center is chosen from original
datapoints(object closest to the center).
Minimize manhattan(l1)/pearson-correlation/any-other
distance to cluster medoids instead of sum of squares.
Analogous to finding medians instead of means
Steps:
Build: Initialize select k of the n data points as the medoids.
Associate each data point to the closest medoid.
Swap: For each medoid m, and each non-medoid data point
o, while the cost of the configuration decreases:
Swap m and o, recompute the cost (sum of distances of points
to their medoid)
If the total cost of the configuration increased in the previous
step, undo the swap
K-Medoids (2)
Pros:
Used when a mean or centroid cannot be defined (eg: 3-D
trajectories, gene expression context)
For sparse cases.
More robust than k-means for noise and outliers
Cons:
More expensive O(n2 ∗ k ∗ i) vs O(n ∗ k ∗ i) due to pairwise
distance calculations
Precompute distances matrix for speed up O(n2) memory
Ref: http://www.cs.umb.edu/cs738/pam1.pdf
CLARA: Clustering for Large Applications
By drawing multiple samples of data(containing best medoids
until then), applying PAM on each sample, assigning entire
dataset to nearest medoid and then returning the
best(minimal sum/avg of dissimilarities) clustering.
Hierarchical
Create a hierarchical decomposition of the set of data (or objects)
using some criterion. Produce nested sets of clusters.
Agglomerative: bottom up approach (UPGMA: Unweighted
Pair Group Method with Arithmatic Mean)
Each observation starts in its own cluster, and pairs of clusters
are merged as one moves up the hierarchy.
Divisive: top down approach (Diana: DIvisive ANAlysis
Clustering)
All observations start in one cluster, and splited recursively.
Algo Steps:
1. Compute proximity matrix and sort :O(m2log(m))
2. REPEAT: until 1 cluster remains :O(m − 1)
3. Merge closest 2 clusters(Linkage Criteria) :O((m − i + 1)2)
4. Update proximity matrix to reflect proximity between new &
old clusters :O(m − i + 1)
Dendrogram: Sequence of merge and distance(Height)
Clustrogram: Cluster on row(data) and column(feature), Reorder
data and feature to expose behavior among groups
Hierarchical (2)
Linkage criteria: determines the metric used for the merge strategy
Ward: minimizes the sum of squared differences within all
clusters (a variance-minimizing approach, similar to k-means)
single, complete, average, centroid:
minimum/maximum/average/mean pair-wise distance
between observations of pairs of clusters.
(Min/MST: long connected/small isolated, susceptible to
noise/outliers
Max/average: spacially grouped cluster/avoids elongated
clusters , may not work well with non-globular clusters)
Ward & centroid: For numerical variables
Dissimilarity measure: Euclidean or Correlation-based distance
Birch: balanced iterative reducing and clustering using
hierarchies
Steps:
1. Every new sample is inserted into the root of the Clustering
Feature Tree(height-balanced tree).
2. It is then clubbed together with the subcluster that has the
centroid closest to the new sample.
3. This is done recursively till it ends up at the subcluster of the
leaf of the tree has the closest centroid.
Pros: For reducing obs to high no. of clusters.
Cons: Not scalable to high diamentional dataset.
Introduction
Measures
Norms
Distance
Algorithms
Partition Based
K-Means
Mini Batch K-Means
K-Medoids
CLARA
Hierarchical
Birch
Density Based
DBScan
Optics
Model/Distr Based
FCM
GMM
LDA
Evaluation
Internal
WSSSE
GAP
Silhouette
Dunn index
DaviesBouldin index
External
Rand index
Jaccard index
FM index
F-measure
V Measure
Entropy and Purity
MI Index
Comparision
References
Density Based
Density: No of points within a specified radius.
Clusters: Areas of high density separated by areas of low density.
Points: core(more than MinPts within Eps), border(less than
MinPts within Eps and in -neighborhood of core) and outlier
Reachability:
Directly density-reachable(q to p): if q is a core and p is in
neighborhood of q (asymmetric).
Density-reachable[Indirectly](q to p): if q to p1, p1 to p2.., pn
to p are directly density-reachable.
Density connected: p and q are density connected, if there are
a core point o, such that both p and q are density reachable
from o.
DBScan: Density-based spatial clustering of applications
with noise
Flat clustering of 1 level (like kmeans)
Cluster: maximal set of density connected points
Used when: Irregular or intertwined, and when noise and outliers
are present
Steps:
1. For each point xi (not yet classified), compute distance to all
other points.
Finds all neighbors within distance eps (xi ).
Points with neighbor count ≥ MinPts, is marked as core point
else border point.
2. For each non assigned core point create a new cluster.
Find recursively all its density connected points and assign
them to the same cluster as core.
3. If p is a border point, no points are density-reachable from p,
Iterate through the remaining unvisited points in the dataset.
DBScan: Determining EPS( ) and MinPts(k)
MinPts: ≥ dimensions + 1
k too small: Noise will be classified as cluster
k too large: Small cluster will be classified as noise
Eps: For points in a cluster, their kth-nearest neighbors are at
roughly the same distance, and for noise points at farther distance
Find distance of every point to its kth (minPts) nearest neighbor
Plot distances in an ascending order
Find the knee for optimal eps
DBScan: Observations
Pros:
1. Handles data of any shape/size
2. Resistant to noise
3. One scan
4. No of clusters are automatically determined.
Cons:
1. Non-deterministic, A non-core sample(distance lower than
eps) to two core samples in different clusters assigned to
whichever cluster is generated first
2. Fails in clusters of varying density
is too small: sparser clusters will be defined as noise.
is too large: denser clusters may be merged together.
3. Sensitive to parameters (hard to determine the correct set of
parameters)
Optics: Ordering points to identify the clustering structure
Identifies clusters with different densities.
Index-based: k dimensions, N points
Core Distance of an object p: the smallest value such that the
-neighborhood of p has at least MinPts objects
Reachability Distance of object p from core object q: the min
radius value that makes p density-reachable from q
rdε,MinPts(p, q) =
UNDEFINED if q is not core
max(core-distε,MinPts(p), dist(p, o)) otherwise
Optics Algo
Steps:
1. Choose p at random or from the queue, output p and its
core/reachability distances
2. Find p’s core-distance, the smallest distance e’ ≤ e such that
p is a core-object
3. Compute the reachability-distance for each other point in p’s
e- neighborhood
4. Visit the points in order of their reachability-distance,
repeating steps 1,2,3 as necessary
Points linearly ordered such that points which are spatially closest
become neighbors in the ordering(Analogous to single-link
dendrogram).
Only fix minpts, and plot the radius at which an object would be
considered dense by DBSCAN.
Cluster nearly indistinguishable from DBSCAN.
Reachability Plot: priority heap, nearby objects are nearby in plot
Ref Algo: http://www.osl.iu.edu/~chemuell/projects/
presentations/optics-v1.pdf
Optics Observations
Pros:
No strict partitioning
No need to specift eps
Hierarchical partitioning (By detecting steep areas).
Cons:
Slower(1.6 times) than DBSCAN(Bcoz: nearest neighbor
queries are more complicated than radius queries).
FCM: Fuzzy C-Means
Each element has a set of membership coefficients corresponding
to degree of belonging to a cluster.
Objective:
k
j=1 xi ∈Cj
um
ij (xi − cj )2
uij is the degree to which an observation xi belongs to a cluster cj
linked inversely to the distance from x to the cluster center.
cj =
x∈cj
um
ij x
x∈cj
um
ij
uij = 1
k
l=1
|xi −cj |
|xi −ck |
2
m−1
m is the fuzzifier/level of fuzzyness: 1 < m < ∞
m = 1: um
ij = 1
m = ∞: um
ij = 1/k
FCM: Steps
Specify a number of clusters k (by the analyst)
Assign randomly to each point coefficients for being in the
clusters. U = [uij ] matrix, U(0)
At k-step: calculate the centroid vectors C(k) = [cj ] with U(k)
Update U(k), U(k+1): coefficients of being in the clusters.
Repeat until max-iteration or when the algorithm has
converged. If ||U(k+1) − U(k)|| < (sensitivity threshold).
FCM: Steps
minimum is a local minimum, and the results depend on the
initial choice of weights
GMM: Gaussian Mixture Models
Composite distribution whereby points are drawn from one of k
Gaussian sub-distributions, each with its own probability.
Mixture models are a semi-parametric alternative to
non-parametric histograms(density) to provide greater flexibility
and precision. Bayes’ theorem to perform classification.
1D Gaussian: N(x|µ, σ2)
nD Gaussian: N(x|µ, ), x & µ vectors
1D Mixture of Gaussians: N(πk, µk, σk
2)
Each Component have:
µk: mean vector
k: covariance matrix(volume, shape, orientation)
πk: mixture-weight/probability/prior (relative proportion of
data from group k in the full data)
Prior: p(zi = k) = πk
Likelihood: p(xi |zi = k, µk, k) = N(xi |µk, k) Each cluster k is
centered at the means µk, with increased density for points near
the mean.
Model: EVI: Equal(volume) Variable(shape) Identity(orientation)
EM: Expectation Maximization
E stage: given the parameters, finding the right mixture (where
does each sample come from?)
M stage: given the mixtures, find the right parameters for each
class
Uses EM algorithm to induce the maximum-likelihood model given
a set of samples. EM requires an a priori selection of model order,
namely, the number of M components(like k-means) to be
incorporated into the model.
initialModel: starting point/random
convergenceTol: Maximum change in log-likelihood at which we
consider convergence achieved.
GMM: Observation
BIC: -2*ln(likelihood) + ln(N)*k
indicates evidence for the model(choose no of cluster).
Pros: Automatically suggest no of clusters
LDA: Latent Dirichlet allocation
By assigning query data points to the multinomial/categorical
components that maximize the component posterior probability
given the data.
Topics and documents both exist in a feature space, where feature
vectors are vectors of word counts.
Each document a mixture of topics (generated from a Dirichlet
distribution), each of which emits words with a certain probability.
Steps:
Fix K as no of topics to discover
Gibbs Sampling:
Randomly assign each word in each document to one of the K
topics (gives topic representations of all the documents and
word distributions of all the topics <poor initially>)
LDA Algo
To improve on them:
for each document d, each word w in d, and for each topic t:
compute two things:
1. p(topic t | document d) = the proportion of words in
document d that are currently assigned to topic t.
2. p(word w | topic t) = the proportion of assignments to
topic t over all documents that come from this word w.
Reassign w a new topic, where we choose topic t with
probability: p(topic t | document d) * p(word w | topic t)
(probability that topic t generated word w)
ie: Assume that all topic assignments except for the current
word in question are correct(priors), and update the
assignment of the current word using our model of how
documents are generated.
to estimate the topic mixtures of each document and the
words associated to each topic
Evaluation
Internal
External
Internal Evaluation
Compactness, connectedness, and separation
Desc:
Based on the data that was clustered itself.
Assign the best score to the algorithm that produces clusters
with high similarity within a cluster and low similarity between
clusters.
Validity as measured by such an index/criterion depends on
the claim that some kind of structure exists(eg: k-means
convex).
Cons:
Biased towards algorithms that use the same cluster model.
Eg: k-means is distance based, and distance based criteria will
overrate
Compares algo performs better than another, but not that it
produces more valid results than another
High scores on an internal measure do not necessarily result in
effective IR applications
WSSSE (Within Set Sum of Squared Error)
Dk =
xi Ck xj Ck
||xi − xj ||2 = 2nk
xi Ck
||xi − µk||2
Sum of intra-cluster distances between points nk in cluster Ck
Wk =
k
k=1
1
2nk
Dk (normalized intra-cluster sums of squares)
Within cluster dispersion
Elbow method: naive solution based on intra-cluster variance
Plot: percentage of variance explained(F-test=grp-var/tot-var) vs
number of clusters
Optimal k is usually one where there is an elbow in the WSSSE
graph.
GAP statistics
Way to standardize the graph by comparing logWk with its
Expectation E∗
n {log Wk} under a null reference distribution.
Gapn(k) = E∗
n {log Wk} − logWk
E∗
n : average of x copies log W ∗
k Monte Carlo sample
k = smallest k such that Gap(k) ≥ Gap(k + 1) − sk+1
k = value for which logWk falls the farthest below this reference
value E∗
n in curve
sk = sd(k) ∗ 1 + 1/x :Simulation error
For globular, Gaussian-distributed, mildly disjoint distributions.
Ref:
http://www.web.stanford.edu/~hastie/Papers/gap.pdf
Silhouette
s(i) =
b(i) − a(i)
max{a(i), b(i)}
s(i) =



1 − a(i)/b(i), if a(i) < b(i)
0, if a(i) = b(i)
b(i)/a(i) − 1, if a(i) > b(i)
ai : average distance from the ith point to the other points in the
same cluster as i
bi : minimum average distance from the ith point to points in a
different cluster, minimized over clusters.
Si : -1 to +1.
High value indicates that i is well-matched to its own cluster, and
poorly-matched to neighboring clusters
appropriate: if most points have a high/+ve silhouette value
too many/few: if many points have a low/-ve silhouette value
Dunn index
D =
min1≤i<j≤n d(i, j)
max1≤k≤n d (k)
,
d(i,j): inter-cluster distance between clusters i and j
d’(k): intra-cluster distance of cluster k
Smaller the score Better the Model
DaviesBouldin index
DB =
1
n
n
i=1
max
j=i
σi + σj
d(ci , cj )
n: number of clusters
cx : centroid of cluster x
σx : average distance of all elements in cluster x to centroid cx
d(ci , cj ): distance between centroids ci and cj .
Smaller the score Better the Model
External Evaluation
Desc:
Based on data that was not used for clustering(external
benchmarks/known class labels: gold standard, created by
humans).
Measure how similar the resulting clusters are to the
benchmark classes/clusters
No assumption made on cluster structure.
semisupervised
row
Rand index (Adjusted)
RI =
TP + TN
TP + FN + TN + FP
Measure of the percentage of correct decisions made by the
algorithm.
Cons:
False positives and false negatives are equally weighted
No bounded range and no guarrentee of random (uniform)
label assignments have a RI score close to 0.0 (eg: number of
clusters = number of samples)
ARI =
RI − E[RI]
max(RI) − E[RI]
Can yield negative values if index is less than expected index.
Jaccard index
J(A, B) =
|A ∩ B|
|A ∪ B|
=
TP
TP + FP + FN
0 ≤ J ≤ 1
J=1: identical
J=0: no common element number of unique elements common to
both sets divided by the total number of unique elements in both
sets.
Fowlkes-Mallows index
FM =
TP
TP + FP
·
TP
TP + FN
FM index is the geometric mean of precision and recall
Performs well in noisy data.
Higher score better..
F-measure
Fβ =
(β2 + 1) · P · R
β2 · P + R
P =
TP
TP + FP
R =
TP
TP + FN
β ≥ 0
β = 0 =⇒ F0 = P. i.e: recall has no impact on the F-measure.
Increasing β allocates an increasing amount of weight to recall in
the final F-measure
F-measure is the harmonic mean of precision and recall
V Measure
homogeneity/purity: each cluster contains only members of a
single class.
completeness: all members of a given class are assigned to the
same cluster.
v = 2.
h.c
h + c
h = 1 −
H(C|K)
H(C)
c = 1 −
H(K|C)
H(K)
H(C|K) =
|C|
c=1
|K|
k=1
nc,k
n
log
nc,k
n
Pros: score is between 0(bad) to 1(perfect)
Bad score can be qualitatively analyzed by h & c
Cons: Random labeling wont yield zero scores especially when the
number of clusters is large
Entropy & Purity
Entropy: amount of uncertainty for a partition set
H(X) = E[I(X)] = E[− ln(P(X))] =
|X|
x=1
P(x) log P(x) =
Nx
x=1
Nx
N
log
Nx
N
P(x): probability of an object from X falls into class Xx
Mutual Information
Information theoretic measure of how much information is shared
in ’bits’ between a clustering and a ground-truth classification.
Can detect a non-linear similarity between two clusterings.
MI(U, V ) =
|U|
u=1
|V |
v=1
P(u, v) log
P(u, v)
P(u) P(v)
P(x): probability of
an object from X falls into class Xx
P(u, v) = |Xx ∩ Yy |/N = probability of an object falls into both
class Xx and Yy
MI = 0 if U and V are independent
Normaized & Adjusted & Standardized Mutual Information
NMI =
MI
(H(U).H(V ))
AMI =
MI − E[MI]
(H(U).H(V )) − E[MI]
=
MI − E[MI]
max(H(U), H(V ) − E[MI]
SMI =
MI − E[MI]
Var(MI)
Ref: http://www.jmlr.org/proceedings/papers/v32/romano14.pdf
Comparision
Algorithm Density Size Shape Noise Complexity
k-means N N N N O(I * K * m * n)
k-medoids N N N N O(I ∗ K ∗ m2)
Hierarchical N Y N Y O(m3)
DBSCAN N Y Y Y O(m2)
OPTICS Y Y Y Y O(N*logN)
GMM Y Y Y N
References
1. https://rorasa.wordpress.com/2012/05/13/
l0-norm-l1-norm-l2-norm-l-infinity-norm
2. http://www.statmethods.net/advstats/cluster.html
3. https://www.r-bloggers.com/
k-means-clustering-from-r-in-action
4. http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
5. http:
//scikit-learn.org/stable/modules/clustering.html

More Related Content

Clustering

  • 2. Introduction Spam Detection (Mail) Handwriting Recognition (Mobile) Voice Recognition (Mobile) Recommendation (Youtube) Used for finding outliers in data. Used for Impute missing values. Used for Feature Generation/Representation Used for exploratory analysis and/or as a component of a hierarchical supervised learning pipeline. Distinct classifiers or regression models are trained for each cluster. Used for Data Compression
  • 3. Introduction Measures Norms Distance Algorithms Partition Based K-Means Mini Batch K-Means K-Medoids CLARA Hierarchical Birch Density Based DBScan Optics Model/Distr Based FCM GMM LDA Evaluation Internal WSSSE GAP Silhouette Dunn index DaviesBouldin index External Rand index Jaccard index FM index F-measure V Measure Entropy and Purity MI Index Comparision References
  • 4. Norms lp norm: (Minkowski) ||x||p = p n i=1 |xi |p l0 norm: Number of non-zero elements in a vector text classification, sparse, discrete data ||x||0 = (i|xi = 0) eg:[3, 4] = 2 l1 norm: Manhattan Norm (sum of absolute distance) ridge regression, convex, non differentiable ||x||1 = n i=1 |xi | eg:[3, 4] = 7 l2 norm: Euclidean Norm(vector Distance) lasso regression ||x||2 = x2 1 + x2 2 + x2 3 + .. + x2 n = √ xT x eg:[3, 4] = 5 l∞ norm: Chebyshev Norm (king in chess) max(|xi |) eg:[3, 4] = 4 l0 to l1 norm: for sparsity l0: Results in NP Hard problem lp<1: Results in Non Convex problem
  • 5. DistanceMeasures d(x, y) = ||x − y|| Mahalanobis distance: d(x, y) = (x − y)T S−1(x − y) S is the Covariance matrix It accounts for the fact that the variances in each direction are different. Normalizes based on a covariance matrix (between variables) to make the distance metric scale-invariant Euclidean distance(l2): de(x, y) = n i=1 (xi − yi )2 Mahalanobis distance for Gaussian (uncorrelated variables with unit variance) Covariance: identity matrix. Normalized Euclidean distance: dM(x, y) = n i=1 (xi −yi )2 s2 i si : SD of the xi and yi over the sample Covariance: diagonal matrix Sum of squared distance: ds(x, y) = n i=1 (xi − yi )2 Hamming distance(l0): Number of positions at which the
  • 6. DistanceMeasures (2) Manhattan distance(l1): dm(x, y) = n i=1 |xi − yi | often good for sparse features eg: text Chebyshev/Chessboard(lmax ): d∞(x, y) = maxi (|xi − yi |) assumes only the most significant dimension is relevant Vector/Cosine distance: for sparse text mining dv (x, y) = 1 − x.y |x||y| = 1 − n i=1 xi .yi n i=1 x2 i n i=1 y2 i where |x|: euclidian norm invariant to global scalings of the signal Tanimoto /Jaccards Distance: dv (x, y) = 1 − xy |x||y|−xy Pearson Corelation: dcor (x, y) = 1 − cov(X,Y ) σX σY = 1 − n i=1 (xi −¯x)(yi −¯y) n i=1 (xi −¯x)2 n i=1 (yi −¯y)2 KL(p, q) = i pi .log2(pi /qi )
  • 7. Dissimilarity Matrix Express the similarity pair to pair between two sets Placement of clusters and the within cluster order is obtained by a seriation algorithm which tries to place large similarities/small dissimilarities close to the diagonal Cophenetic distance: height of the dendrogram where the two branches that include the two objects merge into a single branch.
  • 8. Introduction Measures Norms Distance Algorithms Partition Based K-Means Mini Batch K-Means K-Medoids CLARA Hierarchical Birch Density Based DBScan Optics Model/Distr Based FCM GMM LDA Evaluation Internal WSSSE GAP Silhouette Dunn index DaviesBouldin index External Rand index Jaccard index FM index F-measure V Measure Entropy and Purity MI Index Comparision References
  • 9. K-Means K-Means/Centroids: Minimize l2 mahalanobis/euclidean/sum of square distances to cluster centers/means Objective: arg min s k i=1 x∈Si x − µi 2 Algo Steps: O(I ∗ K ∗ m ∗ n) 1. Initialize µ1, ..., µk (randomly(keep lowest cost) or manually) 2. Repeat until no change(less than threshold) in µ1, ..., µk classify x1, ..., xn to eachs nearest µi 3. recalculate µ1, ..., µk (cordinate descent) Initialization: Random: Choose random data index, Cons: May choose nearby points Distance Based: Start at random, find next point farthest, Cons: May choose outliers k-means++ (Random + Distance): Find next point far but randomly, Cons: May choose outliers
  • 10. K-Means(2) Pros: Fast for low dimensional data. Can find subclusters for large k. Cons: Restricted to globular data(sphere/axis-aligned ellipse shape, similar size) which has the notion of a center. discrete & no overlap Not guaranteed to find a globally optimal solution. Final clusters area sensitive to the selection of initial centroids. Can produce empty clusters Sensitive to Outliers
  • 11. Mini Batch K-Means Subsets of the input data, randomly sampled in each training iteration. Update centroids with batches of sample data(taking streaming average of the sample and all previous samples assigned), instead of all individuals. Steps: 1. Samples drawn randomly and assigned to nearest centroid 2. Repeat for each example: Increment per center count, get per center learning rate( 1 pcc ) 3. c = (1 − lr)c + lr ∗ x (gradient discent for example x) Pros: For Huge amount of Data single pass over data(no need to read similar point many times and compares its distance with each centroids at every iteration) Cons: Results are generally slightly worse.
  • 12. K-Medoids (PAM: Partitioning around medoids) Medoid(representative): Cluster center is chosen from original datapoints(object closest to the center). Minimize manhattan(l1)/pearson-correlation/any-other distance to cluster medoids instead of sum of squares. Analogous to finding medians instead of means Steps: Build: Initialize select k of the n data points as the medoids. Associate each data point to the closest medoid. Swap: For each medoid m, and each non-medoid data point o, while the cost of the configuration decreases: Swap m and o, recompute the cost (sum of distances of points to their medoid) If the total cost of the configuration increased in the previous step, undo the swap
  • 13. K-Medoids (2) Pros: Used when a mean or centroid cannot be defined (eg: 3-D trajectories, gene expression context) For sparse cases. More robust than k-means for noise and outliers Cons: More expensive O(n2 ∗ k ∗ i) vs O(n ∗ k ∗ i) due to pairwise distance calculations Precompute distances matrix for speed up O(n2) memory Ref: http://www.cs.umb.edu/cs738/pam1.pdf
  • 14. CLARA: Clustering for Large Applications By drawing multiple samples of data(containing best medoids until then), applying PAM on each sample, assigning entire dataset to nearest medoid and then returning the best(minimal sum/avg of dissimilarities) clustering.
  • 15. Hierarchical Create a hierarchical decomposition of the set of data (or objects) using some criterion. Produce nested sets of clusters. Agglomerative: bottom up approach (UPGMA: Unweighted Pair Group Method with Arithmatic Mean) Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: top down approach (Diana: DIvisive ANAlysis Clustering) All observations start in one cluster, and splited recursively. Algo Steps: 1. Compute proximity matrix and sort :O(m2log(m)) 2. REPEAT: until 1 cluster remains :O(m − 1) 3. Merge closest 2 clusters(Linkage Criteria) :O((m − i + 1)2) 4. Update proximity matrix to reflect proximity between new & old clusters :O(m − i + 1) Dendrogram: Sequence of merge and distance(Height) Clustrogram: Cluster on row(data) and column(feature), Reorder data and feature to expose behavior among groups
  • 16. Hierarchical (2) Linkage criteria: determines the metric used for the merge strategy Ward: minimizes the sum of squared differences within all clusters (a variance-minimizing approach, similar to k-means) single, complete, average, centroid: minimum/maximum/average/mean pair-wise distance between observations of pairs of clusters. (Min/MST: long connected/small isolated, susceptible to noise/outliers Max/average: spacially grouped cluster/avoids elongated clusters , may not work well with non-globular clusters) Ward & centroid: For numerical variables Dissimilarity measure: Euclidean or Correlation-based distance
  • 17. Birch: balanced iterative reducing and clustering using hierarchies Steps: 1. Every new sample is inserted into the root of the Clustering Feature Tree(height-balanced tree). 2. It is then clubbed together with the subcluster that has the centroid closest to the new sample. 3. This is done recursively till it ends up at the subcluster of the leaf of the tree has the closest centroid. Pros: For reducing obs to high no. of clusters. Cons: Not scalable to high diamentional dataset.
  • 18. Introduction Measures Norms Distance Algorithms Partition Based K-Means Mini Batch K-Means K-Medoids CLARA Hierarchical Birch Density Based DBScan Optics Model/Distr Based FCM GMM LDA Evaluation Internal WSSSE GAP Silhouette Dunn index DaviesBouldin index External Rand index Jaccard index FM index F-measure V Measure Entropy and Purity MI Index Comparision References
  • 19. Density Based Density: No of points within a specified radius. Clusters: Areas of high density separated by areas of low density. Points: core(more than MinPts within Eps), border(less than MinPts within Eps and in -neighborhood of core) and outlier Reachability: Directly density-reachable(q to p): if q is a core and p is in neighborhood of q (asymmetric). Density-reachable[Indirectly](q to p): if q to p1, p1 to p2.., pn to p are directly density-reachable. Density connected: p and q are density connected, if there are a core point o, such that both p and q are density reachable from o.
  • 20. DBScan: Density-based spatial clustering of applications with noise Flat clustering of 1 level (like kmeans) Cluster: maximal set of density connected points Used when: Irregular or intertwined, and when noise and outliers are present Steps: 1. For each point xi (not yet classified), compute distance to all other points. Finds all neighbors within distance eps (xi ). Points with neighbor count ≥ MinPts, is marked as core point else border point. 2. For each non assigned core point create a new cluster. Find recursively all its density connected points and assign them to the same cluster as core. 3. If p is a border point, no points are density-reachable from p, Iterate through the remaining unvisited points in the dataset.
  • 21. DBScan: Determining EPS( ) and MinPts(k) MinPts: ≥ dimensions + 1 k too small: Noise will be classified as cluster k too large: Small cluster will be classified as noise Eps: For points in a cluster, their kth-nearest neighbors are at roughly the same distance, and for noise points at farther distance Find distance of every point to its kth (minPts) nearest neighbor Plot distances in an ascending order Find the knee for optimal eps
  • 22. DBScan: Observations Pros: 1. Handles data of any shape/size 2. Resistant to noise 3. One scan 4. No of clusters are automatically determined. Cons: 1. Non-deterministic, A non-core sample(distance lower than eps) to two core samples in different clusters assigned to whichever cluster is generated first 2. Fails in clusters of varying density is too small: sparser clusters will be defined as noise. is too large: denser clusters may be merged together. 3. Sensitive to parameters (hard to determine the correct set of parameters)
  • 23. Optics: Ordering points to identify the clustering structure Identifies clusters with different densities. Index-based: k dimensions, N points Core Distance of an object p: the smallest value such that the -neighborhood of p has at least MinPts objects Reachability Distance of object p from core object q: the min radius value that makes p density-reachable from q rdε,MinPts(p, q) = UNDEFINED if q is not core max(core-distε,MinPts(p), dist(p, o)) otherwise
  • 24. Optics Algo Steps: 1. Choose p at random or from the queue, output p and its core/reachability distances 2. Find p’s core-distance, the smallest distance e’ ≤ e such that p is a core-object 3. Compute the reachability-distance for each other point in p’s e- neighborhood 4. Visit the points in order of their reachability-distance, repeating steps 1,2,3 as necessary Points linearly ordered such that points which are spatially closest become neighbors in the ordering(Analogous to single-link dendrogram). Only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. Cluster nearly indistinguishable from DBSCAN. Reachability Plot: priority heap, nearby objects are nearby in plot Ref Algo: http://www.osl.iu.edu/~chemuell/projects/ presentations/optics-v1.pdf
  • 25. Optics Observations Pros: No strict partitioning No need to specift eps Hierarchical partitioning (By detecting steep areas). Cons: Slower(1.6 times) than DBSCAN(Bcoz: nearest neighbor queries are more complicated than radius queries).
  • 26. FCM: Fuzzy C-Means Each element has a set of membership coefficients corresponding to degree of belonging to a cluster. Objective: k j=1 xi ∈Cj um ij (xi − cj )2 uij is the degree to which an observation xi belongs to a cluster cj linked inversely to the distance from x to the cluster center. cj = x∈cj um ij x x∈cj um ij uij = 1 k l=1 |xi −cj | |xi −ck | 2 m−1 m is the fuzzifier/level of fuzzyness: 1 < m < ∞ m = 1: um ij = 1 m = ∞: um ij = 1/k
  • 27. FCM: Steps Specify a number of clusters k (by the analyst) Assign randomly to each point coefficients for being in the clusters. U = [uij ] matrix, U(0) At k-step: calculate the centroid vectors C(k) = [cj ] with U(k) Update U(k), U(k+1): coefficients of being in the clusters. Repeat until max-iteration or when the algorithm has converged. If ||U(k+1) − U(k)|| < (sensitivity threshold).
  • 28. FCM: Steps minimum is a local minimum, and the results depend on the initial choice of weights
  • 29. GMM: Gaussian Mixture Models Composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. Mixture models are a semi-parametric alternative to non-parametric histograms(density) to provide greater flexibility and precision. Bayes’ theorem to perform classification. 1D Gaussian: N(x|µ, σ2) nD Gaussian: N(x|µ, ), x & µ vectors 1D Mixture of Gaussians: N(πk, µk, σk 2) Each Component have: µk: mean vector k: covariance matrix(volume, shape, orientation) πk: mixture-weight/probability/prior (relative proportion of data from group k in the full data) Prior: p(zi = k) = πk Likelihood: p(xi |zi = k, µk, k) = N(xi |µk, k) Each cluster k is centered at the means µk, with increased density for points near the mean. Model: EVI: Equal(volume) Variable(shape) Identity(orientation)
  • 30. EM: Expectation Maximization E stage: given the parameters, finding the right mixture (where does each sample come from?) M stage: given the mixtures, find the right parameters for each class Uses EM algorithm to induce the maximum-likelihood model given a set of samples. EM requires an a priori selection of model order, namely, the number of M components(like k-means) to be incorporated into the model. initialModel: starting point/random convergenceTol: Maximum change in log-likelihood at which we consider convergence achieved.
  • 31. GMM: Observation BIC: -2*ln(likelihood) + ln(N)*k indicates evidence for the model(choose no of cluster). Pros: Automatically suggest no of clusters
  • 32. LDA: Latent Dirichlet allocation By assigning query data points to the multinomial/categorical components that maximize the component posterior probability given the data. Topics and documents both exist in a feature space, where feature vectors are vectors of word counts. Each document a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability. Steps: Fix K as no of topics to discover Gibbs Sampling: Randomly assign each word in each document to one of the K topics (gives topic representations of all the documents and word distributions of all the topics <poor initially>)
  • 33. LDA Algo To improve on them: for each document d, each word w in d, and for each topic t: compute two things: 1. p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t. 2. p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability: p(topic t | document d) * p(word w | topic t) (probability that topic t generated word w) ie: Assume that all topic assignments except for the current word in question are correct(priors), and update the assignment of the current word using our model of how documents are generated. to estimate the topic mixtures of each document and the words associated to each topic
  • 35. Internal Evaluation Compactness, connectedness, and separation Desc: Based on the data that was clustered itself. Assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. Validity as measured by such an index/criterion depends on the claim that some kind of structure exists(eg: k-means convex). Cons: Biased towards algorithms that use the same cluster model. Eg: k-means is distance based, and distance based criteria will overrate Compares algo performs better than another, but not that it produces more valid results than another High scores on an internal measure do not necessarily result in effective IR applications
  • 36. WSSSE (Within Set Sum of Squared Error) Dk = xi Ck xj Ck ||xi − xj ||2 = 2nk xi Ck ||xi − µk||2 Sum of intra-cluster distances between points nk in cluster Ck Wk = k k=1 1 2nk Dk (normalized intra-cluster sums of squares) Within cluster dispersion Elbow method: naive solution based on intra-cluster variance Plot: percentage of variance explained(F-test=grp-var/tot-var) vs number of clusters Optimal k is usually one where there is an elbow in the WSSSE graph.
  • 37. GAP statistics Way to standardize the graph by comparing logWk with its Expectation E∗ n {log Wk} under a null reference distribution. Gapn(k) = E∗ n {log Wk} − logWk E∗ n : average of x copies log W ∗ k Monte Carlo sample k = smallest k such that Gap(k) ≥ Gap(k + 1) − sk+1 k = value for which logWk falls the farthest below this reference value E∗ n in curve sk = sd(k) ∗ 1 + 1/x :Simulation error For globular, Gaussian-distributed, mildly disjoint distributions. Ref: http://www.web.stanford.edu/~hastie/Papers/gap.pdf
  • 38. Silhouette s(i) = b(i) − a(i) max{a(i), b(i)} s(i) =    1 − a(i)/b(i), if a(i) < b(i) 0, if a(i) = b(i) b(i)/a(i) − 1, if a(i) > b(i) ai : average distance from the ith point to the other points in the same cluster as i bi : minimum average distance from the ith point to points in a different cluster, minimized over clusters. Si : -1 to +1. High value indicates that i is well-matched to its own cluster, and poorly-matched to neighboring clusters appropriate: if most points have a high/+ve silhouette value too many/few: if many points have a low/-ve silhouette value
  • 39. Dunn index D = min1≤i<j≤n d(i, j) max1≤k≤n d (k) , d(i,j): inter-cluster distance between clusters i and j d’(k): intra-cluster distance of cluster k Smaller the score Better the Model
  • 40. DaviesBouldin index DB = 1 n n i=1 max j=i σi + σj d(ci , cj ) n: number of clusters cx : centroid of cluster x σx : average distance of all elements in cluster x to centroid cx d(ci , cj ): distance between centroids ci and cj . Smaller the score Better the Model
  • 41. External Evaluation Desc: Based on data that was not used for clustering(external benchmarks/known class labels: gold standard, created by humans). Measure how similar the resulting clusters are to the benchmark classes/clusters No assumption made on cluster structure. semisupervised
  • 42. row Rand index (Adjusted) RI = TP + TN TP + FN + TN + FP Measure of the percentage of correct decisions made by the algorithm. Cons: False positives and false negatives are equally weighted No bounded range and no guarrentee of random (uniform) label assignments have a RI score close to 0.0 (eg: number of clusters = number of samples) ARI = RI − E[RI] max(RI) − E[RI] Can yield negative values if index is less than expected index.
  • 43. Jaccard index J(A, B) = |A ∩ B| |A ∪ B| = TP TP + FP + FN 0 ≤ J ≤ 1 J=1: identical J=0: no common element number of unique elements common to both sets divided by the total number of unique elements in both sets.
  • 44. Fowlkes-Mallows index FM = TP TP + FP · TP TP + FN FM index is the geometric mean of precision and recall Performs well in noisy data. Higher score better..
  • 45. F-measure Fβ = (β2 + 1) · P · R β2 · P + R P = TP TP + FP R = TP TP + FN β ≥ 0 β = 0 =⇒ F0 = P. i.e: recall has no impact on the F-measure. Increasing β allocates an increasing amount of weight to recall in the final F-measure F-measure is the harmonic mean of precision and recall
  • 46. V Measure homogeneity/purity: each cluster contains only members of a single class. completeness: all members of a given class are assigned to the same cluster. v = 2. h.c h + c h = 1 − H(C|K) H(C) c = 1 − H(K|C) H(K) H(C|K) = |C| c=1 |K| k=1 nc,k n log nc,k n Pros: score is between 0(bad) to 1(perfect) Bad score can be qualitatively analyzed by h & c Cons: Random labeling wont yield zero scores especially when the number of clusters is large
  • 47. Entropy & Purity Entropy: amount of uncertainty for a partition set H(X) = E[I(X)] = E[− ln(P(X))] = |X| x=1 P(x) log P(x) = Nx x=1 Nx N log Nx N P(x): probability of an object from X falls into class Xx
  • 48. Mutual Information Information theoretic measure of how much information is shared in ’bits’ between a clustering and a ground-truth classification. Can detect a non-linear similarity between two clusterings. MI(U, V ) = |U| u=1 |V | v=1 P(u, v) log P(u, v) P(u) P(v) P(x): probability of an object from X falls into class Xx P(u, v) = |Xx ∩ Yy |/N = probability of an object falls into both class Xx and Yy MI = 0 if U and V are independent
  • 49. Normaized & Adjusted & Standardized Mutual Information NMI = MI (H(U).H(V )) AMI = MI − E[MI] (H(U).H(V )) − E[MI] = MI − E[MI] max(H(U), H(V ) − E[MI] SMI = MI − E[MI] Var(MI) Ref: http://www.jmlr.org/proceedings/papers/v32/romano14.pdf
  • 50. Comparision Algorithm Density Size Shape Noise Complexity k-means N N N N O(I * K * m * n) k-medoids N N N N O(I ∗ K ∗ m2) Hierarchical N Y N Y O(m3) DBSCAN N Y Y Y O(m2) OPTICS Y Y Y Y O(N*logN) GMM Y Y Y N
  • 51. References 1. https://rorasa.wordpress.com/2012/05/13/ l0-norm-l1-norm-l2-norm-l-infinity-norm 2. http://www.statmethods.net/advstats/cluster.html 3. https://www.r-bloggers.com/ k-means-clustering-from-r-in-action 4. http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf 5. http: //scikit-learn.org/stable/modules/clustering.html