Mean shift and Hierarchical clustering

Clustering for new discovery in data
Mean shift clustering
Hierarchical clustering
- Kunal Parmar
Houston Machine
Learning Meetup
1/21/2017

Clustering : A world
without labels
• Finding hidden structure in data when we don’t
have labels/classes for the data
• We group data
together based
on some notion
of similarity in
the feature space

Clustering approaches
covered in previous lecture
• k-means clustering
o Iterative partitioning into k clusters based on proximity of an observation to
the cluster mean

Clustering approaches
covered in previous lecture
• DBSCAN
o Partition the feature space based on density

In this segment,
Mean shift clustering Hierarchical clustering

• Mean shift clustering is a non-parametric iterative
mode-based clustering technique based on kernel
density estimation.
• It is very commonly used in the field of computer
vision because of it’s high efficiency in image
segmentation.

• It assumes that our data is sampled from an
underlying probability distribution
• The algorithm finds out the modes(peaks) of the
probability distribution. The underlying kernel
distribution at the mode corresponds to a cluster

Kernel density estimation
Set of points KDE surface

Algorithm: Mean shift
1. Define a window (bandwidth of the kernel to be
used for estimation) and place the window on a
data point
2. Calculate mean of all the points within the window
3. Move the window to the location of the mean
4. Repeat step 2-3 until convergence
• On convergence, all data points within that window
form a cluster.

Types of kernels
• Generally, a Gaussian kernel is used for probability
estimation in mean shift clustering.
• However, other kinds of kernels that can be used
are,
o Rectangular kernel
o Flat kernel, etc.
• The choice of kernel affects the clustering result

Types of kernels
• The choice of the bandwidth of the kernel(window)
will also impact the clustering result
o Small kernels will result in lots of clusters, some even being individual data
points
o Big kernels will result in one or two huge clusters

Pros and cons : Mean Shift
• Pros
o Model-free, doesn’t assume predefined shape of clusters
o Only relies on one parameter: kernel bandwidth h
o Robust to outliers
• Cons
o The selection of window size is not trivial
o Computationally expensive; O(𝑛2
)
o Sensitive to selection of kernel bandwidth; small h will slow down convergence,
large h speeds it up but might merge two modes

Applications : Mean Shift
• Clustering and segmentation
• dfsn

Applications : Mean Shift
• Clustering and Segmentation

Hierarchical Clustering
• Hierarchical clustering creates clusters that have a
predetermined ordering from top to bottom.
• There are two types of hierarchical clustering:
o Divisive
• Top to bottom approach
o Agglomerative
• Bottom to top approach

Algorithm:
Hierarchical agglomerative clustering
1. Place each data point in it’s own singleton group
2. Iteratively merge the two closest groups
3. Repeat step 2 until all the data points are merged
into a single cluster
• We obtain a dendogram(tree-like structure) at the
final step. We cut the dendogram at a certain level
to obtain the final set of clusters.

Cluster similarity or
dissimilarity
• Distance metric
o Euclidean distance
o Manhattan distance
o Jaccard index, etc.
• Linkage criteria
o Single linkage
o Complete linkage
o Average linkage

Linkage criteria
• It is the quantification of the distance between sets
of observations/intermediate clusters formed in the
agglomeration process

Single linkage
• Distance between two clusters is the shortest
distance between two points in each cluster

Complete linkage
• Distance between two clusters is the longest
distance between two points in each cluster

Average linkage
• Distance between clusters is the average distance
between each point in one cluster to every point in
other cluster

Example: Hierarchical
clustering
• We consider a small dataset with seven samples;
o (A, B, C, D, E, F, G)
• Metrics used in this example
o Distance metric: Jaccard index
o Linkage criteria: Complete linkage

clustering
• We construct a dissimilarity matrix based on
Jaccard index.
• B and F are merged in this step as they have the
lowest dissimilarity

clustering
• How do we calculate distance of (B,F) with other
clusters?
o This is where the choice of linkage criteria comes in
o Since we are using complete linkage, we use the maximum distance
between two clusters
o So,
• Dissimilarity(B, A) : 0.5000
• Dissimilarity(F, A) : 0.6250
• Hence, Dissimilarity((B,F), A) : 0.6250

clustering
• We iteratively merge clusters at each step until all
the data points are covered,
i. merge two clusters with lowest dissimilarity
ii. update the dissimilarity matrix based on merged clusters
o sfs

Dendogram
• At the end of the agglomeration process, we
obtain a dendogram that looks like this,
• sfdafdfsdfsd

Cutting the tree
• We cut the dendogram at a level where there is a
jump in the clustering levels/dissimilarities

Cutting the tree
• If we cut the tree at 0.5, then we can say that within
each cluster the samples have more than 50%
similarity
• So our final set of clusters is,
i. (B,F),
ii. (A,E,C,G) and
iii. (D)

Impact of metrics
• The metrics chosen for hierarchical clustering can
lead to vastly different clusters.
• Distance metric
o In a 2-dimensional space, the distance between the point (1,0) and the
origin (0,0) can be 2 under Manhattan distance, 2 under Euclidean
distance.
• Linkage criteria
o Distance between two clusters can be different based on linkage criteria
used

Linkage criteria
• Complete linkage is the most popular metric used
for hierarchical clustering. It is less sensitive to
outliers.
• Single linkage can handle non-elliptical shapes. But,
single linkage can lead to clusters that are quite
heterogeneous internally and it more sensitive to
outliers and noise

Pros and Cons :
Hierarchical Clustering
• Pros
o No assumption of a particular number of clusters
o May correspond to meaningful taxonomies
• Cons
o Once a decision is made to combine two clusters, it can’t be undone
o Too slow for large data sets, O(𝑛2
log(𝑛))

References
i. https://spin.atomicobject.com/2015/05/26/mean-
shift-clustering/
ii. http://vision.stanford.edu/teaching/cs131_fall1314
_nope/lectures/lecture13_kmeans_cs131.pdf
iii. http://84.89.132.1/~michael/stanford/maeb7.pdf

Mean shift and Hierarchical clustering

More Related Content

Mean shift and Hierarchical clustering