Clustering Slides
Clustering Slides
◼ The choice of an appropriate value of 𝑘 depends on the amount of emphasis that you would like to
give to the larger differences between dimensions
◼ Manhattan or city-block distance (𝐿1 norm)
◼ When used with binary vectors, the L1 norm is known as the Hamming distance
◼ Euclidean norm (𝐿2 norm)
◼ Non-linear distance
and that 𝑇 satisfies the unbiasedness and consistency conditions of the Parzen estimator: 𝑇𝑃𝑁→∞,𝑇→0
𝑎𝑠 𝑁→∞
Distance Metrics
The above distance metrics are measures of dissimilarity, some measures of similarity
also exist
◼ Inner product
The inner product is used when the vectors 𝑥 and 𝑦 are normalized, so that they have the same length
◼ Correlation coefficient
This criterion measures how well the data set 𝑋={𝑥1…𝑥𝑁} is represented by the
cluster centers 𝜇={𝜇1…𝜇𝐶} (𝐶<𝑁)
Clustering methods that use this criterion are called minimum variance
Other criterion functions exist, based on the scatter matrices used in Linear
Discriminant Analysis
Cluster Validity
◼ The validity of the final cluster solution is highly subjective
◼ This is in contrast with supervised training, where a clear objective function is known: Bayes risk
◼ Note that the choice of (dis)similarity measure and criterion function will have a major impact on the
final clustering produced by the algorithms
◼ Example
◼ Which are the meaningful clusters in these cases?
◼ How many clusters should be considered?
◼ Divisive
◼ Also known as top-down or splitting
◼ Starting with a unique cluster, successively split the clusters until N singleton examples are left
Dendograms
◼ A binary tree that shows the structure of the clusters
◼ Dendrograms are the preferred representation for hierarchical clusters
◼ In addition to the binary tree, the dendrogram provides the similarity measure between clusters (the
vertical axis)
◼ An alternative representation is based on sets
◼ {{𝑥1,{𝑥2,𝑥3}},{{{𝑥4,𝑥5},{𝑥6,𝑥7}},𝑥8}}
◼ However, unlike the dendrogram, sets cannot express quantitative information
Divisive Clustering
1. Start with one large cluster
◼ Define
2. Find “worst” cluster
◼ –𝑁𝐶 Number of clusters
◼ –𝑁𝐸𝑋 Number of examples
3. Split it
4. If 𝑁𝐶<𝑁𝐸𝑋 go to 2
◼ How to choose the “worst” cluster
◼ –Largest number of examples
◼ –Largest variance
◼ –Largest sum-squared-error…
◼ How to split clusters
◼ Mean-median in one feature direction
◼ Perpendicular to the direction of largest variance…
◼ The computations required by divisive clustering are more intensive than for
agglomerative clustering methods
◼ For this reason, agglomerative approaches are more popular
Agglomerative Clustering
◼ Define
◼ 𝑁𝐶 - Number of clusters
◼ 𝑁𝐸𝑋 - Number of examples
1. Start with 𝑁𝐸𝑋 singleton clusters
2. Find nearest clusters
3. Merge them
4. If 𝑁𝐶>1 go to 2
6.75
16