Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Clustering Slides

This document discusses unsupervised machine learning techniques including parametric and non-parametric methods. It covers statistical clustering approaches like k-means and ISODATA algorithms as well as hierarchical clustering methods. Key aspects covered include similarity measures, criterion functions, and cluster validation.

Uploaded by

Richa Halder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Clustering Slides

This document discusses unsupervised machine learning techniques including parametric and non-parametric methods. It covers statistical clustering approaches like k-means and ISODATA algorithms as well as hierarchical clustering methods. Key aspects covered include similarity measures, criterion functions, and cluster validation.

Uploaded by

Richa Halder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Machine Learning

and Network Analysis


MA4207
Unsupervised Learning
◼ Parametric unsupervised learning
–Equivalent to density estimation with a mixture of (Gaussian) components
–Through the use of algorithms like EM, the identity of the component that
originated each data point was treated as a missing feature
◼ Non-parametric unsupervised learning
–No density functions are considered in these methods
–Instead, we are concerned with finding natural groupings (clusters) in a dataset
◼ Non-parametric clustering involves three steps
–Defining a measure of (dis)similarity between examples
–Defining a criterion function for clustering
–Defining an algorithm to minimize (or maximize) the criterion function
Statistical Clustering
◼ Similarity measures
◼ Criterion functions
◼ Cluster validity
◼ Flat clustering algorithms
–k-means
–ISODATA
◼ Hierarchical clustering algorithms
–Divisive
–Agglomerative
Similarity Measure
Definition of metric
A measuring rule 𝑑(𝑥,𝑦) for the distance between two vectors 𝑥 and 𝑦 is considered a
metric if it satisfies the following properties
𝑑(𝑥,𝑦)≥0
𝑑(𝑥,𝑦)=0 iff 𝑥=𝑦
𝑑(𝑥,𝑦)=𝑑(𝑦,𝑥)
𝑑(𝑥,𝑦)≤𝑑(𝑥,𝑧)+ 𝑑(𝑧,𝑦)
If the metric has the property 𝑑(𝑎𝑥,𝑎𝑦)=|𝑎|𝑑(𝑥,𝑦) then it is called a norm and
denoted 𝑑(𝑥,𝑦)=||𝑥−𝑦||
The most general form of distance metric is the power norm

◼ 𝑝 controls the weight placed on any dimension dissimilarity, whereas 𝑟 controls


the distance growth of patterns that are further apart
–Notice that the definition of norm must be relaxed, allowing a power factor for |𝑎|
Distance Metrics
◼ Minkowski metric (𝐿𝑘 norm)

◼ The choice of an appropriate value of 𝑘 depends on the amount of emphasis that you would like to
give to the larger differences between dimensions
◼ Manhattan or city-block distance (𝐿1 norm)

◼ When used with binary vectors, the L1 norm is known as the Hamming distance
◼ Euclidean norm (𝐿2 norm)

◼ Chebyshev distance (𝐿∞ norm)


Distance Metrics
◼ Quadratic distance

The Mahalanobis distance is a particular case of this distance


◼ Canberra metric (for non-negative features)

◼ Non-linear distance

where 𝑇 is a threshold and 𝐻 is a distance


An appropriate choice for 𝐻 and 𝑇 for feature selection is that they should satisfy

and that 𝑇 satisfies the unbiasedness and consistency conditions of the Parzen estimator: 𝑇𝑃𝑁→∞,𝑇→0
𝑎𝑠 𝑁→∞
Distance Metrics
The above distance metrics are measures of dissimilarity, some measures of similarity
also exist
◼ Inner product

The inner product is used when the vectors 𝑥 and 𝑦 are normalized, so that they have the same length

◼ Correlation coefficient

◼ Tanimoto measure (for binary-valued vectors)


Criterion Function
After computing a (dis)similarity measure, a criterion function needs to be optimized
The most widely used clustering criterion is the sum-of-square-error.

This criterion measures how well the data set 𝑋={𝑥1…𝑥𝑁} is represented by the
cluster centers 𝜇={𝜇1…𝜇𝐶} (𝐶<𝑁)
Clustering methods that use this criterion are called minimum variance
Other criterion functions exist, based on the scatter matrices used in Linear
Discriminant Analysis
Cluster Validity
◼ The validity of the final cluster solution is highly subjective
◼ This is in contrast with supervised training, where a clear objective function is known: Bayes risk
◼ Note that the choice of (dis)similarity measure and criterion function will have a major impact on the
final clustering produced by the algorithms
◼ Example
◼ Which are the meaningful clusters in these cases?
◼ How many clusters should be considered?

A number of quantitative methods for cluster validity are proposed in [Theodoridis


and Koutrombas, 1999]
Optimization
Find partition of the data set that minimizes the criterion
◼ Exhaustive enumeration of all partitions, which guarantees the optimal solution, is unfeasible
For example, a problem with 5 clusters and 100 examples yields 1067 partitioning
The common approach is to proceed in an iterative fashion
◼ Find some reasonable initial partition
◼ Move samples from one cluster to another in order to reduce the criterion function
These iterative methods produce sub-optimal solutions but are computationally
tractable
◼ Flat clustering algorithms
◼ These algorithms produce a set of disjoint clusters
◼ Two algorithms are widely used: k-means and ISODATA
◼ Hierarchical clustering algorithms:
◼ The result is a hierarchy of nested clusters
◼ These algorithms can be broadly divided into agglomerative and divisive approaches
K-Means Algorithm
k-means is a simple clustering procedure that attempts to minimize the criterion
function 𝐽𝑀𝑆𝐸 in an iterative fashion

1. Define the number of clusters


2. Initialize clusters by
◼ an arbitrary assignment of examples to clusters or
◼ an arbitrary set of cluster centers (some examples used as centers)
3. Compute the sample mean of each cluster
4. Reassign each example to the cluster with the nearest mean
5. If the classification of all samples has not changed, stop, else go to step 3

k-means is a particular case of the EM algorithm for mixture models


K-Means Demo
Vector Quantization
◼ An application of k-means to signal
processing and communication
◼ Univariate signal values are usually
quantized into a number of levels
◼ Typically a power of 2 so the signal can be
transmitted in binary format
◼ Same idea can be extended for multiple
channels
◼ We could quantize each separate channel
◼ Instead, we can obtain a more efficient coding
if we quantize the overall multidimensional
vector by finding a number of multidimensional
prototypes (cluster centers)
◼ The set of cluster centers is called a
codebook, and the problem of finding
this codebook is normally solved using
the k-means algorithm
ISODATA
◼ Iterative Self-Organizing Data Analysis (ISODATA)
◼ An extension to the k-means algorithm with some heuristics to automatically select the
number of clusters
◼ ISODATA requires the user to select a number of parameters
◼ N𝑀𝐼𝑁_𝐸𝑋 minimum number of examples per cluster
◼ 𝑁𝐷 desired (approximate) number of clusters
◼ 𝜎𝑆2 maximum spread parameter for splitting
◼ 𝐷𝑀𝐸𝑅𝐺𝐸 maximum distance separation for merging
◼ 𝑁𝑀𝐸𝑅𝐺𝐸 maximum number of clusters that can be merged
◼ The algorithm works in an iterative fashion
1. Perform k-means clustering
2. Split any clusters whose samples are sufficiently dissimilar
3. Merge any two clusters sufficiently close
4. Go to 1
ISODATA
ISODATA has been shown to be an extremely powerful heuristic
◼ Advantages are
◼ Self-organizing capabilities
◼ Flexibility in eliminating clusters that have very few examples
◼ Ability to divide clusters that are too dissimilar
◼ Ability to merge clusters that are sufficiently similar
◼ Limitations
◼ Data must be linearly separable (long narrow or curved clusters are not handled properly)
◼ It is difficult to know a priori the “optimal” parameters
◼ Performance is highly dependent on these parameters
◼ For large datasets and large number of clusters, ISODATA is less efficient than other linear methods
◼ Convergence is unknown, although it appears to work well for non-overlapping clusters
In practice, ISODATA is run multiple times with different values of the parameters
and the clustering with minimum SSE is selected
Hierarchical Clustering
k-means and ISODATA create disjoint clusters, resulting in a flat
data representation
Often a hierarchical representation of data, with clusters and
sub-clusters arranged in a tree-structured fashion is required
Hierarchical representations are commonly used in the sciences
(e.g., biological taxonomy)
◼ Hierarchical clustering methods can be grouped in two
general classes
◼ Agglomerative
◼ Also known as bottom-up or merging
◼ Starting with N singleton clusters, successively merge clusters until one cluster is left

◼ Divisive
◼ Also known as top-down or splitting
◼ Starting with a unique cluster, successively split the clusters until N singleton examples are left
Dendograms
◼ A binary tree that shows the structure of the clusters
◼ Dendrograms are the preferred representation for hierarchical clusters
◼ In addition to the binary tree, the dendrogram provides the similarity measure between clusters (the
vertical axis)
◼ An alternative representation is based on sets
◼ {{𝑥1,{𝑥2,𝑥3}},{{{𝑥4,𝑥5},{𝑥6,𝑥7}},𝑥8}}
◼ However, unlike the dendrogram, sets cannot express quantitative information
Divisive Clustering
1. Start with one large cluster
◼ Define
2. Find “worst” cluster
◼ –𝑁𝐶 Number of clusters
◼ –𝑁𝐸𝑋 Number of examples
3. Split it
4. If 𝑁𝐶<𝑁𝐸𝑋 go to 2
◼ How to choose the “worst” cluster
◼ –Largest number of examples
◼ –Largest variance
◼ –Largest sum-squared-error…
◼ How to split clusters
◼ Mean-median in one feature direction
◼ Perpendicular to the direction of largest variance…
◼ The computations required by divisive clustering are more intensive than for
agglomerative clustering methods
◼ For this reason, agglomerative approaches are more popular
Agglomerative Clustering
◼ Define
◼ 𝑁𝐶 - Number of clusters
◼ 𝑁𝐸𝑋 - Number of examples
1. Start with 𝑁𝐸𝑋 singleton clusters
2. Find nearest clusters
3. Merge them
4. If 𝑁𝐶>1 go to 2

◼ How to find the “nearest” pair of clusters


◼Minimum Distance dmin(𝜔𝑖,𝜔𝑗)=min||𝑥−𝑦||, 𝑥∈𝜔𝑖 ,y∈𝜔𝑗

◼Maximum Distance dmax(𝜔𝑖,𝜔𝑗)=max||𝑥−𝑦||, 𝑥∈𝜔𝑖 ,y∈𝜔𝑗

◼Average Distance davg(𝜔𝑖,𝜔𝑗)=1/NiNjS 𝑥∈𝜔𝑖 S y∈𝜔 ||𝑥−𝑦||


𝑗

◼Mean Distance dmean(𝜔𝑖,𝜔𝑗)=||mi−mj||


Agglomerative Clustering
◼ Minimum distance
◼ When 𝑑𝑚𝑖𝑛 is used to measure distance between clusters, the algorithm is called the nearest-
neighbor or single-linkage clustering algorithm
◼ If the algorithm is allowed to run until only one cluster remains, the result is a minimum spanning
tree (MST)
◼ This algorithm favors elongated classes
◼ Maximum distance
◼ When 𝑑𝑚𝑎𝑥 is used to measure distance between clusters, the algorithm is called the farthest-
neighbor or complete-linkage clustering algorithm
◼ From a graph-theoretic point of view, each cluster constitutes a complete sub-graph
◼ This algorithm favors compact classes
◼ Average and mean distance
◼ 𝑑𝑚𝑖𝑛 and 𝑑𝑚𝑎𝑥 are extremely sensitive to outliers since their measurement of between-cluster
distance involves minima or maxima
◼ 𝑑𝑎𝑣gand 𝑑𝑚𝑒𝑎𝑛 are more robust to outliers
◼ Of the two, 𝑑𝑚𝑒𝑎𝑛 is more attractive computationally
Notice that 𝑑𝑎𝑣𝑒 involves the computation of 𝑁𝑖𝑁𝑗 pairwise distances
Agglomerative Clustering Example
◼ Perform agglomerative clustering on 𝑋 using the single-linkage metric
𝑋 = {1,3,4,9,10,13,21,23,28,29}
In case of ties, always merge the pair of clusters with the largest mean
Indicate the order in which the merging operations occur

6.75

16

You might also like