DWDM Unit-5
DWDM Unit-5
DWDM Unit-5
Cluster Analysis
Clustering is the process of grouping a set of data objects into multiple groups or clusters so that
objects within a cluster have high similarity, but are very dissimilar to objects in other clusters.
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations)
into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis can be
referred to as a clustering.
Because a cluster is a collection of data objects that are similar to one another within the cluster and
dissimilar to objects in other clusters, a cluster of data objects can be treated as an implicit class. In
this sense, clustering is sometimes called automatic classification.
Clustering is also called data segmentation in some applications because clustering partitions large data
sets into groups according to their similarity. Clustering can also be used for outlier detection, where
outliers (values that are “far away” from any cluster) may be more interesting than common cases.
Cluster analysis has been widely used in many applications such as business intelligence, image
pattern recognition, Web search, biology, and security.
In business intelligence, clustering can be used to organize a large number of customers into groups,
where customers within a group share strong similar characteristics. This facilitates the development
of business strategies for enhanced customer relationship management. Moreover, consider a
consultant company with a large number of projects.
To improve project management, clustering can be applied to partition projects into categories based
on similarity so that project auditing and diagnosis (to improve project delivery and outcomes) can be
conducted effectively.
In image recognition, clustering can be used to discover clusters or “subclasses” in handwritten
character recognition systems.
Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering other data
types, such as binary, nominal (categorical), and ordinal data, or mixtures of these data types.
Capability of clustering high-dimensionality data: A data set can contain numerous dimensions
or attributes. Most clustering algorithms are good at handling low-dimensional data such
as data sets involving only two or three dimensions. Finding clusters of data objects in a
high- dimensional space is challenging, especially considering that such data can be very sparse
and highly skewed.
There are many clustering algorithms. In general, the major fundamental clustering methods can
be classified into the following categories:
Partitioning method
Partitioning methods: Given a set of n objects, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k ≤ n. That is, it divides the data into k groups such
that each group must contain at least one object.
Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the given set of
data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group. It successively merges the objects or
groups close to one another, until all the groups are merged into one (the topmost level of the
hierarchy), or a termination condition holds. The divisive approach, also called the top-down approach,
starts with all the objects in the same cluster. In each successive iteration, a cluster is split into smaller
clusters, until eventually each object is in one cluster, or a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never
be undone.
Their general idea is to continue growing a given cluster as long as the density (number of objects or
data points) in the “neighborhood” exceeds some threshold. For example, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
Grid-based methods: Grid-based methods quantize the object space into a finite number of cells that
form a grid structure. All the clustering operations are performed on the grid structure (i.e., on
the quantized space). The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number of
cells in each dimension in the quantized space.
1) Partitioning Methods:
Given a data set, D, of n objects, and k, the number of clusters to form, partitioning algorithm organizes
the objects into k partitions (k ≤ n), where each partition represents a cluster. The clusters are formed to
optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that
the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in
terms of the data set attributes.
It proceeds as follows:
First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or
center.
For each of the remaining objects, an object is assigned to the cluster to which it is the most similar,
based on the Euclidean distance between the object and the cluster mean.
For each cluster, it computes the new mean using the objects assigned to the cluster in the
previous iteration.
All the objects are then reassigned using the updated means as the new cluster centers.
The iterations continue until the assignment is stable, that is, the clusters formed in the current round
are the same as those formed in the previous round.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual
objects to represent the clusters, using one representative object per cluster.Each remaining object is
assigned to the cluster of which the representative object is the most similar.
Hierarchical Methods:
In either agglomerative or divisive hierarchical clustering, a user can specify the desired number
of clusters as a termination condition.
Initially, AGNES, the agglomerative method, places each object into a cluster of its own.
For example, clusters C 1 and C 2 may be merged if an object in C 1 and an object in C2 form
the minimum Euclidean distance between any two objects from different clusters.
This is a single-linkage approach in that each cluster is represented by all the objects in the cluster, and
the similarity between two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters.
The cluster-merging process repeats until all the objects are eventually merged to form one cluster.
DIANA, the divisive method, proceeds in the contrasting way. All the objects are used to form one
initial cluster. The cluster is split according to some principle such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.
The cluster-splitting process repeats until, eventually, each new cluster contains only a single object.
We can also use a vertical axis to show the similarity scale between clusters.
For example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they are
merged together to form a single cluster.
Whether using an agglomerative method or a divisive method, a core need is to measure the distance
between two clusters, where each cluster is generally a set of objects.
Four widely used measures for distance between clusters are as follows, where |p- p’| is the distance
between two objects mi is the mean for cluster, Ci and ni is the number of objects in Ci.
When an algorithm uses the minimum distance, dmin(ci,cj) to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm or minimal spanning tree algorithm.
Moreover,if the clustering process is terminated when the distance between nearest clusters exceeds a
user-defined threshold, it is called a single-linkage algorithm.
When an algorithm uses the maximum distance, dmax(ci,cj) to measure the distance between clusters, it
is sometimes called a farthest-neighbor clustering algorithm. If the clustering process is
terminated when the maximum distance between nearest clusters exceeds a user-defined threshold, it
is called a complete-linkage algorithm.
The previous minimum and maximum measures represent two extremes in measuring the
distance between clusters. They tend to be overly sensitive to outliers or noisy data. The use of mean
or average distance is a compromise between the minimum and maximum distances and overcomes
the outlier sensitivity problem. Whereas the mean distance is the simplest to compute, the
average distance is advantageous in that it can handle categorical as well as numeric data.
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for clustering a large
amount of numeric data by integrating hierarchical clustering (at the initial micro clustering stage) and
other clustering methods such as iterative partitioning (at the later macro clustering stage). It
overcomes the two difficulties in agglomerative clustering methods: (1) scalability and (2) the inability
to undo what was done in the previous step.
BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-
tree) to represent a cluster hierarchy. These structures help the clustering method achieve good speed
and scalability in large databases, and also make it effective for incremental and dynamic
clustering of incoming objects.
Consider a cluster of n d-dimensional data objects or points. The clustering feature(CF) of the cluster is
a 3-D vector summarizing information about clusters of objects. It is defined as
A clustering feature is essentially a summary of the statistics for the given cluster.Using a
clustering feature, we can easily derive many useful statistics of a cluster. For example, the cluster’s
centroid, x0, radius, R, and diameter, D, are
Here, R is the average distance from member objects to the centroid, and D is the average pairwise
distance within a cluster. Both R and D reflect the tightness of the cluster around the centroid.
Summarizing a cluster using the clustering feature can avoid storing the detailed information
about individual objects or points. Instead, we only need a constant size of space to store the
clustering feature.
A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. An
example is shown in Figure.
The nonleaf nodes store sums of the CFs of their children, and thus summarize clustering information
about their children. A CF-tree has two parameters: branching factor, B, and threshold, T. The branching
factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies
the maximum diameter of subclusters stored at the leaf nodes of the tree. These two
parameters implicitly control the resulting tree’s size.
BIRCH applies a multiphase clustering technique: A single scan of the data set yields a basic,
good clustering, and one or more additional scans can optionally be used to further improve the
quality. Theprimary phases are
Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which can be viewed as
a multilevel compression of the data that tries to preserve the data’s inherent clustering structure.
Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree, which
removes sparse clusters as outliers and group dense clusters into larger ones.
For Phase 1, the CF-tree is built dynamically as objects are inserted. Thus, the method is incremental.
If the diameter of the subcluster stored in the leaf node after insertion is larger than the threshold
value, then the leaf node and possibly other nodes are split.
After the insertion of the new object, information about the object is passed toward the root of the
tree. The size of the CF-tree can be changed by modifying the threshold.
If the size of the memory that is needed for storing the CF-tree is larger than the size of the
main memory, then a larger threshold value can be specified and the CF-tree is rebuilt.
The rebuild process is performed by building a new tree from the leaf nodes of the old tree. Thus, the
process of rebuilding the tree is done without the necessity of rereading all the objects or points. This is
similar to the insertion and node split in the construction of B +-trees. Therefore, for building the tree,
data has to be read just once.
Once the CF-tree is built, any clustering algorithm, such as a typical partitioning algorithm, can be used
with the CF-tree in Phase 2.
Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor graph into a large
number of relatively small sub clusters.
Chameleon then uses an agglomerative hierarchical clustering algorithm that iteratively merges
sub clusters based on their similarity.
To determine the pairs of most similar sub clusters, it takes into account both the interconnectivity and
the closeness of the clusters.
The relative interconnectivity, RI(Ci,Cj), between two clusters Ci and Cj, is defined as the
absolute interconnectivity between Ci and Cj, normalized with respect to the internal interconnectivity
of the two clusters, Ci and Cj. That is,
Density-Based Methods:
They have difficulty finding clusters of arbitrary shape such as the “S”
shape and oval clusters .
An object is a core object if the €-neighborhood of the object contains at least MinPts objects. Where
Given a set, D, of objects, we can identify all core objects with respect to the given parameters, € and
MinPts. The clustering task is therein reduced to using core objects and their neighborhoods to form
dense regions, where the dense regions are clusters.
For a core object q and an object p, we say that p is directly density-reachable from q (with respect to
€ and MinPts) if p is within the €-neighborhood of q. Clearly, an object p is directly density-reachable
from another object q if and only if q is a core object and p is in the € -neighborhood of q. Using the
directly density-reachable relation, a core object can “bring” all objects from its €-neighborhood into a
dense region.
In DBSCAN, p is density-reachable from q (with respect to € and MinPts in D) if there is a chain of objects
p1, …, pn, such that p1=q, pn= p, and pi+1 is directly density-reachable from pi with respect to € and MinPts,
for 1 ≤ i ≤ n, pi belongs to D.
To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notoion of density
connectedness.Two objects P1,P2 belongs to D are density-connected with respect to €-and MinPts if
there is an object q belongs to D such that both P1 and p2 are density reachable from q with respect to €
and MinPts.
For a given € represented by the radius of the circles, and, say, let MinPts =3.
Otherwise, a new cluster C is created for p , and all the objects in the €-neighborhood of
p are added to a candidate set, N.
DBSCAN iteratively adds to C those objects in N that do not belong to any cluster.
In this process, for an object p’ in N that carries the label “unvisited,” DBSCAN marks it as “visited”
and
checks its €-neighborhood. If the €-neighborhood of p’ has at least MinPts objects, those objects in
the
€-neighborhood of p’ are added to N.
DBSCAN continues adding objects to C until C can no longer be expanded, that is, N is empty. At this
time, cluster C is completed, and thus is output.
To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones. The
clustering process continues until all objects are visited.
The grid-based clustering approach uses a multi resolution grid data structure. It quantizes the object
space into a finite number of cells that form a grid structure on which all of the operations for
clustering are performed.
Statistical info of each cell is calculated and stored beforehand and is used to answer
queries
Parameters of higher level cells can be easily calculated from parameters of lower
level cell
Use a top-down approach to answer spatial data queries STING algorithm procedure:
• Start from a pre-selected layer—typically with a small number of cells
• From the pre-selected layer until you reach the bottom layer do the following:
For each cell in the current level compute the confidence interval indicating a
cell’s relevance to a given query;
• Combine relevant cells into relevant regions (based on grid-neighborhood) and return
the so obtained clusters as your answers.
CLIQUE: The Major Steps
Partition the data space and find the number of points that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the Apriori principle
Identify clusters:
Determine maximal regions that cover a cluster of connected dense units for each
cluster
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.
WaveCluster
It was proposed by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98).
Input parameters:
No of grid cells for each dimension
transform.
How to apply the wavelet transform to find clusters
Major features:
interests.
Determine connected dense units in all subspaces
of interests.
It is insensitive to the order of records in input and does not presume some
canonical data distribution.
It scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases.
Disadvantages
The accuracy of the clustering result may be degraded at the expense of the simplicity
of the method.
Summary
Grid-Based Clustering -> It is one of the methods of cluster analysis which uses a
multi-resolution grid data structure.
Clustering
Model based techniques and
Handling high dimensional
data
3
1
Model-Based Clustering
Methods
Attempt to optimize the fit between the data and some
mathematical model
Techniques
Expectation-
Maximization
Conceptual Clustering
Neural Networks
Approach
3
2
Expectation
Maximization
Each cluster is represented
mathematically by a parametric
probability distribution
Component distribution
3
3
Expectation
Maximization
Iterative Refinement Algorithm – used to find
parameter estimates
Extension of k-
means
Assigns an object to a cluster according to a weight
representing probability of membership
Initial estimate of
parameters
Iteratively reassigns
scores
3
4
Expectation
Maximization
Initial guess for parameters; randomly select k
objects to represent cluster means or centers
Iteratively refine parameters / clusters
Expectation
Step
Assign each object xi to cluster Ck with
probability
where
Maximization
Step
Re-estimate model
parameters
3
5
Conceptual
Clustering
Conceptual
clustering
A form of clustering in machine
learning
Produces a classification scheme for a set of
unlabeled objects
Finds characteristic description for each
concept (class)
COBWE
B
A popular and simple method of incremental
conceptual learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a probabilistic
description of that concept
3
6
COBWEB Clustering
Method
A classification
tree
3
7
COBW
EB
Classification
tree
Each node – Concept and its probabilistic
distribution (Summary of objects under that
node)
Description – Conditional probabilities P(Ai=vij /
Ck)
Sibling nodes at given level form a partition
Category Utility
3
8
COBW
EB
Category Utility
rewards:
Intra-class similarity
P(Ai=vij|Ck)
High value indicates many class members share this
attribute-value pair
Inter-class dissimilarity P(Ck|Ai=vij)
Descend tree
Identify best
host
Temporarily place object in each node and compute CU
of resulting partition
Placement with highest CU is chosen
COBWEB may also forms new nodes if object does not
fit into the
existing tree
3
9
COBW
EB
COBWEB is sensitive to order of
records
Additional operations
Merging and
Splitting
Two best hosts are considered for
merging Best host is considered for
splitting
Limitations
The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
Not suitable for clustering large database data
CLASSIT - an extension of COBWEB for incremental
clustering of continuous data
4
0
Neural Network
Approach
Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Self Organizing Map
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
Organization of units – forms a feature map Web Document
Clustering
4
1
Kohenen
SOM
4
2
Clustering High-
Dimensional data
As dimensionality
increases
number of irrelevant dimensions may produce noise and mask
Feature transformation
methods
PCA, SVD – Summarize data by creating linear combinations
of attributes
But do not remove any attributes; transformed attributes –
complex to interpret
Feature Selection
methods
Most relevant set of attributes with respect to class labels
Entropy Analysis
Subspace Clustering – searches for groups of clusters
within different subspaces of the same data set
4
3
CLIQUE: CLustering In
QUest
Dimension growth subspace clustering
Starts at 1-D and grows upwards to higher
dimensions
Partitions each dimension – grids – determines
whether cell is dense
CLIQUE
4
4
CLIQU
E
First partitions d-dimensional space into non-
overlapping units
Performed in 1-D
Based on Apriori property: If a k-dimensional unit is
dense so are its projections in (k-1) dimensional space
Search space size is reduced
4
5
CLIQU
E
Finds subspace of highest dimension
Insensitive to order of inputs
Performance depends on grid size and density
threshold
4
6
PROCLUS – PROjected
CLUStering
Dimension-reduction Subspace Clustering
technique
Finds initial approximation of clusters in high
dimensional space
Avoids generation of large number of
overlapped clusters of lower dimensionality
Finds best set of medoids by hill-climbing
process (Similar to CLARANS)
Manhattan Segmental distance measure
4
7
PROCL
US
Initialization
phase
Greedy algorithm to select a set of initial medoids
that are far apart
Iteration Phase
Refinement
Phase
Computes new dimensions for each medoid based
on clusters found, reasigns points to medoids and
removes outliers
4
8
Frequent Pattern based
Clustering
Frequent patterns may also form clusters
Instead of growing clusters dimension by
dimension sets of frequent itemsets are
determined
Two common technqiues
4
9
Frequent-term based text
clustering
Text documents are clustered based on
frequent terms they contain
Documents – terms Dimensionality is very high
Frequent term based analysis
5
0
Constraint-based cluster Analysis