Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DWDM Unit-5

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 52

V.

Cluster Analysis

Clustering is the process of grouping a set of data objects into multiple groups or clusters so that
objects within a cluster have high similarity, but are very dissimilar to objects in other clusters.

Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations)
into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis can be
referred to as a clustering.

Because a cluster is a collection of data objects that are similar to one another within the cluster and
dissimilar to objects in other clusters, a cluster of data objects can be treated as an implicit class. In
this sense, clustering is sometimes called automatic classification.

Clustering is also called data segmentation in some applications because clustering partitions large data
sets into groups according to their similarity. Clustering can also be used for outlier detection, where
outliers (values that are “far away” from any cluster) may be more interesting than common cases.

APPLICATIONS OF CLUSTER ANALYSIS:

Cluster analysis has been widely used in many applications such as business intelligence, image
pattern recognition, Web search, biology, and security.

In business intelligence, clustering can be used to organize a large number of customers into groups,
where customers within a group share strong similar characteristics. This facilitates the development
of business strategies for enhanced customer relationship management. Moreover, consider a
consultant company with a large number of projects.

To improve project management, clustering can be applied to partition projects into categories based
on similarity so that project auditing and diagnosis (to improve project delivery and outcomes) can be
conducted effectively.

In image recognition, clustering can be used to discover clusters or “subclasses” in handwritten
character recognition systems.

Clustering has also found many applications in Web search.

REQUIREMENTS FOR CLUSTER ANALYSIS:

The following are typical requirements of clustering in data


mining:
 Scalability: Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain millions or even
billions of objects, particularly in Web search scenarios. Clustering on only a sample of a given
large data set may lead to biased results. Therefore, highly scalable clustering algorithms are
needed.

 Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering other data
types, such as binary, nominal (categorical), and ordinal data, or mixtures of these data types.

 Discovery of clusters with arbitrary shape: Many clustering algorithms determine


clusters based on Euclidean or Manhattan distance measures. Algorithms based on such
distance measures tend to find spherical clusters with similar size and density. However, a
cluster could be of any shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.

 Requirements for domain knowledge to determine input parameters: Many


clustering algorithms require users to provide domain knowledge in the form of input
parameters such as the desired number of clusters. Consequently, the clustering results may be
sensitive to such parameters. Requiring the specification of domain knowledge not only
burdens users, but also makes the quality of clustering difficult to control.

 Incremental clustering and insensitivity to input order: In many applications,


incremental updates (representing newer data) may arrive at any time. Some clustering
algorithms cannot incorporate incremental updates into existing clustering structures and,
instead, have to recompute a new clustering from scratch. Clustering algorithms may also
be sensitive to the input data order. That is, given a set of data objects, clustering
algorithms may return dramatically different clustering depending on the order in which
the objects are presented. Incremental clustering algorithms and algorithms that are
insensitive to the input order are needed.

 Capability of clustering high-dimensionality data: A data set can contain numerous dimensions
or attributes. Most clustering algorithms are good at handling low-dimensional data such
as data sets involving only two or three dimensions. Finding clusters of data objects in a
high- dimensional space is challenging, especially considering that such data can be very sparse
and highly skewed.

 Constraint-based clustering: Real-world applications may need to perform clustering


under various kinds of constraints. Suppose that your job is to choose the locations for a given
number of new automatic teller machines (ATMs) in a city. To decide upon this, you
may cluster households while considering constraints such as the city’s rivers and highway
networks and the types and number of customers per cluster. A challenging task is to find data
groups with good clustering behavior that satisfy specified constraints.

 Interpretability and usability: Users want clustering results to be interpretable,


comprehensible, and usable.
CATEGORIZATION OF VARIOUS CLUSTERING METHODS:

There are many clustering algorithms. In general, the major fundamental clustering methods can
be classified into the following categories:

Partitioning method

Partitioning methods: Given a set of n objects, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k ≤ n. That is, it divides the data into k groups such
that each group must contain at least one object.

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the given set of
data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group. It successively merges the objects or
groups close to one another, until all the groups are merged into one (the topmost level of the
hierarchy), or a termination condition holds. The divisive approach, also called the top-down approach,
starts with all the objects in the same cluster. In each successive iteration, a cluster is split into smaller
clusters, until eventually each object is in one cluster, or a termination condition holds.

Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never
be undone.

Density-based methods: These methods can find only spherical-shaped clusters.

Their general idea is to continue growing a given cluster as long as the density (number of objects or
data points) in the “neighborhood” exceeds some threshold. For example, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

Grid-based methods: Grid-based methods quantize the object space into a finite number of cells that
form a grid structure. All the clustering operations are performed on the grid structure (i.e., on
the quantized space). The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number of
cells in each dimension in the quantized space.

1) Partitioning Methods:

Given a data set, D, of n objects, and k, the number of clusters to form, partitioning algorithm organizes
the objects into k partitions (k ≤ n), where each partition represents a cluster. The clusters are formed to
optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that
the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in
terms of the data set attributes.

A)k-Means: A Centroid-Based Technique :

It proceeds as follows:

First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or
center.

For each of the remaining objects, an object is assigned to the cluster to which it is the most similar,
based on the Euclidean distance between the object and the cluster mean.

The k-means algorithm then iteratively improves the within-cluster variation.

For each cluster, it computes the new mean using the objects assigned to the cluster in the
previous iteration.

All the objects are then reassigned using the updated means as the new cluster centers.

The iterations continue until the assignment is stable, that is, the clusters formed in the current round
are the same as those formed in the previous round.

Example :Clustering by k-means partitioning:


Consider a set of objects located in 2-D space, as depicted in Figure (a).
Let k =3, that is, the user would like the objects to be partitioned into three clusters.
According to the algorithm, we arbitrarily choose three objects as the three initial
cluster centers, where cluster centers are marked by a +.
Each object is assigned to a cluster based on the cluster center to which it is the nearest.
Next, the cluster centers are updated. That is, the mean value of each cluster is recalculated based
on
the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest.
This process iterates, leading to Figure (c). The process of iteratively reassigning objects to clusters
to improve the partitioning is referred to as iterative relocation.
Eventually, no reassignment of the objects in any cluster occurs and so the process terminates.
The resulting clusters are returned by the clustering process.

k-Medoids: A Representative Object-Based Technique :

Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual
objects to represent the clusters, using one representative object per cluster.Each remaining object is
assigned to the cluster of which the representative object is the most similar.
Hierarchical Methods:

A hierarchical clustering method works by grouping data objects


into a hierarchy or “tree” of clusters.
There are several orthogonal ways to categorize hierarchical
clustering methods. For instance, they may be categorized into
Agglomerative and Divisive.

An agglomerative hierarchical clustering method uses a bottom-


up strategy. It typically starts by letting each object form its own
cluster and iteratively merges clusters into larger and larger
clusters, until all the objects are in a single cluster or certain
termination conditions are satisfied.

A divisive hierarchical clustering method employs a top-down


strategy. It starts by placing all objects in one cluster, which is
the hierarchy’s root. It then divides the root cluster into
several smaller sub clusters, and recursively partitions those
clusters into smaller ones. The partitioning process continues
until each cluster at the lowest level containing only one object, or the objects within a cluster
are sufficiently similar to each other.

In either agglomerative or divisive hierarchical clustering, a user can specify the desired number
of clusters as a termination condition.

Example showing the application of AGNES (AGglomerative NESting), an agglomerative


hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method,
on a data set of five objects, {a, b, c, d, e}.

Initially, AGNES, the agglomerative method, places each object into a cluster of its own.

The clusters are then merged step-by-step according to some criterion.

For example, clusters C 1 and C 2 may be merged if an object in C 1 and an object in C2 form
the minimum Euclidean distance between any two objects from different clusters.

This is a single-linkage approach in that each cluster is represented by all the objects in the cluster, and
the similarity between two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters.

The cluster-merging process repeats until all the objects are eventually merged to form one cluster.

DIANA, the divisive method, proceeds in the contrasting way. All the objects are used to form one
initial cluster. The cluster is split according to some principle such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.

The cluster-splitting process repeats until, eventually, each new cluster contains only a single object.

A tree structure called a dendrogram is commonly used to represent the process of


hierarchical clustering. It shows how objects are grouped together (in an agglomerative method) or
partitioned (in a divisive method) step-by-step.
Above Figure shows a dendrogram for the five objects {a,b,c,d,e}, where l =0 shows the five objects as
singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first cluster, and
they stay together at all subsequent levels.

We can also use a vertical axis to show the similarity scale between clusters.

For example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they are
merged together to form a single cluster.

Distance measures used:

Whether using an agglomerative method or a divisive method, a core need is to measure the distance
between two clusters, where each cluster is generally a set of objects.

Four widely used measures for distance between clusters are as follows, where |p- p’| is the distance
between two objects mi is the mean for cluster, Ci and ni is the number of objects in Ci.

They are also known as linkage measures.

When an algorithm uses the minimum distance, dmin(ci,cj) to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm or minimal spanning tree algorithm.
Moreover,if the clustering process is terminated when the distance between nearest clusters exceeds a
user-defined threshold, it is called a single-linkage algorithm.

When an algorithm uses the maximum distance, dmax(ci,cj) to measure the distance between clusters, it
is sometimes called a farthest-neighbor clustering algorithm. If the clustering process is
terminated when the maximum distance between nearest clusters exceeds a user-defined threshold, it
is called a complete-linkage algorithm.

The previous minimum and maximum measures represent two extremes in measuring the
distance between clusters. They tend to be overly sensitive to outliers or noisy data. The use of mean
or average distance is a compromise between the minimum and maximum distances and overcomes
the outlier sensitivity problem. Whereas the mean distance is the simplest to compute, the
average distance is advantageous in that it can handle categorical as well as numeric data.

BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees :

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for clustering a large
amount of numeric data by integrating hierarchical clustering (at the initial micro clustering stage) and
other clustering methods such as iterative partitioning (at the later macro clustering stage). It
overcomes the two difficulties in agglomerative clustering methods: (1) scalability and (2) the inability
to undo what was done in the previous step.

BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-
tree) to represent a cluster hierarchy. These structures help the clustering method achieve good speed
and scalability in large databases, and also make it effective for incremental and dynamic
clustering of incoming objects.

Consider a cluster of n d-dimensional data objects or points. The clustering feature(CF) of the cluster is
a 3-D vector summarizing information about clusters of objects. It is defined as

A clustering feature is essentially a summary of the statistics for the given cluster.Using a
clustering feature, we can easily derive many useful statistics of a cluster. For example, the cluster’s
centroid, x0, radius, R, and diameter, D, are
Here, R is the average distance from member objects to the centroid, and D is the average pairwise
distance within a cluster. Both R and D reflect the tightness of the cluster around the centroid.

Summarizing a cluster using the clustering feature can avoid storing the detailed information
about individual objects or points. Instead, we only need a constant size of space to store the
clustering feature.

A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. An
example is shown in Figure.

By definition, a nonleaf node in a tree has descendants or “children.”

The nonleaf nodes store sums of the CFs of their children, and thus summarize clustering information
about their children. A CF-tree has two parameters: branching factor, B, and threshold, T. The branching
factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies
the maximum diameter of subclusters stored at the leaf nodes of the tree. These two
parameters implicitly control the resulting tree’s size.

BIRCH applies a multiphase clustering technique: A single scan of the data set yields a basic,
good clustering, and one or more additional scans can optionally be used to further improve the
quality. Theprimary phases are

Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which can be viewed as
a multilevel compression of the data that tries to preserve the data’s inherent clustering structure.

Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree, which
removes sparse clusters as outliers and group dense clusters into larger ones.
For Phase 1, the CF-tree is built dynamically as objects are inserted. Thus, the method is incremental.

An object is inserted into the closest leaf entry (subcluster).

If the diameter of the subcluster stored in the leaf node after insertion is larger than the threshold
value, then the leaf node and possibly other nodes are split.

After the insertion of the new object, information about the object is passed toward the root of the
tree. The size of the CF-tree can be changed by modifying the threshold.

If the size of the memory that is needed for storing the CF-tree is larger than the size of the
main memory, then a larger threshold value can be specified and the CF-tree is rebuilt.

The rebuild process is performed by building a new tree from the leaf nodes of the old tree. Thus, the
process of rebuilding the tree is done without the necessity of rereading all the objects or points. This is
similar to the insertion and node split in the construction of B +-trees. Therefore, for building the tree,
data has to be read just once.

Once the CF-tree is built, any clustering algorithm, such as a typical partitioning algorithm, can be used
with the CF-tree in Phase 2.

Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling:


Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where each vertex
of the graph represents a data object, and there exists an edge between two vertices (objects) if one
object is among the k-most similar objects to the other.

The edges are weighted to reflect the similarity between objects.

Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor graph into a large
number of relatively small sub clusters.

Chameleon then uses an agglomerative hierarchical clustering algorithm that iteratively merges
sub clusters based on their similarity.

To determine the pairs of most similar sub clusters, it takes into account both the interconnectivity and
the closeness of the clusters.

The relative interconnectivity, RI(Ci,Cj), between two clusters Ci and Cj, is defined as the
absolute interconnectivity between Ci and Cj, normalized with respect to the internal interconnectivity
of the two clusters, Ci and Cj. That is,
Density-Based Methods:

Partitioning and hierarchical methods are designed to find spherical-


shaped clusters.

They have difficulty finding clusters of arbitrary shape such as the “S”
shape and oval clusters .

The main strategy behind density-based clustering methods is to


discover clusters of nonspherical shape. The basic techniques of
density-based clustering methods are :DBSCAN , OPTICS and
DENCLUE .
DBSCAN: Density-Based Clustering Based on Connected Regions with High
Density :
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core objects, that is, objects
that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as
clusters.

An object is a core object if the €-neighborhood of the object contains at least MinPts objects. Where

The €-neighborhood of an object o is the space within a radius € centered at o.

MinPts, which specifies the density threshold of dense regions.

Given a set, D, of objects, we can identify all core objects with respect to the given parameters, € and
MinPts. The clustering task is therein reduced to using core objects and their neighborhoods to form
dense regions, where the dense regions are clusters.

For a core object q and an object p, we say that p is directly density-reachable from q (with respect to
€ and MinPts) if p is within the €-neighborhood of q. Clearly, an object p is directly density-reachable
from another object q if and only if q is a core object and p is in the € -neighborhood of q. Using the
directly density-reachable relation, a core object can “bring” all objects from its €-neighborhood into a
dense region.

In DBSCAN, p is density-reachable from q (with respect to € and MinPts in D) if there is a chain of objects
p1, …, pn, such that p1=q, pn= p, and pi+1 is directly density-reachable from pi with respect to € and MinPts,
for 1 ≤ i ≤ n, pi belongs to D.

To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notoion of density
connectedness.Two objects P1,P2 belongs to D are density-connected with respect to €-and MinPts if
there is an object q belongs to D such that both P1 and p2 are density reachable from q with respect to €
and MinPts.

Density-reachability and density-connectivity. Consider Figure :


Density –reachablity and density connectivity in density-based clustering

For a given € represented by the radius of the circles, and, say, let MinPts =3.

Of the labeled points, m, p, o, r are core objects because each is in an €-neighborhood


containing at least three points.

Object q is directly density-reachable from m.

Object m is directly density-reachable from p and vice versa.

Object q is (indirectly) density-reachable from p because q is directly densityreachable


from m and m is directly density-reachable from p. However, p is not densityreachable from q because
q is not a core object. Similarly, r and s are density-reachable from o and o is density-reachable from
r. Thus, o, r, and s are all density-connected.

DBSCAN finds clusters by using the following procedure :

Initially, all objects in a given data set D are marked as “ unvisited.”

DBSCAN randomly selects an unvisited object p, marks p as “ visited,” and checks


whether the € - neighborhood of p contains at least MinPts objects.

If not, p is marked as a noise point.

Otherwise, a new cluster C is created for p , and all the objects in the €-neighborhood of
p are added to a candidate set, N.

DBSCAN iteratively adds to C those objects in N that do not belong to any cluster.
In this process, for an object p’ in N that carries the label “unvisited,” DBSCAN marks it as “visited”
and

checks its €-neighborhood. If the €-neighborhood of p’ has at least MinPts objects, those objects in
the
€-neighborhood of p’ are added to N.

DBSCAN continues adding objects to C until C can no longer be expanded, that is, N is empty. At this
time, cluster C is completed, and thus is output.

To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones. The
clustering process continues until all objects are visited.

The pseudocode of the DBSCAN algorithm :

The computational complexity of DBSCAN is Onlog


n.

OPTICS: Ordering Points to Identify the Clustering


Structure
Grid-based Methods:

The grid-based clustering approach uses a multi resolution grid data structure. It quantizes the object
space into a finite number of cells that form a grid structure on which all of the operations for
clustering are performed.

Three typical examples of Grid based methods are:


1.STING explores statistical information stored in the grid cells,
2. Wave cluster and
3.CLIQUE represents a grid- and density-based approach for subspace clustering in a high- dimensional
data space
STING - A Statistical Information Grid Approach
STING was proposed by Wang, Yang, and Muntz (VLDB’97).

In this method, the spatial area is divided into rectangular cells.


 The spatial area area is divided into rectangular cells

 There are several levels of cells corresponding to different levels of


resolution
 Each cell at a high level is partitioned into a number of smaller cells in the next lower
level

 Statistical info of each cell is calculated and stored beforehand and is used to answer
queries

 Parameters of higher level cells can be easily calculated from parameters of lower
level cell

 count, mean, s, min, max

 type of distribution—normal, uniform, etc.

 Use a top-down approach to answer spatial data queries STING algorithm procedure:
• Start from a pre-selected layer—typically with a small number of cells

• From the pre-selected layer until you reach the bottom layer do the following:

 For each cell in the current level compute the confidence interval indicating a
cell’s relevance to a given query;

 If it is relevant, include the cell in a cluster

 If it irrelevant, remove cell from further consideration

 otherwise, look for relevant cells at the next lower layer

• Combine relevant cells into relevant regions (based on grid-neighborhood) and return
the so obtained clusters as your answers.
CLIQUE: The Major Steps

Partition the data space and find the number of points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using the Apriori principle

Identify clusters:

 Determine dense units in all subspaces of interests

 Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters

 Determine maximal regions that cover a cluster of connected dense units for each
cluster

 Determination of minimal cover for each cluster


Advantages:

It is Query-independent, easy to parallelize, incremental update.

O(K), where K is the number of grid cells at the lowest level.

Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.
WaveCluster
It was proposed by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98).

It is a multi-resolution clustering approach which


applies wavelet transform to the feature space
A wavelet transform is a signal processing

technique that decomposes a signal into different


frequency sub-band.
It can be both grid-based and density-based
method.

Input parameters:
No of grid cells for each dimension

The wavelet, and the no of applications of wavelet

transform.
How to apply the wavelet transform to find clusters

It summaries the data by imposing a


multidimensional grid structure onto data space.
These multidimensional spatial data objects are
represented in an n-dimensional feature space.
Now apply wavelet transform on feature space to find
the dense regions in the feature space.
Then apply wavelet transform multiple times which
results in clusters at different scales from fine to
coarse.
Why is wavelet transformation useful for clustering
It uses hat-shape filters to emphasize region where points cluster, but simultaneously to
suppress weaker information in their boundary.
It is an effective removal method for outliers.
It is of Multi-resolution method.
It is cost-efficiency.

Major features:

The time complexity of this method is O(N).


It detects arbitrary shaped clusters at different scales.
It is not sensitive to noise, not sensitive to input order.
It only applicable to low dimensional data.
CLIQUE - Clustering In QUEst
It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
It is based on automatically identifying the subspaces of high dimensional data space that
allow better clustering than original space.
CLIQUE can be considered as both density-based and grid-based:
It partitions each dimension into the same number of equal-length intervals.
It partitions an m-dimensional data space into non-overlapping rectangular units.
A unit is dense if the fraction of the total data points contained in the unit exceeds the input
model parameter.
A cluster is a maximal set of connected dense units within a subspace.
Partition the data space and find the number of points that lie inside each cell of the
partition.
Identify the subspaces that contain clusters using the Apriori principle.
Identify clusters:
Determine dense units in all subspaces of

interests.
Determine connected dense units in all subspaces

of interests.

Generate minimal description for the


clusters:
Determine maximal regions that cover a cluster of

connected dense units for each cluster.


Determination of minimal cover for each cluster.
Advantages
It automatically finds subspaces of the highest dimensionality such that high- density
clusters exist in those subspaces.

It is insensitive to the order of records in input and does not presume some
canonical data distribution.

It scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases.

Disadvantages
The accuracy of the clustering result may be degraded at the expense of the simplicity
of the method.

Summary
Grid-Based Clustering -> It is one of the methods of cluster analysis which uses a
multi-resolution grid data structure.
Clustering
Model based techniques and
Handling high dimensional
data

3
1
Model-Based Clustering
Methods
 Attempt to optimize the fit between the data and some
mathematical model

Assumption: Data are generated by a mixture of underlying


 probability distributions

Techniques

 Expectation-
Maximization

Conceptual Clustering


Neural Networks

Approach

3
2
Expectation
Maximization
 Each cluster is represented
mathematically by a parametric
probability distribution
 Component distribution

Data is a mixture of these distributions Mixture



density model

Problem: To estimate parameters of probability
distributions

3
3
Expectation
Maximization
 Iterative Refinement Algorithm – used to find
parameter estimates

 Extension of k-
means
 Assigns an object to a cluster according to a weight
representing probability of membership

 Initial estimate of
parameters

Iteratively reassigns
scores

3
4
Expectation
Maximization
 Initial guess for parameters; randomly select k
objects to represent cluster means or centers
Iteratively refine parameters / clusters

 Expectation
Step
 Assign each object xi to cluster Ck with
probability

where
Maximization

Step
 Re-estimate model
parameters

 Simple and easy to implement


 Complexity depends on features, objects and
iterations

3
5
Conceptual
Clustering
 Conceptual
clustering
 A form of clustering in machine
learning
 Produces a classification scheme for a set of
unlabeled objects
 Finds characteristic description for each
concept (class)

 COBWE
B
 A popular and simple method of incremental
conceptual learning
 Creates a hierarchical clustering in the form of a
classification tree

Each node refers to a concept and contains a probabilistic
description of that concept

3
6
COBWEB Clustering
Method
A classification
tree

3
7
COBW
EB
 Classification
tree
 Each node – Concept and its probabilistic
distribution (Summary of objects under that
node)
 Description – Conditional probabilities P(Ai=vij /
Ck)

Sibling nodes at given level form a partition
 Category Utility

 Increase in the expected number of attribute


values that can be correctly guessed given a
partition

3
8
COBW
EB
 Category Utility
rewards:
 Intra-class similarity
P(Ai=vij|Ck)
 High value indicates many class members share this
attribute-value pair
Inter-class dissimilarity P(Ck|Ai=vij)

 High values – fewer objects in different classes share


this attribute- value
Placement of new objects

 Descend tree
Identify best

host
 Temporarily place object in each node and compute CU
of resulting partition
Placement with highest CU is chosen

COBWEB may also forms new nodes if object does not
 fit into the

existing tree

3
9
COBW
EB
 COBWEB is sensitive to order of
records
 Additional operations
 Merging and
Splitting
 Two best hosts are considered for
merging Best host is considered for

splitting

Limitations
 The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
 Not suitable for clustering large database data
CLASSIT - an extension of COBWEB for incremental

clustering of continuous data

4
0
Neural Network
Approach
 Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
New objects are distributed to the cluster whose
 exemplar is the most similar according to some
distance measure
Self Organizing Map

 Competitive learning
Involves a hierarchical architecture of several units

(neurons)
 Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
Organization of units – forms a feature map Web Document
 Clustering

4
1
Kohenen
SOM

4
2
Clustering High-
Dimensional data
 As dimensionality
increases
 number of irrelevant dimensions may produce noise and mask

real clusters data becomes sparse



Distance measures –meaningless

 Feature transformation
methods
 PCA, SVD – Summarize data by creating linear combinations

of attributes

But do not remove any attributes; transformed attributes –
complex to interpret

 Feature Selection
methods
 Most relevant set of attributes with respect to class labels

Entropy Analysis

Subspace Clustering – searches for groups of clusters
 within different subspaces of the same data set

4
3
CLIQUE: CLustering In
QUest
 Dimension growth subspace clustering
 Starts at 1-D and grows upwards to higher
 dimensions
Partitions each dimension – grids – determines
whether cell is dense
CLIQUE

 Determines sparse and crowded units


Dense unit – fraction of data points >

threshold Cluster – maximal set of
 connected dense units

4
4
CLIQU
E
 First partitions d-dimensional space into non-
overlapping units
 Performed in 1-D
Based on Apriori property: If a k-dimensional unit is

dense so are its projections in (k-1) dimensional space
Search space size is reduced

 Determines the maximal dense region and Generates


a minimal description

4
5
CLIQU
E
 Finds subspace of highest dimension
 Insensitive to order of inputs
 Performance depends on grid size and density
threshold

 Difficult to determine across all


dimensions
 Several lower dimensional subspaces will
have to be processed
Can use adaptive strategy

4
6
PROCLUS – PROjected
CLUStering
 Dimension-reduction Subspace Clustering
 technique
Finds initial approximation of clusters in high
dimensional space
Avoids generation of large number of

overlapped clusters of lower dimensionality
Finds best set of medoids by hill-climbing
process (Similar to CLARANS)

Manhattan Segmental distance measure

4
7
PROCL
US
 Initialization
phase
 Greedy algorithm to select a set of initial medoids
that are far apart
Iteration Phase

 Selects a random set of k-medoids Replaces bad


medoids

For each medoid a set of dimensions is chosen
 whose average distances are small

 Refinement
Phase
 Computes new dimensions for each medoid based
on clusters found, reasigns points to medoids and
removes outliers

4
8
Frequent Pattern based
Clustering
 Frequent patterns may also form clusters
 Instead of growing clusters dimension by
dimension sets of frequent itemsets are
determined
Two common technqiues

 Frequent term-based text


Clustering Clustering by

Pattern similarity

4
9
Frequent-term based text
clustering
 Text documents are clustered based on
frequent terms they contain
Documents – terms Dimensionality is very high

Frequent term based analysis

 Well selected subset of set of all frequent terms


must be discovered
Fi – Set of frequent term sets

Cov(Fi) – set of documents covered by Fi

i=1 k cov(Fi) = D and overlap between Fi and Fj
must be minimized
 Description of clusters – their frequent term sets

5
0
Constraint-based cluster Analysis

Constraint-based clustering finds clusters that satisfy user-specified


preferences or constraints. Depending on the nature of the constraints,
constraint-based clustering may adopt rather different approaches.
There are a few categories of constraints.

Constraints on individual objects:


We can specify constraints on the objects to be clustered. In a real estate
application, for example, one may like to spatially cluster only those
luxury mansions worth over a million dollars. This constraint confines the
set of objects to be clustered. It can easily be handled by preprocessing
after which the problem reduces to an instance of unconstrained
clustering.

Constraints on the selection of clustering parameters:


A user may like to set a desired range for each clustering parameter.
Clustering parameters are usually quite specific to the given clustering
algorithm. Examples of parameters include k, the desired number of
clusters in a k-means algorithm; or e the radius and the minimum number
of points in the DBSCAN algorithm. Although such user-specified
parameters may strongly influence the clustering results, they are usually
confined to the algorithm itself. Thus, their fine tuning and processing are
usually not considered a form of constraint-based clustering.
Constraints on distance or similarity functions:
We can specify different distance or similarity functions for specific
attributes of the objects to be clustered, or different distance measures
for specific pairs of objects. When clustering sportsmen, for example,
we may use different weighting schemes for height, body weight, age,
and skill level. Although this will likely change the mining results, it may
not alter the clustering process per se. However, in some cases, such
changes may make the evaluation of the distance function nontrivial,
especially when it is tightly intertwined with the clustering process.

User-specified constraints on the properties of individual clusters:


A user may like to specify desired characteristics of the resulting
clusters, which may strongly influence the clustering process.

You might also like