Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Prasanna Hebbar @govt First Grade College Honnavar

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Clustering

Clustering is the process of grouping the data into classes or clusters, so that objects
within a cluster have high similarity in comparison to one another but are very dissimilar to
objects in other clusters.

Cluster Analysis
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering. A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in other clusters. A cluster
of data objects can be treated collectively as one group and so may be considered as a form
of data compression.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering and unsupervised
learning do not rely on predefined classes and class-labeled training examples. For this
reason, clustering is a form of learning by observation, rather than learning by examples.
• By automated clustering, we can identify dense and sparse regions in object space and,
therefore, discover overall distribution patterns and interesting correlations among data
attributes.
• Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
• In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
• In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
• Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location.
• The identification of groups of automobile insurance policy holders with a high average
claim cost.
• It can also be used to help classify documents on the Web for information discovery.
• Clustering can also be used for outlier detection, where outliers (values that are “far away”
from any cluster) may be more interesting than common cases.
• As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight
into the distribution of data, to observe the characteristics of each cluster, and to focus on
a particular set of clusters for further analysis.

➢ The following are typical requirements of clustering in data mining:


• Scalability: Clustering on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
• Ability to deal with different types of attributes: Many algorithms are designed to cluster
interval-based (numerical) data. However, applications may require clustering other types
of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data
types.
• Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance measures. Algorithms based on such distance

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 1


Clustering

measures tend to find spherical clusters with similar size and density. However, a cluster
could be of any shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.
• Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis (such as
the number of desired clusters). The clustering results can be quite sensitive to input
parameters. Parameters are often difficult to determine, especially for data sets containing
high-dimensional objects.
• Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and
may lead to clusters of poor quality.
• Incremental clustering and insensitivity to the order of input records: Some clustering
algorithms cannot incorporate newly inserted data (i.e., database updates) into existing
clustering structures and, instead, must determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of input data. It is important to develop
incremental clustering algorithms and algorithms that are insensitive to the order of input.
• High dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Finding clusters of data objects in high dimensional space is challenging,
especially considering that such data can be sparse and highly skewed.
• Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a
given number of new automatic banking machines (ATMs) in a city. To decide upon this,
you may cluster households while considering constraints such as the city’s rivers and
highway networks, and the type and number of customers per cluster. A challenging task
is to find groups of data with good clustering behaviour that satisfy specified constraints.
• Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied to specific semantic
interpretations and applications. It is important to study how an application goal may
influence the selection of clustering features and methods.

Clustering Methods
1. Partitioning methods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a
cluster.
The clusters are formed to optimize an objective partitioning criterion, such as a
dissimilarity function based on distance, so that the objects within a cluster are “similar,”
whereas the objects of different clusters are “dissimilar” in terms of the data set attributes.

Centroid-Based Technique: The k-Means Method


The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intra cluster similarity is high but the inter cluster similarity
is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster,
which can be viewed as the cluster’s centroid or center of gravity.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 2


Clustering

The k-means algorithm proceeds as follows:


• First, it randomly selects k of the objects, each of which initially represents a cluster mean
or center.
• For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
• It then computes the new mean for each cluster.
• This process iterates until the criterion function converges.
➢ Typically, the square-error criterion is used, defined as E = ∑𝒌𝒊=𝟏 ∑𝒑∈𝒄𝒊 |𝒑 − 𝒎𝒊 |𝟐
where E: is the sum of the square error for all objects in the data set.
P: is the point in space representing a given object,
𝒎𝒊 : is the mean of cluster 𝑐𝑖 .
In other words, for each object in each cluster, the distance from the object to its cluster
center is squared, and the distances are summed.

Algorithm: k-means.
//The k-means algorithm for partitioning, where each cluster’s center is represented by the
//mean value of the objects in the cluster.
Input: k: The number of clusters, D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e.,
calculate the mean value of the objects for each cluster;
(5) until no change;

Example
Suppose that there is a set of objects located in space as depicted in the rectangle
shown in Figure (a). Let k = 3; that is, the user would like the objects to be partitioned into
three clusters.
According to the algorithm we arbitrarily choose three objects as the three initial
cluster centers, where cluster centers are marked by a “+”. Each object is distributed to a
cluster based on the cluster center to which it is the nearest. Such a distribution forms
silhouettes encircled by dotted curves, as shown in Figure (a).
Next, the cluster centers are updated. That is, the mean value of each cluster is
recalculated based on the current objects in the cluster. Using the new cluster centers, the
objects are redistributed to the clusters based on which cluster center is the nearest. Such a
redistribution forms new silhouettes encircled by dashed curves, as shown in Figure (b).
This process iterates, leading to Figure (c). The process of iteratively reassigning objects
to clusters to improve the partitioning is referred to as iterative relocation. Eventually, no
redistribution of the objects in any cluster occurs, and so the process terminates. The resulting
clusters are returned by the clustering process.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 3


Clustering

➢ The k-means method, however, can be applied only when the mean of a cluster is defined.

Representative Object-Based Technique: The k-Medoids Method


Instead of taking the mean value of the objects in a cluster as a reference point, we can
pick actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
The partitioning method is then performed based on the principle of minimizing the
sum of the dissimilarities between each object and its corresponding reference point. That is,
an absolute-error criterion is used, defined as E = ∑𝒌𝒋=𝟏 ∑𝒑∈𝒄𝒋|𝒑 − 𝒐𝒋 | where.
• E: is the sum of the absolute error for all objects in the data set.
• p: is the point in space representing a given object in cluster 𝐶𝑗 .
• 𝒐𝒋 : is the representative object of 𝐶𝑗 .
In general, the algorithm iterates until, eventually, each representative object is actually
the medoid, or most centrally located object, of its cluster. This is the basis of the k-medoids
method for grouping n objects into k clusters.
The initial representative objects (or seeds) are chosen arbitrarily. The iterative process
of replacing representative objects by nonrepresentative objects continues as long as the
quality of the resulting clustering is improved. This quality is estimated using a cost function
that measures the average dissimilarity between an object and the representative object of
its cluster.
To determine whether a nonrepresentative object, 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 , is a good replacement for
a current representative object, 𝑜𝑗 , the following four cases are examined for each of the
nonrepresentative objects, p, as illustrated in Figure.
• Case 1: p currently belongs to representative object, 𝑜𝑗 . If 𝑜𝑗 is replaced by 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 as a
representative object and p is closest to one of the other representative objects, 𝑜𝑖 , i ≠ j,
then p is reassigned to 𝑜𝑖 .

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 4


Clustering

• Case 2: p currently belongs to representative object, 𝑜𝑗 . If 𝑜𝑗 is replaced by 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 as a


representative object and p is closest to 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 , then p is reassigned to 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 .
• Case 3: p currently belongs to representative object, 𝑜𝑖 , i ≠ j. If 𝑜𝑗 is replaced by 𝑜𝑟𝑎𝑛𝑑𝑜𝑚
as a representative object and p is still closest to 𝑜𝑖 , then the assignment does not change.
• Case 4: p currently belongs to representative object, 𝑜𝑖 , i ≠ j. If 𝑜𝑗 is replaced by𝑜𝑟𝑎𝑛𝑑𝑜𝑚
as a representative object and p is closest to 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 , then p is reassigned to 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 .

Algorithm: k-medoids.
// PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
Input: k: the number of clusters, D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster
with the nearest representative object;
(4) randomly select a nonrepresentative object, 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 ;
(5) compute the total cost, S, of swapping representative object, 𝑜𝑗 , with 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 ;
(6) if S < 0 then swap 𝑜𝑗 with 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 to form the new set of
k representative objects;
(7) until no change;

Hierarchical methods
It works by grouping data objects into a tree of clusters. It has two types.
Agglomerative hierarchical clustering
This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and large clusters, until all of the objects are in a single cluster
or until certain termination conditions are satisfied. Most hierarchical clustering methods
belong to this category. They differ only in their definition of inter cluster similarity.
Divisive hierarchical clustering
This top-down strategy does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces,
until each object forms a cluster on its own or until it satisfies certain termination conditions,
such as a desired number of clusters is obtained or the diameter of each cluster is within a
certain threshold.
Example
Figure shows the application of AGNES (AGglomerative NESting), an agglomerative
hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical
clustering method, to a data set of five objects, {a, b, c, d, e}.
Initially, AGNES places each object into a cluster of its own. The clusters are then
merged step-by-step according to some criterion. For example, clusters C1 and C2 may be
merged if an object in C1 and an object in C2 form the minimum Euclidean distance between

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 5


Clustering

any two objects from different clusters. This is a single-linkage approach in that each cluster
is represented by all of the objects in the cluster, and the similarity between two clusters is
measured by the similarity of the closest pair of data points belonging to different clusters.
The cluster merging process repeats until all of the objects are eventually merged to form one
cluster.
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the closest
neighbouring objects in the cluster. The cluster splitting process repeats until, eventually, each
new cluster contains only a single object.

Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}.

In either agglomerative or divisive hierarchical clustering, the user can specify the
desired number of clusters as a termination condition.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Figure shows
a dendrogram for the five objects presented in Figure, where l = 0 shows the five objects as
singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first
cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters. For
example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they
are merged together to form a single cluster

Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}.

Four widely used measures for distance between clusters are as follows: 1. Minimum
distance 2. Maximum distance 3. Mean distance 4. Average distance
• When an algorithm uses the minimum distance, to measure the distance between clusters,
it is sometimes called a nearest-neighbour clustering algorithm.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 6


Clustering

• If the clustering process is terminated when the distance between nearest clusters exceeds
an arbitrary threshold, it is called a single-linkage algorithm.
• An agglomerative hierarchical clustering algorithm that uses the minimum distance
measure is also called a minimal spanning tree algorithm.
• When an algorithm uses the maximum distance, to measure the distance between
clusters, it is sometimes called a farthest-neighbour clustering algorithm.
• If the clustering process is terminated when the maximum distance between nearest
clusters exceeds an arbitrary threshold, it is called a complete-linkage algorithm.

What are some of the difficulties with hierarchical clustering?


• It encounters difficulties regarding the selection of merge or split points. Such a decision
is critical because once a group of objects is merged or split, the process at the next step
will operate on the newly generated clusters.
• It will neither undo what was done previously nor perform object swapping between
clusters. Thus merge or split decisions, if not well chosen at some step, may lead to low-
quality clusters.
• The method does not scale well, because each decision to merge or split requires the
examination and evaluation of a good number of objects or clusters.

Density-based methods
To discover clusters with arbitrary shape, density-based clustering methods have been
developed. These typically regard clusters as dense regions of objects in the data space that
are separated by regions of low density (representing noise).
DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently
High Density
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based
clustering algorithm. The algorithm grows regions with sufficiently high density into clusters
and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster
as a maximal set of density-connected points.
The basic ideas of density-based clustering involve a number of new definitions:
• The neighbourhood within a radius ε of a given object is called the ε-neighbourhood of
the object.
• If the ε-neighbourhood of an object contains at least a minimum number, MinPts, of
objects, then the object is called a core object.
• Given a set of objects, D, we say that an object p is directly density-reachable from object
q if p is within the ε-neighbourhood of q, and q is a core object.
• An object p is density-reachable from object q with respect to ε and MinPts in a set of
objects, D, if there is a chain of objects p1,..., pn, where p1 = q and pn = p such that pi+1 is
directly density-reachable from pi with respect to ε and MinPts, for 1 ≤ i ≤ n, pi ∈ D.
• An object p is density-connected to object q with respect to ε and MinPts in a set of
objects, D, if there is an object o ∈ D such that both p and q are density-reachable from o
with respect to ε and MinPts.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 7


Clustering

Density reachability is the transitive closure of direct density reachability, and this
relationship is asymmetric. Only core objects are mutually density reachable. Density
connectivity, however, is a symmetric relation.

Example: Density-reachability and density connectivity.


Consider Figure for a given ε represented by the radius of the circles, and, say, let
MinPts = 3. Based on the above definitions:
• Of the labeled points, m, p, o, and r are core objects because each is in an ε-neighbourhood
containing at least three points.
• q is directly density-reachable from m. m is directly density-reachable from p and vice
versa.
• q is (indirectly) density-reachable from p because q is directly density-reachable from m
and m is directly density-reachable from p. However, p is not density-reachable from q
because q is not a core object. Similarly, r and s are density-reachable from o, and o is
density-reachable from r.
• o, r, and s are all density-connected.

A density-based cluster is a set of density-connected objects that is maximal with


respect to density-reachability. Every object not contained in any cluster is considered to be
noise.

How does DBSCAN find clusters?


DBSCAN searches for clusters by checking the ε-neighbourhood of each point in the
database. If the ε-neighbourhood of a point p contains more than MinPts, a new cluster with
p as a core object is created. DBSCAN then iteratively collects directly density-reachable
objects from these core objects, which may involve the merge of a few density-reachable
clusters. The process terminates when no new point can be added to any cluster.

Density reachability and density connectivity in density-based clustering.

Grid-based methods
The grid-based clustering approach uses a multiresolution grid data structure. It
quantizes the object space into a finite number of cells that form a grid structure on which all
of the operations for clustering are performed. The main advantage of the approach is its fast

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 8


Clustering

processing time, which is typically independent of the number of data objects, yet dependent
on only the number of cells in each dimension in the quantized space. Some typical examples
of the grid-based approaches are:
• STING:Which explores statistical information stored in the grid cells
• WaveCluster: Which clusters objects using a wavelet transform method.
• CLIQUE: Which represents a grid-and density-based approach for clustering in high-
dimensional data space.

STING(STatistical INformation Grid)


STING is a grid-based multiresolution clustering technique in which the spatial area is
divided into rectangular cells. There are usually several levels of such rectangular cells
corresponding to different levels of resolution, and these cells form a hierarchical structure:
each cell at a high level is partitioned to form a number of cells at the next lower level.
Statistical information regarding the attributes in each grid cell (such as the mean, maximum,
and minimum values) is precomputed and stored. These statistical parameters are useful for
query processing.
Figure shows a hierarchical structure for STING clustering. Statistical parameters of
higher-level cells can easily be computed from the parameters of the lower-level cells. These
parameters include the following: the attribute-independent parameter, count; the attribute-
dependent parameters, mean, stdev (standard deviation), min (minimum), max (maximum);
and the type of distribution that the attribute value in the cell follows, such as normal,
uniform, exponential, or none (if the distribution is unknown).
When the data are loaded into the database, the parameters count, mean, stdev, min,
and max of the bottom-level cells are calculated directly from the data. The value of
distribution may either be assigned by the user if the distribution type is known beforehand
or obtained by hypothesis tests such as the χ2 test. The type of distribution of a higher-level
cell can be computed based on the majority of distribution types of its corresponding lower-
level cells in conjunction with a threshold filtering process. If the distributions of the lower
level cells disagree with each other and fail the threshold test, the distribution type of the
high-level cell is set to none.

A hierarchical structure for STING clustering.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 9


Clustering

How is this statistical information useful for query answering?


First, a layer within the hierarchical structure is determined from which the query-
answering process is to start. This layer typically contains a small number of cells. For each
cell in the current layer, we compute the confidence interval (or estimated range of
probability) reflecting the cell’s relevancy to the given query. The irrelevant cells are removed
from further consideration.
Processing of the next lower level examines only the remaining relevant cells. This
process is repeated until the bottom layer is reached. At this time,
• If the query specification is met, the regions of relevant cells that satisfy the query are
returned.
• Otherwise, the data that fall into the relevant cells are retrieved and further processed
until they meet the requirements of the query.

What advantages does STING offer over other clustering methods?


• The grid-based computation is query-independent, because the statistical information
stored in each cell represents the summary information of the data in the grid cell,
independent of the query.
• The grid structure facilitates parallel processing and incremental updating.
• The time complexity of generating clusters is O(n), where n is the total number of objects.
After generating the hierarchical structure, the query processing time is O(g), where g is
the total number of grid cells at the lowest level, which is usually much smaller than n.

Wave Cluster: Clustering Using Wavelet Transformation


Wave Cluster is a multiresolution clustering algorithm that first summarizes the data by
imposing a multidimensional grid structure onto the data space. It then uses a wavelet
transformation to transform the original feature space, finding dense regions in the
transformed space.
In this approach, each grid cell summarizes the information of a group of points that
map into the cell. This summary information typically fits into main memory for use by the
multiresolution wavelet transform and the subsequent cluster analysis.
A wavelet transform is a signal processing technique that decomposes a signal into
different frequency sub bands. The wavelet model can be applied to d-dimensional signals by
applying a one-dimensional wavelet transform d times.
In applying a wavelet transform, data are transformed so as to preserve the relative
distance between objects at different levels of resolution. This allows the natural clusters in
the data to become more distinguishable. Clusters can then be identified by searching for
dense regions in the new domain.

Why is wavelet transformation useful for clustering?


• It provides unsupervised clustering. It uses hat-shaped filters that emphasize regions
where the points cluster, while suppressing weaker information outside of the cluster
boundaries.
• The multiresolution property of wavelet transformations can help detect clusters at
varying levels of accuracy.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 10


Clustering

• Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is


the number of objects in the database.

Model-based methods
Model-based clustering methods attempt to optimize the fit between the given data
and some mathematical model. Such methods are often based on the assumption that the
data are generated by a mixture of underlying probability distributions.

Evaluation of Clustering
Questions
2 Marks
1. Define Cluster & clustering.
2. Define absolute error in k – medoids method.
3. Define square error in k – means method.
4. What is agglomerative hierarchical clustering?
5. What is divisive hierarchical clustering?
6. List the different names for clustering derived based on distance.
7. What are the important measures for distance between the clusters?
8. What are the difficulties in hierarchical clustering?
9. How DBSCAN find clusters?
10. What advantages does STING offer over other clustering methods?
11. Why is wavelet transformation useful for clustering?
5 Marks
1. List the applicability of clustering.
2. Explain k – means method of clustering.
3. Write the algorithm of k – medoids method.
4. Write a note on density based clustering methods.
5. Write a note on grid based clustering methods.
10 Marks
1. Explain the requirements of clustering in Data Mining.
2. Explain k – means method of clustering with example.
3. Explain k – medoids method of clustering.
4. Explain hierarchical clustering.
5. Explain density based clustering method.
6. Explain grid based clustering method.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 11

You might also like