Prasanna Hebbar @govt First Grade College Honnavar
Prasanna Hebbar @govt First Grade College Honnavar
Prasanna Hebbar @govt First Grade College Honnavar
Clustering is the process of grouping the data into classes or clusters, so that objects
within a cluster have high similarity in comparison to one another but are very dissimilar to
objects in other clusters.
Cluster Analysis
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering. A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in other clusters. A cluster
of data objects can be treated collectively as one group and so may be considered as a form
of data compression.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering and unsupervised
learning do not rely on predefined classes and class-labeled training examples. For this
reason, clustering is a form of learning by observation, rather than learning by examples.
• By automated clustering, we can identify dense and sparse regions in object space and,
therefore, discover overall distribution patterns and interesting correlations among data
attributes.
• Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
• In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
• In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
• Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location.
• The identification of groups of automobile insurance policy holders with a high average
claim cost.
• It can also be used to help classify documents on the Web for information discovery.
• Clustering can also be used for outlier detection, where outliers (values that are “far away”
from any cluster) may be more interesting than common cases.
• As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight
into the distribution of data, to observe the characteristics of each cluster, and to focus on
a particular set of clusters for further analysis.
measures tend to find spherical clusters with similar size and density. However, a cluster
could be of any shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.
• Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis (such as
the number of desired clusters). The clustering results can be quite sensitive to input
parameters. Parameters are often difficult to determine, especially for data sets containing
high-dimensional objects.
• Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and
may lead to clusters of poor quality.
• Incremental clustering and insensitivity to the order of input records: Some clustering
algorithms cannot incorporate newly inserted data (i.e., database updates) into existing
clustering structures and, instead, must determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of input data. It is important to develop
incremental clustering algorithms and algorithms that are insensitive to the order of input.
• High dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Finding clusters of data objects in high dimensional space is challenging,
especially considering that such data can be sparse and highly skewed.
• Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a
given number of new automatic banking machines (ATMs) in a city. To decide upon this,
you may cluster households while considering constraints such as the city’s rivers and
highway networks, and the type and number of customers per cluster. A challenging task
is to find groups of data with good clustering behaviour that satisfy specified constraints.
• Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied to specific semantic
interpretations and applications. It is important to study how an application goal may
influence the selection of clustering features and methods.
Clustering Methods
1. Partitioning methods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a
cluster.
The clusters are formed to optimize an objective partitioning criterion, such as a
dissimilarity function based on distance, so that the objects within a cluster are “similar,”
whereas the objects of different clusters are “dissimilar” in terms of the data set attributes.
Algorithm: k-means.
//The k-means algorithm for partitioning, where each cluster’s center is represented by the
//mean value of the objects in the cluster.
Input: k: The number of clusters, D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e.,
calculate the mean value of the objects for each cluster;
(5) until no change;
Example
Suppose that there is a set of objects located in space as depicted in the rectangle
shown in Figure (a). Let k = 3; that is, the user would like the objects to be partitioned into
three clusters.
According to the algorithm we arbitrarily choose three objects as the three initial
cluster centers, where cluster centers are marked by a “+”. Each object is distributed to a
cluster based on the cluster center to which it is the nearest. Such a distribution forms
silhouettes encircled by dotted curves, as shown in Figure (a).
Next, the cluster centers are updated. That is, the mean value of each cluster is
recalculated based on the current objects in the cluster. Using the new cluster centers, the
objects are redistributed to the clusters based on which cluster center is the nearest. Such a
redistribution forms new silhouettes encircled by dashed curves, as shown in Figure (b).
This process iterates, leading to Figure (c). The process of iteratively reassigning objects
to clusters to improve the partitioning is referred to as iterative relocation. Eventually, no
redistribution of the objects in any cluster occurs, and so the process terminates. The resulting
clusters are returned by the clustering process.
➢ The k-means method, however, can be applied only when the mean of a cluster is defined.
Algorithm: k-medoids.
// PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
Input: k: the number of clusters, D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster
with the nearest representative object;
(4) randomly select a nonrepresentative object, 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 ;
(5) compute the total cost, S, of swapping representative object, 𝑜𝑗 , with 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 ;
(6) if S < 0 then swap 𝑜𝑗 with 𝑜𝑟𝑎𝑛𝑑𝑜𝑚 to form the new set of
k representative objects;
(7) until no change;
Hierarchical methods
It works by grouping data objects into a tree of clusters. It has two types.
Agglomerative hierarchical clustering
This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and large clusters, until all of the objects are in a single cluster
or until certain termination conditions are satisfied. Most hierarchical clustering methods
belong to this category. They differ only in their definition of inter cluster similarity.
Divisive hierarchical clustering
This top-down strategy does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces,
until each object forms a cluster on its own or until it satisfies certain termination conditions,
such as a desired number of clusters is obtained or the diameter of each cluster is within a
certain threshold.
Example
Figure shows the application of AGNES (AGglomerative NESting), an agglomerative
hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical
clustering method, to a data set of five objects, {a, b, c, d, e}.
Initially, AGNES places each object into a cluster of its own. The clusters are then
merged step-by-step according to some criterion. For example, clusters C1 and C2 may be
merged if an object in C1 and an object in C2 form the minimum Euclidean distance between
any two objects from different clusters. This is a single-linkage approach in that each cluster
is represented by all of the objects in the cluster, and the similarity between two clusters is
measured by the similarity of the closest pair of data points belonging to different clusters.
The cluster merging process repeats until all of the objects are eventually merged to form one
cluster.
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the closest
neighbouring objects in the cluster. The cluster splitting process repeats until, eventually, each
new cluster contains only a single object.
In either agglomerative or divisive hierarchical clustering, the user can specify the
desired number of clusters as a termination condition.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Figure shows
a dendrogram for the five objects presented in Figure, where l = 0 shows the five objects as
singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first
cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters. For
example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they
are merged together to form a single cluster
Four widely used measures for distance between clusters are as follows: 1. Minimum
distance 2. Maximum distance 3. Mean distance 4. Average distance
• When an algorithm uses the minimum distance, to measure the distance between clusters,
it is sometimes called a nearest-neighbour clustering algorithm.
• If the clustering process is terminated when the distance between nearest clusters exceeds
an arbitrary threshold, it is called a single-linkage algorithm.
• An agglomerative hierarchical clustering algorithm that uses the minimum distance
measure is also called a minimal spanning tree algorithm.
• When an algorithm uses the maximum distance, to measure the distance between
clusters, it is sometimes called a farthest-neighbour clustering algorithm.
• If the clustering process is terminated when the maximum distance between nearest
clusters exceeds an arbitrary threshold, it is called a complete-linkage algorithm.
Density-based methods
To discover clusters with arbitrary shape, density-based clustering methods have been
developed. These typically regard clusters as dense regions of objects in the data space that
are separated by regions of low density (representing noise).
DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently
High Density
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based
clustering algorithm. The algorithm grows regions with sufficiently high density into clusters
and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster
as a maximal set of density-connected points.
The basic ideas of density-based clustering involve a number of new definitions:
• The neighbourhood within a radius ε of a given object is called the ε-neighbourhood of
the object.
• If the ε-neighbourhood of an object contains at least a minimum number, MinPts, of
objects, then the object is called a core object.
• Given a set of objects, D, we say that an object p is directly density-reachable from object
q if p is within the ε-neighbourhood of q, and q is a core object.
• An object p is density-reachable from object q with respect to ε and MinPts in a set of
objects, D, if there is a chain of objects p1,..., pn, where p1 = q and pn = p such that pi+1 is
directly density-reachable from pi with respect to ε and MinPts, for 1 ≤ i ≤ n, pi ∈ D.
• An object p is density-connected to object q with respect to ε and MinPts in a set of
objects, D, if there is an object o ∈ D such that both p and q are density-reachable from o
with respect to ε and MinPts.
Density reachability is the transitive closure of direct density reachability, and this
relationship is asymmetric. Only core objects are mutually density reachable. Density
connectivity, however, is a symmetric relation.
Grid-based methods
The grid-based clustering approach uses a multiresolution grid data structure. It
quantizes the object space into a finite number of cells that form a grid structure on which all
of the operations for clustering are performed. The main advantage of the approach is its fast
processing time, which is typically independent of the number of data objects, yet dependent
on only the number of cells in each dimension in the quantized space. Some typical examples
of the grid-based approaches are:
• STING:Which explores statistical information stored in the grid cells
• WaveCluster: Which clusters objects using a wavelet transform method.
• CLIQUE: Which represents a grid-and density-based approach for clustering in high-
dimensional data space.
Model-based methods
Model-based clustering methods attempt to optimize the fit between the given data
and some mathematical model. Such methods are often based on the assumption that the
data are generated by a mixture of underlying probability distributions.
Evaluation of Clustering
Questions
2 Marks
1. Define Cluster & clustering.
2. Define absolute error in k – medoids method.
3. Define square error in k – means method.
4. What is agglomerative hierarchical clustering?
5. What is divisive hierarchical clustering?
6. List the different names for clustering derived based on distance.
7. What are the important measures for distance between the clusters?
8. What are the difficulties in hierarchical clustering?
9. How DBSCAN find clusters?
10. What advantages does STING offer over other clustering methods?
11. Why is wavelet transformation useful for clustering?
5 Marks
1. List the applicability of clustering.
2. Explain k – means method of clustering.
3. Write the algorithm of k – medoids method.
4. Write a note on density based clustering methods.
5. Write a note on grid based clustering methods.
10 Marks
1. Explain the requirements of clustering in Data Mining.
2. Explain k – means method of clustering with example.
3. Explain k – medoids method of clustering.
4. Explain hierarchical clustering.
5. Explain density based clustering method.
6. Explain grid based clustering method.