Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
37 views

Clustering in Python

Uploaded by

aman38402
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Clustering in Python

Uploaded by

aman38402
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Clustering in Python

Vijay Kumar Dwivedi


Clustering: Concept
• Given a set of records (instances, examples,
objects, observations, …), organize them into
clusters (groups, classes)
• Clustering: the process of grouping physical or
abstract objects into classes of similar objects
What is a Cluster?
• A cluster is a subset of objects which are
“similar” .
• A subset of objects such that the distance
between any two objects in the cluster is less
than the distance between any object in the
cluster and any object not located inside it.
• A connected region of a multidimensional
space containing a relatively high density of
objects
What is Clustering?
• Clustering is a process of partitioning a set of
data (or objects) into a set of meaningful sub-
classes, called clusters.
• Help users understand the natural grouping or
structure in a data set.
• Clustering: unsupervised classification: no
predefined classes.
• Used either as a stand-alone tool to get insight
into data distribution or as a preprocessing
step for other algorithms.
What is Good Clustering?
• A good clustering method will produce high
quality clusters in which:
– the intra-class (that is, intra-cluster) similarity is
high.
– the inter-class similarity is low.
• The quality of a clustering result also depends
on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.
• However, objective evaluation is problematic:
Clustering: Applications
• Economic Science (especially market research).
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
• Pattern Recognition
• Spatial Data Analysis
• Image Processing
Main Categories of Clustering
• Methods
Partitioning algorithms: Construct various
partitions and then evaluate them by some
criterion
• Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects)
using some criterion.
• Density-based: based on connectivity and
density functions
• Grid-based: based on a multiple-level
granularity structure
• Model-based: A model is hypothesized for
each of the clusters and the idea is to find the
Partitioning Algorithms: Basic
Concept
• Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion.
– Global optimal: exhaustively enumerate all
partitions.
– Heuristic methods: k-means and k-medoids
algorithms.
– k-means (MacQueen’67): Each cluster is
represented by the center of the cluster
– k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster.
Simple Clustering: K-means
• Basic version works with numeric data only
• Pick a number (K) of cluster centers - centroids
(at random)
• Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
• Move each cluster center to the mean of its
assigned items
• Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
Illustrating K-Means: Working
KMeans: Numerical Example
• Cluster the following eight points (with (x, y)
representing locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6,
4), A7(1, 2), A8(4, 9)
• Solution:
– Initial cluster centers are:
• A1(2, 10), A4(5, 8) and A7(1, 2). (Randomly chosen)
• Select cluster centers in such a way that they are as
farther as possible from each other.
– Calculate the distance between each data point and
each cluster center.
– The distance may be calculated either by using
given distance function or by using manhatan
distance formula.
• Calculating Distance Between A1(2, 10) and
C1(2, 10)-
– Ρ(A1, C1)== |x2 – x1| + |y2 – y1|
– = |2 – 2| + |10 – 10| = 0
• Calculating Distance Between A1(2, 10) and
C2(5, 8)-
• Ρ(A1, C2)= |x2 – x1| + |y2 – y1|
• = |5 – 2| + |8 – 10|= 3 + 2= 5
• Calculating Distance Between A1(2, 10) and
C3(1, 2)-
• Ρ(A1, C3)= |x2 – x1| + |y2 – y1|
• = |1 – 2| + |2 – 10|= 1 + 8= 9
• Similarly compute distance for remaining

Three Clusters are:

C1:A1(2, 10)

C2:
 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 A8(4, 9)

C3: A2(2, 5), A7(1, 2)

Recompute New Cluster Center
 The new cluster center is computed by taking mean of all
the points contained in that cluster.

We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02
 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)= (6, 6)
 For Cluster-03:

Center of Cluster-03:
 = ((2 + 1)/2, (5 + 2)/2)= (1.5, 3.5)
Iteration 2

Again compute distance of all points with
newly computed cluster centers
 C1(2,10), C2(6,6), C3(1.5,3.5)


After second Iteration, the cluster centers
are:
 C1(3, 9.5)
 C2(6.5, 5.25)
 C3(1.5, 3.5)

This process is continued until the
assignment of any point remains changing
to new clusters
Kmeans: Advantages
Kmeans: Disadvantages

Choosing K manually

Being dependent on initial values.

Clustering data of varying sizes and
density.

Clustering outliers.

Scaling with number of dimensions.
K-means Clustering
• Importing K-means
– from sklearn.cluster import Kmeans
• #Loading Data Set
– df=pd.read_csv('iris.csv')
• Extracting columns
– X=df[['sepal_length','sepal_width']]
– Y=df['species']
• Creating K-Means based Clustreing Model
• kmeans = KMeans(n_clusters=3,
random_state=0).fit(X)
K-means Clustering Contd…
• Printing Clustering Centroids
– print(kmeans.cluster_centers_)
• # Get the cluster labels
– print(kmeans.labels_)
• # Plotting the cluster centers and the data
points on a 2D plane
– plt.scatter(X['sepal_length'], X['sepal_width'])
– plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1], c='red', marker='x')
– plt.title('Data points and cluster centroids')
– plt.show()
K-Means Clustering Contd…
• Checking Accuracy using silhouette_score
– from sklearn.metrics import silhouette_score
– print(silhouette_score(X, kmeans.labels_))
Hierarchical Clustering: Agglomerative

In this technique, initially each data point is
considered as an individual cluster.

At each iteration, the similar clusters merge
with other clusters until one cluster or K
clusters are formed.
Hierarchical Clustering: Agglomerative

The basic algorithm of Agglomerative is as
follows:
 Compute the proximity matrix
 Let each data point be a cluster
 Repeat: Merge the two closest clusters and
update the proximity matrix
 Until only a single cluster remains
Hierarchical Clustering: Agglomerative Example
Hierarchical Clustering: Agglomerative Example
Hierarchical Clustering:
Visualization

The Hierarchical clustering Technique can
be visualized using a Dendrogram.

A Dendrogram is a tree-like diagram that
records the sequences of merges or splits.

Agglomerative Clustering
• Importing Packages
– import scipy.cluster.hierarchy as shc
– from sklearn.cluster import AgglomerativeClustering
• Plotting Dendograms
– dend = shc.dendrogram(shc.linkage(X,
method='ward'))
• Creating Model
– cluster = AgglomerativeClustering(n_clusters=5,
affinity='euclidean', linkage='ward')
Agglomerative Clustering Contd…
• Performing Predictions
– cluster.fit_predict(X)
• Plotting Clusters
– plt.scatter(X['sepal_length'], X['sepal_width'],
c=cluster.labels_, cmap='rainbow')

You might also like