Clustering in Python

Uploaded by

aman38402

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Clustering in Python

Uploaded by

aman38402

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Clustering in Python

Vijay Kumar Dwivedi

Clustering: Concept
• Given a set of records (instances, examples,
objects, observations, …), organize them into
clusters (groups, classes)
• Clustering: the process of grouping physical or
abstract objects into classes of similar objects
What is a Cluster?
• A cluster is a subset of objects which are
“similar” .
• A subset of objects such that the distance
between any two objects in the cluster is less
than the distance between any object in the
cluster and any object not located inside it.
• A connected region of a multidimensional
space containing a relatively high density of
objects
What is Clustering?
• Clustering is a process of partitioning a set of
data (or objects) into a set of meaningful sub-
classes, called clusters.
• Help users understand the natural grouping or
structure in a data set.
• Clustering: unsupervised classification: no
predefined classes.
• Used either as a stand-alone tool to get insight
into data distribution or as a preprocessing
step for other algorithms.
What is Good Clustering?
• A good clustering method will produce high
quality clusters in which:
– the intra-class (that is, intra-cluster) similarity is
high.
– the inter-class similarity is low.
• The quality of a clustering result also depends
on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.
• However, objective evaluation is problematic:
Clustering: Applications
• Economic Science (especially market research).
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
• Pattern Recognition
• Spatial Data Analysis
• Image Processing
Main Categories of Clustering
• Methods
Partitioning algorithms: Construct various
partitions and then evaluate them by some
criterion
• Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects)
using some criterion.
• Density-based: based on connectivity and
density functions
• Grid-based: based on a multiple-level
granularity structure
• Model-based: A model is hypothesized for
each of the clusters and the idea is to find the
Partitioning Algorithms: Basic
Concept
• Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion.
– Global optimal: exhaustively enumerate all
partitions.
– Heuristic methods: k-means and k-medoids
algorithms.
– k-means (MacQueen’67): Each cluster is
represented by the center of the cluster
– k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster.
Simple Clustering: K-means
• Basic version works with numeric data only
• Pick a number (K) of cluster centers - centroids
(at random)
• Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
• Move each cluster center to the mean of its
assigned items
• Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
Illustrating K-Means: Working
KMeans: Numerical Example
• Cluster the following eight points (with (x, y)
representing locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6,
4), A7(1, 2), A8(4, 9)
• Solution:
– Initial cluster centers are:
• A1(2, 10), A4(5, 8) and A7(1, 2). (Randomly chosen)
• Select cluster centers in such a way that they are as
farther as possible from each other.
– Calculate the distance between each data point and
each cluster center.
– The distance may be calculated either by using
given distance function or by using manhatan
distance formula.
• Calculating Distance Between A1(2, 10) and
C1(2, 10)-
– Ρ(A1, C1)== |x2 – x1| + |y2 – y1|
– = |2 – 2| + |10 – 10| = 0
• Calculating Distance Between A1(2, 10) and
C2(5, 8)-
• Ρ(A1, C2)= |x2 – x1| + |y2 – y1|
• = |5 – 2| + |8 – 10|= 3 + 2= 5
• Calculating Distance Between A1(2, 10) and
C3(1, 2)-
• Ρ(A1, C3)= |x2 – x1| + |y2 – y1|
• = |1 – 2| + |2 – 10|= 1 + 8= 9
• Similarly compute distance for remaining

Three Clusters are:

C1:A1(2, 10)

C2:
 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 A8(4, 9)

C3: A2(2, 5), A7(1, 2)

Recompute New Cluster Center
 The new cluster center is computed by taking mean of all
the points contained in that cluster.

We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02
 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)= (6, 6)
 For Cluster-03:

Center of Cluster-03:
 = ((2 + 1)/2, (5 + 2)/2)= (1.5, 3.5)
Iteration 2

Again compute distance of all points with
newly computed cluster centers
 C1(2,10), C2(6,6), C3(1.5,3.5)


After second Iteration, the cluster centers
are:
 C1(3, 9.5)
 C2(6.5, 5.25)
 C3(1.5, 3.5)

This process is continued until the
assignment of any point remains changing
to new clusters
Kmeans: Advantages
Kmeans: Disadvantages

Choosing K manually

Being dependent on initial values.

Clustering data of varying sizes and
density.

Clustering outliers.

Scaling with number of dimensions.
K-means Clustering
• Importing K-means
– from sklearn.cluster import Kmeans
• #Loading Data Set
– df=pd.read_csv('iris.csv')
• Extracting columns
– X=df[['sepal_length','sepal_width']]
– Y=df['species']
• Creating K-Means based Clustreing Model
• kmeans = KMeans(n_clusters=3,
random_state=0).fit(X)
K-means Clustering Contd…
• Printing Clustering Centroids
– print(kmeans.cluster_centers_)
• # Get the cluster labels
– print(kmeans.labels_)
• # Plotting the cluster centers and the data
points on a 2D plane
– plt.scatter(X['sepal_length'], X['sepal_width'])
– plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1], c='red', marker='x')
– plt.title('Data points and cluster centroids')
– plt.show()
K-Means Clustering Contd…
• Checking Accuracy using silhouette_score
– from sklearn.metrics import silhouette_score
– print(silhouette_score(X, kmeans.labels_))
Hierarchical Clustering: Agglomerative

In this technique, initially each data point is
considered as an individual cluster.

At each iteration, the similar clusters merge
with other clusters until one cluster or K
clusters are formed.
Hierarchical Clustering: Agglomerative

The basic algorithm of Agglomerative is as
follows:
 Compute the proximity matrix
 Let each data point be a cluster
 Repeat: Merge the two closest clusters and
update the proximity matrix
 Until only a single cluster remains
Hierarchical Clustering: Agglomerative Example
Hierarchical Clustering: Agglomerative Example
Hierarchical Clustering:
Visualization

The Hierarchical clustering Technique can
be visualized using a Dendrogram.

A Dendrogram is a tree-like diagram that
records the sequences of merges or splits.

Agglomerative Clustering
• Importing Packages
– import scipy.cluster.hierarchy as shc
– from sklearn.cluster import AgglomerativeClustering
• Plotting Dendograms
– dend = shc.dendrogram(shc.linkage(X,
method='ward'))
• Creating Model
– cluster = AgglomerativeClustering(n_clusters=5,
affinity='euclidean', linkage='ward')
Agglomerative Clustering Contd…
• Performing Predictions
– cluster.fit_predict(X)
• Plotting Clusters
– plt.scatter(X['sepal_length'], X['sepal_width'],
c=cluster.labels_, cmap='rainbow')

ML-12
No ratings yet
ML-12
19 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering TNP
No ratings yet
Clustering TNP
53 pages
PART2
No ratings yet
PART2
61 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering
No ratings yet
Clustering
80 pages
Clustring Data Mining
No ratings yet
Clustring Data Mining
21 pages
Module 5.Docx Aiml
No ratings yet
Module 5.Docx Aiml
28 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
6.nsupervised Learning Clustering Lecture 7 Slides For4962
No ratings yet
6.nsupervised Learning Clustering Lecture 7 Slides For4962
37 pages
Grouping
No ratings yet
Grouping
98 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
ML Minors Exp7
No ratings yet
ML Minors Exp7
6 pages
INTRO TO ML ASS
No ratings yet
INTRO TO ML ASS
3 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
DWM Exp7 C49
No ratings yet
DWM Exp7 C49
11 pages
Clustering
No ratings yet
Clustering
7 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Presentation: Operating System Concept CS-582
No ratings yet
Presentation: Operating System Concept CS-582
13 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
KMeans_Clustering
No ratings yet
KMeans_Clustering
11 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
Preethi
No ratings yet
Preethi
11 pages
DS - ML - 7 - 60019210046 1
No ratings yet
DS - ML - 7 - 60019210046 1
6 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
K-Means
No ratings yet
K-Means
66 pages
USL
No ratings yet
USL
21 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
89 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Cluster Analysis Usingr PDF
No ratings yet
Cluster Analysis Usingr PDF
0 pages
Recor
No ratings yet
Recor
6 pages
S27
No ratings yet
S27
30 pages
Unit 7 Clustering (P) (1) (1)
No ratings yet
Unit 7 Clustering (P) (1) (1)
22 pages
CC282 Unsupervised Learning (Clustering) : Lecture 7 Slides For CC282 Machine Learning, R. Palaniappan, 2008 1
No ratings yet
CC282 Unsupervised Learning (Clustering) : Lecture 7 Slides For CC282 Machine Learning, R. Palaniappan, 2008 1
38 pages
Clustering
No ratings yet
Clustering
104 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
ML Application in Signal Processing and Communication Engineering
No ratings yet
ML Application in Signal Processing and Communication Engineering
27 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
M5
No ratings yet
M5
40 pages
Cluster
100% (1)
Cluster
72 pages
6 - Machine Learning and Unlabeled Data
No ratings yet
6 - Machine Learning and Unlabeled Data
67 pages
Clustering
No ratings yet
Clustering
23 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Relationship Between The Financial Literacy and Spending Habits of Grade 12 Iloilo National High School Students
0% (1)
Relationship Between The Financial Literacy and Spending Habits of Grade 12 Iloilo National High School Students
8 pages
The Phonology of Intonation and Phrasing 2006
100% (2)
The Phonology of Intonation and Phrasing 2006
475 pages
Geological Abstract. Kilungu
No ratings yet
Geological Abstract. Kilungu
1 page
Analysis of Differences in Perception of Social Support in Public Spaces in The Neighborhood
No ratings yet
Analysis of Differences in Perception of Social Support in Public Spaces in The Neighborhood
22 pages
Chinese Whisper Game As One Alternative Technique To Teach Speaking
No ratings yet
Chinese Whisper Game As One Alternative Technique To Teach Speaking
14 pages
Volume I PDF
No ratings yet
Volume I PDF
406 pages
Target Market Analysis
No ratings yet
Target Market Analysis
10 pages
Practical Research I Reviewer
No ratings yet
Practical Research I Reviewer
6 pages
Map Design
No ratings yet
Map Design
6 pages
Sir Caril HOS
No ratings yet
Sir Caril HOS
12 pages
Written Assignment Unit 4 - Sara Crawford
No ratings yet
Written Assignment Unit 4 - Sara Crawford
2 pages
Penganiyaan Wanita
No ratings yet
Penganiyaan Wanita
13 pages
Understanding Existing Buildings Five Studies To Complete Before Design Work Starts
No ratings yet
Understanding Existing Buildings Five Studies To Complete Before Design Work Starts
4 pages
International Experts' Review Hydric Component of The Environmental Impact Assessment Conga Mining Project
No ratings yet
International Experts' Review Hydric Component of The Environmental Impact Assessment Conga Mining Project
264 pages
0587 Bgcse Literature in English
No ratings yet
0587 Bgcse Literature in English
14 pages
Boresight Calibration of Mobile Mapping Systems
No ratings yet
Boresight Calibration of Mobile Mapping Systems
10 pages
Quantitative Techniques
No ratings yet
Quantitative Techniques
231 pages
The Impact of Sales Promotion Techniques On Consumer Purchase Decisions Within Community Pharmacies
No ratings yet
The Impact of Sales Promotion Techniques On Consumer Purchase Decisions Within Community Pharmacies
4 pages
Anitesh Barua Bio Short 2013
No ratings yet
Anitesh Barua Bio Short 2013
1 page
Adler 1997
No ratings yet
Adler 1997
25 pages
Research Methods and Data Analysis in Psychology I: Welcome To PSYC 300
No ratings yet
Research Methods and Data Analysis in Psychology I: Welcome To PSYC 300
18 pages
Acknowledgement: Survey Camp Report 2015
No ratings yet
Acknowledgement: Survey Camp Report 2015
5 pages
Design and Policy Making - The Canvas For Social Economy
No ratings yet
Design and Policy Making - The Canvas For Social Economy
3 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
6 pages
Chanda
No ratings yet
Chanda
29 pages
Impact of FMCG Promotion On Consumer Buying Behaviour
No ratings yet
Impact of FMCG Promotion On Consumer Buying Behaviour
6 pages
Emotional Intelligence and Work Life Balance.: P.Shylaja, Dr. CH - Jayasankara Prasad
No ratings yet
Emotional Intelligence and Work Life Balance.: P.Shylaja, Dr. CH - Jayasankara Prasad
4 pages
PR-1-Q2-MODULE-1
No ratings yet
PR-1-Q2-MODULE-1
10 pages
Teaching - Trainer - Cristiane Gonçalves PDF
No ratings yet
Teaching - Trainer - Cristiane Gonçalves PDF
15 pages
Pd9211-Quality Concepts in Design-R8 PDF
No ratings yet
Pd9211-Quality Concepts in Design-R8 PDF
2 pages