Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
195 views

Cluster Analysis in Python Chapter2 PDF

This document provides an overview of hierarchical clustering techniques in Python. It discusses different linkage methods for calculating distances between clusters like single, complete, average, centroid and ward. It also covers creating cluster labels using fcluster and visualizing clusters using matplotlib and seaborn. Dendrograms are introduced as a way to determine the number of clusters by showing how clusters are merged. Limitations like the quadratic increase in runtime with data points, making it unsuitable for large datasets, are also covered.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

Cluster Analysis in Python Chapter2 PDF

This document provides an overview of hierarchical clustering techniques in Python. It discusses different linkage methods for calculating distances between clusters like single, complete, average, centroid and ward. It also covers creating cluster labels using fcluster and visualizing clusters using matplotlib and seaborn. Dendrograms are introduced as a way to determine the number of clusters by showing how clusters are merged. Limitations like the quadratic increase in runtime with data points, making it unsuitable for large datasets, are also covered.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Basics of hierarchical

clustering
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Creating a distance matrix using linkage
scipy.cluster.hierarchy.linkage(observations,
method='single',
metric='euclidean',
optimal_ordering=False
)

method : how to calculate the proximity of clusters

metric : distance metric

optimal_ordering : order data points

CLUSTER ANALYSIS IN PYTHON


Which method should use?
single: based on two closest objects

complete: based on two farthest objects

average: based on the arithmetic mean of all objects

centroid: based on the geometric mean of all objects

median: based on the median of all objects

ward: based on the sum of squares

CLUSTER ANALYSIS IN PYTHON


Create cluster labels with fcluster
scipy.cluster.hierarchy.fcluster(distance_matrix,
num_clusters,
criterion
)

distance_matrix : output of linkage() method

num_clusters : number of clusters

criterion : how to decide thresholds to form clusters

CLUSTER ANALYSIS IN PYTHON


Hierarchical clustering with ward method

CLUSTER ANALYSIS IN PYTHON


Hierarchical clustering with single method

CLUSTER ANALYSIS IN PYTHON


Hierarchical clustering with complete method

CLUSTER ANALYSIS IN PYTHON


Final thoughts on selecting a method
No one right method for all

Need to carefully understand the distribution of data

CLUSTER ANALYSIS IN PYTHON


Let's try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Visualize clusters
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why visualize clusters?
Try to make sense of the clusters formed

An additional step in validation of clusters

Spot trends in data

CLUSTER ANALYSIS IN PYTHON


An introduction to seaborn
seaborn : a Python data visualization library based on matplotlib

Has better, easily modi able aesthetics than matplotlib!

Contains functions that make data visualization tasks easy in the context of data analytics

Use case for clustering: hue parameter for plots

CLUSTER ANALYSIS IN PYTHON


Visualize clusters with matplotlib
from matplotlib import pyplot as plt

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],


'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})

colors = {'A':'red', 'B':'blue'}

df.plot.scatter(x='x',
y='y',
c=df['labels'].apply(lambda x: colors[x]))
plt.show()

CLUSTER ANALYSIS IN PYTHON


Visualize clusters with seaborn
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],


'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})

sns.scatterplot(x='x',
y='y',
hue='labels',
data=df)
plt.show()

CLUSTER ANALYSIS IN PYTHON


Comparison of both methods of visualization
MATPLOTLIB PLOT SEABORN PLOT

CLUSTER ANALYSIS IN PYTHON


Next up: Try some
visualizations
C L U S T E R A N A LY S I S I N P Y T H O N
How many clusters?
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Introduction to dendrograms
Strategy till now - decide clusters on visual
inspection

Dendrograms help in showing progressions as


clusters are merged

A dendrogram is a branching diagram that


demonstrates how each cluster is composed by
branching out into its child nodes

CLUSTER ANALYSIS IN PYTHON


Create a dendrogram in SciPy
from scipy.cluster.hierarchy import dendrogram

Z = linkage(df[['x_whiten', 'y_whiten']],
method='ward',
metric='euclidean')

dn = dendrogram(Z)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Next up - try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Limitations of
hierarchical
clustering
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Measuring speed in hierarchical clustering
timeit module

Measure the speed of .linkage() method

Use randomly generated points

Run various iterations to extrapolate

CLUSTER ANALYSIS IN PYTHON


Use of timeit module
from scipy.cluster.hierarchy import linkage
import pandas as pd
import random, timeit

points = 100
df = pd.DataFrame({'x': random.sample(range(0, points), points),
'y': random.sample(range(0, points), points)})

%timeit linkage(df[['x', 'y']], method = 'ward', metric = 'euclidean')

1.02 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CLUSTER ANALYSIS IN PYTHON


Comparison of runtime of linkage method
Increasing runtime with data points

Quadratic increase of runtime

Not feasible for large datasets

CLUSTER ANALYSIS IN PYTHON


Next up - exercises
C L U S T E R A N A LY S I S I N P Y T H O N

You might also like