0% found this document useful (0 votes)

242 views

K-Means in Python - Solution

This document provides an overview of using k-means clustering in Python. It introduces k-means clustering and how to implement it using scikit-learn. It generates sample data using make_blobs and standardizes the features. Then it fits a k-means model to cluster the data and visualizes the results. Finally, it analyzes the sum of squared errors for different numbers of clusters to determine the optimal number using the elbow method.

Uploaded by

Rodrigo Violante

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

242 views

K-Means in Python - Solution

Uploaded by

Rodrigo Violante

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

k-means in Python

Author: Jessica Cervi

Activity Overview
This activity is designed to consolidate your knowledge about k-means clustering and to teach you how to
define those in Python using the sklearn library.

We'll use a synthetic dataset from sklearn to visualize the clusters and choose an optimal number for the
data we have.

This activity is designed to help you apply the machine learning algorithms you have learned using the
packages in Python . Python concepts, instructions, and starter code are embedded within this Jupyter
Notebook to help guide you as you progress through the activity. Remember to run the code of each code
cell prior to submitting the assignment. Upon completing the activity, we encourage you to compare your
work against the solution file to perform a self-assessment.
Basics of k-means
The k-means clustering method is an unsupervised machine learning technique used to identify clusters of
data objects in a dataset. There are many diﬀerent types of clustering methods, but k-means is one of the
oldest and most approachable ones. These traits make implementing k-means clustering in Python
reasonably straightforward, even for novice programmers and data scientists.

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined
as groups of data objects that are more similar to other objects in their cluster than they are to data objects
in other clusters. In practice, clustering helps identify two qualities of data:

Meaningfulness - find clusters to expand knowledge about the domain

Usefulness - find clusters that can serve as an intermediate step in a data pipeline

There are many applications of clustering, such as document clustering and social network analysis.
Clustering applications are relevant in nearly every industry, making clustering a valuable skill for
professionals working with data in any field.

Algorithm of k-means Clustering

Next, we’ll take a step-by-step tour of the conventional version of the k-means algorithm. Understanding the
details of the algorithm is a fundamental step in the process of writing your k-means clustering pipeline in
Python. What you learn in this section will help you decide if k-means is the right choice to solve your
clustering problem.

Conventional k-means requires only a few steps:

- Specify the number of clusters to assign. Remember that the number of cl

usters is considered a hypeparameter of this model.
- Randomly specify the starting centroids.
- Repeat for optimization:
- Until the centroids positions do not change:
- Assign the data points to their closest centroids.
- Calculate new centroids for each cluster.

The core of the algorithm works by a two-step process called expectation-maximization. The expectation
step assigns each data point to its nearest centroid. Then the maximization step computes the mean of all
the points for each cluster and sets the new centroid to that mean.
Setting up the Problem
In this section, we will perform this calculation using a small example on a machine learning dataset.

We can generate a small binary (2 class) classification problem using the make_blobs() function from the
sklearn library, in a similar way as we did in the notebook about Bayesian probability.

In the code cell below, we've imported the function make_blobs from sklearn and used that to
generate 100 synthetic cluster points with two numerical input variables, each assigned one of the two
classes.

In this exercise, make_blobs() uses these parameters:

n_samples is the total number of samples to generate.

centers is the number of centers to generate.
cluster_std is the standard deviation.
random_state is a parameter to ensure reproducibility of the results.

Modify the code cell below to generate 300 samples and set the number of centers to 3.

In [ ]: from sklearn.datasets import make_blobs

In [ ]: features, true_labels = make_blobs(n_samples=300, centers=3, cluster_s

td=0.60, random_state=0)

Here’s a look at the first five elements for each of the variables returned by make_blobs() :

In [ ]: features[:5]

In [ ]: true_labels[:5]

In the code cell below, we import the plotting library matplotlib to visualize the points we've created
above in a scatter plot.

In [ ]: import matplotlib.pyplot as plt

plt.scatter(features[:, 0], features[:, 1], s=50);
Standardizing Your Data
In the real world, datasets usually contain numerical features that have been measured in diﬀerent units,
such as height (in inches) and weight (in pounds). A machine learning algorithm would consider weight more
important than height only because the values for weight have higher variability from person to person.

Machine learning algorithms need to consider all features on an even playing field. Thus, the values for all
features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling.

There are several approaches to implementing feature scaling. A great resource to determine which
technique is appropriate for your dataset is the sklearn ’s preprocessing documentation (https://scikit-
learn.org/stable/modules/preprocessing.html).

In this example, you’ll use the StandardScaler class from sklearn . This class implements a type of
feature scaling called standardization. Standardization scales, or shifts, the values for each numerical feature
in your dataset so that the features have a mean of 0 and standard deviation of 1.

Run the code cells below:

In [ ]: from sklearn.preprocessing import StandardScaler

In [ ]: scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

Run the code cell below to take a look at how the values have been scaled in scaled_features:

In [ ]: scaled_features[:5]

We can also plot the scaled data to see if it looks similar to our original feature data but which is now shifted
and scaled (note the diﬀerent axes):

In [ ]: plt.scatter(scaled_features[:, 0], scaled_features[:, 1], s=50);

Clustering the Data
Now the data-points are ready to be clustered. The KMeans (https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) estimator class in sklearn is where you
set the algorithm parameters before fitting the estimator to the data. The scikit-learn implementation is
flexible, providing several hyperparameters that can be tuned.

Here are the parameters used in this example:

init controls the initialization technique. The standard version of the k-means algorithm is
implemented by setting init to "random" .
n_clusters sets k for the clustering step.

In the code cell below, instantiate the KMeans class with the following arguments: init equal to
"random" , n_clusters equal to 4 , and random_state equal to 42 .

In [ ]: from sklearn.cluster import KMeans

kmeans = KMeans( init="random", n_clusters=4,random_state=42)

Now that the k-means class is ready, the next step is to fit it and predict the data in scaled_features .

In [ ]: kmeans.fit(scaled_features)
y_kmeans = kmeans.predict(scaled_features)

Let's visualize the results by plotting the data colored by these labels. We will also plot the cluster centers as
determined by the k-means estimator:

In [ ]: plt.scatter(scaled_features[:, 0], scaled_features[:, 1], c=y_kmeans,

s=50, cmap='viridis')

#getting the coordinates of the centers

centers = kmeans.cluster_centers_
#plotting the centers
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);
Choosing the Appropriate Number of Clusters
The optimal number of the cluster assignments can be decided in diﬀerent ways. A common one is based
on some measure of error, such as the sum of the squared error (SSE) after the centroids converge. The SSE
is defined as the sum of the squared Euclidean distances of each point to its closest centroid.
𝑛
(𝑥𝑖 − 𝑥¯ )2
∑
SSE =
𝑖=1

Since this is a measure of error, the objective of k-means is to try to minimize this value.

In this section, you’ll look at one method that is commonly used to evaluate the appropriate number of
clusters: the elbow method. While larger 𝑘 can only result in smaller SSE, at some value for 𝑘 , we start to
see diminishing returns: the SSE decreases only a little, at the cost of an extra cluster. To perform the elbow
method, we run several k-means, increment k with each iteration, and record the SSE. There's usually a
sweet spot where the SSE curve starts to bend, known as the elbow point.

In the code cell below, we compute the SSE for a various number of clusters between 1 and 10.

In [ ]: kmeans_kwargs = {"init": "random", "random_state": 42}

# A list holds the SSE values for each k

sse = []

#computing the SSE for different numbers of clusters

for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(scaled_features)
sse.append(kmeans.inertia_)#getting the SSE

Finally, in the code cell below, we plot the elbow curve.

In [ ]: plt.plot(range(1, 11), sse)

plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()

What is the optimal number of clusters for this example?

DOUBLE CLICK HERE TO START TYPING YOUR ANSWER

Correct answer: 3

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Printlink5-IN: DICOM 3.0 Conformance Statement
No ratings yet
Printlink5-IN: DICOM 3.0 Conformance Statement
38 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Import As
100% (1)
Import As
27 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
100% (1)
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
57 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Pattern Classification
100% (1)
Pattern Classification
42 pages
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
100% (1)
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
11 pages
Regression Project
100% (1)
Regression Project
60 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
Kmeans Matlab Code Feed Own Data Source - QuestionInBox
No ratings yet
Kmeans Matlab Code Feed Own Data Source - QuestionInBox
5 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Homework 2
100% (1)
Homework 2
12 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Machine Learning Coursera Quiz 2
100% (1)
Machine Learning Coursera Quiz 2
6 pages
Eda PDF
100% (1)
Eda PDF
45 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Machine Learning Hands-On
100% (1)
Machine Learning Hands-On
18 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
COMPX310-19A Machine Learning Chapter 3: Classification
No ratings yet
COMPX310-19A Machine Learning Chapter 3: Classification
39 pages
Cluster
100% (1)
Cluster
72 pages
Poly
100% (1)
Poly
108 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Breast Cancer Classification
No ratings yet
Breast Cancer Classification
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Clustering & PCA Assignment Questions
No ratings yet
Clustering & PCA Assignment Questions
4 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Answer 1722791857 NLP and Classification Practical MCQ 4991
No ratings yet
Answer 1722791857 NLP and Classification Practical MCQ 4991
26 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
Machine Learning Algorithm
100% (2)
Machine Learning Algorithm
20 pages
Cok Rigging
No ratings yet
Cok Rigging
64 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
Quiz
No ratings yet
Quiz
6 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
CH 6
No ratings yet
CH 6
72 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
A Complete Course in Physics ( Graphs ) - First Edition
From Everand
A Complete Course in Physics ( Graphs ) - First Edition
Rajat Kalia
No ratings yet
Logistic Regression vs. SVMs - Solution
No ratings yet
Logistic Regression vs. SVMs - Solution
7 pages
Training vs. Testing Sets - Solution
No ratings yet
Training vs. Testing Sets - Solution
4 pages
1 s2.0 S1877050913000951 Main
No ratings yet
1 s2.0 S1877050913000951 Main
10 pages
SumUp Engineering Progression Framework v2
No ratings yet
SumUp Engineering Progression Framework v2
23 pages
Google Productivity
No ratings yet
Google Productivity
12 pages
SAP Module 1
No ratings yet
SAP Module 1
28 pages
Module 1 - 2 (Living in It Era)
No ratings yet
Module 1 - 2 (Living in It Era)
22 pages
2D Animation NC III (DIT Resultant) - Pre-Test - Attempt Review
No ratings yet
2D Animation NC III (DIT Resultant) - Pre-Test - Attempt Review
14 pages
Guardians of The Quantum GAN: Archisman Ghosh Debarshi Kundu Avimita Chatterjee Swaroop Ghosh
No ratings yet
Guardians of The Quantum GAN: Archisman Ghosh Debarshi Kundu Avimita Chatterjee Swaroop Ghosh
11 pages
DOC1774956 - Rev2 - Revolution ACT Product Datatsheet
No ratings yet
DOC1774956 - Rev2 - Revolution ACT Product Datatsheet
29 pages
Connect Note Troubleshooting Emerson DeltaV Connections
No ratings yet
Connect Note Troubleshooting Emerson DeltaV Connections
3 pages
SW 7 0 X To 7 1 1 Stealthwatch Update Guide DV 2 0
No ratings yet
SW 7 0 X To 7 1 1 Stealthwatch Update Guide DV 2 0
62 pages
Image Captioning To Aid Blind and Visually Impaired Outdoor Navigation
No ratings yet
Image Captioning To Aid Blind and Visually Impaired Outdoor Navigation
14 pages
Resume New
No ratings yet
Resume New
2 pages
Swathy U: Courses Certifications Education
No ratings yet
Swathy U: Courses Certifications Education
4 pages
VB 2015 Preview
No ratings yet
VB 2015 Preview
32 pages
Coupling and Cohesion
No ratings yet
Coupling and Cohesion
2 pages
HPE Aruba Networking 3810M Switch Series-PSN1008605435INEN
No ratings yet
HPE Aruba Networking 3810M Switch Series-PSN1008605435INEN
4 pages
CDD: NCBI's Conserved Domain Database
No ratings yet
CDD: NCBI's Conserved Domain Database
5 pages
Starting MSX Assembly Part 1
No ratings yet
Starting MSX Assembly Part 1
3 pages
Planter Box Plan
100% (2)
Planter Box Plan
9 pages
7 Django Forms
No ratings yet
7 Django Forms
4 pages
Debugging Tables Tips
No ratings yet
Debugging Tables Tips
28 pages
Pump - Off Controller: Performance Optimization For Oil & Gas Extraction
No ratings yet
Pump - Off Controller: Performance Optimization For Oil & Gas Extraction
4 pages
Athila Resume
No ratings yet
Athila Resume
2 pages
Com - Aru.phpblue - Slotparty Logcat
No ratings yet
Com - Aru.phpblue - Slotparty Logcat
18 pages
CSC 204 Session 1
No ratings yet
CSC 204 Session 1
16 pages
SOW Computing LS 23-24
No ratings yet
SOW Computing LS 23-24
18 pages
HW3
No ratings yet
HW3
19 pages
Nagma Thakor CV (PM)
No ratings yet
Nagma Thakor CV (PM)
4 pages
recruiterscorner
No ratings yet
recruiterscorner
415 pages
Mah - LLB 5 Years - Cet - 2019 PDF
No ratings yet
Mah - LLB 5 Years - Cet - 2019 PDF
2 pages
Building Technology Level 5
No ratings yet
Building Technology Level 5
65 pages
Computer CSS Transforms and Animations
No ratings yet
Computer CSS Transforms and Animations
21 pages

K-Means in Python - Solution

Uploaded by

K-Means in Python - Solution

Uploaded by

k-means in Python

Author: Jessica Cervi

Meaningfulness - find clusters to expand knowledge about the domain

Algorithm of k-means Clustering

Conventional k-means requires only a few steps:

- Specify the number of clusters to assign. Remember that the number of cl

In this exercise, make_blobs() uses these parameters:

n_samples is the total number of samples to generate.

In [ ]: from sklearn.datasets import make_blobs

In [ ]: features, true_labels = make_blobs(n_samples=300, centers=3, cluster_s

In [ ]: import matplotlib.pyplot as plt

Run the code cells below:

In [ ]: from sklearn.preprocessing import StandardScaler

In [ ]: plt.scatter(scaled_features[:, 0], scaled_features[:, 1], s=50);

Here are the parameters used in this example:

In [ ]: from sklearn.cluster import KMeans

kmeans = KMeans( init="random", n_clusters=4,random_state=42)

In [ ]: plt.scatter(scaled_features[:, 0], scaled_features[:, 1], c=y_kmeans,

#getting the coordinates of the centers

In [ ]: kmeans_kwargs = {"init": "random", "random_state": 42}

# A list holds the SSE values for each k

#computing the SSE for different numbers of clusters

Finally, in the code cell below, we plot the elbow curve.

In [ ]: plt.plot(range(1, 11), sse)

What is the optimal number of clusters for this example?

DOUBLE CLICK HERE TO START TYPING YOUR ANSWER

You might also like