K Means Clustering
K Means Clustering
K Means Clustering
We are importing Numpy for statistical computations, Matplotlib to plot the graph,
and make_blobs from sklearn.datasets.
import numpy as np
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
Output:
Clustering dataset
Python3
k = 3
clusters = {}
np.random.seed(23)
center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}
clusters[idx] = cluster
clusters
Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points': []},
1: {'center': array([ 1.06183904, -0.87041662]), 'points': []},
2: {'center': array([-1.11581855, 0.74488834]), 'points': []}}
Plot the random initialize center with data points
Python3
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.show()
Output:
Data points with random center
Python3
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Python3
#Implementing E step
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
clusters[i]['center'] = new_center
clusters[i]['points'] = []
return clusters
Python3
def pred_cluster(X, clusters):
pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred
Python3
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Python3
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.show()
Output:
K-means Clustering
Example 2:
import numpy as np
import matplotlib.cm as cm
Python3
X, y = load_iris(return_X_y=True)
Elbow Method
Finding the ideal number of groups to divide the data into is a basic stage in any
unsupervised algorithm. One of the most common techniques for figuring out this
ideal value of k is the elbow approach.
Python3
#Find optimum number of cluster
for k in range(1,11):
km = KMeans(n_clusters=k, random_state=2)
km.fit(X)
sse.append(km.inertia_)
Python3
sns.set_style("whitegrid")
g=sns.lineplot(x=range(1,11), y=sse)
plt.show()
Output:
Elbow Method
From the above graph, we can observe that at k=2 and k=3 elbow-like situation. So,
we are considering K=3
Build the Kmeans clustering model
Python3
kmeans = KMeans(n_clusters = 3, random_state = 2)
kmeans.fit(X)
Output:
KMeans
KMeans(n_clusters=3, random_state=2)
Find the cluster center
Python3
kmeans.cluster_centers_
Output:
array([[5.006 , 3.428 , 1.462 , 0.246 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])
Predict the cluster group:
Python3
pred = kmeans.fit_predict(X)
pred
Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2,
2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2,
2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1],
dtype=int32)
Plot the cluster center with data points
Python3
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(X[:,0],X[:,1],c = pred, cmap=cm.Accent)
plt.grid(True)
center = center[:2]
plt.subplot(1,2,2)
plt.grid(True)
center = center[2:4]
plt.show()
Output:
K-means clustering