K Mean Clustering
K Mean Clustering
K Mean Clustering
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 2
Suppose we have several objects (4 types of medicines) and each object have two attributes or features as
shown in table below. Our goal is to group these objects into K=2 group of medicine based on the two
features (pH and weight index).
Object Feature 1 (X): weight index Feature 2 (Y): pH
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Each medicine represents one point with two features (X, Y) that we can represent it as coordinate in a
feature space as shown in the figure below.
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 3
iteration 0
4.5
2.5
1.5
0.5
0
0 1 2 3 4 5 6
1. Initial value of centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1
and c 2 denote the coordinate of the centroids, then c1 = (1,1) and c 2 = (2,1)
2. Objects-Centroids distance: we calculate the distance between cluster centroid to each object. Let us
use Euclidean distance, then we have distance matrix at iteration 0 is
⎡ 0 1 3.61 5 ⎤ c1 = (1,1) group − 1
D0 = ⎢ ⎥
⎣1 0 2.83 4.24 ⎦ c 2 = (2,1) group − 2
A B C D
⎡1 2 4 5 ⎤ X
⎢1 1 3 ⎥
⎣ 4 ⎦ Y
Each column in the distance matrix symbolizes the object. The first row of the distance matrix
corresponds to the distance of each object to the first centroid and the second row is the distance of
each object to the second centroid. For example, distance from medicine C = (4, 3) to the first
centroid c1 = (1,1) is (4 − 1) 2 + (3 − 1) 2 = 3.61 , and its distance to the second centroid
c 2 = (2,1) is (4 − 2) 2 + (3 − 1) 2 = 2.83 , etc.
3. Objects clustering: We assign each object based on the minimum distance. Thus, medicine A is
assigned to group 1, medicine B to group 2, medicine C to group 2 and medicine D to group 2. The
element of Group matrix below is 1 if and only if the object is assigned to that group.
⎡1 0 0 0 ⎤ group − 1
G0 = ⎢ ⎥
⎣0 1 1 1 ⎦ group − 2
A B C D
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 4
4. Iteration-1, determine centroids: Knowing the members of each group, now we compute the new
centroid of each group based on these new memberships. Group 1 only has one member thus the
centroid remains in c1 = (1,1) . Group 2 now has three members, thus the centroid is the average
2 + 4 + 5 1+ 3 + 4
coordinate among the three members: c2 = ( , ) = ( 113 , 83 ) .
3 3
iteration 1
4.5
4
attribute 2 (Y): pH
3.5
3
2.5
1.5
0.5
0
0 1 2 3 4 5 6
5. Iteration-1, Objects-Centroids distances: The next step is to compute the distance of all objects to
the new centroids. Similar to step 2, we have distance matrix at iteration 1 is
⎡ 0 1 3.61 5 ⎤ c1 = (1,1) group − 1
D1 = ⎢ ⎥
⎣3.14 2.36 0.47 1.89 ⎦ c 2 = ( 113 , 83 ) group − 2
A B C D
⎡1 2 4 5 ⎤ X
⎢1 ⎥
⎣ 1 3 4 ⎦ Y
6. Iteration-1, Objects clustering: Similar to step 3, we assign each object based on the minimum
distance. Based on the new distance matrix, we move the medicine B to Group 1 while all the other
objects remain. The Group matrix is shown below
⎡1 1 0 0 ⎤ group − 1
G1 = ⎢ ⎥
⎣0 0 1 1 ⎦ group − 2
A B C D
7. Iteration 2, determine centroids: Now we repeat step 4 to calculate the new centroids coordinate
based on the clustering of previous iteration. Group1 and group 2 both has two members, thus the
1+ 2 1+1 4+5 3+ 4
new centroids are c1 = ( , ) = (1 12 ,1) and c 2 = ( , ) = (4 12 ,3 12 )
2 2 2 2
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 5
iteration 2
4.5
4
attribute 2 (Y): pH
3.5
3
2.5
1.5
0.5
0
0 1 2 3 4 5 6
8. Iteration-2, Objects-Centroids distances: Repeat step 2 again, we have new distance matrix at
iteration 2 as
⎡ 0.5 0.5 3.20 4.61⎤ c1 = (1 12 ,1) group − 1
D2 = ⎢ ⎥
⎣ 4.30 3.54 0.71 0.71⎦ c 2 = (4 12 ,3 12 ) group − 2
A B C D
⎡1 2 4 5 ⎤ X
⎢1 ⎥
⎣ 1 3 4 ⎦ Y
9. Iteration-2, Objects clustering: Again, we assign each object based on the minimum distance.
⎡1 1 0 0 ⎤ group − 1
G2 = ⎢ ⎥
⎣0 0 1 1 ⎦ group − 2
A B C D
We obtain result that G = G . Comparing the grouping of last iteration and this iteration reveals
2 1
that the objects does not move group anymore. Thus, the computation of the k-mean clustering has
reached its stability and no more iteration is needed. We get the final grouping as the results
Object Feature 1 (X): weight Feature 2 (Y): pH Group (result)
index
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 6
Since we are not sure about the location of the centroid, we need to adjust the centroid location based on the
current updated data. Then we assign all the data to this new centroid. This process is repeated until no data is
moving to another cluster anymore. Mathematically this loop can be proved convergent.
As an example, I have made a Visual Basic and Matlab code. You may download the complete program in
http://www.planetsourcecode.com/xq/ASP/txtCodeId.26983/lngWId.1/qx/vb/scripts/ShowCode.htm. or in its
official web page in http://people.revoledu.com/kardi/tutorial/kMean/download.htm
The number of features is limited to two only but you may extent it to any number of features. The main code
is shown here.
isStillMoving = True
Do While isStillMoving
' this loop will surely convergent
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 7
When User click picture box to input new data (X, Y), the program will make group/cluster the data by
minimizing the sum of squares of distances between data and the corresponding cluster centroid. Each dot is
representing an object and the coordinate (X, Y) represents two attributes of the object. The colors of the dot
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 8
and label number represent the cluster. You may try how the cluster may change when additional data is
inputted.
For you who like to use Matlab, Matlab Statistical Toolbox contains a function name kmeans. If you do not
have the statistical toolbox, you may use my code below. The kMeanCluster and distMatrix can be
downloaded as text files in http://people.revoledu.com/kardi/tutorial/kMean/matlab_kMeans.htm.
Alternatively, you may simply type the code below.
function y=kMeansCluster(m,k)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %
% kMeansCluster - Simple k means clustering algorithm
% Author: Kardi Teknomo, Ph.D.
%
% Purpose: classify the objects in data matrix based on the attributes
% Criteria: minimize Euclidean distance between centroids and object points
% For more explanation of the algorithm, see http://people.revoledu.com/kardi/tutorial/kMean/index.html %
% Output: matrix data plus an additional column represent the group of each object %
%
% Example: m = [ 1 1; 2 1; 4 3; 5 4] or in a nice form
% m = [ 1 1;
% 2 1;
% 4 3;
% 5 4]
% k=2
% kMeansCluster(m,k) produces m = [ 1 1 1;
% 2 1 1;
% 4 3 2;
% 5 4 2]
% Input:
% m - matrix data: objects in rows and attributes in columns
% k - number of groups
%
% Local Variables
% c - centroid coordinate size (1:k, 1:maxCol)
% g - current iteration group matrix size (1:maxRow)
% i - scalar iterator
% maxCol - scalar number of rows in the data matrix m = number of attributes
% maxRow - scalar number of columns in the data matrix m = number of objects
% temp - previous iteration group matrix size (1:maxRow)
% z - minimum value (not needed)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[maxRow, maxCol]=size(m);
if maxRow<=k,
y=[m, 1:maxRow];
else
while 1,
d=DistMatrix(m,c); % calculate objcets-centroid distances
[z,g]=min(d,[],2); % find group matrix g
if g==temp,
break; % stop the iteration
else
temp=g; % copy group matrix to temporary variable
end
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 9
for i=1:k
c(i,:)=mean(m(find(g==i),:));
end
end
y=[m,g];
end
The Matlab function kMeansCluster above call function DistMatrix as shown in the code below.
function d=DistMatrix(A,B)
%%%%%%%%%%%%%%%%%%%%%%%%%
% DISTMATRIX return distance matrix between point A=[x1 y1] and B=[x2 y2]
% Author: Kardi Teknomo, Ph.D.
% see http://people.revoledu.com/kardi/
%
% Number of point in A and B are not necessarily the same.
% It can be use for distance-in-a-slice (Spacing) or distance-between-slice (Headway),
%
% A and B must contain two column,
% first column is the X coordinates
% second column is the Y coordinates
% The distance matrix are distance between points in A as row
% and points in B as column.
% example: Spacing= dist(A,A)
% Headway = dist(A,B), with hA ~= hB or hA=hB
% A=[1 2; 3 4; 5 6]; B=[4 5; 6 2; 1 5; 5 8]
% dist(A,B)= [ 4.24 5.00 3.00 7.21;
% 1.41 3.61 2.24 4.47;
% 1.41 4.12 4.12 2.00 ]
%%%%%%%%%%%%%%%%%%%%%%%%%%%
[hA,wA]=size(A);
[hB,wB]=size(B);
if hA==1& hB==1
d=sqrt(dot((A-B),(A-B)));
else
C=[ones(1,hB);zeros(1,hB)];
D=flipud(C);
E=[ones(1,hA);zeros(1,hA)];
F=flipud(E);
G=A*C;
H=A*D;
I=B*E;
J=B*F;
d=sqrt((G-I').^2+(H-J').^2);
end
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 10
Step 3. Repeat step 2 until convergence is achieved, that is until a pass through the training examples causes
no new assignments.
It is proven that the convergence will always occur if the following condition satisfied:
1. Each switch in step 2 decreases the sum of the squared distances from each training example to that
training example’s group centroid.
2. There are only finitely many partitions of the training examples into k cluster.
http://people.revoledu.com/kardi/tutorial/kMean/index.html
Kardi Teknomo – K Mean Clustering Tutorial 11
Centroid -1 Centroid -2
For more updated information about this tutorial, visit the official page of this tutorial:
http://people.revoledu.com/kardi/tutorial/kMean/index.html
http://people.revoledu.com/kardi/tutorial/kMean/index.html