Standardization and Its Effects On K-Means Clustering Algorithm

Uploaded by

Yomainid

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Standardization and Its Effects On K-Means Clustering Algorithm

Uploaded by

Yomainid

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/288044597

Standardization and Its Effects on K-Means Clustering Algorithm

Article in Research Journal of Applied Sciences, Engineering and Technology · September 2013

DOI: 10.19026/rjaset.6.3638

CITATIONS READS
213 5,644

2 authors:

Ismail Mohamad Dauda Usman

Universiti Teknologi Malaysia Umaru Musa Yar'adua University Katsina
50 PUBLICATIONS 459 CITATIONS 10 PUBLICATIONS 223 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

PLUS Highway View project

Determining the performance of control charts View project

All content following this page was uploaded by Dauda Usman on 06 February 2016.

The user has requested enhancement of the downloaded file.

Research Journal of Applied Sciences, Engineering and Technology 6(17): 3299-3303, 2013
ISSN: 2040-7459; e-ISSN: 2040-7467
© Maxwell Scientific Organization, 2013
Submitted: January 23, 2013 Accepted: February 25, 2013 Published: September 20, 2013

Standardization and Its Effects on K-Means Clustering Algorithm

Ismail Bin Mohamad and Dauda Usman

Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, 81310, UTM
Johor Bahru, Johor Darul Ta’azim, Malaysia

Abstract: Data clustering is an important data exploration technique with many applications in data mining. K-
means is one of the most well known methods of data mining that partitions a dataset into groups of patterns, many
methods have been proposed to improve the performance of the K-means algorithm. Standardization is the central
preprocessing step in data mining, to standardize values of features or attributes from different dynamic range into a
specific range. In this paper, we have analyzed the performances of the three standardization methods on
conventional K-means algorithm. By comparing the results on infectious diseases datasets, it was found that the
result obtained by the z-score standardization method is more effective and efficient than min-max and decimal
scaling standardization methods.

Keywords: Clustering, decimal scaling, k-means, min-max, standardization, z-score

INTRODUCTION Iterate until stable (= no object move group):

One of the mosteasiest and generally utilized • Determine the centroid coordinate
technique meant for creating groupings by optimizing • Determine the distance of each object to the
qualifying criterion function, defined eitherglobally centroids
(totaldesign) or locally (on the subset from thedesigns), • Group the object based on minimum distance
is the K-means technique (Vaishali and Rupa, 2011).
K-means clustering is one of the older predictive n The aim of clustering would be to figure out
observations in d dimensional space (an integer d) is commonalities and designs from the large data sets by
given and the problem is to determine a set of c points splitting the data into groups. Since it is assumed that
to minimize the mean squared distance from each data the data sets are unlabeled, clustering is frequently
point to its nearest center with which each observation regarded as the most valuable unsupervised learning
belongs. No exact polynomial-time algorithms are problem (Cios et al., 2007).
known for this problem. The problem can be set up as A primary application of geometrical measures
an integer programming problem but because solving (distances) to features having large ranges will
integer programs with a large number of variables is implicitly assign greater efforts in the metrics compared
time consuming, clusters are often computed using a to the application with features having smaller ranges.
Furthermore, the features need to be dimensionless
fast, heuristic method that generally produces good (but
since the numerical values of the ranges of dimensional
not necessarily optimal) solutions (Jain et al., 1999).
features rely upon the units of measurements and,
The K-means algorithm is one such method where hence, a selection of the units of measurements may
clustering requires less effort. In the beginning, number significantly alter the outcomes of clustering.
of cluster c is determined and the centre of these Therefore, one should not employ distance measures
clusters is assumed. Any random objects as the initial like the Euclidean distance without having
centroids can be taken or the first k objects in sequence normalization of the data sets (Aksoy and Haralick,
can also serve as the initial centroids. However, if there 2001; Larose, 2005).
are some features, with a large size or great variability, Preprocessing Luai et al. (2006) is actually
these kind of features will strongly affect the clustering essential before using any data exploration algorithms
result. In this case, data standardization would be an to enhance the results’ performance. Normalization of
important preprocessing task to scale or control the the dataset is among the preprocessing processes in data
variability of the datasets. exploration, in which the attribute data are scaled tofall
The K-means algorithm will do the three steps in a small specifiedrange. Normalization before
below until convergence clustering is specifically needed for distance metric,

Corresponding Author: Dauda Usman, Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, 81310,
UTM Johor Bahru, Johor Darul Ta’azim, Malaysia
3299
Res. J. App. Sci. Eng. Technol., 6(17): 3299-3303, 2013

like the Euclidian distance that are sensitive to is that it must be applied in global standardization and
variations within the magnitude or scales from the not in within-cluster standardization (Milligan and
attributes. In actual applications, due to the variations in Cooper, 1988).
selection of the attribute's value, one attribute might
overpower another one. Normalization prevents Min-max: Min-Max normalization is the process of
outweighing features having a large number over taking data measured in its engineering units and
features with smaller numbers. The aim would be to transforming it to a value between 0.0 and 1.0. Where
equalize the dimensions or magnitude and also the by the lowest (min) value is set to 0.0 and the highest
variability of those features. (max) value is set to 1.0. This provides an easy way to
Data preprocessing techniques (Vaishali and Rupa, compare values that are measured using different scales
2011) are applied to a raw data to make the data clean, or different units of measure. The normalized value is
noise free and consistent. Data Normalization defined as:
standardize the raw data by converting them into
specific range using a linear transformation which can X −X (3)
ij min
MM ( X ) =
generate good quality clusters and improve the ij X −X
max min
accuracy of clustering algorithms.
There is no universally defined rule for Decimal scaling:
normalizing the datasets and thus the choice of a Normalization by decimal scaling: Normalizes by
particular normalization rule is largely left to the moving the decimal point of values of feature X. The
discretion of the user (Karthikeyani and Thangavel, number of decimal points moved depends on the
2009). Thus the data normalization methods includes Z- maximum absolute value of X. A modified value DS (X)
score, Min-Max and Decimal scaling. In the Z-score the corresponding to X is obtained using:
values for an attribute X are standardized based on the
mean and standard deviation of X, this method is useful X
ij (4)
when the actual minimum and maximum of attribute X DS ( X ) =
ij
10c
are unknown. Decimal scaling standardized by moving
the decimal point of values of attribute X, the number of
where, c is the smallest integer such that
decimal points moved depends on the maximum max[|DS(X ij )|]<1
absolute value of X. Min-Max transforms the data set
between 0.0 and 1.0 by subtracting the minimum value K-means clustering: Given a set of observations (x 1 ,
from each value divided by the range of values for each x 2 , …, x n ), where each observation is a d-dimensional
individual value. real vector, K-means clustering aims to partition the n
observations into k sets (k≤n) S = {S 1 , S 2 , …, S k } so as
MATERIALS AND METHODS to minimize the Within-Cluster Sum of Squares
(WCSS):
Let, Y = {X 1 , X 2 , …, X n } denote the d-
dimensional raw data set.
Then the data matrix is an n×d matrix given by:
(5)
 a11  a1d 

X 1 , X 2 ,..., X n =      .
 where, μ i is the mean of points in S i :
 an1  and 
  (1)
RESULTS AND DISCUSSION
Z-score: The Z-score is a form of standardization used
for transforming normal variants to standard score In this section, details of the overall results have
form. Given a set of raw data Y, the Z-score been discussed. A complete program using MATLAB
standardization formula is defined as: has been developed to find the optimal solution. Few
experiments have been conducted on three
x −x standardization procedures and compare their
= ( xij ) ijσ j
xij Z= (2) performances on K-means clustering algorithm with
j
infectious diseases dataset having 15 data objects and 8
attributes as shown in Table 1. The eight datasets,
where, x j and σ j are the sample mean and standard Malaria dataset, Typhoid fever dataset, Cholera dataset,
deviation of the jth attribute, respectively. The Measles dataset, Chickenpox dataset, Tuberculosis
transformed variable will have a mean of 0 and a dataset, Tetanus dataset and Leprosy dataset for X1 to
variance of 1. The location and scale information of the X8 respectively are used to test the performances of the
original variable has been lost (Jain and Dubes, 1988). three standardization methods on K-means clustering
One important restriction of the Z-score standardization technique. The sum of squares error representing
3300
Res. J. App. Sci. Eng. Technol., 6(17): 3299-3303, 2013

Table 1: The original datasets with 15 data objects and 8 attributes 0.5
Cluster 1
X1 X2 X3 X4 X5 X6 X7 X8 0.45 Cluster 2
Day 1 7 1 1 1 1 2 10 3 Centroids
0.4
Day 2 8 2 1 2 1 2 1 3
Day 3 9 2 1 1 1 2 1 1 0.35

Day 4 10 4 2 1 1 2 1 2 0.3
Day 5 1 5 1 1 1 2 1 3 0.25
Day 6 2 5 4 4 5 7 10 3
0.2
Day 7 1 5 1 1 1 2 1 3
Day 8 2 5 4 4 5 4 3 3 0.15

Day 9 3 3 1 1 1 2 2 3 0.1
Day 10 4 6 8 8 1 3 4 3
0.05
Day 11 3 3 1 1 1 2 2 3
Day 12 4 6 8 8 1 3 4 3 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Day 13 5 4 1 1 3 2 1 3
Day 14 6 8 10 10 8 7 10 9
Day 15 3 3 1 1 1 2 2 3 Fig. 4: K-means algorithm with min-max standardization data
set
8
Cluster 1
Cluster 2 distances between data points and their cluster centers
7 Centroids
and the points attached to a cluster were used to
6 measure the clustering quality among the three different
5 standardization methods, the smaller the value of the
sum of squares error the higher the accuracy, the better
4
the result.
3 Figure 1 presents the result of the conventional K-
2
means algorithm using the original dataset having 15
data objects and 8 attributes as shown in Table 1. Some
1
0 5 10 15
points attached to cluster one and one point attached to
cluster two are out of the cluster formation with the
Fig. 1: Conventional K-means algorithm error sum of squares equal 141.00.
2.5
Cluster 1

2
Cluster 2 Z-score analysis: Figure 2 presents the result of the K-
Centroids
means algorithm using the rescale dataset with Z-score
1.5
standardization method, having 15 data objects and 8
1
attributes as shown in Table 2. All the points attached
0.5 to cluster one and cluster two are within the cluster
0 formation with the error sum of squares equal 49.42
-0.5
Decimal scaling analysis: Figure 3 presents the result
-1
of the K-means algorithm using the rescale dataset with
-1.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 the decimal scaling method of data standardization,
having 15 data objects and 8 attributes as shown in
Fig. 2: K-means algorithm with a Z-score standardized data Table 3. Some points attached to cluster one and one
set
point attached to cluster two are out of the cluster
x 10
-3 formation with the error sum of squares equal 0.14 and
8
Cluster 1 converted to 140.00.
Cluster 2
7 Centroids

Min-max analysis: Figure 4 presents the result of the

6
K-means algorithm using the rescale dataset with Min-
5 Max data standardization method, having 15 data
objects and 8 attributes as shown in Table 4. Some
4
points attached to cluster one and one point attached to
3 cluster two are out of the cluster formation with the
error sum of squares equal 10.07
2
Table 5 shows the number of points that are out of
1
0 0.005 0.01 0.015
cluster formations for both cluster 1 and cluster 2. The
total error sum of squares for conventional K-means, K-
Fig. 3: K-means algorithm with the decimal scaling means with z-score, K-means with decimal scaling and
standardization data set K-means with min-max datasets.
3301
Res. J. App. Sci. Eng. Technol., 6(17): 3299-3303, 2013

Table 2: The standardized datasets with 15 data objects and 8 attributes

X1 X2 X3 X4 X5 X6 X7 X8
Day 1 -0.2236 -1.1442 -0.5192 -0.5192 -0.4913 -0.3705 1.8627 -0.0754
Day 2 0 -0.6674 -0.5192 -0.1652 -0.4913 -0.3705 -0.6519 -0.0754
Day 3 0.2236 -0.6674 -0.5192 -0.5192 -0.4913 -0.3705 -0.6519 -1.2070
Day 4 0.4472 0.2860 -0.1652 -0.5192 -0.4913 -0.3705 -0.6519 -0.6412
Day 5 -1.5652 0.7628 -0.5192 -0.5192 -0.4913 -0.3705 -0.6519 -0.0754
Day 6 -1.3416 0.7628 0.5428 0.5428 1.4739 2.4081 1.8627 -0.0754
Day 7 -0.2236 -1.1442 -0.5192 -0.5192 -0.4913 -0.3705 1.8627 -0.0754
Day 8 0 -0.6674 -0.5192 -0.1652 -0.4913 -0.3705 -0.6519 -0.0754
Day 9 0.2236 -0.6674 -0.5192 -0.5192 -0.4913 -0.3705 -0.6519 -1.2070
Day 10 0.4472 0.2860 -0.1652 -0.5192 -0.4913 -0.3705 -0.6519 -0.6412
Day 11 -1.1180 -0.1907 -0.5192 -0.5192 -0.4913 -0.3705 -0.3725 -0.0754
Day 12 -0.8944 1.2395 1.9587 1.9587 -0.4913 0.1852 0.1863 -0.0754
Day 13 -0.6708 0.2860 -0.5192 -0.5192 0.4913 -0.3705 -0.6519 -0.0754
Day 14 -0.4472 2.1930 2.6666 2.6666 2.9478 2.4081 1.8627 3.3193
Day 15 1.5652 -1.1442 -0.5192 -0.5192 -0.4913 -0.3705 -0.0931 -0.0754

Table 3: The standardized dataset with 15 data objects and 8 attributes

X1 X2 X3 X4 X5 X6 X7 X8
Day 1 0.0070 0.0010 0.0010 0.0010 0.0010 0.0020 0.0100 0.0030
Day 2 0.0080 0.0020 0.0010 0.0020 0.0010 0.0020 0.0010 0.0030
Day 3 0.0090 0.0020 0.0010 0.0010 0.0010 0.0020 0.0010 0.0010
Day 4 0.0100 0.0040 0.0020 0.0010 0.0010 0.0020 0.0010 0.0020
Day 5 0.0010 0.0050 0.0010 0.0010 0.0010 0.0020 0.0010 0.0030
Day 6 0.0020 0.0050 0.0040 0.0040 0.0050 0.0070 0.0100 0.0030
Day 7 0.0070 0.0010 0.0010 0.0010 0.0010 0.0020 0.0100 0.0030
Day 8 0.0080 0.0020 0.0010 0.0020 0.0010 0.0020 0.0010 0.0030
Day 9 0.0090 0.0020 0.0010 0.0010 0.0010 0.0020 0.0010 0.0010
Day 10 0.0100 0.0040 0.0020 0.0010 0.0010 0.0020 0.0010 0.0020
Day 11 0.0030 0.0030 0.0010 0.0010 0.0010 0.0020 0.0020 0.0030
Day 12 0.0040 0.0060 0.0080 0.0080 0.0010 0.0030 0.0040 0.0030
Day 13 0.0050 0.0040 0.0010 0.0010 0.0030 0.0020 0.0010 0.0030
Day 14 0.0060 0.0080 0.0100 0.0100 0.0080 0.0070 0.0100 0.0090
Day 15 0.0150 0.0010 0.0010 0.0010 0.0010 0.0020 0.0030 0.0030

Table 4: The standardized dataset with 15 data objects and 8 attributes

X1 X2 X3 X4 X5 X6 X7 X8
Day 1 0.4286 0 0 0 0 0.0714 0.6429 0.1429
Day 2 0.5000 0.0714 0 0.0714 0 0.0714 0 0.1429
Day 3 0.5714 0.0714 0 0 0 0.0714 0 0
Day 4 0.6429 0.2143 0.0714 0 0 0.0714 0 0.0714
Day 5 0 0.2857 0 0 0 0.0714 0 0.1429
Day 6 0.0714 0.2857 0.2143 0.2143 0.2857 0.4286 0.6429 0.1429
Day 7 0.4286 0 0 0 0 0.0714 0.6429 0.1429
Day 8 0.5000 0.0714 0 0.0714 0 0.0714 0 0.1429
Day 9 0.5714 0.0714 0 0 0 0.0714 0 0
Day 10 0.6429 0.2143 0.0714 0 0 0.0714 0 0.0714
Day 11 0.1429 0.1429 0 0 0 0.0714 0.0714 0.1429
Day 12 0.2143 0.3571 0.5000 0.5000 0 0.1429 0.2143 0.1429
Day 13 0.2857 0.2143 0 0 0.1429 0.0714 0 0.1429
Day 14 0.3571 0.5000 0.6429 0.6429 0.5000 0.4286 0.6429 0.5714
Day 15 1.0000 0 0 0 0 0.0714 0.1429 0.1429

Table 5: Summary of the results for cluster formations conventional K-means clustering algorithm. It can be
Cluster 1 Cluster 2 concluded that standardization before clustering
points out points out ESSs
algorithm leads to obtain a better quality, efficient and
Conventional K-means 2 2 159.00
K-means with Z-score 0 0 45.32 accurate cluster result. It is also important to select a
K-means with decimal 3 1 130.00 specific standardization procedure, according to the
scaling nature of the datasets for the analysis. In this analysis we
K-means with Min-Max 4 1 09.21 proposed Z-score as the most powerful method that will
give more accurate and efficient result among the three
CONCLUSION methods in K-means clustering algorithm.

A novel method of K-means clustering using REFERENCES

standardization method is proposed to produce optimum
quality clusters. Comprehensive experiments on Aksoy, S. and R.M. Haralick, 2001. Feature
infectious diseases datasets have been conducted to normalization and likelihood-based similarity
study the impact of standardization and to compare the measures for image retrieval. Pattern Recogn. Lett.,
effect of three different standardization procedures in 22: 563-582.
3302
Res. J. App. Sci. Eng. Technol., 6(17): 3299-3303, 2013

Cios, K.J., W. Pedrycz, R.W. Swiniarski and L.A. Luai, A.S., S. Zyad and K. Basel, 2006. Data mining: A
Kurgan, 2007. Data Mining: A Knowledge preprocessing engine. J. Comput. Sci., 2(9): 735-
Discovery Approach. Springer, New York. 739.
Jain, A. and R. Dubes, 1988. Algorithms for Clustering Milligan, G. and M. Cooper, 1988. A study of
Data. Prentice Hall, NY. standardization of variables in cluster analysis. J.
Jain, A.R., M.N. Murthy and P.J. Flynn, 1999. Data Classif., 5: 181-204.
clustering: A Review. ACM Comput. Surv., 31(3): Vaishali, R.P. and G.M. Rupa, 2011. Impact of outlier
265-323. removal and normalization approach in modified k-
Karthikeyani, V.N. and K. Thangavel, 2009. Impact of means clustering algorithm. Int. J. Comput. Sci.,
normalization in distributed K-means clustering. 8(5): 331-336.
Int. J. Soft Comput., 4(4): 168-172.
Larose, D.T., 2005. Discovering Knowledge in Data:
An Introduction to Data Mining. Wiley, Hoboken,
NJ.

3303