Standardization and Its Effects On K-Means Clustering Algorithm
Standardization and Its Effects On K-Means Clustering Algorithm
net/publication/288044597
CITATIONS READS
213 5,644
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dauda Usman on 06 February 2016.
Abstract: Data clustering is an important data exploration technique with many applications in data mining. K-
means is one of the most well known methods of data mining that partitions a dataset into groups of patterns, many
methods have been proposed to improve the performance of the K-means algorithm. Standardization is the central
preprocessing step in data mining, to standardize values of features or attributes from different dynamic range into a
specific range. In this paper, we have analyzed the performances of the three standardization methods on
conventional K-means algorithm. By comparing the results on infectious diseases datasets, it was found that the
result obtained by the z-score standardization method is more effective and efficient than min-max and decimal
scaling standardization methods.
One of the mosteasiest and generally utilized • Determine the centroid coordinate
technique meant for creating groupings by optimizing • Determine the distance of each object to the
qualifying criterion function, defined eitherglobally centroids
(totaldesign) or locally (on the subset from thedesigns), • Group the object based on minimum distance
is the K-means technique (Vaishali and Rupa, 2011).
K-means clustering is one of the older predictive n The aim of clustering would be to figure out
observations in d dimensional space (an integer d) is commonalities and designs from the large data sets by
given and the problem is to determine a set of c points splitting the data into groups. Since it is assumed that
to minimize the mean squared distance from each data the data sets are unlabeled, clustering is frequently
point to its nearest center with which each observation regarded as the most valuable unsupervised learning
belongs. No exact polynomial-time algorithms are problem (Cios et al., 2007).
known for this problem. The problem can be set up as A primary application of geometrical measures
an integer programming problem but because solving (distances) to features having large ranges will
integer programs with a large number of variables is implicitly assign greater efforts in the metrics compared
time consuming, clusters are often computed using a to the application with features having smaller ranges.
Furthermore, the features need to be dimensionless
fast, heuristic method that generally produces good (but
since the numerical values of the ranges of dimensional
not necessarily optimal) solutions (Jain et al., 1999).
features rely upon the units of measurements and,
The K-means algorithm is one such method where hence, a selection of the units of measurements may
clustering requires less effort. In the beginning, number significantly alter the outcomes of clustering.
of cluster c is determined and the centre of these Therefore, one should not employ distance measures
clusters is assumed. Any random objects as the initial like the Euclidean distance without having
centroids can be taken or the first k objects in sequence normalization of the data sets (Aksoy and Haralick,
can also serve as the initial centroids. However, if there 2001; Larose, 2005).
are some features, with a large size or great variability, Preprocessing Luai et al. (2006) is actually
these kind of features will strongly affect the clustering essential before using any data exploration algorithms
result. In this case, data standardization would be an to enhance the results’ performance. Normalization of
important preprocessing task to scale or control the the dataset is among the preprocessing processes in data
variability of the datasets. exploration, in which the attribute data are scaled tofall
The K-means algorithm will do the three steps in a small specifiedrange. Normalization before
below until convergence clustering is specifically needed for distance metric,
Corresponding Author: Dauda Usman, Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, 81310,
UTM Johor Bahru, Johor Darul Ta’azim, Malaysia
3299
Res. J. App. Sci. Eng. Technol., 6(17): 3299-3303, 2013
like the Euclidian distance that are sensitive to is that it must be applied in global standardization and
variations within the magnitude or scales from the not in within-cluster standardization (Milligan and
attributes. In actual applications, due to the variations in Cooper, 1988).
selection of the attribute's value, one attribute might
overpower another one. Normalization prevents Min-max: Min-Max normalization is the process of
outweighing features having a large number over taking data measured in its engineering units and
features with smaller numbers. The aim would be to transforming it to a value between 0.0 and 1.0. Where
equalize the dimensions or magnitude and also the by the lowest (min) value is set to 0.0 and the highest
variability of those features. (max) value is set to 1.0. This provides an easy way to
Data preprocessing techniques (Vaishali and Rupa, compare values that are measured using different scales
2011) are applied to a raw data to make the data clean, or different units of measure. The normalized value is
noise free and consistent. Data Normalization defined as:
standardize the raw data by converting them into
specific range using a linear transformation which can X −X (3)
ij min
MM ( X ) =
generate good quality clusters and improve the ij X −X
max min
accuracy of clustering algorithms.
There is no universally defined rule for Decimal scaling:
normalizing the datasets and thus the choice of a Normalization by decimal scaling: Normalizes by
particular normalization rule is largely left to the moving the decimal point of values of feature X. The
discretion of the user (Karthikeyani and Thangavel, number of decimal points moved depends on the
2009). Thus the data normalization methods includes Z- maximum absolute value of X. A modified value DS (X)
score, Min-Max and Decimal scaling. In the Z-score the corresponding to X is obtained using:
values for an attribute X are standardized based on the
mean and standard deviation of X, this method is useful X
ij (4)
when the actual minimum and maximum of attribute X DS ( X ) =
ij
10c
are unknown. Decimal scaling standardized by moving
the decimal point of values of attribute X, the number of
where, c is the smallest integer such that
decimal points moved depends on the maximum max[|DS(X ij )|]<1
absolute value of X. Min-Max transforms the data set
between 0.0 and 1.0 by subtracting the minimum value K-means clustering: Given a set of observations (x 1 ,
from each value divided by the range of values for each x 2 , …, x n ), where each observation is a d-dimensional
individual value. real vector, K-means clustering aims to partition the n
observations into k sets (k≤n) S = {S 1 , S 2 , …, S k } so as
MATERIALS AND METHODS to minimize the Within-Cluster Sum of Squares
(WCSS):
Let, Y = {X 1 , X 2 , …, X n } denote the d-
dimensional raw data set.
Then the data matrix is an n×d matrix given by:
(5)
a11 a1d
X 1 , X 2 ,..., X n = .
where, μ i is the mean of points in S i :
an1 and
(1)
RESULTS AND DISCUSSION
Z-score: The Z-score is a form of standardization used
for transforming normal variants to standard score In this section, details of the overall results have
form. Given a set of raw data Y, the Z-score been discussed. A complete program using MATLAB
standardization formula is defined as: has been developed to find the optimal solution. Few
experiments have been conducted on three
x −x standardization procedures and compare their
= ( xij ) ijσ j
xij Z= (2) performances on K-means clustering algorithm with
j
infectious diseases dataset having 15 data objects and 8
attributes as shown in Table 1. The eight datasets,
where, x j and σ j are the sample mean and standard Malaria dataset, Typhoid fever dataset, Cholera dataset,
deviation of the jth attribute, respectively. The Measles dataset, Chickenpox dataset, Tuberculosis
transformed variable will have a mean of 0 and a dataset, Tetanus dataset and Leprosy dataset for X1 to
variance of 1. The location and scale information of the X8 respectively are used to test the performances of the
original variable has been lost (Jain and Dubes, 1988). three standardization methods on K-means clustering
One important restriction of the Z-score standardization technique. The sum of squares error representing
3300
Res. J. App. Sci. Eng. Technol., 6(17): 3299-3303, 2013
Table 1: The original datasets with 15 data objects and 8 attributes 0.5
Cluster 1
X1 X2 X3 X4 X5 X6 X7 X8 0.45 Cluster 2
Day 1 7 1 1 1 1 2 10 3 Centroids
0.4
Day 2 8 2 1 2 1 2 1 3
Day 3 9 2 1 1 1 2 1 1 0.35
Day 4 10 4 2 1 1 2 1 2 0.3
Day 5 1 5 1 1 1 2 1 3 0.25
Day 6 2 5 4 4 5 7 10 3
0.2
Day 7 1 5 1 1 1 2 1 3
Day 8 2 5 4 4 5 4 3 3 0.15
Day 9 3 3 1 1 1 2 2 3 0.1
Day 10 4 6 8 8 1 3 4 3
0.05
Day 11 3 3 1 1 1 2 2 3
Day 12 4 6 8 8 1 3 4 3 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Day 13 5 4 1 1 3 2 1 3
Day 14 6 8 10 10 8 7 10 9
Day 15 3 3 1 1 1 2 2 3 Fig. 4: K-means algorithm with min-max standardization data
set
8
Cluster 1
Cluster 2 distances between data points and their cluster centers
7 Centroids
and the points attached to a cluster were used to
6 measure the clustering quality among the three different
5 standardization methods, the smaller the value of the
sum of squares error the higher the accuracy, the better
4
the result.
3 Figure 1 presents the result of the conventional K-
2
means algorithm using the original dataset having 15
data objects and 8 attributes as shown in Table 1. Some
1
0 5 10 15
points attached to cluster one and one point attached to
cluster two are out of the cluster formation with the
Fig. 1: Conventional K-means algorithm error sum of squares equal 141.00.
2.5
Cluster 1
2
Cluster 2 Z-score analysis: Figure 2 presents the result of the K-
Centroids
means algorithm using the rescale dataset with Z-score
1.5
standardization method, having 15 data objects and 8
1
attributes as shown in Table 2. All the points attached
0.5 to cluster one and cluster two are within the cluster
0 formation with the error sum of squares equal 49.42
-0.5
Decimal scaling analysis: Figure 3 presents the result
-1
of the K-means algorithm using the rescale dataset with
-1.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 the decimal scaling method of data standardization,
having 15 data objects and 8 attributes as shown in
Fig. 2: K-means algorithm with a Z-score standardized data Table 3. Some points attached to cluster one and one
set
point attached to cluster two are out of the cluster
x 10
-3 formation with the error sum of squares equal 0.14 and
8
Cluster 1 converted to 140.00.
Cluster 2
7 Centroids
Table 5: Summary of the results for cluster formations conventional K-means clustering algorithm. It can be
Cluster 1 Cluster 2 concluded that standardization before clustering
points out points out ESSs
algorithm leads to obtain a better quality, efficient and
Conventional K-means 2 2 159.00
K-means with Z-score 0 0 45.32 accurate cluster result. It is also important to select a
K-means with decimal 3 1 130.00 specific standardization procedure, according to the
scaling nature of the datasets for the analysis. In this analysis we
K-means with Min-Max 4 1 09.21 proposed Z-score as the most powerful method that will
give more accurate and efficient result among the three
CONCLUSION methods in K-means clustering algorithm.
Cios, K.J., W. Pedrycz, R.W. Swiniarski and L.A. Luai, A.S., S. Zyad and K. Basel, 2006. Data mining: A
Kurgan, 2007. Data Mining: A Knowledge preprocessing engine. J. Comput. Sci., 2(9): 735-
Discovery Approach. Springer, New York. 739.
Jain, A. and R. Dubes, 1988. Algorithms for Clustering Milligan, G. and M. Cooper, 1988. A study of
Data. Prentice Hall, NY. standardization of variables in cluster analysis. J.
Jain, A.R., M.N. Murthy and P.J. Flynn, 1999. Data Classif., 5: 181-204.
clustering: A Review. ACM Comput. Surv., 31(3): Vaishali, R.P. and G.M. Rupa, 2011. Impact of outlier
265-323. removal and normalization approach in modified k-
Karthikeyani, V.N. and K. Thangavel, 2009. Impact of means clustering algorithm. Int. J. Comput. Sci.,
normalization in distributed K-means clustering. 8(5): 331-336.
Int. J. Soft Comput., 4(4): 168-172.
Larose, D.T., 2005. Discovering Knowledge in Data:
An Introduction to Data Mining. Wiley, Hoboken,
NJ.
3303